mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-12 19:45:56 +00:00

### Description Updating the python version of the example docs to show how to run the same code that the CLI runs, but using python. Rather than copying the same command that would be run via the terminal and using the subprocess library to run it, this updates it to use the supported code exposed in the inference directory. For now only the wikipedia one has been updated to get some opinions on this before updating all other connector docs. Would close out https://github.com/Unstructured-IO/unstructured/issues/1445
30 lines
1.2 KiB
Bash
30 lines
1.2 KiB
Bash
#!/usr/bin/env bash
|
||
|
||
# Processes the Unstructured-IO/unstructured repository
|
||
# through Unstructured's library in 2 processes.
|
||
|
||
# Structured outputs are stored in sharepoint-ingest-output/
|
||
|
||
# NOTE, this script is not ready-to-run!
|
||
# You must enter a MS Sharepoint app client-id, client secret and sharepoint site url
|
||
# before running.
|
||
|
||
# To get the credentials for your Sharepoint app, follow these steps:
|
||
# https://github.com/vgrem/Office365-REST-Python-Client/wiki/How-to-connect-to-SharePoint-Online-and-and-SharePoint-2013-2016-2019-on-premises--with-app-principal
|
||
|
||
|
||
|
||
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
|
||
cd "$SCRIPT_DIR"/../../.. || exit 1
|
||
|
||
PYTHONPATH=. ./unstructured/ingest/main.py \
|
||
sharepoint \
|
||
--client-id "<Microsoft Sharepoint app client-id>" \
|
||
--client-cred "<Microsoft Sharepoint app client-secret>" \
|
||
--site "<e.g https://contoso.sharepoint.com or https://contoso.admin.sharepoint.com to process all sites within tenant>" \
|
||
--files-only "Flag to process only files within the site(s)" \
|
||
--output-dir sharepoint-ingest-output \
|
||
--num-processes 2 \
|
||
--path "Shared Documents" \
|
||
--verbose
|