Roman Isecke 9d81971fcb
update ingest python doc (#1446)
### Description
Updating the python version of the example docs to show how to run the
same code that the CLI runs, but using python. Rather than copying the
same command that would be run via the terminal and using the subprocess
library to run it, this updates it to use the supported code exposed in
the inference directory.

For now only the wikipedia one has been updated to get some opinions on
this before updating all other connector docs.

Would close out
https://github.com/Unstructured-IO/unstructured/issues/1445
2023-10-03 10:01:41 -04:00

30 lines
1.2 KiB
Bash
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env bash
# Processes the Unstructured-IO/unstructured repository
# through Unstructured's library in 2 processes.
# Structured outputs are stored in sharepoint-ingest-output/
# NOTE, this script is not ready-to-run!
# You must enter a MS Sharepoint app client-id, client secret and sharepoint site url
# before running.
# To get the credentials for your Sharepoint app, follow these steps:
# https://github.com/vgrem/Office365-REST-Python-Client/wiki/How-to-connect-to-SharePoint-Online-and-and-SharePoint-2013-2016-2019-on-premises--with-app-principal
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/../../.. || exit 1
PYTHONPATH=. ./unstructured/ingest/main.py \
sharepoint \
--client-id "<Microsoft Sharepoint app client-id>" \
--client-cred "<Microsoft Sharepoint app client-secret>" \
--site "<e.g https://contoso.sharepoint.com or https://contoso.admin.sharepoint.com to process all sites within tenant>" \
--files-only "Flag to process only files within the site(s)" \
--output-dir sharepoint-ingest-output \
--num-processes 2 \
--path "Shared Documents" \
--verbose