Ahmet Melek b7674fb97e
feat: confluence connector (cloud) (#906)
* Add confluence connector and an example script

* add test script, add dependency installations

* add authentication secret variables for ci tests and actions

* add dependency installation commands for workflows

* add dependency installation commands for workflows

* Update ingest test fixtures (#907)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values

* change workflow name to avoid confusion

* change workflow name to avoid confusion

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions

* Update ingest test fixtures (#911)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* revert back the test python version matrix

* recompile dependencies

* modifications for shellcheck

* update changelog and version

* changelog and version

* remove comments

* Update ingest test fixtures (#915)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add the option to state the number of spaces to be fetched

* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users

* add help message

* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test

* change test names

* rename connector arg

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* change arg name for connector

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* add comment to example

* change arg names

* add new tests to ingest test

* shellcheck remove redundant statement

* Update ingest test fixtures (#932)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* Update ingest test fixtures (#936)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* linting

* change file extensions to parse as html

* Update ingest test fixtures (#943)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* remove old fixtures

* update version to 0.8.2-dev3

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00

30 lines
1.4 KiB
Bash
Executable File

#!/usr/bin/env bash
# Processes all the documents in all spaces within a confluence domain, using the `unstructured` library.
# Structured outputs are stored in confluence-ingest-output
SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/../../.. || exit 1
# Obtain your authentication variables, save/source them from another file, for security reasons:
# source "./../../secrets/confluence.txt"
# ...
# --confluence-user-email "$CONFLUENCE_USER_EMAIL"
# --confluence-api-token "$CONFLUENCE_API_TOKEN"
# Other arguments that you can use:
# --confluence-max-num-of-spaces 10
# --> The maximum number of spaces to be ingested. Set as 10 in the example.
# --confluence-list-of-spaces testteamsp1,testteamsp2
# --> A comma separated list of space ids for the spaces to be ingested.
# --> Avoid using --confluence-list-of-spaces and --confluence-max-num-of-spaces at the same time.
# --confluence-max-num-of-docs-from-each-space 250 \
# --> The maximum number of documents to be ingested from each space. Set as 250 in the example.
PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename,file_directory,metadata.data_source.date_processed \
--confluence-url https://unstructured-ingest-test.atlassian.net \
--confluence-user-email 12345678@unstructured.io \
--confluence-api-token ABCDE1234ABDE1234ABCDE1234 \
--structured-output-dir confluence-ingest-output \
--num-processes 2