unstructured/test_unstructured_ingest/test-ingest-confluence-diff.sh

45 lines
1.5 KiB
Bash
Raw Normal View History

feat: confluence connector (cloud) (#906) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
#!/usr/bin/env bash
set -e
# Description: This test checks if all the processed content is the same as the expected outputs
feat: confluence connector (cloud) (#906) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=confluence-diff
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
DOWNLOAD_DIR=$SCRIPT_DIR/download/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
CI=${CI:-"false"}
feat: confluence connector (cloud) (#906) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
cleanup_dir "$OUTPUT_DIR"
if [ "$CI" == "true" ]; then
cleanup_dir "$DOWNLOAD_DIR"
fi
}
trap cleanup EXIT
feat: confluence connector (cloud) (#906) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
if [ -z "$CONFLUENCE_USER_EMAIL" ] || [ -z "$CONFLUENCE_API_TOKEN" ]; then
echo "Skipping Confluence ingest test because the CONFLUENCE_USER_EMAIL or CONFLUENCE_API_TOKEN env var is not set."
exit 0
fi
PYTHONPATH=. ./unstructured/ingest/main.py \
confluence \
feat: confluence connector (cloud) (#906) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
--download-dir "$DOWNLOAD_DIR" \
Feat: Create a naive hierarchy for elements (#1268) ## **Summary** By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
--metadata-exclude filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
--num-processes "$max_processes" \
feat: confluence connector (cloud) (#906) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
--preserve-downloads \
--reprocess \
--output-dir "$OUTPUT_DIR" \
--verbose \
--url https://unstructured-ingest-test.atlassian.net \
--user-email "$CONFLUENCE_USER_EMAIL" \
--api-token "$CONFLUENCE_API_TOKEN" \
--spaces testteamsp,MFS
feat: confluence connector (cloud) (#906) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME