unstructured/test_unstructured_ingest/test-ingest-elasticsearch.sh

#!/usr/bin/env bash

set -e

SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
echo "SCRIPT_DIR: $SCRIPT_DIR"
OUTPUT_FOLDER_NAME=elasticsearch
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
DOWNLOAD_DIR=$SCRIPT_DIR/download/$OUTPUT_FOLDER_NAME

# shellcheck source=/dev/null
sh scripts/elasticsearch-test-helpers/create-and-check-es.sh
wait

# Kill the container so the script can be repeatedly run using the same ports
trap 'echo "Stopping Elasticsearch Docker container"; docker stop es-test' EXIT

PYTHONPATH=. ./unstructured/ingest/main.py \
    elasticsearch \
    --download-dir "$DOWNLOAD_DIR" \
    --metadata-exclude filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth  \
    --num-processes 2 \
    --preserve-downloads \
    --reprocess \
    --output-dir "$OUTPUT_DIR" \
    --verbose \
    --index-name movies \
    --url http://localhost:9200 \
    --jq-query '{ethnicity, director, plot}'

echo "SCRIPT_DIR: $SCRIPT_DIR"
sh "$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME
feat: elasticsearch connector (#817) 2023-07-01 18:45:28 +01:00			`#!/usr/bin/env bash`

			`set -e`

			`SCRIPT_DIR=$(dirname "$(realpath "$0")")`
			`cd "$SCRIPT_DIR"/.. \|\| exit 1`
			`echo "SCRIPT_DIR: $SCRIPT_DIR"`
			`OUTPUT_FOLDER_NAME=elasticsearch`
			`OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME`
			`DOWNLOAD_DIR=$SCRIPT_DIR/download/$OUTPUT_FOLDER_NAME`

			`# shellcheck source=/dev/null`
			`sh scripts/elasticsearch-test-helpers/create-and-check-es.sh`
			`wait`

			`# Kill the container so the script can be repeatedly run using the same ports`
			`trap 'echo "Stopping Elasticsearch Docker container"; docker stop es-test' EXIT`

			`PYTHONPATH=. ./unstructured/ingest/main.py \`
Roman/ingest refactor (#978) * Pull out s3 code as subcommand * Pull out dropbox code as subcommand * Pull out azure code as subcommand * Pull out fsspec code as subcommand * Pull out github code as subcommand * Pull out gitlab code as subcommand * Pull out reddit code as subcommand * Pull out slack code as subcommand * Pull out discord code as subcommand * Pull out wikipedia code as subcommand * Pull out gdrive code as subcommand * Pull out biomed code as subcommand * rename parameters * Pull out onedrive code as subcommand * Pull out outlook code as subcommand * Pull out local code as subcommand * Pull out elasticsearch code as subcommand * Pull out confluence code as subcommand * Drop previous main file * update changelog * Add back in mp.Pool * Fix mypy issues with click * Make sure all tests run with verbose flag * refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data * Pull out some more shared options * Support running code via python as well as cli * update ingest readme and move it to the ingest folder * update usage in connector docs * move local command arg in test * Seperate out cli code from logic running unstructured * Make some cli fields required rather than optional * rename process -> processor * Improve logger to avoid duplicate handlers --------- Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com> 2023-07-31 13:20:10 -04:00			`elasticsearch \`
feat: elasticsearch connector (#817) 2023-07-01 18:45:28 +01:00			`--download-dir "$DOWNLOAD_DIR" \`
Feat: Create a naive hierarchy for elements (#1268) ## Summary By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> 2023-09-14 11:23:16 -04:00			`--metadata-exclude filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \`
feat: elasticsearch connector (#817) 2023-07-01 18:45:28 +01:00			`--num-processes 2 \`
			`--preserve-downloads \`
			`--reprocess \`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`--output-dir "$OUTPUT_DIR" \`
Roman/ingest refactor (#978) * Pull out s3 code as subcommand * Pull out dropbox code as subcommand * Pull out azure code as subcommand * Pull out fsspec code as subcommand * Pull out github code as subcommand * Pull out gitlab code as subcommand * Pull out reddit code as subcommand * Pull out slack code as subcommand * Pull out discord code as subcommand * Pull out wikipedia code as subcommand * Pull out gdrive code as subcommand * Pull out biomed code as subcommand * rename parameters * Pull out onedrive code as subcommand * Pull out outlook code as subcommand * Pull out local code as subcommand * Pull out elasticsearch code as subcommand * Pull out confluence code as subcommand * Drop previous main file * update changelog * Add back in mp.Pool * Fix mypy issues with click * Make sure all tests run with verbose flag * refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data * Pull out some more shared options * Support running code via python as well as cli * update ingest readme and move it to the ingest folder * update usage in connector docs * move local command arg in test * Seperate out cli code from logic running unstructured * Make some cli fields required rather than optional * rename process -> processor * Improve logger to avoid duplicate handlers --------- Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com> 2023-07-31 13:20:10 -04:00			`--verbose \`
			`--index-name movies \`
			`--url http://localhost:9200 \`
			`--jq-query '{ethnicity, director, plot}'`
feat: elasticsearch connector (#817) 2023-07-01 18:45:28 +01:00
			`echo "SCRIPT_DIR: $SCRIPT_DIR"`
feat: add document date for remaining file types (#930) (#969) * feat: add document date for remaining file types (#930) * feat: add functions for getting modification date * feat: add date field to metadata from csv file * feat: add tests for csv patition * feat: add date field to metadata from html file * feat: add tests for html partition * fix: return file name onlyif possible * feat: add csv tests * fix: renaming * feat: add filed metadata_date as date of last mod * feat: add tests for partition_docx * feat: add filed metadata_date to .doc file * feat: add tests for partition_doc * feat: add metadata_date to .epub file * feat: add tests for partition_epub * fix: fix test mocking * feat: add metadata_date for image partition * feat: add test for image partition * feat: add coorrdinate system argument * feat: add date to element metadata * feat: add metadata_date for JSON partition * feat: add test for JSON partition * fix: rename variable * feat: add metadata_date for md partition * feat: add test for md partition * feat: update doc string * feat: add metadata_date for .odt partition * feat: update .odt string * feat: add metadata_date for .org partition * feat: add tests for .org partition * feat: add metadata_date for .pdf partition * feat: add tests for .pdf partition * feat: add metadata_date for .pptx partition * feat: add metadata_date for .ppt partition * feat: add tests for .ppt partition * feat: add tests for .pptx partition * feat: add metadata_date for .rst partition * feat: add tests for .rst partition * fix: get modification date after file checking * feat: add tests for .rtf partition * feat: add tests for .rtf partition * feat: add metadata_date for .txt partition * fix: rename argument * feat: add tests for .txt partition * feat: update doc string rst patrition function * feat: add metadata_date for .tsv partition * feat: add tests for .tsv partition * feat: add metadata_date for .xlsx partition * feat: add tests for .xlsx partition * fix: clean up * feat: add tests for .xml partition * feat: add tests for .xml partition * fix: use `or ` instead of `if` * fix: fix epub tests * fix: remove not used code * fix: add try block for getting file name * fix: applying linter changes * fix: fix test_partition_file * feat: add metadata_date for email * feat: add test for email partition * feat: add metadata_date for msg * feat: add tests for msg partition * feat: update CHANGELOG file * fix: update partitions doc string * don't push * fix: clean up code * linting, linting, linting * remove unnecessary example doc * update version and changelog * ingest-test-fixtures-update * set metadata date in test --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> * ingest-test-fixtures-update * Update ingest test fixtures (#970) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * Revert "Update ingest test fixtures (#970)" This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2. * remove date from metadata in outputs * update docstring ordering * remove print * remove print * remove print * linting, linting, linting * fix version and test * fix changelog * fix changelog * update version --------- Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> 2023-07-26 15:10:14 -04:00			`sh "$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME`