51 lines
1.7 KiB
Bash
Raw Normal View History

#!/usr/bin/env bash
set -e
if [ -z "$UNS_API_KEY" ]; then
echo "Skipping ingest test against api because the UNS_API_KEY env var is not set."
exit 8
fi
SRC_PATH=$(dirname "$(realpath "$0")")
SCRIPT_DIR=$(dirname "$SRC_PATH")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=api-ingest-output
OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR}
OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
refactor: unstructured ingest as a pipeline (#1551) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)
2023-10-06 14:49:29 -04:00
# shellcheck disable=SC1091
# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
refactor: unstructured ingest as a pipeline (#1551) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)
2023-10-06 14:49:29 -04:00
function cleanup() {
cleanup_dir "$OUTPUT_DIR"
cleanup_dir "$WORK_DIR"
}
trap cleanup EXIT
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023) Closes #1064 When using the `--partition-by-api` flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. ## Changes * parse and pass relevant partition arguments to the api in unstructured-ingest * bonus: leverage an existing `partition.api` function to call out to the api rather than including duplicative request logic in unstructured ingest * bonus: --pdf-infer-table-structure is now a flag not an arg (it defaults false anyways, this is more succinct and consistent with similar parameters) * bonus: adds `hi_res_model_name` so a user can specify the model to leverage when using a hi_res strategy. ## Testing * update against_api.sh source test script to specify a partition argument and validates that the response from the api respected the argument * manually ran a request and validated that it was processed with chipper as specified (not sure if we want to bake a chipper request into the ci tests) (validated that the response leveraged the chipper model): ``` PYTHONPATH=. ./unstructured/ingest/main.py \ local \ --output-dir /tmp/ingest-requests/chipper \ --verbose \ --reprocess \ --strategy hi_res \ --partition-by-api \ --hi-res-model-name chipper \ --api-key "$API_KEY" \ --input-path 'example-docs/layout-parser-paper-with-table.pdf' ```
2023-11-07 20:47:02 -08:00
TEST_FILE_NAME=layout-parser-paper-with-table.pdf
# including pdf-infer-table-structure to validate partition arguments are passed to the api
RUN_SCRIPT=${RUN_SCRIPT:-./unstructured/ingest/main.py}
PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
local \
--api-key "$UNS_API_KEY" \
Feat: Create a naive hierarchy for elements (#1268) ## **Summary** By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
--metadata-exclude coordinates,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
--partition-by-api \
--strategy hi_res \
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023) Closes #1064 When using the `--partition-by-api` flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. ## Changes * parse and pass relevant partition arguments to the api in unstructured-ingest * bonus: leverage an existing `partition.api` function to call out to the api rather than including duplicative request logic in unstructured ingest * bonus: --pdf-infer-table-structure is now a flag not an arg (it defaults false anyways, this is more succinct and consistent with similar parameters) * bonus: adds `hi_res_model_name` so a user can specify the model to leverage when using a hi_res strategy. ## Testing * update against_api.sh source test script to specify a partition argument and validates that the response from the api respected the argument * manually ran a request and validated that it was processed with chipper as specified (not sure if we want to bake a chipper request into the ci tests) (validated that the response leveraged the chipper model): ``` PYTHONPATH=. ./unstructured/ingest/main.py \ local \ --output-dir /tmp/ingest-requests/chipper \ --verbose \ --reprocess \ --strategy hi_res \ --partition-by-api \ --hi-res-model-name chipper \ --api-key "$API_KEY" \ --input-path 'example-docs/layout-parser-paper-with-table.pdf' ```
2023-11-07 20:47:02 -08:00
--pdf-infer-table-structure \
--reprocess \
--output-dir "$OUTPUT_DIR" \
--verbose \
--num-processes "$max_processes" \
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023) Closes #1064 When using the `--partition-by-api` flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. ## Changes * parse and pass relevant partition arguments to the api in unstructured-ingest * bonus: leverage an existing `partition.api` function to call out to the api rather than including duplicative request logic in unstructured ingest * bonus: --pdf-infer-table-structure is now a flag not an arg (it defaults false anyways, this is more succinct and consistent with similar parameters) * bonus: adds `hi_res_model_name` so a user can specify the model to leverage when using a hi_res strategy. ## Testing * update against_api.sh source test script to specify a partition argument and validates that the response from the api respected the argument * manually ran a request and validated that it was processed with chipper as specified (not sure if we want to bake a chipper request into the ci tests) (validated that the response leveraged the chipper model): ``` PYTHONPATH=. ./unstructured/ingest/main.py \ local \ --output-dir /tmp/ingest-requests/chipper \ --verbose \ --reprocess \ --strategy hi_res \ --partition-by-api \ --hi-res-model-name chipper \ --api-key "$API_KEY" \ --input-path 'example-docs/layout-parser-paper-with-table.pdf' ```
2023-11-07 20:47:02 -08:00
--input-path "example-docs/$TEST_FILE_NAME" \
refactor: unstructured ingest as a pipeline (#1551) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)
2023-10-06 14:49:29 -04:00
--work-dir "$WORK_DIR"
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023) Closes #1064 When using the `--partition-by-api` flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. ## Changes * parse and pass relevant partition arguments to the api in unstructured-ingest * bonus: leverage an existing `partition.api` function to call out to the api rather than including duplicative request logic in unstructured ingest * bonus: --pdf-infer-table-structure is now a flag not an arg (it defaults false anyways, this is more succinct and consistent with similar parameters) * bonus: adds `hi_res_model_name` so a user can specify the model to leverage when using a hi_res strategy. ## Testing * update against_api.sh source test script to specify a partition argument and validates that the response from the api respected the argument * manually ran a request and validated that it was processed with chipper as specified (not sure if we want to bake a chipper request into the ci tests) (validated that the response leveraged the chipper model): ``` PYTHONPATH=. ./unstructured/ingest/main.py \ local \ --output-dir /tmp/ingest-requests/chipper \ --verbose \ --reprocess \ --strategy hi_res \ --partition-by-api \ --hi-res-model-name chipper \ --api-key "$API_KEY" \ --input-path 'example-docs/layout-parser-paper-with-table.pdf' ```
2023-11-07 20:47:02 -08:00
RESULT_FILE_PATH="$OUTPUT_DIR/example-docs/$TEST_FILE_NAME.json"
# validate that there is at least one table with text_as_html in the results
if [ "$(jq 'any(.[]; .metadata.text_as_html != null)' "$RESULT_FILE_PATH")" = "false" ]; then
echo "No table with text_as_html found in $RESULT_FILE_PATH but at least one was expected."
exit 1
fi