unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-24 01:18:46 +00:00

Author	SHA1	Message	Date
Klaijan	2c2d5b65ca	refactor: measure_text_edit_distance function for aggregation (#2108 ) - Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. - Switch to `DataFrame` for easier analysis and aggregation.	2023-11-22 13:30:16 -08:00
Klaijan	366c8af2ae	ci: make eval fail on diff (#2138 ) Add conditions on `check-diff-evaluation-metrics.sh` that exits when there's diff between new evaluation metric outputs and the old one.	2023-11-21 20:55:03 -08:00
Klaijan	433c3889dc	ci: reorganize eval output folders and add azure to matrix test (#2093 ) Summary The CI workflow for evaluation previously saved the metric outputs to the `metrics/` folder. Currently structured in subfolders e.g. `metrics/text-extraction` `metrics/element-type` for the folder clean up purpose. Additionally, Azure connector is also added to `full_python_matrix_tests` in this PR. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-21 20:04:30 +00:00
Roman Isecke	6e67c48fd8	feat: update all ingest tests to use huggingface for embeddings (#2071 ) ### Description Update any use of OpenAI for generating embeddings in the ingest tests to use Huggingface Bonus Changes: * Remove duplicate delta table test * Delete delta table destination directory at the beginning of the test to make sure it doesn't exist and prevent the test from breaking.	2023-11-21 18:43:19 +00:00
Steve Canny	ee9be2a3b2	fix: assorted partition_html() bugs (#2113 ) Addresses a cluster of HTML-related bugs: - empty table is identified as bulleted-table - `partition_html()` emits empty (no text) tables (#1928) - `.text_as_html` contains inappropriate `<br>` elements in invalid locations. - cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928) - `.text_as_html` contains whitespace padding Each of these is addressed in a separate commit below. Fixes #1928. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>	2023-11-20 16:29:32 +00:00
ryannikolaidis	13a23deba6	fix: local connector with input path to single file (#2116 ) When passed an absolute file path for the input document path, the local connector incorrectly writes the output file to the wrong directory. Also, in the single file input path cases we are currently including parent path as part of the destination writing, instead when a single file is specified as input the output file should be located directly in the specified outputs directory. Note: this change meant that we needed to bump the file path of some expected results. This fixes such that the output in this case is written to `output-dir/input-filename.json`. ## Changes - Fix for incorrect output path of files partitioned via the local connector when the input path is a file path (rather than directory) - Updated single-local-file test to validate the flow where we specify an absolute file path (since this was particularly broken) ## Testing Note: running the updated `local-single-file` test without the changes to the local connector will result in a final output copy of: ``` Copying /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/workdir/local-single-file/partitioned/a48c2abec07a9a31860429f94e5a6ade.json -> /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/../example-docs/language-docs/UDHR_first_article_all.txt.json ``` where the output path is the input path and not the expected `output-dir/input-filename.json` Running with this change we can now expect the file at that directory. --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2023-11-19 18:21:31 +00:00
Klaijan	5ba3b9c2c6	chore: get eval metrics from ingest in (#2097 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-17 18:22:36 +00:00
Roman Isecke	b8af2f18bb	add mongo db destination connector (#2068 ) ### Description This adds the basic implementation of pushing the generated json output of partition to mongodb. None of this code provisions the mondo db instance so things like adding a search index around the embedding content must be done by the user. Any sort of schema validation would also have to take place via user-specific configuration on the database. This update makes no assumptions about the configuration of the database itself.	2023-11-16 22:40:22 +00:00
Roman Isecke	ead2a7f1eb	drop cloud cli deps (#2088 ) ### Description To not require additional dependencies on cloud-related CLIs (i.e. gcloud and az), using python and the existing dependencies already used to run out code to interact with those providers for overhead work associated with destination ingest tests.	2023-11-16 20:13:46 +00:00
cragwolfe	abe4e8191a	chore: ingest-script cleanup, better skip condition (#2094 ) When testing ingest tests, one often wants to keep the .json output or generated metrics files around for inspection after the fact. This updates the bash condition to actually honor the comment that mentions # export UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1 Test Instructions Run: export UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1 ./test_unstructured_ingest/src/s3.sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction and witness test directories/files do not get cleaned up. E.g., `test_unstructured_ingest/metrics-tmp/`. One can also add a `set -x` at the top of test_unstructured_ingest/cleanup.sh to see what is getting skipped (it's a lot!).	2023-11-15 22:28:04 -08:00
Klaijan	777a428071	chore: for ingest-test metrics, also check subdirs (#2079 ) - Copy script only went through one layer of subdirectory so it did not found the match between manifest file and structured output. Now edited to search all subdirectories. - `set -e` causes the script to exit at any exit rather than `exit 0`, fix all scripts that needs to run the copy script to be `set +e` right before the check diff, then back to `set -e` after - Edit the default evaluation metrics output from `metrics` to `metrics-tmp` to account for diff check - Add a script that checks the differences between old eval metric output (metrics) and new eval metrics output (metrics-tmp)	2023-11-15 21:02:43 -08:00
Yao You	f1ad901f57	chore: add more parametrization to ingestion test (#2086 ) - allow the overwrite destination to be set to the `OUTPUT_ROOT` instead of default to script dir. ## test run ```bash OVERWRITE_FIXTURES=true OUTPUT_ROOT=/tmp ./test_unstructured_ingest/src/s3.sh ``` with this change we should find new files generated under `/tmp/expected-structured-output/s3`. Without this change there will be no such new files.	2023-11-15 22:32:41 +00:00
Steve Canny	b8a8de33f4	fix(ingest): canonicalize ingest JSON (#2080 ) Canonicalize JSON produced for ingest tests such that incidental changes is _form_ of the JSON objects (keys moving around) that does not change the _content_ of that JSON object does not trigger an ingest-test failure.	2023-11-15 00:52:58 -08:00
Christine Straub	475066ba7c	Fix: fast strategy fallback to ocr only (#2055 ) Closes #2038. ### Summary The `fast` strategy should not fall back to a more expensive strategy. ### Testing For [9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf), the following code should return an empty list. ``` elements = partition(filename=filename, strategy="fast") ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-14 18:46:41 +00:00
Yao You	36c4441e2b	ci: parametrize ingest test checking scripts (#2062 ) - parametrize the output folder paths and expected output folder paths in comparison scripts - now allow user to use env `OUTPUT_ROOT` to control where the output and expected output is - currently assumes output from test and expected output are in the same directory; this may need separation later ## test run ```bash OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh ``` and it should show files changed but not able to show diff since there is no expected output content at `OUTPUT_ROOT`. Then run ```bash cp -R test_unstructured_ingest/expected-* /tmp/ OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh ``` we can see (due to CI and local instance producing different results) actual line by line diff	2023-11-13 18:42:19 +00:00
John	1ead5a27df	Jj/2011 missing languages metadata (#2037 ) ### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2023-11-13 16:47:05 +00:00
Klaijan	049b0f3fa8	chore: update metrics-json-manifest (#2047 ) Update `metrics-json-manigest.txt` master file for ingest evaluation. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-10 00:24:59 +00:00
Steve Canny	d06bcc41bb	fix(docx): improve page-break detection (#2036 ) Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a `w:lastRenderedPageBreak` element so cause over-counting if used. Use rendered page-breaks only.	2023-11-09 20:34:30 +00:00
Christine Straub	3fe480799a	Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 ) Closes #1875. ### Summary - add functionality to do a second OCR on cropped table images - use `IMAGE_CROP_PAD` env for `individual_blocks` mode ### Testing The test function [`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425) in `test_pdf.py` should pass. ### NOTE: I've tried to experiment with values for scaling ENVs on the following PRs but found that changes to the values for scaling ENVs affect the entire page OCR output(OCR regression) so switched to doing a second OCR for tables. - https://github.com/Unstructured-IO/unstructured/pull/1998/files - https://github.com/Unstructured-IO/unstructured/pull/2004/files - https://github.com/Unstructured-IO/unstructured/pull/2016/files - https://github.com/Unstructured-IO/unstructured/pull/2029/files --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-09 18:29:55 +00:00
ryannikolaidis	d5fd21f0fd	fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023 ) Closes #1064 When using the `--partition-by-api` flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. ## Changes * parse and pass relevant partition arguments to the api in unstructured-ingest * bonus: leverage an existing `partition.api` function to call out to the api rather than including duplicative request logic in unstructured ingest * bonus: --pdf-infer-table-structure is now a flag not an arg (it defaults false anyways, this is more succinct and consistent with similar parameters) * bonus: adds `hi_res_model_name` so a user can specify the model to leverage when using a hi_res strategy. ## Testing * update against_api.sh source test script to specify a partition argument and validates that the response from the api respected the argument * manually ran a request and validated that it was processed with chipper as specified (not sure if we want to bake a chipper request into the ci tests) (validated that the response leveraged the chipper model): ``` PYTHONPATH=. ./unstructured/ingest/main.py \ local \ --output-dir /tmp/ingest-requests/chipper \ --verbose \ --reprocess \ --strategy hi_res \ --partition-by-api \ --hi-res-model-name chipper \ --api-key "$API_KEY" \ --input-path 'example-docs/layout-parser-paper-with-table.pdf' ```	2023-11-08 04:47:02 +00:00
ryannikolaidis	0e94dd5d65	fix: ingest destination test failure with missing output (#2031 ) Intermittently the various destination test will fail with: ``` {noformat}--- Cleanup done --- gs://utic-test-ingest-fixtures-output/1699377964/example-docs/ deleting gs://utic-test-ingest-fixtures-output/1699377964 Removing objects: ERROR: (gcloud.storage.rm) The following URLs matched no objects or files: -gs://utic-test-ingest-fixtures-output/1699377964 Last ran script: gcs.sh Error: Process completed with exit code 1.{noformat} ``` Reference trace [here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020) After some investigation it looks like this error is due to collisions that occur because we’re assuming 1s date accuracy is sufficient when generating (and deleting) "unique" test destination location names. The likelihood is actually pretty high given that we run these tests against a test matrix. Instead we should just use a uuid for these unique destinations. ## Changes - Use uuidgen instead of `date +%s` for unique destinations	2023-11-07 23:14:01 +00:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00
Ahmet Melek	ca78dc737a	feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918 ) Closes #1782 This PR: - Extends ingest pipeline so that it is possible to select an embedding provider from a range of providers - Modifies the ingest embedding test to be a diff test, since the embedding vectors are reproducible after supporting multiple providers Additional info on the chosen provider for the test: - Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic even when there's no seed set - Took 6.84s to pass a unit test with the provider (without cache, including model download) - `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it zero cost For all these reasons, testing embedding modules with the Huggingface model seems to be making sense --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-11-06 12:26:12 +00:00
Klaijan	c471ea3cc7	chore: remove copy line from non-matrix connectors (#1976 )	2023-11-04 10:58:56 -07:00
Roman Isecke	ba4477ac20	feat: support table conversion for tabular destination connectors (#1917 ) ### Description * A full schema was introduced to map the type of all output content from the json partition output and mapped to a flattened table structure to leverage table-based destination connectors. The delta table destination connector was updated at the moment to take advantage of this. * Existing method to convert to a dataframe was updated because it had a bug in it. Object content in the metadata would have the key name changed when flattened but then this would be omitted since it didn't exist in the `_get_metadata_table_fieldnames` response. * Unit test was added to make sure we handle all values possible in an Element when converting to a table * Delta table ingest test was split into a source and destination test (looking ahead to split these up in CI) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-03 16:47:21 +00:00
Roman Isecke	d09c8c0cab	test: update ingest dest tests to follow set pattern (#1991 ) ### Description Update all destination tests to match pattern: * Don't omit any metadata to check full schema * Move azure cognitive dest test from src to dest * Split delta table test into seperate src and dest tests * Fix azure cognitive search and add to dest tests being run (wasn't being run originally)	2023-11-03 12:46:56 +00:00
Yao You	db766402a4	test: parametrize ingest test scripts (#1979 ) This PR resolves [CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453): - parametrizes the output folder so that ingest output files can be saved other than the same place where the scripts are; this is set by env `OUTPUT_ROOT` - parametrize the python path `PYTHONPATH` to first check existing definition before default to `.`, the current folder - parametrize the run script that carries out ingest using `RUN_SCRIPT`, default is still `./unstructured/ingest/main.py` These changes allows us to run ingest test with more control. To test: - run `OUTPUT_ROOT=/tmp ./test_unstructured_ingest/src/local-single-file.sh`: the output now should be in `/tmp` instead of in the ingest test folder - run `RUN_SCRIPT=/hope/you/do/not/have/this/folder ./test_unstructured_ingest/src/local-single-file.sh` would raise an error because system can't find `/hope/you/do/not/have/this/folder` - run `RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh` should run as normal - do the following ```bash cp ./unstructured/ingest/main.py /tmp/main.py OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh ``` This will run and generate output at `/tmp` [CORE-2453]: https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ	2023-11-02 21:41:56 +00:00
Roman Isecke	6700a7d8c4	feat: support generic inputs for partition kwargs from ingest CLI (#1923 ) ### Description To always support the latest changed to the partition method and the possible kwargs it supports, the ingest CLI has been refactored to take in a valid json string to represent those values to allow a user more flexibility with controlling the partition method.	2023-11-02 21:19:29 +00:00
shreyanid	c24e6e056c	chore: add doctype to ingest evaluation functions (#1977 ) ### Summary To combine ingest and holistic metrics efforts, add the `doctype` field to the results from the functions in evaluate.py for use in subsequent aggregation functions. ### Test Run `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` and there will be a new doctype column with the file's doctype extension. <img width="508" alt="Screenshot 2023-11-01 at 2 23 11 PM" src="https://github.com/Unstructured-IO/unstructured/assets/42684285/44583da9-e7ef-4142-be72-c2247b954bcf"> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>	2023-11-02 19:15:53 +00:00
Roman Isecke	24a419ece0	separate ingest tests (#1951 ) ### Description This splits the source ingest tests from the destination ingest tests since they share a different pattern: * src tests pull data from a source and compare the partitioned content to the expected results * destingation tests leverage the local connector to produce results to push to a destination and leverages overhead to create temporary locations at those destinations to write to and delete when done. Only the src tests create partitioned content that needs to be checked so the update ingest test CI job only needs to run these.	2023-11-01 19:23:44 +00:00
Klaijan	a06b151897	refactor: ci workflow refactor (#1907 ) Refactor the evaluation scripts including `unstructured/ingest/evaluation.py` `test_unstructured_ingest/evaluation-metrics.sh` for more structured code and usage. - The script is now only use one python script call with param - Adds function to build string for output_args (`--output_dir --output_list) and source_args (`--source_dir --source_args`) - Now accepts evaluation to call as a param, currently only accepts `text-extraction` and `element-type` Example to call the function: ```sh evaluation-metrics.sh text-extraction``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-01 15:58:23 +00:00
Roman Isecke	123ad20f4c	support passing credentials from memory for google connectors (#1888 ) ### Description ### Google Drive The existing service account parameter was expanded to support either a file path or a json value to generate the credentials when instantiating the google drive client. ### GCS Google Cloud Storage already supports the value being passed in, from their docstring: > - you may supply a token generated by the [gcloud](https://cloud.google.com/sdk/docs/) utility; this is either a python dictionary, the name of a file containing the JSON returned by logging in with the gcloud CLI tool, or a Credentials object. I tested this locally: ```python from gcsfs import GCSFileSystem import json with open("/Users/romanisecke/.ssh/google-cloud-unstructured-ingest-test-d4fc30286d9d.json") as json_file: json_data = json.load(json_file) print(json_data) fs = GCSFileSystem(token=json_data) print(fs.ls(path="gs://utic-test-ingest-fixtures/")) ``` `['utic-test-ingest-fixtures/ideas-page.html', 'utic-test-ingest-fixtures/nested-1', 'utic-test-ingest-fixtures/nested-2']`	2023-10-31 17:12:04 +00:00
Ahmet Melek	a9a3efd85c	bugfix: SharePoint permissions fetching should be opt-in (#1894 ) Closes: #1891 (check the issue for more info) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io>	2023-10-31 15:55:07 +00:00
Christine Straub	1f0c563e0c	refactor: `partition_pdf()` for `ocr_only` strategy (#1811 ) ### Summary Update `ocr_only` strategy in `partition_pdf()`. This PR adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy. - Add functionality to perform OCR region grouping based on the OCR text taken from `pytesseract.image_to_string()` - Add functionality to get layout elements from OCR regions (ocr_layout) for both `tesseract` and `paddle` - Add functionality to determine the `source` of merged text regions when merging text regions in `merge_text_regions()` - Merge multiple test functions related to "ocr_only" strategy into `test_partition_pdf_with_ocr_only_strategy()` - This PR also fixes [issue #1792](https://github.com/Unstructured-IO/unstructured/issues/1792) ### Evaluation ``` # Image PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image # PDF PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf ``` ### Test - Before update All elements have the same coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768) - After update All elements have accurate coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-30 20:13:29 +00:00
Roman Isecke	680cfbabd4	expand fsspec downstream connectors (#1777 ) ### Description Replacing PR [1383](https://github.com/Unstructured-IO/unstructured/pull/1383) --------- Co-authored-by: Trevor Bossert <alanboss@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-30 20:09:49 +00:00
Benjamin Torres	05c3cd1be2	feat: clean pdfminer elements inside tables (#1808 ) This PR introduces `clean_pdfminer_inner_elements` , which deletes pdfminer elements inside other detection origins such as YoloX or detectron. This function returns the clean document. Also, the ingest-test fixtures were updated to reflect the new standard output. The best way to check that this function is working properly is check the new test `test_clean_pdfminer_inner_elements` in `test_unstructured/partition/utils/test_processing_elements.py` --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-10-30 07:10:51 +00:00
Ahmet Melek	c249d02fa8	bugfix: ingest pipeline with chunking and embedding does not persist data to the embedding step (#1893 ) Closes: #1892 (check the issue for more info)	2023-10-27 13:07:00 +00:00
Klaijan	466255eec3	build: element type frequency evaluation metrics workflow in ci (#1862 ) Executive Summary Measured element type frequency accuracy from the current version of code with the expected output. The performance is reported as tsv file under `metrics`. Technical Details - The evaluation measures element type frequencies from `structured-output-eval` against `expected-structured-output` - `evaluation.py` has been edited to support function calling using `click.group()` and `command()` - `evaluation-ingest-cp.sh` is now added to all the `test-ingest-xx.sh` scripts Outputs 2 tsv files is saved ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/b4458094-a9fc-48f9-a0bd-2ccd6985440a) ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/6d785736-bcaf-4275-bf2d-ab511cdfb3f4) 9-0e05-41d4-b69f-841a2aa131ec) and aggregated score is displayed. ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/9d42bd0c-a0dd-41c2-a2e5-b675a40f35cc) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-27 04:36:36 +00:00
Roman Isecke	135aa65906	update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814 ) ### Description * If the contents of a doc were updated by the process of reading/downloading it, this was not being persisted. To fix this, the data being passed around was updated to use a multiprocessing safe dict rather than the json string. Now that dict is updated after the `get_file` method is called. * Wikipedia connector was updated to use a static filename rather than one requiring a call to fetch data. * The read config param `re_download` was not being leveraged by the source node, this was fixed. * Added fix: chunking and embedding order reversed so chunking runs before embeddings --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-25 22:04:27 +00:00
Yuming Long	01a0e003d9	Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850 ) ### Summary A follow up ticket on https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc Tech details: * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.	2023-10-24 17:13:28 +00:00
Amanda Cameron	0584e1d031	chore: fix infer_table bug (#1833 ) Carrying `skip_infer_table_types` to `infer_table_structure` in partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a `text_as_html` field. Note: I've continued to exclude this var from partitioners that go through html flow, I think if we've already got the html it doesn't make sense to carry the infer variable along, since we're not 'infer-ing' the html table in these cases. TODO: ✅ add unit tests --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: amanda103 <amanda103@users.noreply.github.com>	2023-10-24 00:11:53 +00:00
Klaijan	6707cab250	build: text extraction evaluation metrics workflow added (#1757 ) Executive Summary This PR adds the evaluation metrics to our current workflow. It verifies the flow that when the code is pushed, the code will gets evaluate against our gold standard and output into `.tsv` file. Technical Details - Adds evaluation metrics to the test-ingest workflow - Make use of `structured-output` from `test-ingest` and compare to the gold-standard uploaded in s3, and download into local when make comparison. The current folder in-use is `s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the shell script. - With this PR, only one file from one connector is use to compare. Misc - Not many overlapped files between test-ingest and gold-standard. More files will be added. Outputs 2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`. ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/222e437c-1a94-4d7c-9320-81696633b1ae) ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/5c840322-6739-4634-8868-eba04b4ebc96) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-10-23 21:39:22 +00:00
Roman Isecke	a2af72bb79	local connector metadata and deserialization fix (#1800 ) ### Description * Priority of this was to fix deserialization of ingest docs. Currently the source metadata wasn't being persisted * To help debug this, source metadata was added to the local ingest doc as well. * Unit test added to make sure the metadata itself was persisted. * As part of serialization, it was forcing docs to fetch source metadata if it hadn't already to add to the generated dict/json. This shouldn't have happened if the underlying variable `_source_metadata` was `None`. This way the doc can be serialized without any calls being made. * Serialization was moved to the `to_dict` method to make it more universal. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-23 15:51:52 +00:00
Yuming Long	ce40cdc55f	Chore (refactor): support table extraction with pre-computed ocr data (#1801 ) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. Tech details: * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>	2023-10-21 00:24:23 +00:00
Klaijan	4a9d48859c	chore: add test_unstructured_ingest/metrics dummy dir (#1827 ) Update path for metrics output to `test_unstructured_ingest/metrics` and create a dummy folder `test_unstructured_ingest/metrics` to prevent ci error.	2023-10-20 14:54:30 -07:00
Roman Isecke	63861f537e	Add check for duplicate click options (#1775 ) ### Description Given that many of the options associated with the `Click` based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options. ### Issues that were found and fixes: * duplicate api-key option set on Notion command conflicts with api key used for unstructured api. Added notion prefix. * retry logic configs had duplicates in biomed. Removed since this is not handled by the pipeline.	2023-10-20 14:00:19 +00:00
Steve Canny	d9c2516364	fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779 ) Executive Summary. Introducing strict type-checking as preparation for adding the chunk-overlap feature revealed a type mismatch for regex-metadata between chunking tests and the (authoritative) ElementMetadata definition. The implementation of regex-metadata aspects of chunking passed the tests but did not produce the appropriate behaviors in production where the actual data-structure was different. This PR fixes these two bugs. 1. Over-chunking. The presence of `regex-metadata` in an element was incorrectly being interpreted as a semantic boundary, leading to such elements being isolated in their own chunks. 2. Discarded regex-metadata. regex-metadata present on the second or later elements in a section (chunk) was discarded. Technical Summary The type of `ElementMetadata.regex_metadata` is `Dict[str, List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text": "this matched", "start": 7, "end": 19}`. Multiple regexes can be specified, each with a name like "mail-stop", "version", etc. Each of those may produce its own set of matches, like: ```python >>> element.regex_metadata { "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}], "version": [ {"text": "current: v1.7.2", "start": 7, "end": 21}, {"text": "supersedes: v1.7.0", "start": 22, "end": 40}, ], } ``` Forensic analysis * The regex-metadata feature was added by Matt Robinson on 06/16/2023 commit: 4ea71683. The regex_metadata data structure is the same as when it was added. * The chunk-by-title feature was added by Matt Robinson on 08/29/2023 commit: f6a745a7. The mistaken regex-metadata data structure in the tests is present in that commit. Looks to me like a mis-remembering of the regex-metadata data-structure and insufficient type-checking rigor (type-checker strictness level set too low) to warn of the mistake. Over-chunking Behavior The over-chunking looked like this: Chunking three elements with regex metadata should combine them into a single chunk (`CompositeElement` object), subject to maximum size rules (default 500 chars). ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]} ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] chunks = chunk_by_title(elements) assert chunks == [ CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) ] ``` Observed behavior looked like this: ```python chunks => [ CompositeElement('Lorem Ipsum') CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('In rhoncus ipsum sed lectus porta volutpat.') ] ``` The fix changed the approach from breaking on any metadata field not in a specified group (`regex_metadata` was missing from this group) to only breaking on specified fields (whitelisting instead of blacklisting). This avoids overchunking every time we add a new metadata field and is also simpler and easier to understand. This change in approach is discussed in more detail here #1790. Dropping regex-metadata Behavior Chunking this section: ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={ "dolor": [RegexMetadata(text="dolor", start=12, end=17)], "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)], } ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] ``` ..should produce this regex_metadata on the single produced chunk: ```python assert chunk == CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) assert chunk.metadata.regex_metadata == { "dolor": [RegexMetadata(text="dolor", start=25, end=30)], "ipsum": [ RegexMetadata(text="Ipsum", start=6, end=11), RegexMetadata(text="ipsum", start=19, end=24), RegexMetadata(text="ipsum", start=81, end=86), ], } ``` but instead produced this: ```python regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]} ``` Which is the regex-metadata from the first element only. The fix was to remove the consolidation+adjustment process from inside the "list-attribute-processing" loop (because regex-metadata is not a list) and process regex metadata separately.	2023-10-19 22:16:02 -05:00
Mallori Harrell	00635744ed	feat: Adds local embedding model (#1619 ) This PR adds a local embedding model option as an alternative to using our OpenAI embedding brick. This brick uses LangChain's HuggingFacEmbeddings.	2023-10-19 11:51:36 -05:00
ryannikolaidis	80c3c24ca5	ingest retry strategy refactor <- Ingest test fixtures update (#1780 ) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: benjats07 <benjats07@users.noreply.github.com>	2023-10-18 04:33:57 +00:00
Roman Isecke	775bfb7588	ingest retry strategy refactor (#1708 ) ### Description Pivot from using the retry logic as a decorator as this posed too many limitations on what can be passed in as a parameter at runtime. Moved this to be a class approach and now that can be instantiated with appropriate loggers leveraging the `--verbose` flag to set the log level. This also mitigates how much new code is being forked from the backoff library. The existing notion client that was using the previous decorator has been refactored to use the new class approach and the airtable connector was updated to support retry logic as well. Default log handlers were introduced which applies to all instances of the retry handler when it starts, backs off, and gives up. A generic approach was added to configuring the retry parameters in the CLI and was added to the running number of common configs across all CLI commands. Omitted CHANGELOG entry as this is mostly just a refactor of the retry code. All other connectors will be updated to support retry in another PR but this helps limit the number of changes to review in this one. ### Extra fixes * Updated local and salesforce source connector to set `ingest_doc_cls` in a `__post_init__` method since this variable can't be serialized. ### Testing Both the airtable and notion ingest tests can be run locally. While they might not pass due to text changes (to be expected when running locally), the process can be viewed in the logs to validate. Associated issue: #1488	2023-10-17 16:15:08 +00:00

1 2 3 4 5 ...

334 Commits