unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-02 03:45:24 +00:00

Author	SHA1	Message	Date
Steve Canny	b8a8de33f4	fix(ingest): canonicalize ingest JSON (#2080 ) Canonicalize JSON produced for ingest tests such that incidental changes is _form_ of the JSON objects (keys moving around) that does not change the _content_ of that JSON object does not trigger an ingest-test failure.	2023-11-15 00:52:58 -08:00
Christine Straub	475066ba7c	Fix: fast strategy fallback to ocr only (#2055 ) Closes #2038. ### Summary The `fast` strategy should not fall back to a more expensive strategy. ### Testing For [9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf), the following code should return an empty list. ``` elements = partition(filename=filename, strategy="fast") ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-14 18:46:41 +00:00
John	1ead5a27df	Jj/2011 missing languages metadata (#2037 ) ### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2023-11-13 16:47:05 +00:00
Roman Isecke	ba4477ac20	feat: support table conversion for tabular destination connectors (#1917 ) ### Description * A full schema was introduced to map the type of all output content from the json partition output and mapped to a flattened table structure to leverage table-based destination connectors. The delta table destination connector was updated at the moment to take advantage of this. * Existing method to convert to a dataframe was updated because it had a bug in it. Object content in the metadata would have the key name changed when flattened but then this would be omitted since it didn't exist in the `_get_metadata_table_fieldnames` response. * Unit test was added to make sure we handle all values possible in an Element when converting to a table * Delta table ingest test was split into a source and destination test (looking ahead to split these up in CI) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-03 16:47:21 +00:00
Christine Straub	1f0c563e0c	refactor: `partition_pdf()` for `ocr_only` strategy (#1811 ) ### Summary Update `ocr_only` strategy in `partition_pdf()`. This PR adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy. - Add functionality to perform OCR region grouping based on the OCR text taken from `pytesseract.image_to_string()` - Add functionality to get layout elements from OCR regions (ocr_layout) for both `tesseract` and `paddle` - Add functionality to determine the `source` of merged text regions when merging text regions in `merge_text_regions()` - Merge multiple test functions related to "ocr_only" strategy into `test_partition_pdf_with_ocr_only_strategy()` - This PR also fixes [issue #1792](https://github.com/Unstructured-IO/unstructured/issues/1792) ### Evaluation ``` # Image PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image # PDF PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf ``` ### Test - Before update All elements have the same coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768) - After update All elements have accurate coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-30 20:13:29 +00:00
Roman Isecke	a2af72bb79	local connector metadata and deserialization fix (#1800 ) ### Description * Priority of this was to fix deserialization of ingest docs. Currently the source metadata wasn't being persisted * To help debug this, source metadata was added to the local ingest doc as well. * Unit test added to make sure the metadata itself was persisted. * As part of serialization, it was forcing docs to fetch source metadata if it hadn't already to add to the generated dict/json. This shouldn't have happened if the underlying variable `_source_metadata` was `None`. This way the doc can be serialized without any calls being made. * Serialization was moved to the `to_dict` method to make it more universal. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-23 15:51:52 +00:00
ryannikolaidis	80c3c24ca5	ingest retry strategy refactor <- Ingest test fixtures update (#1780 ) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: benjats07 <benjats07@users.noreply.github.com>	2023-10-18 04:33:57 +00:00
Inscore	8ab40c20c1	fix: correct PDF list item parsing (#1693 ) The current implementation removes elements from the beginning of the element list and duplicates the list items --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: yuming <305248291@qq.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-10-11 20:38:36 +00:00
Christine Straub	5d14a2aea0	feat: shrink bboxes by top left (#1633 ) Closes #1573. ### Summary - update `shrink_bbox()` to keep top left rather than center ### Evaluation Run the following command for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-10-06 05:16:11 +00:00
Christine Straub	b30d6a601e	Fix/1209 tweak xycut ordering output (#1630 ) Closes GH Issue #1209. ### Summary - add swapped `xycut` sorting - update `xycut` sorting evaluation script PDFs: - [sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf) - [multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf) - [11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf) ### Testing ``` elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res") print("\n\n".join([str(el) for el in elements])) ``` ### Evaluation ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only ```	2023-10-05 07:41:38 +00:00
Christine Straub	94fbbed189	feat: bbox shrinking in xycut algo, better natural reading order (#1560 ) Closes GH Issue #1233. ### Summary - add functionality to shrink all bounding boxes along x and y axes (still centered around the same center point) before running xy-cut sort ### Evaluation Run the followin gcommand for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>	2023-09-29 03:48:02 +00:00
Klaijan	d26d591d6a	feat: get embedded url, associate text and start index for pdf (#1539 ) Executive Summary Adds PDF functionality to capture hyperlink (external or internal) for pdf fast strategy along with associate text. Technical Details - `pdfminer` associates `annotation` (links and uris) with bounding box rather than text. Therefore, the link and text matching is not a perfect pair but rather a logic-based and calculation matching from bounding box overlapping. - There is no word-level bounding box. Only character-level (access using `LTChar`). Thus in order to get to word-level, there is a window slicing through the text. The words are captured in alphanumeric and non-alphanumeric separately, meaning it will split the word if contains both, on the first encounter of non-alphanumeric.) - The bounding box calculation is calculated using start and stop coordinates for the corresponding word calculated from above. The calculation is simply using distance between two dots. The result now contains `links` in `metadata` as shown below: ``` "links": [ { "text": "link", "url": "https://github.com/Unstructured-IO/unstructured", "start_index": 12 }, { "text": "email", "url": "mailto:unstructuredai@earlygrowth.com", "start_index": 30 }, { "text": "phone number", "url": "tel:6505124019", "start_index": 49 } ] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-27 13:43:32 -04:00
shreyanid	32bfebccf7	feat: introduce language detection function for text partitioning function (#1453 ) ### Summary Uses `langdetect` to detect all languages present in the input document. ### Details - Converts all language codes (whether user inputted or detected using `langdetect`) to a standard ISO 639-3 code. - Adds `languages` field to the metadata - Will revisit how to nonstandardly represent simplified vs traditional Chinese scripts internally (separate PR). - Update ingest test results to add `languages` field to documents. Some other side effects are changes in order of some elements and changes in element categorization ### Test You can test the detect_languages function individually by importing the function and inputting a text sample and optionally a language: ``` text = "My lubimy mleko i chleb." doc_langs = detect_languages(text) print(doc_langs) ``` -> ['ces', 'pol', 'slk'] --------- Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2023-09-26 18:09:27 +00:00
Klaijan	00181b88df	feat: pdf auto strategy groups broken numbered and bullet list items(#1393 ) Summary Adds logic to combine broken numbered list for pdf fast strategy. Details Previously the document reads the numbered list items part of the `layout-parser-paper-fast.pdf` file as: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character' 'recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that' 'underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' 'tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` Now it reads: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` The added logic leverages `ElementType` and `coordinates` to determine whether the following lines is a part of the previously detected `ListItem` or not. Test Add test that checks the element length less than original version with broken numbered list. The test also checks whether the first detected numbered list ends with previously broken line. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-13 21:30:06 +00:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
qued	79f734d3f9	fix: better extractable check (#900 ) auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.	2023-07-07 23:41:37 -05:00
qued	db4c5dfdf7	feat: coordinate systems (#774 ) Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.	2023-06-20 11:19:55 -05:00
kravetsmic	7fd7d7afae	feat(biomed connector): added additional params (#468 ) (#623 ) Unstructured-ingest biomed connector: Adds max retries, max request time with backoff and decay. --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-15 01:57:45 -07:00
Yuming Long	2fbb1ccd30	Chore(ingest) : add tests on PDFs with fast strategy (#614 ) Summary * Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted * Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh Updated ingest tests procedure: * Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name> * Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name> Test * Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.	2023-06-12 19:02:48 +00:00

19 Commits