unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-08 06:28:55 +00:00

Author	SHA1	Message	Date
Steve Canny	0e2c21e5a2	fix: handle sectionless-docx in the general case (#1829 ) A DOCX document that has no sections can still contain one or more tables. Such files are never created by Word but Word can open them just fine. These can be and are generated by other applications. Use the newly-added `Document.iter_inner_content()` method added upstream in `python-docx` to capture both paragraphs and tables from a section-less DOCX document. This generalizes the fix for MS Teams chat-transcripts (an example of sectionless-docx) implemented in #1825.	2023-11-08 19:05:19 +00:00
qued	92ddf3a337	feat: enable request timeout (#2013 ) Courtesy @cdpierse. Adds a test to PR #1529 in accordance with feedback. Description from original PR: In python the default behaviour of `requests.get` without a `timeout` being set is to hang indefinitely. We have a production use case where the desired behaviour would be to raise a timeout error rather than have the application just hang. This PR adds a new optional keyword parameter `request_timeout` to `partition` which is passed to `file_and_type_from_url` in the case where we are fetching from a URL. This is then passed to `requests.get` --------- Co-authored-by: Charles Pierse <charlespierse@gmail.com>	2023-11-08 00:44:58 +00:00
Steve Canny	80fe07b89f	fix: #1952 support nested docx tables (#2020 ) In DOCX, like HTML, a table cell can itself contain a table. This is not uncommon and is typically used for formatting purposes. When a DOCX table is nested, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table. This implements the solution described and spiked in PR #1952. --------- Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>	2023-11-08 00:37:21 +00:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00
Yuming Long	ad14321016	Chore: don't pass empty language code to tesseract CLI (#1996 ) Summary: Close: https://github.com/Unstructured-IO/unstructured/issues/1920 * stop passing in empty string from `languages` to tesseract, which will result in passing empty string to language config `-l` for the tesseract CLI * also stop passing in duplicate language code from `languages` to tesseract OCR * if we failed to convert any iso languages from the `languages` parameter, proceed OCR with `eng` as default ### Test * First confirm the tesseract error `Estimating resolution as X` before this: * on the `unstructured-api` repo with main branch, run `make run-web-app` * curl to test error from empty string, or just any wrong input like `-F 'languages="eng,de"'`: ``` curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \ -F 'languages=""' \ -F 'strategy=hi_res' \ -F 'pdf_infer_table_structure=True' \ \| jq -C . \| less -R ``` * after this change: * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .` * check out to this branch * run `make run-web-app` again in api repo * the curl command return output and see warning in log --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-11-06 19:30:12 -06:00
Ahmet Melek	ca78dc737a	feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918 ) Closes #1782 This PR: - Extends ingest pipeline so that it is possible to select an embedding provider from a range of providers - Modifies the ingest embedding test to be a diff test, since the embedding vectors are reproducible after supporting multiple providers Additional info on the chosen provider for the test: - Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic even when there's no seed set - Took 6.84s to pass a unit test with the provider (without cache, including model download) - `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it zero cost For all these reasons, testing embedding modules with the Huggingface model seems to be making sense --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-11-06 12:26:12 +00:00
Roman Isecke	ba4477ac20	feat: support table conversion for tabular destination connectors (#1917 ) ### Description * A full schema was introduced to map the type of all output content from the json partition output and mapped to a flattened table structure to leverage table-based destination connectors. The delta table destination connector was updated at the moment to take advantage of this. * Existing method to convert to a dataframe was updated because it had a bug in it. Object content in the metadata would have the key name changed when flattened but then this would be omitted since it didn't exist in the `_get_metadata_table_fieldnames` response. * Unit test was added to make sure we handle all values possible in an Element when converting to a table * Delta table ingest test was split into a source and destination test (looking ahead to split these up in CI) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-03 16:47:21 +00:00
Christine Straub	9f7ff4fd98	rfctr: Clean up test functions in `test_pdf.py` (#1999 ) ### Summary: - use the test utility function `example_doc_path()` - clean up test functions related to `metadata_date` and `exclude_metadata`	2023-11-03 10:02:43 -05:00
Mallori Harrell	d07baed4a1	bug: empty-elements (#1252 ) - This PR adds a function to check if a piece of text only contains a bullet (no text) to prevent creating an empty element. - Also fixed a test that had a typo.	2023-11-02 10:52:41 -05:00
Steve Canny	4e40999070	rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978 ) Reviewer: May be quicker to review commit by commit as they are quite distinct and well-groomed to each focus on a single clean-up task. Clean up odds-and-ends in the docx partitioner in preparation for adding nested-tables support in a closely following PR. 1. Remove obsolete TODOs now in GitHub issues, which is probably where they belong in future anyway. 2. Remove local DOCX "workaround" code that has been implemented upstream and is now obsolete. 3. "Clean" the docx tests, introducing strict typing, extracting a fixture or two, and generally tightening things up. 4. Extract docx-local versions of `unstructured.partition.common.convert_ms_office_table_to_text()` which will be the base for adding nested-table support. More information on why this is required in that commit.	2023-11-02 05:22:17 +00:00
Steve Canny	51d07b6434	fix: flaky chunk metadata (#1947 ) Executive Summary. When the elements in a _section_ are combined into a _chunk_, the metadata in each of the elements is _consolidated_ into a single `ElementMetadata` instance. There are two main problems with the current implementation: 1. The current algorithm simply uses the metadata of the first element as the metadata for the chunk. This produces: - empty chunk metadata when the first element has no metadata, such as a `PageBreak("")` - missing chunk metadata when the first element contains only partial metadata such as a `Header()` or `Footer()` - misleading metadata when the first element contains values applicable only to that element, such as `category_depth`, `coordinates` (bounding-box), `header_footer_type`, or `parent_id` 2. Second, list metadata such as `emphasized_text_content`, `emphasized_text_tags`, `link_texts` and `link_urls` is only combined when it is unique within the combined list. These lists are "unzipped" pairs. For example, the first `link_texts` corresponds to the first `link_urls` value. When an item is removed from one (because it matches a prior entry) and not the other (say same text "here" but different URL) the positional correspondence is broken and downstream processing will at best be wrong, at worst raise an exception. ### Technical Discussion Element metadata cannot be determined in the general case simply by sampling that of the first element. At the same time, a simple union of all values is also not sufficient. To effectively consolidate the current variety of metadata fields we need four distinct strategies, selecting which to apply to each field based on that fields provenance and other characteristics. The four strategies are: - `FIRST` - Select the first non-`None` value across all the elements. Several fields are determined by the document source (`filename`, `file_directory`, etc.) and will not change within the output of a single partitioning run. They might not appear in every element, but they will be the same whenever they do appear. This strategy takes the first one that appears, if any, as proxy for the value for the entire chunk. - `LIST` - Consolidate the four list fields like `emphasized_text_content` and `link_urls` by concatenating them in element order (no set semantics apply). All values from `elements[n]` appear before those from `elements[n+1]` and existing order is preserved. - `LIST_UNIQUE` - Combine only unique elements across the (list) values of the elements, preserving order in which a unique item first appeared. - `REGEX` - Regex metadata has its own rules, including adjusting the `start` and `end` offset of each match based its new position in the concatenated text. - `DROP` - Not all metadata can or should appear in a chunk. For example, a chunk cannot be guaranteed to have a single `category_depth` or `parent_id`. Other strategies such as `COORDINATES` could be added to consolidate the bounding box of the chunk from the coordinates of its elements, roughly `min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or `SUM` depending on how metadata evolves. The proposed strategy assignments are these: - `attached_to_filename`: FIRST, - `category_depth`: DROP, - `coordinates`: DROP, - `data_source`: FIRST, - `detection_class_prob`: DROP, # -- ? confirm -- - `detection_origin`: DROP, # -- ? confirm -- - `emphasized_text_contents`: LIST, - `emphasized_text_tags`: LIST, - `file_directory`: FIRST, - `filename`: FIRST, - `filetype`: FIRST, - `header_footer_type`: DROP, - `image_path`: DROP, - `is_continuation`: DROP, # -- not expected, added by chunking, not before -- - `languages`: LIST_UNIQUE, - `last_modified`: FIRST, - `link_texts`: LIST, - `link_urls`: LIST, - `links`: DROP, # -- deprecated field -- - `max_characters`: DROP, # -- unused in code, probably remove from ElementMetadata -- - `page_name`: FIRST, - `page_number`: FIRST, - `parent_id`: DROP, - `regex_metadata`: REGEX, - `section`: FIRST, # -- section unconditionally breaks on new section -- - `sent_from`: FIRST, - `sent_to`: FIRST, - `subject`: FIRST, - `text_as_html`: DROP, # -- not expected, only occurs in TableSection -- - `url`: FIRST, Assumptions: - each .eml file is partitioned->chunked separately (not in batches), therefore sent-from, sent-to, and subject will not change within a section. ### Implementation Implementation of this behavior requires two steps: 1. Collect all non-`None` values from all elements, each in a sequence by field-name. Fields not populated in any of the elements do not appear in the collection. ```python all_meta = { "filename": ["memo.docx", "memo.docx"] "link_texts": [["here", "here"], ["and here"]] "parent_id": ["f273a7cb", "808b4ced"] } ``` 2. Apply the specified strategy to each item in the overall collection to produce the consolidated chunk meta (see implementation). ### Factoring For the following reasons, the implementation of metadata consolidation is extracted from its current location in `chunk_by_title()` to a handful of collaborating methods in `_TextSection`. - The current implementation of metadata consolidation "inline" in `chunk_by_title()` already has too many moving pieces to be understood without extended study. Adding strategies to that would make it worse. - `_TextSection` is the only section type where metadata is consolidated (the other two types always have exactly one element so already exactly one metadata.) - `_TextSection` is already the expert on all the information required to consolidate metadata, in particular the elements that make up the section and their text. Some other problems were also fixed in that transition, such as mutation of elements during the consolidation process. ### Technical Risk: adding new `ElementMetadata` field breaks metadata If each metadata field requires a strategy assignment to be consolidated and a developer adds a new `ElementMetadata` field without adding a corresponding strategy mapping, metadata consolidation could break or produce incorrect results. This risk can be mitigated multiple ways: 1. Add a test that verifies a strategy is defined for each (Recommended). 2. Define a default strategy, either `DROP` or `FIRST` for scalar types, `LIST` for list types. 3. Raise an exception when an unknown metadata field is encountered. This PR implements option 1 such that a developer will be notified before merge if they add a new metadata field but do not define a strategy for it. ### Other Considerations - If end-users can in-future add arbitrary metadata fields _before_ chunking, then we'll need to define metadata-consolidation behavior for such fields. Depending on how we implement user-defined metadata fields we might: - Require explicit definition of a new metadata field before use, perhaps with a method like `ElementMetadata.add_custom_field()` which requires a consolidation strategy to be defined (and/or has a default value). - Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the field is type `list`. ### Further Context Metadata is only consolidated for `TextSection` because the other two section types (`TableSection` and `NonTextSection`) can only contain a single element. --- ## Further discussion on consolidation strategy by field ### document-static These fields are very likely to be the same for all elements in a single document: - `attached_to_filename` - `data_source` - `file_directory` - `filename` - `filetype` - `last_modified` - `sent_from` - `sent_to` - `subject` - `url` Consolidation strategy: `FIRST` - use first one found, if any. ### section-static These fields are very likely to be the same for all elements in a single section, which is the scope we really care about for metadata consolidation: - `section` - an EPUB document-section unconditionally starts new section. Consolidation strategy: `FIRST` - use first one found, if any. ### consolidated list-items These `List` fields are consolidated by concatenating the lists from each element that has one: - `emphasized_text_contents` - `emphasized_text_tags` - `link_texts` - `link_urls` - `regex_metadata` - special case, this one gets indexes adjusted too. Consolidation strategy: `LIST` - concatenate lists across elements. ### dynamic These fields are likely to hold unique data for each element: - `category_depth` - `coordinates` - `image_path` - `parent_id` Consolidation strategy: - `DROP` as likely misleading. - `COORDINATES` strategy could be added to compute the bounding box from all bounding boxes. - Consider allowing if they are all the same, perhaps an `ALL` strategy. ### slow-changing These fields are somewhere in-between, likely to be common between multiple elements but varied within a document: - `header_footer_type` - strategy: drop as not-consolidatable - `languages` - strategy: take first occurence - `page_name` - strategy: take first occurence - `page_number` - strategy: take first occurence, will all be the same when `multipage_sections` is `False`. Worst-case semantics are "this chunk began on this page". ### N/A These field types do not figure in metadata-consolidation: - `detection_class_prob` - I'm thinking this is for debug and should not appear in chunks, but need confirmation. - `detection_origin` - for debug only - `is_continuation` - is _produced_ by chunking, never by partitioning (not in our code anyway). - `links` (deprecated, probably should be dropped) - `max_characters` - is unused as far as I can tell, is unreferenced in source code. Should be removed from `ElementMetadata` as far as I can tell. - `text_as_html` - only appears in a `Table` element, each of which appears in its own section so needs no consolidation. Never appears in `TextSection`. Consolidation strategy: `DROP` any that appear (several never will)	2023-11-02 01:49:20 +00:00
John	2f553333bd	refactor text.py (#1872 ) ### Summary Closes #1520 Partial solution to #1521 - Adds an abstraction layer between the user API and the partitioner implementation - Adds comments explaining paragraph chunking - Makes edits to pass strict type-checking for both text.py and test_text.py	2023-11-01 17:44:55 -05:00
John	b92cab7fbd	fix languages 500 error with empty string for ocr_languages (#1968 ) Closes #1870 Defining both `languages` and `ocr_languages` raises a ValueError, but the api defaults to `ocr_languages` being an empty string, so if users define `languages` they are automatically hitting the ValueError. This fix checks if `ocr_languages` is an empty string and converts it to `None` to avoid this. ### Testing On the main branch, the following will raise the ValueError, but it will correctly partition on this branch ``` from unstructured.partition.auto import partition filename = "example-docs/category-level.docx" elements = partition(filename,languages=['spa'],ocr_languages="") elements[0].metadata.languages ``` --------- Co-authored-by: yuming <305248291@qq.com> Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com> Co-authored-by: Austin Walker <awalk89@gmail.com>	2023-11-01 22:02:00 +00:00
Klaijan	1893d5a669	fix: avoid loop through None (#1975 ) Fix this issue https://unstructured-ai.atlassian.net/browse/CORE-2455. Adding logical check if the variable is not None.	2023-11-01 20:50:34 +00:00
Christine Straub	210d53a7e0	Fix: missing columns on table ingest output after table OCR refactor (#1959 ) Closes #1873. ### Summary Table OCR refactoring changed the default padding value for table image cropping from [12](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L95) to [0](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/ocr.py#L260), causing some columns in the table to be missing. ### Testing ``` filename = "example-docs/layout-parser-paper-with-table.pdf" elements = pdf.partition_pdf( filename=filename, strategy="hi_res", infer_table_structure=True, ) table = [el.metadata.text_as_html for el in elements if el.metadata.text_as_html] assert "Large Model" in table[0] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-01 18:34:27 +00:00
qued	b08562ba1a	tests: separate chipper tests (#1939 ) Separates chipper tests to speed up testing and CI.	2023-10-31 21:02:00 +00:00
Denis Lusson	f585d489c1	feat: Add include_header argument for partition_csv and partition_tsv (#1764 ) This PR add `include_header` argument for partition_csv and partition_tsv. This is related to the following feature request https://github.com/Unstructured-IO/unstructured/issues/1751. `include_header` is already part of partition_xlsx. The work here is in line with the current usage and testing of the `include_header` argument in partition_xlsx. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-31 08:16:36 +00:00
Klaijan	a11d4634f1	fix: type error string indices bug (#1940 ) Fix TypeError: string indices must be integers. The `annotation_dict` variable is conditioned to be `None` if instance type is not dict. Then we add logic to skip the attempt if the value is `None`.	2023-10-30 17:38:57 -07:00
Christine Straub	1f0c563e0c	refactor: `partition_pdf()` for `ocr_only` strategy (#1811 ) ### Summary Update `ocr_only` strategy in `partition_pdf()`. This PR adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy. - Add functionality to perform OCR region grouping based on the OCR text taken from `pytesseract.image_to_string()` - Add functionality to get layout elements from OCR regions (ocr_layout) for both `tesseract` and `paddle` - Add functionality to determine the `source` of merged text regions when merging text regions in `merge_text_regions()` - Merge multiple test functions related to "ocr_only" strategy into `test_partition_pdf_with_ocr_only_strategy()` - This PR also fixes [issue #1792](https://github.com/Unstructured-IO/unstructured/issues/1792) ### Evaluation ``` # Image PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image # PDF PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf ``` ### Test - Before update All elements have the same coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768) - After update All elements have accurate coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-30 20:13:29 +00:00
qued	645a0fb765	fix: md tables (#1924 ) Courtesy @phowat, created a branch in the repo to make some changes and merge quickly. Closes #1486. * Fixes issue where tables from markdown documents were being treated as text Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the `tables` extension when instantiating the `python-markdown` object. Importance: This will allow users to extract structured data from tables in markdown documents. #### Testing: On `main` run the following (run `git checkout fix/md-tables -- example-docs/simple-table.md` first to grab the example table from this branch) ```python from unstructured.partition.md import partition_md elements = partition_md("example-docs/simple-table.md") print(elements[0].category) ``` Output should be `UncategorizedText`. Then run the same code on this branch and observe the output is `Table`. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-30 14:09:46 +00:00
Benjamin Torres	05c3cd1be2	feat: clean pdfminer elements inside tables (#1808 ) This PR introduces `clean_pdfminer_inner_elements` , which deletes pdfminer elements inside other detection origins such as YoloX or detectron. This function returns the clean document. Also, the ingest-test fixtures were updated to reflect the new standard output. The best way to check that this function is working properly is check the new test `test_clean_pdfminer_inner_elements` in `test_unstructured/partition/utils/test_processing_elements.py` --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-10-30 07:10:51 +00:00
Steve Canny	7373391aa4	fix: sectioner dissociated titles from their chunk (#1861 ) ### disassociated-titles Executive Summary. Section titles are often combined with the prior section and then missing from the section they belong to. _Chunk combination_ is a behavior in which two succesive small chunks are combined into a single chunk that better fills the chunk window. Chunking can be and by default is configured to combine sequential small chunks that will together fit within the full chunk window (default 500 chars). Combination is only valid for "whole" chunks. The current implementation attempts to combine at the element level (in the sectioner), meaning a small initial element (such as a `Title`) is combined with the prior section without considering the remaining length of the section that title belongs to. This frequently causes a title element to be removed from the chunk it belongs to and added to the prior, otherwise unrelated, chunk. Example: ```python elements: List[Element] = [ Title("Lorem Ipsum"), # 11 Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55 Title("Rhoncus"), # 7 Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57 ] chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80) # -- want -------------------- CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.') # -- got --------------------- CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus') CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.') ``` Technical Summary. Combination cannot be effectively performed at the element level, at least not without complicating things with arbitrary look-ahead into future elements. Much more straightforward is to combine sections once they have been formed from the element stream. Fix. Introduce an intermediate stream processor that accepts a stream of sections and emits a stream of sometimes-combined sections. The solution implemented in this PR builds upon introducing `_Section` objects to replace the `List[Element]` primitive used previously: - `_TextSection` gets the `.combine()` method and `.text_length` property which allows a combining client to produce a combined section (only text-sections are ever combined). - `_SectionCombiner` is introduced to encapsulate the logic of combination, acting as a "filter", accepting a stream of sections and emitting the same type, just with some resulting from two or more combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`. - `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes responsibility for repeatedly accumulating sections, characterizing their length and doing the actual combining (calling `_Section.combine(other_section)`) when instructed. Very similar in concept to `_TextSectionBuilder`, just at the section level instead of element level. - Remove attempts to combine sections at the element level from `_split_elements_by_title_and_table()` and install `_SectionCombiner` as filter between sectioner and chunker.	2023-10-30 04:20:27 +00:00
Yao You	f87731e085	feat: use yolox as default to table extraction for pdf/image (#1919 ) - yolox has better recall than yolox_quantized, the current default model, for table detection - update logic so that when `infer_table_structure=True` the default model is `yolox` instead of `yolox_quantized` - user can still override the default by passing in a `model_name` or set the env variable `UNSTRUCTURED_HI_RES_MODEL_NAME` ## Test: Partition the attached file with ```python from unstructured.partition.pdf import partition_pdf yolox_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True) yolox_quantized_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True, model_name="yolox_quantized") ``` Compare the table elements between those two and yolox (default) elements should have more complete table. [AK_AK-PERS_CAFR_2008_3.pdf](https://github.com/Unstructured-IO/unstructured/files/13191198/AK_AK-PERS_CAFR_2008_3.pdf)	2023-10-27 15:37:45 -05:00
Yao You	42f8cf1997	chore: add metric helper for table structure eval (#1877 ) - add helper to run inference over an image or pdf of table and compare it against a ground truth csv file - this metric generates a similarity score between 1 and 0, where 1 is perfect match and 0 is no match at all - add example docs for testing - NOTE: this metric is only relevant to table structure detection. Therefore the input should be just the table area in an image/pdf file; we are not evaluating table element detection in this metric	2023-10-27 13:23:44 -05:00
Steve Canny	f273a7cb83	fix: sectioner does not consider separator length (#1858 ) ### sectioner-does-not-consider-separator-length Executive Summary. A primary responsibility of the sectioner is to minimize the number of chunks that need to be split mid-text. It does this by computing text-length of the section being formed and "finishing" the section when adding another element would extend its text beyond the window size. When element-text is consolidated into a chunk, the text of each element is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does not currently consider the added length of separators (2-chars each) and so forms sections that need to be split mid-text when chunked. Chunk-splitting should only be necessary when the text of a single element is longer than the chunking window. Example ```python elements: List[Element] = [ Title("Chunking Priorities"), # 19 chars ListItem("Divide text into manageable chunks"), # 34 chars ListItem("Preserve semantic boundaries"), # 28 chars ListItem("Minimize mid-text chunk-splitting"), # 33 chars ] # 114 chars total but 120 chars with separators chunks = chunk_by_title(elements, max_characters=115) ``` Want: ```python [ CompositeElement( "Chunking Priorities" "\n\nDivide text into manageable chunks" "\n\nPreserve semantic boundaries" ), CompositeElement("Minimize mid-text chunk-splitting"), ] ``` Got: ```python [ CompositeElement( "Chunking Priorities" "\n\nDivide text into manageable chunks" "\n\nPreserve semantic boundaries" "\n\nMinimize mid-text chunk-spli"), ) CompositeElement("tting") ``` ### Technical Summary Because the sectioner does not consider separator (`"\n\n"`) length when it computes the space remaining in the section, it over-populates the section and when the chunker concatenates the element text (each separated by the separator) the text exceeds the window length and the chunk must be split mid-text, even though there was an even element boundary it could have been split on. ### Fix Consider separator length in the space-remaining computation. The solution here extracts both the `section.text_length` and `section.space_remaining` computations to a `_TextSectionBuilder` object which removes the need for the sectioner (`_split_elements_by_title_and_table()`) to deal with primitives (List[Element], running text length, separator length, etc.) and allows it to focus on the rules of when to start a new section. This solution may seem like overkill at the moment and indeed it would be except it forms the foundation for adding section-level chunk combination (fix: dissociated title elements) in the next PR. The objects introduced here will gain several additional responsibilities in the next few chunking PRs in the pipeline and will earn their place.	2023-10-26 21:34:15 +00:00
qued	808b4ced7a	build(deps): remove ebooklib (#1878 ) * Removed `ebooklib` as a dependency `ebooklib` is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.	2023-10-26 12:22:40 -05:00
Sebastian Laverde Alfonso	c11a2ff478	feat: method to catch and classify overlapping bounding boxes (#1803 ) We have established that overlapping bounding boxes does not have a one-fits-all solution, so different cases need to be handled differently to avoid information loss. We have manually identified the cases/categories of overlapping. Now we need a method to programmatically classify overlapping-bboxes cases within detected elements in a document, and return a report about it (list of cases with metadata). This fits two purposes: - Evaluation: We can have a pipeline using the DVC data registry that assess the performance of a detection model against a set of documents (PDF/Images), by analysing the overlapping-bboxes cases it has. The metadata in the output can be used for generating metrics for this. - Scope overlapping cases: Manual inspection give us a clue about currently present cases of overlapping bboxes. We need to propose solutions to fix those on code. This method generates a report by analysing several aspects of two overlapping regions. This data can be used to profile and specify the necessary changes that will fix each case. - Fix overlapping cases: We could introduce this functionality in the flow of a partition method (such as `partition_pdf`, to handle the calls to post-processing methods to fix overlapping. Tested on ~331 documents, the worst time per page is around 5ms. For a document such as `layout-parser-paper.pdf` it takes 4.46 ms. Introduces functionality to take a list of unstructured elements (which contain bounding boxes) and identify pairs of bounding boxes which overlap and which case is pertinent to the pairing. This PR includes the following methods in `utils.py`: - `ngrams(s, n)`: Generate n-grams from a string - `calculate_shared_ngram_percentage(string_A, string_B, n)`: Calculate the percentage of `common_ngrams` between `string_A` and `string_B` with reference to the total number of ngrams in `string_A`. - `calculate_largest_ngram_percentage(string_A, string_B)`: Iteratively call `calculate_shared_ngram_percentage` starting from the biggest ngram possible until the shared percentage is >0.0% - `is_parent_box(parent_target, child_target, add=0)`: True if the `child_target` bounding box is nested in the `parent_target` Box format: [`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent region - `calculate_overlap_percentage(box1, box2, intersection_ratio_method="total")`: Box format: [`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. Calculates the percentage of overlapped region with reference to biggest element-region (`intersection_ratio_method="parent"`), the smallest element-region (`intersection_ratio_method="partial"`), or to the disjunctive union region (`intersection_ratio_method="total"`). - `identify_overlapping_or_nesting_case`: Identify if there are nested or overlapping elements. If overlapping is present, it identifies the case calling the method `identify_overlapping_case`. - `identify_overlapping_case`: Classifies the overlapping case for an element_pair input in one of 5 categories of overlapping. - `catch_overlapping_and_nested_bboxes`: Catch overlapping and nested bounding boxes cases across a list of elements. The params `nested_error_tolerance_px` and `sm_overlap_threshold` help controling the separation of the cases. The overlapping/nested elements cases that are being caught are: 1. Nested elements 2. Small partial overlap 3. Partial overlap with empty content 4. Partial overlap with duplicate text (sharing 100% of the text) 5. Partial overlap without sharing text 6. Partial overlap sharing {`calculate_largest_ngram_percentage(...)`}% of the text Here is a snippet to test it: ``` from unstructured.partition.auto import partition model_name = "yolox_quantized" target = "sample-docs/layout-parser-paper-fast.pdf" elements = partition(filename=file_path_i, strategy='hi_res', model_name=model_name) overlapping_flag, overlapping_cases = catch_overlapping_bboxes(elements) for case in overlapping_cases: print(case, "\n") ``` Here is a screenshot of a json built with the output list `overlapping_cases`: <img width="377" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/38184042/a6fea64b-d40a-4e01-beda-27840f4f4b3a">	2023-10-25 12:17:34 +00:00
qued	d8241cbcfc	fix: filename missing from image metadata (#1863 ) Closes [#1859](https://github.com/Unstructured-IO/unstructured/issues/1859). * Fixes elements partitioned from an image file missing certain metadata Metadata for image files, like file type, was being handled differently from other file types. This caused a bug where other metadata, like the file name, was being missed. This change brought metadata handling for image files to be more in line with the handling for other file types so that file name and other metadata fields are being captured. Additionally: * Added test to verify filename is being captured in metadata * Cleaned up `CHANGELOG.md` formatting #### Testing: The following produces output `None` on `main`, but outputs the filename `layout-parser-paper-fast.jpg` on this branch: ```python from unstructured.partition.auto import partition elements = partition("example-docs/layout-parser-paper-fast.jpg") print(elements[0].metadata.filename) ```	2023-10-25 05:19:51 +00:00
Steve Canny	40a265d027	fix: chunk_by_title() interface is rude (#1844 ) ### `chunk_by_title()` interface is "rude" Executive Summary. Perhaps the most commonly specified option for `chunk_by_title()` is `max_characters` (default: 500), which specifies the chunk window size. When a user specifies this value, they get an error message: ```python >>> chunks = chunk_by_title(elements, max_characters=100) ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters. ``` A few of the things that might reasonably pass through a user's mind at such a moment are: * "Is `110` not a valid value for `max_characters`? Why would that be?" * "I didn't specify a value for `combine_text_under_n_chars` or `new_after_n_chars`, in fact I don't know what they are because I haven't studied the documentation and would prefer not to; I just want smaller chunks! How could I supply an invalid value when I haven't supplied any value at all for these?" * "Which of these values is the problem? Why are you making me figure that out for myself? I'm sure the code knows which one is not valid, why doesn't it share that information with me? I'm busy here!" In this particular case, the problem is that `combine_text_under_n_chars` (defaults to 500) is greater than `max_characters`, which means it would never take effect (which is actually not a problem in itself). To fix this, once figuring out that was the problem, probably after opening an issue and maybe reading the source code, the user would need to specify: ```python >>> chunks = chunk_by_title( ... elements, max_characters=100, combine_text_under_n_chars=100 ... ) ``` This and other stressful user scenarios can be remedied by: * Using "active" defaults for the `combine_text_under_n_chars` and `new_after_n_chars` options. * Providing a specific error message for each way a constraint may be violated, such that direction to remedy the problem is immediately clear to the user. An active default is for example: * Make the default for `combine_text_under_n_chars: int \| None = None` such that the code can detect when it has not been specified. * When not specified, set its value to `max_characters`, the same as its current (static) default. This particular change would avoid the behavior in the motivating example above. Another alternative for this argument is simply: ```python combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars) ``` ### Fix 1. Add constraint-specific error messages. 2. Use "active" defaults for `combine_text_under_n_ chars` and `new_after_n_chars`. 3. Improve docstring to describe active defaults, and explain other argument behaviors, in particular identifying suppression options like `combine_text_under_n_chars = 0` to disable chunk combining.	2023-10-24 23:22:38 +00:00
John	8080f9480d	fix strategy test for api and linting (#1840 ) ### Summary Closes unstructured-api issue [188](https://github.com/Unstructured-IO/unstructured-api/issues/188) The test and gist were using different versions of the same file (jpg/pdf), creating what looked like a bug when there wasn't one. The api is correctly using the `strategy` kwarg. ### Testing #### Checkout to `main` - Comment out the `@pytest.mark.skip` decorators for the `test_partition_via_api_with_no_strategy` test - Add an API key to your env: - Add `from dotenv import load_dotenv; load_dotenv()` to the top of the file and have `UNS_API_KEY` defined in `.env` - Run `pytest test_unstructured/partition/test_api.py -k "test_partition_via_api_with_no_strategy"` ^the test will fail #### Checkout to this branch - (make the same changes as above) - Run `pytest test_unstructured/partition/test_api.py -k "test_partition_via_api_with_no_strategy"` ### Other `make tidy` and `make check` made linting changes to additional files	2023-10-24 22:17:54 +00:00
Yuming Long	01a0e003d9	Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850 ) ### Summary A follow up ticket on https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc Tech details: * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.	2023-10-24 17:13:28 +00:00
qued	44cef80c82	test: Add test to ensure languages trickle down to ocr (#1857 ) Closes [#93](https://github.com/Unstructured-IO/unstructured-inference/issues/93). Adds a test to ensure language parameters are passed all the way from `partition_pdf` down to the OCR calls. #### Testing: CI should pass.	2023-10-24 16:54:19 +00:00
Yao You	b530e0a2be	fix: partition docx from teams output (#1825 ) This PR resolves #1816 - current docx partition assumes all contents are in sections - this is not true for MS Teams chat transcript exported to docx - now the code checks if there are sections or not; if not then iterate through the paragraphs and partition contents in the paragraphs	2023-10-24 15:17:02 +00:00
Amanda Cameron	0584e1d031	chore: fix infer_table bug (#1833 ) Carrying `skip_infer_table_types` to `infer_table_structure` in partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a `text_as_html` field. Note: I've continued to exclude this var from partitioners that go through html flow, I think if we've already got the html it doesn't make sense to carry the infer variable along, since we're not 'infer-ing' the html table in these cases. TODO: ✅ add unit tests --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: amanda103 <amanda103@users.noreply.github.com>	2023-10-24 00:11:53 +00:00
qued	7fdddfbc1e	chore: improve kwarg handling (#1810 ) Closes `unstructured-inference` issue [#265](https://github.com/Unstructured-IO/unstructured-inference/issues/265). Cleaned up the kwarg handling, taking opportunities to turn instances of handling kwargs as dicts to just using them as normal in function signatures. #### Testing: Should just pass CI.	2023-10-23 04:48:28 +00:00
Steve Canny	82c8adba3f	fix: split-chunks appear out-of-order (#1824 ) Executive Summary. Code inspection in preparation for adding the chunk-overlap feature revealed a bug causing split-chunks to be inserted out-of-order. For example, elements like this: ``` Text("One" + 400 chars) Text("Two" + 400 chars) Text("Three" + 600 chars) Text("Four" + 400 chars) Text("Five" + 600 chars) ``` Should produce chunks: ``` CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("Four") # (400 chars) CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) ``` but produced this instead: ``` CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Four") # (400 chars) ``` This PR fixes that behavior that was introduced on Oct 9 this year in commit: f98d5e65 when adding chunk splitting. Technical Summary The essential transformation of chunking is: ``` elements sections chunks List[Element] -> List[List[Element]] -> List[CompositeElement] ``` 1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_ semantically-related elements into _sections_ (`List[Element]`), in the best case, that would be a title (heading) and the text that follows it (until the next title). A heading and its text is often referred to as a _section_ in publishing parlance, hence the name. 2. The _chunker_ (`chunk_by_title()` currently) does two things: 1. first it _consolidates_ the elements of each section into a single `ConsolidatedElement` object (a "chunk"). This includes both joining the element text into a single string as well as consolidating the metadata of the section elements. 2. then if necessary it _splits_ the chunk into two or more `ConsolidatedElement` objects when the consolidated text is too long to fit in the specified window (`max_characters`). Chunk splitting is only required when a single element (like a big paragraph) has text longer than the specified window. Otherwise a section and the chunk that derives from it reflects an even element boundary. `chunk_by_title()` was elaborated in commit f98d5e65 to add this "chunk-splitting" behavior. At the time there was some notion of wanting to "split from the end backward" such that any small remainder chunk would appear first, and could possibly be combined with a small prior chunk. To accomplish this, split chunks were _inserted_ at the beginning of the list instead of _appended_ to the end. The `chunked_elements` variable (`List[CompositeElement]`) holds the sequence of chunks that result from the chunking operation and is the returned value for `chunk_by_title()`. This was the list "split-from-the-end" chunks were inserted at the beginning of and that unfortunately produces this out-of-order behavior because the insertion was at the beginning of this "all-chunks-in-document" list, not a sublist just for this chunk. Further, the "split-from-the-end" behavior can produce no benefit because chunks are never combined, only _elements_ are combined (across semantic boundaries into a single section when a section is small) and sectioning occurs _prior_ to chunking. The fix is to rework the chunk-splitting passage to a straighforward iterative algorithm that works both when a chunk must be split and when it doesn't. This algorithm is also very easily extended to implement split-chunk-overlap which is coming up in an immediately following PR. ```python # -- split chunk into CompositeElements objects maxlen or smaller -- text_len = len(text) start = 0 remaining = text_len while remaining > 0: end = min(start + max_characters, text_len) chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta)) start = end - overlap remaining = text_len - end ``` Forensic analysis The out-of-order-chunks behavior was introduced in commit 4ea71683 on 10/09/2023 in the same PR in which chunk-splitting was introduced. --------- Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-21 01:37:34 +00:00
Yuming Long	ce40cdc55f	Chore (refactor): support table extraction with pre-computed ocr data (#1801 ) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. Tech details: * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>	2023-10-21 00:24:23 +00:00
Yao You	3437a23c91	fix: partition html fail with table without tbody (#1817 ) This PR resolves #1807 - fix a bug where when a table tagged content does not contain `tbody` tag but `thead` tag for the rows the code fails - now when there is no `tbody` in a table section we try to look for `thead` isntead - when both are not found return empty table	2023-10-20 23:21:59 +00:00
Yao You	aa7b7c87d6	fix: model_name being None raises attribution error (#1822 ) This PR resolves #1754 - function wrapper tries to use `cast` to convert kwargs into `str` but when a value is `None` `cast(str, None)` still returns `None` - fix replaces the conversion to simply using `str()` function call	2023-10-20 21:08:17 +00:00
Roman Isecke	63861f537e	Add check for duplicate click options (#1775 ) ### Description Given that many of the options associated with the `Click` based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options. ### Issues that were found and fixes: * duplicate api-key option set on Notion command conflicts with api key used for unstructured api. Added notion prefix. * retry logic configs had duplicates in biomed. Removed since this is not handled by the pipeline.	2023-10-20 14:00:19 +00:00
John	fb2a1d42ce	Jj/1798 languages warning (#1805 ) ### Summary Closes #1798 Fixes language detection of elements with empty strings: This resolves a warning message that was raised by `langdetect` if the language was attempted to be detected on an empty string. Language detection is now skipped for empty strings. ### Testing on the main branch this will log the warning "No features in text", but it will not log anything on this branch. ``` from unstructured.documents.elements import NarrativeText, PageBreak from unstructured.partition.lang import apply_lang_metadata elements = [NarrativeText("Sample text."), PageBreak("")] elements = list( apply_lang_metadata( elements=elements, languages=["auto"], detect_language_per_element=True, ), ) ``` ### Other Also changes imports in test_lang.py so imports are explicit --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-20 04:15:28 +00:00
Steve Canny	d9c2516364	fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779 ) Executive Summary. Introducing strict type-checking as preparation for adding the chunk-overlap feature revealed a type mismatch for regex-metadata between chunking tests and the (authoritative) ElementMetadata definition. The implementation of regex-metadata aspects of chunking passed the tests but did not produce the appropriate behaviors in production where the actual data-structure was different. This PR fixes these two bugs. 1. Over-chunking. The presence of `regex-metadata` in an element was incorrectly being interpreted as a semantic boundary, leading to such elements being isolated in their own chunks. 2. Discarded regex-metadata. regex-metadata present on the second or later elements in a section (chunk) was discarded. Technical Summary The type of `ElementMetadata.regex_metadata` is `Dict[str, List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text": "this matched", "start": 7, "end": 19}`. Multiple regexes can be specified, each with a name like "mail-stop", "version", etc. Each of those may produce its own set of matches, like: ```python >>> element.regex_metadata { "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}], "version": [ {"text": "current: v1.7.2", "start": 7, "end": 21}, {"text": "supersedes: v1.7.0", "start": 22, "end": 40}, ], } ``` Forensic analysis * The regex-metadata feature was added by Matt Robinson on 06/16/2023 commit: 4ea71683. The regex_metadata data structure is the same as when it was added. * The chunk-by-title feature was added by Matt Robinson on 08/29/2023 commit: f6a745a7. The mistaken regex-metadata data structure in the tests is present in that commit. Looks to me like a mis-remembering of the regex-metadata data-structure and insufficient type-checking rigor (type-checker strictness level set too low) to warn of the mistake. Over-chunking Behavior The over-chunking looked like this: Chunking three elements with regex metadata should combine them into a single chunk (`CompositeElement` object), subject to maximum size rules (default 500 chars). ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]} ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] chunks = chunk_by_title(elements) assert chunks == [ CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) ] ``` Observed behavior looked like this: ```python chunks => [ CompositeElement('Lorem Ipsum') CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('In rhoncus ipsum sed lectus porta volutpat.') ] ``` The fix changed the approach from breaking on any metadata field not in a specified group (`regex_metadata` was missing from this group) to only breaking on specified fields (whitelisting instead of blacklisting). This avoids overchunking every time we add a new metadata field and is also simpler and easier to understand. This change in approach is discussed in more detail here #1790. Dropping regex-metadata Behavior Chunking this section: ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={ "dolor": [RegexMetadata(text="dolor", start=12, end=17)], "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)], } ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] ``` ..should produce this regex_metadata on the single produced chunk: ```python assert chunk == CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) assert chunk.metadata.regex_metadata == { "dolor": [RegexMetadata(text="dolor", start=25, end=30)], "ipsum": [ RegexMetadata(text="Ipsum", start=6, end=11), RegexMetadata(text="ipsum", start=19, end=24), RegexMetadata(text="ipsum", start=81, end=86), ], } ``` but instead produced this: ```python regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]} ``` Which is the regex-metadata from the first element only. The fix was to remove the consolidation+adjustment process from inside the "list-attribute-processing" loop (because regex-metadata is not a list) and process regex metadata separately.	2023-10-19 22:16:02 -05:00
Mallori Harrell	00635744ed	feat: Adds local embedding model (#1619 ) This PR adds a local embedding model option as an alternative to using our OpenAI embedding brick. This brick uses LangChain's HuggingFacEmbeddings.	2023-10-19 11:51:36 -05:00
Roman Isecke	b265d8874b	refactoring linting (#1739 ) ### Description Currently linting only takes place over the base unstructured directory but we support python files throughout the repo. It makes sense for all those files to also abide by the same linting rules so the entire repo was set to be inspected when the linters are run. Along with that autoflake was added as a linter which has a lot of added benefits such as removing unused imports for you that would currently break flake and require manual intervention. The only real relevant changes in this PR are in the `Makefile`, `setup.cfg`, and `requirements/test.in`. The rest is the result of running the linters.	2023-10-17 12:45:12 +00:00
Léa	89fa88f076	fix: stop csv and tsv dropping the first line of the file (#1530 ) The current code assumes the first line of csv and tsv files are a header line. Most csv and tsv files don't have a header line, and even for those that do, dropping this line may not be the desired behavior. Here is a snippet of code that demonstrates the current behavior and the proposed fix ``` import pandas as pd from lxml.html.soupparser import fromstring as soupparser_fromstring c1 = """ Stanley Cups,, Team,Location,Stanley Cups Blues,STL,1 Flyers,PHI,2 Maple Leafs,TOR,13 """ f = "./test.csv" with open(f, 'w') as ff: ff.write(c1) print("Suggested Improvement Keep First Line") table = pd.read_csv(f, header=None) html_text = table.to_html(index=False, header=False, na_rep="") text = soupparser_fromstring(html_text).text_content() print(text) print("\n\nOriginal Looses First Line") table = pd.read_csv(f) html_text = table.to_html(index=False, header=False, na_rep="") text = soupparser_fromstring(html_text).text_content() print(text) ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io>	2023-10-16 17:59:35 -05:00
Klaijan	ba4c649cf0	feat: calculate element type percent match (#1723 ) Executive Summary Adds function to calculate the percent match between two element type frequency output from `get_element_type_frequency` function. Technical Detail - The function takes two `Dict` input which both should be output from `get_element_type_frequency` - Implementors can define weight `category_depth_weight` they want to give to the matching `type` but different in `category_depth` case - The function loops through output item list first to find exact match and count total exact match, and collect the remaining value for both output and source in new list (of `dict` type). Then it loops through existing source item list that has not been an exact match, to find `type` match which then weigh with the factor of `category_depth_weight` defined earlier, default at 0.5) Output output ``` { ("Title", 0): 2, ("Title", 1): 1, ("NarrativeText", None): 3, ("UncategorizedText", None): 1, } ``` source ``` { ("Title", 0): 1, ("Title", 1): 2, ("NarrativeText", None): 5, } ``` With this output and source, and weight of 0.5, the % match will yield 5.5 / 8 -- for 5 exact match, and 1 partial match with 0.5 weight. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-16 17:57:28 +00:00
John	6d7fe3ab02	fix: default to None for the languages metadata field (#1743 ) ### Summary Closes #1714 Changes the default value for `languages` to `None` for elements that don't have text or the language can't be detected. ### Testing ``` from unstructured.partition.auto import partition filename = "example-docs/handbook-1p.docx" elements = partition(filename=filename, detect_language_per_element=True) # PageBreak elements don't have text and will be collected here none_langs = [element for element in elements if element.metadata.languages is None] none_langs[0].text ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-14 22:46:24 +00:00
qued	cf31c9a2c4	fix: use nx to avoid recursion limit (#1761 ) Fixes recursion limit error that was being raised when partitioning Excel documents of a certain size. Previously we used a recursive method to find subtables within an excel sheet. However this would run afoul of Python's recursion depth limit when there was a contiguous block of more than 1000 cells within a sheet. This function has been updated to use the NetworkX library which avoids Python recursion issues. * Updated `_get_connected_components` to use `networkx` graph methods rather than implementing our own algorithm for finding contiguous groups of cells within a sheet. * Added a test and example doc that replicates the `RecursionError` prior to the change. * Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d. #### Testing: The following run from a Python terminal should raise a `RecursionError` on `main` and succeed on this branch: ```python import sys from unstructured.partition.xlsx import partition_xlsx old_recursion_limit = sys.getrecursionlimit() try: sys.setrecursionlimit(1000) filename = "example-docs/more-than-1k-cells.xlsx" partition_xlsx(filename=filename) finally: sys.setrecursionlimit(old_recursion_limit) ``` Note: the recursion limit is different in different contexts. Checking my own system, the default in a notebook seems to be 3000, but in a terminal it's 1000. The documented Python default recursion limit is 1000.	2023-10-14 19:38:21 +00:00
qued	95728ead0f	fix: zero divide in under_non_alpha_ratio (#1753 ) The function `under_non_alpha_ratio` in `unstructured.partition.text_type` was producing a divide-by-zero error. After investigation I found this was a possibility when the function was passed a string of all spaces. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-13 21:20:01 +00:00
Steve Canny	4b84d596c2	docx: add hyperlink metadata (#1746 )	2023-10-13 06:26:14 +00:00

... 3 4 5 6 7 ...

585 Commits