unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-28 00:05:55 +00:00

Author	SHA1	Message	Date
Yao You	b530e0a2be	fix: partition docx from teams output (#1825 ) This PR resolves #1816 - current docx partition assumes all contents are in sections - this is not true for MS Teams chat transcript exported to docx - now the code checks if there are sections or not; if not then iterate through the paragraphs and partition contents in the paragraphs	2023-10-24 15:17:02 +00:00
Amanda Cameron	0584e1d031	chore: fix infer_table bug (#1833 ) Carrying `skip_infer_table_types` to `infer_table_structure` in partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a `text_as_html` field. Note: I've continued to exclude this var from partitioners that go through html flow, I think if we've already got the html it doesn't make sense to carry the infer variable along, since we're not 'infer-ing' the html table in these cases. TODO: ✅ add unit tests --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: amanda103 <amanda103@users.noreply.github.com>	2023-10-24 00:11:53 +00:00
qued	7fdddfbc1e	chore: improve kwarg handling (#1810 ) Closes `unstructured-inference` issue [#265](https://github.com/Unstructured-IO/unstructured-inference/issues/265). Cleaned up the kwarg handling, taking opportunities to turn instances of handling kwargs as dicts to just using them as normal in function signatures. #### Testing: Should just pass CI.	2023-10-23 04:48:28 +00:00
Steve Canny	82c8adba3f	fix: split-chunks appear out-of-order (#1824 ) Executive Summary. Code inspection in preparation for adding the chunk-overlap feature revealed a bug causing split-chunks to be inserted out-of-order. For example, elements like this: ``` Text("One" + 400 chars) Text("Two" + 400 chars) Text("Three" + 600 chars) Text("Four" + 400 chars) Text("Five" + 600 chars) ``` Should produce chunks: ``` CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("Four") # (400 chars) CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) ``` but produced this instead: ``` CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Four") # (400 chars) ``` This PR fixes that behavior that was introduced on Oct 9 this year in commit: f98d5e65 when adding chunk splitting. Technical Summary The essential transformation of chunking is: ``` elements sections chunks List[Element] -> List[List[Element]] -> List[CompositeElement] ``` 1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_ semantically-related elements into _sections_ (`List[Element]`), in the best case, that would be a title (heading) and the text that follows it (until the next title). A heading and its text is often referred to as a _section_ in publishing parlance, hence the name. 2. The _chunker_ (`chunk_by_title()` currently) does two things: 1. first it _consolidates_ the elements of each section into a single `ConsolidatedElement` object (a "chunk"). This includes both joining the element text into a single string as well as consolidating the metadata of the section elements. 2. then if necessary it _splits_ the chunk into two or more `ConsolidatedElement` objects when the consolidated text is too long to fit in the specified window (`max_characters`). Chunk splitting is only required when a single element (like a big paragraph) has text longer than the specified window. Otherwise a section and the chunk that derives from it reflects an even element boundary. `chunk_by_title()` was elaborated in commit f98d5e65 to add this "chunk-splitting" behavior. At the time there was some notion of wanting to "split from the end backward" such that any small remainder chunk would appear first, and could possibly be combined with a small prior chunk. To accomplish this, split chunks were _inserted_ at the beginning of the list instead of _appended_ to the end. The `chunked_elements` variable (`List[CompositeElement]`) holds the sequence of chunks that result from the chunking operation and is the returned value for `chunk_by_title()`. This was the list "split-from-the-end" chunks were inserted at the beginning of and that unfortunately produces this out-of-order behavior because the insertion was at the beginning of this "all-chunks-in-document" list, not a sublist just for this chunk. Further, the "split-from-the-end" behavior can produce no benefit because chunks are never combined, only _elements_ are combined (across semantic boundaries into a single section when a section is small) and sectioning occurs _prior_ to chunking. The fix is to rework the chunk-splitting passage to a straighforward iterative algorithm that works both when a chunk must be split and when it doesn't. This algorithm is also very easily extended to implement split-chunk-overlap which is coming up in an immediately following PR. ```python # -- split chunk into CompositeElements objects maxlen or smaller -- text_len = len(text) start = 0 remaining = text_len while remaining > 0: end = min(start + max_characters, text_len) chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta)) start = end - overlap remaining = text_len - end ``` Forensic analysis The out-of-order-chunks behavior was introduced in commit 4ea71683 on 10/09/2023 in the same PR in which chunk-splitting was introduced. --------- Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-21 01:37:34 +00:00
Yuming Long	ce40cdc55f	Chore (refactor): support table extraction with pre-computed ocr data (#1801 ) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. Tech details: * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>	2023-10-21 00:24:23 +00:00
Yao You	3437a23c91	fix: partition html fail with table without tbody (#1817 ) This PR resolves #1807 - fix a bug where when a table tagged content does not contain `tbody` tag but `thead` tag for the rows the code fails - now when there is no `tbody` in a table section we try to look for `thead` isntead - when both are not found return empty table	2023-10-20 23:21:59 +00:00
Yao You	aa7b7c87d6	fix: model_name being None raises attribution error (#1822 ) This PR resolves #1754 - function wrapper tries to use `cast` to convert kwargs into `str` but when a value is `None` `cast(str, None)` still returns `None` - fix replaces the conversion to simply using `str()` function call	2023-10-20 21:08:17 +00:00
Roman Isecke	63861f537e	Add check for duplicate click options (#1775 ) ### Description Given that many of the options associated with the `Click` based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options. ### Issues that were found and fixes: * duplicate api-key option set on Notion command conflicts with api key used for unstructured api. Added notion prefix. * retry logic configs had duplicates in biomed. Removed since this is not handled by the pipeline.	2023-10-20 14:00:19 +00:00
John	fb2a1d42ce	Jj/1798 languages warning (#1805 ) ### Summary Closes #1798 Fixes language detection of elements with empty strings: This resolves a warning message that was raised by `langdetect` if the language was attempted to be detected on an empty string. Language detection is now skipped for empty strings. ### Testing on the main branch this will log the warning "No features in text", but it will not log anything on this branch. ``` from unstructured.documents.elements import NarrativeText, PageBreak from unstructured.partition.lang import apply_lang_metadata elements = [NarrativeText("Sample text."), PageBreak("")] elements = list( apply_lang_metadata( elements=elements, languages=["auto"], detect_language_per_element=True, ), ) ``` ### Other Also changes imports in test_lang.py so imports are explicit --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-20 04:15:28 +00:00
Steve Canny	d9c2516364	fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779 ) Executive Summary. Introducing strict type-checking as preparation for adding the chunk-overlap feature revealed a type mismatch for regex-metadata between chunking tests and the (authoritative) ElementMetadata definition. The implementation of regex-metadata aspects of chunking passed the tests but did not produce the appropriate behaviors in production where the actual data-structure was different. This PR fixes these two bugs. 1. Over-chunking. The presence of `regex-metadata` in an element was incorrectly being interpreted as a semantic boundary, leading to such elements being isolated in their own chunks. 2. Discarded regex-metadata. regex-metadata present on the second or later elements in a section (chunk) was discarded. Technical Summary The type of `ElementMetadata.regex_metadata` is `Dict[str, List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text": "this matched", "start": 7, "end": 19}`. Multiple regexes can be specified, each with a name like "mail-stop", "version", etc. Each of those may produce its own set of matches, like: ```python >>> element.regex_metadata { "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}], "version": [ {"text": "current: v1.7.2", "start": 7, "end": 21}, {"text": "supersedes: v1.7.0", "start": 22, "end": 40}, ], } ``` Forensic analysis * The regex-metadata feature was added by Matt Robinson on 06/16/2023 commit: 4ea71683. The regex_metadata data structure is the same as when it was added. * The chunk-by-title feature was added by Matt Robinson on 08/29/2023 commit: f6a745a7. The mistaken regex-metadata data structure in the tests is present in that commit. Looks to me like a mis-remembering of the regex-metadata data-structure and insufficient type-checking rigor (type-checker strictness level set too low) to warn of the mistake. Over-chunking Behavior The over-chunking looked like this: Chunking three elements with regex metadata should combine them into a single chunk (`CompositeElement` object), subject to maximum size rules (default 500 chars). ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]} ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] chunks = chunk_by_title(elements) assert chunks == [ CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) ] ``` Observed behavior looked like this: ```python chunks => [ CompositeElement('Lorem Ipsum') CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('In rhoncus ipsum sed lectus porta volutpat.') ] ``` The fix changed the approach from breaking on any metadata field not in a specified group (`regex_metadata` was missing from this group) to only breaking on specified fields (whitelisting instead of blacklisting). This avoids overchunking every time we add a new metadata field and is also simpler and easier to understand. This change in approach is discussed in more detail here #1790. Dropping regex-metadata Behavior Chunking this section: ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={ "dolor": [RegexMetadata(text="dolor", start=12, end=17)], "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)], } ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] ``` ..should produce this regex_metadata on the single produced chunk: ```python assert chunk == CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) assert chunk.metadata.regex_metadata == { "dolor": [RegexMetadata(text="dolor", start=25, end=30)], "ipsum": [ RegexMetadata(text="Ipsum", start=6, end=11), RegexMetadata(text="ipsum", start=19, end=24), RegexMetadata(text="ipsum", start=81, end=86), ], } ``` but instead produced this: ```python regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]} ``` Which is the regex-metadata from the first element only. The fix was to remove the consolidation+adjustment process from inside the "list-attribute-processing" loop (because regex-metadata is not a list) and process regex metadata separately.	2023-10-19 22:16:02 -05:00
Mallori Harrell	00635744ed	feat: Adds local embedding model (#1619 ) This PR adds a local embedding model option as an alternative to using our OpenAI embedding brick. This brick uses LangChain's HuggingFacEmbeddings.	2023-10-19 11:51:36 -05:00
Roman Isecke	b265d8874b	refactoring linting (#1739 ) ### Description Currently linting only takes place over the base unstructured directory but we support python files throughout the repo. It makes sense for all those files to also abide by the same linting rules so the entire repo was set to be inspected when the linters are run. Along with that autoflake was added as a linter which has a lot of added benefits such as removing unused imports for you that would currently break flake and require manual intervention. The only real relevant changes in this PR are in the `Makefile`, `setup.cfg`, and `requirements/test.in`. The rest is the result of running the linters.	2023-10-17 12:45:12 +00:00
Léa	89fa88f076	fix: stop csv and tsv dropping the first line of the file (#1530 ) The current code assumes the first line of csv and tsv files are a header line. Most csv and tsv files don't have a header line, and even for those that do, dropping this line may not be the desired behavior. Here is a snippet of code that demonstrates the current behavior and the proposed fix ``` import pandas as pd from lxml.html.soupparser import fromstring as soupparser_fromstring c1 = """ Stanley Cups,, Team,Location,Stanley Cups Blues,STL,1 Flyers,PHI,2 Maple Leafs,TOR,13 """ f = "./test.csv" with open(f, 'w') as ff: ff.write(c1) print("Suggested Improvement Keep First Line") table = pd.read_csv(f, header=None) html_text = table.to_html(index=False, header=False, na_rep="") text = soupparser_fromstring(html_text).text_content() print(text) print("\n\nOriginal Looses First Line") table = pd.read_csv(f) html_text = table.to_html(index=False, header=False, na_rep="") text = soupparser_fromstring(html_text).text_content() print(text) ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io>	2023-10-16 17:59:35 -05:00
Klaijan	ba4c649cf0	feat: calculate element type percent match (#1723 ) Executive Summary Adds function to calculate the percent match between two element type frequency output from `get_element_type_frequency` function. Technical Detail - The function takes two `Dict` input which both should be output from `get_element_type_frequency` - Implementors can define weight `category_depth_weight` they want to give to the matching `type` but different in `category_depth` case - The function loops through output item list first to find exact match and count total exact match, and collect the remaining value for both output and source in new list (of `dict` type). Then it loops through existing source item list that has not been an exact match, to find `type` match which then weigh with the factor of `category_depth_weight` defined earlier, default at 0.5) Output output ``` { ("Title", 0): 2, ("Title", 1): 1, ("NarrativeText", None): 3, ("UncategorizedText", None): 1, } ``` source ``` { ("Title", 0): 1, ("Title", 1): 2, ("NarrativeText", None): 5, } ``` With this output and source, and weight of 0.5, the % match will yield 5.5 / 8 -- for 5 exact match, and 1 partial match with 0.5 weight. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-16 17:57:28 +00:00
John	6d7fe3ab02	fix: default to None for the languages metadata field (#1743 ) ### Summary Closes #1714 Changes the default value for `languages` to `None` for elements that don't have text or the language can't be detected. ### Testing ``` from unstructured.partition.auto import partition filename = "example-docs/handbook-1p.docx" elements = partition(filename=filename, detect_language_per_element=True) # PageBreak elements don't have text and will be collected here none_langs = [element for element in elements if element.metadata.languages is None] none_langs[0].text ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-14 22:46:24 +00:00
qued	cf31c9a2c4	fix: use nx to avoid recursion limit (#1761 ) Fixes recursion limit error that was being raised when partitioning Excel documents of a certain size. Previously we used a recursive method to find subtables within an excel sheet. However this would run afoul of Python's recursion depth limit when there was a contiguous block of more than 1000 cells within a sheet. This function has been updated to use the NetworkX library which avoids Python recursion issues. * Updated `_get_connected_components` to use `networkx` graph methods rather than implementing our own algorithm for finding contiguous groups of cells within a sheet. * Added a test and example doc that replicates the `RecursionError` prior to the change. * Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d. #### Testing: The following run from a Python terminal should raise a `RecursionError` on `main` and succeed on this branch: ```python import sys from unstructured.partition.xlsx import partition_xlsx old_recursion_limit = sys.getrecursionlimit() try: sys.setrecursionlimit(1000) filename = "example-docs/more-than-1k-cells.xlsx" partition_xlsx(filename=filename) finally: sys.setrecursionlimit(old_recursion_limit) ``` Note: the recursion limit is different in different contexts. Checking my own system, the default in a notebook seems to be 3000, but in a terminal it's 1000. The documented Python default recursion limit is 1000.	2023-10-14 19:38:21 +00:00
qued	95728ead0f	fix: zero divide in under_non_alpha_ratio (#1753 ) The function `under_non_alpha_ratio` in `unstructured.partition.text_type` was producing a divide-by-zero error. After investigation I found this was a possibility when the function was passed a string of all spaces. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-13 21:20:01 +00:00
Steve Canny	4b84d596c2	docx: add hyperlink metadata (#1746 )	2023-10-13 06:26:14 +00:00
qued	8100f1e7e2	chore: process chipper hierarchy (#1634 ) PR to support schema changes introduced from [PR 232](https://github.com/Unstructured-IO/unstructured-inference/pull/232) in `unstructured-inference`. Specifically what needs to be supported is: * Change to the way `LayoutElement` from `unstructured-inference` is structured, specifically that this class is no longer a subclass of `Rectangle`, and instead `LayoutElement` has a `bbox` property that captures the location information and a `from_coords` method that allows construction of a `LayoutElement` directly from coordinates. * Removal of `LocationlessLayoutElement` since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the `bbox` property mentioned above optional. * Getting hierarchy data directly from the inference elements rather than in post-processing * Don't try to reorder elements received from chipper v2, as they should already be ordered. #### Testing: The following demonstrates that the new version of chipper is inferring hierarchy. ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper") children = [el for el in elements if el.metadata.parent_id is not None] print(children) ``` Also verify that running the traditional `hi_res` gives different results: ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") ``` --------- Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-10-13 01:28:46 +00:00
ryannikolaidis	40523061ca	fix: _add_embeddings_to_elements bug resulting in duplicated elements (#1719 ) Currently when the OpenAIEmbeddingEncoder adds embeddings to Elements in `_add_embeddings_to_elements` it overwrites each Element's `to_dict` method, mistakenly resulting in each Element having identical values with the exception of the actual embedding value. This was due to the way it leverages a nested `new_to_dict` method to overwrite. Instead, this updates the original definition of Element itself to accommodate the `embeddings` field when available. This also adds a test to validate that values are not duplicated.	2023-10-12 21:47:32 +00:00
Roman Isecke	ebf0722dcc	roman/ingest continue on error (#1736 ) ### Description Add flag to raise an error on failure but default to only log it and continue with other docs	2023-10-12 21:33:10 +00:00
Steve Canny	d726963e42	serde tests round-trip through JSON (#1681 ) Each partitioner has a test like `test_partition_x_with_json()`. What these do is serialize the elements produced by the partitioner to JSON, then read them back in from JSON and compare the before and after elements. Because our element equality (`Element.__eq__()`) is shallow, this doesn't tell us a lot, but if we take it one more step, like `List[Element] -> JSON -> List[Element] -> JSON` and then compare the JSON, it gives us some confidence that the serialized elements can be "re-hydrated" without losing any information. This actually showed up a few problems, all in the serialization/deserialization (serde) code that all elements share.	2023-10-12 19:47:55 +00:00
Inscore	8ab40c20c1	fix: correct PDF list item parsing (#1693 ) The current implementation removes elements from the beginning of the element list and duplicates the list items --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: yuming <305248291@qq.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-10-11 20:38:36 +00:00
John	9500d04791	detect document language across all partitioners (#1627 ) ### Summary Closes #1534 and #1535 Detects document language using `langdetect` package. Creates new kwargs for user to set the document language (`languages`) or detect the language at the element level instead of the default document level (`detect_language_per_element`) --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Austin Walker <austin@unstructured.io>	2023-10-11 01:47:56 +00:00
Klaijan	ee75ce25e2	feat: element type frequency (#1688 ) Executive Summary Add function that returns frequency of given element types and depth. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-11 00:36:44 +00:00
shreyanid	9d228c7ecb	feat: calculate metric for percent of text missing (#1701 ) ### Summary Missing text is a particularly important metric of quality for the Unstructured library because information from the document is not being captured and therefore not usable by downstream applications. Add function to calculate the percent of text missing relative to the source transcription. Function takes 2 text strings (output and source) as input, and returns the percentage of text missing as a decimal. ### Technical Details - The 2 input strings are both assumed to already contain clean and concatenated text (CCT) - Implementation compares the bags of words (frequency counts for each word present in the text) of each input text - Duplicated/extra text is not penalized - Value is limited to the range [0, 1] ### Test - Several edge cases are covered in the test function (missing text, duplicated text, spaced out words, etc). - Can test other cases or text inputs by calling the function with 2 CCT strings as "output" and "source"	2023-10-10 20:54:49 +00:00
Yuming Long	e597ec7a0f	Fix: skip empty annotation bbox (#1665 ) Address: https://github.com/Unstructured-IO/unstructured/issues/1663 ## Summary While trying to find how overlap between a element bbox and annotation bbox, we find the intersection of two bboxes and divide it by the size of annotation bbox, this will cause a zero division error if size of annotation bbox is 0. * this PR fix the zero division error for function `check_annotations_within_element` * also fix error: `TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'` by stop inserting empty word with None bbox into list of words in function `get_word_bounding_box_from_element` ## Test reproduce with code and document as the user mentioned and should see no error: ``` from unstructured.partition.auto import partition elements = partition( filename="./IZSAM8.2_221012.pdf", strategy="fast", ) ```	2023-10-10 20:48:44 +00:00
Mallori Harrell	a5d7ae4611	Feat: Bag of words for testing metric (#1650 ) This PR adds the `bag_of_words` function to count the frequency of words for evaluation. Testing ```Python from unstructured.cleaners.core import bag_of_words string = "The dog loved the cat, but the cat loved the cow." print(bag_of_words) --------- Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local> Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-10 18:46:01 +00:00
Amanda Cameron	f98d5e65ca	chore: adding max_characters to other element type chunking (#1673 ) This PR adds the `max_characters` (hard max) param to non-table element chunking. Additionally updates the `num_characters` metadata to `max_characters` to make it clearer which param we're referencing. To test: ``` from unstructured.partition.html import partition_html filename = "example-docs/example-10k-1p.html" chunk_elements = partition_html( filename, chunking_strategy="by_title", combine_text_under_n_chars=0, new_after_n_chars=50, max_characters=100, ) for chunk in chunk_elements: print(len(chunk.text)) # previously we were only respecting the "soft max" (default of 500) for elements other than tables # now we should see that all the elements have text fields under 100 chars. ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-09 19:42:36 +00:00
Klaijan	33edbf84f5	feat: add calculate edit distance feature (#1656 ) Executive Summary Adds function to calculate edit distance (Levenshtein distance) between two strings. The function can return as: 1. score (similarity = 1 - distance/source_len) 2. distance (raw levenshtein distance) Technical details - The `weights` param is set to default at (2,1,1) for (insertion, deletion, substitution), meaning that we will penalize the insertion we need to add from output (target) in comparison with the source (reference). In other word, the missing extraction will be penalized higher. - The function takes in 2 strings in an assumption that both string are already clean and concatenated (CCT) Important Note! Test case needs to be updated to use CCT once the function is ready. It is now only tested the "functionality" of edit distance, not the edit distance with CCT as its intended to be. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:21:14 +00:00
Yuming Long	dcd6d0ff67	Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` (#1579 ) ## Summary Second part of OCR refactor to move it from inference repo to unstructured repo, first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/231. This PR adds OCR process logics to entire page OCR, and support two OCR modes, "entire_page" or "individual_blocks". The updated workflow for `Hi_res` partition: * pass the document as data/filename to inference repo to get `inferred_layout` (DocumentLayout) * pass the document as data/filename to OCR module, which first open the document (create temp file/dir as needed), and split the document by pages (convert PDF pages to image pages for PDF file) * if ocr mode is `"entire_page"` * OCR the entire image * merge the OCR layout with inferred page layout * if ocr mode is `"individual_blocks"` * from inferred page layout, find element with no extracted text, crop the entire image by the bboxes of the element * replace empty text element with the text obtained from OCR the cropped image * return all merged PageLayouts and form a DocumentLayout subject for later on process This PR also bump `unstructured-inference==0.7.2` since the branch relay on OCR refactor from unstructured-inference. ## Test ``` from unstructured.partition.auto import partition entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res") individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res") print([el.text for el in entrie_page_ocr_mode_elements]) print([el.text for el in individual_blocks_ocr_mode_elements]) ``` latest output: ``` # entrie_page ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM\|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고'] # individual_blocks ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM\| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email'] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-06 22:54:49 +00:00
Christine Straub	5d14a2aea0	feat: shrink bboxes by top left (#1633 ) Closes #1573. ### Summary - update `shrink_bbox()` to keep top left rather than center ### Evaluation Run the following command for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-10-06 05:16:11 +00:00
Benjamin Torres	e0201e9a11	feat/add sources from unstructured inference (#1538 ) This PR adds support for `source` property from `unstructured_inference`, allowing the user to be able to see the origin of the data under `detection_origin`field environment variable UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true In order to try this feature you can use this code: ``` from unstructured.partition.pdf import partition_pdf_or_image yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox') sources = [e.detection_origin for e in yolox_elements] print(sources) ``` And will print 'yolox' as source for all the elements	2023-10-05 20:26:47 +00:00
Sebastian Laverde Alfonso	e90a979f45	fix: Better logic for setting `category_depth` metadata for `Title` elements (#1517 ) This PR promotes the `category_depth` metadata for `Title` elements from `None` to 0, whenever `Headline` and/or `Subheadline` types (that are also mapped to `Title` elements with depth 1 and 2) are present. An additional test to `test_common.py` has been added to check on the improvement. More test of how this logic fixes the behaviour can be found in a adapted version on the colab [here](https://colab.research.google.com/drive/1LoScFJBYUhkM6X7pMp8cDaJLC_VoxGci?usp=sharing). --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-10-05 17:51:06 +00:00
Newel H	e34396b2c9	Feat: Native hierarchies for elements from pptx documents (#1616 ) ## Summary Improve title detection in pptx documents The default title textboxes on a pptx slide are now categorized as titles. Improve hierarchy detection in pptx documents List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents. Hierarchy detection is improved by determining category depth via the following: - Check if the paragraph item has a level parameter via the python pptx paragraph. If so, use the paragraph level as the category_depth level. - If the shape being checked is a title shape and the item is not a bullet or email, the element will be set as a Title with a depth corresponding to the enumerated paragraph increment (e.g. 1st line of title shape is depth 0, second is depth 1 etc.). - If the shape is not a title shape but the paragraph is a title, the increment will match the level + 1, so that all paragraph titles are at least 1 to set them below the slide title element	2023-10-05 12:55:45 -04:00
Christine Straub	b30d6a601e	Fix/1209 tweak xycut ordering output (#1630 ) Closes GH Issue #1209. ### Summary - add swapped `xycut` sorting - update `xycut` sorting evaluation script PDFs: - [sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf) - [multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf) - [11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf) ### Testing ``` elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res") print("\n\n".join([str(el) for el in elements])) ``` ### Evaluation ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only ```	2023-10-05 07:41:38 +00:00
ryannikolaidis	9960ce5f00	fix: chunking fails with detection_class_prob in metadata (#1637 )	2023-10-04 22:14:21 +00:00
Klaijan	0a65fc2134	feat: xlsx subtable extraction (#1585 ) Executive Summary Unstructured is now able to capture subtables, along with other text element types within the `.xlsx` sheet. Technical Details - The function now reads the excel without header as default - Leverages the connected components search to find subtables within the sheet. This search is based on dfs search - It also handle the overlapping table or text cases - Row with only single cell of data is considered not a table, and therefore passed on the determine the element type as text - In connected elements, it is possible to have table title, header, or footer. We run the count for the first non-single empty rows from top and bottom to determine those text Result This table now reads as: <img width="747" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/2177850/6b8e6d01-4ca5-43f4-ae88-6104b0174ed2"> ``` [ { "type": "Title", "element_id": "3315afd97f7f2ebcd450e7c939878429", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "Financial performance" }, { "type": "Table", "element_id": "17f5d512705be6f8812e5dbb801ba727", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nQuarterly revenue\nNine quarters to 30 June 2023\n\n\n1\n\n\nGroup financial performance\nFY 22\nFY 23\n\n2\n\n\nSegmental results\nFY 22\nFY 23\n\n3\n\n\nSegmental analysis\nFY 22\nFY 23\n\n4\n\n\nCash flow\nFY 22\nFY 23\n\n5\n\n\n" }, { "type": "Title", "element_id": "8a9db7161a02b427f8fda883656036e1", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "Operational metrics" }, { "type": "Table", "element_id": "d5d16f7bf9c7950cd45fae06e12e5847", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nMobile customers\nNine quarters to 30 June 2023\n\n\n6\n\n\nFixed broadband customers\nNine quarters to 30 June 2023\n\n\n7\n\n\nMarketable homes passed\nNine quarters to 30 June 2023\n\n\n8\n\n\nTV customers\nNine quarters to 30 June 2023\n\n\n9\n\n\nConverged customers\nNine quarters to 30 June 2023\n\n\n10\n\n\nMobile churn\nNine quarters to 30 June 2023\n\n\n11\n\n\nMobile data usage\nNine quarters to 30 June 2023\n\n\n12\n\n\nMobile ARPU\nNine quarters to 30 June 2023\n\n\n13\n\n\n" }, { "type": "Title", "element_id": "f97e9da0e3b879f0a9df979ae260a5f7", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "Other" }, { "type": "Table", "element_id": "080e1a745a2a3f2df22b6a08d33d59bb", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nAverage foreign exchange rates\nNine quarters to 30 June 2023\n\n\n14\n\n\nGuidance rates\nFY 23/24\n\n\n14\n\n\n" } ] ```	2023-10-04 13:30:23 -04:00
Yao You	19d8bff275	feat: change default hi_res model to yolox quantized (#1607 )	2023-10-04 03:28:47 +00:00
Amanda Cameron	1fb464235a	chore: Table chunking (#1540 ) This change is adding to our `add_chunking_strategy` logic so that we are able to chunk Table elements' `text` and `text_as_html` params. In order to keep the functionality under the same `by_title` chunking strategy we have renamed the `combine_under_n_chars` to `max_characters`. It functions the same way for the combining elements under Title's, as well as specifying a chunk size (in chars) for TableChunk elements. *renaming the variable to `max_characters` will also reflect the 'hard max' we will implement for large elements in followup PRs Additionally -> some lint changes snuck in when I ran `make tidy` hence the minor changes in unrelated files :) TODO: ✅ add unit tests --> note: added where I could to unit tests! Some unit tests I just clarified that the chunking strategy was now 'by_title' because we don't have a file example that has Table elements to test the 'by_num_characters' chunking strategy ✅ update changelog To manually test: ``` In [1]: filename="example-docs/example-10k.html" In [2]: from unstructured.chunking.title import chunk_table_element In [3]: from unstructured.partition.auto import partition In [4]: elements = partition(filename) # element at -2 happens to be a Table, and we'll get chunks of char size 4 here In [5]: chunks = chunk_table_element(elements[-2], 4) # examine text and text_as_html params ln [6]: for c in chunks: print(c.text) print(c.metadata.text_as_html) ``` --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-03 09:40:34 -07:00
Newel H	bcd0eee753	Feat: Detect all text in HTML Heading tags as titles (#1556 ) ## Summary This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address categorize it as a title. ## Testing ``` from unstructured.partition.html import partition_html elements = partition_html(url="https://www.eda.gov/grants/2015") ``` Before, the date headers at the given url would not be correctly parsed as titles, after this change they are now correctly identified. A unit test to verify the functionality has been added: `test_html_partition::test_html_heading_title_detection` that includes values that were previously detected as narrative text and uncategorized text	2023-10-03 11:54:36 -04:00
Klaijan	d6efd52b4b	fix: isalnum referenced before assignment (#1586 ) Executive Summary Fix bug on the `get_word_bounding_box_from_element` function that prevent `partition_pdf` to run. Technical Details - The function originally first define `isalnum` on the first index. Now switched to conditional on flag value.	2023-10-03 11:25:20 -04:00
unifyh	89bd2faaf7	fix: Fix various cases of HTML text missing after partition (#1587 ) Fix 4 cases of text missing after partition: 1. Text immediately after `<body>` ```html <body> missing1 <div>hello</div> </body> ``` 2. Text inside container and immediately after `<br/>` ```html <div>hello<br/>missing2</div> ``` 3. Text immediately after a text opening tag, if said tag contains `<br/>` ```html <p>missing3<br/>hello</p> ``` 4. Text inside `<body>` if it is the only content (different cause from case 1) ```html <body>missing4</body> ``` Also fix problem causing `test_unstructured/documents/test_html.py::test_exclude_tag_types` to not work as intended. This will close GitHub Issue#1543	2023-10-03 04:17:51 +00:00
Yao You	ad59a879cc	chore: bump inference to 0.6.6 (#1563 ) - bump `unstructured-inference` to `0.6.6` - specify default model name for element detection to be `detectron2_onnx` to keep current behavior - NOTE: the updated inference package by default would use yolox as element detection model; this will be evaluated and enabled in a separated PR --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 19:09:57 +00:00
Christine Straub	94fbbed189	feat: bbox shrinking in xycut algo, better natural reading order (#1560 ) Closes GH Issue #1233. ### Summary - add functionality to shrink all bounding boxes along x and y axes (still centered around the same center point) before running xy-cut sort ### Evaluation Run the followin gcommand for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>	2023-09-29 03:48:02 +00:00
qued	e5d08662d4	enhancement: memory efficient xml partitioning (#1547 ) Closes #1236. Partitions XML documents iteratively in most cases, never loading the entire tree into memory. This ends up being much faster. ( The exception is when the argument `xml_path` is passed to filter elements. I was not able to find a way in Python to compare XPaths while streaming the elements, aside from writing a custom XPath parser. So the shortest way forward was to bite the bullet and load the whole tree in memory when filtering by XPath.) Memory usage is about 20% of usage on `main` when processing a 470MB XML file. Time to process is 10s vs 900s. Output is slightly different, but appears to be an improvement, adding lines of text that are skipped in current partitioning. No text is lost.	2023-09-28 02:34:06 +00:00
Austin Walker	f34c277bca	fix: add backwards compatibility to ElementMetadata (#1526 ) Fixes https://github.com/Unstructured-IO/unstructured-api/issues/237 The problem: The `ElementMetadata` class was not able to ignore fields that it didn't know about. This surfaced in `partition_via_api`, when the hosted api schema is newer than the local `unstructured` version. In `ElementMetadata.from_json()` we get errors such as `TypeError: __init__() got an unexpected keyword argument 'parent_id'`. The fix: The `from_json` methods for these dataclasses should drop any unexpected fields before calling `__init__`. To verify: This shouldn't throw an error ``` from unstructured.staging.base import elements_from_json import json test_api_result = json.dumps([ { "type": "Title", "element_id": "2f7cc75f6467bba468022c4c2875335e", "metadata": { "filename": "layout-parser-paper.pdf", "filetype": "application/pdf", "page_number": 1, "new_field": "foo", }, "text": "LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis" } ]) elements = elements_from_json(text=test_api_result) print(elements) ```	2023-09-27 18:40:56 +00:00
Klaijan	d26d591d6a	feat: get embedded url, associate text and start index for pdf (#1539 ) Executive Summary Adds PDF functionality to capture hyperlink (external or internal) for pdf fast strategy along with associate text. Technical Details - `pdfminer` associates `annotation` (links and uris) with bounding box rather than text. Therefore, the link and text matching is not a perfect pair but rather a logic-based and calculation matching from bounding box overlapping. - There is no word-level bounding box. Only character-level (access using `LTChar`). Thus in order to get to word-level, there is a window slicing through the text. The words are captured in alphanumeric and non-alphanumeric separately, meaning it will split the word if contains both, on the first encounter of non-alphanumeric.) - The bounding box calculation is calculated using start and stop coordinates for the corresponding word calculated from above. The calculation is simply using distance between two dots. The result now contains `links` in `metadata` as shown below: ``` "links": [ { "text": "link", "url": "https://github.com/Unstructured-IO/unstructured", "start_index": 12 }, { "text": "email", "url": "mailto:unstructuredai@earlygrowth.com", "start_index": 30 }, { "text": "phone number", "url": "tel:6505124019", "start_index": 49 } ] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-27 13:43:32 -04:00
Newel H	55315cf645	Feat: Native hierarchies for docx element types (#1505 ) Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents. Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number). Hierarchy detection is improved by determining category depth via the following: 1. Check if the paragraph item has an indentation level (ilvl) xpath - these are typically on list bullet/numbers. Return the indentation level if it exists 2. Check the name of the paragraph style if it contains any category depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List Bullet 2). Return the category depth if found, else default to depth of 0. 3. Check the paragraph ilvl via the paragraph's style name. Outside of the paragraph's metadata, docx stores default ilvls for various style names, which requires a complex lookup. This check is yet to be implemented, as the above methods cover most usecases but the implementation is stubbed out. --- Co-authored-by: Steve Canny <stcanny@gmail.com>	2023-09-27 11:32:46 -04:00
Steve Canny	ab29de8dbd	Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured This refactor solves a problem or two, the big one being recursing into group-shapes to get all shapes on the slide, but mostly lays the groundwork to allow us to refine further aspects such as list-item detection, off-slide shape detection, and image-capture going forward.	2023-09-26 15:43:55 -04:00

... 3 4 5 6 7 ...

553 Commits