unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-30 01:17:43 +00:00

Author	SHA1	Message	Date
Mallori Harrell	a5d7ae4611	Feat: Bag of words for testing metric (#1650 ) This PR adds the `bag_of_words` function to count the frequency of words for evaluation. Testing ```Python from unstructured.cleaners.core import bag_of_words string = "The dog loved the cat, but the cat loved the cow." print(bag_of_words) --------- Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local> Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-10 18:46:01 +00:00
Amanda Cameron	f98d5e65ca	chore: adding max_characters to other element type chunking (#1673 ) This PR adds the `max_characters` (hard max) param to non-table element chunking. Additionally updates the `num_characters` metadata to `max_characters` to make it clearer which param we're referencing. To test: ``` from unstructured.partition.html import partition_html filename = "example-docs/example-10k-1p.html" chunk_elements = partition_html( filename, chunking_strategy="by_title", combine_text_under_n_chars=0, new_after_n_chars=50, max_characters=100, ) for chunk in chunk_elements: print(len(chunk.text)) # previously we were only respecting the "soft max" (default of 500) for elements other than tables # now we should see that all the elements have text fields under 100 chars. ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-09 19:42:36 +00:00
Klaijan	33edbf84f5	feat: add calculate edit distance feature (#1656 ) Executive Summary Adds function to calculate edit distance (Levenshtein distance) between two strings. The function can return as: 1. score (similarity = 1 - distance/source_len) 2. distance (raw levenshtein distance) Technical details - The `weights` param is set to default at (2,1,1) for (insertion, deletion, substitution), meaning that we will penalize the insertion we need to add from output (target) in comparison with the source (reference). In other word, the missing extraction will be penalized higher. - The function takes in 2 strings in an assumption that both string are already clean and concatenated (CCT) Important Note! Test case needs to be updated to use CCT once the function is ready. It is now only tested the "functionality" of edit distance, not the edit distance with CCT as its intended to be. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:21:14 +00:00
Yuming Long	dcd6d0ff67	Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` (#1579 ) ## Summary Second part of OCR refactor to move it from inference repo to unstructured repo, first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/231. This PR adds OCR process logics to entire page OCR, and support two OCR modes, "entire_page" or "individual_blocks". The updated workflow for `Hi_res` partition: * pass the document as data/filename to inference repo to get `inferred_layout` (DocumentLayout) * pass the document as data/filename to OCR module, which first open the document (create temp file/dir as needed), and split the document by pages (convert PDF pages to image pages for PDF file) * if ocr mode is `"entire_page"` * OCR the entire image * merge the OCR layout with inferred page layout * if ocr mode is `"individual_blocks"` * from inferred page layout, find element with no extracted text, crop the entire image by the bboxes of the element * replace empty text element with the text obtained from OCR the cropped image * return all merged PageLayouts and form a DocumentLayout subject for later on process This PR also bump `unstructured-inference==0.7.2` since the branch relay on OCR refactor from unstructured-inference. ## Test ``` from unstructured.partition.auto import partition entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res") individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res") print([el.text for el in entrie_page_ocr_mode_elements]) print([el.text for el in individual_blocks_ocr_mode_elements]) ``` latest output: ``` # entrie_page ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM\|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고'] # individual_blocks ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM\| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email'] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-06 22:54:49 +00:00
Christine Straub	5d14a2aea0	feat: shrink bboxes by top left (#1633 ) Closes #1573. ### Summary - update `shrink_bbox()` to keep top left rather than center ### Evaluation Run the following command for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-10-06 05:16:11 +00:00
Benjamin Torres	e0201e9a11	feat/add sources from unstructured inference (#1538 ) This PR adds support for `source` property from `unstructured_inference`, allowing the user to be able to see the origin of the data under `detection_origin`field environment variable UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true In order to try this feature you can use this code: ``` from unstructured.partition.pdf import partition_pdf_or_image yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox') sources = [e.detection_origin for e in yolox_elements] print(sources) ``` And will print 'yolox' as source for all the elements	2023-10-05 20:26:47 +00:00
Sebastian Laverde Alfonso	e90a979f45	fix: Better logic for setting `category_depth` metadata for `Title` elements (#1517 ) This PR promotes the `category_depth` metadata for `Title` elements from `None` to 0, whenever `Headline` and/or `Subheadline` types (that are also mapped to `Title` elements with depth 1 and 2) are present. An additional test to `test_common.py` has been added to check on the improvement. More test of how this logic fixes the behaviour can be found in a adapted version on the colab [here](https://colab.research.google.com/drive/1LoScFJBYUhkM6X7pMp8cDaJLC_VoxGci?usp=sharing). --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-10-05 17:51:06 +00:00
Newel H	e34396b2c9	Feat: Native hierarchies for elements from pptx documents (#1616 ) ## Summary Improve title detection in pptx documents The default title textboxes on a pptx slide are now categorized as titles. Improve hierarchy detection in pptx documents List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents. Hierarchy detection is improved by determining category depth via the following: - Check if the paragraph item has a level parameter via the python pptx paragraph. If so, use the paragraph level as the category_depth level. - If the shape being checked is a title shape and the item is not a bullet or email, the element will be set as a Title with a depth corresponding to the enumerated paragraph increment (e.g. 1st line of title shape is depth 0, second is depth 1 etc.). - If the shape is not a title shape but the paragraph is a title, the increment will match the level + 1, so that all paragraph titles are at least 1 to set them below the slide title element	2023-10-05 12:55:45 -04:00
Christine Straub	b30d6a601e	Fix/1209 tweak xycut ordering output (#1630 ) Closes GH Issue #1209. ### Summary - add swapped `xycut` sorting - update `xycut` sorting evaluation script PDFs: - [sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf) - [multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf) - [11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf) ### Testing ``` elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res") print("\n\n".join([str(el) for el in elements])) ``` ### Evaluation ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only ```	2023-10-05 07:41:38 +00:00
ryannikolaidis	9960ce5f00	fix: chunking fails with detection_class_prob in metadata (#1637 )	2023-10-04 22:14:21 +00:00
Klaijan	0a65fc2134	feat: xlsx subtable extraction (#1585 ) Executive Summary Unstructured is now able to capture subtables, along with other text element types within the `.xlsx` sheet. Technical Details - The function now reads the excel without header as default - Leverages the connected components search to find subtables within the sheet. This search is based on dfs search - It also handle the overlapping table or text cases - Row with only single cell of data is considered not a table, and therefore passed on the determine the element type as text - In connected elements, it is possible to have table title, header, or footer. We run the count for the first non-single empty rows from top and bottom to determine those text Result This table now reads as: <img width="747" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/2177850/6b8e6d01-4ca5-43f4-ae88-6104b0174ed2"> ``` [ { "type": "Title", "element_id": "3315afd97f7f2ebcd450e7c939878429", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "Financial performance" }, { "type": "Table", "element_id": "17f5d512705be6f8812e5dbb801ba727", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nQuarterly revenue\nNine quarters to 30 June 2023\n\n\n1\n\n\nGroup financial performance\nFY 22\nFY 23\n\n2\n\n\nSegmental results\nFY 22\nFY 23\n\n3\n\n\nSegmental analysis\nFY 22\nFY 23\n\n4\n\n\nCash flow\nFY 22\nFY 23\n\n5\n\n\n" }, { "type": "Title", "element_id": "8a9db7161a02b427f8fda883656036e1", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "Operational metrics" }, { "type": "Table", "element_id": "d5d16f7bf9c7950cd45fae06e12e5847", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nMobile customers\nNine quarters to 30 June 2023\n\n\n6\n\n\nFixed broadband customers\nNine quarters to 30 June 2023\n\n\n7\n\n\nMarketable homes passed\nNine quarters to 30 June 2023\n\n\n8\n\n\nTV customers\nNine quarters to 30 June 2023\n\n\n9\n\n\nConverged customers\nNine quarters to 30 June 2023\n\n\n10\n\n\nMobile churn\nNine quarters to 30 June 2023\n\n\n11\n\n\nMobile data usage\nNine quarters to 30 June 2023\n\n\n12\n\n\nMobile ARPU\nNine quarters to 30 June 2023\n\n\n13\n\n\n" }, { "type": "Title", "element_id": "f97e9da0e3b879f0a9df979ae260a5f7", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "Other" }, { "type": "Table", "element_id": "080e1a745a2a3f2df22b6a08d33d59bb", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nAverage foreign exchange rates\nNine quarters to 30 June 2023\n\n\n14\n\n\nGuidance rates\nFY 23/24\n\n\n14\n\n\n" } ] ```	2023-10-04 13:30:23 -04:00
Yao You	19d8bff275	feat: change default hi_res model to yolox quantized (#1607 )	2023-10-04 03:28:47 +00:00
Amanda Cameron	1fb464235a	chore: Table chunking (#1540 ) This change is adding to our `add_chunking_strategy` logic so that we are able to chunk Table elements' `text` and `text_as_html` params. In order to keep the functionality under the same `by_title` chunking strategy we have renamed the `combine_under_n_chars` to `max_characters`. It functions the same way for the combining elements under Title's, as well as specifying a chunk size (in chars) for TableChunk elements. *renaming the variable to `max_characters` will also reflect the 'hard max' we will implement for large elements in followup PRs Additionally -> some lint changes snuck in when I ran `make tidy` hence the minor changes in unrelated files :) TODO: ✅ add unit tests --> note: added where I could to unit tests! Some unit tests I just clarified that the chunking strategy was now 'by_title' because we don't have a file example that has Table elements to test the 'by_num_characters' chunking strategy ✅ update changelog To manually test: ``` In [1]: filename="example-docs/example-10k.html" In [2]: from unstructured.chunking.title import chunk_table_element In [3]: from unstructured.partition.auto import partition In [4]: elements = partition(filename) # element at -2 happens to be a Table, and we'll get chunks of char size 4 here In [5]: chunks = chunk_table_element(elements[-2], 4) # examine text and text_as_html params ln [6]: for c in chunks: print(c.text) print(c.metadata.text_as_html) ``` --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-03 09:40:34 -07:00
Newel H	bcd0eee753	Feat: Detect all text in HTML Heading tags as titles (#1556 ) ## Summary This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address categorize it as a title. ## Testing ``` from unstructured.partition.html import partition_html elements = partition_html(url="https://www.eda.gov/grants/2015") ``` Before, the date headers at the given url would not be correctly parsed as titles, after this change they are now correctly identified. A unit test to verify the functionality has been added: `test_html_partition::test_html_heading_title_detection` that includes values that were previously detected as narrative text and uncategorized text	2023-10-03 11:54:36 -04:00
Klaijan	d6efd52b4b	fix: isalnum referenced before assignment (#1586 ) Executive Summary Fix bug on the `get_word_bounding_box_from_element` function that prevent `partition_pdf` to run. Technical Details - The function originally first define `isalnum` on the first index. Now switched to conditional on flag value.	2023-10-03 11:25:20 -04:00
unifyh	89bd2faaf7	fix: Fix various cases of HTML text missing after partition (#1587 ) Fix 4 cases of text missing after partition: 1. Text immediately after `<body>` ```html <body> missing1 <div>hello</div> </body> ``` 2. Text inside container and immediately after `<br/>` ```html <div>hello<br/>missing2</div> ``` 3. Text immediately after a text opening tag, if said tag contains `<br/>` ```html <p>missing3<br/>hello</p> ``` 4. Text inside `<body>` if it is the only content (different cause from case 1) ```html <body>missing4</body> ``` Also fix problem causing `test_unstructured/documents/test_html.py::test_exclude_tag_types` to not work as intended. This will close GitHub Issue#1543	2023-10-03 04:17:51 +00:00
Yao You	ad59a879cc	chore: bump inference to 0.6.6 (#1563 ) - bump `unstructured-inference` to `0.6.6` - specify default model name for element detection to be `detectron2_onnx` to keep current behavior - NOTE: the updated inference package by default would use yolox as element detection model; this will be evaluated and enabled in a separated PR --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 19:09:57 +00:00
Christine Straub	94fbbed189	feat: bbox shrinking in xycut algo, better natural reading order (#1560 ) Closes GH Issue #1233. ### Summary - add functionality to shrink all bounding boxes along x and y axes (still centered around the same center point) before running xy-cut sort ### Evaluation Run the followin gcommand for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>	2023-09-29 03:48:02 +00:00
qued	e5d08662d4	enhancement: memory efficient xml partitioning (#1547 ) Closes #1236. Partitions XML documents iteratively in most cases, never loading the entire tree into memory. This ends up being much faster. ( The exception is when the argument `xml_path` is passed to filter elements. I was not able to find a way in Python to compare XPaths while streaming the elements, aside from writing a custom XPath parser. So the shortest way forward was to bite the bullet and load the whole tree in memory when filtering by XPath.) Memory usage is about 20% of usage on `main` when processing a 470MB XML file. Time to process is 10s vs 900s. Output is slightly different, but appears to be an improvement, adding lines of text that are skipped in current partitioning. No text is lost.	2023-09-28 02:34:06 +00:00
Austin Walker	f34c277bca	fix: add backwards compatibility to ElementMetadata (#1526 ) Fixes https://github.com/Unstructured-IO/unstructured-api/issues/237 The problem: The `ElementMetadata` class was not able to ignore fields that it didn't know about. This surfaced in `partition_via_api`, when the hosted api schema is newer than the local `unstructured` version. In `ElementMetadata.from_json()` we get errors such as `TypeError: __init__() got an unexpected keyword argument 'parent_id'`. The fix: The `from_json` methods for these dataclasses should drop any unexpected fields before calling `__init__`. To verify: This shouldn't throw an error ``` from unstructured.staging.base import elements_from_json import json test_api_result = json.dumps([ { "type": "Title", "element_id": "2f7cc75f6467bba468022c4c2875335e", "metadata": { "filename": "layout-parser-paper.pdf", "filetype": "application/pdf", "page_number": 1, "new_field": "foo", }, "text": "LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis" } ]) elements = elements_from_json(text=test_api_result) print(elements) ```	2023-09-27 18:40:56 +00:00
Klaijan	d26d591d6a	feat: get embedded url, associate text and start index for pdf (#1539 ) Executive Summary Adds PDF functionality to capture hyperlink (external or internal) for pdf fast strategy along with associate text. Technical Details - `pdfminer` associates `annotation` (links and uris) with bounding box rather than text. Therefore, the link and text matching is not a perfect pair but rather a logic-based and calculation matching from bounding box overlapping. - There is no word-level bounding box. Only character-level (access using `LTChar`). Thus in order to get to word-level, there is a window slicing through the text. The words are captured in alphanumeric and non-alphanumeric separately, meaning it will split the word if contains both, on the first encounter of non-alphanumeric.) - The bounding box calculation is calculated using start and stop coordinates for the corresponding word calculated from above. The calculation is simply using distance between two dots. The result now contains `links` in `metadata` as shown below: ``` "links": [ { "text": "link", "url": "https://github.com/Unstructured-IO/unstructured", "start_index": 12 }, { "text": "email", "url": "mailto:unstructuredai@earlygrowth.com", "start_index": 30 }, { "text": "phone number", "url": "tel:6505124019", "start_index": 49 } ] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-27 13:43:32 -04:00
Newel H	55315cf645	Feat: Native hierarchies for docx element types (#1505 ) Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents. Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number). Hierarchy detection is improved by determining category depth via the following: 1. Check if the paragraph item has an indentation level (ilvl) xpath - these are typically on list bullet/numbers. Return the indentation level if it exists 2. Check the name of the paragraph style if it contains any category depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List Bullet 2). Return the category depth if found, else default to depth of 0. 3. Check the paragraph ilvl via the paragraph's style name. Outside of the paragraph's metadata, docx stores default ilvls for various style names, which requires a complex lookup. This check is yet to be implemented, as the above methods cover most usecases but the implementation is stubbed out. --- Co-authored-by: Steve Canny <stcanny@gmail.com>	2023-09-27 11:32:46 -04:00
Steve Canny	ab29de8dbd	Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured This refactor solves a problem or two, the big one being recursing into group-shapes to get all shapes on the slide, but mostly lays the groundwork to allow us to refine further aspects such as list-item detection, off-slide shape detection, and image-capture going forward.	2023-09-26 15:43:55 -04:00
shreyanid	32bfebccf7	feat: introduce language detection function for text partitioning function (#1453 ) ### Summary Uses `langdetect` to detect all languages present in the input document. ### Details - Converts all language codes (whether user inputted or detected using `langdetect`) to a standard ISO 639-3 code. - Adds `languages` field to the metadata - Will revisit how to nonstandardly represent simplified vs traditional Chinese scripts internally (separate PR). - Update ingest test results to add `languages` field to documents. Some other side effects are changes in order of some elements and changes in element categorization ### Test You can test the detect_languages function individually by importing the function and inputting a text sample and optionally a language: ``` text = "My lubimy mleko i chleb." doc_langs = detect_languages(text) print(doc_langs) ``` -> ['ces', 'pol', 'slk'] --------- Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2023-09-26 18:09:27 +00:00
Benjamin Torres	5d193c8e5a	fix/bad formed formula (#1481 ) @ron-unstructured reported that loading files with: ``` from unstructured.partition.pdf import partition_pdf elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox") print(elements_yolox) ``` Throws an error. After debugging the execution I found that the issue is that an object of class Formula is being created, however, this class doesn't contain an __init__ method. This PR solves the issue of adding a constructor method with an empty string for the element. The file can be found at: https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing After this PR is merged this file is correctly processed	2023-09-23 02:36:22 +00:00
rvztz	2f52df180f	Adds data source properties to onedrive, reddit and slack (#1281 )	2023-09-20 04:26:36 +00:00
Amanda Cameron	e359afafbe	fix: coordinates bug on pdf parsing (#1462 ) Addresses: https://github.com/Unstructured-IO/unstructured/issues/1460 We were raising an error with invalid coordinates, which prevented us from continuing to return the element and continue parsing the pdf. Now instead of raising the error we'll return early. to test: ``` from unstructured.partition.auto import partition elements = partition(url='https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2022.pdf', strategy="fast") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-09-19 19:25:31 -07:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00
shreyanid	eb8ce89137	chore: function to map between standard and Tesseract language codes (#1421 ) ### Summary In order to convert between incompatible language codes from packages used for OCR, this change adds a function to map between any standard language codes and tesseract OCR specific codes. Users can input language information to `languages` in any Tesseract-supported langcode or any ISO 639 standard language code. ### Details - Introduces the [python-iso639](https://pypi.org/project/python-iso639/) package for matching standard language codes. Recompiles all dependencies. - If a language is not already supplied by the user as a Tesseract specific langcode, supplies all possible script/orthography variants of the language to the Tesseract OCR agent. ### Test Added many unit tests for a variety of language combinations, special cases, and variants. For general testing, call partition functions with any lang codes in the languages parameter (Tesseract or standard). for example, ``` from unstructured.partition.auto import partition elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"]) print("\n\n".join([str(el) for el in elements])) ``` should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract	2023-09-18 08:42:02 -07:00
Amanda Cameron	a9f18eddb8	chore: adding test case for odt tables (#1434 ) ODT table extraction is happening! Just added to an existing example-doc and an accompanying test case.	2023-09-16 22:29:44 -07:00
Yao You	b534b2a6cd	Chore: bump inference package version to 0.5.28 and new release (#1355 ) This bump removes the preprocessing before table structure extraction and improves the OCR results for tables. --------- Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-09-15 18:26:15 -07:00
John	de4d496fcf	Fix bbox coordinates for ocr_only strategy (#1325 ) ### Summary Duplicate PR of #1259 because of issues with checks Closes #1227, which found that `nan` values were present in the coordinates being generated for some elements. This breaks logic out from `add_pytesseract_bbox_to_elements` to new functions `_get_element_box` and `convert_multiple_coordinates_to_new_system`. It also updates the logic to check that the current bounding box matches the first character of the element's text (as to avoid the `~` characters that `pytesseract.image_to_boxes` includes, but are not present in `pytesseract.image_to_string`. ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw filename="example-docs/layout-parser-paper-with-table.jpg" elements = partition_image(filename=filename, strategy="ocr_only") image = Image.open(filename) draw = ImageDraw.Draw(image) for i, element in enumerate(elements): print(i, element.metadata.coordinates) if element.metadata.coordinates: draw.polygon(element.metadata.coordinates.points, outline="red", width=2) output = "example-docs/box-layout-parser-paper-with-table.jpg" image.save(output) image.close() ``` --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-09-15 15:11:16 -05:00
qued	0d61c98481	fix: Pass partition_image kwargs downstream (#1426 ) `partition_pdf` allows for passing a `model_name` parameter. Given the similarity between the image and PDF pipelines, the expected behavior is that `partition_image` should support the same parameter, but `partition_image` was unintentionally not passing along its `kwargs`. This was corrected by adding the kwargs to the downstream call. #### Testing: ```python from unstructured.partition.image import partition_image output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx") output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox") # These shouldn't be the same, since they were produced using different models. assert output1 != output2 ``` The assertion should fail on `main`, but pass on this branch.	2023-09-15 15:09:58 -05:00
Amanda Cameron	50db2abd9f	fix: updating element types (#1394 ) This PR adds an arg to the html partition flow called `source_format` if anything other than "html" we will return non-HTML elements to conform with the file type we received. addresses: https://github.com/Unstructured-IO/unstructured/issues/726	2023-09-15 11:51:22 -05:00
Sebastian Laverde Alfonso	40b1d0d092	feat: improved chipper elements mapping and new category_depth metadata (#1308 ) Two changes: 1. Improved mapping of `chipper` element types `Headline` (to `Title`), `Subheadline`(to `Title`) and `Abstract`( to `NarrativeText`. 2. New element metadata `category_depth`: `None` unless is `Headline` (`category_depth=1`), or `Subheadline` (`category_depth=2`). The update of `category_depth` happens during the transform `normalize_layout_element`. --------- Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: LaverdeS <LaverdeS@users.noreply.github.com> Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> Co-authored-by: Benjamin Torres <benjamin@unstructured.io>	2023-09-15 14:43:17 +00:00
Newel H	cd704e873b	Feat: Create a naive hierarchy for elements (#1268 ) ## Summary By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>	2023-09-14 11:23:16 -04:00
Ahmet Melek	bd33a52ee0	fix: coordinates metadata hinders chunking (#1374 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1373 This PR: - drops the `coordinates` metadata field in `chunk_by_title` to fix https://github.com/Unstructured-IO/unstructured/issues/1373 (read issue for the details) - adds relevant test that checks the particular case	2023-09-14 10:10:03 +00:00
Klaijan	00181b88df	feat: pdf auto strategy groups broken numbered and bullet list items(#1393 ) Summary Adds logic to combine broken numbered list for pdf fast strategy. Details Previously the document reads the numbered list items part of the `layout-parser-paper-fast.pdf` file as: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character' 'recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that' 'underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' 'tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` Now it reads: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` The added logic leverages `ElementType` and `coordinates` to determine whether the following lines is a part of the previously detected `ListItem` or not. Test Add test that checks the element length less than original version with broken numbered list. The test also checks whether the first detected numbered list ends with previously broken line. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-13 21:30:06 +00:00
shreyanid	d87c83d7b6	chore: refactor languages parameter for text_type functions (#1399 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. To identify element types such as NarrativeText and Title, continue the refactor into functions that use language checks to determine those potential classifications. ### Details Replaces `language` with `languages` (a list of strings) as a parameter to `is_possible_narrative_text` and `is_possible_title`. ### Test Call `is_possible_narrative_text` and `is_possible_title` with text in a variety of languages and different inputs for `languages`. The resulting element classifications should be no different from the current outputs. ex: see `test_text_type_handles_multi_language_examples` in `test_unstructured/partition/test_text_type.py`.	2023-09-13 19:46:36 +00:00
shreyanid	1b7c99d878	chore: refactor languages parameter for auto partition (#1400 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. ### Details Follows the pattern established with PDFs in #1334. Adds languages (a list of strings) as a parameter to partition in auto.py. Marks ocr_languages for deprecation. ### Test Call partition with a variety of filetypes (especially pdfs/images), strategies, languages, or ocr_languages. - inclusion of ocr_languages as a parameter should display a deprecation warning and may proceed with partitioning if no other conflicts - the other valid call outputs should be no different from the current outputs	2023-09-13 13:07:28 -04:00
shreyanid	2b571eb9a3	chore: refactor languages parameter for image partition functions (#1395 ) Adds languages (a list of strings) as a parameter to `partition_image`. Marks ocr_languages for deprecation.	2023-09-13 04:11:58 +00:00
Amanda Cameron	7fd81dc7df	Table processing test for RTF (#1388 ) This PR does two things: 1. Adds test case (and alters sample doc) for rtf and epub files with table 2. Adds `xls/x` file extension to `skip_infer_table_types` default list --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-09-12 18:27:05 -07:00
qued	6595632a57	enhancement: backup text categorization (#1322 ) Currently there are some cases when `partition_pdf` is run using the `hi_res` strategy, in which elements can come back with category `UncategorizedText`. This happens when the detection model fails to detect an element, but we're able to find it anyway either because it was embedded in the PDF, or we found it using OCR. This commit is to allow for attempting to categorize these uncategorized elements using our text-based classification function, `element_from_text`.	2023-09-12 20:32:48 +00:00
shreyanid	c2853e4ac3	refactor languages parameter for pdf partition functions (#1334 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. ### Details Adds `languages` (a list of strings) as a parameter to pdf partitioning functions. Marks `ocr_languages` for deprecation. Adds a new file `lang.py` for language-related helper functions. Coming up: langcode standardization, language detection ### Test Call `partition_pdf` or `partition_pdf_or_image` with a variety of strategies, languages, or `ocr_languages`. - inclusion of `ocr_languages` as a parameter should display a deprecation warning - the other valid call outputs should be no different from the current outputs. ex: ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"]) print("\n\n".join([str(el) for el in elements])) ```	2023-09-12 16:15:26 +00:00
John	c58b261feb	chunk_by_title decorator (#1304 ) ### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()	2023-09-11 21:00:14 +00:00
Amanda Cameron	a501d1d18f	Adding table extraction to partition_html (#1324 ) Adding table extraction to HTML partitioning. This PR utilizes 'table' HTML elements to extract and parse HTML tables and return them in partitioning. ``` # checkout this branch, go into ipython shell In [1]: from unstructured.partition.html import partition_html In [2]: path_to_html = "{html sample file with table}" In [3]: elements = partition_html(path_to_html) ``` you should see the table in the elements list!	2023-09-11 11:14:11 -07:00
cragwolfe	d0749d181f	fix: avoid PDF sorting error on negative coords (#1361 ) The default sorting algorithm for PDF's, "xycut," would cause an error when partitioning a document if Y coordinate points were negative. This change checks for that condition (or more broadly, any negative coordinates) and falls back to the "basic" sort if that is the case. This PR does not address the underlying issue of "bad points" which still should be investigated. However, the sorting code should be less brittle to unexpected bounding boxes in the first case. Resolves: https://github.com/Unstructured-IO/unstructured/issues/1296	2023-09-10 19:29:49 -07:00
suraj chauhan	f4bf1fa270	Chore: Libmagic detection for "application/octet-stream" when it is not a zip file. (#1347 ) Addressed the issue #494 . Updated the `_detect_filetype_from_octet_stream()` function to use libmagic to infer the content type of file when it is not a zip file.	2023-09-08 18:49:00 +00:00
pravin-unstructured	8641fe39dc	Add Model Probabilities to Hi-Res strategy MetaData for Images + PDFs. (#1323 ) If a layout model is used from unstructured-inference, you get back class probabilities in the element metadata from partition. extra-pdf-image-in in requirements already has the newest version of unstructured-inference in there without a pinned version. Is there any place else that the unstructured-inference version needs to be updated to the required release version, 0.5.22?	2023-09-07 22:56:43 -04:00
Wahab Alshahin	e4e25c9feb	Add clean_ligatures to core cleaners (#1326 ) # Background [Ligatures](https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets)) can sometimes show up during the text extraction process when they should not. Very common examples of this are with the Latin `f` related ligatures which can be very subtle to spot by eye (see example below), but can wreak havoc later. ```python "ﬀ": "ff", "ﬁ": "fi", "ﬂ": "fl", "ﬃ": "ffi", "ﬄ": "ffl", ``` Several libraries already do something like this. Most recently, `pdfplumber` added this sort of capability as part of the text extraction process, see https://github.com/jsvine/pdfplumber/issues/598 Instead of incorporating any sort of breaking change to the PDF text processing in `unstructured`, it is best to add this as another cleaner and allow users to opt in. In turn, the `clean_ligatures` method has been added in this PR - with accompanying tests. # Example Here is an example PDF that causes the issue. For example: `Beneﬁts`, which should be `Benefits`. [example.pdf](https://github.com/Unstructured-IO/unstructured/files/12544344/example.pdf) ```bash curl -X 'POST' \ 'https://api.unstructured.io/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: ${UNSTRUCTURED_API_KEY}' \ -F 'files=@example.pdf' \ -s \| jq -C . ``` # Notes An initial list of mappings was added with the most common ligatures. There is some subjectivity to this, but this should be a relatively safe starting set. Can always be expanded as needed.	2023-09-07 21:30:18 +00:00

1 2 3 4 5 ...

326 Commits