unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-07 17:12:48 +00:00

Author	SHA1	Message	Date
Steve Canny	f2fee0c32f	fix(auto): partition() passes strategy to DOC,ODT (#3278 ) Summary Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_doc()` or `partition_odt()` and so was not making its way to `partition_docx()`.	2024-06-26 00:29:47 +00:00
Yao You	edddf9f6ee	Feat/pass down strategy to partition ppt as well (#3274 ) Following the same pattern of https://github.com/Unstructured-IO/unstructured/pull/3273 and pass down `strategy` parameter to `partition_ppt` as well.	2024-06-22 02:23:58 +00:00
Steve Canny	16df6944dd	fix(auto): partition() passes strategy to PPTX,DOCX (#3273 ) Summary Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_pptx()` or `partition_docx()`.	2024-06-22 00:16:39 +00:00
Christine Straub	9552fbbfbf	chore: bump unstructured-inference 0.7.35 (#3205 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-14 18:11:38 +00:00
Christine Straub	35ec21ecd0	fix: decide table extraction (#3090 ) This PR aims to add backward compatibility for the deprecated `pdf_infer_table_structure` parameter. A missing part of turning table extraction for PDFs and Images off by default in https://github.com/Unstructured-IO/unstructured/pull/3035, which was turned on in https://github.com/Unstructured-IO/unstructured/pull/2588. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-05-23 20:37:15 +00:00
Steve Canny	b4ee019170	rfctr: flatten test_unstructured/partition (#3073 ) Summary Some partitioner test modules are placed in directories by themselves or with one other test module. This unnecessarily obscures where to find the test module corresponding to a partitiner. Move partitioner test modules to mirror the directory structure of `unstructured/partition`.	2024-05-23 00:51:08 +00:00
Matt Robinson	ec987dcbb2	BREAKING CHANGE: revert table extraction off by default for PDFs and images (#3035 ) ### Summary Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-05-17 15:28:11 +00:00
Matt Robinson	612905e311	build: wolfi base image for Dockerfile (#3016 ) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done.	2024-05-15 22:53:15 +00:00
Yuming Long	542d442699	chore CORE-4775: remove html page number metadata field (#2942 ) ### Summary Rip off page_number metadata fields until we have page counting for all kinds of html files (not just limited to news articles with multiple `<article>` tag) ### Test Unit tests `test_add_chunking_strategy_on_partition_html_respects_multipage` and `test_add_chunking_strategy_title_on_partition_auto_respects_multipage` removed since they relay on the `page_number` fields from the SEC html file - now test moved to mock test for chunk_by_title -> revisit those tests when we find test file for this Also changed the element ids from partition outputs for html files - element id change due to page number change (in element id hashing) -> todo ticket: update other deterministic element id tests per crag's comment --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2024-04-30 15:20:26 +00:00
Marco Lüthy	0d80886578	fix: parse URL response Content-Type according to RFC 9110 (#2950 ) Currently, `file_and_type_from_url()` does not correctly handle the `Content-Type` header. Specifically, it assumes that the header contains only the mime-type (e.g. `text/html`), however, [RFC 9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows for additional directives — specifically the `charset` — to be returned in the header. This leads to a `ValueError` when loading a URL with a response Content-Type header such as `text/html; charset=UTF-8`. To reproduce the issue: ```python from unstructured.partition.auto import partition url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/" partition(url=url) ``` Which will result in the following exception: ```python { "name": "ValueError", "message": "Invalid file. The FileType.UNK file type is not supported in partition.", "stack": "--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 4 1 from unstructured.partition.auto import partition 3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\" ----> 4 partition(url=url) File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs) 539 else: 540 msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\" --> 541 raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\") 543 for element in elements: 544 element.metadata.url = url ValueError: Invalid file. The FileType.UNK file type is not supported in partition." } ``` This PR fixes the issue by parsing the mime-type out of the `Content-Type` header string. Closes #2257	2024-04-29 22:53:44 -07:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Michał Martyniak	cb1e91058e	Introduce `start_page` argument to partitioning functions that assign `element.metadata.page_number` (#2884 ) This small change will be useful for users who partition only fragments of their PDF documents. It's a small step towards addressing this issue: https://github.com/Unstructured-IO/unstructured/issues/2461 Related PRs: * https://github.com/Unstructured-IO/unstructured/pull/2842 * https://github.com/Unstructured-IO/unstructured/pull/2673	2024-04-15 21:03:42 +00:00
Filip Knefel	bdfd975115	chore: change table extraction defaults (#2588 ) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-22 10:08:49 +00:00
Filip Knefel	6af6604057	feat: introduce `date_from_file_object` parameter to partitions (#2563 ) Introduce `date_from_file_object` to `partition*` functions, by default set to `False`. If set to `True` and file is provided via `file` parameter, partition will attempt to infer last modified date from `file`'s contents otherwise last modified metadata will be set to `None`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-18 01:09:44 +00:00
Steve Canny	d9f8467187	fix(xlsx): xlsx subtable algorithm (#2534 ) Reviewers: It may be easier to review each of the two commits separately. The first adds the new `_SubtableParser` object with its unit-tests and the second one uses that object to replace the flawed existing subtable-parsing algorithm. Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). This PR replaces the flawed existing algorithm with a `_SubtableParser` object that encapsulates all that logic and has thorough unit-tests. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes. This PR fixes all these bugs: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-13 20:29:17 -08:00
Matt Robinson	ccf0477080	enhancement: process `.p7s` files with `partition_email` (#2521 ) ### Summary Closes #2489, which reported an inability to process `.p7s` files. PR implements two changes: - If the user selected content type for the email is not available and there is another valid content type available, fall back to the other valid content type. - For signed message, extract the signature and add it to the metadata ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/eml/signed-doc.p7s" elements = partition(filename=filename) # should get a message about fall back logic print(elements[0]) # "This is a test" elements[0].metadata.to_dict() # Will see the signature ```	2024-02-07 22:31:49 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
Yao You	97fb10db4a	fix: default hi_res model rely on inference setting (#2441 ) - there are multiple places setting the default `hi_res_model_name` in both `unstructured` and `unstructured-inference` - they lead to inconsistency and unexpected behaviors - this fix removes a helper in `unstructured` that tries to set the default hi_res layout detection model; instead we rely on the `unstructured-inference` to provide that default when no explicit model name is passed in ## test ```bash UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython ``` ```python from unstructured.partition.auto import partition # find a pdf file elements = partition("foo.pdf", strategy="hi_res") assert elements[0].metadata.detection_origin == "yolox" ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-01-29 16:44:41 +00:00
Matt Robinson	4d5038d9fd	enhancement: add support from bitmap images (#2414 ) ### Summary Adds support for bitmap images (`.bmp`) in both file detection and partitioning. Bitmap images will be processed with `partition_image` just like JPGs and PNGs. ### Testing ```python from unstructured.file_utils.filetype import detect_filetype from unstructured.partition.auto import partition from PIL import Image filename = "example-docs/layout-parser-paper-with-table.jpg" bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp" img = Image.open(filename) img.save(bmp_filename) detect_filetype(filename=bmp_filename) # Should be FileType.BMP elements = partition(filename=bmp_filename) ```	2024-01-17 22:50:36 +00:00
Matt Robinson	36faf677c0	enhancement: file detection for `.wav` files (#2387 ) ### Summary Adds filetype detection for `.wav` audio files ### Testing ```python from unstructured.file_utils.filetype import detect_filetype filename = "example-docs/CantinaBand3.wav" detect_filetype(filename=filename) # Should be FileType.WAV ```	2024-01-15 16:50:49 +00:00
Steve Canny	23edf2e911	feature(chunking): add basic strategy and overlap (#2367 ) This PR culminates the restructuring of chunking over my prior dozen-or-so commits by adding the new options to the API and documentation. Separately I'll be adding a new ingest test to defend against regression, although the integration test included in this PR will do a pretty good job of that too.	2024-01-10 22:19:24 +00:00
Steve Canny	22cbdce7ca	fix(html): unequal row lengths in HTMLTable.text_as_html (#2345 ) Fixes #2339 Fixes to HTML partitioning introduced with v0.11.0 removed the use of `tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This had several benefits, but part of `tabulate`'s behavior was to make row-length (cell-count) uniform across the rows of the table. Lacking this prior uniformity produced a downstream problem reported in On closer inspection, the method used to "harvest" cell-text was producing more text-nodes than there were cells and was sensitive to where whitespace was used to format the HTML. It also "moved" text to different columns in certain rows. Refine the cell-text gathering mechanism to get exactly one text string for each row cell, eliminating whitespace formatting nodes and producing strict correspondence between the number of cells in the original HTML table row and that placed in HTML.text_as_html. HTML tables that are uniform (every row has the same number of cells) will produce a uniform table in `.text_as_html`. Merged cells may still produce a non-uniform table in `.text_as_html` (because the source table is non-uniform).	2024-01-04 21:53:19 +00:00
Christine Straub	5b0ae3fd8b	Refactor: rename image extraction kwargs (#2303 ) Currently, we're using different kwarg names in partition() and partition_pdf(), which has implications for the API since it goes through partition(). ### Summary - rename `extract_element_types` -> `extract_image_block_types` - rename `image_output_dir_path` to `extract_image_block_output_dir` - rename `extract_to_payload` -> `extract_image_block_to_payload` - rename `pdf_extract_images` -> `extract_images_in_pdf` in `partition.auto` - add unit tests to test element extraction for `pdf/image` via `partition.auto` ### Testing CI should pass.	2024-01-04 17:52:00 +00:00
Christine Straub	dd144456de	Feat: return base64 encoded images for PDF's (#2310 ) Closes #2302. ### Summary - add functionality to get a Base64 encoded string from a PIL image - store base64 encoded image data in two metadata fields: `image_base64` and `image_mime_type` - update the "image element filter" logic to keep all image elements in the output if a user specifies image extraction ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", extract_element_types=["Image", "Table"], extract_to_payload=True, ) ``` or ``` from unstructured.partition.auto import partition elements = partition( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", pdf_extract_element_types=["Image", "Table"], pdf_extract_to_payload=True, ) ```	2023-12-27 05:39:01 +00:00
John	5c0043aa7d	chore: add hi_res_model_name kwarg (#2289 ) Closes #2160 Explicitly adds `hi_res_model_name` as kwarg to relevant functions and notes that `model_name` is to be deprecated. Testing: ``` from unstructured.partition.auto import partition filename = "example-docs/DA-1p.pdf" elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Steve Canny <stcanny@gmail.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-12-22 15:06:54 +00:00
Steve Canny	093a11d058	rfctr(chunking): split oversized chunks on word boundary (#2297 ) The text of an oversized chunk is split on an arbitrary character boundary (mid-word). The `chunk_by_character()` strategy introduces the idea of allowing the user to specify a separator to use for chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " "; blank-line, newline, or word boundaries respectively. Even if the user is allowed to specify a separator, we must provide fall-back for when a chunk contains no such character. This can be done incrementally, like blank-line is preferable to newline, newline is preferable to word, and word is preferable to arbitrary character. Further, there is nothing particular to `chunk_by_character()` in providing such a fall-back text-splitting strategy. It would be preferable for all strategies to split oversized chunks on even-word boundaries for example. Note that while a "blank-line" ("\n\n") may be common in plain text, it is unlikely to appear in the text of an element because it would have been interpreted as an element boundary during partitioning. Add _TextSplitter with basic separator preferences and fall-back and apply it to chunk-splitting for all strategies. The `by_character` chunking strategy may enhance this behavior by adding the option for a user to specify a particular separator suited to their use case.	2023-12-21 05:45:36 +00:00
Christine Straub	a7c3f5f570	Refactor: importation consistency for `partition_pdf()` and `partition_image()` (#2282 ) Closes #2278. This PR also removes the `extract_tables_in_pdf` mentioned in issue #2280.	2023-12-15 22:29:58 +00:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
Christine Straub	e114e5c418	Refactor: partition pdf (#2074 ) ### Summary - add constants for strategies - add `_process_uncategorized_text_elements()` to remove code block duplication ### Testing CI should pass.	2023-11-15 21:41:02 -08:00
qued	92ddf3a337	feat: enable request timeout (#2013 ) Courtesy @cdpierse. Adds a test to PR #1529 in accordance with feedback. Description from original PR: In python the default behaviour of `requests.get` without a `timeout` being set is to hang indefinitely. We have a production use case where the desired behaviour would be to raise a timeout error rather than have the application just hang. This PR adds a new optional keyword parameter `request_timeout` to `partition` which is passed to `file_and_type_from_url` in the case where we are fetching from a URL. This is then passed to `requests.get` --------- Co-authored-by: Charles Pierse <charlespierse@gmail.com>	2023-11-08 00:44:58 +00:00
John	b92cab7fbd	fix languages 500 error with empty string for ocr_languages (#1968 ) Closes #1870 Defining both `languages` and `ocr_languages` raises a ValueError, but the api defaults to `ocr_languages` being an empty string, so if users define `languages` they are automatically hitting the ValueError. This fix checks if `ocr_languages` is an empty string and converts it to `None` to avoid this. ### Testing On the main branch, the following will raise the ValueError, but it will correctly partition on this branch ``` from unstructured.partition.auto import partition filename = "example-docs/category-level.docx" elements = partition(filename,languages=['spa'],ocr_languages="") elements[0].metadata.languages ``` --------- Co-authored-by: yuming <305248291@qq.com> Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com> Co-authored-by: Austin Walker <awalk89@gmail.com>	2023-11-01 22:02:00 +00:00
Yuming Long	01a0e003d9	Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850 ) ### Summary A follow up ticket on https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc Tech details: * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.	2023-10-24 17:13:28 +00:00
Amanda Cameron	0584e1d031	chore: fix infer_table bug (#1833 ) Carrying `skip_infer_table_types` to `infer_table_structure` in partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a `text_as_html` field. Note: I've continued to exclude this var from partitioners that go through html flow, I think if we've already got the html it doesn't make sense to carry the infer variable along, since we're not 'infer-ing' the html table in these cases. TODO: ✅ add unit tests --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: amanda103 <amanda103@users.noreply.github.com>	2023-10-24 00:11:53 +00:00
qued	7fdddfbc1e	chore: improve kwarg handling (#1810 ) Closes `unstructured-inference` issue [#265](https://github.com/Unstructured-IO/unstructured-inference/issues/265). Cleaned up the kwarg handling, taking opportunities to turn instances of handling kwargs as dicts to just using them as normal in function signatures. #### Testing: Should just pass CI.	2023-10-23 04:48:28 +00:00
Yuming Long	ce40cdc55f	Chore (refactor): support table extraction with pre-computed ocr data (#1801 ) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. Tech details: * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>	2023-10-21 00:24:23 +00:00
Léa	89fa88f076	fix: stop csv and tsv dropping the first line of the file (#1530 ) The current code assumes the first line of csv and tsv files are a header line. Most csv and tsv files don't have a header line, and even for those that do, dropping this line may not be the desired behavior. Here is a snippet of code that demonstrates the current behavior and the proposed fix ``` import pandas as pd from lxml.html.soupparser import fromstring as soupparser_fromstring c1 = """ Stanley Cups,, Team,Location,Stanley Cups Blues,STL,1 Flyers,PHI,2 Maple Leafs,TOR,13 """ f = "./test.csv" with open(f, 'w') as ff: ff.write(c1) print("Suggested Improvement Keep First Line") table = pd.read_csv(f, header=None) html_text = table.to_html(index=False, header=False, na_rep="") text = soupparser_fromstring(html_text).text_content() print(text) print("\n\nOriginal Looses First Line") table = pd.read_csv(f) html_text = table.to_html(index=False, header=False, na_rep="") text = soupparser_fromstring(html_text).text_content() print(text) ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io>	2023-10-16 17:59:35 -05:00
John	6d7fe3ab02	fix: default to None for the languages metadata field (#1743 ) ### Summary Closes #1714 Changes the default value for `languages` to `None` for elements that don't have text or the language can't be detected. ### Testing ``` from unstructured.partition.auto import partition filename = "example-docs/handbook-1p.docx" elements = partition(filename=filename, detect_language_per_element=True) # PageBreak elements don't have text and will be collected here none_langs = [element for element in elements if element.metadata.languages is None] none_langs[0].text ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-14 22:46:24 +00:00
John	9500d04791	detect document language across all partitioners (#1627 ) ### Summary Closes #1534 and #1535 Detects document language using `langdetect` package. Creates new kwargs for user to set the document language (`languages`) or detect the language at the element level instead of the default document level (`detect_language_per_element`) --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Austin Walker <austin@unstructured.io>	2023-10-11 01:47:56 +00:00
Amanda Cameron	f98d5e65ca	chore: adding max_characters to other element type chunking (#1673 ) This PR adds the `max_characters` (hard max) param to non-table element chunking. Additionally updates the `num_characters` metadata to `max_characters` to make it clearer which param we're referencing. To test: ``` from unstructured.partition.html import partition_html filename = "example-docs/example-10k-1p.html" chunk_elements = partition_html( filename, chunking_strategy="by_title", combine_text_under_n_chars=0, new_after_n_chars=50, max_characters=100, ) for chunk in chunk_elements: print(len(chunk.text)) # previously we were only respecting the "soft max" (default of 500) for elements other than tables # now we should see that all the elements have text fields under 100 chars. ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-09 19:42:36 +00:00
Yuming Long	dcd6d0ff67	Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` (#1579 ) ## Summary Second part of OCR refactor to move it from inference repo to unstructured repo, first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/231. This PR adds OCR process logics to entire page OCR, and support two OCR modes, "entire_page" or "individual_blocks". The updated workflow for `Hi_res` partition: * pass the document as data/filename to inference repo to get `inferred_layout` (DocumentLayout) * pass the document as data/filename to OCR module, which first open the document (create temp file/dir as needed), and split the document by pages (convert PDF pages to image pages for PDF file) * if ocr mode is `"entire_page"` * OCR the entire image * merge the OCR layout with inferred page layout * if ocr mode is `"individual_blocks"` * from inferred page layout, find element with no extracted text, crop the entire image by the bboxes of the element * replace empty text element with the text obtained from OCR the cropped image * return all merged PageLayouts and form a DocumentLayout subject for later on process This PR also bump `unstructured-inference==0.7.2` since the branch relay on OCR refactor from unstructured-inference. ## Test ``` from unstructured.partition.auto import partition entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res") individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res") print([el.text for el in entrie_page_ocr_mode_elements]) print([el.text for el in individual_blocks_ocr_mode_elements]) ``` latest output: ``` # entrie_page ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM\|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고'] # individual_blocks ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM\| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email'] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-06 22:54:49 +00:00
Christine Straub	b30d6a601e	Fix/1209 tweak xycut ordering output (#1630 ) Closes GH Issue #1209. ### Summary - add swapped `xycut` sorting - update `xycut` sorting evaluation script PDFs: - [sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf) - [multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf) - [11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf) ### Testing ``` elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res") print("\n\n".join([str(el) for el in elements])) ``` ### Evaluation ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only ```	2023-10-05 07:41:38 +00:00
Klaijan	0a65fc2134	feat: xlsx subtable extraction (#1585 ) Executive Summary Unstructured is now able to capture subtables, along with other text element types within the `.xlsx` sheet. Technical Details - The function now reads the excel without header as default - Leverages the connected components search to find subtables within the sheet. This search is based on dfs search - It also handle the overlapping table or text cases - Row with only single cell of data is considered not a table, and therefore passed on the determine the element type as text - In connected elements, it is possible to have table title, header, or footer. We run the count for the first non-single empty rows from top and bottom to determine those text Result This table now reads as: <img width="747" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/2177850/6b8e6d01-4ca5-43f4-ae88-6104b0174ed2"> ``` [ { "type": "Title", "element_id": "3315afd97f7f2ebcd450e7c939878429", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "Financial performance" }, { "type": "Table", "element_id": "17f5d512705be6f8812e5dbb801ba727", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nQuarterly revenue\nNine quarters to 30 June 2023\n\n\n1\n\n\nGroup financial performance\nFY 22\nFY 23\n\n2\n\n\nSegmental results\nFY 22\nFY 23\n\n3\n\n\nSegmental analysis\nFY 22\nFY 23\n\n4\n\n\nCash flow\nFY 22\nFY 23\n\n5\n\n\n" }, { "type": "Title", "element_id": "8a9db7161a02b427f8fda883656036e1", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "Operational metrics" }, { "type": "Table", "element_id": "d5d16f7bf9c7950cd45fae06e12e5847", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nMobile customers\nNine quarters to 30 June 2023\n\n\n6\n\n\nFixed broadband customers\nNine quarters to 30 June 2023\n\n\n7\n\n\nMarketable homes passed\nNine quarters to 30 June 2023\n\n\n8\n\n\nTV customers\nNine quarters to 30 June 2023\n\n\n9\n\n\nConverged customers\nNine quarters to 30 June 2023\n\n\n10\n\n\nMobile churn\nNine quarters to 30 June 2023\n\n\n11\n\n\nMobile data usage\nNine quarters to 30 June 2023\n\n\n12\n\n\nMobile ARPU\nNine quarters to 30 June 2023\n\n\n13\n\n\n" }, { "type": "Title", "element_id": "f97e9da0e3b879f0a9df979ae260a5f7", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "Other" }, { "type": "Table", "element_id": "080e1a745a2a3f2df22b6a08d33d59bb", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nAverage foreign exchange rates\nNine quarters to 30 June 2023\n\n\n14\n\n\nGuidance rates\nFY 23/24\n\n\n14\n\n\n" } ] ```	2023-10-04 13:30:23 -04:00
Yao You	19d8bff275	feat: change default hi_res model to yolox quantized (#1607 )	2023-10-04 03:28:47 +00:00
Amanda Cameron	1fb464235a	chore: Table chunking (#1540 ) This change is adding to our `add_chunking_strategy` logic so that we are able to chunk Table elements' `text` and `text_as_html` params. In order to keep the functionality under the same `by_title` chunking strategy we have renamed the `combine_under_n_chars` to `max_characters`. It functions the same way for the combining elements under Title's, as well as specifying a chunk size (in chars) for TableChunk elements. *renaming the variable to `max_characters` will also reflect the 'hard max' we will implement for large elements in followup PRs Additionally -> some lint changes snuck in when I ran `make tidy` hence the minor changes in unrelated files :) TODO: ✅ add unit tests --> note: added where I could to unit tests! Some unit tests I just clarified that the chunking strategy was now 'by_title' because we don't have a file example that has Table elements to test the 'by_num_characters' chunking strategy ✅ update changelog To manually test: ``` In [1]: filename="example-docs/example-10k.html" In [2]: from unstructured.chunking.title import chunk_table_element In [3]: from unstructured.partition.auto import partition In [4]: elements = partition(filename) # element at -2 happens to be a Table, and we'll get chunks of char size 4 here In [5]: chunks = chunk_table_element(elements[-2], 4) # examine text and text_as_html params ln [6]: for c in chunks: print(c.text) print(c.metadata.text_as_html) ``` --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-03 09:40:34 -07:00
Yao You	ad59a879cc	chore: bump inference to 0.6.6 (#1563 ) - bump `unstructured-inference` to `0.6.6` - specify default model name for element detection to be `detectron2_onnx` to keep current behavior - NOTE: the updated inference package by default would use yolox as element detection model; this will be evaluated and enabled in a separated PR --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 19:09:57 +00:00
shreyanid	32bfebccf7	feat: introduce language detection function for text partitioning function (#1453 ) ### Summary Uses `langdetect` to detect all languages present in the input document. ### Details - Converts all language codes (whether user inputted or detected using `langdetect`) to a standard ISO 639-3 code. - Adds `languages` field to the metadata - Will revisit how to nonstandardly represent simplified vs traditional Chinese scripts internally (separate PR). - Update ingest test results to add `languages` field to documents. Some other side effects are changes in order of some elements and changes in element categorization ### Test You can test the detect_languages function individually by importing the function and inputting a text sample and optionally a language: ``` text = "My lubimy mleko i chleb." doc_langs = detect_languages(text) print(doc_langs) ``` -> ['ces', 'pol', 'slk'] --------- Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2023-09-26 18:09:27 +00:00
shreyanid	eb8ce89137	chore: function to map between standard and Tesseract language codes (#1421 ) ### Summary In order to convert between incompatible language codes from packages used for OCR, this change adds a function to map between any standard language codes and tesseract OCR specific codes. Users can input language information to `languages` in any Tesseract-supported langcode or any ISO 639 standard language code. ### Details - Introduces the [python-iso639](https://pypi.org/project/python-iso639/) package for matching standard language codes. Recompiles all dependencies. - If a language is not already supplied by the user as a Tesseract specific langcode, supplies all possible script/orthography variants of the language to the Tesseract OCR agent. ### Test Added many unit tests for a variety of language combinations, special cases, and variants. For general testing, call partition functions with any lang codes in the languages parameter (Tesseract or standard). for example, ``` from unstructured.partition.auto import partition elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"]) print("\n\n".join([str(el) for el in elements])) ``` should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract	2023-09-18 08:42:02 -07:00
Amanda Cameron	a9f18eddb8	chore: adding test case for odt tables (#1434 ) ODT table extraction is happening! Just added to an existing example-doc and an accompanying test case.	2023-09-16 22:29:44 -07:00
shreyanid	1b7c99d878	chore: refactor languages parameter for auto partition (#1400 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. ### Details Follows the pattern established with PDFs in #1334. Adds languages (a list of strings) as a parameter to partition in auto.py. Marks ocr_languages for deprecation. ### Test Call partition with a variety of filetypes (especially pdfs/images), strategies, languages, or ocr_languages. - inclusion of ocr_languages as a parameter should display a deprecation warning and may proceed with partitioning if no other conflicts - the other valid call outputs should be no different from the current outputs	2023-09-13 13:07:28 -04:00

1 2 3

119 Commits