unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-21 22:40:43 +00:00

Author	SHA1	Message	Date
Steve Canny	03c2bf8f1f	rfctr(part): extract partition.common submodules (#3649 ) Summary In preparation for consolidating post-partitioning metadata decorators, extract `partition.common` module into a sub-package (directory) and extract `partition.common.metadata` module to house metadata-specific object shared by partitioners. Additional Context - This new module will be the home of the new consolidated metadata decorator. - The consolidated decorator is a step toward removing post-processing decorators from _delegating_ partitioners. A delegating partitioner is one that convert its file to a different format and "delegates" actual partitioning to the partitioner for that target format. 10 of the 20 partitioners are delegating partitioners. - Removing decorators from delegating partitioners will allow us to avoid "double-decorating", i.e. running those decorators twice, once on the principal partitioner and again on the proxy partitioner. - This will allow us to send `*kwargs` to either partitioner, removing the knowledge of which arguments to send for each file-type from auto-partition. - And this will allow pluggable auto-partitioners which all have a `partition_x(filename, , file, **kwargs) -> list[Element]` interface.	2024-09-20 20:35:28 +00:00
Christine Straub	0ed69a1ac3	refactor: pdfminer image cleanup (#3648 ) This PR aims to remove `clean_pdfminer_duplicate_image_elements()` function, as its functionality has already been integrated into the `remove_duplicate_elements()` function in [PR #3630](https://github.com/Unstructured-IO/unstructured/pull/3630).	2024-09-19 18:57:02 +00:00
Christine Straub	be88eef06f	perf: optimize pdfminer image cleanup process for improved performance (#3630 ) This PR enhances `pdfminer` image cleanup process by repositioning the duplicate image removal step. It optimizes the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances the overall processing speed of PDF documents. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-09-19 14:05:05 +00:00
Steve Canny	cd074bb32b	chore(file): remove dead code (#3645 ) Summary Remove dead code in `unstructured.file_utils`. Additional Context These modules were added in 12/2022 and 1/2023 and are not referenced by any code. Removing to reduce unnecessary complexity. These can of course be recovered from Git history if we decide we want them again in future.	2024-09-19 06:45:33 +00:00
Christine Straub	87a88a3c87	feat: improve pdfminer element processing (#3618 ) This PR implements splitting of `pdfminer` elements (`groups of text chunks`) into smaller bounding boxes (`text lines`). This implementation prevents loss of information from the object detection model and facilitates more effective removal of duplicated `pdfminer` text. This PR also addresses #3430. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-12 21:17:27 +00:00
John	ab94c6c5d1	chore: remove pins (#3579 ) - Remove constraint pins for `Office365-REST-Python-Client`, `weaviate-client`, and `platformdirs`. Removing the pin for `Office365` brought to light some bugs in the Onedrive connector, so some changes were also made to `unstructured/ingest/v2/processes/connectors/onedrive.py`. - Also, as part of updating dependencies `unstructured-client` was updated to `0.25.8`, which introduced a new default for the `strategy` param and required updating a test fixture. - The `hubspot.sh` integration test was failing and is now ignored in CI with this PR per discussion with @rbiseck3. May be easiest to review commit-by-commit.	2024-09-12 13:48:59 +00:00
cragwolfe	3bb0ee1e79	chore: fix tests breaking on main (#3603 ) Fix API tests (really more like integration tests) that run only on main. Also use less compute intensive files to speedup test time and remove a useless test. Tests in `test_unstructured/partition/test_api.py` pass, temporarily running outside of main per per screenshot: ![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704) https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513	2024-09-08 21:25:52 +00:00
Yao You	d51fb134e6	Feat/improve iou speed (#3582 ) This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)	2024-09-03 00:06:18 +00:00
Pawel Kmiecik	404f780bbb	feat: make analysis drawing more flexible (#3574 ) This PR changes the way the analysis tools can be used: - by default if `analysis` is set to `True` in `partition_pdf` and the strategy is resolved to `hi_res`: - for each file 4 layout dumps are produced and saved as JSON files (`object_detection`, `extracted`, `ocr`, `final`) - similar way to the current `object_detection` dump - the drawing functions/classes now accept these dumps accordingly instead of the internal classes instances (like `TextRegion`, `DocumentLayout` - it makes it possible to use the lightweight JSON files to render the bboxes of a given file after the partition is done - `_partition_pdf_or_image_local` has been refactored and most of the analysis code is now encapsulated in `save_analysis_artifiacts` function - to do this, helper function `render_bboxes_for_file` is added <img width="338" alt="Screenshot 2024-08-28 at 14 37 56" src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">	2024-09-02 11:06:11 +00:00
Matt Robinson	6ba8135bf9	fix: check ole storage content to differentiate filetypes (#3581 ) ### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ```	2024-08-30 15:12:46 -04:00
Austin Walker	f440eb476c	feat: Support encoding parameter in partition_csv (#3564 ) See added test file. Added support for the encoding parameter, which can be passed directly to `pd.read_csv`.	2024-08-28 14:19:58 +00:00
John	f21c853ade	bug: fix file_conversion disk leak (#3562 ) Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFile have been replaced with TemporaryFileDirectory to avoid a known issue: - https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile - https://github.com/Unstructured-IO/unstructured/issues/3390 The first 7 commits each address an individual occurrence of the issue if reviewers want to review commit-by-commit.	2024-08-27 22:02:24 +00:00
David Potter	ddba928344	Potter/mixedbread embedder (#3513 ) Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!	2024-08-27 14:52:13 +00:00
Steve Canny	32bb77aafb	fix(file): no default OLE subtype (#3516 ) Summary Do not assume MSG format when an OLE "container" file cannot be differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based identification in that case. Additional Context DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly, a Microsoft-proprietary Zip format which "contains" a filesystem of discrete files and directories. An OLE "container" is easily identified by inspecting the first 8 bytes of the file, so all we need to do is differentiate between the four subtypes we can process. The `filetype` module does a good job of this but it not perfect and does not identify MSG files. Previously we assumed MSG format when none of DOC, PPT, or XLS was detected, but we discovered that `filetype` is not completely reliable at detecting these types. Change the behavior to remove the assumption of MSG format. `_OleFileDifferentiator` returns `None` in this case and filetype detection falls back to use filename-extension. Note a file with no filename and no metadata_filename or an incorrect extension will not be correctly identified in this case, however we're assuming for now that will be rare in practice.	2024-08-22 19:16:53 +00:00
Steve Canny	03e0ed3519	rfctr(docx): DOCX emits std minified .text_as_html (#3545 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_docx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - nested tables appear as their extracted text in the parent cell (no nested `<table>` elements in `.text_as_html`). - DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements).	2024-08-21 18:54:21 +00:00
John	604cadfb7e	chore: remove ipython pin (#3548 ) this pr is stacked on https://github.com/Unstructured-IO/unstructured/pull/3538 and https://github.com/Unstructured-IO/unstructured/pull/3547 This pr removes dependency pins for IPython, anyio, and pyparsing. It also updates the label-studio-sdk import statement so we don't have to have that pinned and make some minor type hinting edits. Label Studio had a breaking change in their 1.13.0 [release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)	2024-08-21 00:06:31 +00:00
Steve Canny	a861ed8fe7	feat(chunk): split tables on even row boundaries (#3504 ) Summary Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. Additional Context Table splitting now has the following characteristics: - `TableChunk.metadata.text_as_html` is always a parseable HTML `<table>` subtree. - `TableChunk.text` is always the text in the HTML version of the table fragment in `.metadata.text_as_html`. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible. - A row is broken at an even-cell boundary when a single row is larger than the chunking window. - A cell is broken at an even-word boundary when a single cell is larger than the chunking window. - `.text_as_html` is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.	2024-08-19 18:56:53 +00:00
Christine Straub	fc26426310	feat: replace `pytesseract` with `unstructured.pytesseract` fork (#3528 ) This PR reverts `pytesseract` dependency to `unstructured.pytesseract` fork due to the unavailability of some recent release versions of `pytesseract` on PyPI. This PR also addresses an issue encountered during the publication of `unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI does not allow direct dependencies from Version Control System URLs like GitHub in the `install_requires` or `extras_require` sections of the `setup.py` file.	2024-08-16 10:34:22 -04:00
Matt Robinson	7437f0a084	fix(CVE-2024-39705): update to latest `nltk` version (#3512 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by updating to `nltk==3.8.2` and closes #3511. This CVE had previously been mitigated in #3361. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-13 09:39:29 -04:00
Steve Canny	cbe1b35621	rfctr(chunk): prep for adding TableSplitter (#3510 ) Summary Mechanical refactoring in preparation for adding (pre-chunk) `TableSplitter` in a PR stacked on this one.	2024-08-12 18:04:49 +00:00
Jake Zerrer	051be5aead	Remove unstructured.pytesseract fork (#3454 ) A second attempt at https://github.com/Unstructured-IO/unstructured/pull/3360, this PR removes unstructured's dependency on its own fork of `pytesseract`. (The original reason for the fork, the addition of `run_and_get_multiple_output`, was removed [here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).) --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-09 04:28:48 +00:00
John	24a1f298e5	chore: small edits (#3480 ) Add comments and fix decorators on some tests.	2024-08-06 19:21:43 +00:00
Steve Canny	73bef27ef1	fix(pptx): accommodate invalid image/jpg MIME-type (#3475 ) As described in #3381, some clients, perhaps including Adobe PDF Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior to v1.0.0, `python-pptx` would not load these images, which caused image extraction to fail. Update the `python-pptx` dependency to `v1.0.1` or above to ensure this upstream fix is always available. Fixes: #3381	2024-08-06 18:48:15 +00:00
Steve Canny	a468b2de3b	rfctr(csv): accommodate single column CSV files (#3483 ) Summary Improve factoring, type-annotation, and tests for `partition_csv()` and accommodate single-column CSV files. Fixes: #2616	2024-08-06 00:48:37 +00:00
Maciej Kurzawa	b749b891a7	fix: disabled checking max pages for images (#3473 ) Added fix related to https://github.com/Unstructured-IO/unstructured/pull/3431, which disables checking max pages for images	2024-08-02 14:25:08 +00:00
John	147514f6b5	feat: msg and email metadata (#3444 ) Update partition_eml and partition_msg to capture cc, bcc, and message id fields. Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files Testing ``` from unstructured.partition.email import partition_email from test_unstructured.unit_utils import example_doc_path elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True) print(elements) elements[0].metadata.to_dict() ``` Note to reviewers: Tests in `test_unstructured/partition/test_email.py` were refactored and rearranged to group similar tests together, so it will be easiest to review those changes commit by commit. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2024-08-01 19:24:17 +00:00
Maciej Kurzawa	8fd216cc9f	feat/pdf-page-limit-in-hi-res (#3431 ) # Description: Passing `max_pages` argument allows rejecting pdf files which exceeds this page number limit while `high_res` strategy is chosen. By default it will allow parsing pdf files with unlimited number of pages. # Testing: ```python from unstructured.partition.auto import partition elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res') # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4) # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2) # should raise PdfMaxPagesExceededError ```	2024-07-30 16:52:17 +00:00
Steve Canny	4e61acc1c6	fix(file): fix OLE-based file-type auto-detection (#3437 ) Summary A DOC, PPT, or XLS file sent to partition() as a file-like object is misidentified as a MSG file and raises an exception in python-oxmsg (which is used to process MSG files). Fix DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound File Binary Format (CFBF). These can be reliably distinguished by inspecting magic bytes in certain locations. `libmagic` is unreliable at this or doesn't try, reporting the generic `"application/x-ole-storage"` which corresponds to the "container" CFBF format (vaguely like a Microsoft Zip format) that all these document types are stored in. Unconditionally use `filetype.guess_mime()` provided by the `filetype` package that is part of the base unstructured install. Unlike `libmagic`, this package reliably detects the distinguished MIME-type (e.g. `"application/msword"`) for OLE file subtypes. Fixes #3364	2024-07-25 17:25:41 +00:00
Steve Canny	432d209c36	fix(file): confirm or correct asserted DOCX, PPTX, and XLSX content types (#3434 ) Summary The `content_type` argument received by `partition()` from the API is sometimes unreliable for MS-Office 2007+ MIME-types. What we've observed is that it gets the MS-Office bit right but falls down on distinguishing PPTX from DOCX or XLSX. Confirmation of these types is simple, fast, and reliable. Confirm all MS-Office `content_type` argument values asserted by callers of `detect_filetype()` and correct swapped values.	2024-07-24 20:32:58 +00:00
Steve Canny	3fe5c094fa	rfctr(file): refactor detect_filetype() (#3429 ) Summary In preparation for fixing a cluster of bugs with automatic file-type detection and paving the way for some reliability improvements, refactor `unstructured.file_utils.filetype` module and improve thoroughness of tests. Additional Context Factor type-recognition process into three distinct strategies that are attempted in sequence. Attempted in order of preference, type-recognition falls to the next strategy when the one before it is not applicable or cannot determine the file-type. This provides a clear basis for organizing the code and tests at the top level. Consolidate the existing tests around these strategies, adding additional cases to achieve better coverage. Several bugs were uncovered in the process. Small ones were just fixed, bigger ones will be remedied in following PRs.	2024-07-23 23:18:48 +00:00
Steve Canny	49c4bd34be	rfctr(auto): add _PartitionerLoader (#3418 ) Summary Replace conditional explicit import of partitioner modules in `.partition.auto` with the new `_PartitionerLoader` class. This avoids unbound variable warnings and is much less noisy. `_PartitionerLoader` makes use of the new `FileType` property `.importable_package_dependencies` to determine whether all required packages are importable before dispatching the file to its partitioner. It uses `FileType.extra_name` to form a helpful error message when a dependency is not installed, so the caller knows which `pip install` extra to specify to remedy the error. `PartitionerLoader` uses the `FileType` properties `.partitioner_module_qname` and `partitioner_function_name` to load the partitioner once its dependencies are verified. Loaded partitioners are cached with module lifetime scope for efficiency.	2024-07-22 06:03:55 +00:00
Christine Straub	ec59abfabc	enhancement: improve text clearing process in `email` partitioning (#3422 ) ### Summary Currently, the email partitioner removes only `=\n` characters during the clearing process. However, email content sometimes contains `=\r\n` characters, especially when read from file-like objects such as `SpooledTemporaryFile` (the file type used in our API). This PR updates the email partitioner to remove both `=\n` and `=\r\n` characters during the clearing process. ### Testing ``` filename = "example-docs/eml/family-day.eml" elements = partition_email( filename=filename, ) print(f"From filename: {elements[3].text}") with open(filename, "rb") as test_file: spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) elements = partition_email(file=spooled_temp_file) print(f"From spooled_temp_file: {elements[3].text}") ``` Results: - on `main` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to = RSVP! ``` - on `PR` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to RSVP! ```	2024-07-19 18:18:02 +00:00
Christine Straub	0eb461acc2	refactor: restructure PDF/Image example document organization (#3410 ) This PR aims to improve the organization and readability of our example documents used in unit tests, specifically focusing on PDF and image files. ### Summary - Created two new subdirectories in the `example-docs` folder: - `pdf/`: for all PDF example files - `img/`: for all image example files - Moved relevant PDF files from `example-docs/` to `example-docs/pdf/` - Moved relevant image files from `example-docs/` to `example-docs/img/` - Updated file paths in affected unit & ingest tests to reflect the new directory structure ### Testing All unit & ingest tests should be updated and verified to work with the new file structure. ## Notes Other file types (e.g., office documents, HTML files) remain in the root of `example-docs/` for now. ## Next Steps Consider similar reorganization for other file types if this structure proves to be beneficial. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-07-18 22:21:32 +00:00
Steve Canny	e99e5a8abd	rfctr(file): make FileType enum a file-type descriptor (#3411 ) Summary Elaborate the `FileType` enum to be a complete descriptor of file-types. Add methods to allow `STR_TO_FILETYPE`, `EXT_TO_FILETYPE` and `FILETYPE_TO_MIMETYPE` mappings to be replaced, removing those redundant and noisy declarations. In the process, fix some lingering file-type identification and `.metadata.filetype` errors that had been skipped in the tests. Additional Context Gathering the various attributes of a file-type into the `FileType` enum eliminates the duplication inherent in the separate `STR_TO_FILETYPE` etc. mappings and makes access to those values convenient for callers. These attributes include what MIME-type a file-type should record in metadata and what MIME-types and extensions map to that file-type. These values and others are made available as methods and properties directly on the `FileType` class and members. Because all attributes are defined in the `FileType` enum there is no risk of inconsistency across multiple locations and any changes happen in one and only one place. Further attributes and methods will be added in later commits to support other file-type related operations like mapping to a partitioner and verifying its dependencies are installed.	2024-07-18 02:05:33 +00:00
Steve Canny	a5c9a3695c	rfctr(file): improve file-type auto-detect (#3409 ) Summary In preparation for further work on auto file-type detection, improve `filetype.py` and related modules: - improve docstrings - improve type annotations - extract domain model to `.model` module	2024-07-17 05:27:31 +00:00
Christine Straub	48bdf94656	feat: `partition_pdf()` support language specification for PaddleOCR (#3400 ) Closes #3159. This PR extends language specification capability to `PaddleOCR` in addition to `TesseractOCR`. Users can now specify OCR languages for both OCR engines when using `partition_pdf()`. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy=strategy, languages=["chi_sim"], # chinese - simplified infer_table_structure=True, ) ```	2024-07-16 22:19:25 +00:00
Steve Canny	56ca39ca7f	rfctr(file): improve filetype tests (#3402 ) Summary Improve file-detection tests in preparation for additional work and bug fixes. Additional Context - Add type annotations. - Use mocks instead of `monkeypatch` in most cases and verify calls to mock. This revealed a dozen broken tests, broken in that the mocks weren't being called so a different code path than intended was being exercised. - Use `example_doc_path()` instead of hard-coded paths. - Add actual test files for cases where they were being constructed in temporary directories. - Make test names consistent and more descriptive of behavior under test.	2024-07-16 04:04:34 +00:00
Steve Canny	3d6e30a1f7	rfctr(auto): improve expression in tests (#3384 ) Summary In preparation for further work on auto-partitioning, improve the expression in the test-suite.	2024-07-11 19:57:28 +00:00
Steve Canny	c27e0d0062	rfctr(html): replace html parser (#3218 ) Summary Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect. Additional Context The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains `lxml` as the performant and reliable base library but uses `lxml`'s custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors. This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes. The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases. Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309 ### BEHAVIOR DIFFERENCES #### `emphasized_text_tags` encoding is changed: - `<strong>` is encoded as `"b"` rather than `"strong"`. - `<em>` is encoded as `"i"` rather than `"em"`. - `<span>` is no longer recorded in `emphasized_text_tags` (because without the CSS we can't tell whether it's used for emphasis or if so what kind). - nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi"). - `emphasized_text_contents` is broken on emphasis-change boundaries, like: ```html `<p>foo <b>bar <i>baz</i> bada</b> bing</p>` ``` produces: ```json { "emphasized_text_contents": ["bar", "baz", "bada"], "emphasized_text_tags": ["b", "bi", "b"] } ``` whereas previously it would have produced: ```json { "emphasized_text_contents": ["bar baz bada", "baz"], "emphasized_text_tags": ["b", "i"] } ``` #### `<pre>` text is preserved as it appears in the html Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the `<pre>` text. Old parser stripped all leading and trailing whitespace. Result is that: ```html <pre> foo bar baz </pre> ``` parses to `"foo\nbar\nbaz"` which is the same result produced for: ```html <pre>foo bar baz</pre> ``` This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way. #### Whitespace normalization Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that `\t`, `\n`, and ` ` are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a `<pre>` tag. Any leading or trailing newline is trimmed from `<pre>` element text; all other whitespace is preserved just as it appeared in the HTML source. #### `link_start_indexes` metadata is no longer captured. Rationale: - It was frequently wrong, often `-1`. - It was deprecated but then added back in a community PR. - Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves. - It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability #### `<br/>` element is replaced with a single newline (`"\n"`) but that is usually replaced with a space in `Element.text` when it is normalized. The newline is preserved within a `<pre>` element. - Related: _No paragraph-break on `<br/><br/>`_ #### Empty `h1..h6` elements are dropped. HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a `Title` element) when they contain no text or contain only whitespace. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-07-11 00:14:28 +00:00
Steve Canny	0c562d8050	rfctr(auto): fix auto-partition test xfails and skips (#3367 ) Summary Improve expression in auto-partition tests and fix xfails and skips. Add issues for the two hard-fails where xfail needed to stay.	2024-07-10 05:29:07 +00:00
Steve Canny	00e1d5c05b	rfctr(html): refine HTML parser (#3351 ) Note This refines the new HTML parser but _does not install it_. This is why no changes to ingest test expectations or other unit-tests are required here. Installing the new parser will happen in the next PR #3218. Summary The initial version of the parser (purposely) raised on a block element nested inside a phrasing element. While such nesting is not valid according to the HTML Standard, it is accepted by the browser and does happen in the wild. The refinements here handle this situation similarly to how the browser does, breaking phrasing at the block element boundaries and starting it up again after the block element. Unfortunately this adds complexity to the parser, but it makes the parser robust against pretty much any HTML we're likely to encounter and partitions it consistent with how it would be rendered in the browser.	2024-07-09 01:10:03 +00:00
Matt Robinson	7b25dfc337	fix(CVE-2024-39705): remove nltk download (#3361 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <stcanny@gmail.com>	2024-07-08 22:55:36 +00:00
Steve Canny	d48fa3b163	rfctr(auto): improve typing and organize auto tests (#3355 ) Summary In preparation for further work on auto-partitioning (`partition()`), improve typing and organize `test_auto.py` by introducing categories.	2024-07-08 21:25:17 +00:00
Pluto	609a08a95f	remove unused _with_spans metric (#3342 ) The table metrics considering spans is not used and it messes with the output thus I have cleaned the code from it. Though, I have left table_as_cells in the source code - it still may be useful for the users	2024-07-08 16:59:53 +00:00
Pluto	caea73c8e3	Tables detection f1 (#3341 ) This pull request add table detection metrics. One case that was considered by me: Case: Two tables are predicted and matched with one table in ground truth Question: Is this matching correct in both cases or just for on table There are two subcases: - table was predicted by OD as two sub tables (so half in two, there are two non overlapping subtables) -> in my opinion both are correct - it is false positive from tables matching script in get_table_level_alignment -> 1 good, 1 wrong As we don't have bounding boxes I followed the notebook calculation script and assumed pessimistic, second subcase version	2024-07-08 13:29:52 +00:00
Christine Straub	493bfccddd	fix: exception handling for OCRAgent.get_agent() (#3335 ) The purpose of this PR is to help investigate https://github.com/Unstructured-IO/unstructured/issues/3202.	2024-07-03 17:58:04 +00:00
John	0046f58a4f	revert unstructured-client pin and make pip-compile (#3298 ) Change unstructured-client pin to setting minimum version instead of max version and `make pip-compile`. Integration tests that were dependent on the old version of the client are removed. These tests should be replicated in/moved to the SDK repo(s).	2024-07-02 16:42:03 +00:00
Pluto	5d89b41b1a	Fix not counting false negatives and false positives in table metrics (#3300 ) This pull request fixes counting tables metric for three cases: - False Negatives: when table exist in ground truth but any of the predicted tables doesn't match the table, the table should count as 0 and the file should not be completely skipped (before it was np.NaN). - False Positives: When there is a predicted table that didn't match any ground truth table it should be counted as 0, right now it is skipped in processing (matched_indices==-1) - The file should be completely skipped only if there is no tables in ground truth and in prediction In short we can say that previous metric calculation didn't consider OD mistakes	2024-07-02 10:07:24 +00:00
Steve Canny	087adb218f	feat(docx): differentiate no-file from not-ZIP (#3306 ) Summary The `python-docx` error `docx.opc.exceptions.PackageNotFoundError` arises both when no file exists at the given path and when the file exists but is not a ZIP archive (and so is not a DOCX file). This ambiguity is unwelcome when diagnosing the error as the two possible conditions generally indicate a different course of action to resolve the error. Add detailed validation to `DocxPartitionerOptions` to distinguish these two and provide more precise exception messages. Additional Context - `python-pptx` shares the same OPC-Package (file) loading code used by `python-docx`, so the same ambiguity will be present in `python-pptx`. - It would be preferable for this distinguished exception behavior to be upstream in `python-docx` and `python-pptx`. If we're willing to take the version bump it might be worth considering doing that instead.	2024-06-27 00:18:56 +00:00
Pawel Kmiecik	575957b2d2	feat: enhance analysis options with od model dump and better vis (#3234 ) This PR adds new capabilities for drawing bboxes for each layout (extracted, inferred, ocr and final) + OD model output dump as a json file for better analysis. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>	2024-06-26 13:14:55 +00:00

1 2 3 4 5 ...

633 Commits