unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
Yuming Long	c04235c168	fix [NEX-49] : Fix TypeError for empty HTML content (#4032 ) ### Summary Addressed a TypeError that occurred when partitioning empty or whitespace-only HTML content. ## Test * unit test `test_unstructured/partition/html/test_partition.py::test_partition_html_with_empty_content_raises_error` can reproduce the TypeErro before fix * now test can pass	2025-06-25 18:13:20 +00:00
jiajun-unstructured	b0dbd71aff	Parallelize tests (#4024 )	2025-06-16 23:29:35 +00:00
Yuming Long	55ad5fd637	fix chucking text None type has no attribute stripe (#4018 ) ### Summary To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no attribute 'strip'"}` I found the logs under same org (could assume this is the same job) screenshot: ![Screenshot 2025-06-11 at 10 15 57 AM](https://github.com/user-attachments/assets/c50ada55-eef1-43f7-9e27-9b9ae339a6fb) stack trace from the `utic-api` ES log doc: ![Screenshot 2025-06-11 at 2 01 01 PM](https://github.com/user-attachments/assets/7e84fa24-4eb6-45e8-b195-a11d3d124bfa) ### Notes longer term we should make partitioner (vlm + utic-api) not return text with Null --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2025-06-12 18:28:46 +00:00
Pluto	ec209c6b5f	Remove IDs from HTML code (#4012 ) In this pull request parent-child relationship for elements generated with v2 parser is based on actual element IDs instead of IDs baked somewhere in the HTML script. With some extra bug fixing it allowed for significantly simplifying json -> HTML script	2025-06-11 11:55:02 +00:00
luke-kucing	a7e90f7990	resolve CVEs and HF issue (#4009 ) update reqs to resolve CVEs and add the HF ENV to stop it from reaching out updated the Dockerfile with ENV HF_HUB_OFFLINE=1 to stop it from pinging HF. This was an issue for a gov customer. and updated requirements to resolve some open CVEs --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>	2025-06-04 18:52:58 +00:00
jordan-homan	570ee078a4	fix: throw validation error when json is passed with invalid unstructured json (#4002 ) ### Notes Adds validation if `json` / `ndjson` are not valid unstructured schema. ### Testing Manually tested serverless API with example json: ``` test_length = [] = 200 test_invalid = [{"invalid": "schema"}] = 422 test_invalid_ndjson ={"hi": "there"} = 422 test_chunk = [{"type":"Header","element_id":"a23fdadef9277f217563e217ebd074d5" ... = 200 ```	2025-05-19 18:24:44 +00:00
Austin Walker	e3417d7e98	fix: Fix for Pillow error when extracting PNG images (#3998 ) When I tried to partition a PNG file and extract images, I got an error from Pillow: ``` WARNING unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image Traceback (most recent call last): File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save rawmode = RAWMODE[im.mode] KeyError: 'RGBA' ``` The issue is that a PNG has an additional layer that cannot be saved off in jpeg format. We can fix this with a quick conversion. I added a png test case that is now passing with this fix.	2025-05-08 21:57:05 +00:00
Yao You	b814ece39f	fix: properly handle the case when an element's text is None (#3995 ) Some elements, like `Image`, can have `None` as its `text` attribute's value. In that case current chunking logic fails because it expects the field to always have a length or can be split. The fix is to update the logic as `element.text or ""` for checking length and add flow control to early exit to avoid calling split on `None`.	2025-05-05 18:08:11 +00:00
Philippe PRADOS	d570f4624b	Fix sort_page_element. ensures that sorting is stable and not random. (#3978 ) The sort_page_element() use the element id to sort the elements. Two executions of the same code, on the same file, produce different results. The order of the elements is random. This makes it impossible to write stable unit tests, for example, or to obtain reproducible results.	2025-04-07 15:57:20 +00:00
cragwolfe	dfa17bd3a0	fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975 )	2025-04-04 14:38:23 -07:00
qued	3f07840b80	chore: deprecate stage_for_label_studio (#3968 ) This PR is to address [a CVE](https://github.com/advisories/GHSA-rgv9-w7jp-m23g) that appeared in a recent scan. The CVE has to do with the package `label_studio_sdk`. This relates to the tool Label Studio, a data labeling platform. We built a staging function that takes a list of elements and converts it to a format suitable for passing to the LabelStudio platform. We don't use the package with the vulnerability in the actual function, we only use it to test the output of the function against the Label Studio API schema. Even the test where we use it is sort of questionable in value, since it's really testing the schema against an old version of the LabelStudio API (we are testing against a recording of the Label Studio API's responses stored using `vcrpy`). Label Studio has fixed the vulnerability as of version 1.0.10 of their SDK, but we're stuck on 1.0.5 because 1.0.6 and above require `numpy<2.0.0`. This leaves us with several choices of resolution, some of which are: 1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to resolve the CVE 2. Drop `label_studio_sdk` by either removing or rewriting the test. 3. Drop test and dev dependencies from the `unstructured` image. We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a follow-on PR. Here we add a deprecation notice to `stage_for_label_studio` and remove the offending test. Normally good practice would be to add a warning of future deprecation to the function for a reasonable amount of time, but in order to address the CVE immediately, we're deprecating it right away. ### Testing Install the dependencies (`make install`) into a fresh environment, and `pip list \| grep label` should have no results. The scan artifact in CI should contain no "high" or "critical" CVEs.	2025-03-26 23:37:03 +00:00
Sri Sudarsan	349728162e	Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names (#3959 ) Instead of looking for presence of `word/document.xml` , `ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and XLSX files, we look for prefix `word/document.xml`, `ppt/presentation.xml` and `xl/workbook*.xml` as certain files generated from office365 has files with different names. Fixes https://github.com/Unstructured-IO/unstructured/issues/3937 --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2025-03-21 16:27:13 +00:00
Antonio Jose Jimeno Yepes	0fa5174bd7	Image within div or span with no text is annotated as Image (#3962 ) Ticket: https://unstructured-ai.atlassian.net/browse/ML-942 The following uncompressed HTML document can be used to test the transformation using the `partition_html` function from the VLM partitioner. [recalibrating-risk-report.pdf.json.html.zip](https://github.com/user-attachments/files/19330528/recalibrating-risk-report.pdf.json.html.zip)	2025-03-20 04:09:02 +00:00
Yao You	7de630e45e	Feat/bump numpy to 2 (#3961 ) This PR updates a few dependencies so that they are compatible with `numpy>=2`.	2025-03-18 21:33:48 +00:00
Yao You	4e424efd22	feat: use lxml instead of bs4 to parse hOCR data (#3960 ) - `lxml` is a much faster library than `bs4` when the input data is regular - since the hOCR data is guaranteed to be regular (programmatically generated) we don't need `bs4` here to parse the data - `lxml` improves parsing speed by about 10x Example runtime profiling locally using the same `hocr` data from 1 page pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main and `agent.hocr_to_dataframe` is the PR's method. ![Screenshot 2025-03-17 at 12 14 59 PM](https://github.com/user-attachments/assets/7c483857-8711-4d72-8954-e83510fef783)	2025-03-18 00:36:19 +00:00
ryannikolaidis	66bf4b0198	feat: support extracting image url in html (#3955 ) also removes mimetype when base64 is not included in image metadata --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2025-03-13 22:41:10 +00:00
Yao You	2dceac34b5	Feat/remove reference of PageLayout.elements (#3943 ) This PR removes usage of `PageLayout.elements` from partition function, except for when `analysis=True`. This PR updates the partition logic so that `PageLayout.elements_array` is used everywhere to save memory and cpu cost. Since the analysis function is intended for investigation and not for general document processing purposes, this part of the code is left for a future refactor. `PageLayout.elements` uses a list to store layout elements' data while `elements_array` uses `numpy` array to store the data, which has much lower memory requirements. Using `memory_profiler` to test the differences is usually around 10x.	2025-03-12 15:21:21 +00:00
Yao You	8759b0aac9	feat: allow passing down of ocr agent and table agent (#3954 ) This PR allows passing down both `ocr_agent` and `table_ocr_agent` as parameters to specify the `OCRAgent` class for the page and tables, if any, respectively. Both are default to using `tesseract`, consistent with the present default behavior. We used to rely on env variables to specify the agents but os env can be changed during runtime outside of the caller's control. This method of passing down the variables ensures that specification is independent of env changes. ## testing Using `example-docs/img/layout-parser-paper-with-table.jpg` and run partition with two different settings. Note that this test requires `paddleocr` extra. ```python from unstructured.partition.auto import partition from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE) elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT) ``` we should see both finish and slight differences in the table element's text attribute.	2025-03-11 16:36:31 +00:00
ryannikolaidis	0001a33dba	fix: pass extract image args to all partitioners (#3950 ) This is needed in order for the user to specify whether to extract the base64 for images, which are now parsed by the html partitioner. ## Testing Adds test that validates this by calling the auto-partitioner with appropriate arguments partitioning an html file with base64 embedded image.	2025-03-10 04:15:08 +00:00
ryannikolaidis	c0457c1cc3	feat: include images when partitioning html (#3945 ) Currently we [filter img tags](`2addb19473/unstructured/partition/html/partition.py (L226-L229)`) before tags are converted to Elements by the html partitioner. More importantly we also don’t currently have a defined “block” / mapping to support these. This adds these mappings and logic to process. It also respects `extract_image_block_types` and `extract_image_block_to_payload` (as we do with pdfs) to determine whether base64 is included in the metadata. The partitioned Image Elements sets the text to the img tag’s alt text if available. The partitioned Image Elements include the [url in the metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209) (rather than image_base64) if the img tag src is a url. ## Testing unit tests have been added for explicit coverage. existing integration tests and other unit test fixtures have been updated to account for `Image` elements now present --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2025-03-08 01:25:21 +00:00
Pluto	74b0647aa2	Fix json bytes content type detection (#3941 ) Fixes order of content type detection strategies for byte-encoded jsons. Before ``` json_bytes = json.dumps([{"example": "data"}]).encode("utf-8") file_buffer = io.BytesIO(json_bytes) detect_filetype(file=file_buffer, metadata_file_path="filename.pdf") ``` Before PDF Now JSON	2025-03-07 10:33:33 +00:00
Nathan Van Gheem	19373de5ff	Enable dynamic file type registration (#3946 ) The purpose of this PR is to enable registering new file types dynamically. The PR enables this through 2 primary functions: 1. `unstructured.file_utils.model.create_file_type` This registers the new `FileType` enum which enables the rest of unstructured to understand a new type of file 2. `unstructured.file_utils.model.register_partitioner` Decorator that enables registering a partitioner function to run for a file type. --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2025-03-06 22:09:42 +00:00
Marek Połom	f333d7fe7f	feat: Json elements to HTML converter (#3936 ) ## NOTE `test_unstructured_ingest/expected-structured-output-html` contains all test HTML fixtures. Original JSON files, from which these HTML fixtures are generated, were taken from `test_unstructured_ingest/expected-structured-output`	2025-03-04 13:57:35 +00:00
Yao You	43b682ad3f	feat: allow extraction of camel cased element type names (#3938 ) This PR allows element types with CamelCase names to be extractable using `extract_image_block_types` variable. Before: specify `extract_image_block_types=["NarrativeText"]` (or any casing for `NarrativeText`) would raise a warning that it doesn't match any available types and not image would be extracted for this element type Now: specify `extract_image_block_types=["NarrativeText"]` would extract images for this element type ## testing ```python from unstructured.partition.auto import partition f = "example-docs/pdf/embedded-images-tables.pdf" elements = partition(f, strategy="hi_res", extract_image_block_types=["narrativetext"]) ``` Without this PR no figures would be extracted. With this PR a local folder would be created to contain images of the narrative text elements in path like `./figures/figure-1-1.jpg` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2025-03-04 01:33:05 +00:00
Pluto	0df50fe6e8	Fix file detection when spooled file is pased (#3932 ) This pull request fixes the scenario when SpooledTemporaryFile is passed to detect_file type. In such cases some weird number was assigned as 'name' (and it couldn't be overwritten as SpooledTemporaryFile can't have fields assigned 😩 ) so I added in our object factory just another scenario where we parse this type of file. For BytesIo `name` attr is None as it should be and some other metadata fields are leveraged for file type recognition	2025-02-20 13:00:25 +00:00
Pluto	3973a30b8c	Feat: Add pdfminer parameters configuration (#3918 ) This pull request adds the ability to configure multiple pdfminer parameters (with the simple possibility to extend for the additional parameters). One of the parameters overwrites the default from LA Params config class. Example: ```python3 partition( filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"), pdfminer_line_margin=1.123, pdfminer_char_margin=None, pdfminer_line_overlap=0.0123, pdfminer_word_margin=3.21, ) assert pdfminer_mock.call_args.kwargs == { "line_margin": 1.123, "line_overlap": 0.0123, "word_margin": 3.21, } ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: plutasnyy <plutasnyy@users.noreply.github.com>	2025-02-17 11:41:20 +00:00
Philippe PRADOS	b521bce9c6	Add password with PDF files (#3721 ) Add password with PDF files Must be combined with [PR 392 in unstructured-inference](https://github.com/Unstructured-IO/unstructured-inference/pull/392) --------- Co-authored-by: John J <43506685+Coniferish@users.noreply.github.com>	2025-02-11 17:39:16 +00:00
Roman Isecke	92be4eb2dd	bugfix/fix ndjson detection (#3905 ) ### Description NDJSON files were being detected as JSON due to having the same mime-type. This adds additional logic to skip mime-type based detection if extension is `.ndjson`	2025-02-11 14:21:28 +00:00
qued	b10379c14c	Fix: plug security issue partition system files via include (#3908 ) #### Summary A recent security review showed that it was possible to partition arbitrary local files in cases where the filetype supports an "include" functionality that brings in the content of files external to the partitioned file. This affects `rst` and `org` files. #### Fix This PR fixes the above issue by passing the parameter `sandbox=True` in all cases where `pypandoc.convert_file` is called. Note I also added the parameter to a call to this method in the ODT code. I haven't investigated whether there was a security issue with ODT files, but it seems better to use pandoc in sandbox mode given the security issues we know about. #### Testing To verify that the tests that are added with this PR find the relevant issue: - Remove the `sandbox=True` text from `unstructured/file_utils/file_conversion.py` line 17. - Run the tests `test_unstructured.partition.test_rst.test_rst_wont_include_external_files` and `test_unstructured.partition.test_org.test_org_wont_include_external_files`. Both should fail due to the partitioning containing the word "wombat", which only appears in a file external to the partitioned file. - Add the parameter back in, and the tests pass.	2025-02-06 03:27:18 +00:00
Pluto	5bb95b5841	Fix parsing table cells (#3904 ) This PR: - Fixes removing HTML tags that exist in <td> cells - stripping function was in general problematic to implement in easy and straightforward way (you can't modify `descendants` in-place). So I decided instead of patching something in table cell I added stripping everywhere in the same consistent way. This is why some tests needed small edits with removing one white-space in each tag. I believe this won't cause any problems for downstream tasks. Tested HTML: ```html <table class="Table"> <tbody> <tr> <td colspan="2"> Some text </td> <td> <input checked="" class="Checkbox" type="checkbox"/> </td> </tr> </tbody> </table> ``` Before & After ```html '<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>' '<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>'' ```	2025-02-05 15:28:49 +00:00
Yao You	9d58b34ab4	Fix/fix table id checking logic (#3898 ) - there is a bug in deciding if a page has tables before performing table extraction. This logic checks if the id associated with Table type element is True - however, it should be checking if the id is `None` because sometimes the id can be 0 (the first type of element in the page) - the fix updates the logic - adds a unit test for this specific case	2025-01-31 10:19:14 -08:00
Yao You	a9ff1e70b2	Fix/fix ocr region to elements bug (#3891 ) This PR fixes a bug in `build_layout_elements_from_ocr_regions` where texts are joint in incorrect orders. The bug is due to incorrect masking of the `ocr_regions` after some are already selected as one of the final groups. The fix uses simpler method to mask the indices by simply use the same indices that adds the regions to the final groups to mask them so they are not considered again. ## Testing This PR adds a unit test specifically aimed for this bug. Without the fix the test would fail. Additionally any PDF files with repeated texts has a potential to trigger this bug. e.g., create a simple pdf use the test text ```python "LayoutParser: \n\nA Unified Toolkit for Deep Learning Based Document Image\n\nLayoutParser for Deep Learning" ``` and partition with `ocr_only` mode on main branch would hit this bug and output text where position of the second "LayoutParser" is incorrect. ```python [ 'LayoutParser:', 'A Unified Toolkit for Deep Learning Based Document Image', 'for Deep Learning LayoutParser', ] ```	2025-01-29 12:11:17 +00:00
fzowl	0fbdd4ea36	Refactoring VoyageAI integration (#3878 ) Using VoyageAI's python package directly, allowing more features than through langchain	2025-01-28 21:45:40 +00:00
David Huggins-Daines	9e5ff225f6	fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR (#3822 ) Fixes: #3815 Verified on my very large documents that it doesn't unnecessarily and unsuccessfully "repair" them. You may or may not wish to keep the version check in `patch_psparser`. Since ~you're pinning the version of pdfminer.six and since it isn't guaranteed that the bug in question will be fixed in the next pdfminer.six release (but it is rather serious, so I should hope so), then perhaps you just want to unconditionally patch it.~ it seems like pinning of versions is only operative when running from Docker (good!) so never mind! Keep that version check! Also corrected an import so that if you do feel like using a newer version of pdfminer.six, it won't break on you. --------- Authored-by: David Huggins-Daines <dhdaines@logisphere.ca>	2025-01-24 14:27:25 -06:00
Yao You	8f2a719873	Feat/refactor layoutelement textregion to vectorized data structure (#3881 ) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-01-23 17:11:38 +00:00
Yao You	27cd53bd45	fix: fix multiple values for infer_table_structure (#3870 ) This PR fixes a bug when using `partition` to partition an email with image attachments with hi_res and allow table structure inference -> the partitioning of the image would encounter a value error: `got multiple values for keyword argument 'infer_table_structure'`. This is because pass `kwargs` into partition "other" types of files in this [block](`50ea6fe7fc/unstructured/partition/auto.py (L270-L280)`) `infer_table_structure` is packaged into `partitioning_kwargs`. Then for email at least when there are attachments that can be partitioned with `hi_res` we pass that dict of `kwargs` right back into `partition` entry -> so when we get [here](`50ea6fe7fc/unstructured/partition/auto.py (L222-L235)`) we are both specifying explicitly `infer_table_structure` and have it in `kwargs` variable The fix is to detect first if `kwargs` already contains `infer_table_structure` and if yes use that and pop it from `kwargs`. --------- Co-authored-by: Kamil Plucinski <kamil.plucinski@deepsense.ai> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2025-01-17 18:41:04 +00:00
Pluto	8685905bd1	Character confidence threshold (#3860 ) This change adds the ability to filter out characters predicted by Tesseract with low confidence scores. Some notes: - I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though - I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code	2025-01-13 13:12:46 +00:00
Christine Straub	8378c26035	Feat/contain nltk assets in docker image (#3853 ) This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container. Current updated solution: - Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data. - Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments. - Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies. - Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments. - Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes. - Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location. - Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container. - Adjusted the application setup to verify the presence of NLTK assets during the container build process.	2025-01-08 22:00:13 +00:00
Roman Isecke	50ea6fe7fc	feat: add ndjson support (#3845 ) ### Description Add ndjson file type support and treat is the same as json files.	2024-12-19 14:39:26 +00:00
Steve Canny	b3a2dd4755	fix: html incorrectly categorizing text (#3841 ) Fixes #3666 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-18 18:46:54 +00:00
Steve Canny	9ece0b5ad2	fix: improve false-positive Title elements on Chinese text (#3836 ) Summary Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive `Title` elements. Fixes #3084 --------- Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2024-12-18 01:16:42 +00:00
Steve Canny	b5ff79d8db	fix: refine filetype detection (#3828 ) Summary Fixes a bug where a CSV file with asserted content-type `application/vnd.ms-excel` was incorrectly identified as an XLS file and failed partitioning. Additional Context The `content_type` argument to partitioning is often authored by the client system (e.g. Unstructured SDK) and is both unreliable and outside the control of the user. In this case the `.csv -> XLS` mapping is correct for certain purposes (Excel is often used to load and edit CSV files) but not for partitioning, and the user has no readily available way to override the mapping. XLS files as well as seven other common binary file types can be efficiently detected 100% of the time (at least 99.999%) using code we already have in the file detector. - Promote this direct-inspection strategy to be tried first. - When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use that file-type. - When one of those types is NOT detected, clear the asserted `content_type` when it matches any of those types. This prevents the problem seen in the bug where the asserted content type was used to determine the file-type. - The remaining content_type, guess MIME-type, and filename-extension mapping strategies are tried, in that order, only when direct inspection fails. This is largely the same as it was before. - Fix #3781 while we were in the neighborhood. - Fix #3596 as well, essentially an earlier report of #3781.	2024-12-17 00:56:21 +00:00
Steve Canny	10f0d54ac2	build: remove ruff version upper bound (#3829 ) Summary Remove pin on `ruff` linter and fix the handful of lint errors a newer version catches.	2024-12-16 23:01:22 +00:00
Steve Canny	3b718ec89a	rfctr: prep for pluggable partitioners (#3806 ) Summary Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature in `auto/partition()` such that a custom or override partitioner can be registered without requiring code changes. Additional Context The central job of `auto/partition()` is to detect the file-type of the given file and use that to dispatch partitioning to the corresponding partitioner function e.g. `partition_pdf()` or `partition_docx()`. In the existing code, each partitioner function is called with parameters "hand-picked" from the available parameters passed to the `partition()` function. This is unnecessary and couples those partitioners tightly with the dispatch function. The desired state is that all available arguments are passed as `kwargs` and the partitioner function "self-selects" the arguments it will be sensitive to, applies its own appropriate default values when the argument is omitted, and simply ignore any arguments it doesn't use. Note that achieving this requires no changes to partitioner functions because they already do precisely this. So the job is to pass all arguments (other than `filename` and `file`) to the partitioner as `kwargs`. This will allow additional or alternate partitioners to be registered at runtime and dispatched to, because as long as they have the signature `partition_x(filename, file, kwargs) -> list[Element]` then they can be dispatched to without customization.	2024-12-10 20:44:34 +00:00
Magnus F	1e2da6df46	fix: ipv4 address regex (#3808 ) I noticed the ipv4 regex is wrong (it only capture one or two-digit octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it. If you wish I can break out the ipv4 test to its own case, so we don't interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction test. Side note: The comment at `unstructured/nlp/patterns.py#95` includes a bad ipv4 address example (last octet is wrongfully left-padded with a zero). I left it as it is because I'm not sure if the intention is to include "non-conventional" ipv4 addresses, like octal or hexadecimal octets.	2024-12-09 14:19:13 -08:00
Steve Canny	4379d883a3	chunk: relax table segregation during chunking (#3812 ) Summary Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. Additional Context Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-09 18:57:22 +00:00
Christine Straub	9076d56d9f	fix: resolve mergeing conflict error	2024-12-07 19:40:11 -08:00
Tracy Shen	8c58bc57db	fix doctype parsing error (#3811 ) - per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551), there is a bug in the `unstructured` lib under metrics/evaluate.py that incorrectly retrieves the file extension before the conversion to cct file from paths like '.pdf.txt' . (see below screenshot) - the current status is in the top example - we should have the correct version in the bottom example of the screenshot. ![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d) - in addition, i also observe the doctype returned are not aligned, some returning '.' and some are returning without the dot. - therefore, i just aligned them to be output into the same version which is '.*".	2024-12-06 23:55:01 +00:00
Christine Straub	7d06c120dc	Merge branch 'main' into ML-593/quote-standardization	2024-12-05 10:27:26 -08:00
Christine Straub	c0c3fd673f	test: enhance quote standardization tests with additional Unicode scenarios	2024-12-04 13:02:07 -08:00

1 2 3 4 5 ...

726 Commits