unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
luke-kucing	147add9a04	Luke/CVE bump (#3928 ) bumping dependancies and updated the tokenizer constraint 0.16.22	2025-02-19 17:23:31 +00:00
Pluto	3403db1ad4	Release 0.16.21 (#3924 ) 0.16.21	2025-02-17 15:06:24 +00:00
Pluto	3973a30b8c	Feat: Add pdfminer parameters configuration (#3918 ) This pull request adds the ability to configure multiple pdfminer parameters (with the simple possibility to extend for the additional parameters). One of the parameters overwrites the default from LA Params config class. Example: ```python3 partition( filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"), pdfminer_line_margin=1.123, pdfminer_char_margin=None, pdfminer_line_overlap=0.0123, pdfminer_word_margin=3.21, ) assert pdfminer_mock.call_args.kwargs == { "line_margin": 1.123, "line_overlap": 0.0123, "word_margin": 3.21, } ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: plutasnyy <plutasnyy@users.noreply.github.com>	2025-02-17 11:41:20 +00:00
Philippe PRADOS	b521bce9c6	Add password with PDF files (#3721 ) Add password with PDF files Must be combined with [PR 392 in unstructured-inference](https://github.com/Unstructured-IO/unstructured-inference/pull/392) --------- Co-authored-by: John J <43506685+Coniferish@users.noreply.github.com>	2025-02-11 17:39:16 +00:00
Roman Isecke	92be4eb2dd	bugfix/fix ndjson detection (#3905 ) ### Description NDJSON files were being detected as JSON due to having the same mime-type. This adds additional logic to skip mime-type based detection if extension is `.ndjson`	2025-02-11 14:21:28 +00:00
Yao You	723c0740e0	Feat/vectorize layout merging (#3900 ) This PR rewrites the logic in `unstructured_inference` that merges extracted with inferred layout using vectorized operations. The goal is to: - vectorize the operation to improve memory and cpu efficiency - apply logic equally without order being a factor (the `unstructured_inference` version uses loops and modifies the content of the inner loop on the fly -> order of the out loop, which is the order of extracted elements becomes a factor) determining the merging results - rewrite the loop into clear steps with clear rules - setup stage for followup improvements While this PR aim to reproduce the existing behavior as much as possible it is not an exact replica of the looped version. Because order is not a factor any more some extracted elements that used to be not considered part of a larger inferred element (due to processing order being not optimum) are now properly merged. This lead to changes in one ingest test. For example, the change shows that now we properly merge the section numerical number with the section title as the full title element. ## Test: Since the goal of this refactor is to preserve as much existing behavior as possible we rely on existing tests. As mentioned above the one file that changed output during ingest test is a net positive change. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-02-07 20:25:57 +00:00
cragwolfe	7ff0ff890d	chore: utils update (#3909 )	2025-02-07 05:58:23 +00:00
qued	b10379c14c	Fix: plug security issue partition system files via include (#3908 ) #### Summary A recent security review showed that it was possible to partition arbitrary local files in cases where the filetype supports an "include" functionality that brings in the content of files external to the partitioned file. This affects `rst` and `org` files. #### Fix This PR fixes the above issue by passing the parameter `sandbox=True` in all cases where `pypandoc.convert_file` is called. Note I also added the parameter to a call to this method in the ODT code. I haven't investigated whether there was a security issue with ODT files, but it seems better to use pandoc in sandbox mode given the security issues we know about. #### Testing To verify that the tests that are added with this PR find the relevant issue: - Remove the `sandbox=True` text from `unstructured/file_utils/file_conversion.py` line 17. - Run the tests `test_unstructured.partition.test_rst.test_rst_wont_include_external_files` and `test_unstructured.partition.test_org.test_org_wont_include_external_files`. Both should fail due to the partitioning containing the word "wombat", which only appears in a file external to the partitioned file. - Add the parameter back in, and the tests pass. 0.16.20	2025-02-06 03:27:18 +00:00
Pluto	5852260a75	Release 0.16.19 (#3906 ) 0.16.19	2025-02-05 16:45:21 +00:00
Pluto	5bb95b5841	Fix parsing table cells (#3904 ) This PR: - Fixes removing HTML tags that exist in <td> cells - stripping function was in general problematic to implement in easy and straightforward way (you can't modify `descendants` in-place). So I decided instead of patching something in table cell I added stripping everywhere in the same consistent way. This is why some tests needed small edits with removing one white-space in each tag. I believe this won't cause any problems for downstream tasks. Tested HTML: ```html <table class="Table"> <tbody> <tr> <td colspan="2"> Some text </td> <td> <input checked="" class="Checkbox" type="checkbox"/> </td> </tr> </tbody> </table> ``` Before & After ```html '<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>' '<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>'' ```	2025-02-05 15:28:49 +00:00
Caleb Bartholomew	451ad97ce2	Add timeout to scarf telemetry requests (#3792 ) Resolves #3791 by setting a default timeout of 10 seconds.	2025-02-01 22:05:42 -08:00
Yao You	9d58b34ab4	Fix/fix table id checking logic (#3898 ) - there is a bug in deciding if a page has tables before performing table extraction. This logic checks if the id associated with Table type element is True - however, it should be checking if the id is `None` because sometimes the id can be 0 (the first type of element in the page) - the fix updates the logic - adds a unit test for this specific case	2025-01-31 10:19:14 -08:00
luke-kucing	a368aac4a3	minor bump to resolve open CVEs (#3895 ) small minor version change to trigger workflows. and fix the open CVEs we had.	2025-01-30 18:24:56 +00:00
cragwolfe	918a3d0deb	fix: allow users to install package with python3.13 or higher (#3893 ) Although, python3.13 is not officially supported or tested in CI just yet.	2025-01-30 14:52:24 +00:00
David Huggins-Daines	11ff9e7659	fix(ci): Use non-deprecated way of invoking ruff in `make tidy` (#3825 ) I noticed that `make tidy` wasn't working in my development environment. This happens if you, a developer, forget to follow the specific instructions in `README.md` and install exactly the right versions of the necessary tools, including a quite old version of Ruff. This version will nonetheless warn you: warning: `ruff <path>` is deprecated. Use `ruff check <path>` instead. So this fixes that, in order to future-proof and avoid confusion!	2025-01-29 21:18:02 -08:00
cragwolfe	55debafa8f	release: 0.16.17 (#3892 ) Co-authored-by: Yao You <yao@unstructured.io> 0.16.17	2025-01-29 06:49:49 -06:00
Yao You	a9ff1e70b2	Fix/fix ocr region to elements bug (#3891 ) This PR fixes a bug in `build_layout_elements_from_ocr_regions` where texts are joint in incorrect orders. The bug is due to incorrect masking of the `ocr_regions` after some are already selected as one of the final groups. The fix uses simpler method to mask the indices by simply use the same indices that adds the regions to the final groups to mask them so they are not considered again. ## Testing This PR adds a unit test specifically aimed for this bug. Without the fix the test would fail. Additionally any PDF files with repeated texts has a potential to trigger this bug. e.g., create a simple pdf use the test text ```python "LayoutParser: \n\nA Unified Toolkit for Deep Learning Based Document Image\n\nLayoutParser for Deep Learning" ``` and partition with `ocr_only` mode on main branch would hit this bug and output text where position of the second "LayoutParser" is incorrect. ```python [ 'LayoutParser:', 'A Unified Toolkit for Deep Learning Based Document Image', 'for Deep Learning LayoutParser', ] ```	2025-01-29 12:11:17 +00:00
fzowl	0fbdd4ea36	Refactoring VoyageAI integration (#3878 ) Using VoyageAI's python package directly, allowing more features than through langchain	2025-01-28 21:45:40 +00:00
cragwolfe	238f985dda	feat: add --images support to unstructured-get-json.sh (#3888 ) E.g., now can run: ```bash # extracts base64 encoded image data for `Table` and `Image` elements $ unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf # also extracts `Title` elements (see screenshot) $ IMAGE_BLOCK_TYPES='"title","table","image"' unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf ``` It was discovered during testing that "narrativetext" does not work, probably due to camel casing of NarrativeText 😬 ![image](https://github.com/user-attachments/assets/e6414a57-81e1-4560-b1b2-dce3b1c2c804)	2025-01-27 16:09:13 -08:00
cragwolfe	b5b13076dd	chore: utils update (#3889 )	2025-01-27 15:47:10 -08:00
Christine Straub	a447b813a9	added auto_download logic to download data runtime (#3883 ) - Add auto-download for NLTK for Python Enviroment When user import `tokenize`, It will automatically download nltk data. - Added `AUTO_DOWNLOAD_NLTK` flag in `tokenize.py` to download `NLTK_DATA` 0.16.16	2025-01-27 19:11:32 +00:00
David Huggins-Daines	9e5ff225f6	fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR (#3822 ) Fixes: #3815 Verified on my very large documents that it doesn't unnecessarily and unsuccessfully "repair" them. You may or may not wish to keep the version check in `patch_psparser`. Since ~you're pinning the version of pdfminer.six and since it isn't guaranteed that the bug in question will be fixed in the next pdfminer.six release (but it is rather serious, so I should hope so), then perhaps you just want to unconditionally patch it.~ it seems like pinning of versions is only operative when running from Docker (good!) so never mind! Keep that version check! Also corrected an import so that if you do feel like using a newer version of pdfminer.six, it won't break on you. --------- Authored-by: David Huggins-Daines <dhdaines@logisphere.ca>	2025-01-24 14:27:25 -06:00
Roman Isecke	e230364a2c	bugfix/drop use of ndjson dep, use local code (#3886 ) ### Description Avoid using the ndjson dependency due to the limiting license that exists on it	2025-01-24 15:31:02 +00:00
Yao You	8f2a719873	Feat/refactor layoutelement textregion to vectorized data structure (#3881 ) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-01-23 17:11:38 +00:00
Tracy Shen	8d0b68aeae	release 0.16.5 (#3884 ) this PR is to release to 0.16.5 which has below updates: - Update `unstructured-inference` to 0.8.6 in requirements which removed `layoutparser` dependency libs - Update `pdfminer-six` to 20240706 0.16.15	2025-01-23 04:01:13 +00:00
Tracy Shen	afecf1b742	update unstructured-inference lib (#3880 ) update unstructured-inference to 0.8.6 in requirements in extra-pdf-image.in 0.8.6 has pdfminer=20240706 (newer version)	2025-01-22 23:11:28 +00:00
Pluto	efd9f648a7	Bump 0.16.14 (#3879 ) 0.16.14	2025-01-20 11:51:01 +00:00
Yao You	27cd53bd45	fix: fix multiple values for infer_table_structure (#3870 ) This PR fixes a bug when using `partition` to partition an email with image attachments with hi_res and allow table structure inference -> the partitioning of the image would encounter a value error: `got multiple values for keyword argument 'infer_table_structure'`. This is because pass `kwargs` into partition "other" types of files in this [block](`50ea6fe7fc/unstructured/partition/auto.py (L270-L280)`) `infer_table_structure` is packaged into `partitioning_kwargs`. Then for email at least when there are attachments that can be partitioned with `hi_res` we pass that dict of `kwargs` right back into `partition` entry -> so when we get [here](`50ea6fe7fc/unstructured/partition/auto.py (L222-L235)`) we are both specifying explicitly `infer_table_structure` and have it in `kwargs` variable The fix is to detect first if `kwargs` already contains `infer_table_structure` and if yes use that and pop it from `kwargs`. --------- Co-authored-by: Kamil Plucinski <kamil.plucinski@deepsense.ai> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2025-01-17 18:41:04 +00:00
Christine Straub	38eb661338	Set the version to 0.16.3 (#3862 ) Co-authored-by: Kamil Plucinski <kamil.plucinski@deepsense.ai> 0.16.13	2025-01-13 14:45:05 +00:00
Pluto	8685905bd1	Character confidence threshold (#3860 ) This change adds the ability to filter out characters predicted by Tesseract with low confidence scores. Some notes: - I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though - I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code	2025-01-13 13:12:46 +00:00
Christine Straub	8378c26035	Feat/contain nltk assets in docker image (#3853 ) This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container. Current updated solution: - Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data. - Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments. - Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies. - Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments. - Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes. - Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location. - Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container. - Adjusted the application setup to verify the presence of NLTK assets during the container build process.	2025-01-08 22:00:13 +00:00
cragwolfe	1a94d95e47	chore: dependency bumps, release commit for 0.16.12 (#3831 ) 0.6.12 0.16.12	2025-01-05 13:50:19 -08:00
bovlb	e2d02808a0	Fix documentation link in CONTRIBUTING.md (#3846 )	2025-01-03 15:14:03 -08:00
luke-kucing	0245661ded	minor comment to trigger new container workflow (#3848 )	2024-12-20 01:02:32 +00:00
Roman Isecke	50ea6fe7fc	feat: add ndjson support (#3845 ) ### Description Add ndjson file type support and treat is the same as json files.	2024-12-19 14:39:26 +00:00
Steve Canny	b3a2dd4755	fix: html incorrectly categorizing text (#3841 ) Fixes #3666 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-18 18:46:54 +00:00
Steve Canny	9ece0b5ad2	fix: improve false-positive Title elements on Chinese text (#3836 ) Summary Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive `Title` elements. Fixes #3084 --------- Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2024-12-18 01:16:42 +00:00
Ribhu Lahiri	9a9bf4c4f5	Added contributing from archived repo (#3616 ) Added `CONTRIBUTING.md` from the archived repo as mentioned in the issue: https://github.com/Unstructured-IO/unstructured/issues/3540 Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>	2024-12-17 02:53:17 +00:00
Steve Canny	b5ff79d8db	fix: refine filetype detection (#3828 ) Summary Fixes a bug where a CSV file with asserted content-type `application/vnd.ms-excel` was incorrectly identified as an XLS file and failed partitioning. Additional Context The `content_type` argument to partitioning is often authored by the client system (e.g. Unstructured SDK) and is both unreliable and outside the control of the user. In this case the `.csv -> XLS` mapping is correct for certain purposes (Excel is often used to load and edit CSV files) but not for partitioning, and the user has no readily available way to override the mapping. XLS files as well as seven other common binary file types can be efficiently detected 100% of the time (at least 99.999%) using code we already have in the file detector. - Promote this direct-inspection strategy to be tried first. - When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use that file-type. - When one of those types is NOT detected, clear the asserted `content_type` when it matches any of those types. This prevents the problem seen in the bug where the asserted content type was used to determine the file-type. - The remaining content_type, guess MIME-type, and filename-extension mapping strategies are tried, in that order, only when direct inspection fails. This is largely the same as it was before. - Fix #3781 while we were in the neighborhood. - Fix #3596 as well, essentially an earlier report of #3781.	2024-12-17 00:56:21 +00:00
Steve Canny	10f0d54ac2	build: remove ruff version upper bound (#3829 ) Summary Remove pin on `ruff` linter and fix the handful of lint errors a newer version catches.	2024-12-16 23:01:22 +00:00
Steve Canny	b092fb7f47	fix: add .grype.yaml (#3834 ) Summary CVE-2024-11053 https://curl.se/docs/CVE-2024-11053.html (severity Low) was published on Dec 11, 2024 and began failing CI builds on open-core on Dec 13, 2024 when it appeared in `grype` apparently misclassified as a critical vulnerability. The severity reported on the CVE is "Low" so it should not fail builds. Add a `.grype.yaml` file to ignore this CVE until grype is updated.	2024-12-16 19:39:55 +00:00
Steve Canny	3b718ec89a	rfctr: prep for pluggable partitioners (#3806 ) Summary Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature in `auto/partition()` such that a custom or override partitioner can be registered without requiring code changes. Additional Context The central job of `auto/partition()` is to detect the file-type of the given file and use that to dispatch partitioning to the corresponding partitioner function e.g. `partition_pdf()` or `partition_docx()`. In the existing code, each partitioner function is called with parameters "hand-picked" from the available parameters passed to the `partition()` function. This is unnecessary and couples those partitioners tightly with the dispatch function. The desired state is that all available arguments are passed as `kwargs` and the partitioner function "self-selects" the arguments it will be sensitive to, applies its own appropriate default values when the argument is omitted, and simply ignore any arguments it doesn't use. Note that achieving this requires no changes to partitioner functions because they already do precisely this. So the job is to pass all arguments (other than `filename` and `file`) to the partitioner as `kwargs`. This will allow additional or alternate partitioners to be registered at runtime and dispatched to, because as long as they have the signature `partition_x(filename, file, kwargs) -> list[Element]` then they can be dispatched to without customization.	2024-12-10 20:44:34 +00:00
Steve Canny	b981d7197f	release: prepare release 0.16.11 (#3819 ) Release only, no code changes. 0.16.11	2024-12-09 23:48:00 +00:00
Magnus F	1e2da6df46	fix: ipv4 address regex (#3808 ) I noticed the ipv4 regex is wrong (it only capture one or two-digit octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it. If you wish I can break out the ipv4 test to its own case, so we don't interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction test. Side note: The comment at `unstructured/nlp/patterns.py#95` includes a bad ipv4 address example (last octet is wrongfully left-padded with a zero). I left it as it is because I'm not sure if the intention is to include "non-conventional" ipv4 addresses, like octal or hexadecimal octets.	2024-12-09 14:19:13 -08:00
Steve Canny	4379d883a3	chunk: relax table segregation during chunking (#3812 ) Summary Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. Additional Context Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-09 18:57:22 +00:00
Christine Straub	18d6c81c47	Update CHANGELOG.md (#3818 )	2024-12-09 14:47:53 +00:00
Christine Straub	2f06d5a2a2	test: fix lint error	2024-12-07 19:51:43 -08:00
Christine Straub	9076d56d9f	fix: resolve mergeing conflict error	2024-12-07 19:40:11 -08:00
Tracy Shen	59e6cff611	release 0.16.10 (#3816 ) release 0.16.10 so that competitor-eval can install and take advantage of the latest change in the metric calculation 0.16.10	2024-12-07 17:24:45 +00:00
Tracy Shen	8c58bc57db	fix doctype parsing error (#3811 ) - per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551), there is a bug in the `unstructured` lib under metrics/evaluate.py that incorrectly retrieves the file extension before the conversion to cct file from paths like '.pdf.txt' . (see below screenshot) - the current status is in the top example - we should have the correct version in the bottom example of the screenshot. ![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d) - in addition, i also observe the doctype returned are not aligned, some returning '.' and some are returning without the dot. - therefore, i just aligned them to be output into the same version which is '.*".	2024-12-06 23:55:01 +00:00

1 2 3 4 5 ...

1739 Commits