unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-12 00:18:56 +00:00

Author	SHA1	Message	Date
Sebastian Oßner	6066a264cb	fix: update container link in README.md (#2889 ) Just a tiny fix for a broken link that bothered me :) Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-05-20 09:24:22 -04:00
Weaviate Git Bot	60f10fe6dd	Updated Weaviate Docker image url (auto PR by bot) (#2659 ) This minor change updates the URL of the [Weaviate Docker image](https://weaviate.io/developers/weaviate/installation/docker-compose). Instead of the standard Docker registry, Weaviate now makes use of a custom registry running at `cr.weaviate.io`. Thanks in advance for merging. 🤖 beep boop, the Weaviate bot PS: Please note that the Weaviate Bot automates this PR; apologies if PR formatting is missing. If you have questions, feel free to reach out via our [forum](https://forum.weaviate.io) or [Slack](https://weaviate.io/slack). Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-05-20 09:23:07 -04:00
Tom Heaton	84cec1f40d	fix: Add `pip` as explicit dep in `environment.yml` to prevent warning (#2574 ) Removes this warning: > Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you. Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-05-20 09:22:24 -04:00
Na'aman Hirschfeld	8802535e95	chore: add py.typed (#3043 ) This PR adds `py.typed`, which will fix issues of the following type: ![Uploading Screenshot 2024-05-17 at 12.13.33.png…]() --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-05-20 08:52:04 -04:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Christine Straub	76831f154b	refactor: `partition_pdf()` pass `kwargs` through `fast` strategy pipeline (#3040 ) This PR aims to pass `kwargs` through `fast` strategy pipeline, which was missing as part of the previous PR - https://github.com/Unstructured-IO/unstructured/pull/3030. I also did some code refactoring in this PR, so I recommend reviewing this PR commit by commit. ### Summary - pass `kwargs` through `fast` strategy pipeline, which will allow users to specify additional params like `sort_mode` - refactor: code reorganization - cut a release for `0.14.0` ### Testing CI should pass 0.14.0	2024-05-17 20:55:11 +00:00
Matt Robinson	9cd0e706ab	fix: reenable arm64 builds for docker (#3045 ) ### Summary Closes #3034 and reenables ARM64 in the docker build and publish job. This was taken out in #3039 because we've only build `libreoffice` for AMD64 and not ARM64. If Chainguard publishes an `apk` for `libreoffice`, we can support a Chainguard image for both architectures. The smoke test now differs for both architectures, to reflect differences in the directory structure. ### Testing Build and publish ran successfully for ARM64 (job [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907497)) and AMD64 (job [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907826)).	2024-05-17 19:27:20 +00:00
amadeusz-ds	1c8b2b23eb	feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres (#3014 ) This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR controlling where temporary files are stored during partition flow, via tempfile.tempdir. #### Edit: Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_ #### Edit 2: Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_	2024-05-17 19:16:10 +00:00
Matt Robinson	ec987dcbb2	BREAKING CHANGE: revert table extraction off by default for PDFs and images (#3035 ) ### Summary Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-05-17 15:28:11 +00:00
David Potter	df8d39a4d4	fix: allow AstraDB to prevent indexing on metadata columns with long text (#3003 ) Thanks to @erichare from AstraDB Adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed. Changes to: test_unstructured_ingest/python/test-ingest-astra-output.py are forward looking from AstraDB	2024-05-17 04:12:37 +00:00
Matt Robinson	934f1a464a	fix: disable arm build for chainguard (#3039 ) ### Summary Temporarily disables the ARM build due to the error in [this CI job](https://github.com/Unstructured-IO/unstructured/actions/runs/9114507405/job/25058629166). Will add back support for ARM using the rockylinux container once we show this works.	2024-05-17 00:22:10 +00:00
Steve Canny	f320889b4f	feat(docx): add strategy parameter to DOC and ODT (#3042 ) Summary Because DOCX now supports the `strategy` argument to control aspects of image extraction, `partition_doc()` and `partition_odt()` will need to support it to because they delegate partitioning to `partition_docx()`. This will allow image extraction to work the same way for those two additional document-types.	2024-05-16 22:14:02 +00:00
Steve Canny	8644a3b09a	fix(odt): fix disk-space leak in partition_odt() (#3037 ) Remedy disk-space leak where `partition_odt()` would leave an on-disk copy of each `.odt` file passed as a file-like object. `partition_odt()` creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound. The `convert_and_partition_docx()` function used to convert ODT->DOCX uses `pandoc` (a command-line program) to do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the ODT source-document is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward. Fix this by writing the temporary source ODT file in a `TemporaryDirectory` and also use that location to write the conversion-target DOCX file. That directory is automatically removed when `partition_odt()` completes. While we're in there, improve the factoring of `partition_odt()`. - Extract `convert_and_partition_docx()` from `partition.docx` (used only by `partition_odt()`) to `_convert_odt_to_docx()` in `partition.odt` where it is used. Decouple file conversion from calling `partition_docx()` with the converted file as the `partition_docx()` call is `partition_odt()`'s natural responsibility. - Improve docstrings, typing, and comments. - All tests pass both before and after.	2024-05-16 20:04:10 +00:00
Steve Canny	0de9215db4	fix: use raw strings for regex patterns (#3029 ) Summary Avoid `SyntaxWarning` and/or `SyntaxError` messages when importing `unstructured.nlp.patterns` by using raw strings (`"r"` prefix) for regex patterns which may contain `\x` character sequences not recognized by the Python parser for normal strings. Fixes: #2495	2024-05-16 16:50:25 +00:00
Jan Kanty Milczek	e6ada05c55	Feat: form parsing placeholders (#3034 ) Allows introduction of form extraction in the future - sets up the FormKeysValues element & format, puts in an empty function call in the partition_pdf_or_image pipeline.	2024-05-16 14:21:31 +00:00
Christine Straub	1fb0fe5cf5	enhancement: `partitoin_pdf()` skip unnecessary element sorting (#3030 ) This PR aims to skip element sorting when determining whether embedded text can be extracted. The extracted elements in this step are returned as final elements only for the `fast` strategy pipeline and are never used for other strategy pipelines (`hi_res`, `ocr`). Removing element sorting in this step and adding it to the `fast` strategy pipeline later will improve performance and reduce execution time. ### Summary - skip element sorting when determining whether embedded text can be extracted. - add `_partition_pdf_with_pdfparser()` function for fast` strategy pipeline ### Testing CI should pass.	2024-05-16 06:02:56 +00:00
Taha Yassine	ecdfb7a07c	fix: update link in readme (#2493 ) Fix link to the "Run the library in a container" section.	2024-05-15 22:48:53 -07:00
Steve Canny	aeca8bef88	rfctr(odt): organize and improve test_odt.py (#3031 ) Summary In preparation for adding more tests related to image extraction, improve the `partition_odt()` test suite: - Add type annotations to type-check clean on strict mode. - Improve test names. - Simplify tests where possible. - Remove a couple duplicated tests	2024-05-16 01:04:06 +00:00
Matt Robinson	612905e311	build: wolfi base image for Dockerfile (#3016 ) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done.	2024-05-15 22:53:15 +00:00
Steve Canny	094e3542cb	feat(docx): add strategy parameter to partition_docx() (#3026 ) Summary The behavior of an image sub-partitioner can be partially determined by the partitioning strategy, for example whether it is "hi_res" or "fast". Add this parameter to `partition_docx()` so it can pass it along to `DocxPartitionerOptions` which will make it available to any image sub-partitioners.	2024-05-15 21:05:32 +00:00
Steve Canny	a164b01c7e	rfctr(doc): spruce up test_doc.py (#3024 ) Summary In preparation for adding more tests related to image extraction, improve the `partition_doc()` test suite: - Remove redundant DOCX -> DOC file conversions on most tests. - Add type annotations to type-check clean on strict mode. - Improve test names. - Simplify tests where possible. - Remove one duplicated test Speed was roughly doubled: 24 tests in 20s -> 23 tests in 8s.	2024-05-15 18:32:51 +00:00
Steve Canny	b1b8eae359	fix(doc): fix disk-space leak (#3019 ) Summary Remedy disk-space leak where `partition_doc()` would leave a copy of each `.doc` file passed as a file-like object on disk. Additional Context `partition_doc()` creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound. The `convert_office_doc()` function used to convert DOC->DOCX uses a command-line program provided with LibreOffice to convert do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the DOC file is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward. Fix this by writing the temporary source DOC file in the TemporaryDirectory already being used to write the conversion-target DOCX file. That directory is automatically removed when `partition_doc()` completes.	2024-05-15 16:33:42 +00:00
Steve Canny	12b30d2810	rfctr(docx): extract DocxPartitionerOptions (#3018 ) Reviewers: Probably easier to review first and second commits separately as the first one adds all the new code and tests (without installing it), and the second one installs it into the partitioner along with the required changes to code and tests. Summary Enable communication of partitioning options to sub-partitioners, in particular to the pluggable `PicturePartitioner` coming in a closely subsequent PR to implement image-extraction and OCR for DOCX, DOC, and ODT formats. Additional Context In general, validation of partitioning options as well as assigning default values and computing derived partitioning settings can be extracted from partitioners into a neatly encapsulated separate object. This simplifies the core partitioning code by removing the noise associated with computing metadata values and deciding how to access the source document, etc. However, better factoring aside, having the partition-time "settings" available in a single object allows partitioning of certain document features, for example images, to be readily _delegated_ to a sub-partitioner while still giving it access to all the relevant partitioning settings for the current document. This is particularly important when a sub-partitioner is "pluggable" at runtime and must rely on a clearly-defined (and simple as possible) interface to operate smoothly.	2024-05-15 00:50:31 +00:00
Steve Canny	db186dc23b	rfctr(doc): organize test_doc.py (#3017 ) Summary Organize DOC tests into related groups with markers. This makes it easier to assess coverage and find tests related to particular behaviors. This is in preparation for adding tests related to DOC image extraction. No code changes, purely line-block moves. - Move module-level fixtures to the bottom. - Organize tests into related groups with markers.	2024-05-14 20:57:31 +00:00
Steve Canny	b4a6009c09	rfctr(docx): improve typing etc. in prep for docx image extraction (#3015 ) Summary Noisy but trivial changes to `partition_docx()` environs and tests in preparation for DOCX image extraction. These changes are extracted here so they don't distract on the changes of substance to follow in the next PR.	2024-05-14 19:32:17 +00:00
Steve Canny	3f8e6b79c5	rfctr(docx): move docx unit tests to bottom (#3011 ) No code changes, strictly this single block move. Move `Describe_DocxPartitioner` unit-test class to bottom so `DescribeDocxPartitionerOptions` unit-test to follow in subsequent commit will be together with it. Integration tests first, then unit tests, for consistency with other test modules e.g. test_pptx. I added `Describe_DocxPartitioner` soon after I arrived, before we adopted the convention of placing unit-tests after integration tests. Move this so we can maintain that consistency with the block of tests to follow in a closely subsequent PR.	2024-05-13 22:05:12 +00:00
Matt Robinson	f4b01a4aad	build(deps): bump versions for security hygiene (#3008 ) ### Summary Version bumps to keep on top of security scans.	2024-05-13 15:30:09 +00:00
John	45d7bcb399	get params with defaults (#3004 ) Extract repeated logic into `get_call_args_with_defaults` function	2024-05-13 13:56:55 +00:00
Steve Canny	e4c895923d	fix(csv): partition_csv() raises on long lines (#2998 ) Summary The CSV delimiter-sniffer requires whole lines to properly detect the delimiter character. Limiting bytes read produced partial lines when lines were very long. Limit bytes but read whole lines. Fixes #2643.	2024-05-10 21:19:31 +00:00
John	8eee14d589	paragraph grouper type hint (#3002 ) Fix type hint for paragraph_grouper param. `paragraph_grouper` can be set to `False`, but the type hint did not not reflect this previously.	2024-05-10 19:37:07 +00:00
John	593aa47802	fix: ppt parameters include_page_breaks and include_slide_notes (#2996 ) Pass the parameters `include_slide_notes` and `include_page_breaks` to `partition_pptx` from `partition_ppt`. Also update the .ppt example doc we use for testing so it has slide notes and a PageBreak (and second page)	2024-05-10 17:57:36 +00:00
John	293a4a1152	remove unused links param (#3001 ) The `links` param in `partition_pdf` was never used by the partitioner, but added when that metadata element was created. This removes the unused parameter since `links` are extracted during partitioning.	2024-05-10 16:55:17 +00:00
Michał Martyniak	f8c119a777	Quickfix: re-apply skip accuracy calculation enhancement (#3000 ) Changes introduced [in this PR](https://github.com/Unstructured-IO/unstructured/pull/2977) have been overwritten by mistake after merging https://github.com/Unstructured-IO/unstructured/pull/2973 - git did not detect a conflict there due to redesign in evaluate.py	2024-05-10 11:13:33 +00:00
John	d829b669e6	Add starting_page_num param to partition_image (#2987 ) Add missing starting_page_num param to partition_image Closes #2985	2024-05-09 21:31:35 +00:00
Michał Martyniak	2f25d8f79e	Support for concurrent processing of documents during evaluation (#2973 ) Currently, CCT eval takes a long time for any of the test_metrics CI runs. Documents in an eval set are evaluated sequentially, and It appears that a max of 1 cpu core is currently utilized. This implies there could be a large speedup by running eval across multiple docs concurrently (probably with multiprocessing). Things done in this PR: - [x] concurrent.futures.ProcessPoolExecutor instead of sequential for-loop - [x] refactor/reorganization of redundant pieces of code without changing the inner logic too much. Without that we'd have 3 places where documents are being processed. Take a look at `BaseMetricsCalculator` class and classes that inherit from it. - [x] string paths manipulation is now reworked and relies on `pathlib.Path()`	2024-05-09 21:25:47 +00:00
amadeusz-ds	648ec33b44	feat(evaluation): skip accuracy calculation (#2977 ) Skip accuracy calculation for files for which output and ground truth sizes differ greatly. ~10% speed up on local machine, keeping the same metrics. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2024-05-09 08:01:08 +00:00
John	e15adb418b	fix param types for partition_image and _pdf (#2988 ) Make the `filename` and `file` params for `partition_image` and `partition_pdf` match the other partitioners	2024-05-09 03:02:43 +00:00
Christine Straub	b64a48440d	chore: bump unstructured-inference 0.7.31 (#2981 ) 0.13.7	2024-05-08 16:26:58 +00:00
John	ef47d530f6	feat: add chunking to partition_tsv (#2982 ) Closes #2980	2024-05-07 23:09:27 +00:00
Yao You	668dd0122f	remove unnecessary warning log (#2978 ) The warning log about default model not set is no longer needed. This PR removes this log to reduce confusion.	2024-05-07 17:04:44 +00:00
Pluto	4397dd6a10	Add calculation of table related metrics based on table_as_cells (#2898 ) This pull request add metrics that are calculated based on table_as_cells instead of text_as_html. This change is required for comprehensive metrics calculation, as previously every colspan or rowspan predicted was considered to be an incorrect predicted (even if it was correct prediction) This change has to be merged after https://github.com/Unstructured-IO/unstructured/pull/2892 which introduces table_as_cells field	2024-05-07 13:57:38 +00:00
Christine Straub	0cd07d78f9	feat: `parition_pdf()` add ability to get `cid` ratio (#2970 ) This PR adds the ability to get the ratio of `cid` characters in embedded text extracted by `pdfminer`. This PR is the second part of moving `cid` related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/342.	2024-05-04 05:21:27 +00:00
Steve Canny	cb55245f70	rfctr: extract OCRAgent.get_agent() out of PDF subtree (#2965 ) Summary File-types other than PDF need to use OCR on extracted images. Extract `OCRAgent.get_agent()` such that any file-type partitioner can use it without risking dependency on PDF-only extras.	2024-05-03 19:39:22 +00:00
Steve Canny	17c2d075a8	rfctr improve partitioner typing (#2963 ) Summary Remedy the persistent type errors when importing `unstructured`. Give the partitioner type annotations a general scrubbing while we're at it.	2024-05-03 16:11:55 +00:00
Steve Canny	39b74a2370	fix(test): Remedy macOS-only test failure not triggered by CI (#2957 ) Summary A crude and OS-specific mechanism was used to detect when a path represented a temp-file. Change that to be robust across operating systems and localized configurations. The specific problem was for DOC files but this PR fixes it for PPT too which was prone to the same problem.	2024-05-02 18:21:18 +00:00
Steve Canny	7dea2fa4a1	rfctr: tidy up ppt+doc tests (#2956 ) Summary Make tests for DOC and PPT formats more concise and readable in preparation for adding one or two.	2024-05-02 16:00:00 +00:00
Steve Canny	601594d373	fix(docx): fix short-row DOCX table (#2943 ) Summary The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate `.text` and `.metadata.text_as_html` for these tables.	2024-05-02 00:45:52 +00:00
Steve Canny	eff84afe24	chore: update python-docx version dependency (#2952 ) Summary `unstructured` will use table features added in the most recent version of `python-docx`. Also update the `lxml` version constraint because `lxml>4.9.2` will not install on Apple Silicon (https://github.com/Unstructured-IO/unstructured/issues/1707). `python-docx` requires `lxml` although other file formats require it as well.	2024-05-01 21:36:31 +00:00
Yuming Long	542d442699	chore CORE-4775: remove html page number metadata field (#2942 ) ### Summary Rip off page_number metadata fields until we have page counting for all kinds of html files (not just limited to news articles with multiple `<article>` tag) ### Test Unit tests `test_add_chunking_strategy_on_partition_html_respects_multipage` and `test_add_chunking_strategy_title_on_partition_auto_respects_multipage` removed since they relay on the `page_number` fields from the SEC html file - now test moved to mock test for chunk_by_title -> revisit those tests when we find test file for this Also changed the element ids from partition outputs for html files - element id change due to page number change (in element id hashing) -> todo ticket: update other deterministic element id tests per crag's comment --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2024-04-30 15:20:26 +00:00
Marco Lüthy	0d80886578	fix: parse URL response Content-Type according to RFC 9110 (#2950 ) Currently, `file_and_type_from_url()` does not correctly handle the `Content-Type` header. Specifically, it assumes that the header contains only the mime-type (e.g. `text/html`), however, [RFC 9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows for additional directives — specifically the `charset` — to be returned in the header. This leads to a `ValueError` when loading a URL with a response Content-Type header such as `text/html; charset=UTF-8`. To reproduce the issue: ```python from unstructured.partition.auto import partition url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/" partition(url=url) ``` Which will result in the following exception: ```python { "name": "ValueError", "message": "Invalid file. The FileType.UNK file type is not supported in partition.", "stack": "--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 4 1 from unstructured.partition.auto import partition 3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\" ----> 4 partition(url=url) File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs) 539 else: 540 msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\" --> 541 raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\") 543 for element in elements: 544 element.metadata.url = url ValueError: Invalid file. The FileType.UNK file type is not supported in partition." } ``` This PR fixes the issue by parsing the mime-type out of the `Content-Type` header string. Closes #2257 0.13.6	2024-04-29 22:53:44 -07:00

1 2 3 4 5 ...

1447 Commits