unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-12 15:42:19 +00:00

Author	SHA1	Message	Date
Steve Canny	3bab9d93e6	rfctr(part): prepare for pluggable auto-partitioners 1 (#3655 ) Summary In preparation for pluggable auto-partitioners simplify metadata as discussed. Additional Context - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, , file, *kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.	2024-09-23 22:23:10 +00:00
Steve Canny	03c2bf8f1f	rfctr(part): extract partition.common submodules (#3649 ) Summary In preparation for consolidating post-partitioning metadata decorators, extract `partition.common` module into a sub-package (directory) and extract `partition.common.metadata` module to house metadata-specific object shared by partitioners. Additional Context - This new module will be the home of the new consolidated metadata decorator. - The consolidated decorator is a step toward removing post-processing decorators from _delegating_ partitioners. A delegating partitioner is one that convert its file to a different format and "delegates" actual partitioning to the partitioner for that target format. 10 of the 20 partitioners are delegating partitioners. - Removing decorators from delegating partitioners will allow us to avoid "double-decorating", i.e. running those decorators twice, once on the principal partitioner and again on the proxy partitioner. - This will allow us to send `*kwargs` to either partitioner, removing the knowledge of which arguments to send for each file-type from auto-partition. - And this will allow pluggable auto-partitioners which all have a `partition_x(filename, , file, **kwargs) -> list[Element]` interface.	2024-09-20 20:35:28 +00:00
Matt Robinson	7d66a236f1	fix: correctly install mesa-gl for arm (#3647 ) ### Summary Fixes the `arm64` image builds, which will be available again starting in version `0.15.13`. A fix was implemented upstream in https://github.com/Unstructured-IO/base-images/pull/47 and a workaround that installed `x86` packages in the `unstructured` repo was removed. ### Testing See [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10948943594/job/30401108059?pr=3647) for a successful `arm64` build on the feature branch. 0.15.13	2024-09-20 13:32:47 +00:00
Christine Straub	0ed69a1ac3	refactor: pdfminer image cleanup (#3648 ) This PR aims to remove `clean_pdfminer_duplicate_image_elements()` function, as its functionality has already been integrated into the `remove_duplicate_elements()` function in [PR #3630](https://github.com/Unstructured-IO/unstructured/pull/3630).	2024-09-19 18:57:02 +00:00
Christine Straub	be88eef06f	perf: optimize pdfminer image cleanup process for improved performance (#3630 ) This PR enhances `pdfminer` image cleanup process by repositioning the duplicate image removal step. It optimizes the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances the overall processing speed of PDF documents. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-09-19 14:05:05 +00:00
Steve Canny	cd074bb32b	chore(file): remove dead code (#3645 ) Summary Remove dead code in `unstructured.file_utils`. Additional Context These modules were added in 12/2022 and 1/2023 and are not referenced by any code. Removing to reduce unnecessary complexity. These can of course be recovered from Git history if we decide we want them again in future.	2024-09-19 06:45:33 +00:00
Yao You	22998354db	add requirements files to ingest cache hash key (#3641 ) This PR adds the requirement files for base and extras for the ingest cache's hash key. - The current workflow uses only the ingest requirements to generate hash key for the gitaction cache - Sometimes only base or extra requirements (like extra-pdf.txt) updated but not any ingest requirements -> this would mean the ingest test would fetch a cache with outdated non-ingest dependencies - When we generate new ingest cache we actually do check first base and extra requirements and generate a base env before layer on top the ingest dependencies. - This PR allows the ingest step to recognize changes to non-ingest dependency changes and trigger new cache generation when either ingest or base/extra requirement files changes. This PR also bumps the setup python action version in cache actions; it also adds installation of `virtualenv` for the ingest cache action to avoid errors like https://github.com/Unstructured-IO/unstructured/actions/runs/10905551870/job/30265057515?pr=3641#step:3:111	2024-09-18 18:39:14 -05:00
Yao You	2d3cd45b23	Fix/reduce memory usage (#3629 ) This PR fixes the high memory usage when computing intersection areas. - it now converts the coordinates into half precision floating point numbers instead of double - removes some intermediate variables to free up memory usage ## test Using a memory profiler like `memory_profiler` in `ipython`: ```ipython ## cell 1 from unstructured.partition.pdf_image.pdfminer_processing import areas_of_boxes_and_intersection_area import numpy as np %load_ext memory_profiler ## cell 2 %%memit coords = np.random.rand(40000).reshape((10000,4)).astype(np.float16) ## cell 3 %%memit inter_area, boxa_area, boxb_area = areas_of_boxes_and_intersection_area(coords, coords) ``` The peak memory and incremental memory from cell 3 should be close to ``` peak memory: 730.55 MiB, increment: 573.22 MiB ``` On main branch the `coords` is double precision and running the same code with ``` coords = np.random.rand(40000).reshape((10000,4)).astype(np.float64) ``` would result in peak memory usage more than 4GiB --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-09-17 11:00:26 -05:00
John	46e04b165a	build(deps): bump protobuf pin (#3625 ) Bumps max version of `protobuf<5.0` and sets min version of `chromadb>0.4.14` in `requirements/ingest/chroma.in`. Also fixes some type hints in `unstructured/ingest/v2/processes/connectors/chroma.py`	2024-09-16 19:39:47 +00:00
Matt Robinson	ba93f9a26a	fix: reenable arm64 build (#3626 ) ### Summary Reverts the CI change in #3624 and reenables the `arm64` build and publish steps.	2024-09-13 16:15:01 +00:00
Matt Robinson	8b7e5bbeac	fix: temporarily disable arm64 build (#3624 ) ### Summary Per [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10842120429/job/30087252047), `arm64` builds are currently failing, likely because the workaround for the broken `mesa-gl` package from the `wolfi` repository only works for `amd64`. Temporarily disabling the `arm64` build in order to push out the latest `amd64` image with security patches, then will revert and work the fix for the `arm64` image. - https://github.com/Unstructured-IO/base-images/pull/44 0.15.12	2024-09-13 13:47:39 +00:00
John	159b8a9082	remove more dependency pins (#3621 ) Remove `langchain-community>=0.2.5` and `wrapt>=1.14.0` pins and add `importlib-metadata>=8.5.0` pin	2024-09-13 01:55:14 +00:00
Christine Straub	87a88a3c87	feat: improve pdfminer element processing (#3618 ) This PR implements splitting of `pdfminer` elements (`groups of text chunks`) into smaller bounding boxes (`text lines`). This implementation prevents loss of information from the object detection model and facilitates more effective removal of duplicated `pdfminer` text. This PR also addresses #3430. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-12 21:17:27 +00:00
qued	639ca591d8	fix: Table metric typo (#3623 ) It looks like we puts columns when we meant rows in one of the table metrics. @pravin-unstructured flagged this.	2024-09-12 19:47:53 +00:00
John	ab94c6c5d1	chore: remove pins (#3579 ) - Remove constraint pins for `Office365-REST-Python-Client`, `weaviate-client`, and `platformdirs`. Removing the pin for `Office365` brought to light some bugs in the Onedrive connector, so some changes were also made to `unstructured/ingest/v2/processes/connectors/onedrive.py`. - Also, as part of updating dependencies `unstructured-client` was updated to `0.25.8`, which introduced a new default for the `strategy` param and required updating a test fixture. - The `hubspot.sh` integration test was failing and is now ignored in CI with this PR per discussion with @rbiseck3. May be easiest to review commit-by-commit.	2024-09-12 13:48:59 +00:00
Roman Isecke	ebf16055d8	feat/add deprecation warning to all embed code (#3614 ) ### Description Related PR to move the code over: https://github.com/Unstructured-IO/unstructured-ingest/pull/92 Also removed the console script that exposes ingest.	2024-09-10 23:48:39 +00:00
cragwolfe	e9690b2738	feat: utility script to process large PDFs through the API by script (#3591 ) Adds the bash script `process-pdf-parallel-through-api.sh` that allows splitting up a PDF into smaller parts (splits) to be processed through the API concurrently, and is re-entrant. If any of the parts splits fail to process, one can attempt reprocessing those split(s) by rerunning the script. Note: requires the `qpdf` command line utility. The below command line output shows the scenario where just one split had to be reprocessed through the API to create the final `layout-parser-paper_combined.json` output. ``` $ BATCH_SIZE=20 PDF_SPLIT_PAGE_SIZE=6 STRATEGY=hi_res \ ./scripts/user/process-pdf-parallel-through-api.sh example-docs/pdf/layout-parser-paper.pdf > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-pars\ er-paper_pages_1_to_6.json as it already exists. Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_7_to_12.json as it already exists. Valid JSON output created: /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_13_to_16.json Processing complete. Combined JSON saved to /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_combined.json ``` Bonus change to `unstructured-get-json.sh` to point to the standard hosted Serverless API, but allow using the Free API with --freemium.	2024-09-10 11:40:35 -07:00
cragwolfe	71208ca2ee	doc: emphasize deprecation of ingest (#3610 ) Given that unstructured-ingest is now maintained in [its own repo](https://github.com/Unstructured-IO/unstructured-ingest), update documentation references in this repo to point there. Note that the forked, deprecated unstructured.ingest [in this repo ](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest)will be removed in the near future, once CI is updated properly. 0.15.10	2024-09-09 16:03:44 -07:00
Matt Robinson	dc1128c21c	build(release): version 0.15.10 (#3609 ) ### Summary Release for version `0.15.10`.	2024-09-09 21:42:20 +00:00
Matt Robinson	cf32672bc5	build(deps): bumps for 2024-09-09 (#3608 ) ### Summary Dependency bumps for 2024-09-09.	2024-09-09 16:45:18 +00:00
cragwolfe	3bb0ee1e79	chore: fix tests breaking on main (#3603 ) Fix API tests (really more like integration tests) that run only on main. Also use less compute intensive files to speedup test time and remove a useless test. Tests in `test_unstructured/partition/test_api.py` pass, temporarily running outside of main per per screenshot: ![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704) https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513	2024-09-08 21:25:52 +00:00
Matt Robinson	c060467018	build(deps): bump cryptography version (#3599 ) ### Summary Bumps to the latest version of the `cryptography` library to address `GHSA-h4gh-qq45-vh27`.	2024-09-05 19:06:43 +00:00
Pawel Kmiecik	f25eb60585	fix: expose drawing options as function params rather than env config (#3598 ) This PR: - changes the interface of analysis tools to expose drawing params as function parameters rather than env_config (=environmental variables) - restructures analysis package	2024-09-05 15:51:43 +00:00
Christine Straub	acd070c5d5	feat: enhance `pdfminer` element cleanup (#3593 ) This PR aims to expand removal of `pdfminer` elements to include those inside all `non-pdfminer` elements, not just `tables`. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-04 12:02:50 +00:00
Yao You	d51fb134e6	Feat/improve iou speed (#3582 ) This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)	2024-09-03 00:06:18 +00:00
Pawel Kmiecik	404f780bbb	feat: make analysis drawing more flexible (#3574 ) This PR changes the way the analysis tools can be used: - by default if `analysis` is set to `True` in `partition_pdf` and the strategy is resolved to `hi_res`: - for each file 4 layout dumps are produced and saved as JSON files (`object_detection`, `extracted`, `ocr`, `final`) - similar way to the current `object_detection` dump - the drawing functions/classes now accept these dumps accordingly instead of the internal classes instances (like `TextRegion`, `DocumentLayout` - it makes it possible to use the lightweight JSON files to render the bboxes of a given file after the partition is done - `_partition_pdf_or_image_local` has been refactored and most of the analysis code is now encapsulated in `save_analysis_artifiacts` function - to do this, helper function `render_bboxes_for_file` is added <img width="338" alt="Screenshot 2024-08-28 at 14 37 56" src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">	2024-09-02 11:06:11 +00:00
Matt Robinson	04322d1632	build(deps): removed unnecessary jupyter deps (#3583 ) ### Summary Removes unnecessary `jupyter` and `ipython` dev dependencies to reduce CVE surface area.	2024-08-31 05:21:40 +00:00
Matt Robinson	6ba8135bf9	fix: check ole storage content to differentiate filetypes (#3581 ) ### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ``` 0.15.9	2024-08-30 15:12:46 -04:00
John	ddb6cb631d	chore: remove minimum version pins for pins older than 6 mo (#3577 ) Remove a number of pins in `requirements/deps/constraints.txt` and `make pip-compile`	2024-08-29 15:35:14 +00:00
Austin Walker	f440eb476c	feat: Support encoding parameter in partition_csv (#3564 ) See added test file. Added support for the encoding parameter, which can be passed directly to `pd.read_csv`.	2024-08-28 14:19:58 +00:00
John	f21c853ade	bug: fix file_conversion disk leak (#3562 ) Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFile have been replaced with TemporaryFileDirectory to avoid a known issue: - https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile - https://github.com/Unstructured-IO/unstructured/issues/3390 The first 7 commits each address an individual occurrence of the issue if reviewers want to review commit-by-commit.	2024-08-27 22:02:24 +00:00
Matt Robinson	4194a07d12	build(deps): replace pillow-heif with pi-heif (#3571 ) ### Summary Closes #2664 and replaces `pillow-heif` with `pi-heif` due to more permissive licensing on the binary wheel for `pi-heif`. 0.15.8	2024-08-27 11:54:35 -04:00
David Potter	ddba928344	Potter/mixedbread embedder (#3513 ) Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!	2024-08-27 14:52:13 +00:00
Christine Straub	affd997c39	refactor(ci): remove unused environment variables (#3568 ) This PR removes the unused env `TABLE_OCR` from CI.	2024-08-26 19:19:58 +00:00
Matt Robinson	09d84bc46b	build(deps): version bumps for 2024-08-26 (#3567 ) ### Summary Version bumps for 2024-08-26.	2024-08-26 15:15:25 -04:00
Christine Straub	ac10ba4fc1	build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561 ) ### Summary - Bump `unstructured.paddleocr` to 2.8.1.0 - Remove `opencv-python` and `opencv-contrib-python` constraint pins - Fix `0.15.7` changelog	2024-08-23 14:17:29 -07:00
Steve Canny	32bb77aafb	fix(file): no default OLE subtype (#3516 ) Summary Do not assume MSG format when an OLE "container" file cannot be differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based identification in that case. Additional Context DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly, a Microsoft-proprietary Zip format which "contains" a filesystem of discrete files and directories. An OLE "container" is easily identified by inspecting the first 8 bytes of the file, so all we need to do is differentiate between the four subtypes we can process. The `filetype` module does a good job of this but it not perfect and does not identify MSG files. Previously we assumed MSG format when none of DOC, PPT, or XLS was detected, but we discovered that `filetype` is not completely reliable at detecting these types. Change the behavior to remove the assumption of MSG format. `_OleFileDifferentiator` returns `None` in this case and filetype detection falls back to use filename-extension. Note a file with no filename and no metadata_filename or an incorrect extension will not be correctly identified in this case, however we're assuming for now that will be rare in practice.	2024-08-22 19:16:53 +00:00
John	b4a6aa5559	chore: remove fsspec pin (#3554 ) remove fsspec pin	2024-08-21 21:57:42 +00:00
Steve Canny	03e0ed3519	rfctr(docx): DOCX emits std minified .text_as_html (#3545 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_docx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - nested tables appear as their extracted text in the parent cell (no nested `<table>` elements in `.text_as_html`). - DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements).	2024-08-21 18:54:21 +00:00
John	f135344738	chore: remove scipy and packaging pins (#3550 ) Remove scipy and packaging constraint pins	2024-08-21 16:05:19 +00:00
John	604cadfb7e	chore: remove ipython pin (#3548 ) this pr is stacked on https://github.com/Unstructured-IO/unstructured/pull/3538 and https://github.com/Unstructured-IO/unstructured/pull/3547 This pr removes dependency pins for IPython, anyio, and pyparsing. It also updates the label-studio-sdk import statement so we don't have to have that pinned and make some minor type hinting edits. Label Studio had a breaking change in their 1.13.0 [release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)	2024-08-21 00:06:31 +00:00
Christine Straub	01dbc7b473	fix: `nltk` data download path to prevent redundant nested directories (#3546 ) Closes #3543. ### Summary This PR addresses an issue with the NLTK data download process. Previously, when downloading NLTK data, a nested "nltk_data" directory was created within the parent "nltk_data" directory if the parent directory already existed. This redundant directory structure led to two significant problems: - errors in checking if data had already been downloaded, potentially causing redundant downloads in subsequent calls. - failures in loading models from the downloaded NLTK data due to incorrect path resolution. This fix modifies the NLTK data download logic to prevent creation of unnecessary nested directories. If the download path ends with "nltk_data" and that directory already exists, we now use the existing directory instead of creating a new nested one. ### Testing CI should pass. 0.15.7	2024-08-20 18:56:59 +00:00
Matt Robinson	1f8030dd0e	fix(CVE-2024-39705): bump to `nltk` 3.9.1; correct model download issues (#3541 ) ### Summary Bumps to `nltk==3.9.1` and resolves [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An NLTK version bump was originally introduced in #3512 and rolled back in #3527 because `nltk==3.8.2` was yanked from PyPI, and also because we observed significant slowdowns in processing time after bumping to `nltk==3.8.2`. The processing time regression does not appear in `nltk==3.9.1`. ### Testing After the bump, CI should pass. Additionally we verified locally that files processing takes around the amount of time we would expect for a long `.docx` file. ```python In [1]: from unstructured.partition.auto import partition In [2]: filename = "test-doc.docx" In [3]: %timeit partition(filename=filename) 3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` 0.15.6	2024-08-19 20:59:36 +00:00
Steve Canny	a861ed8fe7	feat(chunk): split tables on even row boundaries (#3504 ) Summary Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. Additional Context Table splitting now has the following characteristics: - `TableChunk.metadata.text_as_html` is always a parseable HTML `<table>` subtree. - `TableChunk.text` is always the text in the HTML version of the table fragment in `.metadata.text_as_html`. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible. - A row is broken at an even-cell boundary when a single row is larger than the chunking window. - A cell is broken at an even-word boundary when a single cell is larger than the chunking window. - `.text_as_html` is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.	2024-08-19 18:56:53 +00:00
Christine Straub	99f72d65ba	ci: fix ingest test fixtures update (#3532 )	2024-08-16 16:37:33 -07:00
Christine Straub	fc26426310	feat: replace `pytesseract` with `unstructured.pytesseract` fork (#3528 ) This PR reverts `pytesseract` dependency to `unstructured.pytesseract` fork due to the unavailability of some recent release versions of `pytesseract` on PyPI. This PR also addresses an issue encountered during the publication of `unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI does not allow direct dependencies from Version Control System URLs like GitHub in the `install_requires` or `extras_require` sections of the `setup.py` file. 0.15.5	2024-08-16 10:34:22 -04:00
Matt Robinson	e64e09507a	build: update to latest base image (#3524 ) ### Summary Updates to the latest `wolfi-base` base image to pull in more recent package version. A notable update is that upgrading to `libreoffice==24.2.5.2` resolves several CVEs. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-08-15 22:27:41 -07:00
Christine Straub	d0211cc41f	build: downgrade `nltk` version (#3527 ) This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in https://github.com/Unstructured-IO/unstructured/pull/3512 because `3.8.2` is no longer available in PyPI due to some issues(https://github.com/nltk/nltk/issues/3301)	2024-08-15 16:35:21 -07:00
Christine Straub	9b778e270d	fix: `pytesseract>=0.3.12` installation error while installing `pdf` extra (#3522 ) Closes #3521. This PR resolves an installation error with `pytesseract>=0.3.12` that occurred during `pip install unstructured[pdf]==0.15.3`. ### Testing Run following command in main branch and this PR ``` pip uninstall -y pytesseract && pip install ".[pdf]" ``` Results - `main` branch ``` INFO: pip is looking at multiple versions of unstructured[pdf] to determine which version is compatible with other requirements. This could take a while. ERROR: Could not find a version that satisfies the requirement pytesseract>=0.3.12; extra == "pdf" (from unstructured[pdf]) (from versions: 0.1, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.2, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10) ERROR: No matching distribution found for pytesseract>=0.3.12; extra == "pdf" ``` - this `PR` `pytesseract-0.3.13` should be installed successfully. 0.15.4	2024-08-14 16:15:40 -05:00
Christine Straub	d6a84bdfbb	build(deps): update extra-paddleocr requirements (#3515 ) This PR removes custom index URL for `paddlepaddle` installation in `extra-paddleocr.in`, resolving `setup.py` configuration error. Now uses `paddlepaddle==3.0.0b1` directly from PyPI, simplifying installation process. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> 0.15.3	2024-08-14 12:19:20 -05:00

1 2 3 4 5 ...

1574 Commits