unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-01 18:43:04 +00:00

Author	SHA1	Message	Date
Steve Canny	718891a447	rfctr(part): remove double-decoration 5 (#3692 ) Summary Remove double-decoration from EML and MSG. Additional Context - These needed to wait to the end because `partition_email()` and `partition_msg()` can use any other partitioner for one of their attachments. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-04 21:01:32 +00:00
Steve Canny	4711a8dc26	rfctr(part): remove double-decoration 4 (#3690 ) Summary Install new `@apply_metadata()` on TXT. Additional Context - Both EML and MSG delegate to both HTML and TXT to partition the message-body, depending on which MIME-part body payload is selected (`text/plain` or `text/html`). This PR prepares the way to remove decorators from EML and MSG in the next PR. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-03 16:41:31 +00:00
Steve Canny	9bd91a836e	rfctr(part): remove double-decoration 3 (#3687 ) Summary Install new `@apply_metadata()` on HTML and remove decorators from delegating partitioners EPUB, MD, ORG, RST, and RTF. Additional Context - All five of these delegating partitioners delegate to `partition_html()` so they're something of a matched set. EML and MSG also partially delegate to HTML but that's a harder problem (they also delegate to all other partitioners for attachments) that we'll address a couple PRs later . - Replace use of `@process_metadata()` and `@add_metadata_with_filetype()` decorators with `@apply_metadata()` on `partition_html()`. - Remove all decorators from delegating partitioners; this removes the "double-decorating".	2024-10-02 21:04:37 +00:00
Steve Canny	17092198d0	rfctr(part): remove double-decoration 2 (#3686 ) Summary Install new `@apply_metadata()` on PPTX, TSV, XLSX, and XML and remove decoration from PPT. Additional Context - Alphabetical order turns out to be hard, so this is the remaining "easy" delegating partitioner and the remaining principal partitioners. - Replace use of `@process_metadata()` and `@add_metadata_with_filetype()` decorators with `@apply_metadata()` on principal partitioners (those that do not delegate to other partitioners. - Remove all decorators from delegating partitioners (PPT in this case); this removes the "double-decorating".	2024-10-02 18:52:59 +00:00
Steve Canny	bba60260b2	rfctr(part): remove double-decoration 1 (#3685 ) Summary Install new `@apply_metadata()` on CSV and DOCX and remove decoration from DOC and ODT. Additional Context - Working in alphabetical order and keeping PR size manageable, replace use of `@process_metadata()` and `@add_metadata_with_filetype()` decorators with `@apply_metadata()` on principal partitioners (those that do not delegate to other partitioners. - Remove all decorators from delegating partitioners (DOC and ODT in this case); this removes the "double-decorating".	2024-10-01 22:40:58 +00:00
Steve Canny	c05e1babf1	rfctr(meta): refine @apply_metadata() decorator (#3667 ) Summary Refine `@apply_metadata()` replacement decorator. Note it has not been installed yet. - Apply `metadata_last_modified` arg with the `@apply_metadata()` decorator. No need for redundant processing in each partitioner. - Add "unique-ify" step to fix any cases where the same `Element` or `ElementMetadata` instance was used more than once in the element stream. This prevents unexpected "multi-mutation" in downstream processes. - Apply "global" metadata items before computing hash-ids. In particular, `.metadata.filename` is used in the hash computation and will produce different results if that's not already settled. - Compute hash-ids _before_ computing `.metadata.parent_id`. This removes the need for mapping UUID element-ids to their hash counterpart and doing a fixup of `.parent_id` after applying hash-ids to elements. Additional Context - The `@apply_metadata()` decorator replaces the four metadata-related decorators: `@process_metadata()`, `@add_metadata_with_filetype()`, `@add_metadata()`, and `@add_filetype()`. - It will be installed on each partitioner in a series of following PRs.	2024-10-01 21:28:32 +00:00
Tomasz Cąkała	75c4998bc7	Fix: partition on empty or whitespace-only text files (#3675 ) This is a fix for this [bug](https://github.com/Unstructured-IO/unstructured/issues/3674), auto partition fails on text files which are empty or contain only whitespaces Inference of .txt file type fails if the file has only whitespaces. To Reproduce: ``` from tempfile import NamedTemporaryFile from unstructured.partition.auto import partition with NamedTemporaryFile(mode="w", suffix=".txt") as f: f.write(" \n") f.seek(0) elements = partition(filename=f.name) ```	2024-09-28 21:16:33 -07:00
Steve Canny	50d75c47d3	rfctr(part): add new decorator to replace four (#3650 ) Summary In preparation for pluggable auto-partitioners, add a new metadata decorator to replace the four existing ones. Additional Context "Global" metadata items, those applied to all element on all partitioners, are applied using a decorator. Currently there are four decorators where there only needs to be one. Consolidate those into a single metadata decorator. One or two additional behaviors of the new decorator will allow us to remove decorators from delegating partitioners which is a prerequisite for pluggable auto-partitioners.	2024-09-25 23:15:50 +00:00
Steve Canny	44bad216f3	rfctr(part): prepare for pluggable auto-partitioners 3 (#3661 ) Summary Remove unused `include_metadata` parameter. Additional Context - The `include_metadata` parameter was originally added circa v0.7.12 as a mechanism for avoiding the "double-decorating" problem on delegating partitioners. - It turns out it doesn't fully address that problem, is now unused, and is unnecessary for the solution we'll be adding as part of pluggable partitioners. - Remove the unnecessary complexity introduced by this unused parameter.	2024-09-25 18:17:48 +00:00
Steve Canny	086b8d6f8a	rfctr(part): prepare for pluggable auto-partitioners 2 (#3657 ) Summary Step 2 in prep for pluggable auto-partitioners, remove `regex_metadata` field from `ElementMetadata`. Additional Context - "regex-metadata" was an experimental feature that didn't pan out. - It's implemented by one of the post-partitioning metadata decorators, so get rid of it as part of the cleanup before consolidating those decorators.	2024-09-24 17:33:25 +00:00
Yao You	903efb0c6d	fix: fix occasional key error when mapping parent id (#3658 ) This PR fixes an occasional `KeyError` when calling `assign_and_map_hash_ids`. - This happens when the input `elements` has duplicated element instances or metadata. - When there are duplications the logic to iterate through all elements and map their parent ids will raise an error when an already mapped parent id is up for mapping. - The fix adds a logic to check if the parent id exists in `old_to_new_mapping` and if it doesn't we skip mapping it ## test This PR adds a unit test on this case and the test would fail without the fix.	2024-09-24 16:39:11 +00:00
Austin Walker	6428d19e5a	fix: update python SDK syntax for forward compatibility (#3656 ) Wrap the `shared.PartitionParameters` usage with `operations.PartitionRequest`. This syntax has been deprecated since v0.23.0 of the SDK, and will be unsupported in v0.26.0.	2024-09-24 16:37:38 +00:00
Steve Canny	3bab9d93e6	rfctr(part): prepare for pluggable auto-partitioners 1 (#3655 ) Summary In preparation for pluggable auto-partitioners simplify metadata as discussed. Additional Context - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, , file, *kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.	2024-09-23 22:23:10 +00:00
Steve Canny	03c2bf8f1f	rfctr(part): extract partition.common submodules (#3649 ) Summary In preparation for consolidating post-partitioning metadata decorators, extract `partition.common` module into a sub-package (directory) and extract `partition.common.metadata` module to house metadata-specific object shared by partitioners. Additional Context - This new module will be the home of the new consolidated metadata decorator. - The consolidated decorator is a step toward removing post-processing decorators from _delegating_ partitioners. A delegating partitioner is one that convert its file to a different format and "delegates" actual partitioning to the partitioner for that target format. 10 of the 20 partitioners are delegating partitioners. - Removing decorators from delegating partitioners will allow us to avoid "double-decorating", i.e. running those decorators twice, once on the principal partitioner and again on the proxy partitioner. - This will allow us to send `*kwargs` to either partitioner, removing the knowledge of which arguments to send for each file-type from auto-partition. - And this will allow pluggable auto-partitioners which all have a `partition_x(filename, , file, **kwargs) -> list[Element]` interface.	2024-09-20 20:35:28 +00:00
Matt Robinson	7d66a236f1	fix: correctly install mesa-gl for arm (#3647 ) ### Summary Fixes the `arm64` image builds, which will be available again starting in version `0.15.13`. A fix was implemented upstream in https://github.com/Unstructured-IO/base-images/pull/47 and a workaround that installed `x86` packages in the `unstructured` repo was removed. ### Testing See [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10948943594/job/30401108059?pr=3647) for a successful `arm64` build on the feature branch. 0.15.13	2024-09-20 13:32:47 +00:00
Christine Straub	0ed69a1ac3	refactor: pdfminer image cleanup (#3648 ) This PR aims to remove `clean_pdfminer_duplicate_image_elements()` function, as its functionality has already been integrated into the `remove_duplicate_elements()` function in [PR #3630](https://github.com/Unstructured-IO/unstructured/pull/3630).	2024-09-19 18:57:02 +00:00
Christine Straub	be88eef06f	perf: optimize pdfminer image cleanup process for improved performance (#3630 ) This PR enhances `pdfminer` image cleanup process by repositioning the duplicate image removal step. It optimizes the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances the overall processing speed of PDF documents. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-09-19 14:05:05 +00:00
Steve Canny	cd074bb32b	chore(file): remove dead code (#3645 ) Summary Remove dead code in `unstructured.file_utils`. Additional Context These modules were added in 12/2022 and 1/2023 and are not referenced by any code. Removing to reduce unnecessary complexity. These can of course be recovered from Git history if we decide we want them again in future.	2024-09-19 06:45:33 +00:00
Yao You	22998354db	add requirements files to ingest cache hash key (#3641 ) This PR adds the requirement files for base and extras for the ingest cache's hash key. - The current workflow uses only the ingest requirements to generate hash key for the gitaction cache - Sometimes only base or extra requirements (like extra-pdf.txt) updated but not any ingest requirements -> this would mean the ingest test would fetch a cache with outdated non-ingest dependencies - When we generate new ingest cache we actually do check first base and extra requirements and generate a base env before layer on top the ingest dependencies. - This PR allows the ingest step to recognize changes to non-ingest dependency changes and trigger new cache generation when either ingest or base/extra requirement files changes. This PR also bumps the setup python action version in cache actions; it also adds installation of `virtualenv` for the ingest cache action to avoid errors like https://github.com/Unstructured-IO/unstructured/actions/runs/10905551870/job/30265057515?pr=3641#step:3:111	2024-09-18 18:39:14 -05:00
Yao You	2d3cd45b23	Fix/reduce memory usage (#3629 ) This PR fixes the high memory usage when computing intersection areas. - it now converts the coordinates into half precision floating point numbers instead of double - removes some intermediate variables to free up memory usage ## test Using a memory profiler like `memory_profiler` in `ipython`: ```ipython ## cell 1 from unstructured.partition.pdf_image.pdfminer_processing import areas_of_boxes_and_intersection_area import numpy as np %load_ext memory_profiler ## cell 2 %%memit coords = np.random.rand(40000).reshape((10000,4)).astype(np.float16) ## cell 3 %%memit inter_area, boxa_area, boxb_area = areas_of_boxes_and_intersection_area(coords, coords) ``` The peak memory and incremental memory from cell 3 should be close to ``` peak memory: 730.55 MiB, increment: 573.22 MiB ``` On main branch the `coords` is double precision and running the same code with ``` coords = np.random.rand(40000).reshape((10000,4)).astype(np.float64) ``` would result in peak memory usage more than 4GiB --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-09-17 11:00:26 -05:00
John	46e04b165a	build(deps): bump protobuf pin (#3625 ) Bumps max version of `protobuf<5.0` and sets min version of `chromadb>0.4.14` in `requirements/ingest/chroma.in`. Also fixes some type hints in `unstructured/ingest/v2/processes/connectors/chroma.py`	2024-09-16 19:39:47 +00:00
Matt Robinson	ba93f9a26a	fix: reenable arm64 build (#3626 ) ### Summary Reverts the CI change in #3624 and reenables the `arm64` build and publish steps.	2024-09-13 16:15:01 +00:00
Matt Robinson	8b7e5bbeac	fix: temporarily disable arm64 build (#3624 ) ### Summary Per [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10842120429/job/30087252047), `arm64` builds are currently failing, likely because the workaround for the broken `mesa-gl` package from the `wolfi` repository only works for `amd64`. Temporarily disabling the `arm64` build in order to push out the latest `amd64` image with security patches, then will revert and work the fix for the `arm64` image. - https://github.com/Unstructured-IO/base-images/pull/44 0.15.12	2024-09-13 13:47:39 +00:00
John	159b8a9082	remove more dependency pins (#3621 ) Remove `langchain-community>=0.2.5` and `wrapt>=1.14.0` pins and add `importlib-metadata>=8.5.0` pin	2024-09-13 01:55:14 +00:00
Christine Straub	87a88a3c87	feat: improve pdfminer element processing (#3618 ) This PR implements splitting of `pdfminer` elements (`groups of text chunks`) into smaller bounding boxes (`text lines`). This implementation prevents loss of information from the object detection model and facilitates more effective removal of duplicated `pdfminer` text. This PR also addresses #3430. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-12 21:17:27 +00:00
qued	639ca591d8	fix: Table metric typo (#3623 ) It looks like we puts columns when we meant rows in one of the table metrics. @pravin-unstructured flagged this.	2024-09-12 19:47:53 +00:00
John	ab94c6c5d1	chore: remove pins (#3579 ) - Remove constraint pins for `Office365-REST-Python-Client`, `weaviate-client`, and `platformdirs`. Removing the pin for `Office365` brought to light some bugs in the Onedrive connector, so some changes were also made to `unstructured/ingest/v2/processes/connectors/onedrive.py`. - Also, as part of updating dependencies `unstructured-client` was updated to `0.25.8`, which introduced a new default for the `strategy` param and required updating a test fixture. - The `hubspot.sh` integration test was failing and is now ignored in CI with this PR per discussion with @rbiseck3. May be easiest to review commit-by-commit.	2024-09-12 13:48:59 +00:00
Roman Isecke	ebf16055d8	feat/add deprecation warning to all embed code (#3614 ) ### Description Related PR to move the code over: https://github.com/Unstructured-IO/unstructured-ingest/pull/92 Also removed the console script that exposes ingest.	2024-09-10 23:48:39 +00:00
cragwolfe	e9690b2738	feat: utility script to process large PDFs through the API by script (#3591 ) Adds the bash script `process-pdf-parallel-through-api.sh` that allows splitting up a PDF into smaller parts (splits) to be processed through the API concurrently, and is re-entrant. If any of the parts splits fail to process, one can attempt reprocessing those split(s) by rerunning the script. Note: requires the `qpdf` command line utility. The below command line output shows the scenario where just one split had to be reprocessed through the API to create the final `layout-parser-paper_combined.json` output. ``` $ BATCH_SIZE=20 PDF_SPLIT_PAGE_SIZE=6 STRATEGY=hi_res \ ./scripts/user/process-pdf-parallel-through-api.sh example-docs/pdf/layout-parser-paper.pdf > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-pars\ er-paper_pages_1_to_6.json as it already exists. Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_7_to_12.json as it already exists. Valid JSON output created: /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_13_to_16.json Processing complete. Combined JSON saved to /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_combined.json ``` Bonus change to `unstructured-get-json.sh` to point to the standard hosted Serverless API, but allow using the Free API with --freemium.	2024-09-10 11:40:35 -07:00
cragwolfe	71208ca2ee	doc: emphasize deprecation of ingest (#3610 ) Given that unstructured-ingest is now maintained in [its own repo](https://github.com/Unstructured-IO/unstructured-ingest), update documentation references in this repo to point there. Note that the forked, deprecated unstructured.ingest [in this repo ](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest)will be removed in the near future, once CI is updated properly. 0.15.10	2024-09-09 16:03:44 -07:00
Matt Robinson	dc1128c21c	build(release): version 0.15.10 (#3609 ) ### Summary Release for version `0.15.10`.	2024-09-09 21:42:20 +00:00
Matt Robinson	cf32672bc5	build(deps): bumps for 2024-09-09 (#3608 ) ### Summary Dependency bumps for 2024-09-09.	2024-09-09 16:45:18 +00:00
cragwolfe	3bb0ee1e79	chore: fix tests breaking on main (#3603 ) Fix API tests (really more like integration tests) that run only on main. Also use less compute intensive files to speedup test time and remove a useless test. Tests in `test_unstructured/partition/test_api.py` pass, temporarily running outside of main per per screenshot: ![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704) https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513	2024-09-08 21:25:52 +00:00
Matt Robinson	c060467018	build(deps): bump cryptography version (#3599 ) ### Summary Bumps to the latest version of the `cryptography` library to address `GHSA-h4gh-qq45-vh27`.	2024-09-05 19:06:43 +00:00
Pawel Kmiecik	f25eb60585	fix: expose drawing options as function params rather than env config (#3598 ) This PR: - changes the interface of analysis tools to expose drawing params as function parameters rather than env_config (=environmental variables) - restructures analysis package	2024-09-05 15:51:43 +00:00
Christine Straub	acd070c5d5	feat: enhance `pdfminer` element cleanup (#3593 ) This PR aims to expand removal of `pdfminer` elements to include those inside all `non-pdfminer` elements, not just `tables`. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-04 12:02:50 +00:00
Yao You	d51fb134e6	Feat/improve iou speed (#3582 ) This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)	2024-09-03 00:06:18 +00:00
Pawel Kmiecik	404f780bbb	feat: make analysis drawing more flexible (#3574 ) This PR changes the way the analysis tools can be used: - by default if `analysis` is set to `True` in `partition_pdf` and the strategy is resolved to `hi_res`: - for each file 4 layout dumps are produced and saved as JSON files (`object_detection`, `extracted`, `ocr`, `final`) - similar way to the current `object_detection` dump - the drawing functions/classes now accept these dumps accordingly instead of the internal classes instances (like `TextRegion`, `DocumentLayout` - it makes it possible to use the lightweight JSON files to render the bboxes of a given file after the partition is done - `_partition_pdf_or_image_local` has been refactored and most of the analysis code is now encapsulated in `save_analysis_artifiacts` function - to do this, helper function `render_bboxes_for_file` is added <img width="338" alt="Screenshot 2024-08-28 at 14 37 56" src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">	2024-09-02 11:06:11 +00:00
Matt Robinson	04322d1632	build(deps): removed unnecessary jupyter deps (#3583 ) ### Summary Removes unnecessary `jupyter` and `ipython` dev dependencies to reduce CVE surface area.	2024-08-31 05:21:40 +00:00
Matt Robinson	6ba8135bf9	fix: check ole storage content to differentiate filetypes (#3581 ) ### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ``` 0.15.9	2024-08-30 15:12:46 -04:00
John	ddb6cb631d	chore: remove minimum version pins for pins older than 6 mo (#3577 ) Remove a number of pins in `requirements/deps/constraints.txt` and `make pip-compile`	2024-08-29 15:35:14 +00:00
Austin Walker	f440eb476c	feat: Support encoding parameter in partition_csv (#3564 ) See added test file. Added support for the encoding parameter, which can be passed directly to `pd.read_csv`.	2024-08-28 14:19:58 +00:00
John	f21c853ade	bug: fix file_conversion disk leak (#3562 ) Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFile have been replaced with TemporaryFileDirectory to avoid a known issue: - https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile - https://github.com/Unstructured-IO/unstructured/issues/3390 The first 7 commits each address an individual occurrence of the issue if reviewers want to review commit-by-commit.	2024-08-27 22:02:24 +00:00
Matt Robinson	4194a07d12	build(deps): replace pillow-heif with pi-heif (#3571 ) ### Summary Closes #2664 and replaces `pillow-heif` with `pi-heif` due to more permissive licensing on the binary wheel for `pi-heif`. 0.15.8	2024-08-27 11:54:35 -04:00
David Potter	ddba928344	Potter/mixedbread embedder (#3513 ) Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!	2024-08-27 14:52:13 +00:00
Christine Straub	affd997c39	refactor(ci): remove unused environment variables (#3568 ) This PR removes the unused env `TABLE_OCR` from CI.	2024-08-26 19:19:58 +00:00
Matt Robinson	09d84bc46b	build(deps): version bumps for 2024-08-26 (#3567 ) ### Summary Version bumps for 2024-08-26.	2024-08-26 15:15:25 -04:00
Christine Straub	ac10ba4fc1	build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561 ) ### Summary - Bump `unstructured.paddleocr` to 2.8.1.0 - Remove `opencv-python` and `opencv-contrib-python` constraint pins - Fix `0.15.7` changelog	2024-08-23 14:17:29 -07:00
Steve Canny	32bb77aafb	fix(file): no default OLE subtype (#3516 ) Summary Do not assume MSG format when an OLE "container" file cannot be differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based identification in that case. Additional Context DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly, a Microsoft-proprietary Zip format which "contains" a filesystem of discrete files and directories. An OLE "container" is easily identified by inspecting the first 8 bytes of the file, so all we need to do is differentiate between the four subtypes we can process. The `filetype` module does a good job of this but it not perfect and does not identify MSG files. Previously we assumed MSG format when none of DOC, PPT, or XLS was detected, but we discovered that `filetype` is not completely reliable at detecting these types. Change the behavior to remove the assumption of MSG format. `_OleFileDifferentiator` returns `None` in this case and filetype detection falls back to use filename-extension. Note a file with no filename and no metadata_filename or an incorrect extension will not be correctly identified in this case, however we're assuming for now that will be rare in practice.	2024-08-22 19:16:53 +00:00
John	b4a6aa5559	chore: remove fsspec pin (#3554 ) remove fsspec pin	2024-08-21 21:57:42 +00:00

1 2 3 4 5 ...

1586 Commits