unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-14 17:37:27 +00:00

Author	SHA1	Message	Date
Ronny H	51427b3103	Renamed OpenAiEmbeddingConfig dataclass (#2546 )	2024-02-14 17:24:52 +00:00
Matt Robinson	882370022e	fix: don't treat double quote enclosed text as JSON (#2544 ) ### Summary Closes #2444. Treats JSON serializable content that results in a string as plain text. Even though this is valid JSON per [RFC 4627](https://www.ietf.org/rfc/rfc4627.txt), this is valid JSON, but in almost every cases were really want to treat this as a text file. ### Testing 1. Put `"This is not a JSON"` is a text file `notajson.txt` 2. Run the following ```python from unstructured.file_utils.filetype import _is_text_file_a_json _is_text_file_a_json(filename="notajson.txt") # Should be False ```	2024-02-14 13:41:43 +00:00
Christine Straub	d11a83ce65	refactor: embedded text processing modules (#2535 ) This PR is similar to ocr module refactoring PR - https://github.com/Unstructured-IO/unstructured/pull/2492. ### Summary - refactor "embedded text extraction" related modules to use decorator - `@requires_dependencies` on functions that require external libraries and import those libraries inside those functions instead of on module level. - add missing test cases for `pdf_image_utils.py` module to improve average test coverage ### Testing CI should pass.	2024-02-13 21:19:07 -08:00
Steve Canny	d9f8467187	fix(xlsx): xlsx subtable algorithm (#2534 ) Reviewers: It may be easier to review each of the two commits separately. The first adds the new `_SubtableParser` object with its unit-tests and the second one uses that object to replace the flawed existing subtable-parsing algorithm. Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). This PR replaces the flawed existing algorithm with a `_SubtableParser` object that encapsulates all that logic and has thorough unit-tests. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes. This PR fixes all these bugs: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-13 20:29:17 -08:00
David Potter	1a706771fa	feature: add octoai for embeddings (#2538 ) Thanks to Pedro at OctoAI we have a new embedding option. The following PR adds support for the use of OctoAI embeddings. Forked from the original OpenAI embeddings class. We removed the use of the LangChain adaptor, and use OpenAI's SDK directly instead. Also updated out-of-date example script. Including new test file for OctoAI. # Testing Get a token from our platform at: https://www.octoai.cloud/ For testing one can do the following: ``` export OCTOAI_TOKEN=<your octo token> python3 examples/embed/example_octoai.py ``` ## Testing done Validated running the above script from within a locally built container via `make docker-start-dev` --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-10 15:27:06 +00:00
David Potter	d11c70cf83	bug: fix check_connection for (#2497 ) fixes check_connection for: azure opensearch postgres For Azure, the check_connection in fsspec.py actually worked better. Adding check_connection for Databricks Volumes --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-09 14:33:12 +00:00
Steve Canny	dd6576c603	rfctr(xlsx): cleaning in prep for XLSX algorithm replacement (#2524 ) Reviewers: It may be faster to review each of the three commits separately since they are groomed to only make one type of change each (typing, docstrings, test-cleanup). Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). These commits clean up typing, lint, and other non-behavior-changing aspects of the code in preparation for installing a new algorithm that correctly identifies and partitions contiguous sub-regions of an Excel worksheet into distinct elements. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-08 23:33:41 +00:00
Ahmet Melek	f9f2cacb58	build(release): release commit for 0.12.4 (#2525 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 0.12.4	2024-02-08 21:18:29 +00:00
Matt Robinson	389dbb63d7	fix: add missing dep files to manifest (#2516 ) ### Summary Closes #2484. Adds missing dependency files to `MANIFEST.in` so they are included in the Python distribution. Also updates the manifest to look for ingest dependencies in the `requirements/ingest` subdirectory. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-02-08 01:30:13 +00:00
Matt Robinson	ccf0477080	enhancement: process `.p7s` files with `partition_email` (#2521 ) ### Summary Closes #2489, which reported an inability to process `.p7s` files. PR implements two changes: - If the user selected content type for the email is not available and there is another valid content type available, fall back to the other valid content type. - For signed message, extract the signature and add it to the metadata ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/eml/signed-doc.p7s" elements = partition(filename=filename) # should get a message about fall back logic print(elements[0]) # "This is a test" elements[0].metadata.to_dict() # Will see the signature ```	2024-02-07 22:31:49 +00:00
David Potter	0c834517d8	fix: change opensearch port (#2517 ) change opensearch port to see if fixes CI. We think there may be a conflict with the elasticsearch docker port. Also adding simple retry to vector query. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-07 21:25:04 +00:00
Ronny H	67a8fd9809	Added example to use SaaS API URL in partition_via_api (#2512 ) To test: > cd docs && make html Changelog: added an example to use `SaaS API URL` in `partition_via_api` using `api_url` param. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2024-02-07 19:25:42 +00:00
Filip Knefel	5defe79bf2	docs: add information about MIME type of extracted images (#2515 ) Include information about what mime type is expected when extracting images. Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-02-07 08:40:24 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
David Potter	138625438f	fix: add title to Vectara upload (#2511 ) Small improvement to Vectara requested by Ofer at Vectara In the "Document" construct, every document can have a title. If it's there, in the UI it will show up above the document (otherwise you get "Untitled") --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-06 19:49:53 +00:00
Christine Straub	29b9ea7ba6	refactor: ocr modules (#2492 ) The purpose of this PR is to refactor OCR-related modules to reduce unnecessary module imports to avoid potential issues (most likely due to a "circular import"). ### Summary - add `inference_utils` module (unstructured/partition/pdf_image/inference_utils.py) to define unstructured-inference library related utility functions, which will reduce importing unstructured-inference library functions in other files - add `conftest.py` in `test_unstructured/partition/pdf_image/` directory to define fixtures that are available to all tests in the same directory and its subdirectories ### Testing CI should pass	2024-02-06 17:11:55 +00:00
David Potter	0f0b58dfe7	bug: remove vectara requirements (#2491 ) I accidentally added Vectara to setup and makefile. But there are no dependencies for Vectara This removes Vectara from those files. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 20:41:53 +00:00
David Potter	c100ce28a7	feat: add Vectara destination connector (#2357 ) Thanks to Ofer at Vectara, we now have a Vectara destination connector. - There are no dependencies since it is all REST calls to API - --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 14:38:34 +00:00
Christine Straub	94001a208d	feat: improve table cell data (#2457 ) The purpose of this PR is to pass embedded text through table processing sub-pipeline later later use.	2024-02-01 05:29:19 +00:00
Christophe Jolif	ccc2302b33	feat: add the ability to specify a custom OCR besides the ones natively supported (#2462 ) This is nice to natively support both Tesseract and Paddle. However, one might already use another OCR and might want to keep using it (for quality reasons, for cost reasons etc...). This PR adds the ability for the user to specify its own OCR agent implementation that is then called by unstructured. I am new to unstructured so don't hesitate to let me know if you would prefer this being done differently and I will rework the PR. --------- Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io>	2024-01-31 16:38:14 -06:00
Christine Straub	8b1de4c2b8	fix: `partition_pdf()` not working when using chipper model with file (#2479 ) Closes #2480. ### Summary - fixed an error introduced by PR [#2347](https://github.com/Unstructured-IO/unstructured/pull/2347) - https://github.com/Unstructured-IO/unstructured/pull/2347/files#diff-cefa2d296ae7ffcf5c28b5734d5c7d506fbdb225c05a0bc27c6b755d5424ffdaL373 - updated `test_partition_pdf_with_model_name()` to test more model names ### Testing The updated test function `test_partition_pdf_with_model_name()` should work on this branch, but fails on the `main` branch.	2024-01-31 17:36:59 +00:00
qued	399dd60311	build(deps): unpin pillow (#2472 ) Removed `pillow` pin and recompiled. I think it was originally there to address a conflict, which, as far as I can tell, no longer exists. Also a security vulnerability was discovered in the older version of `pillow`. #### Testing: CI should pass.	2024-01-30 21:29:08 +00:00
John	5adc04ac27	bug: fix typo in makefile (#2474 ) fix typo in makefile: `.PHONE` -> `.PHONY`	2024-01-30 18:12:35 +00:00
qued	007fc45739	chore: new black changes (#2473 ) Update `black` and apply changes to affected files. I separated this PR so we can have a look at the changes and decide whether we want to: 1. Go forward with the new formatting 2. Change the black config to make the old formatting valid 3. Get rid of black entirely and just use `ruff` 4. Do something I haven't thought of	2024-01-30 17:12:35 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
John	9320311a19	fix: check languages args (#2435 ) This PR is the last in a series of PRs for refactoring and fixing the language parameters (`languages` and `ocr_languages` so we can address incorrect input by users. See #2293 It is recommended to go though this PR commit-by-commit and note the commit message. The most significant commit is "update check_languages..."	2024-01-29 20:12:08 +00:00
Yao You	97fb10db4a	fix: default hi_res model rely on inference setting (#2441 ) - there are multiple places setting the default `hi_res_model_name` in both `unstructured` and `unstructured-inference` - they lead to inconsistency and unexpected behaviors - this fix removes a helper in `unstructured` that tries to set the default hi_res layout detection model; instead we rely on the `unstructured-inference` to provide that default when no explicit model name is passed in ## test ```bash UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython ``` ```python from unstructured.partition.auto import partition # find a pdf file elements = partition("foo.pdf", strategy="hi_res") assert elements[0].metadata.detection_origin == "yolox" ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-01-29 16:44:41 +00:00
Ahmet Melek	4e226d818f	build(release): release commit for 0.12.3 (#2466 ) 0.12.3	2024-01-29 13:45:23 +00:00
David Potter	74dcca44ca	fix: link_texts was breaking postgres destination connector (#2460 ) Formatting of link_texts was breaking metadata storage. Turns out it didn't need any conforming and came in correctly from json. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-27 04:29:38 +00:00
Ronny H	d5a6f4b82c	Docs updates (#2458 ) To test: > cd docs && make html Change logs: * Updates the best practice for table extraction to use `skip_infer_table_types` instead of `pdf_infer_table_structure`. * Fixed CSS issue with a duplicate search box. * Fixed RST warning message * Fixed typo on the Intro page.	2024-01-25 20:31:28 +00:00
Antonio Jose Jimeno Yepes	d8b3bdb919	Check chipper version and prevent running pdfminer with chipper (#2347 ) We have added a new version of chipper (Chipperv3), which needs to allow unstructured to effective work with all the current Chipper versions. This implies resizing images with the appropriate resolution and make sure that Chipper elements are not sorted by unstructured. In addition, it seems that PDFMiner is being called when calling Chipper, which adds repeated elements from Chipper and PDFMiner. To evaluate this PR, you can test the code below with the attached PDF. The code writes a JSON file with the generated elements. The output can be examined with `cat out.un.json \| python -m json.tool`. There are three things to check: 1. The size of the image passed to Chipper, which can be identiied in the layout_height and layout_width attributes, which should have values 3301 and 2550 as shown in the example below: ``` [ { "element_id": "c0493a7872f227e4172c4192c5f48a06", "metadata": { "coordinates": { "layout_height": 3301, "layout_width": 2550, ``` 2. There should be no repeated elements. 3. Order should be closer to reading order. The script to run Chipper from unstructured is: ``` from unstructured import __version__ print(__version__.__version__) import json from unstructured.partition.auto import partition from unstructured.staging.base import elements_to_json elements = json.loads(elements_to_json(partition("Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf", strategy="hi_res", model_name="chipperv3"))) with open('out.un.json', 'w') as w: json.dump(elements, w) ``` [Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf](https://github.com/Unstructured-IO/unstructured/files/13817273/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf) --------- Co-authored-by: Antonio Jimeno Yepes <antonio@unstructured.io>	2024-01-25 02:33:32 +00:00
Matt Robinson	4613e52e11	fix: treat yaml files as plain text (#2446 ) ### Summary Closes #2412. Adds support for YAML MIME types and treats them as plain text. In response to `500` errors that the API currently returns if the MIME type is `text/yaml`.	2024-01-24 17:48:36 +00:00
David Potter	9fea85dc21	fix: remove none value keys from flattened dictionary (#2442 ) When a partitioned or embedded document json has null values, those get converted to a dictionary with None values. This happens in the metadata. I have not see it in other keys. Chroma and Pinecone do not like those None values. `flatten_dict` has been modified with a `remove_none` arg to remove keys with None values. Also, Pinecone has been pinned at 2.2.4 because at 3.0 and above it breaks our code. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-23 21:52:11 +00:00
Matt Robinson	a155e7a43b	enhancement: add mongodb driver (#2445 ) ### Summary Adds a driver with `unstructured` version information to the MongoDB driver. ### Testing, Good to go as long as the MongoDB ingest test run successfully.	2024-01-23 17:45:31 +00:00
ryannikolaidis	44cb2ce645	fix: path to databricks-volumes requirements (#2443 ) setup.py is currently pointing to the wrong location for the databricks-volumes extra requirements. This PR updates to point to the correct location. ## Testing Tested by installing from local source with `pip install .`	2024-01-23 03:28:36 +00:00
Roman Isecke	a8de52e94f	feat: databricks volumes dest added (#2391 ) ### Description This adds in a destination connector to write content to the Databricks Unity Catalog Volumes service. Currently there is an internal account that can be used for testing manually but there is not dedicated account to use for testing so this is not being added to the automated ingest tests that get run in the CI. To test locally: ```shell #!/usr/bin/env bash path="testpath/$(uuidgen)" PYTHONPATH=. python ./unstructured/ingest/main.py local \ --num-processes 4 \ --output-dir azure-test \ --strategy fast \ --verbose \ --input-path example-docs/fake-memo.pdf \ --recursive \ databricks-volumes \ --catalog "utic-dev-tech-fixtures" \ --volume "small-pdf-set" \ --volume-path "$path" \ --username "$DATABRICKS_USERNAME" \ --password "$DATABRICKS_PASSWORD" \ --host "$DATABRICKS_HOST" ```	2024-01-23 01:25:51 +00:00
jakub-sandomierz-deepsense-ai	283f796f58	fix: Fix and use FSSpec dest check_connection (#2420 ) FSSpec destination connectors did not use `check_connection`. There was an error when trying to `ls` destination directory - it may not exist at the moment of creation of connector. Now `check_connection` calls `ls` on bucket root and this method is called on `initialize` of destination connector.	2024-01-22 09:31:02 +00:00
Ronny H	149f894d0a	Fixed sphinx-build error by pinning alabaster=-0.7.13 (#2436 ) 0.12.2	2024-01-20 14:36:48 -08:00
Ronny H	ea11118655	v0.12.2 changelog and version (#2434 )	2024-01-20 01:02:00 +00:00
Ronny H	4c772f6ed7	Updated docs on API Params and Filetype Supports (#2433 ) To test: > cd docs && make html Changelogs: * Fixed sphinx error due to malformed rst table on partition page * Updated API Params, ie. `extract_image_block_types` and `extract_image_block_to_payload` * Updated image filetype supports	2024-01-19 16:07:57 -08:00
Matt Robinson	2d3a7f1c48	fix: fix table index error by bumping `unstructured-inference` (#2430 ) ### Summary Closes #2417. Bumps `unstructured-inference` to pull in the fix implemented in https://github.com/Unstructured-IO/unstructured-inference/pull/317	2024-01-19 22:42:32 +00:00
John	c34fac9c3a	enhancement: add _clean_ocr_languages_arg helper function (#2413 ) This PR is one in a series of PRs for refactoring and fixing the languages parameter so it can address incorrect input by users. #2293 This PR adds _clean_ocr_languages_arg. There are no calls to this function yet, but it will be called in later PRs related to this series.	2024-01-19 19:59:08 +00:00
Ronny H	c81d4e34be	v0.12.1 release (#2429 ) 0.12.1	2024-01-19 17:56:06 +00:00
ryannikolaidis	2e97494613	fix: fsspec connectors returning data source version as integer (#2427 ) Connector data source versions should always be string values, however we were using the integer checksum value for the version for fsspec connectors. This casts that value to a string. ## Changes * Cast the checksum value to a string when assigning the version value for fsspec connectors. * Adds test to validate that these connectors will assign a string value when an integer checksum is fetched. ## Testing Unit test added.	2024-01-19 15:58:01 +00:00
Christine Straub	7378a378f6	enhancement: allow setting image block crop padding parameter (#2415 ) Closes #2320 . ### Summary In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped. ### Testing - PDF: [LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf) - Set two environment variables `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD` (e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`, `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20` ``` elements = partition_pdf( filename="LM339-D_2-2.pdf", extract_image_block_types=["image"], ) ```	2024-01-19 06:28:32 +00:00
Ahmet Melek	1a305866d1	fix: chroma destination connector serialization (#2425 ) This fixes the serialization of the ChromaDB destination connector. Presence of the _collection object breaks serialization due to TypeError: cannot pickle 'module' object. This removes that object before serialization.	2024-01-19 02:36:25 +00:00
Ahmet Melek	a9ad8ac8d1	fix: update flatten dict to support flattening tuples (#2423 ) This PR updates flatten_dict function to support flattening tuples. This is necessary for objects like Coordinates, when the object is not written to the disk, therefore not being converted to a list before getting flattened.	2024-01-19 00:21:22 +00:00
John	fa9f6ccc17	refactor: use _get_iso639_language_object (#2424 ) This refactor removes `_convert_to_standard_langcode` and replaces it with calling `_get_iso639_language_object` with a string slice. Use of TESSERACT_LANGUAGES_AND_CODES, which was added to `_convert_to_standard_langcode` previously, is moved to the relevant part where `_convert_to_standard_langcode` was previously called. If/else statements replace the list comprehension for readability and `langdetect_langs.append("zho")` replaces `_convert_to_standard_langcode("zh")` since that always returned `"zho"`.	2024-01-19 00:14:45 +00:00
Austin Walker	cfee86f5de	chore: Update base image (#2426 ) Propagating the openssl revert made in the base image: https://github.com/Unstructured-IO/base-images/pull/13 Note that I messed up and wrote over the existing 9.2-9 image. Any current prs will need to rebase in order to get a working dockerfile.	2024-01-18 22:34:43 +00:00
David Potter	4a34765fdf	chore: add postgres extra (#2422 ) postgres missing in setup.py Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-18 16:32:53 +00:00

... 3 4 5 6 7 ...

1393 Commits