unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-01 01:29:53 +00:00

Author	SHA1	Message	Date
Amanda Cameron	1fb464235a	chore: Table chunking (#1540 ) This change is adding to our `add_chunking_strategy` logic so that we are able to chunk Table elements' `text` and `text_as_html` params. In order to keep the functionality under the same `by_title` chunking strategy we have renamed the `combine_under_n_chars` to `max_characters`. It functions the same way for the combining elements under Title's, as well as specifying a chunk size (in chars) for TableChunk elements. *renaming the variable to `max_characters` will also reflect the 'hard max' we will implement for large elements in followup PRs Additionally -> some lint changes snuck in when I ran `make tidy` hence the minor changes in unrelated files :) TODO: ✅ add unit tests --> note: added where I could to unit tests! Some unit tests I just clarified that the chunking strategy was now 'by_title' because we don't have a file example that has Table elements to test the 'by_num_characters' chunking strategy ✅ update changelog To manually test: ``` In [1]: filename="example-docs/example-10k.html" In [2]: from unstructured.chunking.title import chunk_table_element In [3]: from unstructured.partition.auto import partition In [4]: elements = partition(filename) # element at -2 happens to be a Table, and we'll get chunks of char size 4 here In [5]: chunks = chunk_table_element(elements[-2], 4) # examine text and text_as_html params ln [6]: for c in chunks: print(c.text) print(c.metadata.text_as_html) ``` --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-03 09:40:34 -07:00
Newel H	bcd0eee753	Feat: Detect all text in HTML Heading tags as titles (#1556 ) ## Summary This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address categorize it as a title. ## Testing ``` from unstructured.partition.html import partition_html elements = partition_html(url="https://www.eda.gov/grants/2015") ``` Before, the date headers at the given url would not be correctly parsed as titles, after this change they are now correctly identified. A unit test to verify the functionality has been added: `test_html_partition::test_html_heading_title_detection` that includes values that were previously detected as narrative text and uncategorized text	2023-10-03 11:54:36 -04:00
Klaijan	d6efd52b4b	fix: isalnum referenced before assignment (#1586 ) Executive Summary Fix bug on the `get_word_bounding_box_from_element` function that prevent `partition_pdf` to run. Technical Details - The function originally first define `isalnum` on the first index. Now switched to conditional on flag value.	2023-10-03 11:25:20 -04:00
Roman Isecke	b2e997635f	roman/es ingest test fixes (#1610 ) ### Description update elasticsearch docker setup to use docker-compose Would close out https://github.com/Unstructured-IO/unstructured/issues/1609	2023-10-03 10:39:33 -04:00
Roman Isecke	9d81971fcb	update ingest python doc (#1446 ) ### Description Updating the python version of the example docs to show how to run the same code that the CLI runs, but using python. Rather than copying the same command that would be run via the terminal and using the subprocess library to run it, this updates it to use the supported code exposed in the inference directory. For now only the wikipedia one has been updated to get some opinions on this before updating all other connector docs. Would close out https://github.com/Unstructured-IO/unstructured/issues/1445	2023-10-03 10:01:41 -04:00
unifyh	89bd2faaf7	fix: Fix various cases of HTML text missing after partition (#1587 ) Fix 4 cases of text missing after partition: 1. Text immediately after `<body>` ```html <body> missing1 <div>hello</div> </body> ``` 2. Text inside container and immediately after `<br/>` ```html <div>hello<br/>missing2</div> ``` 3. Text immediately after a text opening tag, if said tag contains `<br/>` ```html <p>missing3<br/>hello</p> ``` 4. Text inside `<body>` if it is the only content (different cause from case 1) ```html <body>missing4</body> ``` Also fix problem causing `test_unstructured/documents/test_html.py::test_exclude_tag_types` to not work as intended. This will close GitHub Issue#1543	2023-10-03 04:17:51 +00:00
Roman Isecke	11cdd8d71f	roman/drop downloads in ingest tests (#1614 ) ### Description In an effort to mitigate resource consumption when running CI tests, cleanup download dir for ingest tests after each one.	2023-10-02 20:47:24 +00:00
ryannikolaidis	ed2bf7eb66	build(test): ingest test fixture updates uses larger runners (#1612 )	2023-10-02 17:43:41 +00:00
Austin Walker	0abebb5fe6	fix: fix benchmark script when DOCKER_TEST=true (#1515 ) The home directory for our dockerfile changed and broke this script. To verify, try running the benchmark script: ``` export DOCKER_TEST=true ./scripts/performance/benchmark.sh ``` I'll pull in the latest changelog before merging.	2023-10-02 16:08:26 +00:00
Roman Isecke	4d6492391c	roman/update language kwarg partition ingest (#1605 ) ### Description Change partition kwarg from ocr_languages -> languages Closes out https://github.com/Unstructured-IO/unstructured/issues/1588	2023-10-02 11:31:15 -04:00
Yao You	ad59a879cc	chore: bump inference to 0.6.6 (#1563 ) - bump `unstructured-inference` to `0.6.6` - specify default model name for element detection to be `detectron2_onnx` to keep current behavior - NOTE: the updated inference package by default would use yolox as element detection model; this will be evaluated and enabled in a separated PR --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 19:09:57 +00:00
Yao You	af7639e23f	ci: add retry to elastic search ingest test (#1581 ) Occasionally the es test can fail because the index fail to be created on the first try. Experiments show adding timeout doesn't help but add retry mitigates the issue. See history of commits in branch: yao/bump-inference-to-0.6.6 https://github.com/Unstructured-IO/unstructured/pull/1563 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 13:42:21 -05:00
cragwolfe	5b994f37ae	build(release): actually make the release 0.10.18 (#1576 ) Workaround a publication issue to pypi. No code changes, 0.10.18 is the new 0.10.17. 0.10.18	2023-09-28 23:09:18 -07:00
cragwolfe	e0e329c68f	build(release): cut release for 0.10.7 (#1575 ) 0.10.17	2023-09-28 22:26:23 -07:00
cragwolfe	44f5605ef3	build(image): call python3 not python for image compat (#1574 ) Fixes docker exec unstructured-smoke-test /bin/bash -c /home/notebook-user/test_unstructured_ingest/test-ingest-wikipedia.sh /home/notebook-user/test_unstructured_ingest/test-ingest-wikipedia.sh: line 10: python: command not found in https://github.com/Unstructured-IO/unstructured/blob/6ad4971/scripts/docker-smoke-test.sh#L43 that was preventing docker images from being built.	2023-09-28 21:48:19 -07:00
Christine Straub	94fbbed189	feat: bbox shrinking in xycut algo, better natural reading order (#1560 ) Closes GH Issue #1233. ### Summary - add functionality to shrink all bounding boxes along x and y axes (still centered around the same center point) before running xy-cut sort ### Evaluation Run the followin gcommand for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>	2023-09-29 03:48:02 +00:00
Yao You	cd8c6a2e09	fix: occasional SIGABRT with deltalake writer on Linux (#1567 ) - resolves an issue where occasionally deltalake writer results in SIGABRT event though the writer finished writing table properly on linux - this is first observed in ingest test - Putting the writer into a process mitigates this problem by forcing python to finish the deltalake rust backend to finish its tasks ## test To test this it is best to setup an instance on a Linux system since the problem has only been observed on Linux so far. Run ```bash PYTHONPATH=. ./unstructured/ingest/main.py delta-table --num-processes 2 --metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.date_created,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth --table-uri ../tables/delta/ --preserve-downloads --verbose delta-table --write-column json_data --mode overwrite --table-uri file:///tmp/delta ``` Without this fix occasionally we'd encounter `SIGABTR`. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-09-29 02:41:18 +00:00
rvztz	4e84e32ed0	fix: Discord connector when a channel is not found. (#1480 )	2023-09-29 01:32:43 +00:00
Trevor Bossert	792232dcc5	Chore: move scarf to setup.py (#1569 ) This also follows what I have seen as the recommend way to define a file package like this. Also bumps minor versions from pip compile Testing: `pip install -e .` Everything should build as normal `❯ pip install -e . Obtaining file:///Users/trevor/dev/unstructured Installing build dependencies ... done Checking if build backend supports build_editable ... done Getting requirements to build editable ... done Preparing editable metadata (pyproject.toml) ... done Collecting scarf@ https://packages.unstructured.io/scarf.tgz (from unstructured==0.10.17.dev16) Using cached https://packages.unstructured.io/scarf.tgz (1.1 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done` When new release goes out, I will test just plain pip install to verify that functionality still works	2023-09-28 16:18:14 -07:00
qued	e5d08662d4	enhancement: memory efficient xml partitioning (#1547 ) Closes #1236. Partitions XML documents iteratively in most cases, never loading the entire tree into memory. This ends up being much faster. ( The exception is when the argument `xml_path` is passed to filter elements. I was not able to find a way in Python to compare XPaths while streaming the elements, aside from writing a custom XPath parser. So the shortest way forward was to bite the bullet and load the whole tree in memory when filtering by XPath.) Memory usage is about 20% of usage on `main` when processing a 470MB XML file. Time to process is 10s vs 900s. Output is slightly different, but appears to be an improvement, adding lines of text that are skipped in current partitioning. No text is lost.	2023-09-28 02:34:06 +00:00
Yao You	62b0557792	build: ignore failing delta lake test ingest for now (#1557 )	2023-09-27 19:49:21 -05:00
rvztz	2e01c49d90	feat: adds data source properties to `delta table` connector. (#1464 )	2023-09-27 17:46:01 -07:00
Trevor Bossert	fd79c5262c	Bump Dockerfile to use latest base image (#1553 ) New base image includes security fixes. This is an ongoing process to remediate security issues as they are identified.	2023-09-27 22:30:32 +00:00
Roman Isecke	9836235ead	Chunking support for SharePoint Connector (#1548 ) ### Description Optionally adds in chunking to the CLI which adds a flag to trigger chunking and exposes the parameters used by the `chunk_by_title` method. Runs chunking before the embedding step. Opened to replace original PR https://github.com/Unstructured-IO/unstructured/pull/1531	2023-09-27 21:05:55 +00:00
Ahmet Melek	b283962567	docs: update ingest readme (#1456 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1070 This PR aims to update the ingest readme file based on the recent changes that the ingest module had.	2023-09-27 20:38:15 +00:00
Austin Walker	f34c277bca	fix: add backwards compatibility to ElementMetadata (#1526 ) Fixes https://github.com/Unstructured-IO/unstructured-api/issues/237 The problem: The `ElementMetadata` class was not able to ignore fields that it didn't know about. This surfaced in `partition_via_api`, when the hosted api schema is newer than the local `unstructured` version. In `ElementMetadata.from_json()` we get errors such as `TypeError: __init__() got an unexpected keyword argument 'parent_id'`. The fix: The `from_json` methods for these dataclasses should drop any unexpected fields before calling `__init__`. To verify: This shouldn't throw an error ``` from unstructured.staging.base import elements_from_json import json test_api_result = json.dumps([ { "type": "Title", "element_id": "2f7cc75f6467bba468022c4c2875335e", "metadata": { "filename": "layout-parser-paper.pdf", "filetype": "application/pdf", "page_number": 1, "new_field": "foo", }, "text": "LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis" } ]) elements = elements_from_json(text=test_api_result) print(elements) ```	2023-09-27 18:40:56 +00:00
Klaijan	d26d591d6a	feat: get embedded url, associate text and start index for pdf (#1539 ) Executive Summary Adds PDF functionality to capture hyperlink (external or internal) for pdf fast strategy along with associate text. Technical Details - `pdfminer` associates `annotation` (links and uris) with bounding box rather than text. Therefore, the link and text matching is not a perfect pair but rather a logic-based and calculation matching from bounding box overlapping. - There is no word-level bounding box. Only character-level (access using `LTChar`). Thus in order to get to word-level, there is a window slicing through the text. The words are captured in alphanumeric and non-alphanumeric separately, meaning it will split the word if contains both, on the first encounter of non-alphanumeric.) - The bounding box calculation is calculated using start and stop coordinates for the corresponding word calculated from above. The calculation is simply using distance between two dots. The result now contains `links` in `metadata` as shown below: ``` "links": [ { "text": "link", "url": "https://github.com/Unstructured-IO/unstructured", "start_index": 12 }, { "text": "email", "url": "mailto:unstructuredai@earlygrowth.com", "start_index": 30 }, { "text": "phone number", "url": "tel:6505124019", "start_index": 49 } ] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-27 13:43:32 -04:00
Newel H	55315cf645	Feat: Native hierarchies for docx element types (#1505 ) Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents. Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number). Hierarchy detection is improved by determining category depth via the following: 1. Check if the paragraph item has an indentation level (ilvl) xpath - these are typically on list bullet/numbers. Return the indentation level if it exists 2. Check the name of the paragraph style if it contains any category depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List Bullet 2). Return the category depth if found, else default to depth of 0. 3. Check the paragraph ilvl via the paragraph's style name. Outside of the paragraph's metadata, docx stores default ilvls for various style names, which requires a complex lookup. This check is yet to be implemented, as the above methods cover most usecases but the implementation is stubbed out. --- Co-authored-by: Steve Canny <stcanny@gmail.com>	2023-09-27 11:32:46 -04:00
Roman Isecke	5c7b4f586b	Roman/azure cognitive embeddings (#1524 ) ### Description This PR is two-fold: Embeddings: * Embeddings incorporated into the sharepoint source connector, which will now call out to OpenAI and create embeddings if the flag is passed in and the api key provided. Writing vector content (embeddings) to Azure cognitive search index: * The schema for the index expected to exist in Azure has been updated to include the vector field type and a test script has been added to test the new content being produced from the Sharepoint connector to push the embedding content. Some important notes about other changes in here: * The embedding code had to be updated to patch the `to_dict` method on elements to add `embeddings` to the dict output if that was added. While the code originally added the embedding content, when `to_dict` was called to save the content as json, this was lost.	2023-09-26 23:24:21 +00:00
rvztz	d8a36af08c	fix: Sharepoint connector server_path issue (#1497 )	2023-09-26 14:25:35 -07:00
Roman Isecke	81af879038	roman/increase ingest tests num processes (#1500 ) ### Description In an effort to speed up the ingest tests, bumping the num if processes to the max on the system for each	2023-09-26 16:06:53 -05:00
Steve Canny	ab29de8dbd	Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured This refactor solves a problem or two, the big one being recursing into group-shapes to get all shapes on the slide, but mostly lays the groundwork to allow us to refine further aspects such as list-item detection, off-slide shape detection, and image-capture going forward.	2023-09-26 15:43:55 -04:00
shreyanid	32bfebccf7	feat: introduce language detection function for text partitioning function (#1453 ) ### Summary Uses `langdetect` to detect all languages present in the input document. ### Details - Converts all language codes (whether user inputted or detected using `langdetect`) to a standard ISO 639-3 code. - Adds `languages` field to the metadata - Will revisit how to nonstandardly represent simplified vs traditional Chinese scripts internally (separate PR). - Update ingest test results to add `languages` field to documents. Some other side effects are changes in order of some elements and changes in element categorization ### Test You can test the detect_languages function individually by importing the function and inputting a text sample and optionally a language: ``` text = "My lubimy mleko i chleb." doc_langs = detect_languages(text) print(doc_langs) ``` -> ['ces', 'pol', 'slk'] --------- Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2023-09-26 18:09:27 +00:00
Ronny H	868cac5bd5	Fixed Sphinx warning errors (#1438 ) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html`	2023-09-26 04:20:16 +00:00
Trevor Bossert	2a24c81852	Update docker download url to use scarf gateway (#1523 ) This updates the docker image download url to pass through the scarf gateway, this allows anonymous tracking of downloads Related to: https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics Testing: docker pull downloads.unstructured.io/unstructured-io/unstructured:latest Result: Image should download	2023-09-25 14:52:39 -07:00
Trevor Bossert	af5ef0c1a7	Add scarf archive to requirements (#1514 ) This allows anonymous tracking of downloads Related to: https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics Testing: pip install -r requirements/base.in Result: all packages should install as normal and it builds scarf package	2023-09-25 11:49:40 -07:00
David Potter	01a147eb1d	feat: improved salesforce partitioning (#1475 ) * Partitions Salesforce data as xlm instead of text for improved detail and flexibility * Partitions htmlbody instead of textbody for Salesforce emails	2023-09-25 11:44:28 -07:00
Roman Isecke	bd49cfbab7	feat: adds Azure Cognitive Search (full text) destination connector (#1459 ) ### Description New [Azure Cognitive Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search) destination connector added. Writes each json element from the created json files via partition and writes that content to an index. Bonus bug fix: Due to a recent change where the default version of python used in the repo was bumped to `3.10` from `3.8`, this means running `pip-compile` now runs it against that version rather than the lowest we support which is still `3.8`. This breaks the setup for those lower versions because some of the versions pulled in by `pip-compile` exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a script that checks the version of python being used first, which helps guarantee that all dependencies meet the minimum python version requirement. Closes out https://github.com/Unstructured-IO/unstructured/issues/1466	2023-09-25 10:27:42 -04:00
Benjamin Torres	5d193c8e5a	fix/bad formed formula (#1481 ) @ron-unstructured reported that loading files with: ``` from unstructured.partition.pdf import partition_pdf elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox") print(elements_yolox) ``` Throws an error. After debugging the execution I found that the issue is that an object of class Formula is being created, however, this class doesn't contain an __init__ method. This PR solves the issue of adding a constructor method with an empty string for the element. The file can be found at: https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing After this PR is merged this file is correctly processed	2023-09-23 02:36:22 +00:00
ryannikolaidis	48c52365dd	build(test): disable airtable-large ingest test (#1509 )	2023-09-23 02:00:01 +00:00
Trevor Bossert	961223da2a	Chore: Update readme to using new download location to track download metrics (#1507 ) Related to: https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics Testing: `docker pull downloads.unstructured.io/unstructured-io/unstructured:latest` There should be no additional steps needed.	2023-09-22 17:30:37 -07:00
ryannikolaidis	ca01b30c07	ci: more reliable release version alerts (#1479 )	2023-09-22 21:19:26 +00:00
Trevor Bossert	e8dfbfdbe5	Add notification that we will be utilizing scarf for docker and python downloads (#1503 ) We've created a custom domain, downloads.unstructured.io that redirects to quay.io (using https://scarf.sh/). This custom domain allows us to swap the underlying container registry without impacting users. It also provides us with important metrics about container and package usage, without surfacing PII like IP addresses. Python package follows the same pattern at packages.unstructured.io	2023-09-22 12:59:58 -07:00
ryannikolaidis	955efac935	fix: SharePoint connector fails if any document has an unsupported filetype (#1493 )	2023-09-22 18:47:28 +00:00
Trevor Bossert	3e04110bab	Chore: Pin unstructured-inference in extra-pdf-image (#1474 ) This is so users are able to upgrade it when unstructured library is updated.	2023-09-22 09:41:53 -07:00
Christine Straub	2d951722df	Feat/1332 save embedded images in pdf (#1371 ) Addresses [#1332](https://github.com/Unstructured-IO/unstructured/issues/1332) with `unstructured-inference` PR [#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )	2023-09-22 09:16:03 +00:00
cragwolfe	92ad7698fb	build(test): ignore notion ingest test failures for now (#1496 ) There is a fix in progress here: https://github.com/Unstructured-IO/unstructured/pull/1492 , but let's see proven stability of a few days before allowing notion ingest test failures to block CI.	2023-09-22 07:19:21 +00:00
Roman Isecke	e88f7d9eab	chore: ingest test file cleanup (#1366 )	2023-09-21 11:51:08 -07:00
Ahmet Melek	9e88929a8c	feat: document embeddings (#1368 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1319, closes https://github.com/Unstructured-IO/unstructured/issues/1372 This module: - implements EmbeddingEncoder classes which track embedding related data - implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements - the module uses langchain to obtain the embeddings ----- - The PR additionally fixes a JSON de-serialization issue on the metadata fields. To test the changes, run `examples/embed/example.py`	2023-09-20 19:55:30 +00:00
ryannikolaidis	7a3828d292	chore: fix changelog (#1469 ) Fix an earlier merge that resulted in the Tesseract enhancement entry in a duplicated 0.10.15.	2023-09-20 09:07:36 -07:00

... 12 13 14 15 16 ...

1447 Commits