unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
Roman Isecke	9049e4e2be	feat/remove ingest code, use new dep for tests (#3595 ) ### Description Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572 but maintaining all ingest tests, running them by pulling in the latest version of unstructured-ingest. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-10-15 10:01:34 -05:00
Steve Canny	cbe1b35621	rfctr(chunk): prep for adding TableSplitter (#3510 ) Summary Mechanical refactoring in preparation for adding (pre-chunk) `TableSplitter` in a PR stacked on this one.	2024-08-12 18:04:49 +00:00
Michał Martyniak	001fa17c86	Preparing the foundation for better element IDs (#2842 ) Part one of the issue described here: https://github.com/Unstructured-IO/unstructured/issues/2461 It does not change how hashing algorithm works, just reworks how ids are assigned: > Element ID Design Principles > > 1. A partitioning function can assign only one of two available ID types to a returned element: a hash or UUID. > 2. All elements that are returned come with an ID, which is never None. > 3. No matter which type of ID is used, it will always be in string format. > 4. Partitioning a document returns elements with hashes as their default IDs. Big thanks to @scanny for explaining the current design and suggesting ways to do it right, especially with chunking. Here's the next PR in line: https://github.com/Unstructured-IO/unstructured/pull/2673 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>	2024-04-16 21:14:53 +00:00
John	b6c1882cc3	chore: add tests and small fixes in utils.py (#2554 ) Linting and typing fixes, and add tests to improve test coverage in utils.py On the main branch, run `coverage run -m pytest test_unstructured/test_utils.py` and then `coverage report -m unstructured/utils.py` to see test coverage for `utils.py`. Check out to this branch and do the same. The percent coverage should increase to 88% --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-03-06 21:58:10 +00:00
erjieyong	4d12c61cb8	added parent_element as output for overlapping cases (#2507 ) To provide more utility to the `catch_overlapping_and_nested_bboxes` and `identify_overlapping_or_nesting_case` functions, included parent_element as part of the output. This would allow user to - identify the parent element in the overlapping case: `nested {type} in {type}`. Currently, if the element types is similar, an example case output would be `nested Image in Image` which is confusing. - easily identify elements to keep or delete	2024-02-21 00:13:09 -08:00
Matt Robinson	882370022e	fix: don't treat double quote enclosed text as JSON (#2544 ) ### Summary Closes #2444. Treats JSON serializable content that results in a string as plain text. Even though this is valid JSON per [RFC 4627](https://www.ietf.org/rfc/rfc4627.txt), this is valid JSON, but in almost every cases were really want to treat this as a text file. ### Testing 1. Put `"This is not a JSON"` is a text file `notajson.txt` 2. Run the following ```python from unstructured.file_utils.filetype import _is_text_file_a_json _is_text_file_a_json(filename="notajson.txt") # Should be False ```	2024-02-14 13:41:43 +00:00
Sebastian Laverde Alfonso	c11a2ff478	feat: method to catch and classify overlapping bounding boxes (#1803 ) We have established that overlapping bounding boxes does not have a one-fits-all solution, so different cases need to be handled differently to avoid information loss. We have manually identified the cases/categories of overlapping. Now we need a method to programmatically classify overlapping-bboxes cases within detected elements in a document, and return a report about it (list of cases with metadata). This fits two purposes: - Evaluation: We can have a pipeline using the DVC data registry that assess the performance of a detection model against a set of documents (PDF/Images), by analysing the overlapping-bboxes cases it has. The metadata in the output can be used for generating metrics for this. - Scope overlapping cases: Manual inspection give us a clue about currently present cases of overlapping bboxes. We need to propose solutions to fix those on code. This method generates a report by analysing several aspects of two overlapping regions. This data can be used to profile and specify the necessary changes that will fix each case. - Fix overlapping cases: We could introduce this functionality in the flow of a partition method (such as `partition_pdf`, to handle the calls to post-processing methods to fix overlapping. Tested on ~331 documents, the worst time per page is around 5ms. For a document such as `layout-parser-paper.pdf` it takes 4.46 ms. Introduces functionality to take a list of unstructured elements (which contain bounding boxes) and identify pairs of bounding boxes which overlap and which case is pertinent to the pairing. This PR includes the following methods in `utils.py`: - `ngrams(s, n)`: Generate n-grams from a string - `calculate_shared_ngram_percentage(string_A, string_B, n)`: Calculate the percentage of `common_ngrams` between `string_A` and `string_B` with reference to the total number of ngrams in `string_A`. - `calculate_largest_ngram_percentage(string_A, string_B)`: Iteratively call `calculate_shared_ngram_percentage` starting from the biggest ngram possible until the shared percentage is >0.0% - `is_parent_box(parent_target, child_target, add=0)`: True if the `child_target` bounding box is nested in the `parent_target` Box format: [`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent region - `calculate_overlap_percentage(box1, box2, intersection_ratio_method="total")`: Box format: [`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. Calculates the percentage of overlapped region with reference to biggest element-region (`intersection_ratio_method="parent"`), the smallest element-region (`intersection_ratio_method="partial"`), or to the disjunctive union region (`intersection_ratio_method="total"`). - `identify_overlapping_or_nesting_case`: Identify if there are nested or overlapping elements. If overlapping is present, it identifies the case calling the method `identify_overlapping_case`. - `identify_overlapping_case`: Classifies the overlapping case for an element_pair input in one of 5 categories of overlapping. - `catch_overlapping_and_nested_bboxes`: Catch overlapping and nested bounding boxes cases across a list of elements. The params `nested_error_tolerance_px` and `sm_overlap_threshold` help controling the separation of the cases. The overlapping/nested elements cases that are being caught are: 1. Nested elements 2. Small partial overlap 3. Partial overlap with empty content 4. Partial overlap with duplicate text (sharing 100% of the text) 5. Partial overlap without sharing text 6. Partial overlap sharing {`calculate_largest_ngram_percentage(...)`}% of the text Here is a snippet to test it: ``` from unstructured.partition.auto import partition model_name = "yolox_quantized" target = "sample-docs/layout-parser-paper-fast.pdf" elements = partition(filename=file_path_i, strategy='hi_res', model_name=model_name) overlapping_flag, overlapping_cases = catch_overlapping_bboxes(elements) for case in overlapping_cases: print(case, "\n") ``` Here is a screenshot of a json built with the output list `overlapping_cases`: <img width="377" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/38184042/a6fea64b-d40a-4e01-beda-27840f4f4b3a">	2023-10-25 12:17:34 +00:00
qued	8100f1e7e2	chore: process chipper hierarchy (#1634 ) PR to support schema changes introduced from [PR 232](https://github.com/Unstructured-IO/unstructured-inference/pull/232) in `unstructured-inference`. Specifically what needs to be supported is: * Change to the way `LayoutElement` from `unstructured-inference` is structured, specifically that this class is no longer a subclass of `Rectangle`, and instead `LayoutElement` has a `bbox` property that captures the location information and a `from_coords` method that allows construction of a `LayoutElement` directly from coordinates. * Removal of `LocationlessLayoutElement` since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the `bbox` property mentioned above optional. * Getting hierarchy data directly from the inference elements rather than in post-processing * Don't try to reorder elements received from chipper v2, as they should already be ordered. #### Testing: The following demonstrates that the new version of chipper is inferring hierarchy. ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper") children = [el for el in elements if el.metadata.parent_id is not None] print(children) ``` Also verify that running the traditional `hi_res` gives different results: ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") ``` --------- Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-10-13 01:28:46 +00:00
Alvaro Bartolome	e52dd5c179	feat: add `requires_dependencies` decorator (#302 ) * Add `requires_dependencies` decorator * Use `required_dependencies` on Reddit & S3 * Fix bug in `requires_dependencies` To used named args the decorator needs to be also wrapped * Add `requires_dependencies` integration tests * Add `requires_dependencies` in `Competition.md` * Update `CHANGELOG.md` * Bump version 0.4.16-dev5 * Ignore `F401` unused imports in `requires_dependencies` tests * Apply suggestions from code review * Add `functools.wrap` to keep docs, & annotations * Use `requires_dependencies` in `GitHubConnector`	2023-02-28 14:50:39 +00:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
asymness	28a4ae985d	feat: Implement utility functions for reading and writing `.jsonl` files (#22 ) * Implement save_as_jsonl and read_from_jsonl utility functions * Add unit tests for save_as_jsonl and read_from_jsonl utility functions * Add example of using save_as_jsonl with prodigy staging brick * Bump version and update changelog * remove accidentally added prodigy json file * added "the" in jsonl description Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2022-10-04 09:51:11 -04:00

11 Commits