unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
Sebastian Laverde Alfonso	c11a2ff478	feat: method to catch and classify overlapping bounding boxes (#1803 ) We have established that overlapping bounding boxes does not have a one-fits-all solution, so different cases need to be handled differently to avoid information loss. We have manually identified the cases/categories of overlapping. Now we need a method to programmatically classify overlapping-bboxes cases within detected elements in a document, and return a report about it (list of cases with metadata). This fits two purposes: - Evaluation: We can have a pipeline using the DVC data registry that assess the performance of a detection model against a set of documents (PDF/Images), by analysing the overlapping-bboxes cases it has. The metadata in the output can be used for generating metrics for this. - Scope overlapping cases: Manual inspection give us a clue about currently present cases of overlapping bboxes. We need to propose solutions to fix those on code. This method generates a report by analysing several aspects of two overlapping regions. This data can be used to profile and specify the necessary changes that will fix each case. - Fix overlapping cases: We could introduce this functionality in the flow of a partition method (such as `partition_pdf`, to handle the calls to post-processing methods to fix overlapping. Tested on ~331 documents, the worst time per page is around 5ms. For a document such as `layout-parser-paper.pdf` it takes 4.46 ms. Introduces functionality to take a list of unstructured elements (which contain bounding boxes) and identify pairs of bounding boxes which overlap and which case is pertinent to the pairing. This PR includes the following methods in `utils.py`: - `ngrams(s, n)`: Generate n-grams from a string - `calculate_shared_ngram_percentage(string_A, string_B, n)`: Calculate the percentage of `common_ngrams` between `string_A` and `string_B` with reference to the total number of ngrams in `string_A`. - `calculate_largest_ngram_percentage(string_A, string_B)`: Iteratively call `calculate_shared_ngram_percentage` starting from the biggest ngram possible until the shared percentage is >0.0% - `is_parent_box(parent_target, child_target, add=0)`: True if the `child_target` bounding box is nested in the `parent_target` Box format: [`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent region - `calculate_overlap_percentage(box1, box2, intersection_ratio_method="total")`: Box format: [`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. Calculates the percentage of overlapped region with reference to biggest element-region (`intersection_ratio_method="parent"`), the smallest element-region (`intersection_ratio_method="partial"`), or to the disjunctive union region (`intersection_ratio_method="total"`). - `identify_overlapping_or_nesting_case`: Identify if there are nested or overlapping elements. If overlapping is present, it identifies the case calling the method `identify_overlapping_case`. - `identify_overlapping_case`: Classifies the overlapping case for an element_pair input in one of 5 categories of overlapping. - `catch_overlapping_and_nested_bboxes`: Catch overlapping and nested bounding boxes cases across a list of elements. The params `nested_error_tolerance_px` and `sm_overlap_threshold` help controling the separation of the cases. The overlapping/nested elements cases that are being caught are: 1. Nested elements 2. Small partial overlap 3. Partial overlap with empty content 4. Partial overlap with duplicate text (sharing 100% of the text) 5. Partial overlap without sharing text 6. Partial overlap sharing {`calculate_largest_ngram_percentage(...)`}% of the text Here is a snippet to test it: ``` from unstructured.partition.auto import partition model_name = "yolox_quantized" target = "sample-docs/layout-parser-paper-fast.pdf" elements = partition(filename=file_path_i, strategy='hi_res', model_name=model_name) overlapping_flag, overlapping_cases = catch_overlapping_bboxes(elements) for case in overlapping_cases: print(case, "\n") ``` Here is a screenshot of a json built with the output list `overlapping_cases`: <img width="377" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/38184042/a6fea64b-d40a-4e01-beda-27840f4f4b3a">	2023-10-25 12:17:34 +00:00
qued	8100f1e7e2	chore: process chipper hierarchy (#1634 ) PR to support schema changes introduced from [PR 232](https://github.com/Unstructured-IO/unstructured-inference/pull/232) in `unstructured-inference`. Specifically what needs to be supported is: * Change to the way `LayoutElement` from `unstructured-inference` is structured, specifically that this class is no longer a subclass of `Rectangle`, and instead `LayoutElement` has a `bbox` property that captures the location information and a `from_coords` method that allows construction of a `LayoutElement` directly from coordinates. * Removal of `LocationlessLayoutElement` since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the `bbox` property mentioned above optional. * Getting hierarchy data directly from the inference elements rather than in post-processing * Don't try to reorder elements received from chipper v2, as they should already be ordered. #### Testing: The following demonstrates that the new version of chipper is inferring hierarchy. ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper") children = [el for el in elements if el.metadata.parent_id is not None] print(children) ``` Also verify that running the traditional `hi_res` gives different results: ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") ``` --------- Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-10-13 01:28:46 +00:00
Alvaro Bartolome	e52dd5c179	feat: add `requires_dependencies` decorator (#302 ) * Add `requires_dependencies` decorator * Use `required_dependencies` on Reddit & S3 * Fix bug in `requires_dependencies` To used named args the decorator needs to be also wrapped * Add `requires_dependencies` integration tests * Add `requires_dependencies` in `Competition.md` * Update `CHANGELOG.md` * Bump version 0.4.16-dev5 * Ignore `F401` unused imports in `requires_dependencies` tests * Apply suggestions from code review * Add `functools.wrap` to keep docs, & annotations * Use `requires_dependencies` in `GitHubConnector`	2023-02-28 14:50:39 +00:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
asymness	28a4ae985d	feat: Implement utility functions for reading and writing `.jsonl` files (#22 ) * Implement save_as_jsonl and read_from_jsonl utility functions * Add unit tests for save_as_jsonl and read_from_jsonl utility functions * Add example of using save_as_jsonl with prodigy staging brick * Bump version and update changelog * remove accidentally added prodigy json file * added "the" in jsonl description Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2022-10-04 09:51:11 -04:00

5 Commits