5 Commits

Author SHA1 Message Date
Sebastian Laverde Alfonso
c11a2ff478
feat: method to catch and classify overlapping bounding boxes (#1803)
We have established that overlapping bounding boxes does not have a
one-fits-all solution, so different cases need to be handled differently
to avoid information loss. We have manually identified the
cases/categories of overlapping. Now we need a method to
programmatically classify overlapping-bboxes cases within detected
elements in a document, and return a report about it (list of cases with
metadata). This fits two purposes:

- **Evaluation**: We can have a pipeline using the DVC data registry
that assess the performance of a detection model against a set of
documents (PDF/Images), by analysing the overlapping-bboxes cases it
has. The metadata in the output can be used for generating metrics for
this.
- **Scope overlapping cases**: Manual inspection give us a clue about
currently present cases of overlapping bboxes. We need to propose
solutions to fix those on code. This method generates a report by
analysing several aspects of two overlapping regions. This data can be
used to profile and specify the necessary changes that will fix each
case.
- **Fix overlapping cases**: We could introduce this functionality in
the flow of a partition method (such as `partition_pdf`, to handle the
calls to post-processing methods to fix overlapping. Tested on ~331
documents, the worst time per page is around 5ms. For a document such as
`layout-parser-paper.pdf` it takes 4.46 ms.

Introduces functionality to take a list of unstructured elements (which
contain bounding boxes) and identify pairs of bounding boxes which
overlap and which case is pertinent to the pairing. This PR includes the
following methods in `utils.py`:

- **`ngrams(s, n)`**: Generate n-grams from a string
- **`calculate_shared_ngram_percentage(string_A, string_B, n)`**:
Calculate the percentage of `common_ngrams` between `string_A` and
`string_B` with reference to the total number of ngrams in `string_A`.
- **`calculate_largest_ngram_percentage(string_A, string_B)`**:
Iteratively call `calculate_shared_ngram_percentage` starting from the
biggest ngram possible until the shared percentage is >0.0%
- **`is_parent_box(parent_target, child_target, add=0)`**: True if the
`child_target` bounding box is nested in the `parent_target` Box format:
[`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. The
parameter 'add' is the pixel error tolerance for extra pixels outside
the parent region
- **`calculate_overlap_percentage(box1, box2,
intersection_ratio_method="total")`**: Box format: [`x_bottom_left`,
`y_bottom_left`, `x_top_right`, `y_top_right`]. Calculates the
percentage of overlapped region with reference to biggest element-region
(`intersection_ratio_method="parent"`), the smallest element-region
(`intersection_ratio_method="partial"`), or to the disjunctive union
region (`intersection_ratio_method="total"`).
- **`identify_overlapping_or_nesting_case`**: Identify if there are
nested or overlapping elements. If overlapping is present,
it identifies the case calling the method `identify_overlapping_case`.
- **`identify_overlapping_case`**: Classifies the overlapping case for
an element_pair input in one of 5 categories of overlapping.
- **`catch_overlapping_and_nested_bboxes`**: Catch overlapping and
nested bounding boxes cases across a list of elements. The params
`nested_error_tolerance_px` and `sm_overlap_threshold` help controling
the separation of the cases.

The overlapping/nested elements cases that are being caught are:
1. **Nested elements**
2. **Small partial overlap**
3. **Partial overlap with empty content**
4. **Partial overlap with duplicate text (sharing 100% of the text)**
5. **Partial overlap without sharing text**
6. **Partial overlap sharing**
{`calculate_largest_ngram_percentage(...)`}% **of the text**

Here is a snippet to test it:
```
from unstructured.partition.auto import partition

model_name = "yolox_quantized"
target = "sample-docs/layout-parser-paper-fast.pdf"
elements = partition(filename=file_path_i, strategy='hi_res', model_name=model_name)
overlapping_flag, overlapping_cases = catch_overlapping_bboxes(elements)
for case in overlapping_cases:
    print(case, "\n")
```
Here is a screenshot of a json built with the output list
`overlapping_cases`:
<img width="377" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/38184042/a6fea64b-d40a-4e01-beda-27840f4f4b3a">
2023-10-25 12:17:34 +00:00
qued
8100f1e7e2
chore: process chipper hierarchy (#1634)
PR to support schema changes introduced from [PR
232](https://github.com/Unstructured-IO/unstructured-inference/pull/232)
in `unstructured-inference`.

Specifically what needs to be supported is:
* Change to the way `LayoutElement` from `unstructured-inference` is
structured, specifically that this class is no longer a subclass of
`Rectangle`, and instead `LayoutElement` has a `bbox` property that
captures the location information and a `from_coords` method that allows
construction of a `LayoutElement` directly from coordinates.
* Removal of `LocationlessLayoutElement` since chipper now exports
bounding boxes, and if we need to support elements without bounding
boxes, we can make the `bbox` property mentioned above optional.
* Getting hierarchy data directly from the inference elements rather
than in post-processing
* Don't try to reorder elements received from chipper v2, as they should
already be ordered.

#### Testing:

The following demonstrates that the new version of chipper is inferring
hierarchy.

```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper")
children = [el for el in elements if el.metadata.parent_id is not None]
print(children)

```
Also verify that running the traditional `hi_res` gives different
results:
```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

```

---------

Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-10-13 01:28:46 +00:00
Alvaro Bartolome
e52dd5c179
feat: add requires_dependencies decorator (#302)
* Add `requires_dependencies` decorator

* Use `required_dependencies` on Reddit & S3

* Fix bug in `requires_dependencies`

To used named args the decorator needs to be also wrapped

* Add `requires_dependencies` integration tests

* Add `requires_dependencies` in `Competition.md`

* Update `CHANGELOG.md`

* Bump version 0.4.16-dev5

* Ignore `F401` unused imports in `requires_dependencies` tests

* Apply suggestions from code review

* Add `functools.wrap` to keep docs, & annotations

* Use `requires_dependencies` in `GitHubConnector`
2023-02-28 14:50:39 +00:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
asymness
28a4ae985d
feat: Implement utility functions for reading and writing .jsonl files (#22)
* Implement save_as_jsonl and read_from_jsonl utility functions

* Add unit tests for save_as_jsonl and read_from_jsonl utility functions

* Add example of using save_as_jsonl with prodigy staging brick

* Bump version and update changelog

* remove accidentally added prodigy json file

* added "the" in jsonl description

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-10-04 09:51:11 -04:00