This pull request allows to return predictions in raw cell
representation from table transformer. It will be later used to save
prediction in a cells format for simpler metrics calculation.
This PR has to be merged, after
https://github.com/Unstructured-IO/unstructured-inference/pull/335
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842
Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)
This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
This PR adds a third OCR provider, alongside Tesseract and Paddle: the
[Google Cloud Vision API](https://cloud.google.com/vision).
It can be used similarly to other OCR methods: set the `OCR_AGENT`
environment variable to the path to the OCR module
(`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`).
You also need to set the credentials to use Google APIs, for instance by
setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>
**Summary**
The `.section` field in `ElementMetadata` is dead code, possibly a
remainder from a prior iteration of `partition_epub()`. In any case, it
is not populated by any partitioner. Remove it and any code that uses
it.
**Summary**
A few additional small, mechanical odds and ends required for PPTX image
extraction.
The big one is removing the leading underscore from
`PptxPartitionerOptions` because now client code that implements a
custom Picture-shape sub-partitioner will need to reference this class.
This PR aims to remove duplicate embedded images taken by `PDFminer`.
### Summary
- add `clean_pdfminer_duplicate_image_elements()` to remove embedded
images with similar `bboxes` and the same `text`
- add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the
bounding boxes of two embedded images as the same region
- refactor: reorganzie `clean_pdfminer_inner_elements()`
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461
It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
>
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.
Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.
Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
add support for start_index in html links extraction (closes#2625)
Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json
html_text = """<html>
<p>Hello there I am a <a href="/link">very important link!</a></p>
<p>Here is a list of my favorite things</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
<li>Dogs</li>
</ul>
<a href="/loner">A lone link!</a>
</html>"""
elements = partition_html(text=html_text)
print(elements_to_json(elements))
```
---------
Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
**Summary**
Delegate partitioning of PPTX Picture (image, to a first approximation)
shapes to a distinct sub-partitioner and allow the default picture
sub-partitioner to be replaced at run-time by one of the user's
choosing.
**Summary**
As we move to adding pluggable sub-partitioners, `partition_pptx()` will
need to become sensitive to the `strategy` argument, in particular when
it is set to "hi_res". Up until now there were no expensive operations
(inference, OCR, etc.) incurred while partitioning PPTX so this argument
was ignored.
After this PR, `partition_pptx()` still won't do anything with that
value, other than pass it along to `_PptxPartitionerOptions` for
safe-keeping, but now its ready for use by a `PicturePartitioner` (to
come in a subsequent PR).
Closes#2362.
Previously, when an HTML contained a `div` with a nested tag e.g. a
`<b>` or `<span>`, the element created from the `div` contained only the
text up to the inline element. This PR adds support for extracting text
from tag tails in HTML.
### Testing
```
html_text = """
<html>
<body>
<div>
the Company issues shares at $<div style="display:inline;"><span>5.22</span></div> per share. There is more text
</div>
</body>
</html>
"""
elements = partition_html(text=html_text)
print(''.join([str(el).strip() for el in elements]))
```
**Expected behavior**
```
the Company issues shares at $5.22per share. There is more text
```
**Reviewers:** Likely quicker to review commit-by-commit.
**Summary**
In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
**Summary**
As an initial step in reducing the complexity of the monolithic
`partition_xlsx()` function, extract all argument-handling to a separate
`_XlsxPartitionerOptions` object which can be fully covered by isolated
unit tests.
**Additional Context**
This code was from a prior XLSX bug-fix branch that did not get
committed because of time constraints. I wanted to revisit it here
because I need the benefits of this as part of some new work on PPTX
that will require a separate options object that can be passed to
delegate objects.
This approach was incubated in the chunking context and has produced a
lot of opportunities there to decompose the logic into smaller
components that are more understandable and isolated-test-able, without
having to pass an extended list of option values in ever sub-call. As
well as decluttering the code, this removes coupling where the caller
needs to know which options a subroutine might need to reference.
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.
### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25
### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)
```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```
```
elements = partition_pdf(
filename="pwc-financial-statements-p114.pdf",
strategy="hi_res",
infer_table_structure=True,
extract_image_block_types=["Image"],
)
table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```
---------
Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.
It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.
**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR
We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation
More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.
**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
Minor refactor after conversation with @scanny
Updates docstring and how chunking options are accessed.
`self._kwargs.get()` should only be used in the `lazyproperty`
definition of an instance's attribute. Other calls should use
`self.<attribute>`
Creates a compounding metric to represent table structure score. It is
an average of existing row and col index and content score.
This PR adds a new property to
`unstructured.metrics.table_eval.TableEvaluation`:
`composite_structure_acc`, which is computed from the element level row
and column index and content accuracy scores. This new metric is meant
to offer a single number to represent the performance of table structure
extraction model/algorithms.
This PR also refactors the eval computation logic so it uses a constant
`table_eval_metrics` instead of hard coding the name of the metrics in
multiple places in the code.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
**Summary**
Add the actual behavior to populate `.metadata.orig_elements` during
chunking, when so instructed by the `include_orig_elements` option.
**Additional Context**
The underlying structures to support this, namely the
`.metadata.orig_elements` field and the `include_orig_elements` chunking
option, were added in closely prior PRs. This PR adds the behavior to
actually populate that metadata field during chunking when the option is
set.
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Add features to `get_mean_grouping` to allow input as a list of
filenames in the format of List of strings or txt file.
---------
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
**Summary**
Add `include_orig_elements: bool = True` as a new chunking option. This
PR does not implement _adding_ original elements to chunks, only
accepting this parameter as a chunking option and assigning `True` to it
as a default value when it is omitted as a keyword argument.
Note this will need to be added in other repositories as well in order
to fully support this new option by all access methods. In particular it
will need to be added in `unstructured-api` in order to become available
via the SDKs.
**Summary**
Some typing modernization in `elements.py` which will get changes to add
the `orig_elements` metadata field.
Also some additions to `unit_util.py` to enable simplified mocking that
will be required in the next PR.
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The updated decorator
removes the created directory and its files after the tests run.
Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
**Summary**
Add `metadata.is_continuation = True` to metadata of second-and-later
text-split chunks formed from an oversized non-table element. Previously
this metadata was only present on text-split `TableChunk` elements.
This enables downstream filtering of intentionally redundant metadata on
chunk elements that may not be desired for all purposes.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
This PR redefines the `table_level_acc` metric as follow:
- for each predicted table use sequence matching ratio as its accuracy
- as a prerequisite for the sequence matching we sort the table cells by
row then column for both predicted and ground truth to ensure they are
ordered the same
- average all predicted table accuracy
- any prediction without a matching ground truth (false positive) would
decrease the score
- prediction that splits ground truth into smaller tables would also
have low score with perfectly equal splits having lowest score
This new definition makes the new metric a value between 0 and 1 per
file. This replaces the existing definition where the metric is defined
as (the number of predicted table that has a match to ground truth) to
(the number of ground truth table). This existing metric actually gives
higher values for predictions that splits tables and can be higher than
1. The new definition prefers predictions that do not split ground truth
tables.
**Summary**
Fixes: #2308
**Additional context**
Through a somewhat deep call-chain, partitioning a file-like object
(e.g. io.BytesIO) having its `.name` attribute set to a path not
pointing to an actual file on the local filesystem would raise
`FileNotFoundError` when the last-modified date was being computed for
the document.
This scenario is a legitimate partitioning call, where `file.name` is
used downstream to describe the source of, for example, a bytes payload
downloaded from the network.
**Fix**
- explicitly check for the existence of a file at the given path before
accessing it to get its modified date. Return `None` (already a
legitimate return value) when no such file exists.
- Generally clean up the implementations.
- Add unit tests that exercise all cases.
---------
Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
The current way table structure metrics are computed does not cover
cases when none table is found and all stats are empty.
This PR fixes this + adds some hardenning tests for table eval
processor.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
Linting and typing fixes, and add tests to improve test coverage in
utils.py
On the main branch, run `coverage run -m pytest
test_unstructured/test_utils.py` and then `coverage report -m
unstructured/utils.py` to see test coverage for `utils.py`. Check out to
this branch and do the same. The percent coverage should increase to 88%
---------
Co-authored-by: David Potter <potterdavidm@gmail.com>
**Summary**
This is the final step in adding pluggable chunking-strategies. It
introduces the `chunk()` function to replace calls to strategy-specific
chunkers in the `@add_chunking_strategy` decorator. The `chunk()`
function then uses a mapping of chunking-strategy names (e.g.
"by_title", "basic") to chunking functions (chunkers) to dispatch the
chunking call. This allows other chunkers to be added at runtime rather
than requiring a code change, which is what "pluggable" chunkers is.
**Additional Information**
- Move the `@add_chunking_strategy` to the new `chunking.dispatch`
module since it coheres strongly with that operation, but publish it
from `chunking(.__init__)` (as it was before) so users don't couple to
the way we organize the chunking sub-package. Also remove the third
level of nesting as it's unrequired in this case.
- Add unit tests for the `@add_chunking_strategy` decorator which was
previously uncovered by any direct test.
This PR allow grouping functionality on `evaluate.py`
To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or
call `get_mean_grouping(<doctype or connector>, <dataframe or path to
tsv file>, <export directory>, "element_type")`
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
The purpose of this PR is to support using the same type of parameters
as `partition_*()` when using `partition_via_api()`. This PR works
together with `unsturctured-api` [PR
#368](https://github.com/Unstructured-IO/unstructured-api/pull/368).
**Note:** This PR will support extracting image blocks("Image", "Table")
via partition_via_api().
### Summary
- update `partition_via_api()` to convert all list type parameters to
JSON formatted strings before passing them to the unstructured client
SDK
- add a unit test function to test extracting image blocks via
`parition_via_api()`
- add a unit test function to test list type parameters passed to API
via unstructured client sdk
### Testing
```
from unstructured.partition.api import partition_via_api
elements = partition_via_api(
filename="example-docs/embedded-images-tables.pdf",
api_key="YOUR-API-KEY",
strategy="hi_res",
extract_image_block_types=["image", "table"],
)
image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"]
print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements]))
print("\n\n".join([el.metadata.image_base64 for el in image_block_elements]))
```
**Summary**
A pluggable chunking strategy needs its own local set of chunking
options that subclasses a base-class in `unstructured`.
Extract distinct `_ByTitleChunkingOptions` and `_BasicChunkingOptions`
for the existing two chunking strategies and move their
strategy-specific option setting and validation to the respective
subclass.
This was also a good opportunity for us to clean up a few odds and ends
we'd been meaning to.
Might be worth looking at the commits individually as they are cohesive
incremental steps toward the goal.
### Summary
Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Separate the aggregating functionality of `text_extraction_accuracy` to
a stand-alone function to avoid duplicated eval effort if the granular
level eval is already available.
To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py`
locally
**Summary**
Refactoring as part of `partition_xlsx()` algorithm replacement that was
delayed by some CI challenges.
A separate PR because it is cohesive and relatively independent from the
prior PR.
This PR adds new table evaluation metrics prepared by @leah1985
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows
TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
**Summary**
In order to accommodate customized chunkers other than those directly
provided by `unstructured`, some further modularization is necessary
such that a new chunker can be added as a "plug-in" without modifying
the `unstructured` library code.
This PR is the straightforward refactoring required for this process
like typing changes. There are also some other small changes we've been
meaning to make like making all chunking options accept `None` to
represent their default value so the broad field of callers (e.g.
ingest, unstructured-api, SDK) don't need to determine and set default
values for chunking arguments leading to diverging defaults.
Isolating these "noisy" but easy to accept changes in this preparatory
PR reduces the noise in the more substantive changes to follow.
To provide more utility to the `catch_overlapping_and_nested_bboxes` and
`identify_overlapping_or_nesting_case` functions, included
parent_element as part of the output.
This would allow user to
- identify the parent element in the overlapping case: `nested {type*}
in {type*}`. Currently, if the element types is similar, an example case
output would be `nested Image in Image` which is confusing.
- easily identify elements to keep or delete
**Summary**
For whatever reason, the `@add_chunking_strategy` decorator was not
present on `partition_json()`. This broke the only way to accomplish a
"chunking-only" workflow using the REST API. This PR remedies that
problem.