521 Commits

Author SHA1 Message Date
Pluto
df1f7bcd0e
Save table prediction in cells format (#2892)
This pull request allows to return predictions in raw cell
representation from table transformer. It will be later used to save
prediction in a cells format for simpler metrics calculation.

This PR has to be merged, after
https://github.com/Unstructured-IO/unstructured-inference/pull/335
2024-04-25 11:14:48 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Dimitri Lozeve
abb0174181
Integration with the Google Cloud Vision API (#2902)
This PR adds a third OCR provider, alongside Tesseract and Paddle: the
[Google Cloud Vision API](https://cloud.google.com/vision).

It can be used similarly to other OCR methods: set the `OCR_AGENT`
environment variable to the path to the OCR module
(`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`).
You also need to set the credentials to use Google APIs, for instance by
setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-04-23 21:11:39 +00:00
Steve Canny
05ff975081
fix: remove unused ElementMetadata.section (#2921)
**Summary**
The `.section` field in `ElementMetadata` is dead code, possibly a
remainder from a prior iteration of `partition_epub()`. In any case, it
is not populated by any partitioner. Remove it and any code that uses
it.
2024-04-22 23:58:17 +00:00
Steve Canny
4dc8327149
rfctr(pptx): make PptxPartitionerOptions public (#2901)
**Summary**
A few additional small, mechanical odds and ends required for PPTX image
extraction.

The big one is removing the leading underscore from
`PptxPartitionerOptions` because now client code that implements a
custom Picture-shape sub-partitioner will need to reference this class.
2024-04-19 04:50:06 +00:00
Christine Straub
ac5048bf30
enhancement: remove duplicate embedded images (#2897)
This PR aims to remove duplicate embedded images taken by `PDFminer`.

### Summary
- add `clean_pdfminer_duplicate_image_elements()` to remove embedded
images with similar `bboxes` and the same `text`
- add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the
bounding boxes of two embedded images as the same region
- refactor: reorganzie `clean_pdfminer_inner_elements()`
2024-04-18 23:07:47 +00:00
Michał Martyniak
001fa17c86
Preparing the foundation for better element IDs (#2842)
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461

It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
> 
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.

Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.


Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
2024-04-16 21:14:53 +00:00
Michał Martyniak
cb1e91058e
Introduce start_page argument to partitioning functions that assign element.metadata.page_number (#2884)
This small change will be useful for users who partition only fragments
of their PDF documents.
It's a small step towards addressing this issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

Related PRs:
* https://github.com/Unstructured-IO/unstructured/pull/2842
* https://github.com/Unstructured-IO/unstructured/pull/2673
2024-04-15 21:03:42 +00:00
MiXiBo
0506aff788
add support for start_index in html links extraction (#2600)
add support for start_index in html links extraction (closes #2625)

Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json


html_text = """<html>
        <p>Hello there I am a <a href="/link">very important link!</a></p>
        <p>Here is a list of my favorite things</p>
        <ul>
            <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
            <li>Dogs</li>
        </ul>
        <a href="/loner">A lone link!</a>
    </html>"""

elements = partition_html(text=html_text)
print(elements_to_json(elements))
```

---------

Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-04-12 06:14:20 +00:00
Steve Canny
3e643c4cb3
feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880)
**Summary**
Delegate partitioning of PPTX Picture (image, to a first approximation)
shapes to a distinct sub-partitioner and allow the default picture
sub-partitioner to be replaced at run-time by one of the user's
choosing.
2024-04-12 06:00:01 +00:00
Steve Canny
2cba949f18
feat(pptx): partition_pptx() accepts strategy arg (#2879)
**Summary**
As we move to adding pluggable sub-partitioners, `partition_pptx()` will
need to become sensitive to the `strategy` argument, in particular when
it is set to "hi_res". Up until now there were no expensive operations
(inference, OCR, etc.) incurred while partitioning PPTX so this argument
was ignored.

After this PR, `partition_pptx()` still won't do anything with that
value, other than pass it along to `_PptxPartitionerOptions` for
safe-keeping, but now its ready for use by a `PicturePartitioner` (to
come in a subsequent PR).
2024-04-11 22:36:16 +00:00
Christine Straub
4656b8cbe5
Fix: partition_html() partially extracts text (#2852)
Closes #2362.

Previously, when an HTML contained a `div` with a nested tag e.g. a
`<b>` or `<span>`, the element created from the `div` contained only the
text up to the inline element. This PR adds support for extracting text
from tag tails in HTML.

### Testing
```
html_text = """
<html>
<body>
    <div>
        the Company issues shares at $<div style="display:inline;"><span>5.22</span></div> per share. There is more text
    </div>
</body>
</html>
"""

elements = partition_html(text=html_text)
print(''.join([str(el).strip() for el in elements]))
```

**Expected behavior**
```
the Company issues shares at $5.22per share. There is more text
```
2024-04-08 19:18:55 +00:00
Steve Canny
2c7e0289aa
rfctr(pptx): extract _PptxPartitionerOptions (#2853)
**Reviewers:** Likely quicker to review commit-by-commit.

**Summary**

In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
2024-04-08 19:01:03 +00:00
Christine Straub
a9b6506724
Fix: partition_html() fails parsing simple html (#2849)
Closes #2520.

Previously, `partition_html()` did not extract text from `<b>` tags
inside container tags (like `<div>`, `<pre>`). This PR provides support
for extracting text from `<b>` tags inside container tags.

### Testing
```
html_text = """
<!DOCTYPE html>
<html>
<head>
 <title>A page</title>
</head>
<body>
<div>
    <h1>Header 1</h1>
    <p>Text </p>
    <h2>Header 2</h2>
    <pre><b>Param1</b> = Y<br><b>Param2</b> = 1<br><b>Param3</b> = 2<br><b>Param4</b> = A
    <br><b>Param5</b> = A,B,C,D,E<br><b>Param6</b> = 7<br><b>Param7</b> = Five<br></pre>
</div>
</body>
</html>
"""

elements = partition_html(text=html_text)
print("\n\n".join([str(el) for el in elements]))
```

**Expected behavior**
```
Header 1

Text

Header 2

Param1 = Y

Param2 = 1

Param3 = 2

Param4 = A

Param5 = A,B,C,D,E

Param6 = 7

Param7 = Five
```
2024-04-08 18:09:41 +00:00
Pawel Kmiecik
63fc2a1061
feat: element types extension (#2700)
This PR adds some new element types that can be used especially by
pdf/image parition.
2024-04-04 07:49:55 +00:00
Steve Canny
1ce60f2bba
rfctr(xlsx): extract _XlsxPartitionerOptions (#2838)
**Summary**
As an initial step in reducing the complexity of the monolithic
`partition_xlsx()` function, extract all argument-handling to a separate
`_XlsxPartitionerOptions` object which can be fully covered by isolated
unit tests.
    
**Additional Context**
This code was from a prior XLSX bug-fix branch that did not get
committed because of time constraints. I wanted to revisit it here
because I need the benefits of this as part of some new work on PPTX
that will require a separate options object that can be passed to
delegate objects.

This approach was incubated in the chunking context and has produced a
lot of opportunities there to decompose the logic into smaller
components that are more understandable and isolated-test-able, without
having to pass an extended list of option values in ever sub-call. As
well as decluttering the code, this removes coupling where the caller
needs to know which options a subroutine might need to reference.
2024-04-03 23:27:33 +00:00
Klaijan
8a239b346c
feat: add cleanup fixtures for test_evaluate (#2701)
This PR adds `@pytest.mark.usefixtures("_cleanup_after_test")` to
`test_evaluate` on tests that do not have.
2024-04-02 15:10:59 +00:00
Ahmet Melek
d46792214a
feat: add vertexai embeddings (#2693)
This PR:
- Adds VertexAI embeddings as an embedding provider

Testing
- Tested with pinecone destination connector on
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693)
job run.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-03-28 21:15:36 +00:00
Christine Straub
887e6c9094
refactor: use env_config instead of SUBREGION_THRESHOLD_FOR_OCR constant (#2697)
The purpose of this PR is to introduce a new env_config for the
subregion threshold for OCR.

### Testing
CI should pass.
2024-03-28 20:28:35 +00:00
Christine Straub
08fafc564f
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-03-23 03:59:23 +00:00
Steve Canny
56fbaaed10
feat(chunking): add metadata.orig_elements serde (#2680)
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.

It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.

**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-22 21:53:26 +00:00
Klaijan
fd8b682194
fix: mean group add param (#2684) 2024-03-22 15:16:23 +00:00
Filip Knefel
bdfd975115
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-22 10:08:49 +00:00
Steve Canny
31bef433ad
rfctr: prepare to add orig_elements serde (#2668)
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.

**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
2024-03-20 21:27:59 +00:00
John
9ac4445e74
refactor title.py (#2657)
Minor refactor after conversation with @scanny

Updates docstring and how chunking options are accessed.
`self._kwargs.get()` should only be used in the `lazyproperty`
definition of an instance's attribute. Other calls should use
`self.<attribute>`
2024-03-19 17:48:23 +00:00
Yao You
2eb0b25e0d
Feat: single table structure eval metric (#2655)
Creates a compounding metric to represent table structure score. It is
an average of existing row and col index and content score.

This PR adds a new property to
`unstructured.metrics.table_eval.TableEvaluation`:
`composite_structure_acc`, which is computed from the element level row
and column index and content accuracy scores. This new metric is meant
to offer a single number to represent the performance of table structure
extraction model/algorithms.

This PR also refactors the eval computation logic so it uses a constant
`table_eval_metrics` instead of hard coding the name of the metrics in
multiple places in the code.

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2024-03-19 15:15:32 +00:00
Steve Canny
1af41d5f90
feat(chunking): add .orig_elements behavior to chunking (#2656)
**Summary**
Add the actual behavior to populate `.metadata.orig_elements` during
chunking, when so instructed by the `include_orig_elements` option.

**Additional Context**
The underlying structures to support this, namely the
`.metadata.orig_elements` field and the `include_orig_elements` chunking
option, were added in closely prior PRs. This PR adds the behavior to
actually populate that metadata field during chunking when the option is
set.
2024-03-18 19:27:39 +00:00
Filip Knefel
6af6604057
feat: introduce date_from_file_object parameter to partitions (#2563)
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-18 01:09:44 +00:00
Klaijan
ccda40f750
feat: grouping eval takes list of filenames (#2635)
Add features to `get_mean_grouping` to allow input as a list of
filenames in the format of List of strings or txt file.

---------

Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-17 17:19:55 +00:00
Steve Canny
137ea67336
feat(chunking): add include_orig_elements chunking option (#2649)
**Summary**
Add `include_orig_elements: bool = True` as a new chunking option. This
PR does not implement _adding_ original elements to chunks, only
accepting this parameter as a chunking option and assigning `True` to it
as a default value when it is omitted as a keyword argument.

Note this will need to be added in other repositories as well in order
to fully support this new option by all access methods. In particular it
will need to be added in `unstructured-api` in order to become available
via the SDKs.
2024-03-15 18:48:07 +00:00
Steve Canny
94535e353c
rfctr: prepare for adding metadata.orig_elements field (#2647)
**Summary**
Some typing modernization in `elements.py` which will get changes to add
the `orig_elements` metadata field.

Also some additions to `unit_util.py` to enable simplified mocking that
will be required in the next PR.
2024-03-14 21:31:58 +00:00
John
fe300fe56d
fix: teardown fixture for tests and update pre-commit-config (#2565)
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The updated decorator
removes the created directory and its files after the tests run.

Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
2024-03-12 22:16:39 +00:00
Steve Canny
8ea203adf7
feat(chunking): composite text gets is_continuation (#2639)
**Summary**
Add `metadata.is_continuation = True` to metadata of second-and-later
text-split chunks formed from an oversized non-table element. Previously
this metadata was only present on text-split `TableChunk` elements.

This enables downstream filtering of intentionally redundant metadata on
chunk elements that may not be desired for all purposes.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-12 19:44:41 +00:00
Yao You
911f9983c1
feat: redefine table level acc (#2620)
This PR redefines the `table_level_acc` metric as follow:
- for each predicted table use sequence matching ratio as its accuracy
- as a prerequisite for the sequence matching we sort the table cells by
row then column for both predicted and ground truth to ensure they are
ordered the same
- average all predicted table accuracy
- any prediction without a matching ground truth (false positive) would
decrease the score
- prediction that splits ground truth into smaller tables would also
have low score with perfectly equal splits having lowest score

This new definition makes the new metric a value between 0 and 1 per
file. This replaces the existing definition where the metric is defined
as (the number of predicted table that has a match to ground truth) to
(the number of ground truth table). This existing metric actually gives
higher values for predictions that splits tables and can be higher than
1. The new definition prefers predictions that do not split ground truth
tables.
2024-03-08 17:00:57 +00:00
Steve Canny
b27ad9b6aa
fix: raises on file-like object with .name not a valid path (#2614)
**Summary**
Fixes: #2308

**Additional context**
Through a somewhat deep call-chain, partitioning a file-like object
(e.g. io.BytesIO) having its `.name` attribute set to a path not
pointing to an actual file on the local filesystem would raise
`FileNotFoundError` when the last-modified date was being computed for
the document.

This scenario is a legitimate partitioning call, where `file.name` is
used downstream to describe the source of, for example, a bytes payload
downloaded from the network.

**Fix**
- explicitly check for the existence of a file at the given path before
accessing it to get its modified date. Return `None` (already a
legitimate return value) when no such file exists.
- Generally clean up the implementations.
- Add unit tests that exercise all cases.

---------

Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
2024-03-07 19:02:04 +00:00
Pawel Kmiecik
e35306cfc7
fix: table evaluation metrics fix calculations when no tables found in predictions (#2619)
The current way table structure metrics are computed does not cover
cases when none table is found and all stats are empty.

This PR fixes this + adds some hardenning tests for table eval
processor.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2024-03-07 18:39:19 +00:00
Steve Canny
b59e4b69ce
rfctr: prepare for fix to raises on file-like-object with name not a path to a file (#2617)
**Summary**
Improve typing and other mechanical refactoring in preparation for fix
to issue 2308.
2024-03-06 23:46:54 +00:00
John
b6c1882cc3
chore: add tests and small fixes in utils.py (#2554)
Linting and typing fixes, and add tests to improve test coverage in
utils.py

On the main branch, run `coverage run -m pytest
test_unstructured/test_utils.py` and then `coverage report -m
unstructured/utils.py` to see test coverage for `utils.py`. Check out to
this branch and do the same. The percent coverage should increase to 88%

---------

Co-authored-by: David Potter <potterdavidm@gmail.com>
2024-03-06 21:58:10 +00:00
Steve Canny
4096a38371
rfctr(chunking): extract chunking-strategy dispatch (#2545)
**Summary**
This is the final step in adding pluggable chunking-strategies. It
introduces the `chunk()` function to replace calls to strategy-specific
chunkers in the `@add_chunking_strategy` decorator. The `chunk()`
function then uses a mapping of chunking-strategy names (e.g.
"by_title", "basic") to chunking functions (chunkers) to dispatch the
chunking call. This allows other chunkers to be added at runtime rather
than requiring a code change, which is what "pluggable" chunkers is.

**Additional Information**
- Move the `@add_chunking_strategy` to the new `chunking.dispatch`
module since it coheres strongly with that operation, but publish it
from `chunking(.__init__)` (as it was before) so users don't couple to
the way we organize the chunking sub-package. Also remove the third
level of nesting as it's unrequired in this case.
- Add unit tests for the `@add_chunking_strategy` decorator which was
previously uncovered by any direct test.
2024-03-05 23:19:29 +00:00
Klaijan
3ff6de4f50
refactor: refactor var name for consistency (#2609)
refactor variable name for consistency.
2024-03-05 09:08:25 +00:00
Klaijan
6a4b7a134b
feat: element type accuracy grouping (#2594)
This PR allow grouping functionality on `evaluate.py`

To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or
call `get_mean_grouping(<doctype or connector>, <dataframe or path to
tsv file>, <export directory>, "element_type")`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2024-03-01 15:18:37 +00:00
Christine Straub
ee8b0f93dc
feat: pass list type parameters via client sdk (#2567)
The purpose of this PR is to support using the same type of parameters
as `partition_*()` when using `partition_via_api()`. This PR works
together with `unsturctured-api` [PR
#368](https://github.com/Unstructured-IO/unstructured-api/pull/368).

**Note:** This PR will support extracting image blocks("Image", "Table")
via partition_via_api().

### Summary
- update `partition_via_api()` to convert all list type parameters to
JSON formatted strings before passing them to the unstructured client
SDK
- add a unit test function to test extracting image blocks via
`parition_via_api()`
- add a unit test function to test list type parameters passed to API
via unstructured client sdk

### Testing
```
from unstructured.partition.api import partition_via_api

elements = partition_via_api(
    filename="example-docs/embedded-images-tables.pdf",
    api_key="YOUR-API-KEY",
    strategy="hi_res",
    extract_image_block_types=["image", "table"],
)

image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"]
print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements]))
print("\n\n".join([el.metadata.image_base64 for el in image_block_elements]))
```
2024-02-26 19:17:06 +00:00
Steve Canny
51cf6bf716
rfctr(chunking): extract strategy-specific chunking options (#2556)
**Summary**
A pluggable chunking strategy needs its own local set of chunking
options that subclasses a base-class in `unstructured`.

Extract distinct `_ByTitleChunkingOptions` and `_BasicChunkingOptions`
for the existing two chunking strategies and move their
strategy-specific option setting and validation to the respective
subclass.

This was also a good opportunity for us to clean up a few odds and ends
we'd been meaning to.

Might be worth looking at the commits individually as they are cohesive
incremental steps toward the goal.
2024-02-23 18:22:44 +00:00
Matt Robinson
b4d9ad8130
enhancement: detect headers in partition_pdf with fast strategy (#2455)
### Summary

Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-02-23 16:56:09 +00:00
Klaijan
daaf1775b4
feat: separate evaluate grouping function (#2572)
Separate the aggregating functionality of `text_extraction_accuracy` to
a stand-alone function to avoid duplicated eval effort if the granular
level eval is already available.

To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py`
locally
2024-02-23 05:45:20 +00:00
Steve Canny
d3242fb546
rfctr(xlsx): extract connected components (#2575)
**Summary**
Refactoring as part of `partition_xlsx()` algorithm replacement that was
delayed by some CI challenges.

A separate PR because it is cohesive and relatively independent from the
prior PR.
2024-02-22 22:50:48 +00:00
Pawel Kmiecik
ff9d46f9dc
feat(eval): table evaluation metrics (#2558)
This PR adds new table evaluation metrics prepared by @leah1985 
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows

TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
2024-02-22 16:35:46 +00:00
Steve Canny
1947375b2e
rfctr(chunking): preparation for plug-in chunkers, Part I (#2550)
**Summary**
In order to accommodate customized chunkers other than those directly
provided by `unstructured`, some further modularization is necessary
such that a new chunker can be added as a "plug-in" without modifying
the `unstructured` library code.

This PR is the straightforward refactoring required for this process
like typing changes. There are also some other small changes we've been
meaning to make like making all chunking options accept `None` to
represent their default value so the broad field of callers (e.g.
ingest, unstructured-api, SDK) don't need to determine and set default
values for chunking arguments leading to diverging defaults.

Isolating these "noisy" but easy to accept changes in this preparatory
PR reduces the noise in the more substantive changes to follow.
2024-02-21 23:16:13 +00:00
erjieyong
4d12c61cb8
added parent_element as output for overlapping cases (#2507)
To provide more utility to the `catch_overlapping_and_nested_bboxes` and
`identify_overlapping_or_nesting_case` functions, included
parent_element as part of the output.

This would allow user to 
- identify the parent element in the overlapping case: `nested {type*}
in {type*}`. Currently, if the element types is similar, an example case
output would be `nested Image in Image` which is confusing.
- easily identify elements to keep or delete
2024-02-21 00:13:09 -08:00
Steve Canny
f1c52c3e3f
fix(json): partition_json() does not chunk (#2564)
**Summary**
For whatever reason, the `@add_chunking_strategy` decorator was not
present on `partition_json()`. This broke the only way to accomplish a
"chunking-only" workflow using the REST API. This PR remedies that
problem.
2024-02-21 01:35:16 +00:00