**Summary**
As an initial step in reducing the complexity of the monolithic
`partition_xlsx()` function, extract all argument-handling to a separate
`_XlsxPartitionerOptions` object which can be fully covered by isolated
unit tests.
**Additional Context**
This code was from a prior XLSX bug-fix branch that did not get
committed because of time constraints. I wanted to revisit it here
because I need the benefits of this as part of some new work on PPTX
that will require a separate options object that can be passed to
delegate objects.
This approach was incubated in the chunking context and has produced a
lot of opportunities there to decompose the logic into smaller
components that are more understandable and isolated-test-able, without
having to pass an extended list of option values in ever sub-call. As
well as decluttering the code, this removes coupling where the caller
needs to know which options a subroutine might need to reference.
Closes#2836.
The `partition_html()` only considers elements with limited depth when
determining if an HTML tag (`etree`) element contains text, to avoid
becoming the text representation of a giant div. This PR increases the
limit value.
The mongodb redact method was created because we wanted part of the url
to be exposed to the user during logging. Thus it did not use the
dataclass `enhanced_field(sensitive=True)` solution.
This changes it to use our standard redacted solution. This also
minimizes the amount of work to be done in platform.
**Summary**
Add an `--include-orig-elements` option to the Ingest CLI to allow users
to specify that corresponding new chunking parameter.
**Reviewer** A lot of this is cleanup, the second commit is where the
actual adding of this option are. The first commit fixes a number of
inaccuracies in the documentation and does some other clean-up.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.
### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25
### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)
```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```
```
elements = partition_pdf(
filename="pwc-financial-statements-p114.pdf",
strategy="hi_res",
infer_table_structure=True,
extract_image_block_types=["Image"],
)
table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```
---------
Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.
It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.
**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR
We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation
More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Thanks to @mogith-pn from Clarifai we have a new destination connector!
This PR intends to add Clarifai as a ingest destination connector.
Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.
**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
Minor refactor after conversation with @scanny
Updates docstring and how chunking options are accessed.
`self._kwargs.get()` should only be used in the `lazyproperty`
definition of an instance's attribute. Other calls should use
`self.<attribute>`
Creates a compounding metric to represent table structure score. It is
an average of existing row and col index and content score.
This PR adds a new property to
`unstructured.metrics.table_eval.TableEvaluation`:
`composite_structure_acc`, which is computed from the element level row
and column index and content accuracy scores. This new metric is meant
to offer a single number to represent the performance of table structure
extraction model/algorithms.
This PR also refactors the eval computation logic so it uses a constant
`table_eval_metrics` instead of hard coding the name of the metrics in
multiple places in the code.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
**Summary**
Add the actual behavior to populate `.metadata.orig_elements` during
chunking, when so instructed by the `include_orig_elements` option.
**Additional Context**
The underlying structures to support this, namely the
`.metadata.orig_elements` field and the `include_orig_elements` chunking
option, were added in closely prior PRs. This PR adds the behavior to
actually populate that metadata field during chunking when the option is
set.
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Add features to `get_mean_grouping` to allow input as a list of
filenames in the format of List of strings or txt file.
---------
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
**Summary**
Add `include_orig_elements: bool = True` as a new chunking option. This
PR does not implement _adding_ original elements to chunks, only
accepting this parameter as a chunking option and assigning `True` to it
as a default value when it is omitted as a keyword argument.
Note this will need to be added in other repositories as well in order
to fully support this new option by all access methods. In particular it
will need to be added in `unstructured-api` in order to become available
via the SDKs.
JavaScript uses `//` for comments instead of `#`
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
https://github.com/Unstructured-IO/unstructured/pull/2591https://github.com/Unstructured-IO/unstructured/pull/2592/files
We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.
Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)
Adds unit tests for isofomat.
json_to_dict already unit tested here:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py
Adds small change for AstraDB to allow them to see what source called
their api
**Summary**
Some typing modernization in `elements.py` which will get changes to add
the `orig_elements` metadata field.
Also some additions to `unit_util.py` to enable simplified mocking that
will be required in the next PR.
**Summary**
Use omnibus `kwargs` dict for `ChunkingOptions` state rather than
explicit option parameters.
**Additional Context**
While articulating explicit options for `ChunkingOptions` and its (now
several) sub-classes provides some type-safety, it induces a large
amount of redundancy which complicates updates to the base class and
especially patches to the base class from client code that adds custom
chunkers.
In particular, it makes custom chunkers brittle to any new attributes
added to `ChunkingOptions` (the base class).
Use a single omnibus `kwargs` argument to `ChunkingOptions` and its
subclasses allowing each to pull out the options it is interested in and
happily ignore the rest.
The type safety provided by explicit parameters and types is only
afforded to the single place the options item is called from which is
the custom chunker itself. Because this is "internal" code and not part
of the public interface, this is a manageably small loss in type-safety.
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The updated decorator
removes the created directory and its files after the tests run.
Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
**Summary**
Add `metadata.is_continuation = True` to metadata of second-and-later
text-split chunks formed from an oversized non-table element. Previously
this metadata was only present on text-split `TableChunk` elements.
This enables downstream filtering of intentionally redundant metadata on
chunk elements that may not be desired for all purposes.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
To test:
> cd docs && make html
Changelogs:
* added a note to have **Storage Object Viewer** IAM Role for the GCS
source connector.
* added a note to have **Storage Object Creator** IAM Role for the GCS
destination connector.
## 0.12.6
### Enhancements
* **Improve ability to capture embedded links in `partition_pdf()` for
`fast` strategy** Previously, a threshold value that affects the capture
of embedded links was set to a fixed value by default. This allows users
to specify the threshold value for better capturing.
* **Refactor `add_chunking_strategy` decorator to dispatch by name.**
Add `chunk()` function to be used by the `add_chunking_strategy`
decorator to dispatch chunking call based on a chunking-strategy name
(that can be dynamic at runtime). This decouples chunking dispatch from
only those chunkers known at "compile" time and enables runtime
registration of custom chunkers.
### Features
* **Added Unstructured Platform Documentation** The Unstructured
Platform is currently in beta. The documentation provides how-to guides
for setting up workflow automation, job scheduling, and configuring
source and destination connectors.
### Fixes
* **Partitioning raises on file-like object with `.name` not a local
file path.** When partitioning a file using the `file=` argument, and
`file` is a file-like object (e.g. io.BytesIO) having a `.name`
attribute, and the value of `file.name` is not a valid path to a file
present on the local filesystem, `FileNotFoundError` is raised. This
prevents use of the `file.name` attribute for downstream purposes to,
for example, describe the source of a document retrieved from a network
location via HTTP.
* **Fix SharePoint dates with inconsistent formatting** Adds logic to
conditionally support dates returned by office365 that may vary in date
formatting or may be a datetime rather than a string.
* **Include warnings** about the potential risk of installing a version
of `pandoc` which does not support RTF files + instructions that will
help resolve that issue.
* **Incorporate the `install-pandoc` Makefile recipe** into relevant
stages of CI workflow, ensuring it is a version that supports RTF input
files.
* **Fix Google Drive source key** Allow passing string for source
connector key.
* **Fix table structure evaluations calculations** Replaced special
value `-1.0` with `np.nan` and corrected rows filtering of files metrics
basing on that.
* **Fix Sharepoint-with-permissions test** Ignore permissions metadata,
update test.
* **Fix table structure evaluations for edge case** Fixes the issue when
the prediction does not contain any table - no longer errors in such
case.
This PR redefines the `table_level_acc` metric as follow:
- for each predicted table use sequence matching ratio as its accuracy
- as a prerequisite for the sequence matching we sort the table cells by
row then column for both predicted and ground truth to ensure they are
ordered the same
- average all predicted table accuracy
- any prediction without a matching ground truth (false positive) would
decrease the score
- prediction that splits ground truth into smaller tables would also
have low score with perfectly equal splits having lowest score
This new definition makes the new metric a value between 0 and 1 per
file. This replaces the existing definition where the metric is defined
as (the number of predicted table that has a match to ground truth) to
(the number of ground truth table). This existing metric actually gives
higher values for predictions that splits tables and can be higher than
1. The new definition prefers predictions that do not split ground truth
tables.
The docker-publish github actions workflow builds amd and arm images of
the repository and tests them before publishing. These tests have been
failing since [this
commit](ee8b0f93dc)
with an error `UNS_API_KEY environment variable not set`.
The issue is that [this
line](b27ad9b6aa/.github/workflows/docker-publish.yml (L62))
in the workflow is actually blowing away the value assigned to the file
in the previous line
## Changes
* Update line that was overwriting the assignment of UNS_API_KEY to the
uns_test_env_file in the docker-publish workflow to leverage the `>>`
operator so that UNSTRUCTURED_HF_TOKEN assignment is only appended.
* [bonus]: arithmetic expansion in version-sync.sh to keep shell-check
happy
## Testing
To validate, I edited the docker-publish workflow to trigger on push
(and to run the test but not publish the workflow) in [this
commit](0f04f5f0f7).
The successful test results can be reviewed
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/8199826803).
**Summary**
Fixes: #2308
**Additional context**
Through a somewhat deep call-chain, partitioning a file-like object
(e.g. io.BytesIO) having its `.name` attribute set to a path not
pointing to an actual file on the local filesystem would raise
`FileNotFoundError` when the last-modified date was being computed for
the document.
This scenario is a legitimate partitioning call, where `file.name` is
used downstream to describe the source of, for example, a bytes payload
downloaded from the network.
**Fix**
- explicitly check for the existence of a file at the given path before
accessing it to get its modified date. Return `None` (already a
legitimate return value) when no such file exists.
- Generally clean up the implementations.
- Add unit tests that exercise all cases.
---------
Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
The current way table structure metrics are computed does not cover
cases when none table is found and all stats are empty.
This PR fixes this + adds some hardenning tests for table eval
processor.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
### Description
The only real change here is adding this line:
```python
apply_name_overload=apply_name_overload,
```
Everything else is from running `make tidy`
We are activating to configure the annotation threshold for links as an
optional parameter.
The reason for the change is that we ran into issues extracting simple
text links from PDF documents that were created with MS Word. The sample
PDF from unstructured worked with a default value of 0.9, and the PDF
generated with Word resulted in a threshold of approx 0.67.
We do use unstructured in together with langchain within an automated
container deployment and to access by default the setting
'annotation_threshold' (refactored from 'threshold') can be very
helpful.
---------
Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Linting and typing fixes, and add tests to improve test coverage in
utils.py
On the main branch, run `coverage run -m pytest
test_unstructured/test_utils.py` and then `coverage report -m
unstructured/utils.py` to see test coverage for `utils.py`. Check out to
this branch and do the same. The percent coverage should increase to 88%
---------
Co-authored-by: David Potter <potterdavidm@gmail.com>
### Description
Currently the requirements associated with an extra in the `setup.py` is
being dynamically generated using the `load_requirements()` method in
the same file. This is being passed in all the `.in` files which then
get read line by line to generate the requirements associated with an
extra. Unless the `.in` file itself has a version pin, this will never
respect the `.txt` files being generated by `pip-compile`. This fix
updates all the inputs to `load_requirements()` to use the `.txt` files
themselves.
Adding `metadata.data_source.permissions_data` to
sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint
deprecation warning from ruining test.
Updating expected-structured-output
As per Ahmet's comment. We do want to check sharepoint permissions
metadata at some point. But that will take a separate type of test. A
file diff test is too unstable. Permissions checking will be later down
the road.
**Summary**
This is the final step in adding pluggable chunking-strategies. It
introduces the `chunk()` function to replace calls to strategy-specific
chunkers in the `@add_chunking_strategy` decorator. The `chunk()`
function then uses a mapping of chunking-strategy names (e.g.
"by_title", "basic") to chunking functions (chunkers) to dispatch the
chunking call. This allows other chunkers to be added at runtime rather
than requiring a code change, which is what "pluggable" chunkers is.
**Additional Information**
- Move the `@add_chunking_strategy` to the new `chunking.dispatch`
module since it coheres strongly with that operation, but publish it
from `chunking(.__init__)` (as it was before) so users don't couple to
the way we organize the chunking sub-package. Also remove the third
level of nesting as it's unrequired in this case.
- Add unit tests for the `@add_chunking_strategy` decorator which was
previously uncovered by any direct test.