Minor refactor after conversation with @scanny
Updates docstring and how chunking options are accessed.
`self._kwargs.get()` should only be used in the `lazyproperty`
definition of an instance's attribute. Other calls should use
`self.<attribute>`
Creates a compounding metric to represent table structure score. It is
an average of existing row and col index and content score.
This PR adds a new property to
`unstructured.metrics.table_eval.TableEvaluation`:
`composite_structure_acc`, which is computed from the element level row
and column index and content accuracy scores. This new metric is meant
to offer a single number to represent the performance of table structure
extraction model/algorithms.
This PR also refactors the eval computation logic so it uses a constant
`table_eval_metrics` instead of hard coding the name of the metrics in
multiple places in the code.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
**Summary**
Add the actual behavior to populate `.metadata.orig_elements` during
chunking, when so instructed by the `include_orig_elements` option.
**Additional Context**
The underlying structures to support this, namely the
`.metadata.orig_elements` field and the `include_orig_elements` chunking
option, were added in closely prior PRs. This PR adds the behavior to
actually populate that metadata field during chunking when the option is
set.
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Add features to `get_mean_grouping` to allow input as a list of
filenames in the format of List of strings or txt file.
---------
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
**Summary**
Add `include_orig_elements: bool = True` as a new chunking option. This
PR does not implement _adding_ original elements to chunks, only
accepting this parameter as a chunking option and assigning `True` to it
as a default value when it is omitted as a keyword argument.
Note this will need to be added in other repositories as well in order
to fully support this new option by all access methods. In particular it
will need to be added in `unstructured-api` in order to become available
via the SDKs.
JavaScript uses `//` for comments instead of `#`
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
https://github.com/Unstructured-IO/unstructured/pull/2591https://github.com/Unstructured-IO/unstructured/pull/2592/files
We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.
Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)
Adds unit tests for isofomat.
json_to_dict already unit tested here:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py
Adds small change for AstraDB to allow them to see what source called
their api
**Summary**
Some typing modernization in `elements.py` which will get changes to add
the `orig_elements` metadata field.
Also some additions to `unit_util.py` to enable simplified mocking that
will be required in the next PR.
**Summary**
Use omnibus `kwargs` dict for `ChunkingOptions` state rather than
explicit option parameters.
**Additional Context**
While articulating explicit options for `ChunkingOptions` and its (now
several) sub-classes provides some type-safety, it induces a large
amount of redundancy which complicates updates to the base class and
especially patches to the base class from client code that adds custom
chunkers.
In particular, it makes custom chunkers brittle to any new attributes
added to `ChunkingOptions` (the base class).
Use a single omnibus `kwargs` argument to `ChunkingOptions` and its
subclasses allowing each to pull out the options it is interested in and
happily ignore the rest.
The type safety provided by explicit parameters and types is only
afforded to the single place the options item is called from which is
the custom chunker itself. Because this is "internal" code and not part
of the public interface, this is a manageably small loss in type-safety.
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The updated decorator
removes the created directory and its files after the tests run.
Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
**Summary**
Add `metadata.is_continuation = True` to metadata of second-and-later
text-split chunks formed from an oversized non-table element. Previously
this metadata was only present on text-split `TableChunk` elements.
This enables downstream filtering of intentionally redundant metadata on
chunk elements that may not be desired for all purposes.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
To test:
> cd docs && make html
Changelogs:
* added a note to have **Storage Object Viewer** IAM Role for the GCS
source connector.
* added a note to have **Storage Object Creator** IAM Role for the GCS
destination connector.
## 0.12.6
### Enhancements
* **Improve ability to capture embedded links in `partition_pdf()` for
`fast` strategy** Previously, a threshold value that affects the capture
of embedded links was set to a fixed value by default. This allows users
to specify the threshold value for better capturing.
* **Refactor `add_chunking_strategy` decorator to dispatch by name.**
Add `chunk()` function to be used by the `add_chunking_strategy`
decorator to dispatch chunking call based on a chunking-strategy name
(that can be dynamic at runtime). This decouples chunking dispatch from
only those chunkers known at "compile" time and enables runtime
registration of custom chunkers.
### Features
* **Added Unstructured Platform Documentation** The Unstructured
Platform is currently in beta. The documentation provides how-to guides
for setting up workflow automation, job scheduling, and configuring
source and destination connectors.
### Fixes
* **Partitioning raises on file-like object with `.name` not a local
file path.** When partitioning a file using the `file=` argument, and
`file` is a file-like object (e.g. io.BytesIO) having a `.name`
attribute, and the value of `file.name` is not a valid path to a file
present on the local filesystem, `FileNotFoundError` is raised. This
prevents use of the `file.name` attribute for downstream purposes to,
for example, describe the source of a document retrieved from a network
location via HTTP.
* **Fix SharePoint dates with inconsistent formatting** Adds logic to
conditionally support dates returned by office365 that may vary in date
formatting or may be a datetime rather than a string.
* **Include warnings** about the potential risk of installing a version
of `pandoc` which does not support RTF files + instructions that will
help resolve that issue.
* **Incorporate the `install-pandoc` Makefile recipe** into relevant
stages of CI workflow, ensuring it is a version that supports RTF input
files.
* **Fix Google Drive source key** Allow passing string for source
connector key.
* **Fix table structure evaluations calculations** Replaced special
value `-1.0` with `np.nan` and corrected rows filtering of files metrics
basing on that.
* **Fix Sharepoint-with-permissions test** Ignore permissions metadata,
update test.
* **Fix table structure evaluations for edge case** Fixes the issue when
the prediction does not contain any table - no longer errors in such
case.
This PR redefines the `table_level_acc` metric as follow:
- for each predicted table use sequence matching ratio as its accuracy
- as a prerequisite for the sequence matching we sort the table cells by
row then column for both predicted and ground truth to ensure they are
ordered the same
- average all predicted table accuracy
- any prediction without a matching ground truth (false positive) would
decrease the score
- prediction that splits ground truth into smaller tables would also
have low score with perfectly equal splits having lowest score
This new definition makes the new metric a value between 0 and 1 per
file. This replaces the existing definition where the metric is defined
as (the number of predicted table that has a match to ground truth) to
(the number of ground truth table). This existing metric actually gives
higher values for predictions that splits tables and can be higher than
1. The new definition prefers predictions that do not split ground truth
tables.
The docker-publish github actions workflow builds amd and arm images of
the repository and tests them before publishing. These tests have been
failing since [this
commit](ee8b0f93dc)
with an error `UNS_API_KEY environment variable not set`.
The issue is that [this
line](b27ad9b6aa/.github/workflows/docker-publish.yml (L62))
in the workflow is actually blowing away the value assigned to the file
in the previous line
## Changes
* Update line that was overwriting the assignment of UNS_API_KEY to the
uns_test_env_file in the docker-publish workflow to leverage the `>>`
operator so that UNSTRUCTURED_HF_TOKEN assignment is only appended.
* [bonus]: arithmetic expansion in version-sync.sh to keep shell-check
happy
## Testing
To validate, I edited the docker-publish workflow to trigger on push
(and to run the test but not publish the workflow) in [this
commit](0f04f5f0f7).
The successful test results can be reviewed
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/8199826803).
**Summary**
Fixes: #2308
**Additional context**
Through a somewhat deep call-chain, partitioning a file-like object
(e.g. io.BytesIO) having its `.name` attribute set to a path not
pointing to an actual file on the local filesystem would raise
`FileNotFoundError` when the last-modified date was being computed for
the document.
This scenario is a legitimate partitioning call, where `file.name` is
used downstream to describe the source of, for example, a bytes payload
downloaded from the network.
**Fix**
- explicitly check for the existence of a file at the given path before
accessing it to get its modified date. Return `None` (already a
legitimate return value) when no such file exists.
- Generally clean up the implementations.
- Add unit tests that exercise all cases.
---------
Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
The current way table structure metrics are computed does not cover
cases when none table is found and all stats are empty.
This PR fixes this + adds some hardenning tests for table eval
processor.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
### Description
The only real change here is adding this line:
```python
apply_name_overload=apply_name_overload,
```
Everything else is from running `make tidy`
We are activating to configure the annotation threshold for links as an
optional parameter.
The reason for the change is that we ran into issues extracting simple
text links from PDF documents that were created with MS Word. The sample
PDF from unstructured worked with a default value of 0.9, and the PDF
generated with Word resulted in a threshold of approx 0.67.
We do use unstructured in together with langchain within an automated
container deployment and to access by default the setting
'annotation_threshold' (refactored from 'threshold') can be very
helpful.
---------
Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Linting and typing fixes, and add tests to improve test coverage in
utils.py
On the main branch, run `coverage run -m pytest
test_unstructured/test_utils.py` and then `coverage report -m
unstructured/utils.py` to see test coverage for `utils.py`. Check out to
this branch and do the same. The percent coverage should increase to 88%
---------
Co-authored-by: David Potter <potterdavidm@gmail.com>
### Description
Currently the requirements associated with an extra in the `setup.py` is
being dynamically generated using the `load_requirements()` method in
the same file. This is being passed in all the `.in` files which then
get read line by line to generate the requirements associated with an
extra. Unless the `.in` file itself has a version pin, this will never
respect the `.txt` files being generated by `pip-compile`. This fix
updates all the inputs to `load_requirements()` to use the `.txt` files
themselves.
Adding `metadata.data_source.permissions_data` to
sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint
deprecation warning from ruining test.
Updating expected-structured-output
As per Ahmet's comment. We do want to check sharepoint permissions
metadata at some point. But that will take a separate type of test. A
file diff test is too unstable. Permissions checking will be later down
the road.
**Summary**
This is the final step in adding pluggable chunking-strategies. It
introduces the `chunk()` function to replace calls to strategy-specific
chunkers in the `@add_chunking_strategy` decorator. The `chunk()`
function then uses a mapping of chunking-strategy names (e.g.
"by_title", "basic") to chunking functions (chunkers) to dispatch the
chunking call. This allows other chunkers to be added at runtime rather
than requiring a code change, which is what "pluggable" chunkers is.
**Additional Information**
- Move the `@add_chunking_strategy` to the new `chunking.dispatch`
module since it coheres strongly with that operation, but publish it
from `chunking(.__init__)` (as it was before) so users don't couple to
the way we organize the chunking sub-package. Also remove the third
level of nesting as it's unrequired in this case.
- Add unit tests for the `@add_chunking_strategy` decorator which was
previously uncovered by any direct test.
Closes #2577
Testing:
```
from unstructured.partition.html import partition_html
cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []
for element in elements:
if element.metadata.link_urls:
relative_link = element.metadata.link_urls[0][1:]
if relative_link.startswith("2024"):
links.append(f"{cnn_lite_url}{relative_link}")
print(links)
```
---------
Co-authored-by: ron-unstructured <ronny@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Google Drive Service account key can be a dict or a file path(str)
We have successfully been using the path. But the dict can also end up
being stored as a string that needs to be deserialized. The
deserialization can have issues with single and double quotes.
This PR allow grouping functionality on `evaluate.py`
To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or
call `get_mean_grouping(<doctype or connector>, <dataframe or path to
tsv file>, <export directory>, "element_type")`
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
We are seeing occurrences of inconsistency in the timestamps returned by
office365.sharepoint when fetching created and modified dates.
Furthermore, in future versions of this library, a datetime object will
be returned rather than a string.
## Changes
- This adds logic to guarantee SharePoint dates will be properly
formatted as ISO, regardless of the format provided by the sharepoint
library.
- Bumps timestamp format output to include timezone offset (as we do
with others)
## Testing
Unit test added to validate this datetime handling across various
formats.
---------
Co-authored-by: David Potter <potterdavidm@gmail.com>
The purpose of this PR is to support using the same type of parameters
as `partition_*()` when using `partition_via_api()`. This PR works
together with `unsturctured-api` [PR
#368](https://github.com/Unstructured-IO/unstructured-api/pull/368).
**Note:** This PR will support extracting image blocks("Image", "Table")
via partition_via_api().
### Summary
- update `partition_via_api()` to convert all list type parameters to
JSON formatted strings before passing them to the unstructured client
SDK
- add a unit test function to test extracting image blocks via
`parition_via_api()`
- add a unit test function to test list type parameters passed to API
via unstructured client sdk
### Testing
```
from unstructured.partition.api import partition_via_api
elements = partition_via_api(
filename="example-docs/embedded-images-tables.pdf",
api_key="YOUR-API-KEY",
strategy="hi_res",
extract_image_block_types=["image", "table"],
)
image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"]
print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements]))
print("\n\n".join([el.metadata.image_base64 for el in image_block_elements]))
```
This PR removes `extract_image_block_to_payload` section from "API
Parameters" page. The "unstructured" API does not support the
`extract_image_block_to_payload` parameter, and it is always set to
`True` internally on the API side when trying to extract image blocks
via the API. Users only need to specify `extract_image_block_types`
parameter when extracting image blocks via the API.
**NOTE:** The `extract_image_block_to_payload` parameter is only used
when calling `partition()`, `partition_pdf()`, and `partition_image()`
functions directly.
### Testing
CI should pass.
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.
This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.
To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:
1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used
Some notes about Astra DB:
- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
**Summary**
A pluggable chunking strategy needs its own local set of chunking
options that subclasses a base-class in `unstructured`.
Extract distinct `_ByTitleChunkingOptions` and `_BasicChunkingOptions`
for the existing two chunking strategies and move their
strategy-specific option setting and validation to the respective
subclass.
This was also a good opportunity for us to clean up a few odds and ends
we'd been meaning to.
Might be worth looking at the commits individually as they are cohesive
incremental steps toward the goal.
### Summary
Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Separate the aggregating functionality of `text_extraction_accuracy` to
a stand-alone function to avoid duplicated eval effort if the granular
level eval is already available.
To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py`
locally
**Summary**
Refactoring as part of `partition_xlsx()` algorithm replacement that was
delayed by some CI challenges.
A separate PR because it is cohesive and relatively independent from the
prior PR.
This PR adds new table evaluation metrics prepared by @leah1985
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows
TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
**Summary**
In order to accommodate customized chunkers other than those directly
provided by `unstructured`, some further modularization is necessary
such that a new chunker can be added as a "plug-in" without modifying
the `unstructured` library code.
This PR is the straightforward refactoring required for this process
like typing changes. There are also some other small changes we've been
meaning to make like making all chunking options accept `None` to
represent their default value so the broad field of callers (e.g.
ingest, unstructured-api, SDK) don't need to determine and set default
values for chunking arguments leading to diverging defaults.
Isolating these "noisy" but easy to accept changes in this preparatory
PR reduces the noise in the more substantive changes to follow.
To provide more utility to the `catch_overlapping_and_nested_bboxes` and
`identify_overlapping_or_nesting_case` functions, included
parent_element as part of the output.
This would allow user to
- identify the parent element in the overlapping case: `nested {type*}
in {type*}`. Currently, if the element types is similar, an example case
output would be `nested Image in Image` which is confusing.
- easily identify elements to keep or delete
**Summary**
For whatever reason, the `@add_chunking_strategy` decorator was not
present on `partition_json()`. This broke the only way to accomplish a
"chunking-only" workflow using the REST API. This PR remedies that
problem.