**Summary**
This is the final step in adding pluggable chunking-strategies. It
introduces the `chunk()` function to replace calls to strategy-specific
chunkers in the `@add_chunking_strategy` decorator. The `chunk()`
function then uses a mapping of chunking-strategy names (e.g.
"by_title", "basic") to chunking functions (chunkers) to dispatch the
chunking call. This allows other chunkers to be added at runtime rather
than requiring a code change, which is what "pluggable" chunkers is.
**Additional Information**
- Move the `@add_chunking_strategy` to the new `chunking.dispatch`
module since it coheres strongly with that operation, but publish it
from `chunking(.__init__)` (as it was before) so users don't couple to
the way we organize the chunking sub-package. Also remove the third
level of nesting as it's unrequired in this case.
- Add unit tests for the `@add_chunking_strategy` decorator which was
previously uncovered by any direct test.
Closes #2577
Testing:
```
from unstructured.partition.html import partition_html
cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []
for element in elements:
if element.metadata.link_urls:
relative_link = element.metadata.link_urls[0][1:]
if relative_link.startswith("2024"):
links.append(f"{cnn_lite_url}{relative_link}")
print(links)
```
---------
Co-authored-by: ron-unstructured <ronny@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Google Drive Service account key can be a dict or a file path(str)
We have successfully been using the path. But the dict can also end up
being stored as a string that needs to be deserialized. The
deserialization can have issues with single and double quotes.
This PR allow grouping functionality on `evaluate.py`
To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or
call `get_mean_grouping(<doctype or connector>, <dataframe or path to
tsv file>, <export directory>, "element_type")`
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
We are seeing occurrences of inconsistency in the timestamps returned by
office365.sharepoint when fetching created and modified dates.
Furthermore, in future versions of this library, a datetime object will
be returned rather than a string.
## Changes
- This adds logic to guarantee SharePoint dates will be properly
formatted as ISO, regardless of the format provided by the sharepoint
library.
- Bumps timestamp format output to include timezone offset (as we do
with others)
## Testing
Unit test added to validate this datetime handling across various
formats.
---------
Co-authored-by: David Potter <potterdavidm@gmail.com>
The purpose of this PR is to support using the same type of parameters
as `partition_*()` when using `partition_via_api()`. This PR works
together with `unsturctured-api` [PR
#368](https://github.com/Unstructured-IO/unstructured-api/pull/368).
**Note:** This PR will support extracting image blocks("Image", "Table")
via partition_via_api().
### Summary
- update `partition_via_api()` to convert all list type parameters to
JSON formatted strings before passing them to the unstructured client
SDK
- add a unit test function to test extracting image blocks via
`parition_via_api()`
- add a unit test function to test list type parameters passed to API
via unstructured client sdk
### Testing
```
from unstructured.partition.api import partition_via_api
elements = partition_via_api(
filename="example-docs/embedded-images-tables.pdf",
api_key="YOUR-API-KEY",
strategy="hi_res",
extract_image_block_types=["image", "table"],
)
image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"]
print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements]))
print("\n\n".join([el.metadata.image_base64 for el in image_block_elements]))
```
This PR removes `extract_image_block_to_payload` section from "API
Parameters" page. The "unstructured" API does not support the
`extract_image_block_to_payload` parameter, and it is always set to
`True` internally on the API side when trying to extract image blocks
via the API. Users only need to specify `extract_image_block_types`
parameter when extracting image blocks via the API.
**NOTE:** The `extract_image_block_to_payload` parameter is only used
when calling `partition()`, `partition_pdf()`, and `partition_image()`
functions directly.
### Testing
CI should pass.
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.
This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.
To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:
1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used
Some notes about Astra DB:
- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
**Summary**
A pluggable chunking strategy needs its own local set of chunking
options that subclasses a base-class in `unstructured`.
Extract distinct `_ByTitleChunkingOptions` and `_BasicChunkingOptions`
for the existing two chunking strategies and move their
strategy-specific option setting and validation to the respective
subclass.
This was also a good opportunity for us to clean up a few odds and ends
we'd been meaning to.
Might be worth looking at the commits individually as they are cohesive
incremental steps toward the goal.
### Summary
Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Separate the aggregating functionality of `text_extraction_accuracy` to
a stand-alone function to avoid duplicated eval effort if the granular
level eval is already available.
To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py`
locally
**Summary**
Refactoring as part of `partition_xlsx()` algorithm replacement that was
delayed by some CI challenges.
A separate PR because it is cohesive and relatively independent from the
prior PR.
This PR adds new table evaluation metrics prepared by @leah1985
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows
TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
**Summary**
In order to accommodate customized chunkers other than those directly
provided by `unstructured`, some further modularization is necessary
such that a new chunker can be added as a "plug-in" without modifying
the `unstructured` library code.
This PR is the straightforward refactoring required for this process
like typing changes. There are also some other small changes we've been
meaning to make like making all chunking options accept `None` to
represent their default value so the broad field of callers (e.g.
ingest, unstructured-api, SDK) don't need to determine and set default
values for chunking arguments leading to diverging defaults.
Isolating these "noisy" but easy to accept changes in this preparatory
PR reduces the noise in the more substantive changes to follow.
To provide more utility to the `catch_overlapping_and_nested_bboxes` and
`identify_overlapping_or_nesting_case` functions, included
parent_element as part of the output.
This would allow user to
- identify the parent element in the overlapping case: `nested {type*}
in {type*}`. Currently, if the element types is similar, an example case
output would be `nested Image in Image` which is confusing.
- easily identify elements to keep or delete
**Summary**
For whatever reason, the `@add_chunking_strategy` decorator was not
present on `partition_json()`. This broke the only way to accomplish a
"chunking-only" workflow using the REST API. This PR remedies that
problem.
The current `test-ingest-src.sh` and `evaluation-metrics` do not allow
passing the `EXPORT_DIR` (`OUTPUT_ROOT` in `evaluation-metrics`). It is
currently saving at the current working directory
(`unstructured/test_unstructured_ingest`). When running the eval from
`core-product`, all outputs is now saved at
`core-product/upstream-unstructured/test_unstructured_ingest` which is
undesirable.
This PR modifies two scripts to accommodate such behavior:
1. `test-ingest-src.sh` - assign `EVAL_OUTPUT_ROOT` to the value set
within the environment if exist, or the current working directory if
not. Then calls to run `evaluation-metrics.sh`.
2. `evaluation-metrics.sh` - accepting param from `test-ingest-src.sh`
if exist, or to the value set within the environment if exist, or the
current directory if not.
(Note: I also add param to `evaluation-metrics.sh` because it makes
sense to allow a separate run to be able to specify an export directory)
This PR should work in sync with another PR under `core-product`, which
I will add the link here later.
**To test:**
Run the script below, change `$SCRIPT_DIR` as needed to see the result.
```
export OVERWRITE_FIXTURES=true
./upstream-unstructured/test_unstructured_ingest/src/s3.sh
SCRIPT_DIR=$(dirname "$(realpath "$0")")
bash -x ./upstream-unstructured/test_unstructured_ingest/evaluation-metrics.sh text-extraction "$SCRIPT_DIR"
```
----
This PR also updates the requirements by `make pip-compile` since the
`click` module was not found.
Reported bug: Text from docx shapes is not included in the `partition`
output.
Fix: Extend docx partition to search for text tags nested inside
structures responsible for creating the shape.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
### Summary
Closes#2444. Treats JSON serializable content that results in a string
as plain text. Even though this is valid JSON per [RFC
4627](https://www.ietf.org/rfc/rfc4627.txt), this is valid JSON, but in
almost every cases were really want to treat this as a text file.
### Testing
1. Put `"This is not a JSON"` is a text file `notajson.txt`
2. Run the following
```python
from unstructured.file_utils.filetype import _is_text_file_a_json
_is_text_file_a_json(filename="notajson.txt") # Should be False
```
This PR is similar to ocr module refactoring PR -
https://github.com/Unstructured-IO/unstructured/pull/2492.
### Summary
- refactor "embedded text extraction" related modules to use decorator -
`@requires_dependencies` on functions that require external libraries
and import those libraries inside those functions instead of on module
level.
- add missing test cases for `pdf_image_utils.py` module to improve
average test coverage
### Testing
CI should pass.
**Reviewers:** It may be easier to review each of the two commits
separately. The first adds the new `_SubtableParser` object with its
unit-tests and the second one uses that object to replace the flawed
existing subtable-parsing algorithm.
**Summary**
There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).
This PR replaces the flawed existing algorithm with a `_SubtableParser`
object that encapsulates all that logic and has thorough unit-tests.
**Additional Context**
This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes. This PR fixes all these bugs:
```python
#
# -- ✅ CASE 1: There are no leading or trailing single-cell rows.
# -> this subtable functions never get called, subtable is emitted as the only element
#
# a b -> Table(a, b, c, d)
# c d
# -- ✅ CASE 2: There is exactly one leading single-cell row.
# -> Leading single-cell row emitted as `Title` element, core-table properly identified.
#
# a -> [ Title(a),
# b c Table(b, c, d, e) ]
# d e
# -- ❌ CASE 3: There are two-or-more leading single-cell rows.
# -> leading single-cell rows are included in subtable
#
# a -> [ Table(a, b, c, d, e, f) ]
# b
# c d
# e f
# -- ❌ CASE 4: There is exactly one trailing single-cell row.
# -> core table is dropped. trailing single-cell row is emitted as Title
# (this is the behavior in the reported bug)
#
# a b -> [ Title(e) ]
# c d
# e
# -- ❌ CASE 5: There are two-or-more trailing single-cell rows.
# -> core table is dropped. trailing single-cell rows are each emitted as a Title
#
# a b -> [ Title(e),
# c d Title(f) ]
# e
# f
# -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b c Table(b, c, d, e),
# d e Title(f) ]
# f
# -- ✅ CASE 7: There are two leading and one trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g) ]
# g
# -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g),
# g Title(h) ]
# h
# -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below.
# -> First cell is mistakenly emitted as title, remaining cells are dropped.
#
# a b c -> [ Title(a) ]
# -- ❌ CASE 10: Single-row subtable with one leading single-cell row.
# -> Leading single-row cell is correctly identified as title, core-table is mis-identified
# as a `Title` and truncated.
#
# a -> [ Title(a),
# b c d Title(b) ]
```
Thanks to Pedro at OctoAI we have a new embedding option.
The following PR adds support for the use of OctoAI embeddings.
Forked from the original OpenAI embeddings class. We removed the use of
the LangChain adaptor, and use OpenAI's SDK directly instead.
Also updated out-of-date example script.
Including new test file for OctoAI.
# Testing
Get a token from our platform at: https://www.octoai.cloud/
For testing one can do the following:
```
export OCTOAI_TOKEN=<your octo token>
python3 examples/embed/example_octoai.py
```
## Testing done
Validated running the above script from within a locally built container
via `make docker-start-dev`
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
fixes check_connection for:
azure
opensearch
postgres
For Azure, the check_connection in fsspec.py actually worked better.
Adding check_connection for Databricks Volumes
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
**Reviewers:** It may be faster to review each of the three commits
separately since they are groomed to only make one type of change each
(typing, docstrings, test-cleanup).
**Summary**
There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).
These commits clean up typing, lint, and other non-behavior-changing
aspects of the code in preparation for installing a new algorithm that
correctly identifies and partitions contiguous sub-regions of an Excel
worksheet into distinct elements.
**Additional Context**
This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes:
```python
#
# -- ✅ CASE 1: There are no leading or trailing single-cell rows.
# -> this subtable functions never get called, subtable is emitted as the only element
#
# a b -> Table(a, b, c, d)
# c d
# -- ✅ CASE 2: There is exactly one leading single-cell row.
# -> Leading single-cell row emitted as `Title` element, core-table properly identified.
#
# a -> [ Title(a),
# b c Table(b, c, d, e) ]
# d e
# -- ❌ CASE 3: There are two-or-more leading single-cell rows.
# -> leading single-cell rows are included in subtable
#
# a -> [ Table(a, b, c, d, e, f) ]
# b
# c d
# e f
# -- ❌ CASE 4: There is exactly one trailing single-cell row.
# -> core table is dropped. trailing single-cell row is emitted as Title
# (this is the behavior in the reported bug)
#
# a b -> [ Title(e) ]
# c d
# e
# -- ❌ CASE 5: There are two-or-more trailing single-cell rows.
# -> core table is dropped. trailing single-cell rows are each emitted as a Title
#
# a b -> [ Title(e),
# c d Title(f) ]
# e
# f
# -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b c Table(b, c, d, e),
# d e Title(f) ]
# f
# -- ✅ CASE 7: There are two leading and one trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g) ]
# g
# -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g),
# g Title(h) ]
# h
# -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below.
# -> First cell is mistakenly emitted as title, remaining cells are dropped.
#
# a b c -> [ Title(a) ]
# -- ❌ CASE 10: Single-row subtable with one leading single-cell row.
# -> Leading single-row cell is correctly identified as title, core-table is mis-identified
# as a `Title` and truncated.
#
# a -> [ Title(a),
# b c d Title(b) ]
```
### Summary
Closes#2484. Adds missing dependency files to `MANIFEST.in` so they are
included in the Python distribution. Also updates the manifest to look
for ingest dependencies in the `requirements/ingest` subdirectory.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
### Summary
Closes#2489, which reported an inability to process `.p7s` files. PR
implements two changes:
- If the user selected content type for the email is not available and
there is another valid content type available, fall back to the other
valid content type.
- For signed message, extract the signature and add it to the metadata
### Testing
```python
from unstructured.partition.auto import partition
filename = "example-docs/eml/signed-doc.p7s"
elements = partition(filename=filename) # should get a message about fall back logic
print(elements[0]) # "This is a test"
elements[0].metadata.to_dict() # Will see the signature
```
change opensearch port to see if fixes CI. We think there may be a
conflict with the elasticsearch docker port.
Also adding simple retry to vector query.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
To test:
> cd docs && make html
Changelog: added an example to use `SaaS API URL` in `partition_via_api`
using `api_url` param.
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
This PR:
- Moves ingest dependencies into local scopes to be able to import
ingest connector classes without the need of installing imported
external dependencies. This allows lightweight use of the classes (not
the instances. to use the instances as intended you'll still need the
dependencies).
- Upgrades the embed module dependencies from `langchain` to
`langchain-community` module (to pass CI [rather than introducing a
pin])
- Does pip-compile
- Does minor refactors in other files to pass `ruff 2.0` checks which
were introduced by pip-compile
Small improvement to Vectara requested by Ofer at Vectara
In the "Document" construct, every document can have a title. If it's
there, in the UI it will show up above the document (otherwise you get
"Untitled")
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
The purpose of this PR is to refactor OCR-related modules to reduce
unnecessary module imports to avoid potential issues (most likely due to
a "circular import").
### Summary
- add `inference_utils` module
(unstructured/partition/pdf_image/inference_utils.py) to define
unstructured-inference library related utility functions, which will
reduce importing unstructured-inference library functions in other files
- add `conftest.py` in `test_unstructured/partition/pdf_image/`
directory to define fixtures that are available to all tests in the same
directory and its subdirectories
### Testing
CI should pass
I accidentally added Vectara to setup and makefile. But there are no
dependencies for Vectara
This removes Vectara from those files.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
Thanks to Ofer at Vectara, we now have a Vectara destination connector.
- There are no dependencies since it is all REST calls to API
-
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
This is nice to natively support both Tesseract and Paddle. However, one
might already use another OCR and might want to keep using it (for
quality reasons, for cost reasons etc...).
This PR adds the ability for the user to specify its own OCR agent
implementation that is then called by unstructured.
I am new to unstructured so don't hesitate to let me know if you would
prefer this being done differently and I will rework the PR.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Removed `pillow` pin and recompiled. I think it was originally there to
address a conflict, which, as far as I can tell, no longer exists. Also
a security vulnerability was discovered in the older version of
`pillow`.
#### Testing:
CI should pass.
Update `black` and apply changes to affected files. I separated this PR
so we can have a look at the changes and decide whether we want to:
1. Go forward with the new formatting
2. Change the black config to make the old formatting valid
3. Get rid of black entirely and just use `ruff`
4. Do something I haven't thought of
.heic files are an image filetype we have not supported.
#### Testing
```
from unstructured.partition.image import partition_image
png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"
png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")
for i in range(len(heic_elements)):
print(heic_elements[i].text == png_elements[i].text)
```
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>