Thanks to @tullytim we have a new Kafka source and destination
connector. It also works with hosted Kafka via Confluent.
Documentation will be added to the Docs repo.
### Summary
Explicitly replaces all old docs pages with a link to the new docs. This
was required because 404 redirects didn't work for pages that previously
existed, though they worked non-existing paths that never existed.
Summary:
- bump unstructured-inference to `0.7.33`
- cut a release for `0.14.2`
- add some dependencies that previously came through from the
layoutparser extras.
### Summary
Switches to installing `libreoffice` from the Wolfi repository and
upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a
medium vulnerability in the old `libreoffice` version. Security scanning
with `anchore/grype` was also added to the `test_dockerfile` job.
Requirements were bumped to resolve a vulnerability in the `requests`
library.
### Testing
`test_dockerfile` passes with the updates.
### Summary
Closes#2959. Updates the dependency and CI to add support for Python
3.12.
The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.
Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837
---------
Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
**Summary**
`unstructured` will use table features added in the most recent version
of `python-docx`.
Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon
(https://github.com/Unstructured-IO/unstructured/issues/1707).
`python-docx` requires `lxml` although other file formats require it as
well.
Cut a release.
Run pip-compile on mac to avoid `nvidia-*` requirements creeping into
`requirements/extra-pdf-image.txt`. This should fix arm64 image builds
that have been breaking on main.
Update: The cli shell script works when sending documents to the free
api, but the paid api is down, so waiting to test against it.
- The first commit adds docstrings and fixes type hints.
- The second commit reorganizes `test_unstructured_ingest` so it matches
the structure of `unstructured/ingest`.
- The third commit contains the primary changes for this PR.
- The `.chunk()` method responsible for sending elements to the correct
method is moved from `ChunkingConfig` to `Chunker` so that
`ChunkingConfig` acts as a config object instead of containing
implementation logic. `Chunker.chunk()` also now takes a json file
instead of a list of elements. This is done to avoid redundant
serialization if the file is to be sent to the api for chunking.
---------
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842
Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)
This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
**Summary**
The `.section` field in `ElementMetadata` is dead code, possibly a
remainder from a prior iteration of `partition_epub()`. In any case, it
is not populated by any partitioner. Remove it and any code that uses
it.
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461
It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
>
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.
Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.
Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.
A couple optimizations:
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file
**Summary**
Add an `--include-orig-elements` option to the Ingest CLI to allow users
to specify that corresponding new chunking parameter.
**Reviewer** A lot of this is cleanup, the second commit is where the
actual adding of this option are. The first commit fixes a number of
inaccuracies in the documentation and does some other clean-up.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.
It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.
**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR
We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation
More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Thanks to @mogith-pn from Clarifai we have a new destination connector!
This PR intends to add Clarifai as a ingest destination connector.
Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
JavaScript uses `//` for comments instead of `#`
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
To test:
> cd docs && make html
Changelogs:
* added a note to have **Storage Object Viewer** IAM Role for the GCS
source connector.
* added a note to have **Storage Object Creator** IAM Role for the GCS
destination connector.
Closes #2577
Testing:
```
from unstructured.partition.html import partition_html
cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []
for element in elements:
if element.metadata.link_urls:
relative_link = element.metadata.link_urls[0][1:]
if relative_link.startswith("2024"):
links.append(f"{cnn_lite_url}{relative_link}")
print(links)
```
---------
Co-authored-by: ron-unstructured <ronny@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Google Drive Service account key can be a dict or a file path(str)
We have successfully been using the path. But the dict can also end up
being stored as a string that needs to be deserialized. The
deserialization can have issues with single and double quotes.
This PR removes `extract_image_block_to_payload` section from "API
Parameters" page. The "unstructured" API does not support the
`extract_image_block_to_payload` parameter, and it is always set to
`True` internally on the API side when trying to extract image blocks
via the API. Users only need to specify `extract_image_block_types`
parameter when extracting image blocks via the API.
**NOTE:** The `extract_image_block_to_payload` parameter is only used
when calling `partition()`, `partition_pdf()`, and `partition_image()`
functions directly.
### Testing
CI should pass.
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.
This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.
To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:
1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used
Some notes about Astra DB:
- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
Thanks to Pedro at OctoAI we have a new embedding option.
The following PR adds support for the use of OctoAI embeddings.
Forked from the original OpenAI embeddings class. We removed the use of
the LangChain adaptor, and use OpenAI's SDK directly instead.
Also updated out-of-date example script.
Including new test file for OctoAI.
# Testing
Get a token from our platform at: https://www.octoai.cloud/
For testing one can do the following:
```
export OCTOAI_TOKEN=<your octo token>
python3 examples/embed/example_octoai.py
```
## Testing done
Validated running the above script from within a locally built container
via `make docker-start-dev`
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
To test:
> cd docs && make html
Changelog: added an example to use `SaaS API URL` in `partition_via_api`
using `api_url` param.
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
This PR:
- Moves ingest dependencies into local scopes to be able to import
ingest connector classes without the need of installing imported
external dependencies. This allows lightweight use of the classes (not
the instances. to use the instances as intended you'll still need the
dependencies).
- Upgrades the embed module dependencies from `langchain` to
`langchain-community` module (to pass CI [rather than introducing a
pin])
- Does pip-compile
- Does minor refactors in other files to pass `ruff 2.0` checks which
were introduced by pip-compile
Thanks to Ofer at Vectara, we now have a Vectara destination connector.
- There are no dependencies since it is all REST calls to API
-
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
.heic files are an image filetype we have not supported.
#### Testing
```
from unstructured.partition.image import partition_image
png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"
png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")
for i in range(len(heic_elements)):
print(heic_elements[i].text == png_elements[i].text)
```
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>
This PR is the last in a series of PRs for refactoring and fixing the
language parameters (`languages` and `ocr_languages` so we can address
incorrect input by users. See #2293
It is recommended to go though this PR commit-by-commit and note the
commit message. The most significant commit is "update
check_languages..."
To test:
> cd docs && make html
Change logs:
* Updates the best practice for table extraction to use
`skip_infer_table_types` instead of `pdf_infer_table_structure`.
* Fixed CSS issue with a duplicate search box.
* Fixed RST warning message
* Fixed typo on the Intro page.
### Description
This adds in a destination connector to write content to the Databricks
Unity Catalog Volumes service. Currently there is an internal account
that can be used for testing manually but there is not dedicated account
to use for testing so this is not being added to the automated ingest
tests that get run in the CI.
To test locally:
```shell
#!/usr/bin/env bash
path="testpath/$(uuidgen)"
PYTHONPATH=. python ./unstructured/ingest/main.py local \
--num-processes 4 \
--output-dir azure-test \
--strategy fast \
--verbose \
--input-path example-docs/fake-memo.pdf \
--recursive \
databricks-volumes \
--catalog "utic-dev-tech-fixtures" \
--volume "small-pdf-set" \
--volume-path "$path" \
--username "$DATABRICKS_USERNAME" \
--password "$DATABRICKS_PASSWORD" \
--host "$DATABRICKS_HOST"
```
To test:
> cd docs && make html
Changelogs:
* Fixed sphinx error due to malformed rst table on partition page
* Updated API Params, ie. `extract_image_block_types` and
`extract_image_block_to_payload`
* Updated image filetype supports
Closes#2320 .
### Summary
In certain circumstances, adjusting the image block crop padding can
improve image block extraction by preventing extracted image blocks from
being clipped.
### Testing
- PDF:
[LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf)
- Set two environment variables
`EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD`
(e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`,
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20`
```
elements = partition_pdf(
filename="LM339-D_2-2.pdf",
extract_image_block_types=["image"],
)
```
To test:
> cd docs && make html
> click "Ask AI" button on the bottom right-hand corner
Changelogs:
* Installed kapa.ai widget
* fixed sphinx errors in opensearch & elasticsearch documentation