### Description
This refactors the current ingest CLI process to support better
granularity in how the steps are ran
* Both multiprocessing and async now supported. Given that a lot of the
steps are IO-bound, such as downloading and uploading content, we can
achieve better parallelization by using async here
* Destination step broken up into a stager step and an upload step. This
will allow for steps that require manipulation of the data between
formats, such as converting the elements json into a csv format to
upload for tabular destinations, to be pulled out of the step that does
the actual upload.
* The process of writing the content to a local destination was now
pulled out as it's own dedicated destination connector, meaning you no
longer need to persist the content locally once the process is done if
the content was uploaded elsewhere.
* Quick update to the chunker/partition step to use the python client.
* Move the uncompress suppport as a pipeline step since this can
arbitrarily apply to any concrete files that have been downloaded,
regardless of where they came from.
* Leverage last modified date to mark files to be reprocessed, even if
the file already exists locally.
### Callouts
Retry configs haven't been moved over yet. This is an open question
because the intent was for it to wrap potential connection errors but
now any of the other steps that leverage an API might run into network
connection issues. Should those be isolated in each of the steps and
wrapped with the same retry configs? Or do we need to expose a unique
retry config for each step? This would bloat the input params even more.
### Testing
* If you want to run the new code as an SDK, there's an example file
that was added to highlight how to do that:
[example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py)
* If you want to run the new code as an isolated CLI:
```shell
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
```
* If you want to see which commands have been migrated to the new
version, there's now a `v2` short help text next to those commands when
running the current cli:
```shell
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help
Options:
--help Show this message and exit.
Commands:
airtable
azure
biomed
box
confluence
delta-table
discord
dropbox
elasticsearch
fsspec
gcs
github
gitlab
google-drive
hubspot
jira
local v2
mongodb
notion
onedrive
opensearch
outlook
reddit
s3 v2
salesforce
sftp
sharepoint
slack
wikipedia
```
You can run any of the local or s3 specific ingest tests and these
should now work.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
Closes#2959. Updates the dependency and CI to add support for Python
3.12.
The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.
Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837
---------
Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
### Summary
Closes#3021 . Turns table extraction for PDFs and images off by
default. The default behavior originally changed in #2588 . The reason
for reversion is that some users did not realize turning off table
extraction was an option and experience long processing times for PDFs
and images with the new default behavior.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Thanks to @erichare from AstraDB
Adds support for specifying the indexing options for various columns in
Astra DB, allowing users to avoid a situation where long text columns
are by-default indexed.
Changes to: test_unstructured_ingest/python/test-ingest-astra-output.py
are forward looking from AstraDB
### Summary
Rip off page_number metadata fields until we have page counting for all
kinds of html files (not just limited to news articles with multiple
`<article>` tag)
### Test
Unit tests
`test_add_chunking_strategy_on_partition_html_respects_multipage` and
`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`
removed since they relay on the `page_number` fields from the SEC html
file - now test moved to mock test for chunk_by_title -> revisit those
tests when we find test file for this
Also changed the element ids from partition outputs for html files -
element id change due to page number change (in element id hashing) ->
todo ticket: update other deterministic element id tests per crag's
comment
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
This pull request allows to return predictions in raw cell
representation from table transformer. It will be later used to save
prediction in a cells format for simpler metrics calculation.
This PR has to be merged, after
https://github.com/Unstructured-IO/unstructured-inference/pull/335
Update: The cli shell script works when sending documents to the free
api, but the paid api is down, so waiting to test against it.
- The first commit adds docstrings and fixes type hints.
- The second commit reorganizes `test_unstructured_ingest` so it matches
the structure of `unstructured/ingest`.
- The third commit contains the primary changes for this PR.
- The `.chunk()` method responsible for sending elements to the correct
method is moved from `ChunkingConfig` to `Chunker` so that
`ChunkingConfig` acts as a config object instead of containing
implementation logic. `Chunker.chunk()` also now takes a json file
instead of a list of elements. This is done to avoid redundant
serialization if the file is to be sent to the api for chunking.
---------
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842
Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)
This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461
It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
>
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.
Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.
Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
This PR aims to update "Ingest Test Fixtures Update PR" CI to update the
ingest test fixtures only if the OVERWRITE_FIXTURES variable is not
`false` and the OUTPUT_DIR directory is not empty.
add support for start_index in html links extraction (closes#2625)
Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json
html_text = """<html>
<p>Hello there I am a <a href="/link">very important link!</a></p>
<p>Here is a list of my favorite things</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
<li>Dogs</li>
</ul>
<a href="/loner">A lone link!</a>
</html>"""
elements = partition_html(text=html_text)
print(elements_to_json(elements))
```
---------
Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
This PR:
- Fixes occasional collection deletion failures for AstraDB via putting
collection deletion statements inside a trap statement. It uses click
commands to do this.
Testing:
- Run ingest astradb destination test
The mongodb redact method was created because we wanted part of the url
to be exposed to the user during logging. Thus it did not use the
dataclass `enhanced_field(sensitive=True)` solution.
This changes it to use our standard redacted solution. This also
minimizes the amount of work to be done in platform.
**Summary**
Add an `--include-orig-elements` option to the Ingest CLI to allow users
to specify that corresponding new chunking parameter.
**Reviewer** A lot of this is cleanup, the second commit is where the
actual adding of this option are. The first commit fixes a number of
inaccuracies in the documentation and does some other clean-up.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.
### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25
### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)
```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```
```
elements = partition_pdf(
filename="pwc-financial-statements-p114.pdf",
strategy="hi_res",
infer_table_structure=True,
extract_image_block_types=["Image"],
)
table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```
---------
Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.
It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.
**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR
We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation
More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Thanks to @mogith-pn from Clarifai we have a new destination connector!
This PR intends to add Clarifai as a ingest destination connector.
Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.
**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
https://github.com/Unstructured-IO/unstructured/pull/2591https://github.com/Unstructured-IO/unstructured/pull/2592/files
We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.
Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)
Adds unit tests for isofomat.
json_to_dict already unit tested here:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py
Adds small change for AstraDB to allow them to see what source called
their api
**Summary**
Add `metadata.is_continuation = True` to metadata of second-and-later
text-split chunks formed from an oversized non-table element. Previously
this metadata was only present on text-split `TableChunk` elements.
This enables downstream filtering of intentionally redundant metadata on
chunk elements that may not be desired for all purposes.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
Adding `metadata.data_source.permissions_data` to
sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint
deprecation warning from ruining test.
Updating expected-structured-output
As per Ahmet's comment. We do want to check sharepoint permissions
metadata at some point. But that will take a separate type of test. A
file diff test is too unstable. Permissions checking will be later down
the road.
Google Drive Service account key can be a dict or a file path(str)
We have successfully been using the path. But the dict can also end up
being stored as a string that needs to be deserialized. The
deserialization can have issues with single and double quotes.
We are seeing occurrences of inconsistency in the timestamps returned by
office365.sharepoint when fetching created and modified dates.
Furthermore, in future versions of this library, a datetime object will
be returned rather than a string.
## Changes
- This adds logic to guarantee SharePoint dates will be properly
formatted as ISO, regardless of the format provided by the sharepoint
library.
- Bumps timestamp format output to include timezone offset (as we do
with others)
## Testing
Unit test added to validate this datetime handling across various
formats.
---------
Co-authored-by: David Potter <potterdavidm@gmail.com>
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.
This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.
To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:
1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used
Some notes about Astra DB:
- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
### Summary
Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
This PR adds new table evaluation metrics prepared by @leah1985
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows
TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
The current `test-ingest-src.sh` and `evaluation-metrics` do not allow
passing the `EXPORT_DIR` (`OUTPUT_ROOT` in `evaluation-metrics`). It is
currently saving at the current working directory
(`unstructured/test_unstructured_ingest`). When running the eval from
`core-product`, all outputs is now saved at
`core-product/upstream-unstructured/test_unstructured_ingest` which is
undesirable.
This PR modifies two scripts to accommodate such behavior:
1. `test-ingest-src.sh` - assign `EVAL_OUTPUT_ROOT` to the value set
within the environment if exist, or the current working directory if
not. Then calls to run `evaluation-metrics.sh`.
2. `evaluation-metrics.sh` - accepting param from `test-ingest-src.sh`
if exist, or to the value set within the environment if exist, or the
current directory if not.
(Note: I also add param to `evaluation-metrics.sh` because it makes
sense to allow a separate run to be able to specify an export directory)
This PR should work in sync with another PR under `core-product`, which
I will add the link here later.
**To test:**
Run the script below, change `$SCRIPT_DIR` as needed to see the result.
```
export OVERWRITE_FIXTURES=true
./upstream-unstructured/test_unstructured_ingest/src/s3.sh
SCRIPT_DIR=$(dirname "$(realpath "$0")")
bash -x ./upstream-unstructured/test_unstructured_ingest/evaluation-metrics.sh text-extraction "$SCRIPT_DIR"
```
----
This PR also updates the requirements by `make pip-compile` since the
`click` module was not found.
**Reviewers:** It may be easier to review each of the two commits
separately. The first adds the new `_SubtableParser` object with its
unit-tests and the second one uses that object to replace the flawed
existing subtable-parsing algorithm.
**Summary**
There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).
This PR replaces the flawed existing algorithm with a `_SubtableParser`
object that encapsulates all that logic and has thorough unit-tests.
**Additional Context**
This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes. This PR fixes all these bugs:
```python
#
# -- ✅ CASE 1: There are no leading or trailing single-cell rows.
# -> this subtable functions never get called, subtable is emitted as the only element
#
# a b -> Table(a, b, c, d)
# c d
# -- ✅ CASE 2: There is exactly one leading single-cell row.
# -> Leading single-cell row emitted as `Title` element, core-table properly identified.
#
# a -> [ Title(a),
# b c Table(b, c, d, e) ]
# d e
# -- ❌ CASE 3: There are two-or-more leading single-cell rows.
# -> leading single-cell rows are included in subtable
#
# a -> [ Table(a, b, c, d, e, f) ]
# b
# c d
# e f
# -- ❌ CASE 4: There is exactly one trailing single-cell row.
# -> core table is dropped. trailing single-cell row is emitted as Title
# (this is the behavior in the reported bug)
#
# a b -> [ Title(e) ]
# c d
# e
# -- ❌ CASE 5: There are two-or-more trailing single-cell rows.
# -> core table is dropped. trailing single-cell rows are each emitted as a Title
#
# a b -> [ Title(e),
# c d Title(f) ]
# e
# f
# -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b c Table(b, c, d, e),
# d e Title(f) ]
# f
# -- ✅ CASE 7: There are two leading and one trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g) ]
# g
# -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g),
# g Title(h) ]
# h
# -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below.
# -> First cell is mistakenly emitted as title, remaining cells are dropped.
#
# a b c -> [ Title(a) ]
# -- ❌ CASE 10: Single-row subtable with one leading single-cell row.
# -> Leading single-row cell is correctly identified as title, core-table is mis-identified
# as a `Title` and truncated.
#
# a -> [ Title(a),
# b c d Title(b) ]
```
change opensearch port to see if fixes CI. We think there may be a
conflict with the elasticsearch docker port.
Also adding simple retry to vector query.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
This PR:
- Moves ingest dependencies into local scopes to be able to import
ingest connector classes without the need of installing imported
external dependencies. This allows lightweight use of the classes (not
the instances. to use the instances as intended you'll still need the
dependencies).
- Upgrades the embed module dependencies from `langchain` to
`langchain-community` module (to pass CI [rather than introducing a
pin])
- Does pip-compile
- Does minor refactors in other files to pass `ruff 2.0` checks which
were introduced by pip-compile
Small improvement to Vectara requested by Ofer at Vectara
In the "Document" construct, every document can have a title. If it's
there, in the UI it will show up above the document (otherwise you get
"Untitled")
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
Thanks to Ofer at Vectara, we now have a Vectara destination connector.
- There are no dependencies since it is all REST calls to API
-
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
- there are multiple places setting the default `hi_res_model_name` in
both `unstructured` and `unstructured-inference`
- they lead to inconsistency and unexpected behaviors
- this fix removes a helper in `unstructured` that tries to set the
default hi_res layout detection model; instead we rely on the
`unstructured-inference` to provide that default when no explicit model
name is passed in
## test
```bash
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython
```
```python
from unstructured.partition.auto import partition
# find a pdf file
elements = partition("foo.pdf", strategy="hi_res")
assert elements[0].metadata.detection_origin == "yolox"
```
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Formatting of link_texts was breaking metadata storage. Turns out it
didn't need any conforming and came in correctly from json.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
Connector data source versions should always be string values, however
we were using the integer checksum value for the version for fsspec
connectors. This casts that value to a string.
## Changes
* Cast the checksum value to a string when assigning the version value
for fsspec connectors.
* Adds test to validate that these connectors will assign a string value
when an integer checksum is fetched.
## Testing
Unit test added.
Adds OpenSearch as a source and destination.
Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.
- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
To test:
cd docs && make HTML
changelogs:
point main readme to the correct connector html page
point chroma docs to correct sample code
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
There are several public interface points for chunking and they all
provide a default for arguments like `max_charactes`. These defaults are
provided by literal values. Keeping these synchronized has become a
problem.
Declare constant values for chunking argument default values and use
those wherever a non-trivial default is used in an end-user facing API
function.
Connectors use predictable result file naming convention so consumers of
library can write code in abstraction of particular connector.
This change introduces compatibility with said naming convention.
`_output_filename` returns now filename with format.
### Summary
The goal of this PR is to keep all image elements when using "hi_res"
strategy. Previously, `Image` elements with small chunks of text were
ignored unless the image block extraction parameters
(`extract_images_in_pdf` or `extract_image_block_types`) were specified.
Now, all image elements are kept regardless of whether the image block
extraction parameters are specified.
### Testing
- on `main` branch,
```
elements = partition_pdf(
filename="example-docs/embedded-images.pdf",
strategy="hi_res",
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print("number of image elements: ", len(image_elements))
```
The above code will display `number of image elements: 0`.
- on this `feature` branch,
The same code will display `number of image elements: 3`
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Currently in the Elasticsearch Destination ingest test we are writing
the embeddings to a "float" type field. In order to leverage this field
for similarity search it should be mapped as "dense_vector" with the
respective dimensions assigned.
This PR updates that mapping and adds a test query to validate that this
works as expected.
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.
Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
This PR updates Pinecone index creation in the ingest test due to a
recent update in Pinecone API.
Due to a change in Pinecone API, it is not allowed anymore to specify
both number of replicas and number of pods:
`Cannot specify both replicas and pods`
We solve it by removing the replica specification while sending the
index creation request.
```
Creating index ingest-test-28418
Index creation success: 201
```
Solution to issue
https://github.com/Unstructured-IO/unstructured/issues/2321.
simple_salesforce API allows for passing private key path or value. This
PR introduces this support for Ingest connector.
Salesforce parameter "private-key-file" has been renamed to
"private-key".
It can contain one of following:
- path to PEM encoded key file (as string)
- key contents (PEM encoded string)
If the provided value cannot be parsed as PEM encoded private key, then
the file existence is checked. This way private key contents are not
exposed to unnecessary underlying function calls.
FSSpec serialization caused conversion of JSON token to string with
single quotes. GCS requires JSON token in form of dict so this format is
now assured. Other forms of auth are not modified but there is improved
validation for all of the options.