308 Commits

Author SHA1 Message Date
Pluto
df1f7bcd0e
Save table prediction in cells format (#2892)
This pull request allows to return predictions in raw cell
representation from table transformer. It will be later used to save
prediction in a cells format for simpler metrics calculation.

This PR has to be merged, after
https://github.com/Unstructured-IO/unstructured-inference/pull/335
2024-04-25 11:14:48 +00:00
John
3843af666e
feat: Enable remote chunking via unstructured-ingest (#2905)
Update: The cli shell script works when sending documents to the free
api, but the paid api is down, so waiting to test against it.

- The first commit adds docstrings and fixes type hints.
- The second commit reorganizes `test_unstructured_ingest` so it matches
the structure of `unstructured/ingest`.
- The third commit contains the primary changes for this PR.
- The `.chunk()` method responsible for sending elements to the correct
method is moved from `ChunkingConfig` to `Chunker` so that
`ChunkingConfig` acts as a config object instead of containing
implementation logic. `Chunker.chunk()` also now takes a json file
instead of a list of elements. This is done to avoid redundant
serialization if the file is to be sent to the api for chunking.

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-04-25 00:24:58 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Michał Martyniak
001fa17c86
Preparing the foundation for better element IDs (#2842)
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461

It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
> 
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.

Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.


Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
2024-04-16 21:14:53 +00:00
Christine Straub
ba3f374268
Fix: ingest test fixtures update pr (#2881)
This PR aims to update "Ingest Test Fixtures Update PR" CI to update the
ingest test fixtures only if the OVERWRITE_FIXTURES variable is not
`false` and the OUTPUT_DIR directory is not empty.
2024-04-15 17:47:22 +00:00
MiXiBo
0506aff788
add support for start_index in html links extraction (#2600)
add support for start_index in html links extraction (closes #2625)

Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json


html_text = """<html>
        <p>Hello there I am a <a href="/link">very important link!</a></p>
        <p>Here is a list of my favorite things</p>
        <ul>
            <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
            <li>Dogs</li>
        </ul>
        <a href="/loner">A lone link!</a>
    </html>"""

elements = partition_html(text=html_text)
print(elements_to_json(elements))
```

---------

Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-04-12 06:14:20 +00:00
Ahmet Melek
6fd29ea77c
fix: collection deletion for AstraDB test (#2869)
This PR:
- Fixes occasional collection deletion failures for AstraDB via putting
collection deletion statements inside a trap statement. It uses click
commands to do this.

Testing:
- Run ingest astradb destination test
2024-04-10 23:08:24 +00:00
Ahmet Melek
d46792214a
feat: add vertexai embeddings (#2693)
This PR:
- Adds VertexAI embeddings as an embedding provider

Testing
- Tested with pinecone destination connector on
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693)
job run.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-03-28 21:15:36 +00:00
David Potter
c8cf8f31ac
bug CORE-4225: mongodb url bug (#2662)
The mongodb redact method was created because we wanted part of the url
to be exposed to the user during logging. Thus it did not use the
dataclass `enhanced_field(sensitive=True)` solution.

This changes it to use our standard redacted solution. This also
minimizes the amount of work to be done in platform.
2024-03-28 18:38:50 +00:00
Steve Canny
9ae838e50a
feat: add --include-orig-elements option to Ingest CLI (#2687)
**Summary**
Add an `--include-orig-elements` option to the Ingest CLI to allow users
to specify that corresponding new chunking parameter.

**Reviewer** A lot of this is cleanup, the second commit is where the
actual adding of this option are. The first commit fixes a number of
inaccuracies in the documentation and does some other clean-up.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-27 06:35:01 +00:00
Christine Straub
08fafc564f
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-03-23 03:59:23 +00:00
Steve Canny
56fbaaed10
feat(chunking): add metadata.orig_elements serde (#2680)
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.

It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.

**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-22 21:53:26 +00:00
Filip Knefel
bdfd975115
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-22 10:08:49 +00:00
Roman Isecke
4ff6a5b78e
Roman/bugfix support bedrock embeddings (#2650)
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-03-21 18:21:04 +00:00
David Potter
9177aa20a8
feature CORE-3985: add Clarifai destination connector (#2633)
Thanks to @mogith-pn from Clarifai we have a new destination connector!

This PR intends to add Clarifai as a ingest destination connector.

Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
2024-03-21 16:36:21 +00:00
Steve Canny
31bef433ad
rfctr: prepare to add orig_elements serde (#2668)
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.

**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
2024-03-20 21:27:59 +00:00
David Potter
5b92e0bb6b
bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638)
Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
https://github.com/Unstructured-IO/unstructured/pull/2591
https://github.com/Unstructured-IO/unstructured/pull/2592/files

We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.

Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)

Adds unit tests for isofomat.

json_to_dict already unit tested here:

https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py

Adds small change for AstraDB to allow them to see what source called
their api
2024-03-15 14:01:05 +00:00
Steve Canny
8ea203adf7
feat(chunking): composite text gets is_continuation (#2639)
**Summary**
Add `metadata.is_continuation = True` to metadata of second-and-later
text-split chunks formed from an oversized non-table element. Previously
this metadata was only present on text-split `TableChunk` elements.

This enables downstream filtering of intentionally redundant metadata on
chunk elements that may not be desired for all purposes.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-12 19:44:41 +00:00
David Potter
1ca90d209a
bug: update sharepoint-with-permissions test to fix CI (#2589)
Adding `metadata.data_source.permissions_data` to
sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint
deprecation warning from ruining test.

Updating expected-structured-output

As per Ahmet's comment. We do want to check sharepoint permissions
metadata at some point. But that will take a separate type of test. A
file diff test is too unstable. Permissions checking will be later down
the road.
2024-03-06 17:15:36 +00:00
David Potter
43250d5576
bug CORE-3971: fix deserialization in google-drive source connector key path (#2586)
Google Drive Service account key can be a dict or a file path(str)

We have successfully been using the path. But the dict can also end up
being stored as a string that needs to be deserialized. The
deserialization can have issues with single and double quotes.
2024-03-03 15:30:35 +00:00
ryannikolaidis
71d5d513ef
fix: handling of varied SharePoint date formats (#2591)
We are seeing occurrences of inconsistency in the timestamps returned by
office365.sharepoint when fetching created and modified dates.
Furthermore, in future versions of this library, a datetime object will
be returned rather than a string.

## Changes

- This adds logic to guarantee SharePoint dates will be properly
formatted as ISO, regardless of the format provided by the sharepoint
library.
- Bumps timestamp format output to include timezone offset (as we do
with others)

## Testing

Unit test added to validate this datetime handling across various
formats.

---------

Co-authored-by: David Potter <potterdavidm@gmail.com>
2024-02-28 16:11:53 +00:00
David Potter
e8ec09c8b9
feat: astra dest connector (#2571)
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.

This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.

To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:

1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used

Some notes about Astra DB:

- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
2024-02-23 20:50:50 +00:00
Matt Robinson
b4d9ad8130
enhancement: detect headers in partition_pdf with fast strategy (#2455)
### Summary

Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-02-23 16:56:09 +00:00
Pawel Kmiecik
ff9d46f9dc
feat(eval): table evaluation metrics (#2558)
This PR adds new table evaluation metrics prepared by @leah1985 
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows

TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
2024-02-22 16:35:46 +00:00
Klaijan
d06936d35a
feat: modify test-ingest-src and evaluation-metrics to allow EXPORT_DIR (#2551)
The current `test-ingest-src.sh` and `evaluation-metrics` do not allow
passing the `EXPORT_DIR` (`OUTPUT_ROOT` in `evaluation-metrics`). It is
currently saving at the current working directory
(`unstructured/test_unstructured_ingest`). When running the eval from
`core-product`, all outputs is now saved at
`core-product/upstream-unstructured/test_unstructured_ingest` which is
undesirable.

This PR modifies two scripts to accommodate such behavior:
1. `test-ingest-src.sh` - assign `EVAL_OUTPUT_ROOT` to the value set
within the environment if exist, or the current working directory if
not. Then calls to run `evaluation-metrics.sh`.
2. `evaluation-metrics.sh` - accepting param from `test-ingest-src.sh`
if exist, or to the value set within the environment if exist, or the
current directory if not.

(Note: I also add param to `evaluation-metrics.sh` because it makes
sense to allow a separate run to be able to specify an export directory)

This PR should work in sync with another PR under `core-product`, which
I will add the link here later.

**To test:**

Run the script below, change `$SCRIPT_DIR` as needed to see the result.

```
export OVERWRITE_FIXTURES=true

./upstream-unstructured/test_unstructured_ingest/src/s3.sh

SCRIPT_DIR=$(dirname "$(realpath "$0")")
bash -x ./upstream-unstructured/test_unstructured_ingest/evaluation-metrics.sh text-extraction "$SCRIPT_DIR"
```

----

This PR also updates the requirements by `make pip-compile` since the
`click` module was not found.
2024-02-17 05:21:15 +00:00
Steve Canny
d9f8467187
fix(xlsx): xlsx subtable algorithm (#2534)
**Reviewers:** It may be easier to review each of the two commits
separately. The first adds the new `_SubtableParser` object with its
unit-tests and the second one uses that object to replace the flawed
existing subtable-parsing algorithm.

**Summary**

There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).

This PR replaces the flawed existing algorithm with a `_SubtableParser`
object that encapsulates all that logic and has thorough unit-tests.

**Additional Context**

This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes. This PR fixes all these bugs:
```python
    #
    # --  CASE 1: There are no leading or trailing single-cell rows.
    #       -> this subtable functions never get called, subtable is emitted as the only element
    #
    #    a b  -> Table(a, b, c, d)
    #    c d

    # --  CASE 2: There is exactly one leading single-cell row.
    #       -> Leading single-cell row emitted as `Title` element, core-table properly identified.
    #
    #    a    -> [ Title(a),
    #    b c       Table(b, c, d, e) ]
    #    d e

    # --  CASE 3: There are two-or-more leading single-cell rows.
    #       -> leading single-cell rows are included in subtable
    #
    #    a    -> [ Table(a, b, c, d, e, f) ]
    #    b
    #    c d
    #    e f

    # --  CASE 4: There is exactly one trailing single-cell row.
    #      -> core table is dropped. trailing single-cell row is emitted as Title
    #         (this is the behavior in the reported bug)
    #
    #    a b  -> [ Title(e) ]
    #    c d
    #      e

    # --  CASE 5: There are two-or-more trailing single-cell rows.
    #      -> core table is dropped. trailing single-cell rows are each emitted as a Title
    #
    #    a b  -> [ Title(e),
    #    c d       Title(f) ]
    #      e
    #      f

    # --  CASE 6: There are exactly one each leading and trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #      a  -> [ Title(a),
    #    b c       Table(b, c, d, e),
    #    d e       Title(f) ]
    #    f

    # --  CASE 7: There are two leading and one trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #    a    -> [ Title(a),
    #    b         Title(b),
    #    c d       Table(c, d, e, f),
    #    e f       Title(g) ]
    #      g

    # --  CASE 8: There are two-or-more leading and trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #      a  -> [ Title(a),
    #      b       Title(b),
    #    c d       Table(c, d, e, f),
    #    e f       Title(g),
    #    g         Title(h) ]
    #    h

    # --  CASE 9: Single-row subtable, no single-cell rows above or below.
    #      -> First cell is mistakenly emitted as title, remaining cells are dropped.
    #
    #    a b c  -> [ Title(a) ]

    # --  CASE 10: Single-row subtable with one leading single-cell row.
    #      -> Leading single-row cell is correctly identified as title, core-table is mis-identified
    #         as a `Title` and truncated.
    #
    #    a      -> [ Title(a),
    #    b c d       Title(b) ]
```
2024-02-13 20:29:17 -08:00
Ahmet Melek
f9f2cacb58
build(release): release commit for 0.12.4 (#2525)
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2024-02-08 21:18:29 +00:00
David Potter
0c834517d8
fix: change opensearch port (#2517)
change opensearch port to see if fixes CI. We think there may be a
conflict with the elasticsearch docker port.

Also adding simple retry to vector query.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-07 21:25:04 +00:00
Ahmet Melek
be71633415
refactor: isolate ingest dependencies into local scopes (#2509)
This PR: 
- Moves ingest dependencies into local scopes to be able to import
ingest connector classes without the need of installing imported
external dependencies. This allows lightweight use of the classes (not
the instances. to use the instances as intended you'll still need the
dependencies).
- Upgrades the embed module dependencies from `langchain` to
`langchain-community` module (to pass CI [rather than introducing a
pin])
- Does pip-compile
- Does minor refactors in other files to pass `ruff 2.0` checks which
were introduced by pip-compile
2024-02-06 21:28:55 +00:00
David Potter
138625438f
fix: add title to Vectara upload (#2511)
Small improvement to Vectara requested by Ofer at Vectara

In the "Document" construct, every document can have a title. If it's
there, in the UI it will show up above the document (otherwise you get
"Untitled")

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-06 19:49:53 +00:00
David Potter
c100ce28a7
feat: add Vectara destination connector (#2357)
Thanks to Ofer at Vectara, we now have a Vectara destination connector.

- There are no dependencies since it is all REST calls to API
-

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-01 14:38:34 +00:00
Yao You
97fb10db4a
fix: default hi_res model rely on inference setting (#2441)
- there are multiple places setting the default `hi_res_model_name` in
both `unstructured` and `unstructured-inference`
- they lead to inconsistency and unexpected behaviors
- this fix removes a helper in `unstructured` that tries to set the
default hi_res layout detection model; instead we rely on the
`unstructured-inference` to provide that default when no explicit model
name is passed in

## test

```bash
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython
```

```python
from unstructured.partition.auto import partition

# find a pdf file
elements = partition("foo.pdf", strategy="hi_res")
assert elements[0].metadata.detection_origin == "yolox"
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2024-01-29 16:44:41 +00:00
David Potter
74dcca44ca
fix: link_texts was breaking postgres destination connector (#2460)
Formatting of link_texts was breaking metadata storage. Turns out it
didn't need any conforming and came in correctly from json.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-27 04:29:38 +00:00
ryannikolaidis
2e97494613
fix: fsspec connectors returning data source version as integer (#2427)
Connector data source versions should always be string values, however
we were using the integer checksum value for the version for fsspec
connectors. This casts that value to a string.

## Changes

* Cast the checksum value to a string when assigning the version value
for fsspec connectors.
* Adds test to validate that these connectors will assign a string value
when an integer checksum is fetched.

## Testing

Unit test added.
2024-01-19 15:58:01 +00:00
David Potter
bc791d53f4
feat: add opensearch source and destination connector (#2349)
Adds OpenSearch as a source and destination.

Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.

- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 04:31:49 +00:00
David Potter
d7f4c24e21
fix documentation for chroma (#2403)
To test:

cd docs && make HTML

changelogs:

point main readme to the correct connector html page
point chroma docs to correct sample code

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 01:53:52 +00:00
Steve Canny
fcc919b9f5
rfctr(chunking): add chunking arg constants (#2408)
There are several public interface points for chunking and they all
provide a default for arguments like `max_charactes`. These defaults are
provided by literal values. Keeping these synchronized has become a
problem.

Declare constant values for chunking argument default values and use
those wherever a non-trivial default is used in an end-user facing API
function.
2024-01-16 21:48:36 +00:00
David Potter
76e0d10e61
feat: add MongoDB source connector (#2393)
Adds MongoDB as a source (we already had it as a destination connector)

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-16 20:56:29 +00:00
jakub-sandomierz-deepsense-ai
ee0441efea
enhancement: normalize Salesforce artifact extensions (#2402)
Connectors use predictable result file naming convention so consumers of
library can write code in abstraction of particular connector.

This change introduces compatibility with said naming convention.

`_output_filename` returns now filename with format.
2024-01-16 10:36:00 +00:00
Christine Straub
ee06260987
feat: keep all image elements when using hi_res strategy. (#2382)
### Summary
The goal of this PR is to keep all image elements when using "hi_res"
strategy. Previously, `Image` elements with small chunks of text were
ignored unless the image block extraction parameters
(`extract_images_in_pdf` or `extract_image_block_types`) were specified.
Now, all image elements are kept regardless of whether the image block
extraction parameters are specified.

### Testing
- on `main` branch,
```
elements = partition_pdf(
    filename="example-docs/embedded-images.pdf",
    strategy="hi_res",
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print("number of image elements: ", len(image_elements))
```
The above code will display `number of image elements: 0`. 

- on this `feature` branch,

The same code will display `number of image elements: 3`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-01-15 23:19:17 +00:00
ryannikolaidis
2ce829ddd0
test: update test Elasticsearch mappings to validate embedding search (#2397)
Currently in the Elasticsearch Destination ingest test we are writing
the embeddings to a "float" type field. In order to leverage this field
for similarity search it should be mapped as "dense_vector" with the
respective dimensions assigned.

This PR updates that mapping and adds a test query to validate that this
works as expected.
2024-01-14 19:27:56 +00:00
Steve Canny
2f2c48acd5
feat(ingest): add basic chunking to ingest (#2380)
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.

Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
2024-01-12 20:27:34 +00:00
Ahmet Melek
50f142d4e0
chore(ingest): update pinecone index creation specifications (#2389)
This PR updates Pinecone index creation in the ingest test due to a
recent update in Pinecone API.

Due to a change in Pinecone API, it is not allowed anymore to specify
both number of replicas and number of pods:
`Cannot specify both replicas and pods`

We solve it by removing the replica specification while sending the
index creation request.

```
Creating index ingest-test-28418
Index creation success: 201
```
2024-01-12 02:49:09 +00:00
jakub-sandomierz-deepsense-ai
411aa98bbf
feat: Salesforce connector accepts key path or value (#2321) (#2327)
Solution to issue
https://github.com/Unstructured-IO/unstructured/issues/2321.

simple_salesforce API allows for passing private key path or value. This
PR introduces this support for Ingest connector.

Salesforce parameter "private-key-file" has been renamed to
"private-key".
It can contain one of following:
- path to PEM encoded key file (as string)
- key contents (PEM encoded string)

If the provided value cannot be parsed as PEM encoded private key, then
the file existence is checked. This way private key contents are not
exposed to unnecessary underlying function calls.
2024-01-11 11:15:24 +00:00
jakub-sandomierz-deepsense-ai
5581e6a4c4
fix: Ingest GCS accepts JSON auth token (#2322) (#2371)
FSSpec serialization caused conversion of JSON token to string with
single quotes. GCS requires JSON token in form of dict so this format is
now assured. Other forms of auth are not modified but there is improved
validation for all of the options.
2024-01-11 09:03:47 +00:00
Roman Isecke
8dc130c920
fix: ensure consistency in method signatures across destination connectors (#2381)
### Description
* Make sure all destination connectors implement the base abstract
methods using the same signatures.
* Also leverage conform dict in the base methods to make sure it's
called in a consistent fashion.
* Additional updates to move the common code into the base destination
connector class
2024-01-11 00:19:49 +00:00
Roman Isecke
22c0bad246
bug: weaviate serialization broken (#2378)
### Description
This PR handles two things:
* Fixes the serialization of the weaviate destination connector since
the client content breaks serialization when present due to `TypeError:
cannot pickle '_thread.lock' object`.
* Set finer auth control rather than generic dictionary on the CLI and
access config.
2024-01-10 17:22:37 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
jakub-sandomierz-deepsense-ai
0ca154a0f3
Fix: MongoDB connector URI password redaction, basic unit tests for Git connector (#2268)
MongoDB connector:
Issue:
[MongoDB
documentation](https://www.mongodb.com/docs/manual/reference/connection-string/)
states that characters `$ : / ? # [ ] @` must be percent encoded. URI
with password containing such special character will not be redacted.

Fix:
This fix removes usage of `unquote_plus` on password which allows
detected password to match with one inside URI and successfully replace
it.

Git connector:
Added very basic unit tests for repository filtering methods. Their
impact is rather minimal but showcases current limitation in
`is_file_type_supported` method.
2024-01-08 11:27:08 +00:00
Ahmet Melek
d6674ba27e
chore: update ingest azure cognitive search endpoint (#2353)
This PR:
- updates ingest azure cognitive search destination connector test to
move into a new service.
- changes response parsing logic in the test.
2024-01-05 05:26:12 +00:00