334 Commits

Author SHA1 Message Date
Roman Isecke
54ec311c55
feat/migrate onedrive src (#3295)
### Description
Migrate the onedrive source connector to v2, adding in more rich content
pulled from the response of the SDK to add further metadata to the
FIleData produced by the indexer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-26 23:59:51 +00:00
Matt Robinson
6939bff49e
build(deps): bump langchain-community version (#3305)
### Summary

Bumps to the latest `langchain-community` version to resolve
[CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-06-26 22:42:32 +00:00
Roman Isecke
3f581e6b7d
feat/migrate gdrive source connector (#3239)
### Description
Migrate the google drive source connector over to the new v2 ingest
framework and include a variety of improvements as part of the refactor:
* The ID is no longer limited to a drive id but can also be the id of a
subfolder within a drive or a file directly and each case is handled
appropriately
* More metadata is pulled in from google drive to enrich the partitioned
elements downstream and now the modified date is being set to not
reprocess if the ingest pipeline already has the file cached
* timing information is set on the file created when downloaded based on
the last modified data retrieved from google drive

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-25 12:55:28 +00:00
Roman Isecke
e0f4374386
Roman/bugfix conflicting event loop ingest (#3264)
### Description
In use cases where an external system (such as code being run in a
jupyter notebook) already has a running event loop, run the async code
in a dedicated thread pool to not conflict with the existing event loop.

This also has a variety of fixes that were found when putting together a
demo leveraging the elasticsearch destination connector
2024-06-24 18:47:37 +00:00
David Potter
8610bd3ab9
feat: Kafka source and destination connector (#3176)
Thanks to @tullytim we have a new Kafka source and destination
connector. It also works with hosted Kafka via Confluent.

Documentation will be added to the Docs repo.
2024-06-22 23:26:23 +00:00
Christine Straub
f23d180d34
fix: docker image publishing error (#3238)
This PR aims to fix a docker image publishing error caused by user
changes when pulling the `amd64` image from the `unstructured`
`wolfi-base` image.
(https://github.com/Unstructured-IO/unstructured/pull/3213).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-06-18 21:01:42 +00:00
Roman Isecke
fd98cf9ea5
Roman/migrate es dest (#3224)
### Description
Migrate elasticsearch destination connector to new v2 ingest framework
2024-06-18 14:20:49 +00:00
Roman Isecke
d876a386ed
Roman/fix ingest async connectors (#3210)
### Description
Choosing to use async needs to be very careful because if a connector is
set to use async, the pipeline will not fan out the inputs via
multiprocessing but instead it will be limited to run in a single
process under the assumption it has more benefit from async due to heavy
network traffic. This means the exact same code that is not optimized
for async and is blocking will force the pipeline to perform worse than
simply never marking the connector to use async since the pipeline will
fan that out using multiprocessing.

All connectors and processes in the pipeline we revisited to make sure
this criteria was met and updated accordingly:
* Currently the unstructured client does not support making requests
async, so this was moved over to use multiprocessing
* fsspec connector was updated to use the async client from the fsspec
library. This also required that the client be a `@property` fetched on
demand, otherwise the client would break the multiprocessing pool since
it maintains a thread lock and that can't be pickled when the fsspec
connector doesn't support async.
* elasticsearch was also updated to use the async client
* weaviate only recently came out with async support in their SDK at a
version that is higher than we can use in the open source repo, so a
TODO was left but otherwise moved to use multiprocessing
* all underlying embedders don't use async to embedder step must be
multiprocessing for now. TODO left to update underlying embedder classes
to optionally support async.
* Chunking parameters were not accurately being passed through from cli
to chunker params, this was fixed

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-17 16:55:19 +00:00
Steve Canny
9fae0111d9
rfctr(html): drop HTML-specific elements (#3207)
**Summary**
Remove HTML-specific element types and return "regular" elements like
`Title` and `NarrativeText` from `partition_html()`.

**Additional Context**
- An aspect of the legacy HTML partitioner was the use of HTML-specific
element types used to track metadata during partitioning.
- That role is no longer necessary or desireable.
- HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were
returned from partitioning HTML but also the seven other file-formats
that broker partitioning to HTML (convert-to-HTML and partition_html()).
This does not cause immediate breakage because these are still `Text`
element subtypes, but it produces a confusing developer experience.
- Remove the prior metadata roles from HTML-specific elements and remove
those element types entirely.
2024-06-15 00:14:22 +00:00
Christine Straub
9552fbbfbf
chore: bump unstructured-inference 0.7.35 (#3205)
### Summary
- bump unstructured-inference to `0.7.35` which fixed syntax for
generated HTML tables
- update unit tests and ingest test fixtures to reflect changes in the
generated HTML tables
- cut a release for `0.14.6`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-06-14 18:11:38 +00:00
ryannikolaidis
da3492b529
fix: dropbox source connector file path bugs (#3189)
The Dropbox source connector currently raises exceptions when indexing
files due to two issues: a path formatting idiosyncrasy of the Dropbox
library and a divergence in the definition of the Dropbox libraries
fs.info method, expecting a 'url' parameter rather than 'path'.

## Changes

* add a `/` prefix to file path used by DropboxIndexer
* override the fsspec sterilize_info method in DropboxIndexer to call
`self.fs.info` with `url` rather than `path`; to accommodate for the
fact that `dropboxdrivefs` diverges with this signature
* remove `dropbox.sh` from ignored source tests
* update test fixtures (now that the dropbox connector has been fixed
and not skipped)

## Testing
`dropbox.sh` source ingest test now succeeds (and is no longer ignored)

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-06-13 18:06:41 +00:00
Roman Isecke
f7b0a37c86
Feat/migrate elasticsearch src connector (#3174)
### Description
Migrate elasticsearch connector with support for what used to be batch
ingest docs but not it support for the download step to generate
additional file data.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-13 17:57:59 +00:00
ryannikolaidis
17bc55e7be
fix: relative path / permissions issues with v2 fsspec connectors (#3186)
When the v2 fsspec connectors currently generate the relative path, they
may introduce a path with a leading slash (this happens in the case of
the Box connector, which is a subclass of fsspec). When this happens
this results in the paths unintentionally being treated as absolute
paths. As a result, the ingest pipeline attempts to write files to
directories at root level, which in turn raises permission issues.

Note: Box expected results needed to update now that it's no longer
failing.

Aside: found that our tests were unintentionally skipping `box.sh` tests
because we were intending to skip `dropbox.sh` and we use regex to match
if a given test is in skip tests. This adds changes to force an exact
match.

## Changes

* Strip leading slashes during the creating of relative paths in fsspec
connectors
* Add expected results for Box connector
* (bonus): `make tidy` altered an unrelated file by removing an
unnecessary call of `pass`
* (bonus): check exact match for skipped ingest tests which fixes Box
tests getting skipped

## Testing


[Tests](https://github.com/Unstructured-IO/unstructured/actions/runs/9461928289/job/26093475612#step:7:2085)
for the Box connector was failing. It was accidentally getting skipped
(see changes above). It is now no longer skipped and passing.
2024-06-12 03:39:35 +00:00
Roman Isecke
b777864296
feat: Migrate over fsspec connectors (#3066)
### Description
Move over all fsspec connectors to the new framework

Variety of bug fixes found and fixed in this PR as well:
* custom json mixin being used for the enhanced dataclass would break if
typing was quoted. That was fixed. A check was also added to the
enhanced dataclass to prevent `InitVar` from being used in the root
dataclass since this breaks serialization.
* hashing for partitioner was using the filename of the raw file being
partitioned rather than the file name of the file data generated from
indexing. This means that mutliple files could result in the same
partition hash when recursive flag is passed in. This was updated to use
the file data file name instead.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-05 19:12:06 +00:00
Matt Robinson
865ef496e6
ci: update pinecone test to use serverless (#3127)
### Summary

Closes #3068. Updates the Pinecone connector tests to use serverless
indexes, per the documentation
[here](https://docs.pinecone.io/reference/api/control-plane/create_index).
Also updates the CHANGELOG to mention serverless. Turns out we already
supported it with the client version bump, but it hadn't been tested
yet.

### Testing

See [this CI
job](https://github.com/Unstructured-IO/unstructured/actions/runs/9319836670/job/25655322433?pr=3127)
that passed, running only the Pinecone test.
2024-05-31 15:24:41 +00:00
ryannikolaidis
1f8768750c
chore: add auth to s3 destination test (#3122)
We should be validating the S3 Destination with authenticated requests,
with credentials from a limited test user.

## Changes

- Updates s3 destination test to point to a bucket that requires
authentication.
- Adds authentication to the s3 destination test request
- Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination
- Bonus: fix S3ConnectionConfig never registered for s3 V2 destination
- Bonus: repair version and changelog version for consistency with -dev
convention

## Testing
Validated by changes to S3 destination ingest test
2024-05-31 07:05:09 +00:00
ryannikolaidis
6b5d8a9785
fix: revert dropping of filename extension for some connectors (#3109)
V2 refactor of ingest code introduces the removal of original file
extensions. Since the upgrade of connectors is incomplete this means
that some connectors will remove the original file extension and some
will not. Still TBD whether this is actually something we want at all.

This PR reverts specifically that change in the V2 ingest code so that
original file extension is preserved downstream.

## Testing
CI is passing with filenames updated via `Ingest Test Fixtures Update`
workflow.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2024-05-29 19:14:22 +00:00
Matt Robinson
3158169585
fix: uninstall bson for mongo connector (#3104)
### Summary

Closes #3049. Reenables the MongoDB connector test, which was disabled
previously in #3047 due to incompatibility between the `pymongo` and the
`bson` package from `pip`, which is a dependency for the Astra
connector. Per the `pymongo` docs below, `pymongo` ships with its own
version of `bson` and installing `bson` from `pip` breaks `pymongo`.

- https://pymongo.readthedocs.io/en/stable/installation.html

### Testing

Ingest tests ran successfully for the [source
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512636315)
and the [destination
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512635546).
2024-05-28 17:45:18 +00:00
Matt Robinson
6b400b46fe
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.

------------
Original PR description:
Adding VoyageAI embeddings 
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.

---------

Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
2024-05-24 21:48:35 +00:00
Yao You
32df4ee1c6
fix: disable table_as_cells output by default (#3093)
This PR changes the output of table elements: now by default the table
elements' `metadata.table_as_cells` is `None`. The data will only be
populated when the env `EXTRACT_TABLE_AS_CELLS` is set to `true`.

The original design of the `table_as_cells` is for evaluate table
extraction performance. The format itself is not as readable as the
`table_as_html` metadata for human or RAG consumption. Therefore by
default this data is not needed.

Since this output is meant for evaluation use this PR choose to use an
environment variable to control if it should be present in the
partitioned results. This approach avoids adding parameters to the
`partition` function call. Adding a new parameter to the `partition`
interface increases the complexity of the interface and adds more
maintenance cost since there is a long chain of function calls to pass
down this parameter to where it is needed.

## test

running the following code snippet on main vs. this PR

```python
from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[])
table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"]
```

on main branch `table_cells` contains cell structured data but on this
branch it is a list of `None`

However if we first set in terminal:

```bash
export EXTRACT_TABLE_AS_CELLS=true
```

then run the same code again with this PR the `table_cells` would
contain actual data, the same as on main branch.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2024-05-24 16:41:25 +00:00
Christine Straub
35ec21ecd0
fix: decide table extraction (#3090)
This PR aims to add backward compatibility for the deprecated
`pdf_infer_table_structure` parameter. A missing part of turning table
extraction for PDFs and Images off by default in
https://github.com/Unstructured-IO/unstructured/pull/3035, which was
turned on in https://github.com/Unstructured-IO/unstructured/pull/2588.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-05-23 20:37:15 +00:00
Roman Isecke
3eaf65a8c1
feat: refactor ingest (#3009)
### Description
This refactors the current ingest CLI process to support better
granularity in how the steps are ran
* Both multiprocessing and async now supported. Given that a lot of the
steps are IO-bound, such as downloading and uploading content, we can
achieve better parallelization by using async here
* Destination step broken up into a stager step and an upload step. This
will allow for steps that require manipulation of the data between
formats, such as converting the elements json into a csv format to
upload for tabular destinations, to be pulled out of the step that does
the actual upload.
* The process of writing the content to a local destination was now
pulled out as it's own dedicated destination connector, meaning you no
longer need to persist the content locally once the process is done if
the content was uploaded elsewhere.
* Quick update to the chunker/partition step to use the python client.
* Move the uncompress suppport as a pipeline step since this can
arbitrarily apply to any concrete files that have been downloaded,
regardless of where they came from.
* Leverage last modified date to mark files to be reprocessed, even if
the file already exists locally.

### Callouts
Retry configs haven't been moved over yet. This is an open question
because the intent was for it to wrap potential connection errors but
now any of the other steps that leverage an API might run into network
connection issues. Should those be isolated in each of the steps and
wrapped with the same retry configs? Or do we need to expose a unique
retry config for each step? This would bloat the input params even more.

### Testing
* If you want to run the new code as an SDK, there's an example file
that was added to highlight how to do that:
[example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py)
* If you want to run the new code as an isolated CLI:
```shell
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
```
* If you want to see which commands have been migrated to the new
version, there's now a `v2` short help text next to those commands when
running the current cli:
```shell
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help   

Options:
  --help  Show this message and exit.

Commands:
  airtable
  azure
  biomed
  box
  confluence
  delta-table
  discord
  dropbox
  elasticsearch
  fsspec
  gcs
  github
  gitlab
  google-drive
  hubspot
  jira
  local          v2
  mongodb
  notion
  onedrive
  opensearch
  outlook
  reddit
  s3             v2
  salesforce
  sftp
  sharepoint
  slack
  wikipedia
```

You can run any of the local or s3 specific ingest tests and these
should now work.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-05-21 17:01:49 +00:00
Matt Robinson
d7608014c0
improve: add Python 3.12 support (#3033) (#3047)
### Summary

Closes #2959. Updates the dependency and CI to add support for Python
3.12.

The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.

Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837

---------

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
2024-05-19 23:03:15 +00:00
Matt Robinson
ec987dcbb2
BREAKING CHANGE: revert table extraction off by default for PDFs and images (#3035)
### Summary

Closes #3021 . Turns table extraction for PDFs and images off by
default. The default behavior originally changed in #2588 . The reason
for reversion is that some users did not realize turning off table
extraction was an option and experience long processing times for PDFs
and images with the new default behavior.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-05-17 15:28:11 +00:00
David Potter
df8d39a4d4
fix: allow AstraDB to prevent indexing on metadata columns with long text (#3003)
Thanks to @erichare from AstraDB
Adds support for specifying the indexing options for various columns in
Astra DB, allowing users to avoid a situation where long text columns
are by-default indexed.

Changes to: test_unstructured_ingest/python/test-ingest-astra-output.py
are forward looking from AstraDB
2024-05-17 04:12:37 +00:00
Yuming Long
542d442699
chore CORE-4775: remove html page number metadata field (#2942)
### Summary

Rip off page_number metadata fields until we have page counting for all
kinds of html files (not just limited to news articles with multiple
`<article>` tag)

### Test
Unit tests
`test_add_chunking_strategy_on_partition_html_respects_multipage` and
`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`
removed since they relay on the `page_number` fields from the SEC html
file - now test moved to mock test for chunk_by_title -> revisit those
tests when we find test file for this

Also changed the element ids from partition outputs for html files -
element id change due to page number change (in element id hashing) ->
todo ticket: update other deterministic element id tests per crag's
comment

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2024-04-30 15:20:26 +00:00
Pluto
df1f7bcd0e
Save table prediction in cells format (#2892)
This pull request allows to return predictions in raw cell
representation from table transformer. It will be later used to save
prediction in a cells format for simpler metrics calculation.

This PR has to be merged, after
https://github.com/Unstructured-IO/unstructured-inference/pull/335
2024-04-25 11:14:48 +00:00
John
3843af666e
feat: Enable remote chunking via unstructured-ingest (#2905)
Update: The cli shell script works when sending documents to the free
api, but the paid api is down, so waiting to test against it.

- The first commit adds docstrings and fixes type hints.
- The second commit reorganizes `test_unstructured_ingest` so it matches
the structure of `unstructured/ingest`.
- The third commit contains the primary changes for this PR.
- The `.chunk()` method responsible for sending elements to the correct
method is moved from `ChunkingConfig` to `Chunker` so that
`ChunkingConfig` acts as a config object instead of containing
implementation logic. `Chunker.chunk()` also now takes a json file
instead of a list of elements. This is done to avoid redundant
serialization if the file is to be sent to the api for chunking.

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-04-25 00:24:58 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Michał Martyniak
001fa17c86
Preparing the foundation for better element IDs (#2842)
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461

It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
> 
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.

Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.


Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
2024-04-16 21:14:53 +00:00
Christine Straub
ba3f374268
Fix: ingest test fixtures update pr (#2881)
This PR aims to update "Ingest Test Fixtures Update PR" CI to update the
ingest test fixtures only if the OVERWRITE_FIXTURES variable is not
`false` and the OUTPUT_DIR directory is not empty.
2024-04-15 17:47:22 +00:00
MiXiBo
0506aff788
add support for start_index in html links extraction (#2600)
add support for start_index in html links extraction (closes #2625)

Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json


html_text = """<html>
        <p>Hello there I am a <a href="/link">very important link!</a></p>
        <p>Here is a list of my favorite things</p>
        <ul>
            <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
            <li>Dogs</li>
        </ul>
        <a href="/loner">A lone link!</a>
    </html>"""

elements = partition_html(text=html_text)
print(elements_to_json(elements))
```

---------

Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-04-12 06:14:20 +00:00
Ahmet Melek
6fd29ea77c
fix: collection deletion for AstraDB test (#2869)
This PR:
- Fixes occasional collection deletion failures for AstraDB via putting
collection deletion statements inside a trap statement. It uses click
commands to do this.

Testing:
- Run ingest astradb destination test
2024-04-10 23:08:24 +00:00
Ahmet Melek
d46792214a
feat: add vertexai embeddings (#2693)
This PR:
- Adds VertexAI embeddings as an embedding provider

Testing
- Tested with pinecone destination connector on
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693)
job run.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-03-28 21:15:36 +00:00
David Potter
c8cf8f31ac
bug CORE-4225: mongodb url bug (#2662)
The mongodb redact method was created because we wanted part of the url
to be exposed to the user during logging. Thus it did not use the
dataclass `enhanced_field(sensitive=True)` solution.

This changes it to use our standard redacted solution. This also
minimizes the amount of work to be done in platform.
2024-03-28 18:38:50 +00:00
Steve Canny
9ae838e50a
feat: add --include-orig-elements option to Ingest CLI (#2687)
**Summary**
Add an `--include-orig-elements` option to the Ingest CLI to allow users
to specify that corresponding new chunking parameter.

**Reviewer** A lot of this is cleanup, the second commit is where the
actual adding of this option are. The first commit fixes a number of
inaccuracies in the documentation and does some other clean-up.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-27 06:35:01 +00:00
Christine Straub
08fafc564f
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-03-23 03:59:23 +00:00
Steve Canny
56fbaaed10
feat(chunking): add metadata.orig_elements serde (#2680)
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.

It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.

**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-22 21:53:26 +00:00
Filip Knefel
bdfd975115
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-22 10:08:49 +00:00
Roman Isecke
4ff6a5b78e
Roman/bugfix support bedrock embeddings (#2650)
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-03-21 18:21:04 +00:00
David Potter
9177aa20a8
feature CORE-3985: add Clarifai destination connector (#2633)
Thanks to @mogith-pn from Clarifai we have a new destination connector!

This PR intends to add Clarifai as a ingest destination connector.

Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
2024-03-21 16:36:21 +00:00
Steve Canny
31bef433ad
rfctr: prepare to add orig_elements serde (#2668)
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.

**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
2024-03-20 21:27:59 +00:00
David Potter
5b92e0bb6b
bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638)
Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
https://github.com/Unstructured-IO/unstructured/pull/2591
https://github.com/Unstructured-IO/unstructured/pull/2592/files

We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.

Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)

Adds unit tests for isofomat.

json_to_dict already unit tested here:

https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py

Adds small change for AstraDB to allow them to see what source called
their api
2024-03-15 14:01:05 +00:00
Steve Canny
8ea203adf7
feat(chunking): composite text gets is_continuation (#2639)
**Summary**
Add `metadata.is_continuation = True` to metadata of second-and-later
text-split chunks formed from an oversized non-table element. Previously
this metadata was only present on text-split `TableChunk` elements.

This enables downstream filtering of intentionally redundant metadata on
chunk elements that may not be desired for all purposes.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-12 19:44:41 +00:00
David Potter
1ca90d209a
bug: update sharepoint-with-permissions test to fix CI (#2589)
Adding `metadata.data_source.permissions_data` to
sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint
deprecation warning from ruining test.

Updating expected-structured-output

As per Ahmet's comment. We do want to check sharepoint permissions
metadata at some point. But that will take a separate type of test. A
file diff test is too unstable. Permissions checking will be later down
the road.
2024-03-06 17:15:36 +00:00
David Potter
43250d5576
bug CORE-3971: fix deserialization in google-drive source connector key path (#2586)
Google Drive Service account key can be a dict or a file path(str)

We have successfully been using the path. But the dict can also end up
being stored as a string that needs to be deserialized. The
deserialization can have issues with single and double quotes.
2024-03-03 15:30:35 +00:00
ryannikolaidis
71d5d513ef
fix: handling of varied SharePoint date formats (#2591)
We are seeing occurrences of inconsistency in the timestamps returned by
office365.sharepoint when fetching created and modified dates.
Furthermore, in future versions of this library, a datetime object will
be returned rather than a string.

## Changes

- This adds logic to guarantee SharePoint dates will be properly
formatted as ISO, regardless of the format provided by the sharepoint
library.
- Bumps timestamp format output to include timezone offset (as we do
with others)

## Testing

Unit test added to validate this datetime handling across various
formats.

---------

Co-authored-by: David Potter <potterdavidm@gmail.com>
2024-02-28 16:11:53 +00:00
David Potter
e8ec09c8b9
feat: astra dest connector (#2571)
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.

This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.

To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:

1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used

Some notes about Astra DB:

- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
2024-02-23 20:50:50 +00:00
Matt Robinson
b4d9ad8130
enhancement: detect headers in partition_pdf with fast strategy (#2455)
### Summary

Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-02-23 16:56:09 +00:00
Pawel Kmiecik
ff9d46f9dc
feat(eval): table evaluation metrics (#2558)
This PR adds new table evaluation metrics prepared by @leah1985 
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows

TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
2024-02-22 16:35:46 +00:00