365 Commits

Author SHA1 Message Date
Steve Canny
cd074bb32b
chore(file): remove dead code (#3645)
**Summary**
Remove dead code in `unstructured.file_utils`.

**Additional Context**
These modules were added in 12/2022 and 1/2023 and are not referenced by
any code. Removing to reduce unnecessary complexity. These can of course
be recovered from Git history if we decide we want them again in future.
2024-09-19 06:45:33 +00:00
Yao You
22998354db
add requirements files to ingest cache hash key (#3641)
This PR adds the requirement files for base and extras for the ingest
cache's hash key.

- The current workflow uses only the ingest requirements to generate
hash key for the gitaction cache
- Sometimes only base or extra requirements (like extra-pdf.txt) updated
but not any ingest requirements -> this would mean the ingest test would
fetch a cache with outdated non-ingest dependencies
- When we generate new ingest cache we actually do check first base and
extra requirements and generate a base env before layer on top the
ingest dependencies.
- This PR allows the ingest step to recognize changes to non-ingest
dependency changes and trigger new cache generation when either ingest
or base/extra requirement files changes.

This PR also bumps the setup python action version in cache actions; it
also adds installation of `virtualenv` for the ingest cache action to
avoid errors like
https://github.com/Unstructured-IO/unstructured/actions/runs/10905551870/job/30265057515?pr=3641#step:3:111
2024-09-18 18:39:14 -05:00
Christine Straub
87a88a3c87
feat: improve pdfminer element processing (#3618)
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-09-12 21:17:27 +00:00
Roman Isecke
ebf16055d8
feat/add deprecation warning to all embed code (#3614)
### Description
Related PR to move the code over:
https://github.com/Unstructured-IO/unstructured-ingest/pull/92

Also removed the console script that exposes ingest.
2024-09-10 23:48:39 +00:00
Christine Straub
acd070c5d5
feat: enhance pdfminer element cleanup (#3593)
This PR aims to expand removal of `pdfminer` elements to include those
inside all `non-pdfminer` elements, not just `tables`.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-09-04 12:02:50 +00:00
David Potter
ddba928344
Potter/mixedbread embedder (#3513)
Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai
embedder!
2024-08-27 14:52:13 +00:00
Steve Canny
a861ed8fe7
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.

**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
2024-08-19 18:56:53 +00:00
John
6545f16e57
chore: remove cryptography pin and update test (#3482)
remove cryptography pin, pin tenacity, and update
test_unstructured_ingest/unit/connector/test_salesforce_connector.py
2024-08-07 15:25:23 +00:00
David Potter
59ec64235b
chore: rename astra to astradb (#3458)
DataStax wanted all references to be astradb instead of astra. As per
@erichare

We'll also have to do the same in unstructured-ingest :)
2024-08-05 20:41:02 +00:00
John
147514f6b5
feat: msg and email metadata (#3444)
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.

Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files

Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path

elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```

Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2024-08-01 19:24:17 +00:00
Roman Isecke
482f093afb
feat: Add deprecation warning on import of any ingest code (#3443)
### Description
Any time `unstructed.ingest` is imported, this deprecation warning gets
emitted:
```
DeprecationWarning: unstructured.ingest will be removed in a future version
```
2024-07-30 15:06:21 +00:00
David Potter
441b3393b1
bugfix [OSS-67]: update import of pinecone exception (#3432)
the pinecone python package moved their importing of
PineconeApiException

Chroma `sleep` added because even thought there is a `wait`, there is
still some sort of timing issue.
2024-07-23 19:48:55 +00:00
Roman Isecke
1df7908f03
feat: save file id for all fsspec connectors if present (#3405)
### Description

If the id value exists in the stats response from fsspec, save it as a
`file_id` field in the metadata being persisted on each element.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-07-19 13:30:21 +00:00
Christine Straub
0eb461acc2
refactor: restructure PDF/Image example document organization (#3410)
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.

### Summary
- Created two new subdirectories in the `example-docs` folder:
  - `pdf/`: for all PDF example files
  - `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure

### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.

## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.

## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-07-18 22:21:32 +00:00
Christine Straub
69cddf5f89
ci: disable sharepoint ingest test (#3393)
Disable sharepoint ingest test to unblock development. We need to
re-enable this test when the sharepoint credentials are updated.
2024-07-14 01:58:08 +00:00
Steve Canny
c27e0d0062
rfctr(html): replace html parser (#3218)
**Summary**
Replace legacy HTML parser with recursive version that captures all
content and provides flexibility to add new metadata. It's also
substantially faster although that's just a happy side-effect.

**Additional Context**
The prior HTML parsing algorithm that makes up the core of HTML
partitioning was buggy and very difficult to reason about because it did
not conform to the inherently recursive structure of HTML. The new
version retains `lxml` as the performant and reliable base library but
uses `lxml`'s custom element classes to efficiently classify HTML
elements by their behaviors (block-item and inline (phrasing) primarily)
and give those elements the desired partitioning behaviors.

This solves a host of existing problems with content being skipped and
elements (paragraphs) being divided improperly, but also provides a
clear domain model for reasoning about its behavior and reliably
adjusting it to suit our existing and future purposes.

The parser's operation is recursive, closely modeling the recursive
structure of HTML itself. It's behaviors are based on the HTML Standard
and reliably produce proper and explainable results even for novel
cases.

Fixes #2325 
Fixes #2562
Fixes #2675
Fixes #3168
Fixes #3227
Fixes #3228 
Fixes #3230 
Fixes #3237 
Fixes #3245 
Fixes #3247 
Fixes #3255
Fixes #3309 

### BEHAVIOR DIFFERENCES

#### `emphasized_text_tags` encoding is changed:
- `<strong>` is encoded as `"b"` rather than `"strong"`.
- `<em>` is encoded as `"i"` rather than `"em"`.
- `<span>` is no longer recorded in `emphasized_text_tags` (because
without the CSS we can't tell whether it's used for emphasis or if so
what kind).
- nested emphasis (e.g. bold+italic) is encoded as multiple characters
("bi").
- `emphasized_text_contents` is broken on emphasis-change boundaries,
like:
  ```html
   `<p>foo <b>bar <i>baz</i> bada</b> bing</p>`
  ```
  produces:
  ```json
  {
    "emphasized_text_contents": ["bar", "baz", "bada"],
    "emphasized_text_tags": ["b", "bi", "b"]
  }
  ```
   whereas previously it would have produced:
  ```json
  {
    "emphasized_text_contents": ["bar baz bada", "baz"],
    "emphasized_text_tags": ["b", "i"]
  }
  ```

#### `<pre>` text is preserved as it appears in the html
Except that a leading newline is removed if present (has to be in
position 0 of text). Also, a trailing newline is stripped but only if it
appears in the very last position ([-1]) of the `<pre>` text. Old parser
stripped all leading and trailing whitespace.

Result is that:
```html
<pre>
foo
bar
baz
</pre>
```
parses to `"foo\nbar\nbaz"` which is the same result produced for:
```html
<pre>foo
bar
baz</pre>
```
This equivalence is the same behavior exhibited by a browser, which is
why we did the extra work to make it this way.

#### Whitespace normalization
Leading and trailing whitespace are removed from element text, just as
it is removed in the browser. Runs of whitespace within the element text
are reduced to a single space character (like in the browser). Note this
means that `\t`, `\n`, and `&nbsp;` are replaced with a regular space
character. All text derived from elements is whitespace normalized
except the text within a `<pre>` tag. Any leading or trailing newline is
trimmed from `<pre>` element text; all other whitespace is preserved
just as it appeared in the HTML source.

#### `link_start_indexes` metadata is no longer captured. Rationale:
- It was frequently wrong, often `-1`.
- It was deprecated but then added back in a community PR.
- Maintaining it across any possible downstream transformations (e.g.
chunking) would be expensive and almost certainly lead to wrong values
as distant code evolves.
- It is complex to compute and recompute when whitespace is normalized,
adding substantial complexity to the code and reducing readability and
maintainability

#### `<br/>` element is replaced with a single newline (`"\n"`)
but that is usually replaced with a space in `Element.text` when it is
normalized. The newline is preserved within a `<pre>` element.
  - Related: _No paragraph-break on `<br/><br/>`_

#### Empty `h1..h6` elements are dropped.
HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a
`Title` element) when they contain no text or contain only whitespace.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-07-11 00:14:28 +00:00
Roman Isecke
76cccb3a5e
feat/persist metadata for fsspec connectors (#3371)
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-07-10 22:08:05 +00:00
David Potter
6c78677ebb
feat: add Astra source connector (#3304)
Thanks to @erichare we now have an AstraDB source connector.

updating constant names to be more aligned with AstraDB
2024-07-10 20:29:22 +00:00
Christine Straub
512583ed91
build(deps): bump unstructured.paddleocr 2.8.0 (#3374)
### Summary
Bump unstructured.paddleocr to `2.8.0`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-07-09 22:19:08 +00:00
David Potter
db1e6993a8
rfctr [P6M-398]: salesforce connector v2 (#3344)
Updates salesforce source connector to v2.
2024-07-09 16:46:58 +00:00
Ahmet Melek
3f96a5ae5c
rfctr: Implement Azure Cognitive Search V2 Destination Connector (#3311)
This PR 
- adds the V2 version of Azure Cognitive Search connector
- extends the ingest test to check for chunking and embedding capability
2024-07-09 12:19:15 +00:00
Roman Isecke
b556d6d575
rfctr: Implement Sharepoint V2 Source Connector (#3314)
### Description
Migrate over the sharepoint connector to v2 and in the process refactor
the majority of the connector. It now pulls in much more content from
the SDK on index time, including permissions data is the parameters are
passed in. HTML content generated from the SitePage is isolated to the
html content in the `CanvasContent1` and `LayoutWebpartsContent`
returned by the SDK.

Some TODOs were left in there for future iterations. Currently only
document and site page content is being pulled in from sharepoint, but
sharepoint has more types of content than just that, such as lists. Note
left in there to support other sharepoint types.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: vangheem <vangheem@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-07-09 09:52:59 +00:00
Nathan Van Gheem
6e4d9ccd5b
refactor: implement databricks volumes v2 dest connector (#3334) 2024-07-03 19:01:16 +00:00
Roman Isecke
f1a28600d9
feat/singlestore dest connector (#3320)
### Description
Adds [SingleStore](https://www.singlestore.com/) database destination
connector with associated ingest test.
2024-07-03 15:15:39 +00:00
John
0046f58a4f
revert unstructured-client pin and make pip-compile (#3298)
Change unstructured-client pin to setting minimum version instead of max
version and `make pip-compile`.

Integration tests that were dependent on the old version of the client
are removed. These tests should be replicated in/moved to the SDK
repo(s).
2024-07-02 16:42:03 +00:00
Nathan Van Gheem
da29242dbd
rfctr: implement mongodb v2 destination connector (#3313)
This PR provides support for V2 mongodb destination connector.
2024-07-02 16:40:51 +00:00
Ahmet Melek
72f28d7a11
feat: add v2 pinecone destination connector (#3286)
This PR adds a V2 version of the Pinecone destination connector
2024-07-01 23:22:06 +00:00
David Potter
a18b21c06e
rfctr [P6M-397]: opensearch source connector v2 (#3302)
Updates opensearch source connector to v2. Leverages elasticsearch v2
heavily.

Expected tests renamed because thats how Elasticsearch names them.
2024-07-01 20:35:26 +00:00
Matt Robinson
db8617872b
build: image and dependency updates; fix tesseract files locations (#3310)
### Summary

Updates to the latest version of the `wolfi-base` image. Changes
include:
- Version bumps to address CVEs
- `libreoffice` is now included in the `arm64`. `.doc` files are now
supported for `arm64`. `.ppt` do not work with the `libreoffice` package
currently available on `wolfi-os`. We have follow on work to look into
that.
- Updates the location of the `tesseract` `tessdata` files on the
`arm64` build. Closes #3290.
- Closes #3319 and addes `psutil` to the base dependencies.

### Testing

- `test_dockerfile` should continue to pass with the updates.
2024-07-01 19:39:32 +00:00
David Potter
9eb4c96b94
fix: update slack test to point to new channel (#3328)
When we switched community Slack from Paid to Free we lost the CI test
bot. Also if messages delete after 90 days then our expected test data
will disappear.

- created a new bot in our paid company slack
(test_unstructured_ingest_bot)
- added a new private channel (test-ingest)
- invited the bot to the channel
- adjusted the end datetime of the test to cover the first few messages
in the channel

Still to do:
- update the CI secrets with the new bot token
- update the LastPass with the new bot token (I don't have write
access.. :(.
2024-07-01 18:11:21 +00:00
David Potter
15f80c4ad6
rfct [P6M]-392: OpenSearch V2 Destination Connector (#3293)
Migrates OpenSearch destination connector to V2. Relies a lot on the
Elasticsearch connector where possible. (this is expected)
2024-06-28 20:51:23 +00:00
Roman Isecke
54ec311c55
feat/migrate onedrive src (#3295)
### Description
Migrate the onedrive source connector to v2, adding in more rich content
pulled from the response of the SDK to add further metadata to the
FIleData produced by the indexer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-26 23:59:51 +00:00
Matt Robinson
6939bff49e
build(deps): bump langchain-community version (#3305)
### Summary

Bumps to the latest `langchain-community` version to resolve
[CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-06-26 22:42:32 +00:00
Roman Isecke
3f581e6b7d
feat/migrate gdrive source connector (#3239)
### Description
Migrate the google drive source connector over to the new v2 ingest
framework and include a variety of improvements as part of the refactor:
* The ID is no longer limited to a drive id but can also be the id of a
subfolder within a drive or a file directly and each case is handled
appropriately
* More metadata is pulled in from google drive to enrich the partitioned
elements downstream and now the modified date is being set to not
reprocess if the ingest pipeline already has the file cached
* timing information is set on the file created when downloaded based on
the last modified data retrieved from google drive

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-25 12:55:28 +00:00
Roman Isecke
e0f4374386
Roman/bugfix conflicting event loop ingest (#3264)
### Description
In use cases where an external system (such as code being run in a
jupyter notebook) already has a running event loop, run the async code
in a dedicated thread pool to not conflict with the existing event loop.

This also has a variety of fixes that were found when putting together a
demo leveraging the elasticsearch destination connector
2024-06-24 18:47:37 +00:00
David Potter
8610bd3ab9
feat: Kafka source and destination connector (#3176)
Thanks to @tullytim we have a new Kafka source and destination
connector. It also works with hosted Kafka via Confluent.

Documentation will be added to the Docs repo.
2024-06-22 23:26:23 +00:00
Christine Straub
f23d180d34
fix: docker image publishing error (#3238)
This PR aims to fix a docker image publishing error caused by user
changes when pulling the `amd64` image from the `unstructured`
`wolfi-base` image.
(https://github.com/Unstructured-IO/unstructured/pull/3213).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-06-18 21:01:42 +00:00
Roman Isecke
fd98cf9ea5
Roman/migrate es dest (#3224)
### Description
Migrate elasticsearch destination connector to new v2 ingest framework
2024-06-18 14:20:49 +00:00
Roman Isecke
d876a386ed
Roman/fix ingest async connectors (#3210)
### Description
Choosing to use async needs to be very careful because if a connector is
set to use async, the pipeline will not fan out the inputs via
multiprocessing but instead it will be limited to run in a single
process under the assumption it has more benefit from async due to heavy
network traffic. This means the exact same code that is not optimized
for async and is blocking will force the pipeline to perform worse than
simply never marking the connector to use async since the pipeline will
fan that out using multiprocessing.

All connectors and processes in the pipeline we revisited to make sure
this criteria was met and updated accordingly:
* Currently the unstructured client does not support making requests
async, so this was moved over to use multiprocessing
* fsspec connector was updated to use the async client from the fsspec
library. This also required that the client be a `@property` fetched on
demand, otherwise the client would break the multiprocessing pool since
it maintains a thread lock and that can't be pickled when the fsspec
connector doesn't support async.
* elasticsearch was also updated to use the async client
* weaviate only recently came out with async support in their SDK at a
version that is higher than we can use in the open source repo, so a
TODO was left but otherwise moved to use multiprocessing
* all underlying embedders don't use async to embedder step must be
multiprocessing for now. TODO left to update underlying embedder classes
to optionally support async.
* Chunking parameters were not accurately being passed through from cli
to chunker params, this was fixed

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-17 16:55:19 +00:00
Steve Canny
9fae0111d9
rfctr(html): drop HTML-specific elements (#3207)
**Summary**
Remove HTML-specific element types and return "regular" elements like
`Title` and `NarrativeText` from `partition_html()`.

**Additional Context**
- An aspect of the legacy HTML partitioner was the use of HTML-specific
element types used to track metadata during partitioning.
- That role is no longer necessary or desireable.
- HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were
returned from partitioning HTML but also the seven other file-formats
that broker partitioning to HTML (convert-to-HTML and partition_html()).
This does not cause immediate breakage because these are still `Text`
element subtypes, but it produces a confusing developer experience.
- Remove the prior metadata roles from HTML-specific elements and remove
those element types entirely.
2024-06-15 00:14:22 +00:00
Christine Straub
9552fbbfbf
chore: bump unstructured-inference 0.7.35 (#3205)
### Summary
- bump unstructured-inference to `0.7.35` which fixed syntax for
generated HTML tables
- update unit tests and ingest test fixtures to reflect changes in the
generated HTML tables
- cut a release for `0.14.6`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-06-14 18:11:38 +00:00
ryannikolaidis
da3492b529
fix: dropbox source connector file path bugs (#3189)
The Dropbox source connector currently raises exceptions when indexing
files due to two issues: a path formatting idiosyncrasy of the Dropbox
library and a divergence in the definition of the Dropbox libraries
fs.info method, expecting a 'url' parameter rather than 'path'.

## Changes

* add a `/` prefix to file path used by DropboxIndexer
* override the fsspec sterilize_info method in DropboxIndexer to call
`self.fs.info` with `url` rather than `path`; to accommodate for the
fact that `dropboxdrivefs` diverges with this signature
* remove `dropbox.sh` from ignored source tests
* update test fixtures (now that the dropbox connector has been fixed
and not skipped)

## Testing
`dropbox.sh` source ingest test now succeeds (and is no longer ignored)

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-06-13 18:06:41 +00:00
Roman Isecke
f7b0a37c86
Feat/migrate elasticsearch src connector (#3174)
### Description
Migrate elasticsearch connector with support for what used to be batch
ingest docs but not it support for the download step to generate
additional file data.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-13 17:57:59 +00:00
ryannikolaidis
17bc55e7be
fix: relative path / permissions issues with v2 fsspec connectors (#3186)
When the v2 fsspec connectors currently generate the relative path, they
may introduce a path with a leading slash (this happens in the case of
the Box connector, which is a subclass of fsspec). When this happens
this results in the paths unintentionally being treated as absolute
paths. As a result, the ingest pipeline attempts to write files to
directories at root level, which in turn raises permission issues.

Note: Box expected results needed to update now that it's no longer
failing.

Aside: found that our tests were unintentionally skipping `box.sh` tests
because we were intending to skip `dropbox.sh` and we use regex to match
if a given test is in skip tests. This adds changes to force an exact
match.

## Changes

* Strip leading slashes during the creating of relative paths in fsspec
connectors
* Add expected results for Box connector
* (bonus): `make tidy` altered an unrelated file by removing an
unnecessary call of `pass`
* (bonus): check exact match for skipped ingest tests which fixes Box
tests getting skipped

## Testing


[Tests](https://github.com/Unstructured-IO/unstructured/actions/runs/9461928289/job/26093475612#step:7:2085)
for the Box connector was failing. It was accidentally getting skipped
(see changes above). It is now no longer skipped and passing.
2024-06-12 03:39:35 +00:00
Roman Isecke
b777864296
feat: Migrate over fsspec connectors (#3066)
### Description
Move over all fsspec connectors to the new framework

Variety of bug fixes found and fixed in this PR as well:
* custom json mixin being used for the enhanced dataclass would break if
typing was quoted. That was fixed. A check was also added to the
enhanced dataclass to prevent `InitVar` from being used in the root
dataclass since this breaks serialization.
* hashing for partitioner was using the filename of the raw file being
partitioned rather than the file name of the file data generated from
indexing. This means that mutliple files could result in the same
partition hash when recursive flag is passed in. This was updated to use
the file data file name instead.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-05 19:12:06 +00:00
Matt Robinson
865ef496e6
ci: update pinecone test to use serverless (#3127)
### Summary

Closes #3068. Updates the Pinecone connector tests to use serverless
indexes, per the documentation
[here](https://docs.pinecone.io/reference/api/control-plane/create_index).
Also updates the CHANGELOG to mention serverless. Turns out we already
supported it with the client version bump, but it hadn't been tested
yet.

### Testing

See [this CI
job](https://github.com/Unstructured-IO/unstructured/actions/runs/9319836670/job/25655322433?pr=3127)
that passed, running only the Pinecone test.
2024-05-31 15:24:41 +00:00
ryannikolaidis
1f8768750c
chore: add auth to s3 destination test (#3122)
We should be validating the S3 Destination with authenticated requests,
with credentials from a limited test user.

## Changes

- Updates s3 destination test to point to a bucket that requires
authentication.
- Adds authentication to the s3 destination test request
- Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination
- Bonus: fix S3ConnectionConfig never registered for s3 V2 destination
- Bonus: repair version and changelog version for consistency with -dev
convention

## Testing
Validated by changes to S3 destination ingest test
2024-05-31 07:05:09 +00:00
ryannikolaidis
6b5d8a9785
fix: revert dropping of filename extension for some connectors (#3109)
V2 refactor of ingest code introduces the removal of original file
extensions. Since the upgrade of connectors is incomplete this means
that some connectors will remove the original file extension and some
will not. Still TBD whether this is actually something we want at all.

This PR reverts specifically that change in the V2 ingest code so that
original file extension is preserved downstream.

## Testing
CI is passing with filenames updated via `Ingest Test Fixtures Update`
workflow.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2024-05-29 19:14:22 +00:00
Matt Robinson
3158169585
fix: uninstall bson for mongo connector (#3104)
### Summary

Closes #3049. Reenables the MongoDB connector test, which was disabled
previously in #3047 due to incompatibility between the `pymongo` and the
`bson` package from `pip`, which is a dependency for the Astra
connector. Per the `pymongo` docs below, `pymongo` ships with its own
version of `bson` and installing `bson` from `pip` breaks `pymongo`.

- https://pymongo.readthedocs.io/en/stable/installation.html

### Testing

Ingest tests ran successfully for the [source
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512636315)
and the [destination
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512635546).
2024-05-28 17:45:18 +00:00
Matt Robinson
6b400b46fe
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.

------------
Original PR description:
Adding VoyageAI embeddings 
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.

---------

Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
2024-05-24 21:48:35 +00:00