6 Commits

Author SHA1 Message Date
Steve Canny
086b8d6f8a
rfctr(part): prepare for pluggable auto-partitioners 2 (#3657)
**Summary**
Step 2 in prep for pluggable auto-partitioners, remove `regex_metadata`
field from `ElementMetadata`.

**Additional Context**
- "regex-metadata" was an experimental feature that didn't pan out.
- It's implemented by one of the post-partitioning metadata decorators,
so get rid of it as part of the cleanup before consolidating those
decorators.
2024-09-24 17:33:25 +00:00
Ahmet Melek
3f96a5ae5c
rfctr: Implement Azure Cognitive Search V2 Destination Connector (#3311)
This PR 
- adds the V2 version of Azure Cognitive Search connector
- extends the ingest test to check for chunking and embedding capability
2024-07-09 12:19:15 +00:00
Roman Isecke
d09c8c0cab
test: update ingest dest tests to follow set pattern (#1991)
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
2023-11-03 12:46:56 +00:00
Roman Isecke
9836235ead
Chunking support for SharePoint Connector (#1548)
### Description
Optionally adds in chunking to the CLI which adds a flag to trigger
chunking and exposes the parameters used by the `chunk_by_title` method.
Runs chunking before the embedding step.


Opened to replace original PR
https://github.com/Unstructured-IO/unstructured/pull/1531
2023-09-27 21:05:55 +00:00
Roman Isecke
5c7b4f586b
Roman/azure cognitive embeddings (#1524)
### Description
This PR is two-fold:  

**Embeddings:**
* Embeddings incorporated into the sharepoint source connector, which
will now call out to OpenAI and create embeddings if the flag is passed
in and the api key provided.

**Writing vector content (embeddings) to Azure cognitive search index:**
* The schema for the index expected to exist in Azure has been updated
to include the vector field type and a test script has been added to
test the new content being produced from the Sharepoint connector to
push the embedding content.

Some important notes about other changes in here:
* The embedding code had to be updated to patch the `to_dict` method on
elements to add `embeddings` to the dict output if that was added. While
the code originally added the embedding content, when `to_dict` was
called to save the content as json, this was lost.
2023-09-26 23:24:21 +00:00
Roman Isecke
bd49cfbab7
feat: adds Azure Cognitive Search (full text) destination connector (#1459)
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.

**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.

Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
2023-09-25 10:27:42 -04:00