- Adds a destination connector to upload processed output into a
PostgreSQL/Sqlite database instance.
- Users are responsible to provide their instances. This PR includes a
couple of configuration examples.
- Defines the scripts required to setup a PostgreSQL instance with the
unstructured elements schema.
- Validates postgres/pgvector embedding storage and retrieval
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
To test:
> cd docs && make html
Sections:
- New User sign-up: (i) registration form, (ii) payment processing, and
(iii) use API key & URL
- API Account maintenance: (i) update billing, (ii) opt-in email, (iii)
rotate API key, and (iv) cancel plan
- Get Supports
This PR intends to add [Qdrant](https://qdrant.tech/) as a supported
ingestion destination.
- Implements CLI and programmatic usage.
- Documentation update
- Integration test script
---
Clone of #2315 to run with CI secrets
---------
Co-authored-by: Anush008 <anushshetty90@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Closes https://github.com/Unstructured-IO/unstructured/issues/1842
Closes https://github.com/Unstructured-IO/unstructured/issues/2202
Closes https://github.com/Unstructured-IO/unstructured/issues/2203
This PR:
- Adds Elasticsearch destination connector to be able to ingest
documents from any supported source, embed them and write the embeddings
/ documents into Elasticsearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured elasticsearch indexes easily.
- Includes parallelized upload and lazy processing for elasticsearch
destination connector.
- Rearranges elasticsearch test helpers to source, destination, and
common folders.
- Adds util functions to be able to batch iterables in a lazy way for
uploads
- Fixes a bug where removing the optional parameter `--fields` broke the
connector due to an integer processing error.
- Fixes a bug where using an [elasticsearch
config](8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35))
for a destination connector resulted in a serialization issue when
optional parameter `--fields` was not provided.
Adds Chroma (also known as ChromaDB) as a vector destination.
Currently Chroma is an in-memory single-process oriented library with
plans of a hosted and/or more production ready solution
-https://docs.trychroma.com/deployment
Though they now claim to support multiple Clients hitting the database
at once, I found that it was inconsistent. Sometimes multiprocessing
worked (maybe 1 out of 3 times) But the other times I would get
different errors. So I kept it single process.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...
```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```
Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
### Summary
This PR is the second part of the "image extraction" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/299. This
PR adds logic to support extracting images.
### Testing
`git clone -b refactor/remove_image_extraction_code --single-branch
https://github.com/Unstructured-IO/unstructured-inference.git && cd
unstructured-inference && pip install -e . && cd ../`
```
elements = partition_pdf(
filename="example-docs/embedded-images.pdf",
strategy="hi_res",
extract_images_in_pdf=True,
)
print("\n\n".join([str(el) for el in elements]))
```
Closes#1781.
- Adds a Weaviate destination connector
- The connector receives a host for the weaviate instance and a weaviate
class name.
- Defines a weaviate schema for json elements.
- Defines the pre-processing to conform unstructured's schema to the
proposed weaviate schema.
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.
The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
* if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)
### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.
### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
### Summary
Closes#2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.
Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.
---------
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
To test:
> cd docs && make html
Change logs:
* Examples are reorganized to have its own page
* Removed two old examples, ie. "file-utils" & "sentiment analysis".
* Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" &
"Multi-Files Processing with S3 Connector and API"
* Reorganized and added detailed API documentation: (i) usage, (ii)
SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and
validation errors
### Summary
Add a procedure to repair PDF when the PDF structure is invalid for
`PDFminer` to process.
This PR handles two cases of `PSSyntaxError Invalid dictionary
construct: ...`:
* PDFminer open entire document and create pages generator on
`PDFPage.get_pages(fp)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue¬ification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack)
* PDFminer's interpreter process a single page on
`interpreter.process_page(page)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack¬ification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue)
**Additional tech details:**
* Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`,
which is used for repairing PDF.
* Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`,
which is used to find the error page from entire document by reading the
PDF file again (can't find a way to split pdf in PDFminer).
* Refactor the `is null` check for `get_uris_from_annots`, since the
root cause is that `get_uris` passed a None `annots` to
`get_uris_from_annots`, so the Null check should happen in `get_uris`.
* Add more type protection in `get_uris_from_annots` when using any
`PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This
should fix :
* https://github.com/Unstructured-IO/unstructured/issues/1922 where
`annotation_dict` is a `PDFObjRef`
* https://github.com/Unstructured-IO/unstructured/issues/1921 where
`rect` is a `PDFObjRef`
### Test
Added three test files (both are larger than 500 KB) for unittests to
test:
* Repair entire doc
* Repair one page
* Reprocess failure after repairing one page (just return the elements
before error page in this case).
* Also seems like splitting the document into smaller pages could fix
this problem, but not sure why. For example, I saw error from reprocess
in the whole
[cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf)
doc, but no error when i split the pdf by error page....
* tested if i can repair the entire doc again in this case, saw other
error which means repairing is not helping imo
* PDFminer can process the whole doc after pikepdf only repaired the
entire doc in the first place, but we can't repair by pages in this way
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
Summary:
Close: https://github.com/Unstructured-IO/unstructured/issues/1920
* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
```
curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""' \
-F 'strategy=hi_res' \
-F 'pdf_infer_table_structure=True' \
| jq -C . | less -R
```
* after this change:
* in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
* check out to this branch
* run `make run-web-app` again in api repo
* the curl command return output and see warning in log
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Per @tabossert we're now using a link shortener behind which we can
rotate the link to keep it current. That way we (🤞 ) never have to
update this here again.
#### Testing:
Links should work. No more links should exist in the documentation
except this one.
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
### Summary
We no longer use the "bricks" terminology for partioning functions, etc
in the library. This PR updates various references to bricks within the
repo and the docs. This is just an initial pass to swap the terminology
out, it'll likely be helpful to reorganize the docs a bit as well.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
### Description
Add in the fsspec configs needed for the fsspec-based connectors
To match the behavior of the original CLI, the default used by the click
option was mirrored in the base config for the api endpoint.
Carrying `skip_infer_table_types` to `infer_table_structure` in
partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a
`text_as_html` field.
Note: I've continued to exclude this var from partitioners that go
through html flow, I think if we've already got the html it doesn't make
sense to carry the infer variable along, since we're not 'infer-ing' the
html table in these cases.
TODO:
✅ add unit tests
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: amanda103 <amanda103@users.noreply.github.com>
Summary: Added support for AWS Bedrock embeddings. Leverages
"amazon.titan-tg1-large" for the embedding model.
Test
- find your aws secret access key and key id; make sure the account has
access to bedrock's tian embed model
- follow the instructions in
d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)
---------
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
### Description
* Update all existing connector docs to use new pipeline approach
### Additional changes:
* Some defaults were set for the runners to match those in the configs
to make those easy to handle, i.e. the biomed runner:
```python
max_retries: int = 5,
max_request_time: int = 45,
decay: float = 0.3,
```
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.
The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
This PR:
- defines rbac_data as a SourceMetadata field,
- manages connections to an external api for obtaining rbac data with
ConnectorRBAC class,
- serializes rbac data and saves it to the disk,
- matches the rbac_data in the disk to each IngestDoc, using a common
field,
- forwards rbac data to Elements, via the partition() function
To test the changes, run `examples/ingest/sharepoint/ingest.sh` with the
relevant rbac & connector credentials
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
This PR adds the `max_characters` (hard max) param to non-table element
chunking. Additionally updates the `num_characters` metadata to
`max_characters` to make it clearer which param we're referencing.
To test:
```
from unstructured.partition.html import partition_html
filename = "example-docs/example-10k-1p.html"
chunk_elements = partition_html(
filename,
chunking_strategy="by_title",
combine_text_under_n_chars=0,
new_after_n_chars=50,
max_characters=100,
)
for chunk in chunk_elements:
print(len(chunk.text))
# previously we were only respecting the "soft max" (default of 500) for elements other than tables
# now we should see that all the elements have text fields under 100 chars.
```
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
### Description
Updating the python version of the example docs to show how to run the
same code that the CLI runs, but using python. Rather than copying the
same command that would be run via the terminal and using the subprocess
library to run it, this updates it to use the supported code exposed in
the inference directory.
For now only the wikipedia one has been updated to get some opinions on
this before updating all other connector docs.
Would close out
https://github.com/Unstructured-IO/unstructured/issues/1445
### Description
This PR is two-fold:
**Embeddings:**
* Embeddings incorporated into the sharepoint source connector, which
will now call out to OpenAI and create embeddings if the flag is passed
in and the api key provided.
**Writing vector content (embeddings) to Azure cognitive search index:**
* The schema for the index expected to exist in Azure has been updated
to include the vector field type and a test script has been added to
test the new content being produced from the Sharepoint connector to
push the embedding content.
Some important notes about other changes in here:
* The embedding code had to be updated to patch the `to_dict` method on
elements to add `embeddings` to the dict output if that was added. While
the code originally added the embedding content, when `to_dict` was
called to save the content as json, this was lost.
This updates the docker image download url to pass through the scarf
gateway, this allows anonymous tracking of downloads
Related to:
https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics
Testing:
docker pull
downloads.unstructured.io/unstructured-io/unstructured:latest
Result:
Image should download
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.
**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.
Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
Addresses
[#1332](https://github.com/Unstructured-IO/unstructured/issues/1332)
with `unstructured-inference` PR
[#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing
from unstructured.partition.pdf import partition_pdf
f_path = "example-docs/embedded-images.pdf"
# default image output directory
elements = partition_pdf(
f_path,
strategy=strategy,
extract_images_in_pdf=True,
)
# specific image output directory
elements = partition_pdf(
f_path,
strategy=strategy,
extract_images_in_pdf=True,
image_output_dir_path=<directory path>,
)
Closes https://github.com/Unstructured-IO/unstructured/issues/1319,
closes https://github.com/Unstructured-IO/unstructured/issues/1372
This module:
- implements EmbeddingEncoder classes which track embedding related data
- implements embed_documents method which receives a list of Elements,
obtains embeddings for the text within Elements, updates the Elements
with an attribute named embeddings , and returns the updated Elements
- the module uses langchain to obtain the embeddings
-----
- The PR additionally fixes a JSON de-serialization issue on the
metadata fields.
To test the changes, run `examples/embed/example.py`