- Adds a destination connector to upload processed output into a
PostgreSQL/Sqlite database instance.
- Users are responsible to provide their instances. This PR includes a
couple of configuration examples.
- Defines the scripts required to setup a PostgreSQL instance with the
unstructured elements schema.
- Validates postgres/pgvector embedding storage and retrieval
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
This PR intends to add [Qdrant](https://qdrant.tech/) as a supported
ingestion destination.
- Implements CLI and programmatic usage.
- Documentation update
- Integration test script
---
Clone of #2315 to run with CI secrets
---------
Co-authored-by: Anush008 <anushshetty90@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Adds Chroma (also known as ChromaDB) as a vector destination.
Currently Chroma is an in-memory single-process oriented library with
plans of a hosted and/or more production ready solution
-https://docs.trychroma.com/deployment
Though they now claim to support multiple Clients hitting the database
at once, I found that it was inconsistent. Sometimes multiprocessing
worked (maybe 1 out of 3 times) But the other times I would get
different errors. So I kept it single process.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...
```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```
Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
### Summary
This PR is the second part of the "image extraction" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/299. This
PR adds logic to support extracting images.
### Testing
`git clone -b refactor/remove_image_extraction_code --single-branch
https://github.com/Unstructured-IO/unstructured-inference.git && cd
unstructured-inference && pip install -e . && cd ../`
```
elements = partition_pdf(
filename="example-docs/embedded-images.pdf",
strategy="hi_res",
extract_images_in_pdf=True,
)
print("\n\n".join([str(el) for el in elements]))
```
Closes#1781.
- Adds a Weaviate destination connector
- The connector receives a host for the weaviate instance and a weaviate
class name.
- Defines a weaviate schema for json elements.
- Defines the pre-processing to conform unstructured's schema to the
proposed weaviate schema.
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.
The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
* if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)
### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.
### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
### Summary
Closes#2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.
Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.
---------
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
Add a procedure to repair PDF when the PDF structure is invalid for
`PDFminer` to process.
This PR handles two cases of `PSSyntaxError Invalid dictionary
construct: ...`:
* PDFminer open entire document and create pages generator on
`PDFPage.get_pages(fp)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue¬ification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack)
* PDFminer's interpreter process a single page on
`interpreter.process_page(page)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack¬ification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue)
**Additional tech details:**
* Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`,
which is used for repairing PDF.
* Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`,
which is used to find the error page from entire document by reading the
PDF file again (can't find a way to split pdf in PDFminer).
* Refactor the `is null` check for `get_uris_from_annots`, since the
root cause is that `get_uris` passed a None `annots` to
`get_uris_from_annots`, so the Null check should happen in `get_uris`.
* Add more type protection in `get_uris_from_annots` when using any
`PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This
should fix :
* https://github.com/Unstructured-IO/unstructured/issues/1922 where
`annotation_dict` is a `PDFObjRef`
* https://github.com/Unstructured-IO/unstructured/issues/1921 where
`rect` is a `PDFObjRef`
### Test
Added three test files (both are larger than 500 KB) for unittests to
test:
* Repair entire doc
* Repair one page
* Reprocess failure after repairing one page (just return the elements
before error page in this case).
* Also seems like splitting the document into smaller pages could fix
this problem, but not sure why. For example, I saw error from reprocess
in the whole
[cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf)
doc, but no error when i split the pdf by error page....
* tested if i can repair the entire doc again in this case, saw other
error which means repairing is not helping imo
* PDFminer can process the whole doc after pikepdf only repaired the
entire doc in the first place, but we can't repair by pages in this way
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Closes#1843
Ingest connector for HubSpot. Supports:
- Calls: Logs from calls related to contacts, companies and tickets
- Communications: Logs from SMS/Whatsapp related to contacts, companies
and tickets
- Notes: Notes related to CRM notes
- Products: CRM products
- Emails: Logs from emails sent to CRM objects.
- Tasks: CRM tasks
From each record, `body/`description`information is grabbed. When a
title property is available, this is registered at the beggining of the
output file. The CLI receives three params:
- `api-token`: [Private
app](https://developers.hubspot.com/docs/api/private-apps) token.
- `object-types: One of the noted supported objects in the form of a
comma separated list: `calls,products,tasks`
- `custom-properties`: Custom properties to grab information from. Must
be in the form
`<object_type>:<custom_property_id>,<object_type>:<custom_property_id>`
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.
**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.
When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.
This implements the solution described and spiked in PR #1952.
---------
Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
Summary:
Close: https://github.com/Unstructured-IO/unstructured/issues/1920
* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
```
curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""' \
-F 'strategy=hi_res' \
-F 'pdf_infer_table_structure=True' \
| jq -C . | less -R
```
* after this change:
* in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
* check out to this branch
* run `make run-web-app` again in api repo
* the curl command return output and see warning in log
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Closes#1782
This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers
Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost
For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under
AGPL3, which is incompatible with the Apache 2.0 license. Thus it is
being removed.
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`