249 Commits

Author SHA1 Message Date
Ahmet Melek
d46792214a
feat: add vertexai embeddings (#2693)
This PR:
- Adds VertexAI embeddings as an embedding provider

Testing
- Tested with pinecone destination connector on
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693)
job run.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-03-28 21:15:36 +00:00
Christine Straub
08fafc564f
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-03-23 03:59:23 +00:00
Roman Isecke
4ff6a5b78e
Roman/bugfix support bedrock embeddings (#2650)
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-03-21 18:21:04 +00:00
David Potter
9177aa20a8
feature CORE-3985: add Clarifai destination connector (#2633)
Thanks to @mogith-pn from Clarifai we have a new destination connector!

This PR intends to add Clarifai as a ingest destination connector.

Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
2024-03-21 16:36:21 +00:00
David Potter
5b92e0bb6b
bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638)
Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
https://github.com/Unstructured-IO/unstructured/pull/2591
https://github.com/Unstructured-IO/unstructured/pull/2592/files

We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.

Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)

Adds unit tests for isofomat.

json_to_dict already unit tested here:

https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py

Adds small change for AstraDB to allow them to see what source called
their api
2024-03-15 14:01:05 +00:00
John
3783b44d0b
fix documentation html links example (#2608)
Closes  #2577

Testing:
```
from unstructured.partition.html import partition_html

cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2024"):
            links.append(f"{cnn_lite_url}{relative_link}")
            
print(links)
```

---------

Co-authored-by: ron-unstructured <ronny@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-04 18:33:42 +00:00
David Potter
e8ec09c8b9
feat: astra dest connector (#2571)
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.

This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.

To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:

1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used

Some notes about Astra DB:

- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
2024-02-23 20:50:50 +00:00
Klaijan
d06936d35a
feat: modify test-ingest-src and evaluation-metrics to allow EXPORT_DIR (#2551)
The current `test-ingest-src.sh` and `evaluation-metrics` do not allow
passing the `EXPORT_DIR` (`OUTPUT_ROOT` in `evaluation-metrics`). It is
currently saving at the current working directory
(`unstructured/test_unstructured_ingest`). When running the eval from
`core-product`, all outputs is now saved at
`core-product/upstream-unstructured/test_unstructured_ingest` which is
undesirable.

This PR modifies two scripts to accommodate such behavior:
1. `test-ingest-src.sh` - assign `EVAL_OUTPUT_ROOT` to the value set
within the environment if exist, or the current working directory if
not. Then calls to run `evaluation-metrics.sh`.
2. `evaluation-metrics.sh` - accepting param from `test-ingest-src.sh`
if exist, or to the value set within the environment if exist, or the
current directory if not.

(Note: I also add param to `evaluation-metrics.sh` because it makes
sense to allow a separate run to be able to specify an export directory)

This PR should work in sync with another PR under `core-product`, which
I will add the link here later.

**To test:**

Run the script below, change `$SCRIPT_DIR` as needed to see the result.

```
export OVERWRITE_FIXTURES=true

./upstream-unstructured/test_unstructured_ingest/src/s3.sh

SCRIPT_DIR=$(dirname "$(realpath "$0")")
bash -x ./upstream-unstructured/test_unstructured_ingest/evaluation-metrics.sh text-extraction "$SCRIPT_DIR"
```

----

This PR also updates the requirements by `make pip-compile` since the
`click` module was not found.
2024-02-17 05:21:15 +00:00
Ahmet Melek
be71633415
refactor: isolate ingest dependencies into local scopes (#2509)
This PR: 
- Moves ingest dependencies into local scopes to be able to import
ingest connector classes without the need of installing imported
external dependencies. This allows lightweight use of the classes (not
the instances. to use the instances as intended you'll still need the
dependencies).
- Upgrades the embed module dependencies from `langchain` to
`langchain-community` module (to pass CI [rather than introducing a
pin])
- Does pip-compile
- Does minor refactors in other files to pass `ruff 2.0` checks which
were introduced by pip-compile
2024-02-06 21:28:55 +00:00
qued
399dd60311
build(deps): unpin pillow (#2472)
Removed `pillow` pin and recompiled. I think it was originally there to
address a conflict, which, as far as I can tell, no longer exists. Also
a security vulnerability was discovered in the older version of
`pillow`.

#### Testing:

CI should pass.
2024-01-30 21:29:08 +00:00
qued
007fc45739
chore: new black changes (#2473)
Update `black` and apply changes to affected files. I separated this PR
so we can have a look at the changes and decide whether we want to:
1. Go forward with the new formatting
2. Change the black config to make the old formatting valid
3. Get rid of black entirely and just use `ruff`
4. Do something I haven't thought of
2024-01-30 17:12:35 +00:00
John
db67805ec6
feat: add support for partitioning .heic files (#2454)
.heic files are an image filetype we have not supported.

#### Testing
```
from unstructured.partition.image import partition_image

png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"

png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")

for i in range(len(heic_elements)):
	print(heic_elements[i].text == png_elements[i].text)
```

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-01-30 04:49:00 +00:00
David Potter
9fea85dc21
fix: remove none value keys from flattened dictionary (#2442)
When a partitioned or embedded document json has null values, those get
converted to a dictionary with None values.

This happens in the metadata. I have not see it in other keys.

Chroma and Pinecone do not like those None values. 

`flatten_dict` has been modified with a `remove_none` arg to remove keys
with None values.

Also, Pinecone has been pinned at 2.2.4 because at 3.0 and above it
breaks our code.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-23 21:52:11 +00:00
Roman Isecke
a8de52e94f
feat: databricks volumes dest added (#2391)
### Description
This adds in a destination connector to write content to the Databricks
Unity Catalog Volumes service. Currently there is an internal account
that can be used for testing manually but there is not dedicated account
to use for testing so this is not being added to the automated ingest
tests that get run in the CI.

To test locally: 
```shell
#!/usr/bin/env bash

path="testpath/$(uuidgen)"

PYTHONPATH=. python ./unstructured/ingest/main.py local \
    --num-processes 4 \
    --output-dir azure-test \
    --strategy fast \
    --verbose \
    --input-path example-docs/fake-memo.pdf \
    --recursive \
    databricks-volumes \
    --catalog "utic-dev-tech-fixtures" \
    --volume "small-pdf-set" \
    --volume-path "$path" \
    --username "$DATABRICKS_USERNAME" \
    --password "$DATABRICKS_PASSWORD" \
    --host "$DATABRICKS_HOST"
```
2024-01-23 01:25:51 +00:00
Matt Robinson
2d3a7f1c48
fix: fix table index error by bumping unstructured-inference (#2430)
### Summary

Closes #2417. Bumps `unstructured-inference` to pull in the fix
implemented in
https://github.com/Unstructured-IO/unstructured-inference/pull/317
2024-01-19 22:42:32 +00:00
David Potter
bc791d53f4
feat: add opensearch source and destination connector (#2349)
Adds OpenSearch as a source and destination.

Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.

- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 04:31:49 +00:00
John
1f0826ab0a
pin unstructured-client (#2392)
Replacement for #2311 since python 3.8 was dropped as a supported
version.

Unstructured-client added `api_key_auth` as a param to
`UnstructuredClient` in [version
0.9.0](8c93115c92).

This pins the version of `unstructured-client` so users do not receive
`TypeError: UnstructuredClient.__init__() got an unexpected keyword
argument 'api_key_auth'`
2024-01-15 17:26:38 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
Christine Straub
e2f0de3c50
chore: bump unstructured-inference=0.7.21 (#2361) 2024-01-08 21:05:04 +00:00
rvztz
950e5d68f9
feat: adds postgresql/sqlite destination connector (#2005)
- Adds a destination connector to upload processed output into a
PostgreSQL/Sqlite database instance.
- Users are responsible to provide their instances. This PR includes a
couple of configuration examples.
- Defines the scripts required to setup a PostgreSQL instance with the
unstructured elements schema.
- Validates postgres/pgvector embedding storage and retrieval

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-04 19:33:16 +00:00
ryannikolaidis
dd1443ab6f
feat: add Qdrant ingest destination connector (#2338)
This PR intends to add [Qdrant](https://qdrant.tech/) as a supported
ingestion destination.

- Implements CLI and programmatic usage.
- Documentation update
- Integration test script

---
Clone of #2315 to run with CI secrets

---------

Co-authored-by: Anush008 <anushshetty90@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2024-01-02 22:08:20 +00:00
David Potter
4b8352e0f5
feat: add chroma destination connector (#2240)
Adds Chroma (also known as ChromaDB) as a vector destination.

Currently Chroma is an in-memory single-process oriented library with
plans of a hosted and/or more production ready solution
-https://docs.trychroma.com/deployment

Though they now claim to support multiple Clients hitting the database
at once, I found that it was inconsistent. Sometimes multiprocessing
worked (maybe 1 out of 3 times) But the other times I would get
different errors. So I kept it single process.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-19 16:58:23 +00:00
David Potter
cde11d1eb0
feat: Add sftp source connector (#2163)
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...

```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```

Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-07 19:33:19 +00:00
Christine Straub
ed76b11b1a
Refactor: support image extraction (#2201)
### Summary
This PR is the second part of the "image extraction" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/299. This
PR adds logic to support extracting images.

### Testing

`git clone -b refactor/remove_image_extraction_code --single-branch
https://github.com/Unstructured-IO/unstructured-inference.git && cd
unstructured-inference && pip install -e . && cd ../`

```
elements = partition_pdf(
        filename="example-docs/embedded-images.pdf",
        strategy="hi_res",
        extract_images_in_pdf=True,
    )

print("\n\n".join([str(el) for el in elements]))
```
2023-12-05 18:22:29 +00:00
Roman Isecke
c5cb216ac8
chore: lint for print statements in ingest code (#2215)
### Description
Given the filtering in the ingest logger, anything going to console
should go through that. This adds a linter that only checks for
`print()` statements in the ingest code and ignored it elsewhere for
now.
2023-12-05 16:42:23 +00:00
rvztz
ce905dd098
feat: Weaviate destination connector (#1963)
Closes #1781.
- Adds a Weaviate destination connector
- The connector receives a host for the weaviate instance and a weaviate
class name.
- Defines a weaviate schema for json elements.
- Defines the pre-processing to conform unstructured's schema to the
proposed weaviate schema.
2023-12-01 22:27:41 +00:00
Christine Straub
69d0ee1aea
Refactor: support merging extracted layout with inferred layout (#2158)
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
2023-12-01 20:56:31 +00:00
John
e5bdf7fb43
chore: unstructured python client (#2195)
### Summary
Closes #2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.

Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-01 18:49:59 +00:00
Ahmet Melek
ed08773de7
feat: add pinecone destination connector (#1774)
Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039 

This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node  in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f)

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-11-29 22:37:32 +00:00
Yuming Long
92dae8cd1a
Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137)
### Summary

Add a procedure to repair PDF when the PDF structure is invalid for
`PDFminer` to process.

This PR handles two cases of `PSSyntaxError Invalid dictionary
construct: ...`:
* PDFminer open entire document and create pages generator on
`PDFPage.get_pages(fp)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack)
* PDFminer's interpreter process a single page on
`interpreter.process_page(page)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue)

**Additional tech details:**
* Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`,
which is used for repairing PDF.
* Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`,
which is used to find the error page from entire document by reading the
PDF file again (can't find a way to split pdf in PDFminer).
* Refactor the `is null` check for `get_uris_from_annots`, since the
root cause is that `get_uris` passed a None `annots` to
`get_uris_from_annots`, so the Null check should happen in `get_uris`.
* Add more type protection in `get_uris_from_annots` when using any
`PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This
should fix :
* https://github.com/Unstructured-IO/unstructured/issues/1922 where
`annotation_dict` is a `PDFObjRef`
* https://github.com/Unstructured-IO/unstructured/issues/1921 where
`rect` is a `PDFObjRef`

### Test
Added three test files (both are larger than 500 KB) for unittests to
test:
* Repair entire doc
* Repair one page
* Reprocess failure after repairing one page (just return the elements
before error page in this case).
* Also seems like splitting the document into smaller pages could fix
this problem, but not sure why. For example, I saw error from reprocess
in the whole
[cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf)
doc, but no error when i split the pdf by error page....
* tested if i can repair the entire doc again in this case, saw other
error which means repairing is not helping imo
* PDFminer can process the whole doc after pikepdf only repaired the
entire doc in the first place, but we can't repair by pages in this way

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-29 19:00:15 +00:00
rvztz
50b1431c9e
rvztz/hubspot ingest connector (#1760)
Closes #1843 

Ingest connector for HubSpot. Supports:
- Calls: Logs from calls related to contacts, companies and tickets
- Communications: Logs from SMS/Whatsapp related to contacts, companies
and tickets
- Notes: Notes related to CRM notes
- Products: CRM products
- Emails: Logs from emails sent to CRM objects.
- Tasks: CRM tasks

From each record, `body/`description`information is grabbed. When a
title property is available, this is registered at the beggining of the
output file. The CLI receives three params:
- `api-token`: [Private
app](https://developers.hubspot.com/docs/api/private-apps) token.
- `object-types: One of the noted supported objects in the form of a
comma separated list: `calls,products,tasks`
- `custom-properties`: Custom properties to grab information from. Must
be in the form
`<object_type>:<custom_property_id>,<object_type>:<custom_property_id>`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
2023-11-28 23:07:57 +00:00
Roman Isecke
2bb463d006
feat: support both single and batch ingest docs (#2105)
### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.

**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-27 19:25:30 +00:00
Yuming Long
ccda93b0d1
chore: bump inference to 0.7.15 release unst 0.11.0 (#2110)
^^
2023-11-20 18:20:03 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00
Austin Walker
2931cb38e8
fix: handle KeyError: 'N' for certain pdfs (#2072)
Closes #2059.

We've found some pdfs that throw an error in pdfminer. These files use a
ICCBased color profile but do not include an expected value `N`. As a
workaround, we can wrap pdfminer and drop any colorspace info, since we
don't need to render the document.

To verify, try to partition the document in the linked issue.

```
elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-15 01:59:05 +00:00
Steve Canny
80fe07b89f
fix: #1952 support nested docx tables (#2020)
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.

When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.

This implements the solution described and spiked in PR #1952.

---------

Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
2023-11-08 00:37:21 +00:00
Yuming Long
ad14321016
Chore: don't pass empty language code to tesseract CLI (#1996)
Summary:

Close: https://github.com/Unstructured-IO/unstructured/issues/1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-11-06 19:30:12 -06:00
Yao You
38ab35dcb6
fix: make pip compile (#2015)
- add missing make file in ingest folder
2023-11-06 16:26:12 -06:00
Ahmet Melek
ca78dc737a
feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918)
Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-11-06 12:26:12 +00:00
Yao You
69265685ea
build(deps): add makefile to requirements (#1295)
This PR resolves #1294 by adding a Makefile to compile requirements.
This makefile respects the dependencies between file and will compile
them in order. E.g., extra-*.txt will be compiled __after__ base.txt is
updated.

Test locally by simply running `make pip-compile` or `cd requirements &&
make clean && make all`

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-11-02 10:17:35 -05:00
qued
808b4ced7a
build(deps): remove ebooklib (#1878)
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under
AGPL3, which is incompatible with the Apache 2.0 license. Thus it is
being removed.
2023-10-26 12:22:40 -05:00
qued
d79f633ada
build(deps): add typing extensions dep (#1835)
Closes #1330.

Added `typing-extensions` as an explicit dependency (it was previously
an implicit dependency via `dataclasses-json`).

This dependency should be explicit, since we import from it directly in
`unstructured.documents.elements`. This has the added benefit that
`TypedDict` will be available for Python 3.7 users.

Other changes:
* Ran `pip-compile`
* Fixed a bug in `version-sync.sh` that caused an error when using the
sync functionality when syncing to a dev version from a release version.

#### Testing:

To test the Python 3.7 functionality, in a Python 3.7 environment
install the base requirements and run
```python
from unstructured.documents.elements import Element

```
This also works on `main` as `typing_extensions` is a requirement.
However if you `pip uninstall typing-extensions`, and run the above
code, it should fail. So this update makes sure `typing-extensions`
doesn't get lost if the other dependencies move around.

To reproduce the `version-sync.sh` bug that was fixed, in `main`,
increment the most recent version in `CHANGELOG.md` while leaving the
version in `__version__.py`. Then add the following lines to
`version-sync.sh` to simulate a particular set of circumstances,
starting on line 114:

```
MAIN_IS_RELEASE=true
CURRENT_BRANCH="something-not-main"
```

Then run `make version-sync`.

The expected behavior is that the version in `__version__.py` is changed
to the new version to match `CHANGELOG.md`, but instead it exits with an
error.

The fix was to only do the version incrementation check when the script
is running in `-c` or "check" mode.
2023-10-24 19:19:09 +00:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00
qued
7fdddfbc1e
chore: improve kwarg handling (#1810)
Closes `unstructured-inference` issue
[#265](https://github.com/Unstructured-IO/unstructured-inference/issues/265).

Cleaned up the kwarg handling, taking opportunities to turn instances of
handling kwargs as dicts to just using them as normal in function
signatures.

#### Testing:

Should just pass CI.
2023-10-23 04:48:28 +00:00
Yuming Long
ce40cdc55f
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary

Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.

**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image

### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:

screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">

### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR` 

### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`

---------

Co-authored-by: Yao You <yao@unstructured.io>
2023-10-21 00:24:23 +00:00
Mallori Harrell
00635744ed
feat: Adds local embedding model (#1619)
This PR adds a local embedding model option as an alternative to using
our OpenAI embedding brick. This brick uses LangChain's
HuggingFacEmbeddings.
2023-10-19 11:51:36 -05:00
Jack Retterer
b8f24ba67e
Added AWS Bedrock embeddings (#1738)
Summary: Added support for AWS Bedrock embeddings. Leverages
"amazon.titan-tg1-large" for the embedding model.

Test

- find your aws secret access key and key id; make sure the account has
access to bedrock's tian embed model
- follow the instructions in
d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
2023-10-18 19:36:51 -05:00
cragwolfe
9ea3734fd0
fix: memory issue resolved for chipper v2 (#1772)
Co-authored-by: Austin Walker <austin@unstructured.io>
Co-authored-by: Austin Walker <awalk89@gmail.com>
2023-10-17 14:37:25 +00:00
Roman Isecke
b265d8874b
refactoring linting (#1739)
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.

The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
2023-10-17 12:45:12 +00:00
qued
cf31c9a2c4
fix: use nx to avoid recursion limit (#1761)
Fixes recursion limit error that was being raised when partitioning
Excel documents of a certain size.

Previously we used a recursive method to find subtables within an excel
sheet. However this would run afoul of Python's recursion depth limit
when there was a contiguous block of more than 1000 cells within a
sheet. This function has been updated to use the NetworkX library which
avoids Python recursion issues.

* Updated `_get_connected_components` to use `networkx` graph methods
rather than implementing our own algorithm for finding contiguous groups
of cells within a sheet.
* Added a test and example doc that replicates the `RecursionError`
prior to the change.
*  Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d.

#### Testing:
The following run from a Python terminal should raise a `RecursionError`
on `main` and succeed on this branch:
```python
import sys
from unstructured.partition.xlsx import partition_xlsx
old_recursion_limit = sys.getrecursionlimit()
try:
    sys.setrecursionlimit(1000)
    filename = "example-docs/more-than-1k-cells.xlsx"
    partition_xlsx(filename=filename)
finally:
    sys.setrecursionlimit(old_recursion_limit)

```
Note: the recursion limit is different in different contexts. Checking
my own system, the default in a notebook seems to be 3000, but in a
terminal it's 1000. The documented Python default recursion limit is
1000.
2023-10-14 19:38:21 +00:00