1447 Commits

Author SHA1 Message Date
Ahmet Melek
a9ad8ac8d1
fix: update flatten dict to support flattening tuples (#2423)
This PR updates flatten_dict function to support flattening tuples. 

This is necessary for objects like Coordinates, when the object is not
written to the disk, therefore not being converted to a list before
getting flattened.
2024-01-19 00:21:22 +00:00
John
fa9f6ccc17
refactor: use _get_iso639_language_object (#2424)
This refactor removes `_convert_to_standard_langcode` and replaces it
with calling `_get_iso639_language_object` with a string slice.

Use of TESSERACT_LANGUAGES_AND_CODES, which was added to
`_convert_to_standard_langcode` previously, is moved to the relevant
part where `_convert_to_standard_langcode` was previously called.

If/else statements replace the list comprehension for readability and
`langdetect_langs.append("zho")` replaces
`_convert_to_standard_langcode("zh")` since that always returned
`"zho"`.
2024-01-19 00:14:45 +00:00
Austin Walker
cfee86f5de
chore: Update base image (#2426)
Propagating the openssl revert made in the base image:
https://github.com/Unstructured-IO/base-images/pull/13

Note that I messed up and wrote over the existing 9.2-9 image. Any
current prs will need to rebase in order to get a working dockerfile.
2024-01-18 22:34:43 +00:00
David Potter
4a34765fdf
chore: add postgres extra (#2422)
postgres missing in setup.py

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-18 16:32:53 +00:00
ryannikolaidis
d25e6081d8
chore: add opensearch extra (#2419) 2024-01-18 05:21:37 +00:00
Ronny H
96fe7dd5e5
Kapa.ai widget installation (#2418)
To test:
> cd docs && make html
> click "Ask AI" button on the bottom right-hand corner

Changelogs:
* Installed kapa.ai widget
* fixed sphinx errors in opensearch & elasticsearch documentation
2024-01-18 00:17:11 +00:00
Matt Robinson
4d5038d9fd
enhancement: add support from bitmap images (#2414)
### Summary

Adds support for bitmap images (`.bmp`) in both file detection and
partitioning. Bitmap images will be processed with `partition_image`
just like JPGs and PNGs.

### Testing

```python
from unstructured.file_utils.filetype import detect_filetype
from unstructured.partition.auto import partition
from PIL import Image

filename = "example-docs/layout-parser-paper-with-table.jpg"
bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp"

img = Image.open(filename)
img.save(bmp_filename)

detect_filetype(filename=bmp_filename) # Should be FileType.BMP

elements = partition(filename=bmp_filename)
```
2024-01-17 22:50:36 +00:00
Ronny H
8e6bc10ba1
Docs various updates (#2386)
To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
2024-01-17 21:01:01 +00:00
ryannikolaidis
f23f20c1dc
fix: postgres destination connector serialization (#2411)
This fixes the serialization of the Elasticsearch destination connector.
Presence of the _client object breaks serialization due to TypeError:
cannot pickle '_thread.lock' object. This removes that object before
serialization.
2024-01-17 17:39:32 +00:00
Yao You
ae24136238
chore: update installation instructions for conda (#2409)
- bump the pytorch version for conda to match that in
requirements/extra-pdf-image.txt (to 2.1.2)
2024-01-17 17:27:37 +00:00
David Potter
bc791d53f4
feat: add opensearch source and destination connector (#2349)
Adds OpenSearch as a source and destination.

Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.

- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 04:31:49 +00:00
David Potter
d7f4c24e21
fix documentation for chroma (#2403)
To test:

cd docs && make HTML

changelogs:

point main readme to the correct connector html page
point chroma docs to correct sample code

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 01:53:52 +00:00
Austin Walker
aaf3fd982b
chore: bump base image (#2410)
Propagating the openssl fix from Unstructured-IO/base-images#12
2024-01-17 01:32:58 +00:00
Steve Canny
fcc919b9f5
rfctr(chunking): add chunking arg constants (#2408)
There are several public interface points for chunking and they all
provide a default for arguments like `max_charactes`. These defaults are
provided by literal values. Keeping these synchronized has become a
problem.

Declare constant values for chunking argument default values and use
those wherever a non-trivial default is used in an end-user facing API
function.
2024-01-16 21:48:36 +00:00
David Potter
76e0d10e61
feat: add MongoDB source connector (#2393)
Adds MongoDB as a source (we already had it as a destination connector)

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-16 20:56:29 +00:00
John
125b63cd7c
refactor: extract language helper functions (#2370)
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293

Refactor `_convert_language_code_to_pytesseract_lang_code` and extract
`_get_iso639_language_object` to its own function


```
from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert
convert("English") # this will raise an error on both main and this branch
convert("en") # this will return "eng" on both branches
```
2024-01-16 17:51:03 +00:00
jakub-sandomierz-deepsense-ai
ee0441efea
enhancement: normalize Salesforce artifact extensions (#2402)
Connectors use predictable result file naming convention so consumers of
library can write code in abstraction of particular connector.

This change introduces compatibility with said naming convention.

`_output_filename` returns now filename with format.
2024-01-16 10:36:00 +00:00
Christine Straub
ee06260987
feat: keep all image elements when using hi_res strategy. (#2382)
### Summary
The goal of this PR is to keep all image elements when using "hi_res"
strategy. Previously, `Image` elements with small chunks of text were
ignored unless the image block extraction parameters
(`extract_images_in_pdf` or `extract_image_block_types`) were specified.
Now, all image elements are kept regardless of whether the image block
extraction parameters are specified.

### Testing
- on `main` branch,
```
elements = partition_pdf(
    filename="example-docs/embedded-images.pdf",
    strategy="hi_res",
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print("number of image elements: ", len(image_elements))
```
The above code will display `number of image elements: 0`. 

- on this `feature` branch,

The same code will display `number of image elements: 3`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-01-15 23:19:17 +00:00
John
1f0826ab0a
pin unstructured-client (#2392)
Replacement for #2311 since python 3.8 was dropped as a supported
version.

Unstructured-client added `api_key_auth` as a param to
`UnstructuredClient` in [version
0.9.0](8c93115c92).

This pins the version of `unstructured-client` so users do not receive
`TypeError: UnstructuredClient.__init__() got an unexpected keyword
argument 'api_key_auth'`
2024-01-15 17:26:38 +00:00
Matt Robinson
36faf677c0
enhancement: file detection for .wav files (#2387)
### Summary

Adds filetype detection for `.wav` audio files

### Testing

```python
from unstructured.file_utils.filetype import detect_filetype

filename = "example-docs/CantinaBand3.wav"
detect_filetype(filename=filename) # Should be FileType.WAV
```
2024-01-15 16:50:49 +00:00
ryannikolaidis
d7980b3665
fix: elasticsearch serialization issue (#2399)
This fixes the serialization of the Elasticsearch destination connector.
Presence of the _client object breaks serialization due to TypeError:
cannot pickle '_thread.lock' object. This removes that object before
serialization.
2024-01-14 23:07:37 +00:00
ryannikolaidis
f07fc6e03a
chore: make Elasticsearch Destination connector write settings optional (#2398)
* set required=False to all write config options
* update num_processes to default to 1 since that will always work
2024-01-14 22:31:05 +00:00
ryannikolaidis
2ce829ddd0
test: update test Elasticsearch mappings to validate embedding search (#2397)
Currently in the Elasticsearch Destination ingest test we are writing
the embeddings to a "float" type field. In order to leverage this field
for similarity search it should be mapped as "dense_vector" with the
respective dimensions assigned.

This PR updates that mapping and adds a test query to validate that this
works as expected.
2024-01-14 19:27:56 +00:00
ryannikolaidis
018cd7f71b
fix: pinecone serialization issue (#2394)
This fixes the serialization of the Pinecone destination connector.
Presence of the PineconeIndex object breaks serialization due to
TypeError: cannot pickle '_thread.lock' object. This removes that object
before serialization.
2024-01-13 00:08:33 +00:00
Steve Canny
2f2c48acd5
feat(ingest): add basic chunking to ingest (#2380)
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.

Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
2024-01-12 20:27:34 +00:00
Ahmet Melek
50f142d4e0
chore(ingest): update pinecone index creation specifications (#2389)
This PR updates Pinecone index creation in the ingest test due to a
recent update in Pinecone API.

Due to a change in Pinecone API, it is not allowed anymore to specify
both number of replicas and number of pods:
`Cannot specify both replicas and pods`

We solve it by removing the replica specification while sending the
index creation request.

```
Creating index ingest-test-28418
Index creation success: 201
```
2024-01-12 02:49:09 +00:00
jakub-sandomierz-deepsense-ai
411aa98bbf
feat: Salesforce connector accepts key path or value (#2321) (#2327)
Solution to issue
https://github.com/Unstructured-IO/unstructured/issues/2321.

simple_salesforce API allows for passing private key path or value. This
PR introduces this support for Ingest connector.

Salesforce parameter "private-key-file" has been renamed to
"private-key".
It can contain one of following:
- path to PEM encoded key file (as string)
- key contents (PEM encoded string)

If the provided value cannot be parsed as PEM encoded private key, then
the file existence is checked. This way private key contents are not
exposed to unnecessary underlying function calls.
2024-01-11 11:15:24 +00:00
jakub-sandomierz-deepsense-ai
5581e6a4c4
fix: Ingest GCS accepts JSON auth token (#2322) (#2371)
FSSpec serialization caused conversion of JSON token to string with
single quotes. GCS requires JSON token in form of dict so this format is
now assured. Other forms of auth are not modified but there is improved
validation for all of the options.
2024-01-11 09:03:47 +00:00
John
bfd0258ba5
chore: refactor _convert_to_standard_langcode (#2369)
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293

This PR adds a dictionary for helping map fully spelled out languages to
tesseract language codes

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2024-01-11 00:34:13 +00:00
Roman Isecke
8dc130c920
fix: ensure consistency in method signatures across destination connectors (#2381)
### Description
* Make sure all destination connectors implement the base abstract
methods using the same signatures.
* Also leverage conform dict in the base methods to make sure it's
called in a consistent fashion.
* Additional updates to move the common code into the base destination
connector class
2024-01-11 00:19:49 +00:00
Ronny H
98a0de30b4
Fix sphinx error (#2384)
To test:
> cd docs && make HTML

changelogs:
- remove unindented line in destination connector's sql.rst file 
- add elasticsearch page into destination_connector.rst file
2024-01-10 22:25:18 +00:00
Steve Canny
23edf2e911
feature(chunking): add basic strategy and overlap (#2367)
This PR culminates the restructuring of chunking over my prior
dozen-or-so commits by adding the new options to the API and
documentation.

Separately I'll be adding a new ingest test to defend against
regression, although the integration test included in this PR will do a
pretty good job of that too.
2024-01-10 22:19:24 +00:00
Roman Isecke
a8a103bc5c
bug: don't redact text when serialization if not value (#2379)
### Description
The current approach injects the redacted text for all sensitive fields
regardless of if they have a value or not. This updates the code to only
replace the value with the redacted text if the value exists.
2024-01-10 18:52:43 +00:00
Roman Isecke
22c0bad246
bug: weaviate serialization broken (#2378)
### Description
This PR handles two things:
* Fixes the serialization of the weaviate destination connector since
the client content breaks serialization when present due to `TypeError:
cannot pickle '_thread.lock' object`.
* Set finer auth control rather than generic dictionary on the CLI and
access config.
2024-01-10 17:22:37 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
0.12.0
2024-01-09 23:37:30 +00:00
Christine Straub
e2f0de3c50
chore: bump unstructured-inference=0.7.21 (#2361) 2024-01-08 21:05:04 +00:00
Roman Isecke
7caf255316
bug: omit session handler from serialization to avoid mp issues (#2366)
### Description
The session handler variable can be anything, because it's specific to
the SDK being used for the connector. This can break the serialization
depending on what that is. To avoid this all together, the session
handler itself is not serialized. Instead, it needs to be recreated if
an object is serialized and then deserialized.
2024-01-08 19:14:26 +00:00
jakub-sandomierz-deepsense-ai
0ca154a0f3
Fix: MongoDB connector URI password redaction, basic unit tests for Git connector (#2268)
MongoDB connector:
Issue:
[MongoDB
documentation](https://www.mongodb.com/docs/manual/reference/connection-string/)
states that characters `$ : / ? # [ ] @` must be percent encoded. URI
with password containing such special character will not be redacted.

Fix:
This fix removes usage of `unquote_plus` on password which allows
detected password to match with one inside URI and successfully replace
it.

Git connector:
Added very basic unit tests for repository filtering methods. Their
impact is rather minimal but showcases current limitation in
`is_file_type_supported` method.
2024-01-08 11:27:08 +00:00
Klaijan
e65a44eabb
feat: update cct eval for text dir (#2299)
The code makes edit to the `measure_text_extraction_accuracy` function
to allows dir of txt as well as json. The function also takes input
`output_type` to be either "json" or "txt" only, and checks if the files
under given directory/list contains only specified file type or not.

To test this feature, run the following code:

```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```
2024-01-05 23:34:53 +00:00
Ahmet Melek
d6674ba27e
chore: update ingest azure cognitive search endpoint (#2353)
This PR:
- updates ingest azure cognitive search destination connector test to
move into a new service.
- changes response parsing logic in the test.
2024-01-05 05:26:12 +00:00
Steve Canny
7a1e732aa1
feat(chunking): add inter-chunk overlap (#2309)
Reviewer: This PR probably reviews faster commit-by-commit. Each of the
commits is groomed and focuses on a separate clear aspect of this
implementation.

This PR adds inter-chunk overlap capability to chunking. It does not yet
expose it via the API.

Inter-chunk overlap is overlap between whole pre-chunks, prior to any
text-splitting required for oversized chunks. Contrast with intra-chunk
overlap implemented in the prior PR which implements overlap on these
latter text-splitting boundaries.

Inter-chunk overlap is disabled by default since a pre-chunk already has
a "clean" semantic boundary (composed of whole elements) and adding
overlap there introduces noise from the adjacent context. If the user
wants inter-chunk overlap they must specify `overlap_all=True` in the
options. Inter-chunk overlap uses the same `overlap` length value used
by intra-chunk overlap and does not overlap when that value is 0.
2024-01-05 01:24:12 +00:00
Steve Canny
22cbdce7ca
fix(html): unequal row lengths in HTMLTable.text_as_html (#2345)
Fixes #2339

Fixes to HTML partitioning introduced with v0.11.0 removed the use of
`tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This
had several benefits, but part of `tabulate`'s behavior was to make
row-length (cell-count) uniform across the rows of the table.

Lacking this prior uniformity produced a downstream problem reported in

On closer inspection, the method used to "harvest" cell-text was
producing more text-nodes than there were cells and was sensitive to
where whitespace was used to format the HTML. It also "moved" text to
different columns in certain rows.

Refine the cell-text gathering mechanism to get exactly one text string
for each row cell, eliminating whitespace formatting nodes and producing
strict correspondence between the number of cells in the original HTML
table row and that placed in HTML.text_as_html.

HTML tables that are uniform (every row has the same number of cells)
will produce a uniform table in `.text_as_html`. Merged cells may still
produce a non-uniform table in `.text_as_html` (because the source table
is non-uniform).
2024-01-04 21:53:19 +00:00
rvztz
950e5d68f9
feat: adds postgresql/sqlite destination connector (#2005)
- Adds a destination connector to upload processed output into a
PostgreSQL/Sqlite database instance.
- Users are responsible to provide their instances. This PR includes a
couple of configuration examples.
- Defines the scripts required to setup a PostgreSQL instance with the
unstructured elements schema.
- Validates postgres/pgvector embedding storage and retrieval

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-04 19:33:16 +00:00
Christine Straub
5b0ae3fd8b
Refactor: rename image extraction kwargs (#2303)
Currently, we're using different kwarg names in partition() and
partition_pdf(), which has implications for the API since it goes
through partition().

### Summary
- rename `extract_element_types` -> `extract_image_block_types`
- rename `image_output_dir_path` to `extract_image_block_output_dir`
- rename `extract_to_payload` -> `extract_image_block_to_payload`
- rename `pdf_extract_images` -> `extract_images_in_pdf` in
`partition.auto`
- add unit tests to test element extraction for `pdf/image` via
`partition.auto`
### Testing
CI should pass.
2024-01-04 17:52:00 +00:00
Ronny H
8e2bfcab18
Unstructured SaaS API subscription guide (#2341)
To test:
> cd docs && make html

Sections:
- New User sign-up: (i) registration form, (ii) payment processing, and
(iii) use API key & URL
- API Account maintenance: (i) update billing, (ii) opt-in email, (iii)
rotate API key, and (iv) cancel plan
- Get Supports
0.11.8
2024-01-03 14:38:03 -08:00
Austin Walker
91b892c79d
fix: Fix api_url param to partition_via_api (#2342)
Closes #2340 

We need to make sure the custom url is passed to our client. The client
constructor takes the base url, so for compatibility we can continue to
take the full url and strip off the path.

To verify, run the api locally and confirm you can make calls to it.

```
# In unstructured-api
make run-web-app

# In ipython in this repo
from unstructured.partition.api import partition_via_api
filename = "example-docs/layout-parser-paper.pdf"
partition_via_api(filename=filename, api_url="http://localhost:8000")
```
0.11.7
2024-01-03 20:08:48 +00:00
Yao You
1b70ea86b3
fix: update table structure eval to use new table inference interface (#2306)
Provide OCR tokens for table eval script. Right now
`unstructured-inference` can compute OCR components when they are not
passed in but in a future release we will be required to pass in OCR
results into table structure extraction model:
d3b2981313/CHANGELOG.md (0719)
This PR prepares for the upcoming change by passing ocr token into table
structure extraction process.

## test

Create a new virtual env that follows the setup in readme then upgrade
`inference` with `pip install unstructured-inference --upgrade`.
Run test `PYTHONPATH=. pytest
test_unstructured/metrics/test_table_structure.py` would fail on main
branch but fixed in this PR.

---------

Co-authored-by: Austin Walker <awalk89@gmail.com>
2024-01-03 19:41:51 +00:00
ryannikolaidis
dd1443ab6f
feat: add Qdrant ingest destination connector (#2338)
This PR intends to add [Qdrant](https://qdrant.tech/) as a supported
ingestion destination.

- Implements CLI and programmatic usage.
- Documentation update
- Integration test script

---
Clone of #2315 to run with CI secrets

---------

Co-authored-by: Anush008 <anushshetty90@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2024-01-02 22:08:20 +00:00
Christine Straub
9459af435d
Fix: element extraction not working when using "auto" strategy for pdf (#2324)
Closes #2323.

### Summary
- update logic to return "hi_res" if either `extract_images_in_pdf` or
`extract_element_types` is set
- refactor: remove unused `file` parameter from
`determine_pdf_or_image_strategy()`
### Testing
```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="example-docs/embedded-images-tables.pdf",
    extract_element_types=["Image"],
    extract_to_payload=True,
)

image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print(image_elements)
```
2023-12-28 22:25:30 +00:00
Christine Straub
dd144456de
Feat: return base64 encoded images for PDF's (#2310)
Closes #2302.
### Summary
- add functionality to get a Base64 encoded string from a PIL image
- store base64 encoded image data in two metadata fields: `image_base64`
and `image_mime_type`
- update the "image element filter" logic to keep all image elements in
the output if a user specifies image extraction
### Testing
```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="example-docs/embedded-images-tables.pdf",
    strategy="hi_res",
    extract_element_types=["Image", "Table"],
    extract_to_payload=True,
)
```
or
```
from unstructured.partition.auto import partition

elements = partition(
    filename="example-docs/embedded-images-tables.pdf",
    strategy="hi_res",
    pdf_extract_element_types=["Image", "Table"],
    pdf_extract_to_payload=True,
)
```
2023-12-27 05:39:01 +00:00