1418 Commits

Author SHA1 Message Date
John
9320311a19
fix: check languages args (#2435)
This PR is the last in a series of PRs for refactoring and fixing the
language parameters (`languages` and `ocr_languages` so we can address
incorrect input by users. See #2293

It is recommended to go though this PR commit-by-commit and note the
commit message. The most significant commit is "update
check_languages..."
2024-01-29 20:12:08 +00:00
Yao You
97fb10db4a
fix: default hi_res model rely on inference setting (#2441)
- there are multiple places setting the default `hi_res_model_name` in
both `unstructured` and `unstructured-inference`
- they lead to inconsistency and unexpected behaviors
- this fix removes a helper in `unstructured` that tries to set the
default hi_res layout detection model; instead we rely on the
`unstructured-inference` to provide that default when no explicit model
name is passed in

## test

```bash
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython
```

```python
from unstructured.partition.auto import partition

# find a pdf file
elements = partition("foo.pdf", strategy="hi_res")
assert elements[0].metadata.detection_origin == "yolox"
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2024-01-29 16:44:41 +00:00
Ahmet Melek
4e226d818f
build(release): release commit for 0.12.3 (#2466) 0.12.3 2024-01-29 13:45:23 +00:00
David Potter
74dcca44ca
fix: link_texts was breaking postgres destination connector (#2460)
Formatting of link_texts was breaking metadata storage. Turns out it
didn't need any conforming and came in correctly from json.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-27 04:29:38 +00:00
Ronny H
d5a6f4b82c
Docs updates (#2458)
To test:
> cd docs && make html

Change logs:
* Updates the best practice for table extraction to use
`skip_infer_table_types` instead of `pdf_infer_table_structure`.
* Fixed CSS issue with a duplicate search box.
* Fixed RST warning message
* Fixed typo on the Intro page.
2024-01-25 20:31:28 +00:00
Antonio Jose Jimeno Yepes
d8b3bdb919
Check chipper version and prevent running pdfminer with chipper (#2347)
We have added a new version of chipper (Chipperv3), which needs to allow
unstructured to effective work with all the current Chipper versions.
This implies resizing images with the appropriate resolution and make
sure that Chipper elements are not sorted by unstructured.

In addition, it seems that PDFMiner is being called when calling
Chipper, which adds repeated elements from Chipper and PDFMiner.

To evaluate this PR, you can test the code below with the attached PDF.
The code writes a JSON file with the generated elements. The output can
be examined with `cat out.un.json | python -m json.tool`. There are
three things to check:

1. The size of the image passed to Chipper, which can be identiied in
the layout_height and layout_width attributes, which should have values
3301 and 2550 as shown in the example below:

```
[
    {
        "element_id": "c0493a7872f227e4172c4192c5f48a06",
        "metadata": {
            "coordinates": {
                "layout_height": 3301,
                "layout_width": 2550,

```

2. There should be no repeated elements. 
3. Order should be closer to reading order.

The script to run Chipper from unstructured is:

```
from unstructured import __version__
print(__version__.__version__)

import json
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json

elements = json.loads(elements_to_json(partition("Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf", strategy="hi_res", model_name="chipperv3")))

with open('out.un.json', 'w') as w:
    json.dump(elements, w)

```



[Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf](https://github.com/Unstructured-IO/unstructured/files/13817273/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf)

---------

Co-authored-by: Antonio Jimeno Yepes <antonio@unstructured.io>
2024-01-25 02:33:32 +00:00
Matt Robinson
4613e52e11
fix: treat yaml files as plain text (#2446)
### Summary

Closes #2412. Adds support for YAML MIME types and treats them as plain
text. In response to `500` errors that the API currently returns if the
MIME type is `text/yaml`.
2024-01-24 17:48:36 +00:00
David Potter
9fea85dc21
fix: remove none value keys from flattened dictionary (#2442)
When a partitioned or embedded document json has null values, those get
converted to a dictionary with None values.

This happens in the metadata. I have not see it in other keys.

Chroma and Pinecone do not like those None values. 

`flatten_dict` has been modified with a `remove_none` arg to remove keys
with None values.

Also, Pinecone has been pinned at 2.2.4 because at 3.0 and above it
breaks our code.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-23 21:52:11 +00:00
Matt Robinson
a155e7a43b
enhancement: add mongodb driver (#2445)
### Summary

Adds a driver with `unstructured` version information to the MongoDB
driver.

### Testing,

Good to go as long as the MongoDB ingest test run successfully.
2024-01-23 17:45:31 +00:00
ryannikolaidis
44cb2ce645
fix: path to databricks-volumes requirements (#2443)
setup.py is currently pointing to the wrong location for the
databricks-volumes extra requirements. This PR updates to point to the
correct location.

## Testing
Tested by installing from local source with `pip install .`
2024-01-23 03:28:36 +00:00
Roman Isecke
a8de52e94f
feat: databricks volumes dest added (#2391)
### Description
This adds in a destination connector to write content to the Databricks
Unity Catalog Volumes service. Currently there is an internal account
that can be used for testing manually but there is not dedicated account
to use for testing so this is not being added to the automated ingest
tests that get run in the CI.

To test locally: 
```shell
#!/usr/bin/env bash

path="testpath/$(uuidgen)"

PYTHONPATH=. python ./unstructured/ingest/main.py local \
    --num-processes 4 \
    --output-dir azure-test \
    --strategy fast \
    --verbose \
    --input-path example-docs/fake-memo.pdf \
    --recursive \
    databricks-volumes \
    --catalog "utic-dev-tech-fixtures" \
    --volume "small-pdf-set" \
    --volume-path "$path" \
    --username "$DATABRICKS_USERNAME" \
    --password "$DATABRICKS_PASSWORD" \
    --host "$DATABRICKS_HOST"
```
2024-01-23 01:25:51 +00:00
jakub-sandomierz-deepsense-ai
283f796f58
fix: Fix and use FSSpec dest check_connection (#2420)
FSSpec destination connectors did not use `check_connection`. There was
an error when trying to `ls` destination directory - it may not exist at
the moment of creation of connector.

Now `check_connection` calls `ls` on bucket root and this method is
called on `initialize` of destination connector.
2024-01-22 09:31:02 +00:00
Ronny H
149f894d0a
Fixed sphinx-build error by pinning alabaster=-0.7.13 (#2436) 0.12.2 2024-01-20 14:36:48 -08:00
Ronny H
ea11118655
v0.12.2 changelog and version (#2434) 2024-01-20 01:02:00 +00:00
Ronny H
4c772f6ed7
Updated docs on API Params and Filetype Supports (#2433)
To test:
> cd docs && make html

Changelogs:
* Fixed sphinx error due to malformed rst table on partition page
* Updated API Params, ie. `extract_image_block_types` and
`extract_image_block_to_payload`
* Updated image filetype supports
2024-01-19 16:07:57 -08:00
Matt Robinson
2d3a7f1c48
fix: fix table index error by bumping unstructured-inference (#2430)
### Summary

Closes #2417. Bumps `unstructured-inference` to pull in the fix
implemented in
https://github.com/Unstructured-IO/unstructured-inference/pull/317
2024-01-19 22:42:32 +00:00
John
c34fac9c3a
enhancement: add _clean_ocr_languages_arg helper function (#2413)
This PR is one in a series of PRs for refactoring and fixing the
languages parameter so it can address incorrect input by users. #2293

This PR adds _clean_ocr_languages_arg. There are no calls to this
function yet, but it will be called in later PRs related to this series.
2024-01-19 19:59:08 +00:00
Ronny H
c81d4e34be
v0.12.1 release (#2429) 0.12.1 2024-01-19 17:56:06 +00:00
ryannikolaidis
2e97494613
fix: fsspec connectors returning data source version as integer (#2427)
Connector data source versions should always be string values, however
we were using the integer checksum value for the version for fsspec
connectors. This casts that value to a string.

## Changes

* Cast the checksum value to a string when assigning the version value
for fsspec connectors.
* Adds test to validate that these connectors will assign a string value
when an integer checksum is fetched.

## Testing

Unit test added.
2024-01-19 15:58:01 +00:00
Christine Straub
7378a378f6
enhancement: allow setting image block crop padding parameter (#2415)
Closes #2320 .

### Summary
In certain circumstances, adjusting the image block crop padding can
improve image block extraction by preventing extracted image blocks from
being clipped.

### Testing
- PDF:
[LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf)
- Set two environment variables
`EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD`
(e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`,
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20`

```
elements = partition_pdf(
    filename="LM339-D_2-2.pdf",
    extract_image_block_types=["image"],
)
```
2024-01-19 06:28:32 +00:00
Ahmet Melek
1a305866d1
fix: chroma destination connector serialization (#2425)
This fixes the serialization of the ChromaDB destination connector.
Presence of the _collection object breaks serialization due to
TypeError: cannot pickle 'module' object. This removes that object
before serialization.
2024-01-19 02:36:25 +00:00
Ahmet Melek
a9ad8ac8d1
fix: update flatten dict to support flattening tuples (#2423)
This PR updates flatten_dict function to support flattening tuples. 

This is necessary for objects like Coordinates, when the object is not
written to the disk, therefore not being converted to a list before
getting flattened.
2024-01-19 00:21:22 +00:00
John
fa9f6ccc17
refactor: use _get_iso639_language_object (#2424)
This refactor removes `_convert_to_standard_langcode` and replaces it
with calling `_get_iso639_language_object` with a string slice.

Use of TESSERACT_LANGUAGES_AND_CODES, which was added to
`_convert_to_standard_langcode` previously, is moved to the relevant
part where `_convert_to_standard_langcode` was previously called.

If/else statements replace the list comprehension for readability and
`langdetect_langs.append("zho")` replaces
`_convert_to_standard_langcode("zh")` since that always returned
`"zho"`.
2024-01-19 00:14:45 +00:00
Austin Walker
cfee86f5de
chore: Update base image (#2426)
Propagating the openssl revert made in the base image:
https://github.com/Unstructured-IO/base-images/pull/13

Note that I messed up and wrote over the existing 9.2-9 image. Any
current prs will need to rebase in order to get a working dockerfile.
2024-01-18 22:34:43 +00:00
David Potter
4a34765fdf
chore: add postgres extra (#2422)
postgres missing in setup.py

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-18 16:32:53 +00:00
ryannikolaidis
d25e6081d8
chore: add opensearch extra (#2419) 2024-01-18 05:21:37 +00:00
Ronny H
96fe7dd5e5
Kapa.ai widget installation (#2418)
To test:
> cd docs && make html
> click "Ask AI" button on the bottom right-hand corner

Changelogs:
* Installed kapa.ai widget
* fixed sphinx errors in opensearch & elasticsearch documentation
2024-01-18 00:17:11 +00:00
Matt Robinson
4d5038d9fd
enhancement: add support from bitmap images (#2414)
### Summary

Adds support for bitmap images (`.bmp`) in both file detection and
partitioning. Bitmap images will be processed with `partition_image`
just like JPGs and PNGs.

### Testing

```python
from unstructured.file_utils.filetype import detect_filetype
from unstructured.partition.auto import partition
from PIL import Image

filename = "example-docs/layout-parser-paper-with-table.jpg"
bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp"

img = Image.open(filename)
img.save(bmp_filename)

detect_filetype(filename=bmp_filename) # Should be FileType.BMP

elements = partition(filename=bmp_filename)
```
2024-01-17 22:50:36 +00:00
Ronny H
8e6bc10ba1
Docs various updates (#2386)
To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
2024-01-17 21:01:01 +00:00
ryannikolaidis
f23f20c1dc
fix: postgres destination connector serialization (#2411)
This fixes the serialization of the Elasticsearch destination connector.
Presence of the _client object breaks serialization due to TypeError:
cannot pickle '_thread.lock' object. This removes that object before
serialization.
2024-01-17 17:39:32 +00:00
Yao You
ae24136238
chore: update installation instructions for conda (#2409)
- bump the pytorch version for conda to match that in
requirements/extra-pdf-image.txt (to 2.1.2)
2024-01-17 17:27:37 +00:00
David Potter
bc791d53f4
feat: add opensearch source and destination connector (#2349)
Adds OpenSearch as a source and destination.

Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.

- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 04:31:49 +00:00
David Potter
d7f4c24e21
fix documentation for chroma (#2403)
To test:

cd docs && make HTML

changelogs:

point main readme to the correct connector html page
point chroma docs to correct sample code

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 01:53:52 +00:00
Austin Walker
aaf3fd982b
chore: bump base image (#2410)
Propagating the openssl fix from Unstructured-IO/base-images#12
2024-01-17 01:32:58 +00:00
Steve Canny
fcc919b9f5
rfctr(chunking): add chunking arg constants (#2408)
There are several public interface points for chunking and they all
provide a default for arguments like `max_charactes`. These defaults are
provided by literal values. Keeping these synchronized has become a
problem.

Declare constant values for chunking argument default values and use
those wherever a non-trivial default is used in an end-user facing API
function.
2024-01-16 21:48:36 +00:00
David Potter
76e0d10e61
feat: add MongoDB source connector (#2393)
Adds MongoDB as a source (we already had it as a destination connector)

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-16 20:56:29 +00:00
John
125b63cd7c
refactor: extract language helper functions (#2370)
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293

Refactor `_convert_language_code_to_pytesseract_lang_code` and extract
`_get_iso639_language_object` to its own function


```
from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert
convert("English") # this will raise an error on both main and this branch
convert("en") # this will return "eng" on both branches
```
2024-01-16 17:51:03 +00:00
jakub-sandomierz-deepsense-ai
ee0441efea
enhancement: normalize Salesforce artifact extensions (#2402)
Connectors use predictable result file naming convention so consumers of
library can write code in abstraction of particular connector.

This change introduces compatibility with said naming convention.

`_output_filename` returns now filename with format.
2024-01-16 10:36:00 +00:00
Christine Straub
ee06260987
feat: keep all image elements when using hi_res strategy. (#2382)
### Summary
The goal of this PR is to keep all image elements when using "hi_res"
strategy. Previously, `Image` elements with small chunks of text were
ignored unless the image block extraction parameters
(`extract_images_in_pdf` or `extract_image_block_types`) were specified.
Now, all image elements are kept regardless of whether the image block
extraction parameters are specified.

### Testing
- on `main` branch,
```
elements = partition_pdf(
    filename="example-docs/embedded-images.pdf",
    strategy="hi_res",
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print("number of image elements: ", len(image_elements))
```
The above code will display `number of image elements: 0`. 

- on this `feature` branch,

The same code will display `number of image elements: 3`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-01-15 23:19:17 +00:00
John
1f0826ab0a
pin unstructured-client (#2392)
Replacement for #2311 since python 3.8 was dropped as a supported
version.

Unstructured-client added `api_key_auth` as a param to
`UnstructuredClient` in [version
0.9.0](8c93115c92).

This pins the version of `unstructured-client` so users do not receive
`TypeError: UnstructuredClient.__init__() got an unexpected keyword
argument 'api_key_auth'`
2024-01-15 17:26:38 +00:00
Matt Robinson
36faf677c0
enhancement: file detection for .wav files (#2387)
### Summary

Adds filetype detection for `.wav` audio files

### Testing

```python
from unstructured.file_utils.filetype import detect_filetype

filename = "example-docs/CantinaBand3.wav"
detect_filetype(filename=filename) # Should be FileType.WAV
```
2024-01-15 16:50:49 +00:00
ryannikolaidis
d7980b3665
fix: elasticsearch serialization issue (#2399)
This fixes the serialization of the Elasticsearch destination connector.
Presence of the _client object breaks serialization due to TypeError:
cannot pickle '_thread.lock' object. This removes that object before
serialization.
2024-01-14 23:07:37 +00:00
ryannikolaidis
f07fc6e03a
chore: make Elasticsearch Destination connector write settings optional (#2398)
* set required=False to all write config options
* update num_processes to default to 1 since that will always work
2024-01-14 22:31:05 +00:00
ryannikolaidis
2ce829ddd0
test: update test Elasticsearch mappings to validate embedding search (#2397)
Currently in the Elasticsearch Destination ingest test we are writing
the embeddings to a "float" type field. In order to leverage this field
for similarity search it should be mapped as "dense_vector" with the
respective dimensions assigned.

This PR updates that mapping and adds a test query to validate that this
works as expected.
2024-01-14 19:27:56 +00:00
ryannikolaidis
018cd7f71b
fix: pinecone serialization issue (#2394)
This fixes the serialization of the Pinecone destination connector.
Presence of the PineconeIndex object breaks serialization due to
TypeError: cannot pickle '_thread.lock' object. This removes that object
before serialization.
2024-01-13 00:08:33 +00:00
Steve Canny
2f2c48acd5
feat(ingest): add basic chunking to ingest (#2380)
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.

Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
2024-01-12 20:27:34 +00:00
Ahmet Melek
50f142d4e0
chore(ingest): update pinecone index creation specifications (#2389)
This PR updates Pinecone index creation in the ingest test due to a
recent update in Pinecone API.

Due to a change in Pinecone API, it is not allowed anymore to specify
both number of replicas and number of pods:
`Cannot specify both replicas and pods`

We solve it by removing the replica specification while sending the
index creation request.

```
Creating index ingest-test-28418
Index creation success: 201
```
2024-01-12 02:49:09 +00:00
jakub-sandomierz-deepsense-ai
411aa98bbf
feat: Salesforce connector accepts key path or value (#2321) (#2327)
Solution to issue
https://github.com/Unstructured-IO/unstructured/issues/2321.

simple_salesforce API allows for passing private key path or value. This
PR introduces this support for Ingest connector.

Salesforce parameter "private-key-file" has been renamed to
"private-key".
It can contain one of following:
- path to PEM encoded key file (as string)
- key contents (PEM encoded string)

If the provided value cannot be parsed as PEM encoded private key, then
the file existence is checked. This way private key contents are not
exposed to unnecessary underlying function calls.
2024-01-11 11:15:24 +00:00
jakub-sandomierz-deepsense-ai
5581e6a4c4
fix: Ingest GCS accepts JSON auth token (#2322) (#2371)
FSSpec serialization caused conversion of JSON token to string with
single quotes. GCS requires JSON token in form of dict so this format is
now assured. Other forms of auth are not modified but there is improved
validation for all of the options.
2024-01-11 09:03:47 +00:00
John
bfd0258ba5
chore: refactor _convert_to_standard_langcode (#2369)
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293

This PR adds a dictionary for helping map fully spelled out languages to
tesseract language codes

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2024-01-11 00:34:13 +00:00