125 Commits

Author SHA1 Message Date
Steve Canny
ddbe90f6bb
rfctr(html): clean html tests in prep for PRs to follow (#3156)
**Summary**
Clean `tests_unstructured/partition/test_html.py` in preparation for
broader refactor of HTML partitioner to follow. That refactor will
address a cluster of bugs.

Temporarily remove blank lines in tests so reordering tests in following
commit is easier to follow. Those will go back in after that.
2024-06-05 23:11:58 +00:00
Steve Canny
e4158deaff
fix(msg): use python-oxmsg for MSG email parsing (#3142)
**Summary**
`partition_msg()` previously used the `msg_parser` library for parsing
Outlook MSG email files (.msg files). The `msg_parser` library is
unmaintained and has several major shortcomings such as not being able
to parse MSG files with 8-bit encoded strings and not reliably
extracting attachments.

Use the new and permissively licenced `python-oxmsg` library instead.

**Additional Context**
For reviewability purposes, this PR temporarily places the new
`partition_msg()` implementation in `new_msg.py` and references that
implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py`
in a closely following PR. This avoids a very messy interleaving of
hunks in a diff between the old and re-written `partition_msg()`
implementation.

Fixes #2481 
Fixes #3006
2024-06-05 21:12:27 +00:00
Yao You
fdb27378cb
chore: use python3 consistently in makefile (#3152)
This PR changes two `python` commands in `Makefile` to use `python3` to
be consistent with other make commands. This makes it more explicit on
which python to use when the makefile is used outside of a controlled
virtualenv where only one python exists.
2024-06-05 00:05:57 +00:00
Steve Canny
b4ee019170
rfctr: flatten test_unstructured/partition (#3073)
**Summary**
Some partitioner test modules are placed in directories by themselves or
with one other test module. This unnecessarily obscures where to find
the test module corresponding to a partitiner.

Move partitioner test modules to mirror the directory structure of
`unstructured/partition`.
2024-05-23 00:51:08 +00:00
Matt Robinson
d7608014c0
improve: add Python 3.12 support (#3033) (#3047)
### Summary

Closes #2959. Updates the dependency and CI to add support for Python
3.12.

The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.

Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837

---------

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
2024-05-19 23:03:15 +00:00
Matt Robinson
612905e311
build: wolfi base image for Dockerfile (#3016)
### Summary

Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to
reduce CVEs. Also adds a step in the docker publish job that scans the
images and checks for CVEs before publishing. The job will fail if there
are high or critical vulnerabilities.

### Testing

Run `make docker-run-dev` and then `python3.11` once you're in. And that
point, you can try:

```python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"])
elements
```

Stop the container once you're done.
2024-05-15 22:53:15 +00:00
Steve Canny
2c7e0289aa
rfctr(pptx): extract _PptxPartitionerOptions (#2853)
**Reviewers:** Likely quicker to review commit-by-commit.

**Summary**

In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
2024-04-08 19:01:03 +00:00
Roman Isecke
d6f2841ff4
feat: update dependencies and remove constraint on pydantic (#2841)
### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.

A couple optimizations: 
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file
2024-04-04 19:58:23 +00:00
Roman Isecke
4ff6a5b78e
Roman/bugfix support bedrock embeddings (#2650)
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-03-21 18:21:04 +00:00
David Potter
9177aa20a8
feature CORE-3985: add Clarifai destination connector (#2633)
Thanks to @mogith-pn from Clarifai we have a new destination connector!

This PR intends to add Clarifai as a ingest destination connector.

Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
2024-03-21 16:36:21 +00:00
Steve Canny
137ea67336
feat(chunking): add include_orig_elements chunking option (#2649)
**Summary**
Add `include_orig_elements: bool = True` as a new chunking option. This
PR does not implement _adding_ original elements to chunks, only
accepting this parameter as a chunking option and assigning `True` to it
as a default value when it is omitted as a keyword argument.

Note this will need to be added in other repositories as well in order
to fully support this new option by all access methods. In particular it
will need to be added in `unstructured-api` in order to become available
via the SDKs.
2024-03-15 18:48:07 +00:00
Steve Canny
94535e353c
rfctr: prepare for adding metadata.orig_elements field (#2647)
**Summary**
Some typing modernization in `elements.py` which will get changes to add
the `orig_elements` metadata field.

Also some additions to `unit_util.py` to enable simplified mocking that
will be required in the next PR.
2024-03-14 21:31:58 +00:00
Roman Isecke
9c1c41f493
BUGFIX: fix dependencies in setup.py (#2605)
### Description
Currently the requirements associated with an extra in the `setup.py` is
being dynamically generated using the `load_requirements()` method in
the same file. This is being passed in all the `.in` files which then
get read line by line to generate the requirements associated with an
extra. Unless the `.in` file itself has a version pin, this will never
respect the `.txt` files being generated by `pip-compile`. This fix
updates all the inputs to `load_requirements()` to use the `.txt` files
themselves.
2024-03-06 18:59:08 +00:00
Michał Martyniak
b9aa4b7452
fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593)
## Problem Description
In some cases you might find yourselves in a situation when pandoc won't
be able to process an `rtf` as input file format, because older versions
simply do not support that.

```
RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki
```

Basically, some user may install the wrong version. The `README.md` is
not be precise enough when mentioning RTF files support:

47b35ccdd6/README.md (L120-L122)

## Example
Installing `pandoc` from a [stable repository, like
Debian](https://packages.debian.org/source/bullseye/pandoc) will give
you `2.9` and the official documentation shows clearly that support for
rtf was introduced in `2.14`
https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8)

### Note that `rtf` is not there

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38)

### More detail

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8)

## Proposed Solution 
- [x] I've simply added/copied `make install-pandoc` calls, mimicking
other recipes in order to ensure that `3.1.2` will be installed in all
cases. **Side note**: `make install-pandoc` calls
`./scripts/install-pandoc.sh` under the hood.
- [x] Update README file - mention that `make install-pandoc` is
recommended (`>=2.14.2`)
- [x] Verify tests that cover `rtf` cases:
47b35ccdd6/test_unstructured/file_utils/test_file_conversion.py (L14)
- [x] Update `setup_ubuntu.sh` if needed?:
47b35ccdd6/scripts/setup_ubuntu.sh (L87)
-
2024-03-04 11:02:32 +00:00
David Potter
e8ec09c8b9
feat: astra dest connector (#2571)
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.

This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.

To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:

1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used

Some notes about Astra DB:

- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
2024-02-23 20:50:50 +00:00
David Potter
0f0b58dfe7
bug: remove vectara requirements (#2491)
I accidentally added Vectara to setup and makefile. But there are no
dependencies for Vectara

This removes Vectara from those files.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-01 20:41:53 +00:00
David Potter
c100ce28a7
feat: add Vectara destination connector (#2357)
Thanks to Ofer at Vectara, we now have a Vectara destination connector.

- There are no dependencies since it is all REST calls to API
-

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-01 14:38:34 +00:00
John
5adc04ac27
bug: fix typo in makefile (#2474)
fix typo in makefile: `.PHONE` -> `.PHONY`
2024-01-30 18:12:35 +00:00
Roman Isecke
a8de52e94f
feat: databricks volumes dest added (#2391)
### Description
This adds in a destination connector to write content to the Databricks
Unity Catalog Volumes service. Currently there is an internal account
that can be used for testing manually but there is not dedicated account
to use for testing so this is not being added to the automated ingest
tests that get run in the CI.

To test locally: 
```shell
#!/usr/bin/env bash

path="testpath/$(uuidgen)"

PYTHONPATH=. python ./unstructured/ingest/main.py local \
    --num-processes 4 \
    --output-dir azure-test \
    --strategy fast \
    --verbose \
    --input-path example-docs/fake-memo.pdf \
    --recursive \
    databricks-volumes \
    --catalog "utic-dev-tech-fixtures" \
    --volume "small-pdf-set" \
    --volume-path "$path" \
    --username "$DATABRICKS_USERNAME" \
    --password "$DATABRICKS_PASSWORD" \
    --host "$DATABRICKS_HOST"
```
2024-01-23 01:25:51 +00:00
David Potter
bc791d53f4
feat: add opensearch source and destination connector (#2349)
Adds OpenSearch as a source and destination.

Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.

- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 04:31:49 +00:00
David Potter
76e0d10e61
feat: add MongoDB source connector (#2393)
Adds MongoDB as a source (we already had it as a destination connector)

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-16 20:56:29 +00:00
rvztz
950e5d68f9
feat: adds postgresql/sqlite destination connector (#2005)
- Adds a destination connector to upload processed output into a
PostgreSQL/Sqlite database instance.
- Users are responsible to provide their instances. This PR includes a
couple of configuration examples.
- Defines the scripts required to setup a PostgreSQL instance with the
unstructured elements schema.
- Validates postgres/pgvector embedding storage and retrieval

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-04 19:33:16 +00:00
ryannikolaidis
dd1443ab6f
feat: add Qdrant ingest destination connector (#2338)
This PR intends to add [Qdrant](https://qdrant.tech/) as a supported
ingestion destination.

- Implements CLI and programmatic usage.
- Documentation update
- Integration test script

---
Clone of #2315 to run with CI secrets

---------

Co-authored-by: Anush008 <anushshetty90@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2024-01-02 22:08:20 +00:00
David Potter
4b8352e0f5
feat: add chroma destination connector (#2240)
Adds Chroma (also known as ChromaDB) as a vector destination.

Currently Chroma is an in-memory single-process oriented library with
plans of a hosted and/or more production ready solution
-https://docs.trychroma.com/deployment

Though they now claim to support multiple Clients hitting the database
at once, I found that it was inconsistent. Sometimes multiprocessing
worked (maybe 1 out of 3 times) But the other times I would get
different errors. So I kept it single process.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-19 16:58:23 +00:00
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
David Potter
cde11d1eb0
feat: Add sftp source connector (#2163)
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...

```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```

Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-07 19:33:19 +00:00
Roman Isecke
c5cb216ac8
chore: lint for print statements in ingest code (#2215)
### Description
Given the filtering in the ingest logger, anything going to console
should go through that. This adds a linter that only checks for
`print()` statements in the ingest code and ignored it elsewhere for
now.
2023-12-05 16:42:23 +00:00
rvztz
ce905dd098
feat: Weaviate destination connector (#1963)
Closes #1781.
- Adds a Weaviate destination connector
- The connector receives a host for the weaviate instance and a weaviate
class name.
- Defines a weaviate schema for json elements.
- Defines the pre-processing to conform unstructured's schema to the
proposed weaviate schema.
2023-12-01 22:27:41 +00:00
Ahmet Melek
ed08773de7
feat: add pinecone destination connector (#1774)
Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039 

This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node  in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f)

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-11-29 22:37:32 +00:00
rvztz
50b1431c9e
rvztz/hubspot ingest connector (#1760)
Closes #1843 

Ingest connector for HubSpot. Supports:
- Calls: Logs from calls related to contacts, companies and tickets
- Communications: Logs from SMS/Whatsapp related to contacts, companies
and tickets
- Notes: Notes related to CRM notes
- Products: CRM products
- Emails: Logs from emails sent to CRM objects.
- Tasks: CRM tasks

From each record, `body/`description`information is grabbed. When a
title property is available, this is registered at the beggining of the
output file. The CLI receives three params:
- `api-token`: [Private
app](https://developers.hubspot.com/docs/api/private-apps) token.
- `object-types: One of the noted supported objects in the form of a
comma separated list: `calls,products,tasks`
- `custom-properties`: Custom properties to grab information from. Must
be in the form
`<object_type>:<custom_property_id>,<object_type>:<custom_property_id>`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
2023-11-28 23:07:57 +00:00
cragwolfe
fa27408c4f
chore: fix Makefile ingest targets (#2051)
Fixes the Makefile `ingest-` targets were broken in
https://github.com/Unstructured-IO/unstructured/pull/1799/files.

**Test Instructions**

for maketarget in $(grep .PHONY Makefile | grep install-ingest | perl -p
-e 's/.PHONY://' | tr -d '\n'); do
      echo $maketarget; make $maketarget
    done
2023-11-09 21:55:27 -08:00
Yao You
69265685ea
build(deps): add makefile to requirements (#1295)
This PR resolves #1294 by adding a Makefile to compile requirements.
This makefile respects the dependencies between file and will compile
them in order. E.g., extra-*.txt will be compiled __after__ base.txt is
updated.

Test locally by simply running `make pip-compile` or `cd requirements &&
make clean && make all`

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-11-02 10:17:35 -05:00
qued
b08562ba1a
tests: separate chipper tests (#1939)
Separates chipper tests to speed up testing and CI.
2023-10-31 21:02:00 +00:00
John
670687bb67
update .pre-commit-config to match linting used by CI (#1906)
Closes #1905 
.pre-commit-config.yaml does not match pyproject.toml, which causes
unnecessary/undesirable formatting changes. These changes are not
required by CI, so they should not have to be made.

**To Reproduce**
Install pre-commit configuration as described
[here](https://github.com/Unstructured-IO/unstructured#installation-instructions-for-local-development).
Make a commit and something like the following will be logged:
```
check for added large files..............................................Passed
check toml...........................................(no files to check)Skipped
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check xml............................................(no files to check)Skipped
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
black....................................................................Passed
ruff.....................................................................Failed
- hook id: ruff
- files were modified by this hook
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-27 13:24:55 -05:00
cragwolfe
8aceda97dd
test: print slowest unittests (#1911)
Show which tests are slowing things down when running `make test`:

E.g., from the CI run in this PR:

```
2023-10-27T05:51:05.6256039Z 105.12s setup    test_unstructured/partition/pdf_image/test_pdf.py::test_chipper_has_hierarchy
2023-10-27T05:51:05.6257784Z 93.47s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_hi_res_ocr_mode_with_table_extraction[entire_page]
2023-10-27T05:51:05.6259866Z 93.09s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_hi_res_ocr_mode_with_table_extraction[individual_blocks]
2023-10-27T05:51:05.6261818Z 31.70s call     test_unstructured/partition/epub/test_epub.py::test_add_chunking_strategy_on_partition_epub_non_default
2023-10-27T05:51:05.6263774Z 17.22s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-filename]
2023-10-27T05:51:05.6265658Z 17.13s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-spool]
2023-10-27T05:51:05.6273195Z 16.95s call     test_unstructured/partition/pdf_image/test_image.py::test_add_chunking_strategy_on_partition_image_hi_res
2023-10-27T05:51:05.6275118Z 16.77s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-rb]
2023-10-27T05:51:05.6276759Z 14.64s call     test_unstructured/partition/test_text.py::test_partition_text_detects_more_than_3_languages
2023-10-27T05:51:05.6278381Z 13.86s call     test_unstructured/partition/pdf_image/test_image.py::test_partition_image_with_multipage_tiff
2023-10-27T05:51:05.6280137Z 13.51s call     test_unstructured/partition/test_auto.py::test_auto_partition_pdf_from_filename[False-None]
2023-10-27T05:51:05.6281995Z 13.41s call     test_unstructured/partition/test_html_partition.py::test_add_chunking_strategy_on_partition_html
2023-10-27T05:51:05.6283640Z 12.80s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_copy_protection
2023-10-27T05:51:05.6285305Z 12.46s call     test_unstructured/partition/pdf_image/test_image.py::test_add_chunking_strategy_on_partition_image
2023-10-27T05:51:05.6287250Z 12.39s call     test_unstructured/partition/pdf_image/test_image.py::test_partition_image_hi_res_ocr_mode_with_table_extraction[individual_blocks]
2023-10-27T05:51:05.6289347Z 12.14s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_from_file_with_hi_res_strategy_custom_metadata_date
2023-10-27T05:51:05.6291329Z 12.12s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_hi_res_strategy_custom_metadata_date
2023-10-27T05:51:05.6293388Z 12.12s call     test_unstructured/partition/test_auto.py::test_auto_partition_pdf_from_file[True-application/pdf]
2023-10-27T05:51:05.6294869Z 12.08s call     test_unstructured/partition/test_auto.py::test_auto_with_page_breaks
2023-10-27T05:51:05.6296396Z 12.02s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_hi_res_strategy_metadata_date
2023-10-27T05:51:05.6298278Z 11.99s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_from_file_with_hi_res_strategy_metadata_date
```
2023-10-27 11:40:55 -05:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00
Roman Isecke
63861f537e
Add check for duplicate click options (#1775)
### Description
Given that many of the options associated with the `Click` based cli
ingest commands are added dynamically from a number of configs, a check
was incorporated to make sure there were no duplicate entries to prevent
new configs from overwriting already added options.

### Issues that were found and fixes:
* duplicate api-key option set on Notion command conflicts with api key
used for unstructured api. Added notion prefix.
* retry logic configs had duplicates in biomed. Removed since this is
not handled by the pipeline.
2023-10-20 14:00:19 +00:00
Mallori Harrell
00635744ed
feat: Adds local embedding model (#1619)
This PR adds a local embedding model option as an alternative to using
our OpenAI embedding brick. This brick uses LangChain's
HuggingFacEmbeddings.
2023-10-19 11:51:36 -05:00
Roman Isecke
b265d8874b
refactoring linting (#1739)
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.

The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
2023-10-17 12:45:12 +00:00
ryannikolaidis
d9a0bd741a
fix: build test failures (#1748)
* Fix missing HF_TOKEN when running containerized test for the build
process
* Fix pytest args when running specific test

## Testing
Example run of the HF_TOKEN assgned for the containerized test in the
build process:
https://github.com/Unstructured-IO/unstructured/actions/runs/6504556437/job/17666669155

Example run of the pytest args working for the arm test (ran in a new
workflow for testing on push):
https://github.com/Unstructured-IO/unstructured/actions/runs/6504213010
2023-10-13 01:08:27 -07:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
Trevor Bossert
6acd06987b
Remove extra index url from docs (#1711)
It’s no longer required to specify the extra index url as we utilize a
different method of gathering install anonymous analytics.
2023-10-11 19:34:49 +00:00
Trevor Bossert
ce206f1f85
add extra-index-url for scarf anonymous tracking (#1668)
This adds extra-index-url to our docs to allow for anonymous install
analytics to help us understand and improve our product.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:16:38 +00:00
Benjamin Torres
e0201e9a11
feat/add sources from unstructured inference (#1538)
This PR adds support for `source` property from
`unstructured_inference`, allowing the user to be able to see the origin
of the data under `detection_origin`field environment variable
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true

In order to try this feature you can use this code:
```
from unstructured.partition.pdf import partition_pdf_or_image

yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')

sources = [e.detection_origin for e in yolox_elements]
print(sources)
```
And will print 'yolox' as source for all the elements
2023-10-05 20:26:47 +00:00
Roman Isecke
bd49cfbab7
feat: adds Azure Cognitive Search (full text) destination connector (#1459)
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.

**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.

Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
2023-09-25 10:27:42 -04:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
Yuming Long
f962a1e57d
fix: fix ingest paddle hanging issue (#1441)
## Summary

Ingest tests are having paddle OOM issue which cause the tests to hang
forever. The fix here is to remove paddle from ci and set both OCR env
`TABLE_OCR` and `ENTIRE_PAGE_OCR` to `tesseract`. (will have follow up
PR to investigate why this is failing)

## Test
please check ingest tests in CI
2023-09-19 17:20:23 +00:00
Yao You
b534b2a6cd
Chore: bump inference package version to 0.5.28 and new release (#1355)
This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.

---------

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-09-15 18:26:15 -07:00
Trevor Bossert
09a0958f90
Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350)
Testing instructions

on Apple silicon

```
make docker-build
docker run -it unstructured:dev bash
python3
```
Then run the test in this PR
https://unstructured-ai.atlassian.net/browse/CORE-1269

You should get output like shown in ticket

Run the same process on your local machine (not inside docker) with same
test to verify the non aarch64 paddlepaddle got installed correctly

---------

Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-09-15 17:05:48 -07:00