56 Commits

Author SHA1 Message Date
qued
808b4ced7a
build(deps): remove ebooklib (#1878)
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under
AGPL3, which is incompatible with the Apache 2.0 license. Thus it is
being removed.
2023-10-26 12:22:40 -05:00
qued
d79f633ada
build(deps): add typing extensions dep (#1835)
Closes #1330.

Added `typing-extensions` as an explicit dependency (it was previously
an implicit dependency via `dataclasses-json`).

This dependency should be explicit, since we import from it directly in
`unstructured.documents.elements`. This has the added benefit that
`TypedDict` will be available for Python 3.7 users.

Other changes:
* Ran `pip-compile`
* Fixed a bug in `version-sync.sh` that caused an error when using the
sync functionality when syncing to a dev version from a release version.

#### Testing:

To test the Python 3.7 functionality, in a Python 3.7 environment
install the base requirements and run
```python
from unstructured.documents.elements import Element

```
This also works on `main` as `typing_extensions` is a requirement.
However if you `pip uninstall typing-extensions`, and run the above
code, it should fail. So this update makes sure `typing-extensions`
doesn't get lost if the other dependencies move around.

To reproduce the `version-sync.sh` bug that was fixed, in `main`,
increment the most recent version in `CHANGELOG.md` while leaving the
version in `__version__.py`. Then add the following lines to
`version-sync.sh` to simulate a particular set of circumstances,
starting on line 114:

```
MAIN_IS_RELEASE=true
CURRENT_BRANCH="something-not-main"
```

Then run `make version-sync`.

The expected behavior is that the version in `__version__.py` is changed
to the new version to match `CHANGELOG.md`, but instead it exits with an
error.

The fix was to only do the version incrementation check when the script
is running in `-c` or "check" mode.
2023-10-24 19:19:09 +00:00
Yuming Long
01a0e003d9
Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850)
### Summary

A follow up ticket on
https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to
remove the lines that pass extract_tables to inference, and noted the
table regression if we only do one OCR for entire doc

**Tech details:**
* stop passing `extract_tables` parameter to inference
* added table extraction ingest test for image, which was skipped
before, and the "text_as_html" field contains the OCR output from the
table OCR refactor PR
* replaced `assert_called_once_with` with `call_args` so that the unit
tests don't need to test additional parameters
* added `error_margin` as ENV when comparing bounding boxes
of`ocr_region` with `table_element`
* added more tests for tables and noted the table regression in test for
partition pdf

### Test
* for stop passing `extract_tables` parameter to inference, run test
`test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this
branch and you will see warning like `Table OCR from get_tokens method
will be deprecated....`, which means it called the table OCR in
inference repo. This branch removed the warning.
2023-10-24 17:13:28 +00:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00
Jack Retterer
b8f24ba67e
Added AWS Bedrock embeddings (#1738)
Summary: Added support for AWS Bedrock embeddings. Leverages
"amazon.titan-tg1-large" for the embedding model.

Test

- find your aws secret access key and key id; make sure the account has
access to bedrock's tian embed model
- follow the instructions in
d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
2023-10-18 19:36:51 -05:00
Roman Isecke
8821689f36
Roman/s3 minio all cloud support (#1606)
### Description
Exposes the endpoint url as an access kwarg when using the s3 filesystem
library via the fsspec abstraction. This allows for any non-aws data
providers that support the s3 protocol to be used with the s3 connector
(i.e. minio)

Closes out https://github.com/Unstructured-IO/unstructured/issues/950

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-03 14:31:28 -04:00
Roman Isecke
b2e997635f
roman/es ingest test fixes (#1610)
### Description
update elasticsearch docker setup to use docker-compose

Would close out
https://github.com/Unstructured-IO/unstructured/issues/1609
2023-10-03 10:39:33 -04:00
Austin Walker
0abebb5fe6
fix: fix benchmark script when DOCKER_TEST=true (#1515)
The home directory for our dockerfile changed and broke this script. To
verify, try running the benchmark script:

```
export DOCKER_TEST=true
./scripts/performance/benchmark.sh
```

I'll pull in the latest changelog before merging.
2023-10-02 16:08:26 +00:00
Yao You
ad59a879cc
chore: bump inference to 0.6.6 (#1563)
- bump `unstructured-inference` to `0.6.6`
- specify default model name for element detection to be
`detectron2_onnx` to keep current behavior
- NOTE: the updated inference package by default would use yolox as
element detection model; this will be evaluated and enabled in a
separated PR

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2023-09-29 19:09:57 +00:00
Yao You
af7639e23f
ci: add retry to elastic search ingest test (#1581)
Occasionally the es test can fail because the index fail to be created
on the first try. Experiments show adding timeout doesn't help but add
retry mitigates the issue. See history of commits in branch:
yao/bump-inference-to-0.6.6
https://github.com/Unstructured-IO/unstructured/pull/1563

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2023-09-29 13:42:21 -05:00
Roman Isecke
bd49cfbab7
feat: adds Azure Cognitive Search (full text) destination connector (#1459)
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.

**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.

Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
2023-09-25 10:27:42 -04:00
ryannikolaidis
ca01b30c07
ci: more reliable release version alerts (#1479) 2023-09-22 21:19:26 +00:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
Trevor Bossert
09a0958f90
Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350)
Testing instructions

on Apple silicon

```
make docker-build
docker run -it unstructured:dev bash
python3
```
Then run the test in this PR
https://unstructured-ai.atlassian.net/browse/CORE-1269

You should get output like shown in ticket

Run the same process on your local machine (not inside docker) with same
test to verify the non aarch64 paddlepaddle got installed correctly

---------

Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-09-15 17:05:48 -07:00
ryannikolaidis
ad69d93d53
ci: add new release version alert (#1413) 2023-09-15 07:05:00 +00:00
shreyanid
791adf459d
stop printing all commands in version-sync script (#1390)
### Summary

Remove -x in version-sync script to stop printing all commands and
arguments and improve readability.

### Test

`make check` and `make check-version` no longer print all the commands
and arguments.

(unstructured) shreyanid@Shreyas-MBP-2 unstructured % make check-version 
scripts/version-sync.sh -c \
                -f "unstructured/__version__.py" semver
From github.com:Unstructured-IO/unstructured
 * branch              main       -> FETCH_HEAD
version sync would make no changes to unstructured/__version__.py.
2023-09-12 15:05:26 -07:00
ryannikolaidis
95c3e17af0
fix: version-sync (#1266) 2023-09-01 06:50:05 +00:00
Yao You
b504a48e06
dev: add py-spy profiling (#1251)
This PR adds a new developer tool for profiling performance: `py-spy`.
Additionally it adds a new make command to start a docker with your
local `unstructured` repo mounted for quick testing code in a Rocky
Linux environment (see usage below for intent).

### py-spy

It is a sampling profiler https://github.com/benfred/py-spy and in
practice usually provides more readily usable information than commonly
used `cProfiler`. It also supports output to `speedscope` format,
[which](https://github.com/jlfwong/speedscope#usage) provides a rich
view of the profiling result.

### usage

The new tool is added to the existing `profile.sh` script and is readily
discoverable in the interactive interface. When select to view the new
speedscope format profile it would show up in your local browser if you
followed the readme to install speedscope locally via `npm install -g
speedscope`.

On macOS the profiling tool needs superuser privilege. If you are not
comfortable with that feel free to run the profiling inside a Linux
container if your local dev env is macOS.
2023-08-31 19:26:29 +00:00
cragwolfe
6ad497136d
build: docker image fix (#1245)
Moving to a non-root user in the docker image caused a failure in the
publication workflow.

This fix was used to publish the 0.10.9 unstructured image in this
workflow:

https://github.com/Unstructured-IO/unstructured/actions/runs/6020624226/job/16332230987
2023-08-29 23:27:52 -07:00
Trevor Bossert
e4535d29ca
Set user for container to same as api image. (#1239)
This is security best practice, a user can override this with their own
Dockerfile if required.
2023-08-30 01:01:44 +00:00
cragwolfe
d19183f442
build(lint): don't check version in main against self (#1123)
If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version.

This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.
2023-08-15 17:57:59 +00:00
Ahmet Melek
627f78c16f
feat: airtable connector (#1012)
* add the first version of airtable connector

* change imports as inline to fail gracefully in case of lacking dependency

* parse tables as csv rather than plain text

* add relevant logic to be able to use --airtable-list-of-paths

* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings

* fix ingest test names

* add scripts for the large table test

* remove large table test from diff test

* make base and table ids explicit

* add and remove comments

* use -ne instead of !=

* update code based on the recent ingest refactor, update changelog and version

* shellcheck fix

* update comments

* update check-num-rows-and-columns-output error message

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* update help comments

* update help comments

* update help comments

* update workflows to set auth tokens and to run make install

* add comments on create_scale_test_components

* separate component ids from the test script, add comments to document test component creation

* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id

* shellcheck fixes

* shellcheck fixes

* update docs

* update comment

* bump version

* add wrongly deleted file

* sort columns before saving to process

* Update ingest test fixtures (#1098)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-11 12:02:51 -07:00
shreyanid
463c498c78
Update make check-version script to fail if release version is unchanged (#1039)
* TEMP adding git current release check

* working, checks version file against current release

* clean up comments

* shellcheck
2023-08-07 21:21:11 -07:00
Ronny H
7a05ef2cd9
Python script to collect environment for debugging issues (#989)
* Tested on Mac, Windows & Rocky Linux OS
* Updated README to include bugs reporting script
2023-08-02 22:54:43 +00:00
Yuming Long
df1ba39905
Chore: add uns api repo unittests (#954)
* stage

* git clone

* ci ignore markdown file

* make install

* use env instead

* remove md

* add script

* wrong env value

* add note

* maybe don't rm

* no cd../

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-07-26 20:55:35 +00:00
Ahmet Melek
b7674fb97e
feat: confluence connector (cloud) (#906)
* Add confluence connector and an example script

* add test script, add dependency installations

* add authentication secret variables for ci tests and actions

* add dependency installation commands for workflows

* add dependency installation commands for workflows

* Update ingest test fixtures (#907)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values

* change workflow name to avoid confusion

* change workflow name to avoid confusion

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions

* Update ingest test fixtures (#911)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* revert back the test python version matrix

* recompile dependencies

* modifications for shellcheck

* update changelog and version

* changelog and version

* remove comments

* Update ingest test fixtures (#915)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add the option to state the number of spaces to be fetched

* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users

* add help message

* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test

* change test names

* rename connector arg

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* change arg name for connector

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* add comment to example

* change arg names

* add new tests to ingest test

* shellcheck remove redundant statement

* Update ingest test fixtures (#932)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* Update ingest test fixtures (#936)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* linting

* change file extensions to parse as html

* Update ingest test fixtures (#943)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* remove old fixtures

* update version to 0.8.2-dev3

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
Ahmet Melek
5ea216cf07
feat: elasticsearch connector (#817) 2023-07-01 17:45:28 +00:00
ryannikolaidis
e08936b6fb
chore: update all bash scripts to use shebang: /usr/bin/env bash (#779) 2023-06-20 16:00:55 -07:00
cragwolfe
2989f53358
chore: bump to python 3.8.17 (#766)
The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.
2023-06-16 11:17:03 -07:00
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy (#614)
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh

Updated ingest tests procedure:

* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00
ryannikolaidis
dabda67c8f
fix: ingest-test-fixtures-update script to pass env vars (#697) 2023-06-08 04:48:49 +00:00
ryannikolaidis
7d157c1ede
test: add benchmark script (#638) 2023-06-05 09:14:43 -07:00
ryannikolaidis
bdef4fd398
test: adds profiling script (#661) 2023-06-01 21:26:05 +00:00
Yuming Long
fc59a043b7
Chore: Support epub tests in docker image (#630)
* docker works

* more epub tests

* changelog version

* support epub + odt + rtf

* update dockerfile

* revert..

* install pandoc on ci env

* pandoc docker grab bashed on arch

* move arch into image

* move back to base image
2023-05-26 15:38:48 -04:00
qued
c82bad1061
build(deps): avoid version conflicts (#636)
Addresses #631.

* Uses constraints to keep dependency versions more consistent.
* Moves all dependencies to .in files which are then ingested by setup.py.
* Adds script to check consistency of all extras.
* Adds consistency check to CI.

I should note that while it shouldn't be possible to cause a conflict between base.txt and any of the extras (because base.txt constrains all the extras) it is possible to get a conflict between two of the extras files. There are ways of trying to avoid that (like constraining each file by all the files that have already been processed before it in the order given in the make pip-compile target) but the ones I could think of seemed a little overwrought, and come with problems of their own. If a conflict arises, it should be flagged by CI or locally with make check-deps. When/if that happens, you can resolve the conflict by adding appropriate global constraints in requirements/constraints.txt.

Also note that if fileA.in is constrained by fileB.txt, then fileB.in should be compiled before fileA.in in the make pip-compile target. Otherwise fileA.in will be compiled with the old version of fileB.txt which can cause conflicts or keep dependencies from being updated properly.
2023-05-24 22:29:35 +00:00
Trevor Bossert
830d67f653
Feat: Discord connector (#515)
* Initial commit of discord connector

based off of initial work by @tnachen with modifications

https://github.com/tnachen/unstructured/tree/tnachen/discord_connector

* Add test file

change format of imports

* working version of the connector

More work to be done to tidy it up and add any additional options

* add to test fixtures update

* fix spacing

* tests working, switching to bot testing channel

* add additional channel

add reprocess to tests

* add try clause to allow for exit on error

Update changelog and bump version

* add updated expected output filtes

* add logic to check if —discord-period is an integer

Add more to option description

* fix lint error

* Update discord reqs

* PR feedback

* add newline

* another newline

---------

Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>
2023-05-16 11:46:30 -07:00
cragwolfe
aaea6358f6
build(deps): bump pip (#558) 2023-05-08 23:08:10 -07:00
natygyoon
db2f70dbc4
sync version-sync.sh with other repos (#508) 2023-04-21 05:48:38 +09:00
Matt Robinson
4e1cc5ab3d
fix: add slack to fixture update script (#500) 2023-04-19 18:16:44 +00:00
cragwolfe
a11563fe63
fix: update ingest test fixtures, disable biomed test (#486)
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
2023-04-15 00:07:09 +00:00
cragwolfe
7b44bcd6e0
build: script to update all ingest fixtures, add azure ingest fixtures (#367)
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
2023-04-11 00:11:50 -07:00
ryannikolaidis
ee52a749c3
fix: docker smoke test on build (#457) 2023-04-06 10:03:42 -07:00
ryannikolaidis
ef9fb79ed4
chore: build with registry as cache (#454) 2023-04-06 00:34:07 -07:00
ryannikolaidis
59785e4332
chore: install all extras in Dockerfile (#419)
* Adds step to install all extras
* Adds smoke test of wikipedia ingest to validate in CI
2023-03-30 13:23:30 -07:00
Amanda Cameron
edb847ce0b
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00
Matt Robinson
e43cb0e6e0
feat: add partition_epub function (#364)
* add pypandoc dependency

* added epub partitioner and file conversion

* test for partition_epub

* tests for file conversion

* add epub to filetype detection

* added epub to auto partition

* update bricks docs

* updated installing docs

* changelot and version

* add pandoc to dependencies

* add pandoc to debian dependencies

* linting, linting, linting

* typo fix

* typo fix

* file conversion type hints

* more type hints

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-03-14 15:52:21 +00:00
Matt Robinson
d17a94f395
chore: add libreoffice to ubuntu install script (#363) 2023-03-13 10:46:23 -04:00
qued
e43e9178ae
feat: amazon linux 2 setup script (#350)
Added Amazon Linux 2 setup script. Also updated Ubuntu setup script to keep the scripts as aligned as possible.

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-03-09 14:52:24 +00:00
qued
ed074b5828
fix: set through env to avoid interpretation as command (#329)
When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment.

Added the env command so there's no misinterpretation. Tested in docker as both root and user.
2023-03-01 12:56:37 -06:00
qued
d566f9b56a
Inject DEBIAN_FRONTEND into sudo env (#290)
Gets rid of the interactive prompt when tzdata gets installed.
2023-02-28 02:27:58 +00:00