unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-23 22:10:52 +00:00

Author	SHA1	Message	Date
qued	808b4ced7a	build(deps): remove ebooklib (#1878 ) * Removed `ebooklib` as a dependency `ebooklib` is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.	2023-10-26 12:22:40 -05:00
qued	d79f633ada	build(deps): add typing extensions dep (#1835 ) Closes #1330. Added `typing-extensions` as an explicit dependency (it was previously an implicit dependency via `dataclasses-json`). This dependency should be explicit, since we import from it directly in `unstructured.documents.elements`. This has the added benefit that `TypedDict` will be available for Python 3.7 users. Other changes: * Ran `pip-compile` * Fixed a bug in `version-sync.sh` that caused an error when using the sync functionality when syncing to a dev version from a release version. #### Testing: To test the Python 3.7 functionality, in a Python 3.7 environment install the base requirements and run ```python from unstructured.documents.elements import Element ``` This also works on `main` as `typing_extensions` is a requirement. However if you `pip uninstall typing-extensions`, and run the above code, it should fail. So this update makes sure `typing-extensions` doesn't get lost if the other dependencies move around. To reproduce the `version-sync.sh` bug that was fixed, in `main`, increment the most recent version in `CHANGELOG.md` while leaving the version in `__version__.py`. Then add the following lines to `version-sync.sh` to simulate a particular set of circumstances, starting on line 114: ``` MAIN_IS_RELEASE=true CURRENT_BRANCH="something-not-main" ``` Then run `make version-sync`. The expected behavior is that the version in `__version__.py` is changed to the new version to match `CHANGELOG.md`, but instead it exits with an error. The fix was to only do the version incrementation check when the script is running in `-c` or "check" mode.	2023-10-24 19:19:09 +00:00
Yuming Long	01a0e003d9	Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850 ) ### Summary A follow up ticket on https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc Tech details: * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.	2023-10-24 17:13:28 +00:00
Roman Isecke	4802332de0	Roman/optimize ingest ci (#1799 ) ### Description Currently the CI caches the CI dependencies but uses the hash of all files in `requirements/`. This isn't completely accurate since the ingest dependencies are installed in a later step and don't affect the cached environment. As part of this PR: * ingest dependencies were isolated into their own folder in `requirements/ingest/` * A new cache setup was introduced in the CI to restore the base cache -> install ingest dependencies -> cache it with a new id * new make target created to install all ingest dependencies via `pip install -r ...` * updates to Dockerfile to use `find ...` to install all dependencies, avoiding the need to update this when new deps are added. * update to pip-compile script to run over all `*.in` files in `requirements/`	2023-10-24 14:54:00 +00:00
Jack Retterer	b8f24ba67e	Added AWS Bedrock embeddings (#1738 ) Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model. Test - find your aws secret access key and key id; make sure the account has access to bedrock's tian embed model - follow the instructions in `d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)` --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>	2023-10-18 19:36:51 -05:00
Roman Isecke	8821689f36	Roman/s3 minio all cloud support (#1606 ) ### Description Exposes the endpoint url as an access kwarg when using the s3 filesystem library via the fsspec abstraction. This allows for any non-aws data providers that support the s3 protocol to be used with the s3 connector (i.e. minio) Closes out https://github.com/Unstructured-IO/unstructured/issues/950 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-03 14:31:28 -04:00
Roman Isecke	b2e997635f	roman/es ingest test fixes (#1610 ) ### Description update elasticsearch docker setup to use docker-compose Would close out https://github.com/Unstructured-IO/unstructured/issues/1609	2023-10-03 10:39:33 -04:00
Austin Walker	0abebb5fe6	fix: fix benchmark script when DOCKER_TEST=true (#1515 ) The home directory for our dockerfile changed and broke this script. To verify, try running the benchmark script: ``` export DOCKER_TEST=true ./scripts/performance/benchmark.sh ``` I'll pull in the latest changelog before merging.	2023-10-02 16:08:26 +00:00
Yao You	ad59a879cc	chore: bump inference to 0.6.6 (#1563 ) - bump `unstructured-inference` to `0.6.6` - specify default model name for element detection to be `detectron2_onnx` to keep current behavior - NOTE: the updated inference package by default would use yolox as element detection model; this will be evaluated and enabled in a separated PR --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 19:09:57 +00:00
Yao You	af7639e23f	ci: add retry to elastic search ingest test (#1581 ) Occasionally the es test can fail because the index fail to be created on the first try. Experiments show adding timeout doesn't help but add retry mitigates the issue. See history of commits in branch: yao/bump-inference-to-0.6.6 https://github.com/Unstructured-IO/unstructured/pull/1563 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 13:42:21 -05:00
Roman Isecke	bd49cfbab7	feat: adds Azure Cognitive Search (full text) destination connector (#1459 ) ### Description New [Azure Cognitive Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search) destination connector added. Writes each json element from the created json files via partition and writes that content to an index. Bonus bug fix: Due to a recent change where the default version of python used in the repo was bumped to `3.10` from `3.8`, this means running `pip-compile` now runs it against that version rather than the lowest we support which is still `3.8`. This breaks the setup for those lower versions because some of the versions pulled in by `pip-compile` exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a script that checks the version of python being used first, which helps guarantee that all dependencies meet the minimum python version requirement. Closes out https://github.com/Unstructured-IO/unstructured/issues/1466	2023-09-25 10:27:42 -04:00
ryannikolaidis	ca01b30c07	ci: more reliable release version alerts (#1479 )	2023-09-22 21:19:26 +00:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00
Trevor Bossert	09a0958f90	Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350 ) Testing instructions on Apple silicon ``` make docker-build docker run -it unstructured:dev bash python3 ``` Then run the test in this PR https://unstructured-ai.atlassian.net/browse/CORE-1269 You should get output like shown in ticket Run the same process on your local machine (not inside docker) with same test to verify the non aarch64 paddlepaddle got installed correctly --------- Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>	2023-09-15 17:05:48 -07:00
ryannikolaidis	ad69d93d53	ci: add new release version alert (#1413 )	2023-09-15 07:05:00 +00:00
shreyanid	791adf459d	stop printing all commands in version-sync script (#1390 ) ### Summary Remove -x in version-sync script to stop printing all commands and arguments and improve readability. ### Test `make check` and `make check-version` no longer print all the commands and arguments. (unstructured) shreyanid@Shreyas-MBP-2 unstructured % make check-version scripts/version-sync.sh -c \ -f "unstructured/__version__.py" semver From github.com:Unstructured-IO/unstructured * branch main -> FETCH_HEAD version sync would make no changes to unstructured/__version__.py.	2023-09-12 15:05:26 -07:00
ryannikolaidis	95c3e17af0	fix: version-sync (#1266 )	2023-09-01 06:50:05 +00:00
Yao You	b504a48e06	dev: add py-spy profiling (#1251 ) This PR adds a new developer tool for profiling performance: `py-spy`. Additionally it adds a new make command to start a docker with your local `unstructured` repo mounted for quick testing code in a Rocky Linux environment (see usage below for intent). ### py-spy It is a sampling profiler https://github.com/benfred/py-spy and in practice usually provides more readily usable information than commonly used `cProfiler`. It also supports output to `speedscope` format, [which](https://github.com/jlfwong/speedscope#usage) provides a rich view of the profiling result. ### usage The new tool is added to the existing `profile.sh` script and is readily discoverable in the interactive interface. When select to view the new speedscope format profile it would show up in your local browser if you followed the readme to install speedscope locally via `npm install -g speedscope`. On macOS the profiling tool needs superuser privilege. If you are not comfortable with that feel free to run the profiling inside a Linux container if your local dev env is macOS.	2023-08-31 19:26:29 +00:00
cragwolfe	6ad497136d	build: docker image fix (#1245 ) Moving to a non-root user in the docker image caused a failure in the publication workflow. This fix was used to publish the 0.10.9 unstructured image in this workflow: https://github.com/Unstructured-IO/unstructured/actions/runs/6020624226/job/16332230987	2023-08-29 23:27:52 -07:00
Trevor Bossert	e4535d29ca	Set user for container to same as api image. (#1239 ) This is security best practice, a user can override this with their own Dockerfile if required.	2023-08-30 01:01:44 +00:00
cragwolfe	d19183f442	build(lint): don't check version in main against self (#1123 ) If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version. This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.	2023-08-15 17:57:59 +00:00
Ahmet Melek	627f78c16f	feat: airtable connector (#1012 ) * add the first version of airtable connector * change imports as inline to fail gracefully in case of lacking dependency * parse tables as csv rather than plain text * add relevant logic to be able to use --airtable-list-of-paths * add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings * fix ingest test names * add scripts for the large table test * remove large table test from diff test * make base and table ids explicit * add and remove comments * use -ne instead of != * update code based on the recent ingest refactor, update changelog and version * shellcheck fix * update comments * update check-num-rows-and-columns-output error message Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * update help comments * update help comments * update help comments * update workflows to set auth tokens and to run make install * add comments on create_scale_test_components * separate component ids from the test script, add comments to document test component creation * add LARGE_BASE test, implement LARGE_BASE component creation, replace component id * shellcheck fixes * shellcheck fixes * update docs * update comment * bump version * add wrongly deleted file * sort columns before saving to process * Update ingest test fixtures (#1098) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-11 12:02:51 -07:00
shreyanid	463c498c78	Update `make check-version` script to fail if release version is unchanged (#1039 ) * TEMP adding git current release check * working, checks version file against current release * clean up comments * shellcheck	2023-08-07 21:21:11 -07:00
Ronny H	7a05ef2cd9	Python script to collect environment for debugging issues (#989 ) * Tested on Mac, Windows & Rocky Linux OS * Updated README to include bugs reporting script	2023-08-02 22:54:43 +00:00
Yuming Long	df1ba39905	Chore: add uns api repo unittests (#954 ) * stage * git clone * ci ignore markdown file * make install * use env instead * remove md * add script * wrong env value * add note * maybe don't rm * no cd../ --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-07-26 20:55:35 +00:00
Ahmet Melek	b7674fb97e	feat: confluence connector (cloud) (#906 ) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-18 19:29:41 +01:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00
ryannikolaidis	e08936b6fb	chore: update all bash scripts to use shebang: /usr/bin/env bash (#779 )	2023-06-20 16:00:55 -07:00
cragwolfe	2989f53358	chore: bump to python 3.8.17 (#766 ) The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.	2023-06-16 11:17:03 -07:00
Yuming Long	2fbb1ccd30	Chore(ingest) : add tests on PDFs with fast strategy (#614 ) Summary * Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted * Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh Updated ingest tests procedure: * Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name> * Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name> Test * Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.	2023-06-12 19:02:48 +00:00
ryannikolaidis	dabda67c8f	fix: ingest-test-fixtures-update script to pass env vars (#697 )	2023-06-08 04:48:49 +00:00
ryannikolaidis	7d157c1ede	test: add benchmark script (#638 )	2023-06-05 09:14:43 -07:00
ryannikolaidis	bdef4fd398	test: adds profiling script (#661 )	2023-06-01 21:26:05 +00:00
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
qued	c82bad1061	build(deps): avoid version conflicts (#636 ) Addresses #631. * Uses constraints to keep dependency versions more consistent. * Moves all dependencies to .in files which are then ingested by setup.py. * Adds script to check consistency of all extras. * Adds consistency check to CI. I should note that while it shouldn't be possible to cause a conflict between base.txt and any of the extras (because base.txt constrains all the extras) it is possible to get a conflict between two of the extras files. There are ways of trying to avoid that (like constraining each file by all the files that have already been processed before it in the order given in the make pip-compile target) but the ones I could think of seemed a little overwrought, and come with problems of their own. If a conflict arises, it should be flagged by CI or locally with make check-deps. When/if that happens, you can resolve the conflict by adding appropriate global constraints in requirements/constraints.txt. Also note that if fileA.in is constrained by fileB.txt, then fileB.in should be compiled before fileA.in in the make pip-compile target. Otherwise fileA.in will be compiled with the old version of fileB.txt which can cause conflicts or keep dependencies from being updated properly.	2023-05-24 22:29:35 +00:00
Trevor Bossert	830d67f653	Feat: Discord connector (#515 ) * Initial commit of discord connector based off of initial work by @tnachen with modifications https://github.com/tnachen/unstructured/tree/tnachen/discord_connector * Add test file change format of imports * working version of the connector More work to be done to tidy it up and add any additional options * add to test fixtures update * fix spacing * tests working, switching to bot testing channel * add additional channel add reprocess to tests * add try clause to allow for exit on error Update changelog and bump version * add updated expected output filtes * add logic to check if —discord-period is an integer Add more to option description * fix lint error * Update discord reqs * PR feedback * add newline * another newline --------- Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>	2023-05-16 11:46:30 -07:00
cragwolfe	aaea6358f6	build(deps): bump pip (#558 )	2023-05-08 23:08:10 -07:00
natygyoon	db2f70dbc4	sync version-sync.sh with other repos (#508 )	2023-04-21 05:48:38 +09:00
Matt Robinson	4e1cc5ab3d	fix: add slack to fixture update script (#500 )	2023-04-19 18:16:44 +00:00
cragwolfe	a11563fe63	fix: update ingest test fixtures, disable biomed test (#486 ) * Update test fixtures that should have been updated in prior commit * Disable biomed ingest tests for now, the fail more often than not * Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.	2023-04-15 00:07:09 +00:00
cragwolfe	7b44bcd6e0	build: script to update all ingest fixtures, add azure ingest fixtures (#367 ) - Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.). - Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 . - Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details. - Updates expected outputs with above script. - Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.	2023-04-11 00:11:50 -07:00
ryannikolaidis	ee52a749c3	fix: docker smoke test on build (#457 )	2023-04-06 10:03:42 -07:00
ryannikolaidis	ef9fb79ed4	chore: build with registry as cache (#454 )	2023-04-06 00:34:07 -07:00
ryannikolaidis	59785e4332	chore: install all extras in Dockerfile (#419 ) * Adds step to install all extras * Adds smoke test of wikipedia ingest to validate in CI	2023-03-30 13:23:30 -07:00
Amanda Cameron	edb847ce0b	adding Dockerfile (#359 )	2023-03-14 13:40:01 -07:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-14 15:52:21 +00:00
Matt Robinson	d17a94f395	chore: add libreoffice to ubuntu install script (#363 )	2023-03-13 10:46:23 -04:00
qued	e43e9178ae	feat: amazon linux 2 setup script (#350 ) Added Amazon Linux 2 setup script. Also updated Ubuntu setup script to keep the scripts as aligned as possible. Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-09 14:52:24 +00:00
qued	ed074b5828	fix: set through env to avoid interpretation as command (#329 ) When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment. Added the env command so there's no misinterpretation. Tested in docker as both root and user.	2023-03-01 12:56:37 -06:00
qued	d566f9b56a	Inject DEBIAN_FRONTEND into sudo env (#290 ) Gets rid of the interactive prompt when tzdata gets installed.	2023-02-28 02:27:58 +00:00

1 2

56 Commits