unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-08 22:47:47 +00:00

Author	SHA1	Message	Date
Roman Isecke	d6f2841ff4	feat: update dependencies and remove constraint on pydantic (#2841 ) ### Description * The `consistent-deps.sh` was fixed to take into account the ingest dependencies, causing some errors to show up. New constriants were added to make that script pass. * Update all requirements without constraint on pydantic, allowing the latest version to be pulled in. * `pikepdf` is causing a conflict but there's a fix on their `main` branch, just need for the next release to be published. Opened up a question here to see if we can get that out any sooner: [Do releases happen on a schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now added `lxml<5` to the constraints. A couple optimizations: * `constraints.in` renamed to `constraints.txt` since the whole point is all dependencies are already pinned and the file never gets compiled * `constraints.txt` moved to a `requirements/deps` directory as this never gets compiled by `pip-compile` * Other dependency files updated to reference the new location of `base.in` and `constraints.txt` * make file updated since it was originally written to avoid the `base.in` and `constraints.in` file	2024-04-04 19:58:23 +00:00
Klaijan	30b6a09bc3	fix: declare -i [SC2324 shellcheck] (#2624 ) Fix SC2324 shellcheck warning by adding -i to indicate var type of integer and tidy up the formatting.	2024-03-08 10:09:55 +00:00
Roman Isecke	9c1c41f493	BUGFIX: fix dependencies in setup.py (#2605 ) ### Description Currently the requirements associated with an extra in the `setup.py` is being dynamically generated using the `load_requirements()` method in the same file. This is being passed in all the `.in` files which then get read line by line to generate the requirements associated with an extra. Unless the `.in` file itself has a version pin, this will never respect the `.txt` files being generated by `pip-compile`. This fix updates all the inputs to `load_requirements()` to use the `.txt` files themselves.	2024-03-06 18:59:08 +00:00
David Potter	0c834517d8	fix: change opensearch port (#2517 ) change opensearch port to see if fixes CI. We think there may be a conflict with the elasticsearch docker port. Also adding simple retry to vector query. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-07 21:25:04 +00:00
David Potter	bc791d53f4	feat: add opensearch source and destination connector (#2349 ) Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 04:31:49 +00:00
ryannikolaidis	2ce829ddd0	test: update test Elasticsearch mappings to validate embedding search (#2397 ) Currently in the Elasticsearch Destination ingest test we are writing the embeddings to a "float" type field. In order to leverage this field for similarity search it should be mapped as "dense_vector" with the respective dimensions assigned. This PR updates that mapping and adds a test query to validate that this works as expected.	2024-01-14 19:27:56 +00:00
Roman Isecke	b37b4689bc	drop python3.8 (#2372 ) ### Description Remove all uses of python3.8 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-01-09 23:37:30 +00:00
rvztz	950e5d68f9	feat: adds postgresql/sqlite destination connector (#2005 ) - Adds a destination connector to upload processed output into a PostgreSQL/Sqlite database instance. - Users are responsible to provide their instances. This PR includes a couple of configuration examples. - Defines the scripts required to setup a PostgreSQL instance with the unstructured elements schema. - Validates postgres/pgvector embedding storage and retrieval --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-04 19:33:16 +00:00
Ahmet Melek	fd293b3e78	feat: add elasticsearch destination connector (#2152 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1842 Closes https://github.com/Unstructured-IO/unstructured/issues/2202 Closes https://github.com/Unstructured-IO/unstructured/issues/2203 This PR: - Adds Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured elasticsearch indexes easily. - Includes parallelized upload and lazy processing for elasticsearch destination connector. - Rearranges elasticsearch test helpers to source, destination, and common folders. - Adds util functions to be able to batch iterables in a lazy way for uploads - Fixes a bug where removing the optional parameter `--fields` broke the connector due to an integer processing error. - Fixes a bug where using an [elasticsearch config](`8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35)`) for a destination connector resulted in a serialization issue when optional parameter `--fields` was not provided.	2023-12-20 01:26:58 +00:00
David Potter	4b8352e0f5	feat: add chroma destination connector (#2240 ) Adds Chroma (also known as ChromaDB) as a vector destination. Currently Chroma is an in-memory single-process oriented library with plans of a hosted and/or more production ready solution -https://docs.trychroma.com/deployment Though they now claim to support multiple Clients hitting the database at once, I found that it was inconsistent. Sometimes multiprocessing worked (maybe 1 out of 3 times) But the other times I would get different errors. So I kept it single process. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-19 16:58:23 +00:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
Roman Isecke	76efcf4dd7	chore: add shfmt (#2246 ) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called.	2023-12-12 01:04:15 +00:00
Roman Isecke	ac302689a0	chore: update sphinx ingest docs with new connectors (#2245 ) Replacing https://github.com/Unstructured-IO/unstructured/pull/2243	2023-12-11 21:29:41 +00:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00
Roman Isecke	f193d3d43b	feat: improve sensitive data handling by fsspec connectors (#2194 ) ### Description Building off of PR https://github.com/Unstructured-IO/unstructured/pull/2179, updating fsspec based connectors to use better authentication field handling. This PR adds in the following changes: * Update the base classes to inherit from the enhanced json mixin * Add in a new access config dataclass that should be used as a nest dataclass in the connector configs * Update the code extracting configs out of the cli options dictionary to support the nested access config if it exists on the parent config * Update all fsspec connectors with explicit access configs given what each one's SDKs support * Update the json mixin and enhanced field to support a name override when serializing/deserializing from json/dicts. This allows a different name to be used for the CLI option than what the name of the field is on the dataclass. * Update all the writes to use class-based approach and share the same structure of the runner classes * Above update allowed for better code to be used in the base source and destination CLI commands * Add in utility code around paring a flat dictionary (coming from the click based options) into dataclass-based configs with potentially nested dataclasses. Slightly unrelated changes: * session handle removed from pinecone connector as this was breaking the serialization of the write config and didn't have any benefit as a connection was never being shared, the index used simply makes a new http call each time it's invoked. * Dedicated write configs were created for all destination connectors to better support serialization * Refactor of Elasticsearch connector included, with update to ingest test to use auth TODOs * Left a `#TODO` in the code but the way session handler is implemented right now, it breaks serialization since it adds a generic variable based on the library being used for a connector (i.e. `googleapiclient.discovery.Resource`) which is not serializable. This will need to be updated to omit that from serialization but still support the current workflow. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-05 20:55:19 +00:00
rvztz	ce905dd098	feat: Weaviate destination connector (#1963 ) Closes #1781. - Adds a Weaviate destination connector - The connector receives a host for the weaviate instance and a weaviate class name. - Defines a weaviate schema for json elements. - Defines the pre-processing to conform unstructured's schema to the proposed weaviate schema.	2023-12-01 22:27:41 +00:00
cragwolfe	d7456ab6d2	feat: convenience script to view tables (#2124 ) Executive Summary Eyeballing or saving html in a Table element (in the `metadata.text_as_html` field) takes some manual effort. This script provides a quick way to do so given an unstructured .json file that adheres to the usual schema (i.e., that's returned by the Unstructured API). Testing Instructions Get some unstructured output that includes a table. E.g. [124_PDFsam_Basel III - Finalising post-crisis reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13407404/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf) ``` ./unstructured-get-json.sh --tables --hi-res \ 124_PDFsam_Basel\ III\ -\ Finalising\ post-crisis\ reforms.pdf ```` Then use this the following script to view the structure and content of the tables: (note that output file was copied to the clipboard from prior command): ``` ./u-tables-inspect.sh \ "<snip>/tmp/unst-outputs/124_PDFsam_Basel III - Finalising post-crisis reforms.pdf-hi-res.json" ```	2023-11-21 22:18:39 -08:00
cragwolfe	5fa40850f4	feat: convenience script to post files to the API (#2083 ) Usage: ./unstructured-get-json.sh [options] <file>" Options: --api-key KEY Specify the API key for authentication. Set the env var $UNST_API_KEY to skip providing this option. --hi-res hi_res strategy: Enable high-resolution processing, with layout segmentation and OCR --fast fast strategy: No OCR, just extract embedded text --ocr-only ocr_only strategy: Perform OCR (Optical Character Recognition) only. No layout segmentation. --tables Enable table extraction: tables are represented as html in metadata --coordinates Include coordinates in the output --trace Enable trace logging for debugging, useful to cut and paste the executed curl call --verbose Enable verbose logging including printing first 8 elements to stdout --s3 Write the resulting output to s3 (like a pastebin) --help Display this help and exit. Arguments: <file> File to send to the API. The script requires a <file>, the document to post to the Unstructured API. The .json result is written to ~/tmp/unst-outputs/ -- this path is echoed and copied to your clipboard.	2023-11-15 22:58:28 -08:00
cragwolfe	69952f66ed	fix(build): update ingest script loc in Dockerfile (#2052 ) Fixes docker-smoke-test.sh to reference the new location for the wikipedia ingest script, which was moved in https://github.com/Unstructured-IO/unstructured/pull/1951 . This fix should allow the docker image build to complete on merges to main. Reference to recent failed job: https://github.com/Unstructured-IO/unstructured/actions/runs/6819416096/job/18546724401	2023-11-09 21:55:07 -08:00
Yao You	69265685ea	build(deps): add makefile to requirements (#1295 ) This PR resolves #1294 by adding a Makefile to compile requirements. This makefile respects the dependencies between file and will compile them in order. E.g., extra-*.txt will be compiled __after__ base.txt is updated. Test locally by simply running `make pip-compile` or `cd requirements && make clean && make all` --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-11-02 10:17:35 -05:00
qued	808b4ced7a	build(deps): remove ebooklib (#1878 ) * Removed `ebooklib` as a dependency `ebooklib` is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.	2023-10-26 12:22:40 -05:00
qued	d79f633ada	build(deps): add typing extensions dep (#1835 ) Closes #1330. Added `typing-extensions` as an explicit dependency (it was previously an implicit dependency via `dataclasses-json`). This dependency should be explicit, since we import from it directly in `unstructured.documents.elements`. This has the added benefit that `TypedDict` will be available for Python 3.7 users. Other changes: * Ran `pip-compile` * Fixed a bug in `version-sync.sh` that caused an error when using the sync functionality when syncing to a dev version from a release version. #### Testing: To test the Python 3.7 functionality, in a Python 3.7 environment install the base requirements and run ```python from unstructured.documents.elements import Element ``` This also works on `main` as `typing_extensions` is a requirement. However if you `pip uninstall typing-extensions`, and run the above code, it should fail. So this update makes sure `typing-extensions` doesn't get lost if the other dependencies move around. To reproduce the `version-sync.sh` bug that was fixed, in `main`, increment the most recent version in `CHANGELOG.md` while leaving the version in `__version__.py`. Then add the following lines to `version-sync.sh` to simulate a particular set of circumstances, starting on line 114: ``` MAIN_IS_RELEASE=true CURRENT_BRANCH="something-not-main" ``` Then run `make version-sync`. The expected behavior is that the version in `__version__.py` is changed to the new version to match `CHANGELOG.md`, but instead it exits with an error. The fix was to only do the version incrementation check when the script is running in `-c` or "check" mode.	2023-10-24 19:19:09 +00:00
Yuming Long	01a0e003d9	Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850 ) ### Summary A follow up ticket on https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc Tech details: * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.	2023-10-24 17:13:28 +00:00
Roman Isecke	4802332de0	Roman/optimize ingest ci (#1799 ) ### Description Currently the CI caches the CI dependencies but uses the hash of all files in `requirements/`. This isn't completely accurate since the ingest dependencies are installed in a later step and don't affect the cached environment. As part of this PR: * ingest dependencies were isolated into their own folder in `requirements/ingest/` * A new cache setup was introduced in the CI to restore the base cache -> install ingest dependencies -> cache it with a new id * new make target created to install all ingest dependencies via `pip install -r ...` * updates to Dockerfile to use `find ...` to install all dependencies, avoiding the need to update this when new deps are added. * update to pip-compile script to run over all `*.in` files in `requirements/`	2023-10-24 14:54:00 +00:00
Jack Retterer	b8f24ba67e	Added AWS Bedrock embeddings (#1738 ) Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model. Test - find your aws secret access key and key id; make sure the account has access to bedrock's tian embed model - follow the instructions in `d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)` --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>	2023-10-18 19:36:51 -05:00
Roman Isecke	8821689f36	Roman/s3 minio all cloud support (#1606 ) ### Description Exposes the endpoint url as an access kwarg when using the s3 filesystem library via the fsspec abstraction. This allows for any non-aws data providers that support the s3 protocol to be used with the s3 connector (i.e. minio) Closes out https://github.com/Unstructured-IO/unstructured/issues/950 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-03 14:31:28 -04:00
Roman Isecke	b2e997635f	roman/es ingest test fixes (#1610 ) ### Description update elasticsearch docker setup to use docker-compose Would close out https://github.com/Unstructured-IO/unstructured/issues/1609	2023-10-03 10:39:33 -04:00
Austin Walker	0abebb5fe6	fix: fix benchmark script when DOCKER_TEST=true (#1515 ) The home directory for our dockerfile changed and broke this script. To verify, try running the benchmark script: ``` export DOCKER_TEST=true ./scripts/performance/benchmark.sh ``` I'll pull in the latest changelog before merging.	2023-10-02 16:08:26 +00:00
Yao You	ad59a879cc	chore: bump inference to 0.6.6 (#1563 ) - bump `unstructured-inference` to `0.6.6` - specify default model name for element detection to be `detectron2_onnx` to keep current behavior - NOTE: the updated inference package by default would use yolox as element detection model; this will be evaluated and enabled in a separated PR --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 19:09:57 +00:00
Yao You	af7639e23f	ci: add retry to elastic search ingest test (#1581 ) Occasionally the es test can fail because the index fail to be created on the first try. Experiments show adding timeout doesn't help but add retry mitigates the issue. See history of commits in branch: yao/bump-inference-to-0.6.6 https://github.com/Unstructured-IO/unstructured/pull/1563 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 13:42:21 -05:00
Roman Isecke	bd49cfbab7	feat: adds Azure Cognitive Search (full text) destination connector (#1459 ) ### Description New [Azure Cognitive Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search) destination connector added. Writes each json element from the created json files via partition and writes that content to an index. Bonus bug fix: Due to a recent change where the default version of python used in the repo was bumped to `3.10` from `3.8`, this means running `pip-compile` now runs it against that version rather than the lowest we support which is still `3.8`. This breaks the setup for those lower versions because some of the versions pulled in by `pip-compile` exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a script that checks the version of python being used first, which helps guarantee that all dependencies meet the minimum python version requirement. Closes out https://github.com/Unstructured-IO/unstructured/issues/1466	2023-09-25 10:27:42 -04:00
ryannikolaidis	ca01b30c07	ci: more reliable release version alerts (#1479 )	2023-09-22 21:19:26 +00:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00
Trevor Bossert	09a0958f90	Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350 ) Testing instructions on Apple silicon ``` make docker-build docker run -it unstructured:dev bash python3 ``` Then run the test in this PR https://unstructured-ai.atlassian.net/browse/CORE-1269 You should get output like shown in ticket Run the same process on your local machine (not inside docker) with same test to verify the non aarch64 paddlepaddle got installed correctly --------- Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>	2023-09-15 17:05:48 -07:00
ryannikolaidis	ad69d93d53	ci: add new release version alert (#1413 )	2023-09-15 07:05:00 +00:00
shreyanid	791adf459d	stop printing all commands in version-sync script (#1390 ) ### Summary Remove -x in version-sync script to stop printing all commands and arguments and improve readability. ### Test `make check` and `make check-version` no longer print all the commands and arguments. (unstructured) shreyanid@Shreyas-MBP-2 unstructured % make check-version scripts/version-sync.sh -c \ -f "unstructured/__version__.py" semver From github.com:Unstructured-IO/unstructured * branch main -> FETCH_HEAD version sync would make no changes to unstructured/__version__.py.	2023-09-12 15:05:26 -07:00
ryannikolaidis	95c3e17af0	fix: version-sync (#1266 )	2023-09-01 06:50:05 +00:00
Yao You	b504a48e06	dev: add py-spy profiling (#1251 ) This PR adds a new developer tool for profiling performance: `py-spy`. Additionally it adds a new make command to start a docker with your local `unstructured` repo mounted for quick testing code in a Rocky Linux environment (see usage below for intent). ### py-spy It is a sampling profiler https://github.com/benfred/py-spy and in practice usually provides more readily usable information than commonly used `cProfiler`. It also supports output to `speedscope` format, [which](https://github.com/jlfwong/speedscope#usage) provides a rich view of the profiling result. ### usage The new tool is added to the existing `profile.sh` script and is readily discoverable in the interactive interface. When select to view the new speedscope format profile it would show up in your local browser if you followed the readme to install speedscope locally via `npm install -g speedscope`. On macOS the profiling tool needs superuser privilege. If you are not comfortable with that feel free to run the profiling inside a Linux container if your local dev env is macOS.	2023-08-31 19:26:29 +00:00
cragwolfe	6ad497136d	build: docker image fix (#1245 ) Moving to a non-root user in the docker image caused a failure in the publication workflow. This fix was used to publish the 0.10.9 unstructured image in this workflow: https://github.com/Unstructured-IO/unstructured/actions/runs/6020624226/job/16332230987	2023-08-29 23:27:52 -07:00
Trevor Bossert	e4535d29ca	Set user for container to same as api image. (#1239 ) This is security best practice, a user can override this with their own Dockerfile if required.	2023-08-30 01:01:44 +00:00
cragwolfe	d19183f442	build(lint): don't check version in main against self (#1123 ) If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version. This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.	2023-08-15 17:57:59 +00:00
Ahmet Melek	627f78c16f	feat: airtable connector (#1012 ) * add the first version of airtable connector * change imports as inline to fail gracefully in case of lacking dependency * parse tables as csv rather than plain text * add relevant logic to be able to use --airtable-list-of-paths * add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings * fix ingest test names * add scripts for the large table test * remove large table test from diff test * make base and table ids explicit * add and remove comments * use -ne instead of != * update code based on the recent ingest refactor, update changelog and version * shellcheck fix * update comments * update check-num-rows-and-columns-output error message Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * update help comments * update help comments * update help comments * update workflows to set auth tokens and to run make install * add comments on create_scale_test_components * separate component ids from the test script, add comments to document test component creation * add LARGE_BASE test, implement LARGE_BASE component creation, replace component id * shellcheck fixes * shellcheck fixes * update docs * update comment * bump version * add wrongly deleted file * sort columns before saving to process * Update ingest test fixtures (#1098) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-11 12:02:51 -07:00
shreyanid	463c498c78	Update `make check-version` script to fail if release version is unchanged (#1039 ) * TEMP adding git current release check * working, checks version file against current release * clean up comments * shellcheck	2023-08-07 21:21:11 -07:00
Ronny H	7a05ef2cd9	Python script to collect environment for debugging issues (#989 ) * Tested on Mac, Windows & Rocky Linux OS * Updated README to include bugs reporting script	2023-08-02 22:54:43 +00:00
Yuming Long	df1ba39905	Chore: add uns api repo unittests (#954 ) * stage * git clone * ci ignore markdown file * make install * use env instead * remove md * add script * wrong env value * add note * maybe don't rm * no cd../ --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-07-26 20:55:35 +00:00
Ahmet Melek	b7674fb97e	feat: confluence connector (cloud) (#906 ) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-18 19:29:41 +01:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00
ryannikolaidis	e08936b6fb	chore: update all bash scripts to use shebang: /usr/bin/env bash (#779 )	2023-06-20 16:00:55 -07:00
cragwolfe	2989f53358	chore: bump to python 3.8.17 (#766 ) The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.	2023-06-16 11:17:03 -07:00
Yuming Long	2fbb1ccd30	Chore(ingest) : add tests on PDFs with fast strategy (#614 ) Summary * Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted * Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh Updated ingest tests procedure: * Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name> * Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name> Test * Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.	2023-06-12 19:02:48 +00:00

1 2

76 Commits