unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2026-01-08 21:35:58 +00:00

Author	SHA1	Message	Date
Christine Straub	df156ebe5a	feat: support pdf link extraction in hi_res strategy (#3753 ) This PR aims to add support for link extraction in pdf `hi_res` strategy. The `partition_pdf()` function now supports link extraction when using the `hi_res` strategy, allowing users to extract hyperlinks from PDF documents. ### Summary - Added functionalities to support link extraction in hi_res flow - Enhanced word extraction functionality used for link extraction in both `fast` and `hi_res` flows, resulted in more correct `start_index` and `text` in `links` metadata. - Updated ingest fixture update workflow to not skip Astra DB source test ### Testing ``` elements = partition_pdf( filename="example-docs/pdf/embedded-link.pdf", strategy="hi_res" ) assert len(elements[0].metadata.links) == 3 ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2024-10-31 16:52:27 +00:00
Yao You	a11ad22609	bump `unstructured-inference` (#3711 ) This PR bumps `unstructured-inference` to `0.8.0`, which introduces vectorized data structure for layout elements and text regions. This PR also cleans up a few places in CI that has repeated definition of env variables or missing installation of testing dependencies in cache. A few document ingest results are changed: - two places for `biomed-api` (actually processed locally on runner) are due to very small changes in numerical results of the bounding box areas: one results in a duplicated page number/header and another results in a deduplication of a word of a sentence that starts in a new line. (yes, two cases goes in opposite directions) - the layout parser paper now outputs the code lines with page number inside the code box as list items --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-10-21 21:55:08 +00:00
Nathan Van Gheem	b092d45816	Remove unsupported chipper model (#3728 ) The chipper model is no longer supported.	2024-10-17 17:40:45 +00:00
Roman Isecke	9049e4e2be	feat/remove ingest code, use new dep for tests (#3595 ) ### Description Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572 but maintaining all ingest tests, running them by pulling in the latest version of unstructured-ingest. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-10-15 10:01:34 -05:00
Steve Canny	cd074bb32b	chore(file): remove dead code (#3645 ) Summary Remove dead code in `unstructured.file_utils`. Additional Context These modules were added in 12/2022 and 1/2023 and are not referenced by any code. Removing to reduce unnecessary complexity. These can of course be recovered from Git history if we decide we want them again in future.	2024-09-19 06:45:33 +00:00
Yao You	22998354db	add requirements files to ingest cache hash key (#3641 ) This PR adds the requirement files for base and extras for the ingest cache's hash key. - The current workflow uses only the ingest requirements to generate hash key for the gitaction cache - Sometimes only base or extra requirements (like extra-pdf.txt) updated but not any ingest requirements -> this would mean the ingest test would fetch a cache with outdated non-ingest dependencies - When we generate new ingest cache we actually do check first base and extra requirements and generate a base env before layer on top the ingest dependencies. - This PR allows the ingest step to recognize changes to non-ingest dependency changes and trigger new cache generation when either ingest or base/extra requirement files changes. This PR also bumps the setup python action version in cache actions; it also adds installation of `virtualenv` for the ingest cache action to avoid errors like https://github.com/Unstructured-IO/unstructured/actions/runs/10905551870/job/30265057515?pr=3641#step:3:111	2024-09-18 18:39:14 -05:00
Matt Robinson	ba93f9a26a	fix: reenable arm64 build (#3626 ) ### Summary Reverts the CI change in #3624 and reenables the `arm64` build and publish steps.	2024-09-13 16:15:01 +00:00
Matt Robinson	8b7e5bbeac	fix: temporarily disable arm64 build (#3624 ) ### Summary Per [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10842120429/job/30087252047), `arm64` builds are currently failing, likely because the workaround for the broken `mesa-gl` package from the `wolfi` repository only works for `amd64`. Temporarily disabling the `arm64` build in order to push out the latest `amd64` image with security patches, then will revert and work the fix for the `arm64` image. - https://github.com/Unstructured-IO/base-images/pull/44	2024-09-13 13:47:39 +00:00
Matt Robinson	6ba8135bf9	fix: check ole storage content to differentiate filetypes (#3581 ) ### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ```	2024-08-30 15:12:46 -04:00
David Potter	ddba928344	Potter/mixedbread embedder (#3513 ) Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!	2024-08-27 14:52:13 +00:00
Christine Straub	affd997c39	refactor(ci): remove unused environment variables (#3568 ) This PR removes the unused env `TABLE_OCR` from CI.	2024-08-26 19:19:58 +00:00
Christine Straub	99f72d65ba	ci: fix ingest test fixtures update (#3532 )	2024-08-16 16:37:33 -07:00
Matt Robinson	7437f0a084	fix(CVE-2024-39705): update to latest `nltk` version (#3512 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by updating to `nltk==3.8.2` and closes #3511. This CVE had previously been mitigated in #3361. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-13 09:39:29 -04:00
Christine Straub	d99b39923d	build(deps): Remove unstructured.paddlepaddle fork (#3506 ) This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we used `unstructured.paddlepaddle` fork to support `unstructured.paddleocr` on arm64 architecture. But currently, `unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work on `arm64` architecture. Also, `unstructured.paddleocr` with the latest version of the original `paddlepaddle` works on both `amd64` and `arm64` architectures. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy="hi_res", infer_table_structure=True, ) ```	2024-08-09 22:04:22 +00:00
Matt Robinson	ee2b247297	build: check dependency licenses in CI (#3349 ) ### Summary Adds a CI check to ensure that packages added as dependencies are appropriately licensed. All of the `.txt` files in the `requirements` directory are checked with the exception of: - `constraints.txt`, since those are not installed and are instead conditions on the other dependency files - `dev.txt`, since those are for local development and not shipped as part of the `unstructured` package - `extra-pdf-image.txt` - the `extra-pdf-image.in` since checking `extra-pdf-image.txt` pulls in NVIDIA GPU related packages with an `Other/Proprietary` license type, and there's not a good way to exclude those without adding `Other/Proprietary` to the allowed licenses list. ### Testing The new `check-licenses` job should pass in CI.	2024-07-11 22:36:01 +00:00
David Potter	6c78677ebb	feat: add Astra source connector (#3304 ) Thanks to @erichare we now have an AstraDB source connector. updating constant names to be more aligned with AstraDB	2024-07-10 20:29:22 +00:00
Matt Robinson	7b25dfc337	fix(CVE-2024-39705): remove nltk download (#3361 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <stcanny@gmail.com>	2024-07-08 22:55:36 +00:00
Nathan Van Gheem	6e4d9ccd5b	refactor: implement databricks volumes v2 dest connector (#3334 )	2024-07-03 19:01:16 +00:00
Matt Robinson	db8617872b	build: image and dependency updates; fix tesseract files locations (#3310 ) ### Summary Updates to the latest version of the `wolfi-base` image. Changes include: - Version bumps to address CVEs - `libreoffice` is now included in the `arm64`. `.doc` files are now supported for `arm64`. `.ppt` do not work with the `libreoffice` package currently available on `wolfi-os`. We have follow on work to look into that. - Updates the location of the `tesseract` `tessdata` files on the `arm64` build. Closes #3290. - Closes #3319 and addes `psutil` to the base dependencies. ### Testing - `test_dockerfile` should continue to pass with the updates.	2024-07-01 19:39:32 +00:00
David Potter	9eb4c96b94	fix: update slack test to point to new channel (#3328 ) When we switched community Slack from Paid to Free we lost the CI test bot. Also if messages delete after 90 days then our expected test data will disappear. - created a new bot in our paid company slack (test_unstructured_ingest_bot) - added a new private channel (test-ingest) - invited the bot to the channel - adjusted the end datetime of the test to cover the first few messages in the channel Still to do: - update the CI secrets with the new bot token - update the LastPass with the new bot token (I don't have write access.. :(.	2024-07-01 18:11:21 +00:00
Roman Isecke	e0f4374386	Roman/bugfix conflicting event loop ingest (#3264 ) ### Description In use cases where an external system (such as code being run in a jupyter notebook) already has a running event loop, run the async code in a dedicated thread pool to not conflict with the existing event loop. This also has a variety of fixes that were found when putting together a demo leveraging the elasticsearch destination connector	2024-06-24 18:47:37 +00:00
Christine Straub	ab88e20575	chore: bump unstructured-inference 0.7.36 (#3275 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed `ValueError` when converting cells to HTML in the table processing subpipeline - cut a release for `0.14.8` --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2024-06-24 13:07:22 +00:00
Matt Robinson	2d965fd65e	build: switch arm64 image to wolfi-base (#3268 ) ### Summary Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since there are now upstream base images for `wolfi-base` for both architectures. The legacy `rockylinux-9.4` is now stashed in a subdirectory the `docker` subdirectory and is no longer built in CI, but is available is users would like to build it themselves. Additionally, this PR includes a fix to symlink `python3` to `python3.11`, which had caused a CI failure [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755). BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`, or `.xls` because we do not yet have a `libreoffice` `apk` built for `wolfi-base`. We intend to address that as a follow on. All other filetypes work. ### Testing Successfully docker builds, tests, and smoke tests for [amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268) and [arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268) on the feature branch (with publish disabled).	2024-06-22 05:10:29 +00:00
Roman Isecke	fd98cf9ea5	Roman/migrate es dest (#3224 ) ### Description Migrate elasticsearch destination connector to new v2 ingest framework	2024-06-18 14:20:49 +00:00
Christine Straub	b47e6e9fdc	refactor: remove download packages step (#3225 ) This PR aims to remove the download packages step since all of that gets installed in the base images. This PR also updates the base `wolfi` image because the original base image can not be found anymore: https://github.com/Unstructured-IO/unstructured/actions/runs/9555654898/job/26339587945	2024-06-18 12:15:44 +00:00
Matt Robinson	2815226b54	build(deps): version bumps for 2024-06-17 (#3220 ) ### Summary Version bumps for the week of 2024-06-17. There is a now a pin on `numpy` due to a breaking change in the latest version that we'll need to investigate and remove in a subsequent PR.	2024-06-17 14:04:29 +00:00
Steve Canny	5f582f1716	ci: update to Node 20 actions (#3200 ) Summary Silence the long list of warnings we get in CI from using Node 16 actions by updating to Node 20 versions.	2024-06-13 03:43:26 +00:00
Matt Robinson	54c1e4e57f	ci: remove jira issue workflow (#3129 ) ### Summary Removes the workflow for creating Jira tickets.	2024-05-31 22:00:40 +00:00
ryannikolaidis	1f8768750c	chore: add auth to s3 destination test (#3122 ) We should be validating the S3 Destination with authenticated requests, with credentials from a limited test user. ## Changes - Updates s3 destination test to point to a bucket that requires authentication. - Adds authentication to the s3 destination test request - Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination - Bonus: fix S3ConnectionConfig never registered for s3 V2 destination - Bonus: repair version and changelog version for consistency with -dev convention ## Testing Validated by changes to S3 destination ingest test	2024-05-31 07:05:09 +00:00
Matt Robinson	9acf26ec2e	docs: explicitly replace all old pages with link to new docs (#3118 ) ### Summary Explicitly replaces all old docs pages with a link to the new docs. This was required because 404 redirects didn't work for pages that previously existed, though they worked non-existing paths that never existed.	2024-05-30 13:01:33 +00:00
Matt Robinson	059fc64bd9	build: apk add libreoffice24 (#3065 ) ### Summary Switches to installing `libreoffice` from the Wolfi repository and upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a medium vulnerability in the old `libreoffice` version. Security scanning with `anchore/grype` was also added to the `test_dockerfile` job. Requirements were bumped to resolve a vulnerability in the `requests` library. ### Testing `test_dockerfile` passes with the updates.	2024-05-21 18:54:16 +00:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Matt Robinson	9cd0e706ab	fix: reenable arm64 builds for docker (#3045 ) ### Summary Closes #3034 and reenables ARM64 in the docker build and publish job. This was taken out in #3039 because we've only build `libreoffice` for AMD64 and not ARM64. If Chainguard publishes an `apk` for `libreoffice`, we can support a Chainguard image for both architectures. The smoke test now differs for both architectures, to reflect differences in the directory structure. ### Testing Build and publish ran successfully for ARM64 (job [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907497)) and AMD64 (job [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907826)).	2024-05-17 19:27:20 +00:00
Matt Robinson	934f1a464a	fix: disable arm build for chainguard (#3039 ) ### Summary Temporarily disables the ARM build due to the error in [this CI job](https://github.com/Unstructured-IO/unstructured/actions/runs/9114507405/job/25058629166). Will add back support for ARM using the rockylinux container once we show this works.	2024-05-17 00:22:10 +00:00
Matt Robinson	612905e311	build: wolfi base image for Dockerfile (#3016 ) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done.	2024-05-15 22:53:15 +00:00
Christine Straub	23edc4ad71	build(ci): skip `python 3.11` in CI ingest jobs (#2877 ) CI fails every time on test_ingest_src (3.11) and test_ingest_dst (3.11) on what looks like a pip-install problem `(ModuleNotFoundError: No module named 'click')`. The error is exactly the same place every time. - https://github.com/Unstructured-IO/unstructured/actions/runs/8622028071/job/23632669423 - https://github.com/Unstructured-IO/unstructured/actions/runs/8623541446 - https://github.com/Unstructured-IO/unstructured/actions/runs/8623056382 ... This PR skips the Python `3.11` ingest tests since the most important one is `3.10` anyway.	2024-04-10 15:16:49 -07:00
Steve Canny	2c7e0289aa	rfctr(pptx): extract _PptxPartitionerOptions (#2853 ) Reviewers: Likely quicker to review commit-by-commit. Summary In preparation for adding a PPTX `Picture` shape _sub-partitioner_, extract management of PPTX partitioning-run options to a separate `_PptxPartitioningOptions` object similar to those used in chunking and XLSX partitioning. This provides several benefits: - Extract code dealing with applying defaults and computing derived values from the main partitioning code, leaving it less cluttered and focused on the partitioning algorithm itself. - Allow the options set to be passed to helper objects, prominently including sub-partitioners, without requiring a long list of parameters or requiring the caller to couple itself to the particular option values the helper object requires. - Allow options behaviors to be thoroughly and efficiently tested in isolation.	2024-04-08 19:01:03 +00:00
David Potter	57c7c7afc8	fix: Add mongodb env variables to ingest-test-fixtures-update-pr.yaml (#2851 ) ingest-test-fixtures-update-pr.yaml was missing mongodb vars. And the workflow was failing.	2024-04-04 23:38:21 +00:00
David Potter	9177aa20a8	feature CORE-3985: add Clarifai destination connector (#2633 ) Thanks to @mogith-pn from Clarifai we have a new destination connector! This PR intends to add Clarifai as a ingest destination connector. Access via CLI and programmatic. Documentation and Examples. Integration test script.	2024-03-21 16:36:21 +00:00
ryannikolaidis	3853840d52	fix: docker-publish build test missing key error (#2623 ) The docker-publish github actions workflow builds amd and arm images of the repository and tests them before publishing. These tests have been failing since [this commit](`ee8b0f93dc`) with an error `UNS_API_KEY environment variable not set`. The issue is that [this line](`b27ad9b6aa/.github/workflows/docker-publish.yml (L62)`) in the workflow is actually blowing away the value assigned to the file in the previous line ## Changes * Update line that was overwriting the assignment of UNS_API_KEY to the uns_test_env_file in the docker-publish workflow to leverage the `>>` operator so that UNSTRUCTURED_HF_TOKEN assignment is only appended. * [bonus]: arithmetic expansion in version-sync.sh to keep shell-check happy ## Testing To validate, I edited the docker-publish workflow to trigger on push (and to run the test but not publish the workflow) in [this commit](`0f04f5f0f7`). The successful test results can be reviewed [here](https://github.com/Unstructured-IO/unstructured/actions/runs/8199826803).	2024-03-08 14:55:04 +00:00
Roman Isecke	9c1c41f493	BUGFIX: fix dependencies in setup.py (#2605 ) ### Description Currently the requirements associated with an extra in the `setup.py` is being dynamically generated using the `load_requirements()` method in the same file. This is being passed in all the `.in` files which then get read line by line to generate the requirements associated with an extra. Unless the `.in` file itself has a version pin, this will never respect the `.txt` files being generated by `pip-compile`. This fix updates all the inputs to `load_requirements()` to use the `.txt` files themselves.	2024-03-06 18:59:08 +00:00
Michał Martyniak	b9aa4b7452	fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593 ) ## Problem Description In some cases you might find yourselves in a situation when pandoc won't be able to process an `rtf` as input file format, because older versions simply do not support that. ``` RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki ``` Basically, some user may install the wrong version. The `README.md` is not be precise enough when mentioning RTF files support: `47b35ccdd6/README.md (L120-L122)` ## Example Installing `pandoc` from a [stable repository, like Debian](https://packages.debian.org/source/bullseye/pandoc) will give you `2.9` and the official documentation shows clearly that support for rtf was introduced in `2.14` https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21 ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8) ### Note that `rtf` is not there ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38) ### More detail ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8) ## Proposed Solution - [x] I've simply added/copied `make install-pandoc` calls, mimicking other recipes in order to ensure that `3.1.2` will be installed in all cases. Side note: `make install-pandoc` calls `./scripts/install-pandoc.sh` under the hood. - [x] Update README file - mention that `make install-pandoc` is recommended (`>=2.14.2`) - [x] Verify tests that cover `rtf` cases: `47b35ccdd6/test_unstructured/file_utils/test_file_conversion.py (L14)` - [x] Update `setup_ubuntu.sh` if needed?: `47b35ccdd6/scripts/setup_ubuntu.sh (L87)` -	2024-03-04 11:02:32 +00:00
David Potter	e8ec09c8b9	feat: astra dest connector (#2571 ) Thanks to Eric Hare @erichare at DataStax we have a new destination connector. This Pull Request implements an integration with [Astra DB](https://datastax.com) which allows for the Astra DB Vector Database to be compatible with Unstructured's set of integrations. To create your Astra account and authenticate with your `ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these steps: 1. Create an account at https://astra.datastax.com 2. Login and create a new database 3. From the database page, in the right hand panel, you will find your API Endpoint 4. Beneath that, you can create a Token to be used Some notes about Astra DB: - Astra DB is a Vector Database which allows for high-performance database transactions, and enables modern GenAI apps [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html) - It supports similarity search via a number of methods [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics) - It also supports non-vector tables / collections	2024-02-23 20:50:50 +00:00
David Potter	c100ce28a7	feat: add Vectara destination connector (#2357 ) Thanks to Ofer at Vectara, we now have a Vectara destination connector. - There are no dependencies since it is all REST calls to API - --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 14:38:34 +00:00
Christophe Jolif	ccc2302b33	feat: add the ability to specify a custom OCR besides the ones natively supported (#2462 ) This is nice to natively support both Tesseract and Paddle. However, one might already use another OCR and might want to keep using it (for quality reasons, for cost reasons etc...). This PR adds the ability for the user to specify its own OCR agent implementation that is then called by unstructured. I am new to unstructured so don't hesitate to let me know if you would prefer this being done differently and I will rework the PR. --------- Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io>	2024-01-31 16:38:14 -06:00
David Potter	76e0d10e61	feat: add MongoDB source connector (#2393 ) Adds MongoDB as a source (we already had it as a destination connector) --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-16 20:56:29 +00:00
Roman Isecke	b37b4689bc	drop python3.8 (#2372 ) ### Description Remove all uses of python3.8 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-01-09 23:37:30 +00:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
Roman Isecke	76efcf4dd7	chore: add shfmt (#2246 ) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called.	2023-12-12 01:04:15 +00:00
Roman Isecke	f8aea71f3a	chore: refactor ingest unit test to run in it's own CI step (#2190 ) ### Description Currently the ingest unit tests are running in both the source and destination ingest steps. This PR moved it out into it's own step that can run on a more basic runner since it doesn't require multiple cores/cpus for parallelization. Further optimizations: * Added the changelog step as a dependency for the ingest tests to avoid running them (fail fast) of the pipeline has already failed due to the changelog not being updated. * The cache was never actually being saved if it needed to be recreated, a save job was added at the end of each custom action	2023-11-30 20:57:00 +00:00

1 2 3 4

174 Commits