unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-07 05:38:38 +00:00

Author	SHA1	Message	Date
David Potter	8610bd3ab9	feat: Kafka source and destination connector (#3176 ) Thanks to @tullytim we have a new Kafka source and destination connector. It also works with hosted Kafka via Confluent. Documentation will be added to the Docs repo.	2024-06-22 23:26:23 +00:00
Matt Robinson	9acf26ec2e	docs: explicitly replace all old pages with link to new docs (#3118 ) ### Summary Explicitly replaces all old docs pages with a link to the new docs. This was required because 404 redirects didn't work for pages that previously existed, though they worked non-existing paths that never existed.	2024-05-30 13:01:33 +00:00
Matt Robinson	8415db5112	docs: make 404 pages same as index (#3114 ) ### Summary Makes a custom 404 page that's the same as `index.html`, so any path shows the URL for the new docs.	2024-05-30 07:46:38 -04:00
Matt Robinson	2ecaf5e38c	fix: remove 404 from docs (#3112 ) ### Summary Removes 404 from the docs build to avoid rate limiting behavior.	2024-05-29 20:41:32 +00:00
Matt Robinson	c9976760c5	fix: revert back to old requirements file for sphinx docs (#3077 ) ### Summary As seen in [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/9182534479/job/25251583102), the build job for sphinx docs is failing, and has been failing for quite some time. This PR reverts the requirements file back to a [previous good commit](`91b892c79d`) for that job, and also moves the `build.in` file so the requirements file doesn't get update on `make pip-compile.` This is fine since those requirements don't get installed as part of the package, and we're deprecated the `sphinx` docs in favor of https://docs.unstructured.io anyway. ### Testing Build was [successful](https://github.com/Unstructured-IO/unstructured/actions/runs/9198605026/job/25301670934?pr=3077) on the feature branch. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-05-23 03:32:06 +00:00
Christine Straub	18428f24ab	chore: bump unstructured-inference 0.7.33 (#3074 ) Summary: - bump unstructured-inference to `0.7.33` - cut a release for `0.14.2` - add some dependencies that previously came through from the layoutparser extras.	2024-05-22 22:35:00 +00:00
Matt Robinson	059fc64bd9	build: apk add libreoffice24 (#3065 ) ### Summary Switches to installing `libreoffice` from the Wolfi repository and upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a medium vulnerability in the old `libreoffice` version. Security scanning with `anchore/grype` was also added to the `test_dockerfile` job. Requirements were bumped to resolve a vulnerability in the `requests` library. ### Testing `test_dockerfile` passes with the updates.	2024-05-21 18:54:16 +00:00
Matt Robinson	73739b38cc	docs: redirect to docs.unstructured.io on github pages (#3054 ) ### Summary Updates GitHub pages to redirect to the new https://docs.unstructured.io page. This will appear on GitHub pages after the next tag. ### Testing 1. From the docs direction, run `make html`. You should not see any errors or warnings 2. Open `unstructured/docs/build/html/index.html`. It should look like the following: <img width="1512" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/077626a5-d88a-467e-9e37-273a92e75d30"> 3. Open `unstructured/docs/build/html/404.html`. It should redirect back to `index.html`. Per the [GitHub pages docs](https://docs.github.com/en/pages/getting-started-with-github-pages/creating-a-custom-404-page-for-your-github-pages-site), that page will get served for 404 errors, meaning any links to old docs pages will redirect to `index.html`, which points users to the new docs page.	2024-05-21 09:38:32 -04:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Matt Robinson	f4b01a4aad	build(deps): bump versions for security hygiene (#3008 ) ### Summary Version bumps to keep on top of security scans.	2024-05-13 15:30:09 +00:00
Christine Straub	b64a48440d	chore: bump unstructured-inference 0.7.31 (#2981 )	2024-05-08 16:26:58 +00:00
Steve Canny	eff84afe24	chore: update python-docx version dependency (#2952 ) Summary `unstructured` will use table features added in the most recent version of `python-docx`. Also update the `lxml` version constraint because `lxml>4.9.2` will not install on Apple Silicon (https://github.com/Unstructured-IO/unstructured/issues/1707). `python-docx` requires `lxml` although other file formats require it as well.	2024-05-01 21:36:31 +00:00
cragwolfe	9e46ed016c	fix: reqs arm64 friendly again. release 0.13.4 (#2935 ) Cut a release. Run pip-compile on mac to avoid `nvidia-*` requirements creeping into `requirements/extra-pdf-image.txt`. This should fix arm64 image builds that have been breaking on main.	2024-04-26 08:15:13 +00:00
David Potter	00f544f100	fix: improve doc code (#2920 ) Improves the documentation code. Standardizes unstructured api key Replaces misc hard coded values Replaces `azureunstructured1` with a generic value	2024-04-25 17:55:15 +00:00
John	3843af666e	feat: Enable remote chunking via unstructured-ingest (#2905 ) Update: The cli shell script works when sending documents to the free api, but the paid api is down, so waiting to test against it. - The first commit adds docstrings and fixes type hints. - The second commit reorganizes `test_unstructured_ingest` so it matches the structure of `unstructured/ingest`. - The third commit contains the primary changes for this PR. - The `.chunk()` method responsible for sending elements to the correct method is moved from `ChunkingConfig` to `Chunker` so that `ChunkingConfig` acts as a config object instead of containing implementation logic. `Chunker.chunk()` also now takes a json file instead of a list of elements. This is done to avoid redundant serialization if the file is to be sent to the api for chunking. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-04-25 00:24:58 +00:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Steve Canny	05ff975081	fix: remove unused `ElementMetadata.section` (#2921 ) Summary The `.section` field in `ElementMetadata` is dead code, possibly a remainder from a prior iteration of `partition_epub()`. In any case, it is not populated by any partitioner. Remove it and any code that uses it.	2024-04-22 23:58:17 +00:00
Roman Isecke	9ad2993fe3	bug: fix pip-compile (#2885 ) ### Description Currently wasn't compiling `base.in` first, which is required because others use the generated `.txt` file as a constraint.	2024-04-19 21:39:25 +00:00
Michał Martyniak	001fa17c86	Preparing the foundation for better element IDs (#2842 ) Part one of the issue described here: https://github.com/Unstructured-IO/unstructured/issues/2461 It does not change how hashing algorithm works, just reworks how ids are assigned: > Element ID Design Principles > > 1. A partitioning function can assign only one of two available ID types to a returned element: a hash or UUID. > 2. All elements that are returned come with an ID, which is never None. > 3. No matter which type of ID is used, it will always be in string format. > 4. Partitioning a document returns elements with hashes as their default IDs. Big thanks to @scanny for explaining the current design and suggesting ways to do it right, especially with chunking. Here's the next PR in line: https://github.com/Unstructured-IO/unstructured/pull/2673 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>	2024-04-16 21:14:53 +00:00
Roman Isecke	d6f2841ff4	feat: update dependencies and remove constraint on pydantic (#2841 ) ### Description * The `consistent-deps.sh` was fixed to take into account the ingest dependencies, causing some errors to show up. New constriants were added to make that script pass. * Update all requirements without constraint on pydantic, allowing the latest version to be pulled in. * `pikepdf` is causing a conflict but there's a fix on their `main` branch, just need for the next release to be published. Opened up a question here to see if we can get that out any sooner: [Do releases happen on a schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now added `lxml<5` to the constraints. A couple optimizations: * `constraints.in` renamed to `constraints.txt` since the whole point is all dependencies are already pinned and the file never gets compiled * `constraints.txt` moved to a `requirements/deps` directory as this never gets compiled by `pip-compile` * Other dependency files updated to reference the new location of `base.in` and `constraints.txt` * make file updated since it was originally written to avoid the `base.in` and `constraints.in` file	2024-04-04 19:58:23 +00:00
Ahmet Melek	d46792214a	feat: add vertexai embeddings (#2693 ) This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-03-28 21:15:36 +00:00
Steve Canny	9ae838e50a	feat: add --include-orig-elements option to Ingest CLI (#2687 ) Summary Add an `--include-orig-elements` option to the Ingest CLI to allow users to specify that corresponding new chunking parameter. Reviewer A lot of this is cleanup, the second commit is where the actual adding of this option are. The first commit fixes a number of inaccuracies in the documentation and does some other clean-up. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-27 06:35:01 +00:00
Steve Canny	56fbaaed10	feat(chunking): add metadata.orig_elements serde (#2680 ) Summary This final PR in the "orig_elements" series adds the needful such that `.metadata.orig_elements`, when present on a chunk (element), is serialized to JSON when the chunk is serialized, for instance, to be used in an HTTP response payload. It also provides for deserializing such a JSON payload into chunks that contain the `.orig_elements` metadata. Additional Context Note that `.metadata.orig_elements` is always `Optional[list[Element]]` when in memory. However, those original elements are serialized as Base64-encoded gzipped JSON and are in that form (str) when present as JSON or as "element-dicts" which is an intermediate serialization/deserialization format. That is, serialization is `Element -> dict -> JSON` and deserialization is `JSON -> dict -> Element` and `.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-22 21:53:26 +00:00
Filip Knefel	bdfd975115	chore: change table extraction defaults (#2588 ) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-22 10:08:49 +00:00
David Potter	9177aa20a8	feature CORE-3985: add Clarifai destination connector (#2633 ) Thanks to @mogith-pn from Clarifai we have a new destination connector! This PR intends to add Clarifai as a ingest destination connector. Access via CLI and programmatic. Documentation and Examples. Integration test script.	2024-03-21 16:36:21 +00:00
Mason Brothers	ea67be5665	Doc: Change Python comment string to JavaScript comment string. (#2596 ) JavaScript uses `//` for comments instead of `#` Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-15 09:56:35 -05:00
Ronny H	9cbede37bd	Update requirements for GCS IAM Role for Platform Source & Destination Connectors (#2637 ) To test: > cd docs && make html Changelogs: * added a note to have Storage Object Viewer IAM Role for the GCS source connector. * added a note to have Storage Object Creator IAM Role for the GCS destination connector.	2024-03-11 22:14:20 +00:00
Ronny H	2afd347e6b	Create Enterprise Platform Documentation (#2486 ) To test: > cd docs && make html Structures: * Getting Started with Platform (User Account Management) * Set Up workflow automation * Job Scheduling * Platform Source Connectors: * Azure Blob Storage, * Amazon S3 * Salesforce * Sharepoint * Google Cloud Storage * Google Drive * One Drive * Elasticsearch * SFTP Storage * Platform Destination Connectors: (i) * Amazon S3 * Azure Cognitive Search * Google Cloud Storage * Pinecone * Elasticsearch * Weaviate * MongoDB * AWS OpenSearch * Databricks --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2024-03-06 19:16:08 +00:00
John	3783b44d0b	fix documentation html links example (#2608 ) Closes #2577 Testing: ``` from unstructured.partition.html import partition_html cnn_lite_url = "https://lite.cnn.com/" elements = partition_html(url=cnn_lite_url) links = [] for element in elements: if element.metadata.link_urls: relative_link = element.metadata.link_urls[0][1:] if relative_link.startswith("2024"): links.append(f"{cnn_lite_url}{relative_link}") print(links) ``` --------- Co-authored-by: ron-unstructured <ronny@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-04 18:33:42 +00:00
Michał Martyniak	b9aa4b7452	fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593 ) ## Problem Description In some cases you might find yourselves in a situation when pandoc won't be able to process an `rtf` as input file format, because older versions simply do not support that. ``` RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki ``` Basically, some user may install the wrong version. The `README.md` is not be precise enough when mentioning RTF files support: `47b35ccdd6/README.md (L120-L122)` ## Example Installing `pandoc` from a [stable repository, like Debian](https://packages.debian.org/source/bullseye/pandoc) will give you `2.9` and the official documentation shows clearly that support for rtf was introduced in `2.14` https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21 ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8) ### Note that `rtf` is not there ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38) ### More detail ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8) ## Proposed Solution - [x] I've simply added/copied `make install-pandoc` calls, mimicking other recipes in order to ensure that `3.1.2` will be installed in all cases. Side note: `make install-pandoc` calls `./scripts/install-pandoc.sh` under the hood. - [x] Update README file - mention that `make install-pandoc` is recommended (`>=2.14.2`) - [x] Verify tests that cover `rtf` cases: `47b35ccdd6/test_unstructured/file_utils/test_file_conversion.py (L14)` - [x] Update `setup_ubuntu.sh` if needed?: `47b35ccdd6/scripts/setup_ubuntu.sh (L87)` -	2024-03-04 11:02:32 +00:00
David Potter	43250d5576	bug CORE-3971: fix deserialization in google-drive source connector key path (#2586 ) Google Drive Service account key can be a dict or a file path(str) We have successfully been using the path. But the dict can also end up being stored as a string that needs to be deserialized. The deserialization can have issues with single and double quotes.	2024-03-03 15:30:35 +00:00
Christine Straub	5cb6504d5a	docs: update image block extraction docs (#2578 ) This PR removes `extract_image_block_to_payload` section from "API Parameters" page. The "unstructured" API does not support the `extract_image_block_to_payload` parameter, and it is always set to `True` internally on the API side when trying to extract image blocks via the API. Users only need to specify `extract_image_block_types` parameter when extracting image blocks via the API. NOTE: The `extract_image_block_to_payload` parameter is only used when calling `partition()`, `partition_pdf()`, and `partition_image()` functions directly. ### Testing CI should pass.	2024-02-24 04:36:58 +00:00
David Potter	e8ec09c8b9	feat: astra dest connector (#2571 ) Thanks to Eric Hare @erichare at DataStax we have a new destination connector. This Pull Request implements an integration with [Astra DB](https://datastax.com) which allows for the Astra DB Vector Database to be compatible with Unstructured's set of integrations. To create your Astra account and authenticate with your `ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these steps: 1. Create an account at https://astra.datastax.com 2. Login and create a new database 3. From the database page, in the right hand panel, you will find your API Endpoint 4. Beneath that, you can create a Token to be used Some notes about Astra DB: - Astra DB is a Vector Database which allows for high-performance database transactions, and enables modern GenAI apps [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html) - It supports similarity search via a number of methods [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics) - It also supports non-vector tables / collections	2024-02-23 20:50:50 +00:00
Austin Walker	6d17b9a7e4	Fix a parameter name in the js-client example usage (#2560 ) `files` should be `fileName` as noted in https://github.com/Unstructured-IO/unstructured-js-client/issues/24 Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-02-18 17:28:15 +00:00
Ronny H	ad561b7939	Fixed broken links and improved readability in `key concepts` page (#2533 ) To test: > cd docs && make html	2024-02-14 18:19:47 +00:00
Ronny H	51427b3103	Renamed OpenAiEmbeddingConfig dataclass (#2546 )	2024-02-14 17:24:52 +00:00
David Potter	1a706771fa	feature: add octoai for embeddings (#2538 ) Thanks to Pedro at OctoAI we have a new embedding option. The following PR adds support for the use of OctoAI embeddings. Forked from the original OpenAI embeddings class. We removed the use of the LangChain adaptor, and use OpenAI's SDK directly instead. Also updated out-of-date example script. Including new test file for OctoAI. # Testing Get a token from our platform at: https://www.octoai.cloud/ For testing one can do the following: ``` export OCTOAI_TOKEN=<your octo token> python3 examples/embed/example_octoai.py ``` ## Testing done Validated running the above script from within a locally built container via `make docker-start-dev` --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-10 15:27:06 +00:00
Ronny H	67a8fd9809	Added example to use SaaS API URL in partition_via_api (#2512 ) To test: > cd docs && make html Changelog: added an example to use `SaaS API URL` in `partition_via_api` using `api_url` param. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2024-02-07 19:25:42 +00:00
Filip Knefel	5defe79bf2	docs: add information about MIME type of extracted images (#2515 ) Include information about what mime type is expected when extracting images. Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-02-07 08:40:24 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
David Potter	c100ce28a7	feat: add Vectara destination connector (#2357 ) Thanks to Ofer at Vectara, we now have a Vectara destination connector. - There are no dependencies since it is all REST calls to API - --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 14:38:34 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
John	9320311a19	fix: check languages args (#2435 ) This PR is the last in a series of PRs for refactoring and fixing the language parameters (`languages` and `ocr_languages` so we can address incorrect input by users. See #2293 It is recommended to go though this PR commit-by-commit and note the commit message. The most significant commit is "update check_languages..."	2024-01-29 20:12:08 +00:00
Ronny H	d5a6f4b82c	Docs updates (#2458 ) To test: > cd docs && make html Change logs: * Updates the best practice for table extraction to use `skip_infer_table_types` instead of `pdf_infer_table_structure`. * Fixed CSS issue with a duplicate search box. * Fixed RST warning message * Fixed typo on the Intro page.	2024-01-25 20:31:28 +00:00
Roman Isecke	a8de52e94f	feat: databricks volumes dest added (#2391 ) ### Description This adds in a destination connector to write content to the Databricks Unity Catalog Volumes service. Currently there is an internal account that can be used for testing manually but there is not dedicated account to use for testing so this is not being added to the automated ingest tests that get run in the CI. To test locally: ```shell #!/usr/bin/env bash path="testpath/$(uuidgen)" PYTHONPATH=. python ./unstructured/ingest/main.py local \ --num-processes 4 \ --output-dir azure-test \ --strategy fast \ --verbose \ --input-path example-docs/fake-memo.pdf \ --recursive \ databricks-volumes \ --catalog "utic-dev-tech-fixtures" \ --volume "small-pdf-set" \ --volume-path "$path" \ --username "$DATABRICKS_USERNAME" \ --password "$DATABRICKS_PASSWORD" \ --host "$DATABRICKS_HOST" ```	2024-01-23 01:25:51 +00:00
Ronny H	149f894d0a	Fixed sphinx-build error by pinning alabaster=-0.7.13 (#2436 )	2024-01-20 14:36:48 -08:00
Ronny H	4c772f6ed7	Updated docs on API Params and Filetype Supports (#2433 ) To test: > cd docs && make html Changelogs: * Fixed sphinx error due to malformed rst table on partition page * Updated API Params, ie. `extract_image_block_types` and `extract_image_block_to_payload` * Updated image filetype supports	2024-01-19 16:07:57 -08:00
Matt Robinson	2d3a7f1c48	fix: fix table index error by bumping `unstructured-inference` (#2430 ) ### Summary Closes #2417. Bumps `unstructured-inference` to pull in the fix implemented in https://github.com/Unstructured-IO/unstructured-inference/pull/317	2024-01-19 22:42:32 +00:00
Christine Straub	7378a378f6	enhancement: allow setting image block crop padding parameter (#2415 ) Closes #2320 . ### Summary In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped. ### Testing - PDF: [LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf) - Set two environment variables `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD` (e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`, `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20` ``` elements = partition_pdf( filename="LM339-D_2-2.pdf", extract_image_block_types=["image"], ) ```	2024-01-19 06:28:32 +00:00
Ronny H	96fe7dd5e5	Kapa.ai widget installation (#2418 ) To test: > cd docs && make html > click "Ask AI" button on the bottom right-hand corner Changelogs: * Installed kapa.ai widget * fixed sphinx errors in opensearch & elasticsearch documentation	2024-01-18 00:17:11 +00:00

1 2 3 4 5 ...

299 Commits