unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
Roman Isecke	4ff6a5b78e	Roman/bugfix support bedrock embeddings (#2650 ) ### Description This PR resolved the following open issue: [bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319). To do so, the following changes were made: * All aws configs were added as input parameters to the CLI * These were mapped to the bedrock embedder when an embedder is generated via `get_embedder` * An ingest test was added to call the aws bedrock service * Requirements for boto were bumped because the first version to introduce the bedrock runtime, which is required to hit the bedrock service, was introduced in version `1.34.63`, which was ahead of the version of boto pinned. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-03-21 18:21:04 +00:00
Steve Canny	94535e353c	rfctr: prepare for adding metadata.orig_elements field (#2647 ) Summary Some typing modernization in `elements.py` which will get changes to add the `orig_elements` metadata field. Also some additions to `unit_util.py` to enable simplified mocking that will be required in the next PR.	2024-03-14 21:31:58 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Roman Isecke	f8aea71f3a	chore: refactor ingest unit test to run in it's own CI step (#2190 ) ### Description Currently the ingest unit tests are running in both the source and destination ingest steps. This PR moved it out into it's own step that can run on a more basic runner since it doesn't require multiple cores/cpus for parallelization. Further optimizations: * Added the changelog step as a dependency for the ingest tests to avoid running them (fail fast) of the pipeline has already failed due to the changelog not being updated. * The cache was never actually being saved if it needed to be recreated, a save job was added at the end of each custom action	2023-11-30 20:57:00 +00:00
John	670687bb67	update .pre-commit-config to match linting used by CI (#1906 ) Closes #1905 .pre-commit-config.yaml does not match pyproject.toml, which causes unnecessary/undesirable formatting changes. These changes are not required by CI, so they should not have to be made. To Reproduce Install pre-commit configuration as described [here](https://github.com/Unstructured-IO/unstructured#installation-instructions-for-local-development). Make a commit and something like the following will be logged: ``` check for added large files..............................................Passed check toml...........................................(no files to check)Skipped check yaml...........................................(no files to check)Skipped check json...........................................(no files to check)Skipped check xml............................................(no files to check)Skipped fix end of files.........................................................Passed trim trailing whitespace.................................................Passed mixed line ending........................................................Passed black....................................................................Passed ruff.....................................................................Failed - hook id: ruff - files were modified by this hook ``` --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-27 13:24:55 -05:00
Ahmet Melek	4b827f0793	fix: local connector output filename when a single file is being processed (#879 ) * fix string processing error for _output_filename * Add docstring and type hint, update CHANGELOG, update version * update test fixture * simple code change commit to retrigger ci checks * update test fixture - after brew install tesseract-lang * Update ingest test fixtures (#882) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * correct CHANGELOG * correct CHANGELOG --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-05 14:37:40 -07:00
Alvaro Bartolome	2979e17aa4	feat: add `.pre-commit-config.yaml` to let users enable `pre-commit` hooks (#320 ) Per the README, provides an optional `pre-commit` configuration file to ensure code matches the formatting and linting standards used in `unstructured`.	2023-03-05 20:23:39 +00:00

7 Commits