7 Commits

Author SHA1 Message Date
Roman Isecke
4ff6a5b78e
Roman/bugfix support bedrock embeddings (#2650)
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-03-21 18:21:04 +00:00
Steve Canny
94535e353c
rfctr: prepare for adding metadata.orig_elements field (#2647)
**Summary**
Some typing modernization in `elements.py` which will get changes to add
the `orig_elements` metadata field.

Also some additions to `unit_util.py` to enable simplified mocking that
will be required in the next PR.
2024-03-14 21:31:58 +00:00
Pawel Kmiecik
ff9d46f9dc
feat(eval): table evaluation metrics (#2558)
This PR adds new table evaluation metrics prepared by @leah1985 
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows

TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
2024-02-22 16:35:46 +00:00
Roman Isecke
f8aea71f3a
chore: refactor ingest unit test to run in it's own CI step (#2190)
### Description
Currently the ingest unit tests are running in both the source and
destination ingest steps. This PR moved it out into it's own step that
can run on a more basic runner since it doesn't require multiple
cores/cpus for parallelization.

Further optimizations:
* Added the changelog step as a dependency for the ingest tests to avoid
running them (fail fast) of the pipeline has already failed due to the
changelog not being updated.
* The cache was never actually being saved if it needed to be recreated,
a save job was added at the end of each custom action
2023-11-30 20:57:00 +00:00
John
670687bb67
update .pre-commit-config to match linting used by CI (#1906)
Closes #1905 
.pre-commit-config.yaml does not match pyproject.toml, which causes
unnecessary/undesirable formatting changes. These changes are not
required by CI, so they should not have to be made.

**To Reproduce**
Install pre-commit configuration as described
[here](https://github.com/Unstructured-IO/unstructured#installation-instructions-for-local-development).
Make a commit and something like the following will be logged:
```
check for added large files..............................................Passed
check toml...........................................(no files to check)Skipped
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check xml............................................(no files to check)Skipped
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
black....................................................................Passed
ruff.....................................................................Failed
- hook id: ruff
- files were modified by this hook
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-27 13:24:55 -05:00
Ahmet Melek
4b827f0793
fix: local connector output filename when a single file is being processed (#879)
* fix string processing error for _output_filename

* Add docstring and type hint, update CHANGELOG, update version

* update test fixture

* simple code change commit to retrigger ci checks

* update test fixture - after brew install tesseract-lang

* Update ingest test fixtures (#882)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* correct CHANGELOG

* correct CHANGELOG

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-05 14:37:40 -07:00
Alvaro Bartolome
2979e17aa4
feat: add .pre-commit-config.yaml to let users enable pre-commit hooks (#320)
Per the README, provides an optional `pre-commit` configuration
file to ensure code matches the formatting and linting standards used in `unstructured`.
2023-03-05 20:23:39 +00:00