### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
**Summary**
Some typing modernization in `elements.py` which will get changes to add
the `orig_elements` metadata field.
Also some additions to `unit_util.py` to enable simplified mocking that
will be required in the next PR.
This PR adds new table evaluation metrics prepared by @leah1985
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows
TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
### Description
Currently the ingest unit tests are running in both the source and
destination ingest steps. This PR moved it out into it's own step that
can run on a more basic runner since it doesn't require multiple
cores/cpus for parallelization.
Further optimizations:
* Added the changelog step as a dependency for the ingest tests to avoid
running them (fail fast) of the pipeline has already failed due to the
changelog not being updated.
* The cache was never actually being saved if it needed to be recreated,
a save job was added at the end of each custom action
Closes#1905
.pre-commit-config.yaml does not match pyproject.toml, which causes
unnecessary/undesirable formatting changes. These changes are not
required by CI, so they should not have to be made.
**To Reproduce**
Install pre-commit configuration as described
[here](https://github.com/Unstructured-IO/unstructured#installation-instructions-for-local-development).
Make a commit and something like the following will be logged:
```
check for added large files..............................................Passed
check toml...........................................(no files to check)Skipped
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check xml............................................(no files to check)Skipped
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
black....................................................................Passed
ruff.....................................................................Failed
- hook id: ruff
- files were modified by this hook
```
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
Per the README, provides an optional `pre-commit` configuration
file to ensure code matches the formatting and linting standards used in `unstructured`.