Change unstructured-client pin to setting minimum version instead of max
version and `make pip-compile`.
Integration tests that were dependent on the old version of the client
are removed. These tests should be replicated in/moved to the SDK
repo(s).
In two places parameters passed to the python client when using either
Ingest workflow and `partition_via_api` function directly we parse the
parameters with list values to strings e.g.
```python
extract_image_block_types=["image"] -> extract_image_block_types='["image"]'
```
as of now these parameters are parsed incorrectly when given as strings
and correctly when given as lists.
This PR removes parsing from `PartitionConfig` and `partition_via_api`.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
### Summary
A `partition_via_api` test that only runs on `main` was
[failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959)
with the following output, likely due to the change in the default
behavior for `skip_infer_table_types`. This PR explicitly sets the
`skip_infer_table_types` param to avoid the failure..
```python
=========================== short test summary info ============================
FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®'
+ where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text
+ and 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text
= 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) =
make: *** [Makefile:302: test] Error 1
```
### Testing
After temporarily removing the "skip if not on `main`" `pytest` mark,
the [unit tests
pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O)
on the feature branch.
The purpose of this PR is to support using the same type of parameters
as `partition_*()` when using `partition_via_api()`. This PR works
together with `unsturctured-api` [PR
#368](https://github.com/Unstructured-IO/unstructured-api/pull/368).
**Note:** This PR will support extracting image blocks("Image", "Table")
via partition_via_api().
### Summary
- update `partition_via_api()` to convert all list type parameters to
JSON formatted strings before passing them to the unstructured client
SDK
- add a unit test function to test extracting image blocks via
`parition_via_api()`
- add a unit test function to test list type parameters passed to API
via unstructured client sdk
### Testing
```
from unstructured.partition.api import partition_via_api
elements = partition_via_api(
filename="example-docs/embedded-images-tables.pdf",
api_key="YOUR-API-KEY",
strategy="hi_res",
extract_image_block_types=["image", "table"],
)
image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"]
print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements]))
print("\n\n".join([el.metadata.image_base64 for el in image_block_elements]))
```
Closes#2340
We need to make sure the custom url is passed to our client. The client
constructor takes the base url, so for compatibility we can continue to
take the full url and strip off the path.
To verify, run the api locally and confirm you can make calls to it.
```
# In unstructured-api
make run-web-app
# In ipython in this repo
from unstructured.partition.api import partition_via_api
filename = "example-docs/layout-parser-paper.pdf"
partition_via_api(filename=filename, api_url="http://localhost:8000")
```
Follow-up PR to
[https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195).
Removes unnecessary calls to `get_api_key()`. That helper function is
supposed to only be used for tests decorated by
@pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside
of CI") (which are skipped because those tests are partitioning pdf/jpg
files).
These tests are partitioning emails and rely on the MockResponse at the
top of the file, so they don't need to call `get_api_key()` and it can
simply be removed from them.
### Summary
Closes#2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.
Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.
---------
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
This should fix the broken unit test on main CI
* change the strategy in
`test_partition_multiple_via_api_valid_request_data_kwargs` from `fast`
to `auto`, since the test was using `fast` for images, and we don't
support it.
### Summary
Closes unstructured-api issue
[188](https://github.com/Unstructured-IO/unstructured-api/issues/188)
The test and gist were using different versions of the same file
(jpg/pdf), creating what looked like a bug when there wasn't one. The
api is correctly using the `strategy` kwarg.
### Testing
#### Checkout to `main`
- Comment out the `@pytest.mark.skip` decorators for the
`test_partition_via_api_with_no_strategy` test
- Add an API key to your env:
- Add `from dotenv import load_dotenv; load_dotenv()` to the top of the
file and have `UNS_API_KEY` defined in `.env`
- Run `pytest test_unstructured/partition/test_api.py -k
"test_partition_via_api_with_no_strategy"`
^the test will fail
#### Checkout to this branch
- (make the same changes as above)
- Run `pytest test_unstructured/partition/test_api.py -k
"test_partition_via_api_with_no_strategy"`
### Other
`make tidy` and `make check` made linting changes to additional files
### Summary
Closes#1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.
### Testing
```python
from unstructured.partition.auto import partition
filename = "example-docs/winter-sports.epub"
# Should not emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should raise an error
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
The reason this test is failing is the API is returning "fast" results
when "hi_res" is requested, which is being tracked in this ticket:
https://github.com/Unstructured-IO/unstructured-api/issues/188 .
This failure was only showing up on the `main` branch, per the commented
out `pytest` skips.
* remove default strategy
* working on test
* fixed test, coordinates param needed to be included
* nits
* update changelog
* lint
* update requirements
* add include_metadata kwarg and tests to parsers
add exclude_metadata to docx
add test for doc to exclude metadata
add include_metadata kwarg to email
add include_metadata kwarg to epub
add include_metadata kwarg to json
add exclude_metadata tests to md
add include_metadata kwarg and tests for msg parse
add include_metadata kwarg and tests for odt parse
add include_metadata kwarg and tests for org parse
add include_metadata kwarg and tests for ppt and pptx parse
add include_metadata kwarg and tests for rst parse
add include_metadata kwarg and tests for rtf parse
add include_metadata tests for text parse
add include_metadata tests for tsv parse
add include_metadata tests for xlsx parse
add include_metadata tests for xml parse
* WIP add include_metadata to partition_pdf
* add include_metadata tests to partition_pdf
* make tidy/check
* update changelog and version
* change test asserts and move docstring logic to process_metadata
* make tidy
* fix tests asserts
* linting, linting, linting
* sync versions
* skip api call test not on main
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification).
Bonus refactor for PageBreak to have text values of "".
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
* added function for multiple files via api
* make multiple work with files
* updated docs strings
* changelog and version
* docs and contextlib for open files
* tests for partition multiple
* add tests for error conditions
* add output example