**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.
It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.
**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR
We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation
More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
This PR removes `extract_image_block_to_payload` section from "API
Parameters" page. The "unstructured" API does not support the
`extract_image_block_to_payload` parameter, and it is always set to
`True` internally on the API side when trying to extract image blocks
via the API. Users only need to specify `extract_image_block_types`
parameter when extracting image blocks via the API.
**NOTE:** The `extract_image_block_to_payload` parameter is only used
when calling `partition()`, `partition_pdf()`, and `partition_image()`
functions directly.
### Testing
CI should pass.
.heic files are an image filetype we have not supported.
#### Testing
```
from unstructured.partition.image import partition_image
png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"
png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")
for i in range(len(heic_elements)):
print(heic_elements[i].text == png_elements[i].text)
```
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>
To test:
> cd docs && make html
Changelogs:
* Fixed sphinx error due to malformed rst table on partition page
* Updated API Params, ie. `extract_image_block_types` and
`extract_image_block_to_payload`
* Updated image filetype supports
To test:
> cd docs && make html
Change logs:
* Examples are reorganized to have its own page
* Removed two old examples, ie. "file-utils" & "sentiment analysis".
* Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" &
"Multi-Files Processing with S3 Connector and API"
* Reorganized and added detailed API documentation: (i) usage, (ii)
SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and
validation errors