**Summary**
The `.section` field in `ElementMetadata` is dead code, possibly a
remainder from a prior iteration of `partition_epub()`. In any case, it
is not populated by any partitioner. Remove it and any code that uses
it.
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.
It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.
**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR
We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation
More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Thanks to Pedro at OctoAI we have a new embedding option.
The following PR adds support for the use of OctoAI embeddings.
Forked from the original OpenAI embeddings class. We removed the use of
the LangChain adaptor, and use OpenAI's SDK directly instead.
Also updated out-of-date example script.
Including new test file for OctoAI.
# Testing
Get a token from our platform at: https://www.octoai.cloud/
For testing one can do the following:
```
export OCTOAI_TOKEN=<your octo token>
python3 examples/embed/example_octoai.py
```
## Testing done
Validated running the above script from within a locally built container
via `make docker-start-dev`
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
To test:
> cd docs && make html
Changelog: added an example to use `SaaS API URL` in `partition_via_api`
using `api_url` param.
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
.heic files are an image filetype we have not supported.
#### Testing
```
from unstructured.partition.image import partition_image
png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"
png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")
for i in range(len(heic_elements)):
print(heic_elements[i].text == png_elements[i].text)
```
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>
This PR is the last in a series of PRs for refactoring and fixing the
language parameters (`languages` and `ocr_languages` so we can address
incorrect input by users. See #2293
It is recommended to go though this PR commit-by-commit and note the
commit message. The most significant commit is "update
check_languages..."
To test:
> cd docs && make html
Changelogs:
* Fixed sphinx error due to malformed rst table on partition page
* Updated API Params, ie. `extract_image_block_types` and
`extract_image_block_to_payload`
* Updated image filetype supports
Closes#2320 .
### Summary
In certain circumstances, adjusting the image block crop padding can
improve image block extraction by preventing extracted image blocks from
being clipped.
### Testing
- PDF:
[LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf)
- Set two environment variables
`EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD`
(e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`,
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20`
```
elements = partition_pdf(
filename="LM339-D_2-2.pdf",
extract_image_block_types=["image"],
)
```
### Summary
Adds support for bitmap images (`.bmp`) in both file detection and
partitioning. Bitmap images will be processed with `partition_image`
just like JPGs and PNGs.
### Testing
```python
from unstructured.file_utils.filetype import detect_filetype
from unstructured.partition.auto import partition
from PIL import Image
filename = "example-docs/layout-parser-paper-with-table.jpg"
bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp"
img = Image.open(filename)
img.save(bmp_filename)
detect_filetype(filename=bmp_filename) # Should be FileType.BMP
elements = partition(filename=bmp_filename)
```
To test:
> cd docs && make html
Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
This PR culminates the restructuring of chunking over my prior
dozen-or-so commits by adding the new options to the API and
documentation.
Separately I'll be adding a new ingest test to defend against
regression, although the integration test included in this PR will do a
pretty good job of that too.
### Summary
We no longer use the "bricks" terminology for partioning functions, etc
in the library. This PR updates various references to bricks within the
repo and the docs. This is just an initial pass to swap the terminology
out, it'll likely be helpful to reorganize the docs a bit as well.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>