12 Commits

Author SHA1 Message Date
Steve Canny
56fbaaed10
feat(chunking): add metadata.orig_elements serde (#2680)
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.

It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.

**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-22 21:53:26 +00:00
Filip Knefel
bdfd975115
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-22 10:08:49 +00:00
Mason Brothers
ea67be5665
Doc: Change Python comment string to JavaScript comment string. (#2596)
JavaScript uses `//` for comments instead of `#`

Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-15 09:56:35 -05:00
Christine Straub
5cb6504d5a
docs: update image block extraction docs (#2578)
This PR removes `extract_image_block_to_payload` section from "API
Parameters" page. The "unstructured" API does not support the
`extract_image_block_to_payload` parameter, and it is always set to
`True` internally on the API side when trying to extract image blocks
via the API. Users only need to specify `extract_image_block_types`
parameter when extracting image blocks via the API.

**NOTE:** The `extract_image_block_to_payload` parameter is only used
when calling `partition()`, `partition_pdf()`, and `partition_image()`
functions directly.

### Testing
CI should pass.
2024-02-24 04:36:58 +00:00
Austin Walker
6d17b9a7e4
Fix a parameter name in the js-client example usage (#2560)
`files` should be `fileName` as noted in
https://github.com/Unstructured-IO/unstructured-js-client/issues/24

Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-02-18 17:28:15 +00:00
Ronny H
67a8fd9809
Added example to use SaaS API URL in partition_via_api (#2512)
To test:
> cd docs && make html

Changelog: added an example to use `SaaS API URL` in `partition_via_api`
using `api_url` param.

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2024-02-07 19:25:42 +00:00
John
db67805ec6
feat: add support for partitioning .heic files (#2454)
.heic files are an image filetype we have not supported.

#### Testing
```
from unstructured.partition.image import partition_image

png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"

png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")

for i in range(len(heic_elements)):
	print(heic_elements[i].text == png_elements[i].text)
```

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-01-30 04:49:00 +00:00
Ronny H
4c772f6ed7
Updated docs on API Params and Filetype Supports (#2433)
To test:
> cd docs && make html

Changelogs:
* Fixed sphinx error due to malformed rst table on partition page
* Updated API Params, ie. `extract_image_block_types` and
`extract_image_block_to_payload`
* Updated image filetype supports
2024-01-19 16:07:57 -08:00
Ronny H
8e6bc10ba1
Docs various updates (#2386)
To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
2024-01-17 21:01:01 +00:00
Ronny H
8e2bfcab18
Unstructured SaaS API subscription guide (#2341)
To test:
> cd docs && make html

Sections:
- New User sign-up: (i) registration form, (ii) payment processing, and
(iii) use API key & URL
- API Account maintenance: (i) update billing, (ii) opt-in email, (iii)
rotate API key, and (iv) cancel plan
- Get Supports
2024-01-03 14:38:03 -08:00
Ronny H
ac380ce989
Added AWS Marketplace docs and improved Azure Marketplace docs (#2248)
To test:
> cd docs && make HTML

Change logs:
- Added AWS Marketplace documentation
- Improved Azure Marketplace documentation - Networking section
2023-12-20 20:13:47 +00:00
Ronny H
d80abf0714
Reorganized the Examples section in Documentation & add Databricks example (#1855)
To test:
> cd docs && make html

Change logs:
* Examples are reorganized to have its own page
* Removed two old examples, ie. "file-utils" & "sentiment analysis".
* Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" &
"Multi-Files Processing with S3 Connector and API"
* Reorganized and added detailed API documentation: (i) usage, (ii)
SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and
validation errors
2023-11-30 01:24:43 +00:00