To test:
> cd docs && make html
Change logs:
* Examples are reorganized to have its own page
* Removed two old examples, ie. "file-utils" & "sentiment analysis".
* Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" &
"Multi-Files Processing with S3 Connector and API"
* Reorganized and added detailed API documentation: (i) usage, (ii)
SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and
validation errors
This PR adds the `max_characters` (hard max) param to non-table element
chunking. Additionally updates the `num_characters` metadata to
`max_characters` to make it clearer which param we're referencing.
To test:
```
from unstructured.partition.html import partition_html
filename = "example-docs/example-10k-1p.html"
chunk_elements = partition_html(
filename,
chunking_strategy="by_title",
combine_text_under_n_chars=0,
new_after_n_chars=50,
max_characters=100,
)
for chunk in chunk_elements:
print(len(chunk.text))
# previously we were only respecting the "soft max" (default of 500) for elements other than tables
# now we should see that all the elements have text fields under 100 chars.
```
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.
This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
Documentation Overhaul
- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)
* first pass on regex metadata
* fix typing for regex metadata
* add dataclass back in
* add decorators
* fix tests
* update docs
* add tests for regex metadata
* add process metadata to tsv
* changelog and version
* docs typos
* consolidate to using a single kwarg
* fix test