unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-13 12:05:54 +00:00

Author	SHA1	Message	Date
Ronny H	d80abf0714	Reorganized the Examples section in Documentation & add Databricks example (#1855 ) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors	2023-11-30 01:24:43 +00:00
Amanda Cameron	f98d5e65ca	chore: adding max_characters to other element type chunking (#1673 ) This PR adds the `max_characters` (hard max) param to non-table element chunking. Additionally updates the `num_characters` metadata to `max_characters` to make it clearer which param we're referencing. To test: ``` from unstructured.partition.html import partition_html filename = "example-docs/example-10k-1p.html" chunk_elements = partition_html( filename, chunking_strategy="by_title", combine_text_under_n_chars=0, new_after_n_chars=50, max_characters=100, ) for chunk in chunk_elements: print(len(chunk.text)) # previously we were only respecting the "soft max" (default of 500) for elements other than tables # now we should see that all the elements have text fields under 100 chars. ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-09 19:42:36 +00:00
Ronny H	8564d920ac	Update Metadata and Installation Documentation (#1646 ) * Updated Metadata page: add common and additional metadata fields by document types and connectors * Updated specific installation extra by document types and connectors * Added embedding brick page in Sphinx TOC * Fixed Sphinx warnings in new pages	2023-10-05 01:25:41 +00:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00
Jack Retterer	a35ff890e0	Update docs jack (#1157 ) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work)	2023-08-21 10:27:32 -07:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
Matt Robinson	4ea716837d	feat: add ability to extract extra metadata with regex (#763 ) * first pass on regex metadata * fix typing for regex metadata * add dataclass back in * add decorators * fix tests * update docs * add tests for regex metadata * add process metadata to tsv * changelog and version * docs typos * consolidate to using a single kwarg * fix test	2023-06-16 10:10:56 -04:00

7 Commits