unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-20 19:54:49 +00:00

Author	SHA1	Message	Date
Mallori Harrell	00635744ed	feat: Adds local embedding model (#1619 ) This PR adds a local embedding model option as an alternative to using our OpenAI embedding brick. This brick uses LangChain's HuggingFacEmbeddings.	2023-10-19 11:51:36 -05:00
Jack Retterer	b8f24ba67e	Added AWS Bedrock embeddings (#1738 ) Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model. Test - find your aws secret access key and key id; make sure the account has access to bedrock's tian embed model - follow the instructions in `d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)` --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>	2023-10-18 19:36:51 -05:00
Amanda Cameron	d0c84d605c	chore: updating table docs with file extensions (#1702 ) gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691 Adding filetype extensions from this [list](`f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200)`) where applicable. --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-10-14 14:14:52 -07:00
Ronny H	8564d920ac	Update Metadata and Installation Documentation (#1646 ) * Updated Metadata page: add common and additional metadata fields by document types and connectors * Updated specific installation extra by document types and connectors * Added embedding brick page in Sphinx TOC * Fixed Sphinx warnings in new pages	2023-10-05 01:25:41 +00:00
Ronny H	868cac5bd5	Fixed Sphinx warning errors (#1438 ) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html`	2023-09-26 04:20:16 +00:00
Ahmet Melek	9e88929a8c	feat: document embeddings (#1368 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1319, closes https://github.com/Unstructured-IO/unstructured/issues/1372 This module: - implements EmbeddingEncoder classes which track embedding related data - implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements - the module uses langchain to obtain the embeddings ----- - The PR additionally fixes a JSON de-serialization issue on the metadata fields. To test the changes, run `examples/embed/example.py`	2023-09-20 19:55:30 +00:00
Jack Retterer	95b6295307	Jack/update documentation (#1190 ) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview	2023-09-04 16:15:50 +00:00
Matt Robinson	c49df62967	feat: `partition_xml` infers element type on each leaf node (#1249 ) ### Summary Closes #1229. Updates `partition_xml` so that the element type is inferred on each leaf node when `xml_keep_tags=False` instead of delegating splitting and partitioning to `partition_xml`. If `xml_keep_tags=True`, the file is treated like a text file still and partitioning is still delegated to `partition_text`. Also adds the option to pass `text` as an input to `partition_xml`. ### Testing Create a `parrots.xml` file that looks like: ```xml <xml><parrot><name>Conure</name><description>A conure is a very friendly bird. Conures are feathery and like to dance.</description></parrot></xml> ``` Run: ```python from unstructured.partition.xml import partition_xml from unstructured.staging.base import convert_to_dict elements = partition_xml(filename="parrots.xml") convert_to_dict(elements) ``` One `main`, the output is the following. Notice how the `<name>` tag incorrectly gets merged into `<description>` in the first element. ```python [{'element_id': '7ae4074435df8dfcefcf24a4e6c52026', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conure A conure is a very friendly bird.', 'type': 'NarrativeText'}, {'element_id': '859ecb332da6961acd2fb6a0185d1549', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conures are feathery and like to dance.', 'type': 'NarrativeText'}] ``` One the feature branch, the output is the following, and the tags are correctly separated. ```python [{'element_id': '5512218914e4eeacf71a9cd42c373710', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conure', 'type': 'Title'}, {'element_id': '113bf8d250c2b1a77c9c2caa4b812f85', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'A conure is a very friendly bird.\n' '\n' 'Conures are feathery and like to dance.', 'type': 'NarrativeText'}] ```	2023-08-30 17:07:10 -04:00
Matt Robinson	f6a745a74f	feat: chunk elements based on titles (#1222 ) ### Summary An initial pass on smart chunking for RAG applications. Breaks a document into sections based on the presence of `Title` elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is `1500`. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars` characters, which could occur with a long `NarrativeText` element. - Section under `combine_under_n_chars` characters are combined. The default is `500`. ### Testing ```python from unstructured.partition.html import partition_html from unstructured.chunking.title import chunk_by_title url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" elements = partition_html(url=url) chunks = chunk_by_title(elements) for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input() ```	2023-08-29 16:04:57 +00:00
Matt Robinson	07f76275f1	feat: detect PGP encrypted content in `partition_email` and `partition_msg` (#1205 ) ### Summary Closes #1018. Enables `partition_email` and `partition_msg` to detect if an email has PGP encrypted content. Based on the specification in [RFC 2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based on the example email in the spec. If PGP detected content is detected, a warning is emitted and an empty set of lists is returned. ### Testing ```python from unstructured.partition_email import partition_email filename = "example-docs/eml/fake-encrypted.eml" partition_email(filename=filename) ``` ```python from unstructured.partition_msg import partition_msg filename = "example-docs/fake-encrypted.msg" partition_msgl(filename=filename) ```	2023-08-25 17:09:25 -07:00
Matt Robinson	cdae53cc29	chore: deprecation warning for `file_filename` (#1191 ) ### Summary Closes #1007. Adds a deprecation warning for the `file_filename` kwarg to `partition`, `partition_via_api`, and `partition_multiple_via_api`. Also catches a warning in `ebooklib` that we do not want to emit in `unstructured`. ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/winter-sports.epub" # Should not emit a warning with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should emit a warning with open(filename, "rb") as f: elements = partition(file=f, file_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should raise an error with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub") ```	2023-08-24 07:02:47 +00:00
Jack Retterer	a35ff890e0	Update docs jack (#1157 ) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work)	2023-08-21 10:27:32 -07:00

12 Commits