unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-20 12:27:34 +00:00

Author	SHA1	Message	Date
rvztz	950e5d68f9	feat: adds postgresql/sqlite destination connector (#2005 ) - Adds a destination connector to upload processed output into a PostgreSQL/Sqlite database instance. - Users are responsible to provide their instances. This PR includes a couple of configuration examples. - Defines the scripts required to setup a PostgreSQL instance with the unstructured elements schema. - Validates postgres/pgvector embedding storage and retrieval --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-04 19:33:16 +00:00
Ronny H	8e2bfcab18	Unstructured SaaS API subscription guide (#2341 ) To test: > cd docs && make html Sections: - New User sign-up: (i) registration form, (ii) payment processing, and (iii) use API key & URL - API Account maintenance: (i) update billing, (ii) opt-in email, (iii) rotate API key, and (iv) cancel plan - Get Supports	2024-01-03 14:38:03 -08:00
ryannikolaidis	dd1443ab6f	feat: add Qdrant ingest destination connector (#2338 ) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-02 22:08:20 +00:00
Ronny H	ac380ce989	Added AWS Marketplace docs and improved Azure Marketplace docs (#2248 ) To test: > cd docs && make HTML Change logs: - Added AWS Marketplace documentation - Improved Azure Marketplace documentation - Networking section	2023-12-20 20:13:47 +00:00
Ahmet Melek	fd293b3e78	feat: add elasticsearch destination connector (#2152 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1842 Closes https://github.com/Unstructured-IO/unstructured/issues/2202 Closes https://github.com/Unstructured-IO/unstructured/issues/2203 This PR: - Adds Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured elasticsearch indexes easily. - Includes parallelized upload and lazy processing for elasticsearch destination connector. - Rearranges elasticsearch test helpers to source, destination, and common folders. - Adds util functions to be able to batch iterables in a lazy way for uploads - Fixes a bug where removing the optional parameter `--fields` broke the connector due to an integer processing error. - Fixes a bug where using an [elasticsearch config](`8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35)`) for a destination connector resulted in a serialization issue when optional parameter `--fields` was not provided.	2023-12-20 01:26:58 +00:00
David Potter	4b8352e0f5	feat: add chroma destination connector (#2240 ) Adds Chroma (also known as ChromaDB) as a vector destination. Currently Chroma is an in-memory single-process oriented library with plans of a hosted and/or more production ready solution -https://docs.trychroma.com/deployment Though they now claim to support multiple Clients hitting the database at once, I found that it was inconsistent. Sometimes multiprocessing worked (maybe 1 out of 3 times) But the other times I would get different errors. So I kept it single process. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-19 16:58:23 +00:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
cragwolfe	9efc22c0fc	build: release commit for 0.11.5 (#2285 ) Also fix broken link in docs.	2023-12-16 18:25:55 -08:00
Roman Isecke	ac302689a0	chore: update sphinx ingest docs with new connectors (#2245 ) Replacing https://github.com/Unstructured-IO/unstructured/pull/2243	2023-12-11 21:29:41 +00:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00
rvztz	ce905dd098	feat: Weaviate destination connector (#1963 ) Closes #1781. - Adds a Weaviate destination connector - The connector receives a host for the weaviate instance and a weaviate class name. - Defines a weaviate schema for json elements. - Defines the pre-processing to conform unstructured's schema to the proposed weaviate schema.	2023-12-01 22:27:41 +00:00
Ronny H	d80abf0714	Reorganized the Examples section in Documentation & add Databricks example (#1855 ) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors	2023-11-30 01:24:43 +00:00
Ahmet Melek	ed08773de7	feat: add pinecone destination connector (#1774 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1414 Closes #2039 This PR: - Uses Pinecone python cli to implement a destination connector for Pinecone and provides the ingest readme requirements [(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist) for the connector - Updates documentation for the s3 destination connector - Alphabetically sorts setup.py contents - Updates logs for the chunking node in ingest pipeline - Adds a baseline session handle implementation for destination connectors, to be able to parallelize their operations - For the [bug](https://github.com/Unstructured-IO/unstructured/issues/1892) related to persisting element data to ingest embedding nodes; this PR tests the [solution](https://github.com/Unstructured-IO/unstructured/pull/1893) with its ingest test - Solves a bug on ingest chunking params with [bugfix on chunking params and implementing related test](`69e1949a6f`) --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-11-29 22:37:32 +00:00
qued	1576e0b891	docs: update docker image link (#2186 ) Updated docker image link in documentation to be consistent with README.	2023-11-29 11:40:09 -06:00
ryannikolaidis	b718db22b8	docs: add mongodb destination connector to ingest docs (#2136 ) The mongodb documentation page exists but is missing from the list of destination connectors, meaning it's not actually visible in the documentation.	2023-11-21 19:11:25 +00:00
Roman Isecke	b8af2f18bb	add mongo db destination connector (#2068 ) ### Description This adds the basic implementation of pushing the generated json output of partition to mongodb. None of this code provisions the mondo db instance so things like adding a search index around the embedding content must be done by the user. Any sort of schema validation would also have to take place via user-specific configuration on the database. This update makes no assumptions about the configuration of the database itself.	2023-11-16 22:40:22 +00:00
qued	ad09a869b5	fix: update slack link to link shortener (#2010 ) Per @tabossert we're now using a link shortener behind which we can rotate the link to keep it current. That way we (🤞 ) never have to update this here again. #### Testing: Links should work. No more links should exist in the documentation except this one.	2023-11-06 15:47:18 +00:00
Matt Robinson	e5bcd36475	docs: update slack links (#1990 ) ### Summary A user in the [Community Slack](https://unstructuredw-kbe4326.slack.com/archives/C043YA29U0J/p1698933003702919) reported having difficulty signing up for Slack using the links from the documentation. Updated the links to the use the invite link that worked from him, which came from [this blog post](https://medium.com/unstructured-io/setting-up-a-private-retrieval-augmented-generation-rag-system-with-local-vector-database-d42f34692ca7).	2023-11-05 11:26:34 -08:00
Roman Isecke	d09c8c0cab	test: update ingest dest tests to follow set pattern (#1991 ) ### Description Update all destination tests to match pattern: * Don't omit any metadata to check full schema * Move azure cognitive dest test from src to dest * Split delta table test into seperate src and dest tests * Fix azure cognitive search and add to dest tests being run (wasn't being run originally)	2023-11-03 12:46:56 +00:00
Roman Isecke	901704b6c0	update sphinx docs with ingest content (#1969 ) ### Description Create a new structure for ingest content in the docs, update with all configs	2023-11-02 20:40:35 +00:00
Matt Robinson	d9c035edb1	docs: no more bricks (#1967 ) ### Summary We no longer use the "bricks" terminology for partioning functions, etc in the library. This PR updates various references to bricks within the repo and the docs. This is just an initial pass to swap the terminology out, it'll likely be helpful to reorganize the docs a bit as well. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-11-02 09:43:26 -05:00
Roman Isecke	922bc84cee	Update fsspecs-specific source connector docs (#1898 ) ### Description Add in the fsspec configs needed for the fsspec-based connectors To match the behavior of the original CLI, the default used by the click option was mirrored in the base config for the api endpoint.	2023-10-31 16:09:46 +00:00
Matt Robinson	21b45ae8b0	docs: update to new logo (#1937 ) ### Summary Updates the docs and README to use the new Unstructured logo. The README links to the raw GitHub user content, so the changes isn't reflected in the README on the branch, but will update after the image is merged to main. ### Testing Here's what the updated docs look like locally: <img width="237" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/f13d8b4b-3098-4823-bd16-a6c8dfcffe67"> <img width="1509" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3b8aae5e-34aa-48c0-90f9-f5f3f0f1e26d"> <img width="1490" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/e82a876f-b19a-4573-b6bb-1c0215d2d7a9">	2023-10-31 15:39:19 +00:00
Amanda Cameron	0584e1d031	chore: fix infer_table bug (#1833 ) Carrying `skip_infer_table_types` to `infer_table_structure` in partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a `text_as_html` field. Note: I've continued to exclude this var from partitioners that go through html flow, I think if we've already got the html it doesn't make sense to carry the infer variable along, since we're not 'infer-ing' the html table in these cases. TODO: ✅ add unit tests --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: amanda103 <amanda103@users.noreply.github.com>	2023-10-24 00:11:53 +00:00
Mallori Harrell	00635744ed	feat: Adds local embedding model (#1619 ) This PR adds a local embedding model option as an alternative to using our OpenAI embedding brick. This brick uses LangChain's HuggingFacEmbeddings.	2023-10-19 11:51:36 -05:00
Jack Retterer	b8f24ba67e	Added AWS Bedrock embeddings (#1738 ) Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model. Test - find your aws secret access key and key id; make sure the account has access to bedrock's tian embed model - follow the instructions in `d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)` --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>	2023-10-18 19:36:51 -05:00
Roman Isecke	adacd8e5b1	roman/update ingest pipeline docs (#1689 ) ### Description * Update all existing connector docs to use new pipeline approach ### Additional changes: * Some defaults were set for the runners to match those in the configs to make those easy to handle, i.e. the biomed runner: ```python max_retries: int = 5, max_request_time: int = 45, decay: float = 0.3, ```	2023-10-17 16:11:16 +00:00
Roman Isecke	b265d8874b	refactoring linting (#1739 ) ### Description Currently linting only takes place over the base unstructured directory but we support python files throughout the repo. It makes sense for all those files to also abide by the same linting rules so the entire repo was set to be inspected when the linters are run. Along with that autoflake was added as a linter which has a lot of added benefits such as removing unused imports for you that would currently break flake and require manual intervention. The only real relevant changes in this PR are in the `Makefile`, `setup.cfg`, and `requirements/test.in`. The rest is the result of running the linters.	2023-10-17 12:45:12 +00:00
Amanda Cameron	d0c84d605c	chore: updating table docs with file extensions (#1702 ) gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691 Adding filetype extensions from this [list](`f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200)`) where applicable. --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-10-14 14:14:52 -07:00
Ahmet Melek	94836cfad4	feat: add file-based access permissions for SharePoint ingest (#1628 ) This PR: - defines rbac_data as a SourceMetadata field, - manages connections to an external api for obtaining rbac data with ConnectorRBAC class, - serializes rbac data and saves it to the disk, - matches the rbac_data in the disk to each IngestDoc, using a common field, - forwards rbac data to Elements, via the partition() function To test the changes, run `examples/ingest/sharepoint/ingest.sh` with the relevant rbac & connector credentials --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-10-13 00:38:08 +00:00
Dev Khant	f09b87da23	Doc : replace link `upstream connectors` with `source connectors` (#1683 ) Fixes #1502 Here I have replaced `stream_connectors.html` with `source_connectors.html`.	2023-10-09 21:37:51 -07:00
Amanda Cameron	f98d5e65ca	chore: adding max_characters to other element type chunking (#1673 ) This PR adds the `max_characters` (hard max) param to non-table element chunking. Additionally updates the `num_characters` metadata to `max_characters` to make it clearer which param we're referencing. To test: ``` from unstructured.partition.html import partition_html filename = "example-docs/example-10k-1p.html" chunk_elements = partition_html( filename, chunking_strategy="by_title", combine_text_under_n_chars=0, new_after_n_chars=50, max_characters=100, ) for chunk in chunk_elements: print(len(chunk.text)) # previously we were only respecting the "soft max" (default of 500) for elements other than tables # now we should see that all the elements have text fields under 100 chars. ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-09 19:42:36 +00:00
Jack Retterer	7e310ecac2	Update Getting Started Guide in Documentation (#1667 ) - Fixed typo that stated "infer_table_structured" instead of "infer_table_structure" Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:12:52 +00:00
Ronny H	8564d920ac	Update Metadata and Installation Documentation (#1646 ) * Updated Metadata page: add common and additional metadata fields by document types and connectors * Updated specific installation extra by document types and connectors * Added embedding brick page in Sphinx TOC * Fixed Sphinx warnings in new pages	2023-10-05 01:25:41 +00:00
Manirevuri	13453d6358	Fix: Documentation for Unstructured API's (#1624 ) Fixed "files=file_data" param for all python files --------- Co-authored-by: Austin Walker <austin@unstructured.io>	2023-10-03 20:42:32 +00:00
Roman Isecke	9d81971fcb	update ingest python doc (#1446 ) ### Description Updating the python version of the example docs to show how to run the same code that the CLI runs, but using python. Rather than copying the same command that would be run via the terminal and using the subprocess library to run it, this updates it to use the supported code exposed in the inference directory. For now only the wikipedia one has been updated to get some opinions on this before updating all other connector docs. Would close out https://github.com/Unstructured-IO/unstructured/issues/1445	2023-10-03 10:01:41 -04:00
Roman Isecke	5c7b4f586b	Roman/azure cognitive embeddings (#1524 ) ### Description This PR is two-fold: Embeddings: * Embeddings incorporated into the sharepoint source connector, which will now call out to OpenAI and create embeddings if the flag is passed in and the api key provided. Writing vector content (embeddings) to Azure cognitive search index: * The schema for the index expected to exist in Azure has been updated to include the vector field type and a test script has been added to test the new content being produced from the Sharepoint connector to push the embedding content. Some important notes about other changes in here: * The embedding code had to be updated to patch the `to_dict` method on elements to add `embeddings` to the dict output if that was added. While the code originally added the embedding content, when `to_dict` was called to save the content as json, this was lost.	2023-09-26 23:24:21 +00:00
Ronny H	868cac5bd5	Fixed Sphinx warning errors (#1438 ) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html`	2023-09-26 04:20:16 +00:00
Trevor Bossert	2a24c81852	Update docker download url to use scarf gateway (#1523 ) This updates the docker image download url to pass through the scarf gateway, this allows anonymous tracking of downloads Related to: https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics Testing: docker pull downloads.unstructured.io/unstructured-io/unstructured:latest Result: Image should download	2023-09-25 14:52:39 -07:00
Roman Isecke	bd49cfbab7	feat: adds Azure Cognitive Search (full text) destination connector (#1459 ) ### Description New [Azure Cognitive Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search) destination connector added. Writes each json element from the created json files via partition and writes that content to an index. Bonus bug fix: Due to a recent change where the default version of python used in the repo was bumped to `3.10` from `3.8`, this means running `pip-compile` now runs it against that version rather than the lowest we support which is still `3.8`. This breaks the setup for those lower versions because some of the versions pulled in by `pip-compile` exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a script that checks the version of python being used first, which helps guarantee that all dependencies meet the minimum python version requirement. Closes out https://github.com/Unstructured-IO/unstructured/issues/1466	2023-09-25 10:27:42 -04:00
Ahmet Melek	9e88929a8c	feat: document embeddings (#1368 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1319, closes https://github.com/Unstructured-IO/unstructured/issues/1372 This module: - implements EmbeddingEncoder classes which track embedding related data - implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements - the module uses langchain to obtain the embeddings ----- - The PR additionally fixes a JSON de-serialization issue on the metadata fields. To test the changes, run `examples/embed/example.py`	2023-09-20 19:55:30 +00:00
Ryan Nikolaidis	8c1d03e5cf	update slack invite	2023-09-20 00:02:03 -07:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00
John	6187dc0976	update links in integrations.rst (#1418 ) A number of the links in integrations.rst don't seem to lead to the intended section in the unstructured documentation. For example: ```See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details``` It seems this link should direct to here instead: https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-weaviate	2023-09-15 16:50:55 -07:00
Roman Isecke	333558494e	roman/delta lake dest connector (#1385 ) ### Description Add delta table downstream destination connector Closes https://github.com/Unstructured-IO/unstructured/issues/1415	2023-09-15 22:13:39 +00:00
Ronny H	f1364594ad	Docs models (#1412 ) This PR adds documentation of models supported by the `Unstructured` tool. The changes reflect the tool's capabilities, usage examples, and the process for integrating custom models. Sections: - Detailed the basic usage of the `Unstructured` partition with the model name. - Provided a list of available models in the `Unstructured` partition. - Added instructions on using non-default models via three distinct methods. - Explained leveraging models from the LayoutParser's model zoo with `UnstructuredDetectronModel`. - Guided users in integrating their custom object detection models using the `UnstructuredObjectDetectionModel` class. Tested the docs build with: > cd docs > pip install -r requirements.txt > make html	2023-09-13 23:37:31 -07:00
Amanda Cameron	7fd81dc7df	Table processing test for RTF (#1388 ) This PR does two things: 1. Adds test case (and alters sample doc) for rtf and epub files with table 2. Adds `xls/x` file extension to `skip_infer_table_types` default list --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-09-12 18:27:05 -07:00
Roman Isecke	59e850bbd9	Roman/downstream connector cli subcommand (#1302 ) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311	2023-09-11 11:40:56 -04:00
Ronny H	edc45013dc	Add `strategy` documentation (#1353 )	2023-09-09 18:54:01 -07:00
Ahmet Melek	09cc4bfa5f	feat: jira connector (cloud) (#1238 ) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-09-06 10:10:48 +00:00

1 2 3 4

200 Commits