unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-08 01:22:43 +00:00

Author	SHA1	Message	Date
Roman Isecke	54ec311c55	feat/migrate onedrive src (#3295 ) ### Description Migrate the onedrive source connector to v2, adding in more rich content pulled from the response of the SDK to add further metadata to the FIleData produced by the indexer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-26 23:59:51 +00:00
Matt Robinson	6939bff49e	build(deps): bump langchain-community version (#3305 ) ### Summary Bumps to the latest `langchain-community` version to resolve [CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-06-26 22:42:32 +00:00
Roman Isecke	3f581e6b7d	feat/migrate gdrive source connector (#3239 ) ### Description Migrate the google drive source connector over to the new v2 ingest framework and include a variety of improvements as part of the refactor: * The ID is no longer limited to a drive id but can also be the id of a subfolder within a drive or a file directly and each case is handled appropriately * More metadata is pulled in from google drive to enrich the partitioned elements downstream and now the modified date is being set to not reprocess if the ingest pipeline already has the file cached * timing information is set on the file created when downloaded based on the last modified data retrieved from google drive --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-25 12:55:28 +00:00
Roman Isecke	e0f4374386	Roman/bugfix conflicting event loop ingest (#3264 ) ### Description In use cases where an external system (such as code being run in a jupyter notebook) already has a running event loop, run the async code in a dedicated thread pool to not conflict with the existing event loop. This also has a variety of fixes that were found when putting together a demo leveraging the elasticsearch destination connector	2024-06-24 18:47:37 +00:00
David Potter	8610bd3ab9	feat: Kafka source and destination connector (#3176 ) Thanks to @tullytim we have a new Kafka source and destination connector. It also works with hosted Kafka via Confluent. Documentation will be added to the Docs repo.	2024-06-22 23:26:23 +00:00
Christine Straub	f23d180d34	fix: docker image publishing error (#3238 ) This PR aims to fix a docker image publishing error caused by user changes when pulling the `amd64` image from the `unstructured` `wolfi-base` image. (https://github.com/Unstructured-IO/unstructured/pull/3213). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-18 21:01:42 +00:00
Roman Isecke	fd98cf9ea5	Roman/migrate es dest (#3224 ) ### Description Migrate elasticsearch destination connector to new v2 ingest framework	2024-06-18 14:20:49 +00:00
Roman Isecke	d876a386ed	Roman/fix ingest async connectors (#3210 ) ### Description Choosing to use async needs to be very careful because if a connector is set to use async, the pipeline will not fan out the inputs via multiprocessing but instead it will be limited to run in a single process under the assumption it has more benefit from async due to heavy network traffic. This means the exact same code that is not optimized for async and is blocking will force the pipeline to perform worse than simply never marking the connector to use async since the pipeline will fan that out using multiprocessing. All connectors and processes in the pipeline we revisited to make sure this criteria was met and updated accordingly: * Currently the unstructured client does not support making requests async, so this was moved over to use multiprocessing * fsspec connector was updated to use the async client from the fsspec library. This also required that the client be a `@property` fetched on demand, otherwise the client would break the multiprocessing pool since it maintains a thread lock and that can't be pickled when the fsspec connector doesn't support async. * elasticsearch was also updated to use the async client * weaviate only recently came out with async support in their SDK at a version that is higher than we can use in the open source repo, so a TODO was left but otherwise moved to use multiprocessing * all underlying embedders don't use async to embedder step must be multiprocessing for now. TODO left to update underlying embedder classes to optionally support async. * Chunking parameters were not accurately being passed through from cli to chunker params, this was fixed --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-17 16:55:19 +00:00
Steve Canny	9fae0111d9	rfctr(html): drop HTML-specific elements (#3207 ) Summary Remove HTML-specific element types and return "regular" elements like `Title` and `NarrativeText` from `partition_html()`. Additional Context - An aspect of the legacy HTML partitioner was the use of HTML-specific element types used to track metadata during partitioning. - That role is no longer necessary or desireable. - HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were returned from partitioning HTML but also the seven other file-formats that broker partitioning to HTML (convert-to-HTML and partition_html()). This does not cause immediate breakage because these are still `Text` element subtypes, but it produces a confusing developer experience. - Remove the prior metadata roles from HTML-specific elements and remove those element types entirely.	2024-06-15 00:14:22 +00:00
Christine Straub	9552fbbfbf	chore: bump unstructured-inference 0.7.35 (#3205 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-14 18:11:38 +00:00
ryannikolaidis	da3492b529	fix: dropbox source connector file path bugs (#3189 ) The Dropbox source connector currently raises exceptions when indexing files due to two issues: a path formatting idiosyncrasy of the Dropbox library and a divergence in the definition of the Dropbox libraries fs.info method, expecting a 'url' parameter rather than 'path'. ## Changes * add a `/` prefix to file path used by DropboxIndexer * override the fsspec sterilize_info method in DropboxIndexer to call `self.fs.info` with `url` rather than `path`; to accommodate for the fact that `dropboxdrivefs` diverges with this signature * remove `dropbox.sh` from ignored source tests * update test fixtures (now that the dropbox connector has been fixed and not skipped) ## Testing `dropbox.sh` source ingest test now succeeds (and is no longer ignored) --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-06-13 18:06:41 +00:00
Roman Isecke	f7b0a37c86	Feat/migrate elasticsearch src connector (#3174 ) ### Description Migrate elasticsearch connector with support for what used to be batch ingest docs but not it support for the download step to generate additional file data. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-13 17:57:59 +00:00
ryannikolaidis	17bc55e7be	fix: relative path / permissions issues with v2 fsspec connectors (#3186 ) When the v2 fsspec connectors currently generate the relative path, they may introduce a path with a leading slash (this happens in the case of the Box connector, which is a subclass of fsspec). When this happens this results in the paths unintentionally being treated as absolute paths. As a result, the ingest pipeline attempts to write files to directories at root level, which in turn raises permission issues. Note: Box expected results needed to update now that it's no longer failing. Aside: found that our tests were unintentionally skipping `box.sh` tests because we were intending to skip `dropbox.sh` and we use regex to match if a given test is in skip tests. This adds changes to force an exact match. ## Changes * Strip leading slashes during the creating of relative paths in fsspec connectors * Add expected results for Box connector * (bonus): `make tidy` altered an unrelated file by removing an unnecessary call of `pass` * (bonus): check exact match for skipped ingest tests which fixes Box tests getting skipped ## Testing [Tests](https://github.com/Unstructured-IO/unstructured/actions/runs/9461928289/job/26093475612#step:7:2085) for the Box connector was failing. It was accidentally getting skipped (see changes above). It is now no longer skipped and passing.	2024-06-12 03:39:35 +00:00
Roman Isecke	b777864296	feat: Migrate over fsspec connectors (#3066 ) ### Description Move over all fsspec connectors to the new framework Variety of bug fixes found and fixed in this PR as well: * custom json mixin being used for the enhanced dataclass would break if typing was quoted. That was fixed. A check was also added to the enhanced dataclass to prevent `InitVar` from being used in the root dataclass since this breaks serialization. * hashing for partitioner was using the filename of the raw file being partitioned rather than the file name of the file data generated from indexing. This means that mutliple files could result in the same partition hash when recursive flag is passed in. This was updated to use the file data file name instead. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-05 19:12:06 +00:00
Matt Robinson	865ef496e6	ci: update `pinecone` test to use serverless (#3127 ) ### Summary Closes #3068. Updates the Pinecone connector tests to use serverless indexes, per the documentation [here](https://docs.pinecone.io/reference/api/control-plane/create_index). Also updates the CHANGELOG to mention serverless. Turns out we already supported it with the client version bump, but it hadn't been tested yet. ### Testing See [this CI job](https://github.com/Unstructured-IO/unstructured/actions/runs/9319836670/job/25655322433?pr=3127) that passed, running only the Pinecone test.	2024-05-31 15:24:41 +00:00
ryannikolaidis	1f8768750c	chore: add auth to s3 destination test (#3122 ) We should be validating the S3 Destination with authenticated requests, with credentials from a limited test user. ## Changes - Updates s3 destination test to point to a bucket that requires authentication. - Adds authentication to the s3 destination test request - Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination - Bonus: fix S3ConnectionConfig never registered for s3 V2 destination - Bonus: repair version and changelog version for consistency with -dev convention ## Testing Validated by changes to S3 destination ingest test	2024-05-31 07:05:09 +00:00
ryannikolaidis	6b5d8a9785	fix: revert dropping of filename extension for some connectors (#3109 ) V2 refactor of ingest code introduces the removal of original file extensions. Since the upgrade of connectors is incomplete this means that some connectors will remove the original file extension and some will not. Still TBD whether this is actually something we want at all. This PR reverts specifically that change in the V2 ingest code so that original file extension is preserved downstream. ## Testing CI is passing with filenames updated via `Ingest Test Fixtures Update` workflow. --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2024-05-29 19:14:22 +00:00
Matt Robinson	3158169585	fix: uninstall bson for mongo connector (#3104 ) ### Summary Closes #3049. Reenables the MongoDB connector test, which was disabled previously in #3047 due to incompatibility between the `pymongo` and the `bson` package from `pip`, which is a dependency for the Astra connector. Per the `pymongo` docs below, `pymongo` ships with its own version of `bson` and installing `bson` from `pip` breaks `pymongo`. - https://pymongo.readthedocs.io/en/stable/installation.html ### Testing Ingest tests ran successfully for the [source connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512636315) and the [destination connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512635546).	2024-05-28 17:45:18 +00:00
Matt Robinson	6b400b46fe	feat: add VoyageAI embeddings (#3069 ) (#3099 ) Original PR was #3069. Merged in to a feature branch to fix dependency and linting issues. Application code changes from the original PR were already reviewed and approved. ------------ Original PR description: Adding VoyageAI embeddings Voyage AI’s embedding models and rerankers are state-of-the-art in retrieval accuracy. --------- Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>	2024-05-24 21:48:35 +00:00
Yao You	32df4ee1c6	fix: disable table_as_cells output by default (#3093 ) This PR changes the output of table elements: now by default the table elements' `metadata.table_as_cells` is `None`. The data will only be populated when the env `EXTRACT_TABLE_AS_CELLS` is set to `true`. The original design of the `table_as_cells` is for evaluate table extraction performance. The format itself is not as readable as the `table_as_html` metadata for human or RAG consumption. Therefore by default this data is not needed. Since this output is meant for evaluation use this PR choose to use an environment variable to control if it should be present in the partitioned results. This approach avoids adding parameters to the `partition` function call. Adding a new parameter to the `partition` interface increases the complexity of the interface and adds more maintenance cost since there is a long chain of function calls to pass down this parameter to where it is needed. ## test running the following code snippet on main vs. this PR ```python from unstructured.partition.auto import partition elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[]) table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"] ``` on main branch `table_cells` contains cell structured data but on this branch it is a list of `None` However if we first set in terminal: ```bash export EXTRACT_TABLE_AS_CELLS=true ``` then run the same code again with this PR the `table_cells` would contain actual data, the same as on main branch. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-05-24 16:41:25 +00:00
Christine Straub	35ec21ecd0	fix: decide table extraction (#3090 ) This PR aims to add backward compatibility for the deprecated `pdf_infer_table_structure` parameter. A missing part of turning table extraction for PDFs and Images off by default in https://github.com/Unstructured-IO/unstructured/pull/3035, which was turned on in https://github.com/Unstructured-IO/unstructured/pull/2588. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-05-23 20:37:15 +00:00
Roman Isecke	3eaf65a8c1	feat: refactor ingest (#3009 ) ### Description This refactors the current ingest CLI process to support better granularity in how the steps are ran * Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here * Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload. * The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere. * Quick update to the chunker/partition step to use the python client. * Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from. * Leverage last modified date to mark files to be reprocessed, even if the file already exists locally. ### Callouts Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more. ### Testing * If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: [example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py) * If you want to run the new code as an isolated CLI: ```shell PYTHONPATH=. python unstructured/ingest/v2/main.py --help ``` * If you want to see which commands have been migrated to the new version, there's now a `v2` short help text next to those commands when running the current cli: ```shell PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia ``` You can run any of the local or s3 specific ingest tests and these should now work. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-05-21 17:01:49 +00:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Matt Robinson	ec987dcbb2	BREAKING CHANGE: revert table extraction off by default for PDFs and images (#3035 ) ### Summary Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-05-17 15:28:11 +00:00
David Potter	df8d39a4d4	fix: allow AstraDB to prevent indexing on metadata columns with long text (#3003 ) Thanks to @erichare from AstraDB Adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed. Changes to: test_unstructured_ingest/python/test-ingest-astra-output.py are forward looking from AstraDB	2024-05-17 04:12:37 +00:00
Yuming Long	542d442699	chore CORE-4775: remove html page number metadata field (#2942 ) ### Summary Rip off page_number metadata fields until we have page counting for all kinds of html files (not just limited to news articles with multiple `<article>` tag) ### Test Unit tests `test_add_chunking_strategy_on_partition_html_respects_multipage` and `test_add_chunking_strategy_title_on_partition_auto_respects_multipage` removed since they relay on the `page_number` fields from the SEC html file - now test moved to mock test for chunk_by_title -> revisit those tests when we find test file for this Also changed the element ids from partition outputs for html files - element id change due to page number change (in element id hashing) -> todo ticket: update other deterministic element id tests per crag's comment --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2024-04-30 15:20:26 +00:00
Pluto	df1f7bcd0e	Save table prediction in cells format (#2892 ) This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation. This PR has to be merged, after https://github.com/Unstructured-IO/unstructured-inference/pull/335	2024-04-25 11:14:48 +00:00
John	3843af666e	feat: Enable remote chunking via unstructured-ingest (#2905 ) Update: The cli shell script works when sending documents to the free api, but the paid api is down, so waiting to test against it. - The first commit adds docstrings and fixes type hints. - The second commit reorganizes `test_unstructured_ingest` so it matches the structure of `unstructured/ingest`. - The third commit contains the primary changes for this PR. - The `.chunk()` method responsible for sending elements to the correct method is moved from `ChunkingConfig` to `Chunker` so that `ChunkingConfig` acts as a config object instead of containing implementation logic. `Chunker.chunk()` also now takes a json file instead of a list of elements. This is done to avoid redundant serialization if the file is to be sent to the api for chunking. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-04-25 00:24:58 +00:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Michał Martyniak	001fa17c86	Preparing the foundation for better element IDs (#2842 ) Part one of the issue described here: https://github.com/Unstructured-IO/unstructured/issues/2461 It does not change how hashing algorithm works, just reworks how ids are assigned: > Element ID Design Principles > > 1. A partitioning function can assign only one of two available ID types to a returned element: a hash or UUID. > 2. All elements that are returned come with an ID, which is never None. > 3. No matter which type of ID is used, it will always be in string format. > 4. Partitioning a document returns elements with hashes as their default IDs. Big thanks to @scanny for explaining the current design and suggesting ways to do it right, especially with chunking. Here's the next PR in line: https://github.com/Unstructured-IO/unstructured/pull/2673 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>	2024-04-16 21:14:53 +00:00
Christine Straub	ba3f374268	Fix: ingest test fixtures update pr (#2881 ) This PR aims to update "Ingest Test Fixtures Update PR" CI to update the ingest test fixtures only if the OVERWRITE_FIXTURES variable is not `false` and the OUTPUT_DIR directory is not empty.	2024-04-15 17:47:22 +00:00
MiXiBo	0506aff788	add support for `start_index` in `html` links extraction (#2600 ) add support for start_index in html links extraction (closes #2625) Testing ``` from unstructured.partition.html import partition_html from unstructured.staging.base import elements_to_json html_text = """<html> <p>Hello there I am a <a href="/link">very important link!</a></p> <p>Here is a list of my favorite things</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li> <li>Dogs</li> </ul> <a href="/loner">A lone link!</a> </html>""" elements = partition_html(text=html_text) print(elements_to_json(elements)) ``` --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-04-12 06:14:20 +00:00
Ahmet Melek	6fd29ea77c	fix: collection deletion for AstraDB test (#2869 ) This PR: - Fixes occasional collection deletion failures for AstraDB via putting collection deletion statements inside a trap statement. It uses click commands to do this. Testing: - Run ingest astradb destination test	2024-04-10 23:08:24 +00:00
Ahmet Melek	d46792214a	feat: add vertexai embeddings (#2693 ) This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-03-28 21:15:36 +00:00
David Potter	c8cf8f31ac	bug CORE-4225: mongodb url bug (#2662 ) The mongodb redact method was created because we wanted part of the url to be exposed to the user during logging. Thus it did not use the dataclass `enhanced_field(sensitive=True)` solution. This changes it to use our standard redacted solution. This also minimizes the amount of work to be done in platform.	2024-03-28 18:38:50 +00:00
Steve Canny	9ae838e50a	feat: add --include-orig-elements option to Ingest CLI (#2687 ) Summary Add an `--include-orig-elements` option to the Ingest CLI to allow users to specify that corresponding new chunking parameter. Reviewer A lot of this is cleanup, the second commit is where the actual adding of this option are. The first commit fixes a number of inaccuracies in the documentation and does some other clean-up. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-27 06:35:01 +00:00
Christine Straub	08fafc564f	Fix: embedded text not getting merged with inferred elements (#2679 ) This PR is the second part of fixing "embedded text not getting merged with inferred elements", the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/331. ### Summary - replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()` when removing pdfminer (embedded) elements that were merged with inferred elements - use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD` introduced in the [first part](https://github.com/Unstructured-IO/unstructured-inference/pull/331) when removing pdfminer (embedded) elements that were merged with inferred elements - bump `unstructured-inference` to 0.7.25 ### Testing PDF: [pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf) ``` $ pip uninstall unstructured-inference -y $ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference $ pip install -e . ``` ``` elements = partition_pdf( filename="pwc-financial-statements-p114.pdf", strategy="hi_res", infer_table_structure=True, extract_image_block_types=["Image"], ) table_elements = [el for el in elements if el.category == "Table"] print(table_elements[0].text) ``` --------- Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-03-23 03:59:23 +00:00
Steve Canny	56fbaaed10	feat(chunking): add metadata.orig_elements serde (#2680 ) Summary This final PR in the "orig_elements" series adds the needful such that `.metadata.orig_elements`, when present on a chunk (element), is serialized to JSON when the chunk is serialized, for instance, to be used in an HTTP response payload. It also provides for deserializing such a JSON payload into chunks that contain the `.orig_elements` metadata. Additional Context Note that `.metadata.orig_elements` is always `Optional[list[Element]]` when in memory. However, those original elements are serialized as Base64-encoded gzipped JSON and are in that form (str) when present as JSON or as "element-dicts" which is an intermediate serialization/deserialization format. That is, serialization is `Element -> dict -> JSON` and deserialization is `JSON -> dict -> Element` and `.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-22 21:53:26 +00:00
Filip Knefel	bdfd975115	chore: change table extraction defaults (#2588 ) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-22 10:08:49 +00:00
Roman Isecke	4ff6a5b78e	Roman/bugfix support bedrock embeddings (#2650 ) ### Description This PR resolved the following open issue: [bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319). To do so, the following changes were made: * All aws configs were added as input parameters to the CLI * These were mapped to the bedrock embedder when an embedder is generated via `get_embedder` * An ingest test was added to call the aws bedrock service * Requirements for boto were bumped because the first version to introduce the bedrock runtime, which is required to hit the bedrock service, was introduced in version `1.34.63`, which was ahead of the version of boto pinned. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-03-21 18:21:04 +00:00
David Potter	9177aa20a8	feature CORE-3985: add Clarifai destination connector (#2633 ) Thanks to @mogith-pn from Clarifai we have a new destination connector! This PR intends to add Clarifai as a ingest destination connector. Access via CLI and programmatic. Documentation and Examples. Integration test script.	2024-03-21 16:36:21 +00:00
Steve Canny	31bef433ad	rfctr: prepare to add orig_elements serde (#2668 ) Summary The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.	2024-03-20 21:27:59 +00:00
David Potter	5b92e0bb6b	bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638 ) Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products) https://github.com/Unstructured-IO/unstructured/pull/2591 https://github.com/Unstructured-IO/unstructured/pull/2592/files We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. Changes This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library. Bumps timestamp format output to include timezone offset (as we do with others) Adds unit tests for isofomat. json_to_dict already unit tested here: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py Adds small change for AstraDB to allow them to see what source called their api	2024-03-15 14:01:05 +00:00
Steve Canny	8ea203adf7	feat(chunking): composite text gets is_continuation (#2639 ) Summary Add `metadata.is_continuation = True` to metadata of second-and-later text-split chunks formed from an oversized non-table element. Previously this metadata was only present on text-split `TableChunk` elements. This enables downstream filtering of intentionally redundant metadata on chunk elements that may not be desired for all purposes. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-12 19:44:41 +00:00
David Potter	1ca90d209a	bug: update sharepoint-with-permissions test to fix CI (#2589 ) Adding `metadata.data_source.permissions_data` to sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint deprecation warning from ruining test. Updating expected-structured-output As per Ahmet's comment. We do want to check sharepoint permissions metadata at some point. But that will take a separate type of test. A file diff test is too unstable. Permissions checking will be later down the road.	2024-03-06 17:15:36 +00:00
David Potter	43250d5576	bug CORE-3971: fix deserialization in google-drive source connector key path (#2586 ) Google Drive Service account key can be a dict or a file path(str) We have successfully been using the path. But the dict can also end up being stored as a string that needs to be deserialized. The deserialization can have issues with single and double quotes.	2024-03-03 15:30:35 +00:00
ryannikolaidis	71d5d513ef	fix: handling of varied SharePoint date formats (#2591 ) We are seeing occurrences of inconsistency in the timestamps returned by office365.sharepoint when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. ## Changes - This adds logic to guarantee SharePoint dates will be properly formatted as ISO, regardless of the format provided by the sharepoint library. - Bumps timestamp format output to include timezone offset (as we do with others) ## Testing Unit test added to validate this datetime handling across various formats. --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-02-28 16:11:53 +00:00
David Potter	e8ec09c8b9	feat: astra dest connector (#2571 ) Thanks to Eric Hare @erichare at DataStax we have a new destination connector. This Pull Request implements an integration with [Astra DB](https://datastax.com) which allows for the Astra DB Vector Database to be compatible with Unstructured's set of integrations. To create your Astra account and authenticate with your `ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these steps: 1. Create an account at https://astra.datastax.com 2. Login and create a new database 3. From the database page, in the right hand panel, you will find your API Endpoint 4. Beneath that, you can create a Token to be used Some notes about Astra DB: - Astra DB is a Vector Database which allows for high-performance database transactions, and enables modern GenAI apps [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html) - It supports similarity search via a number of methods [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics) - It also supports non-vector tables / collections	2024-02-23 20:50:50 +00:00
Matt Robinson	b4d9ad8130	enhancement: detect headers in `partition_pdf` with fast strategy (#2455 ) ### Summary Detects headers and footers when using `partition_pdf` with the fast strategy. Identifies elements that are positioned in the top or bottom 5% of the page as headers or footers. If no coordinate information is available, an element won't be detected as a header or footer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-02-23 16:56:09 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00

1 2 3 4 5 ...

334 Commits