unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-01 10:33:09 +00:00

Author	SHA1	Message	Date
Nathan Van Gheem	da29242dbd	rfctr: implement mongodb v2 destination connector (#3313 ) This PR provides support for V2 mongodb destination connector.	2024-07-02 16:40:51 +00:00
Roman Isecke	c28deffbc4	bugfix/isolate ingest v2 dependencies (#3327 ) ### Description This PR handles two things: * Exposing all the connectors via the connector registries by simply importing the connector module. This should be safe assuming all connector specific dependencies themselves are imported in the methods where they are used and wrapped in `@requires_dependencies` decorator * Remove any import that pulls from the v2 ingest.cli package	2024-07-02 14:07:54 +00:00
Pluto	5d89b41b1a	Fix not counting false negatives and false positives in table metrics (#3300 ) This pull request fixes counting tables metric for three cases: - False Negatives: when table exist in ground truth but any of the predicted tables doesn't match the table, the table should count as 0 and the file should not be completely skipped (before it was np.NaN). - False Positives: When there is a predicted table that didn't match any ground truth table it should be counted as 0, right now it is skipped in processing (matched_indices==-1) - The file should be completely skipped only if there is no tables in ground truth and in prediction In short we can say that previous metric calculation didn't consider OD mistakes	2024-07-02 10:07:24 +00:00
Ahmet Melek	72f28d7a11	feat: add v2 pinecone destination connector (#3286 ) This PR adds a V2 version of the Pinecone destination connector	2024-07-01 23:22:06 +00:00
David Potter	a18b21c06e	rfctr [P6M-397]: opensearch source connector v2 (#3302 ) Updates opensearch source connector to v2. Leverages elasticsearch v2 heavily. Expected tests renamed because thats how Elasticsearch names them.	2024-07-01 20:35:26 +00:00
Matt Robinson	db8617872b	build: image and dependency updates; fix tesseract files locations (#3310 ) ### Summary Updates to the latest version of the `wolfi-base` image. Changes include: - Version bumps to address CVEs - `libreoffice` is now included in the `arm64`. `.doc` files are now supported for `arm64`. `.ppt` do not work with the `libreoffice` package currently available on `wolfi-os`. We have follow on work to look into that. - Updates the location of the `tesseract` `tessdata` files on the `arm64` build. Closes #3290. - Closes #3319 and addes `psutil` to the base dependencies. ### Testing - `test_dockerfile` should continue to pass with the updates.	2024-07-01 19:39:32 +00:00
David Potter	9eb4c96b94	fix: update slack test to point to new channel (#3328 ) When we switched community Slack from Paid to Free we lost the CI test bot. Also if messages delete after 90 days then our expected test data will disappear. - created a new bot in our paid company slack (test_unstructured_ingest_bot) - added a new private channel (test-ingest) - invited the bot to the channel - adjusted the end datetime of the test to cover the first few messages in the channel Still to do: - update the CI secrets with the new bot token - update the LastPass with the new bot token (I don't have write access.. :(.	2024-07-01 18:11:21 +00:00
Matt Robinson	116200559b	docs: add link to serverless api in readme (#3322 ) ### Summary Adds links to the serverless api. README updates look like the following: <img width="904" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/fcb2b0c5-0dff-4612-8f18-62836ca6de8b">	2024-07-01 07:39:12 -04:00
David Potter	15f80c4ad6	rfct [P6M]-392: OpenSearch V2 Destination Connector (#3293 ) Migrates OpenSearch destination connector to V2. Relies a lot on the Elasticsearch connector where possible. (this is expected)	2024-06-28 20:51:23 +00:00
Matt Robinson	4a71bbb44c	release: version 0.14.9 (#3312 ) ### Summary Release for `0.14.9`. 0.14.9	2024-06-27 20:51:24 +00:00
Roman Isecke	137b149be8	Bugfix/ingest pipeline check (#3303 ) ### Description Using a `isinstance` on the destination registry mapping breaks when inheritance is used for the associated uploader types. This adds a connector type field to all uploaders so that the entry can be deterministically fetched when running check for associated stager in pipeline.	2024-06-27 16:35:37 +00:00
Steve Canny	087adb218f	feat(docx): differentiate no-file from not-ZIP (#3306 ) Summary The `python-docx` error `docx.opc.exceptions.PackageNotFoundError` arises both when no file exists at the given path and when the file exists but is not a ZIP archive (and so is not a DOCX file). This ambiguity is unwelcome when diagnosing the error as the two possible conditions generally indicate a different course of action to resolve the error. Add detailed validation to `DocxPartitionerOptions` to distinguish these two and provide more precise exception messages. Additional Context - `python-pptx` shares the same OPC-Package (file) loading code used by `python-docx`, so the same ambiguity will be present in `python-pptx`. - It would be preferable for this distinguished exception behavior to be upstream in `python-docx` and `python-pptx`. If we're willing to take the version bump it might be worth considering doing that instead.	2024-06-27 00:18:56 +00:00
Roman Isecke	54ec311c55	feat/migrate onedrive src (#3295 ) ### Description Migrate the onedrive source connector to v2, adding in more rich content pulled from the response of the SDK to add further metadata to the FIleData produced by the indexer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-26 23:59:51 +00:00
Matt Robinson	6939bff49e	build(deps): bump langchain-community version (#3305 ) ### Summary Bumps to the latest `langchain-community` version to resolve [CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-06-26 22:42:32 +00:00
Roman Isecke	5179b739fa	feat/more conservative ingest logging (#3301 ) ### Description Isolate all log statements that happen per record and make them debug level to avoid bloating the console output.	2024-06-26 16:54:08 +00:00
Pawel Kmiecik	575957b2d2	feat: enhance analysis options with od model dump and better vis (#3234 ) This PR adds new capabilities for drawing bboxes for each layout (extracted, inferred, ocr and final) + OD model output dump as a json file for better analysis. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>	2024-06-26 13:14:55 +00:00
Steve Canny	f2fee0c32f	fix(auto): partition() passes strategy to DOC,ODT (#3278 ) Summary Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_doc()` or `partition_odt()` and so was not making its way to `partition_docx()`.	2024-06-26 00:29:47 +00:00
qued	0665e94b96	build: move numpy pin to packaging (#3296 ) Moved numpy pin to `base.in` where it will be picked up by packaging. Side note: `constraints.txt` (formerly `constraints.in`) is a really useful pattern: you put a constraint there, add that file as a `-c` requirement in other files, and the constraint will be applied when pip-compiling only when needed because the library is required by something else. Neat! However, unfortunately, in my searches I've never found a similar pattern for packaging, so any pins we want to propagate to user installs need to be explicitly placed in the `.in` files. So what is `constraints.txt` really doing for us? Well in the past I think there have been instances where something is temporarily broken in an upstream dependency but we expect it to be patched soon, but in the meantime we want things to work in our CI builds and development installs, so it's not worth pinning everywhere it's used. Having said that, I'm coming to the conclusion that `constraints.txt` causes more harm than good in the confusion it causes WRT packaging -- maybe we should remove that pattern at some point.	2024-06-25 21:08:25 +00:00
Yao You	c32aeaac44	fix: wait to run soffice until there is no other soffice process running (#3287 ) ## Summary This PR addresses an issue where the code could attempt to run `soffice` in multiple processes and closes #3284 The fix is to add a wait mechanism when there is another `soffice` process running in already. ## Diagnosis of issue - `soffice` can only have one process running when using the command `soffice` as is. - on main branch the function `partition.common.convert_office_doc` simply spawns a subprocess to run `soffice` command to convert a `doc` or `ppt` file into `docx` or `pptx` format. - if there are multiple partition calls to process `doc` or `ppt` files and they all want to spawn `soffice` subprocesses only one will succeed while other processes will simply fail and return 1 from the subprocess - in downstream this will lead to errors like `PackageNotFoundError: Package not found at '/tmp/tmpac6lcu4w/document.docx'` ## solution While there are [ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/) to circumvent the limit of `soffice` by setting a tmp file as user installation env, these kind of solutions rely on the internals of `soffice` and adds maintenance cost to track its changes. This PR solves this problem by adding a wait mechanism: - we first spawning a subprocess to run `soffice` - if the `stdout` is empty and we still have wait time budget left the function first checks if there is another `soffice` running * If yes then the function waits for 0.01s before checking again; * if no the functions spawns a subprocess to run `soffice` and return to beginning of this step * we need to return the the beginning to check if `stdout` is empty because we could have another collision right after `soffice` becomes available. ## test This PR adds two unit tests. Additionally this can be tested by running partition of `.doc` files locally with multiprocessing.	2024-06-25 18:49:27 +00:00
Roman Isecke	a7a53f6fcb	feat/migrate astra db (#3294 ) ### Description Move astradb destination connector over to the new v2 ingest framework	2024-06-25 18:00:47 +00:00
Roman Isecke	3f581e6b7d	feat/migrate gdrive source connector (#3239 ) ### Description Migrate the google drive source connector over to the new v2 ingest framework and include a variety of improvements as part of the refactor: * The ID is no longer limited to a drive id but can also be the id of a subfolder within a drive or a file directly and each case is handled appropriately * More metadata is pulled in from google drive to enrich the partitioned elements downstream and now the modified date is being set to not reprocess if the ingest pipeline already has the file cached * timing information is set on the file created when downloaded based on the last modified data retrieved from google drive --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-25 12:55:28 +00:00
Roman Isecke	e0f4374386	Roman/bugfix conflicting event loop ingest (#3264 ) ### Description In use cases where an external system (such as code being run in a jupyter notebook) already has a running event loop, run the async code in a dedicated thread pool to not conflict with the existing event loop. This also has a variety of fixes that were found when putting together a demo leveraging the elasticsearch destination connector	2024-06-24 18:47:37 +00:00
Christine Straub	ab88e20575	chore: bump unstructured-inference 0.7.36 (#3275 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed `ValueError` when converting cells to HTML in the table processing subpipeline - cut a release for `0.14.8` --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> 0.14.8	2024-06-24 13:07:22 +00:00
Nathan Van Gheem	ce591e2e5d	Fix missing sensitive fields for embedders (#3263 ) A few embedder types are missing sensitive field annotations.	2024-06-24 12:47:50 +00:00
David Potter	88b08a734d	rfctr: chroma destination migrated to V2 (#3214 ) Moving Chroma destination to the V2 version of connectors.	2024-06-23 00:19:29 +00:00
David Potter	8610bd3ab9	feat: Kafka source and destination connector (#3176 ) Thanks to @tullytim we have a new Kafka source and destination connector. It also works with hosted Kafka via Confluent. Documentation will be added to the Docs repo.	2024-06-22 23:26:23 +00:00
Matt Robinson	2d965fd65e	build: switch arm64 image to wolfi-base (#3268 ) ### Summary Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since there are now upstream base images for `wolfi-base` for both architectures. The legacy `rockylinux-9.4` is now stashed in a subdirectory the `docker` subdirectory and is no longer built in CI, but is available is users would like to build it themselves. Additionally, this PR includes a fix to symlink `python3` to `python3.11`, which had caused a CI failure [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755). BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`, or `.xls` because we do not yet have a `libreoffice` `apk` built for `wolfi-base`. We intend to address that as a follow on. All other filetypes work. ### Testing Successfully docker builds, tests, and smoke tests for [amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268) and [arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268) on the feature branch (with publish disabled).	2024-06-22 05:10:29 +00:00
Yao You	edddf9f6ee	Feat/pass down strategy to partition ppt as well (#3274 ) Following the same pattern of https://github.com/Unstructured-IO/unstructured/pull/3273 and pass down `strategy` parameter to `partition_ppt` as well.	2024-06-22 02:23:58 +00:00
Steve Canny	16df6944dd	fix(auto): partition() passes strategy to PPTX,DOCX (#3273 ) Summary Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_pptx()` or `partition_docx()`.	2024-06-22 00:16:39 +00:00
Steve Canny	6fe1c9980e	rfctr(html): prepare for new html parser (#3257 ) Summary Extract as much mechanical refactoring from the HTML parser change-over into the PR as possible. This leaves the next PR focused on installing the new parser and the ingest-test impact. Reviewers: Commits are well groomed and reviewing commit-by-commit is probably easier. Additional Context This PR introduces the rewritten HTML parser. Its general design is recursive, consistent with the recursive structure of HTML (tree of elements). It also adds the unit tests for that parser but it does not _install_ the parser. So the behavior of `partition_html()` is unchanged by this PR. The next PR in this series will do that and handle the ingest and other unit test changes required to reflect the dozen or so bug-fixes the new parser provides.	2024-06-21 20:59:48 +00:00
Matt Robinson	e1b75539f7	build: fix amd64 image hash (#3272 ) ### Summary Sets to latest hash in quay.	2024-06-21 19:57:31 +00:00
Christine Straub	14f149d43c	fix: update base image SHA for amd64 wolfi (#3270 ) This PR aims to fix a `test_dockerfile` job [failure](https://github.com/Unstructured-IO/unstructured/actions/runs/9613636416/job/26517074221?pr=3234) in CI after `base-images` repo update.	2024-06-21 18:30:33 +00:00
Matt Robinson	80abbcd5a8	build: version bump for release 0.14.7 (#3259 ) ### Summary Release PR for `0.14.7`. 0.14.7	2024-06-20 14:10:33 -04:00
Austin Walker	0b73978b92	fix: fix `IndexError` when partioning a pdf with `starting_page_number` (#3246 ) The Issue: When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via `starting_page_number`. To get the true page index, we need to subtract this value. Testing: Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements. ``` from unstructured.partition.auto import partition filename = "example-docs/layout-parser-paper-with-table.pdf" partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20) ``` --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-06-19 18:20:54 +00:00
Pawel Kmiecik	c3af03d5ac	feat: expose converters deckerd -> html and back (#3233 ) This PR exposes functions in evaluation module for easy conversion between tables in Deckerd and HTML formats, which are useful in evalution experiments.	2024-06-19 07:03:38 +00:00
Christine Straub	f23d180d34	fix: docker image publishing error (#3238 ) This PR aims to fix a docker image publishing error caused by user changes when pulling the `amd64` image from the `unstructured` `wolfi-base` image. (https://github.com/Unstructured-IO/unstructured/pull/3213). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-18 21:01:42 +00:00
Roman Isecke	fd98cf9ea5	Roman/migrate es dest (#3224 ) ### Description Migrate elasticsearch destination connector to new v2 ingest framework	2024-06-18 14:20:49 +00:00
Christine Straub	b47e6e9fdc	refactor: remove download packages step (#3225 ) This PR aims to remove the download packages step since all of that gets installed in the base images. This PR also updates the base `wolfi` image because the original base image can not be found anymore: https://github.com/Unstructured-IO/unstructured/actions/runs/9555654898/job/26339587945	2024-06-18 12:15:44 +00:00
Steve Canny	77a9e1b54d	rfctr(html): drop convert_and_partition_html() (#3215 ) Summary Remove `unstructured.partition.html.convert_and_partition_html()`. Move file-type conversion (to HTML) responsibility to each brokering partitioner that uses that strategy and let them call `partition_html()` for themselves with the result. Additional Context Rationale: - `partition_html()` does not want or need to know which partitioners might broker partitioning to it. - Different brokering partitioners have their own methods to convert their format to HTML and quirks that may be involved for their format. Avoid coupling them so they can evolve independently. - The core of the conversion work is already encapsulated in `unstructured.partition.common.convert_file_to_html_text_using_pandoc()`. - `convert_and_partition_html()` represents an additional brokering layer with the entailed complexities of an additional site for default parameter values to be (mis-)applied and/or dropped and is an additional location for new parameters to be added.	2024-06-17 19:43:18 +00:00
Roman Isecke	d876a386ed	Roman/fix ingest async connectors (#3210 ) ### Description Choosing to use async needs to be very careful because if a connector is set to use async, the pipeline will not fan out the inputs via multiprocessing but instead it will be limited to run in a single process under the assumption it has more benefit from async due to heavy network traffic. This means the exact same code that is not optimized for async and is blocking will force the pipeline to perform worse than simply never marking the connector to use async since the pipeline will fan that out using multiprocessing. All connectors and processes in the pipeline we revisited to make sure this criteria was met and updated accordingly: * Currently the unstructured client does not support making requests async, so this was moved over to use multiprocessing * fsspec connector was updated to use the async client from the fsspec library. This also required that the client be a `@property` fetched on demand, otherwise the client would break the multiprocessing pool since it maintains a thread lock and that can't be pickled when the fsspec connector doesn't support async. * elasticsearch was also updated to use the async client * weaviate only recently came out with async support in their SDK at a version that is higher than we can use in the open source repo, so a TODO was left but otherwise moved to use multiprocessing * all underlying embedders don't use async to embedder step must be multiprocessing for now. TODO left to update underlying embedder classes to optionally support async. * Chunking parameters were not accurately being passed through from cli to chunker params, this was fixed --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-17 16:55:19 +00:00
Frederic Marvin Abraham	6220633d3f	enhancement: make tempfiles windows friendly (#3108 ) ### Summary Updates handling of tempfiles so that they work on Windows systems. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2024-06-17 13:28:48 -04:00
Matt Robinson	2815226b54	build(deps): version bumps for 2024-06-17 (#3220 ) ### Summary Version bumps for the week of 2024-06-17. There is a now a pin on `numpy` due to a breaking change in the latest version that we'll need to investigate and remove in a subsequent PR.	2024-06-17 14:04:29 +00:00
Steve Canny	9fae0111d9	rfctr(html): drop HTML-specific elements (#3207 ) Summary Remove HTML-specific element types and return "regular" elements like `Title` and `NarrativeText` from `partition_html()`. Additional Context - An aspect of the legacy HTML partitioner was the use of HTML-specific element types used to track metadata during partitioning. - That role is no longer necessary or desireable. - HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were returned from partitioning HTML but also the seven other file-formats that broker partitioning to HTML (convert-to-HTML and partition_html()). This does not cause immediate breakage because these are still `Text` element subtypes, but it produces a confusing developer experience. - Remove the prior metadata roles from HTML-specific elements and remove those element types entirely.	2024-06-15 00:14:22 +00:00
Matt Robinson	08383a27de	build: pull from wolfi base image (#3213 ) ### Summary Updates the `wolfi` image to pull from the upstream `wolfi-base` base image to avoid maintaining the base layers in both locations. Closes #3105 by pulling in the fix from upstream. ### Testing `test_dockerfile` should continue to pass with the changes.	2024-06-14 20:41:27 +00:00
Christine Straub	9552fbbfbf	chore: bump unstructured-inference 0.7.35 (#3205 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> 0.14.6	2024-06-14 18:11:38 +00:00
Roman Isecke	a6c09ec621	Roman/dry ingest pipeline step (#3203 ) ### Description The main goal of this was to reduce the duplicate code that was being written for each ingest pipeline step to support async and not async functionality. Additional bug fixes found and fixed: * each logger for ingest wasn't being instantiated correctly. This was fixed to instantiate in the beginning of a pipeline run as soon as the verbosity level can be determined. * The `requires_dependencies` wrapper wasn't wrapping async functions correctly. This was fixed so that `asyncio.iscoroutinefunction()` gets trigger correctly.	2024-06-14 13:46:44 +00:00
Pawel Kmiecik	29e64eb281	feat: table evaluations for fixed html table generation (#3196 ) Update to the evaluation script to handle correct HTML syntax for tables. See https://github.com/Unstructured-IO/unstructured-inference/pull/355 for details. This change: - modifies transforming HTML tables to evaluation internal `cells` format - fixes the indexing of the output (internal format cells) when HTML cells use spans	2024-06-14 09:03:27 +00:00
Roman Isecke	dadc9c6d0b	feat/tqdm ingest support (#3199 ) ### Description Add in tqdm support to show progress bar of status of each job when being run. Supported for each mode (serial, async, multiprocess). Also small timing wrapper around jobs to print out how long it took in total.	2024-06-13 18:41:54 +00:00
Steve Canny	f5ebb209a4	rfctr(html): drop page concept (#3184 ) Summary Pagination of HTML documents is currently unused. The `Page` class and concept were deeply embedding in the legacy organization of HTML partitioning code due to the legacy `Document` (= pages of elements) domain model. Remove this concept from the code such that elements are available directly from the partitioner. Additional Context - Pagination can be re-added later if we decide we want it again. A re-implementation would be much simpler and much lower impact to the structure of the code and introduce much less additional complexity, similar to the approach we take in `partition_docx()`.	2024-06-13 18:19:42 +00:00
ryannikolaidis	da3492b529	fix: dropbox source connector file path bugs (#3189 ) The Dropbox source connector currently raises exceptions when indexing files due to two issues: a path formatting idiosyncrasy of the Dropbox library and a divergence in the definition of the Dropbox libraries fs.info method, expecting a 'url' parameter rather than 'path'. ## Changes * add a `/` prefix to file path used by DropboxIndexer * override the fsspec sterilize_info method in DropboxIndexer to call `self.fs.info` with `url` rather than `path`; to accommodate for the fact that `dropboxdrivefs` diverges with this signature * remove `dropbox.sh` from ignored source tests * update test fixtures (now that the dropbox connector has been fixed and not skipped) ## Testing `dropbox.sh` source ingest test now succeeds (and is no longer ignored) --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-06-13 18:06:41 +00:00

... 2 3 4 5 6 ...

1605 Commits