unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-20 20:37:24 +00:00

Author	SHA1	Message	Date
John	9ac4445e74	refactor title.py (#2657 ) Minor refactor after conversation with @scanny Updates docstring and how chunking options are accessed. `self._kwargs.get()` should only be used in the `lazyproperty` definition of an instance's attribute. Other calls should use `self.<attribute>`	2024-03-19 17:48:23 +00:00
Yao You	2eb0b25e0d	Feat: single table structure eval metric (#2655 ) Creates a compounding metric to represent table structure score. It is an average of existing row and col index and content score. This PR adds a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores. This new metric is meant to offer a single number to represent the performance of table structure extraction model/algorithms. This PR also refactors the eval computation logic so it uses a constant `table_eval_metrics` instead of hard coding the name of the metrics in multiple places in the code. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2024-03-19 15:15:32 +00:00
Steve Canny	1af41d5f90	feat(chunking): add .orig_elements behavior to chunking (#2656 ) Summary Add the actual behavior to populate `.metadata.orig_elements` during chunking, when so instructed by the `include_orig_elements` option. Additional Context The underlying structures to support this, namely the `.metadata.orig_elements` field and the `include_orig_elements` chunking option, were added in closely prior PRs. This PR adds the behavior to actually populate that metadata field during chunking when the option is set.	2024-03-18 19:27:39 +00:00
Roman Isecke	c02cfb89d3	bug/unstructured-ingest produces ModuleNotFoundError: No module named 'unstructured.txtgest (#2661 ) Quick fix for issue https://github.com/Unstructured-IO/unstructured/issues/2658	2024-03-18 19:08:29 +00:00
Filip Knefel	6af6604057	feat: introduce `date_from_file_object` parameter to partitions (#2563 ) Introduce `date_from_file_object` to `partition*` functions, by default set to `False`. If set to `True` and file is provided via `file` parameter, partition will attempt to infer last modified date from `file`'s contents otherwise last modified metadata will be set to `None`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-18 01:09:44 +00:00
Klaijan	ccda40f750	feat: grouping eval takes list of filenames (#2635 ) Add features to `get_mean_grouping` to allow input as a list of filenames in the format of List of strings or txt file. --------- Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-17 17:19:55 +00:00
Steve Canny	137ea67336	feat(chunking): add include_orig_elements chunking option (#2649 ) Summary Add `include_orig_elements: bool = True` as a new chunking option. This PR does not implement _adding_ original elements to chunks, only accepting this parameter as a chunking option and assigning `True` to it as a default value when it is omitted as a keyword argument. Note this will need to be added in other repositories as well in order to fully support this new option by all access methods. In particular it will need to be added in `unstructured-api` in order to become available via the SDKs.	2024-03-15 18:48:07 +00:00
Matt Robinson	a63e8a9719	docs: add wintersports example (#2653 ) ### Summary Adds the Winter Sports in Switzerland example for a video we're working on with a partner.	2024-03-15 16:58:45 +00:00
Mason Brothers	ea67be5665	Doc: Change Python comment string to JavaScript comment string. (#2596 ) JavaScript uses `//` for comments instead of `#` Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-15 09:56:35 -05:00
David Potter	5b92e0bb6b	bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638 ) Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products) https://github.com/Unstructured-IO/unstructured/pull/2591 https://github.com/Unstructured-IO/unstructured/pull/2592/files We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. Changes This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library. Bumps timestamp format output to include timezone offset (as we do with others) Adds unit tests for isofomat. json_to_dict already unit tested here: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py Adds small change for AstraDB to allow them to see what source called their api	2024-03-15 14:01:05 +00:00
Steve Canny	94535e353c	rfctr: prepare for adding metadata.orig_elements field (#2647 ) Summary Some typing modernization in `elements.py` which will get changes to add the `orig_elements` metadata field. Also some additions to `unit_util.py` to enable simplified mocking that will be required in the next PR.	2024-03-14 21:31:58 +00:00
Ronny H	d9e557459c	Update link_urls metadata (#2646 ) Update the metadata `link_urls` in the News-of-the-Day notebook example.	2024-03-14 17:48:42 +00:00
Steve Canny	45e3c00120	rfctr(chunking): simplify chunking opts construction (#2645 ) Summary Use omnibus `kwargs` dict for `ChunkingOptions` state rather than explicit option parameters. Additional Context While articulating explicit options for `ChunkingOptions` and its (now several) sub-classes provides some type-safety, it induces a large amount of redundancy which complicates updates to the base class and especially patches to the base class from client code that adds custom chunkers. In particular, it makes custom chunkers brittle to any new attributes added to `ChunkingOptions` (the base class). Use a single omnibus `kwargs` argument to `ChunkingOptions` and its subclasses allowing each to pull out the options it is interested in and happily ignore the rest. The type safety provided by explicit parameters and types is only afforded to the single place the options item is called from which is the custom chunker itself. Because this is "internal" code and not part of the public interface, this is a manageably small loss in type-safety.	2024-03-14 06:00:51 +00:00
John	fe300fe56d	fix: teardown fixture for tests and update pre-commit-config (#2565 ) Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The updated decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2024-03-12 22:16:39 +00:00
Steve Canny	8ea203adf7	feat(chunking): composite text gets is_continuation (#2639 ) Summary Add `metadata.is_continuation = True` to metadata of second-and-later text-split chunks formed from an oversized non-table element. Previously this metadata was only present on text-split `TableChunk` elements. This enables downstream filtering of intentionally redundant metadata on chunk elements that may not be desired for all purposes. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-12 19:44:41 +00:00
Ronny H	9cbede37bd	Update requirements for GCS IAM Role for Platform Source & Destination Connectors (#2637 ) To test: > cd docs && make html Changelogs: * added a note to have Storage Object Viewer IAM Role for the GCS source connector. * added a note to have Storage Object Creator IAM Role for the GCS destination connector.	2024-03-11 22:14:20 +00:00
Ronny H	e5fab217be	Unstructured v0.12.6 release (#2626 ) ## 0.12.6 ### Enhancements * Improve ability to capture embedded links in `partition_pdf()` for `fast` strategy Previously, a threshold value that affects the capture of embedded links was set to a fixed value by default. This allows users to specify the threshold value for better capturing. * Refactor `add_chunking_strategy` decorator to dispatch by name. Add `chunk()` function to be used by the `add_chunking_strategy` decorator to dispatch chunking call based on a chunking-strategy name (that can be dynamic at runtime). This decouples chunking dispatch from only those chunkers known at "compile" time and enables runtime registration of custom chunkers. ### Features * Added Unstructured Platform Documentation The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors. ### Fixes * Partitioning raises on file-like object with `.name` not a local file path. When partitioning a file using the `file=` argument, and `file` is a file-like object (e.g. io.BytesIO) having a `.name` attribute, and the value of `file.name` is not a valid path to a file present on the local filesystem, `FileNotFoundError` is raised. This prevents use of the `file.name` attribute for downstream purposes to, for example, describe the source of a document retrieved from a network location via HTTP. * Fix SharePoint dates with inconsistent formatting Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. * Include warnings about the potential risk of installing a version of `pandoc` which does not support RTF files + instructions that will help resolve that issue. * Incorporate the `install-pandoc` Makefile recipe into relevant stages of CI workflow, ensuring it is a version that supports RTF input files. * Fix Google Drive source key Allow passing string for source connector key. * Fix table structure evaluations calculations Replaced special value `-1.0` with `np.nan` and corrected rows filtering of files metrics basing on that. * Fix Sharepoint-with-permissions test Ignore permissions metadata, update test. * Fix table structure evaluations for edge case Fixes the issue when the prediction does not contain any table - no longer errors in such case. 0.12.6	2024-03-08 17:54:13 +00:00
Yao You	911f9983c1	feat: redefine table level acc (#2620 ) This PR redefines the `table_level_acc` metric as follow: - for each predicted table use sequence matching ratio as its accuracy - as a prerequisite for the sequence matching we sort the table cells by row then column for both predicted and ground truth to ensure they are ordered the same - average all predicted table accuracy - any prediction without a matching ground truth (false positive) would decrease the score - prediction that splits ground truth into smaller tables would also have low score with perfectly equal splits having lowest score This new definition makes the new metric a value between 0 and 1 per file. This replaces the existing definition where the metric is defined as (the number of predicted table that has a match to ground truth) to (the number of ground truth table). This existing metric actually gives higher values for predictions that splits tables and can be higher than 1. The new definition prefers predictions that do not split ground truth tables.	2024-03-08 17:00:57 +00:00
ryannikolaidis	3853840d52	fix: docker-publish build test missing key error (#2623 ) The docker-publish github actions workflow builds amd and arm images of the repository and tests them before publishing. These tests have been failing since [this commit](`ee8b0f93dc`) with an error `UNS_API_KEY environment variable not set`. The issue is that [this line](`b27ad9b6aa/.github/workflows/docker-publish.yml (L62)`) in the workflow is actually blowing away the value assigned to the file in the previous line ## Changes * Update line that was overwriting the assignment of UNS_API_KEY to the uns_test_env_file in the docker-publish workflow to leverage the `>>` operator so that UNSTRUCTURED_HF_TOKEN assignment is only appended. * [bonus]: arithmetic expansion in version-sync.sh to keep shell-check happy ## Testing To validate, I edited the docker-publish workflow to trigger on push (and to run the test but not publish the workflow) in [this commit](`0f04f5f0f7`). The successful test results can be reviewed [here](https://github.com/Unstructured-IO/unstructured/actions/runs/8199826803).	2024-03-08 14:55:04 +00:00
Klaijan	30b6a09bc3	fix: declare -i [SC2324 shellcheck] (#2624 ) Fix SC2324 shellcheck warning by adding -i to indicate var type of integer and tidy up the formatting.	2024-03-08 10:09:55 +00:00
Steve Canny	b27ad9b6aa	fix: raises on file-like object with .name not a valid path (#2614 ) Summary Fixes: #2308 Additional context Through a somewhat deep call-chain, partitioning a file-like object (e.g. io.BytesIO) having its `.name` attribute set to a path not pointing to an actual file on the local filesystem would raise `FileNotFoundError` when the last-modified date was being computed for the document. This scenario is a legitimate partitioning call, where `file.name` is used downstream to describe the source of, for example, a bytes payload downloaded from the network. Fix - explicitly check for the existence of a file at the given path before accessing it to get its modified date. Return `None` (already a legitimate return value) when no such file exists. - Generally clean up the implementations. - Add unit tests that exercise all cases. --------- Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>	2024-03-07 19:02:04 +00:00
Pawel Kmiecik	e35306cfc7	fix: table evaluation metrics fix calculations when no tables found in predictions (#2619 ) The current way table structure metrics are computed does not cover cases when none table is found and all stats are empty. This PR fixes this + adds some hardenning tests for table eval processor. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-03-07 18:39:19 +00:00
Roman Isecke	9866f1b52b	BUG: name override not passed through in recursive _asdict call (#2613 ) ### Description The only real change here is adding this line: ```python apply_name_overload=apply_name_overload, ``` Everything else is from running `make tidy`	2024-03-07 17:17:22 +00:00
Steve Canny	b59e4b69ce	rfctr: prepare for fix to raises on file-like-object with name not a path to a file (#2617 ) Summary Improve typing and other mechanical refactoring in preparation for fix to issue 2308.	2024-03-06 23:46:54 +00:00
MiXiBo	79552ff70b	Refactor threshold to annotation_threshold and make it an optional parameter (#2537 ) We are activating to configure the annotation threshold for links as an optional parameter. The reason for the change is that we ran into issues extracting simple text links from PDF documents that were created with MS Word. The sample PDF from unstructured worked with a default value of 0.9, and the PDF generated with Word resulted in a threshold of approx 0.67. We do use unstructured in together with langchain within an automated container deployment and to access by default the setting 'annotation_threshold' (refactored from 'threshold') can be very helpful. --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-03-06 23:08:49 +00:00
John	b6c1882cc3	chore: add tests and small fixes in utils.py (#2554 ) Linting and typing fixes, and add tests to improve test coverage in utils.py On the main branch, run `coverage run -m pytest test_unstructured/test_utils.py` and then `coverage report -m unstructured/utils.py` to see test coverage for `utils.py`. Check out to this branch and do the same. The percent coverage should increase to 88% --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-03-06 21:58:10 +00:00
Ronny H	2afd347e6b	Create Enterprise Platform Documentation (#2486 ) To test: > cd docs && make html Structures: * Getting Started with Platform (User Account Management) * Set Up workflow automation * Job Scheduling * Platform Source Connectors: * Azure Blob Storage, * Amazon S3 * Salesforce * Sharepoint * Google Cloud Storage * Google Drive * One Drive * Elasticsearch * SFTP Storage * Platform Destination Connectors: (i) * Amazon S3 * Azure Cognitive Search * Google Cloud Storage * Pinecone * Elasticsearch * Weaviate * MongoDB * AWS OpenSearch * Databricks --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2024-03-06 19:16:08 +00:00
Roman Isecke	9c1c41f493	BUGFIX: fix dependencies in setup.py (#2605 ) ### Description Currently the requirements associated with an extra in the `setup.py` is being dynamically generated using the `load_requirements()` method in the same file. This is being passed in all the `.in` files which then get read line by line to generate the requirements associated with an extra. Unless the `.in` file itself has a version pin, this will never respect the `.txt` files being generated by `pip-compile`. This fix updates all the inputs to `load_requirements()` to use the `.txt` files themselves.	2024-03-06 18:59:08 +00:00
David Potter	1ca90d209a	bug: update sharepoint-with-permissions test to fix CI (#2589 ) Adding `metadata.data_source.permissions_data` to sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint deprecation warning from ruining test. Updating expected-structured-output As per Ahmet's comment. We do want to check sharepoint permissions metadata at some point. But that will take a separate type of test. A file diff test is too unstable. Permissions checking will be later down the road.	2024-03-06 17:15:36 +00:00
Pawel Kmiecik	dc376053dd	feat(eval): Correct table metrics evaluations (#2615 ) This PR: - replaces `-1.0` value in table metrics with `nan`s - corrected rows filtering basing on above	2024-03-06 15:37:32 +00:00
Steve Canny	4096a38371	rfctr(chunking): extract chunking-strategy dispatch (#2545 ) Summary This is the final step in adding pluggable chunking-strategies. It introduces the `chunk()` function to replace calls to strategy-specific chunkers in the `@add_chunking_strategy` decorator. The `chunk()` function then uses a mapping of chunking-strategy names (e.g. "by_title", "basic") to chunking functions (chunkers) to dispatch the chunking call. This allows other chunkers to be added at runtime rather than requiring a code change, which is what "pluggable" chunkers is. Additional Information - Move the `@add_chunking_strategy` to the new `chunking.dispatch` module since it coheres strongly with that operation, but publish it from `chunking(.__init__)` (as it was before) so users don't couple to the way we organize the chunking sub-package. Also remove the third level of nesting as it's unrequired in this case. - Add unit tests for the `@add_chunking_strategy` decorator which was previously uncovered by any direct test.	2024-03-05 23:19:29 +00:00
Klaijan	3ff6de4f50	refactor: refactor var name for consistency (#2609 ) refactor variable name for consistency.	2024-03-05 09:08:25 +00:00
John	3783b44d0b	fix documentation html links example (#2608 ) Closes #2577 Testing: ``` from unstructured.partition.html import partition_html cnn_lite_url = "https://lite.cnn.com/" elements = partition_html(url=cnn_lite_url) links = [] for element in elements: if element.metadata.link_urls: relative_link = element.metadata.link_urls[0][1:] if relative_link.startswith("2024"): links.append(f"{cnn_lite_url}{relative_link}") print(links) ``` --------- Co-authored-by: ron-unstructured <ronny@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-04 18:33:42 +00:00
Michał Martyniak	b9aa4b7452	fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593 ) ## Problem Description In some cases you might find yourselves in a situation when pandoc won't be able to process an `rtf` as input file format, because older versions simply do not support that. ``` RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki ``` Basically, some user may install the wrong version. The `README.md` is not be precise enough when mentioning RTF files support: `47b35ccdd6/README.md (L120-L122)` ## Example Installing `pandoc` from a [stable repository, like Debian](https://packages.debian.org/source/bullseye/pandoc) will give you `2.9` and the official documentation shows clearly that support for rtf was introduced in `2.14` https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21 ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8) ### Note that `rtf` is not there ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38) ### More detail ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8) ## Proposed Solution - [x] I've simply added/copied `make install-pandoc` calls, mimicking other recipes in order to ensure that `3.1.2` will be installed in all cases. Side note: `make install-pandoc` calls `./scripts/install-pandoc.sh` under the hood. - [x] Update README file - mention that `make install-pandoc` is recommended (`>=2.14.2`) - [x] Verify tests that cover `rtf` cases: `47b35ccdd6/test_unstructured/file_utils/test_file_conversion.py (L14)` - [x] Update `setup_ubuntu.sh` if needed?: `47b35ccdd6/scripts/setup_ubuntu.sh (L87)` -	2024-03-04 11:02:32 +00:00
David Potter	43250d5576	bug CORE-3971: fix deserialization in google-drive source connector key path (#2586 ) Google Drive Service account key can be a dict or a file path(str) We have successfully been using the path. But the dict can also end up being stored as a string that needs to be deserialized. The deserialization can have issues with single and double quotes.	2024-03-03 15:30:35 +00:00
Klaijan	6a4b7a134b	feat: element type accuracy grouping (#2594 ) This PR allow grouping functionality on `evaluate.py` To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or call `get_mean_grouping(<doctype or connector>, <dataframe or path to tsv file>, <export directory>, "element_type")` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2024-03-01 15:18:37 +00:00
ryannikolaidis	71d5d513ef	fix: handling of varied SharePoint date formats (#2591 ) We are seeing occurrences of inconsistency in the timestamps returned by office365.sharepoint when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. ## Changes - This adds logic to guarantee SharePoint dates will be properly formatted as ISO, regardless of the format provided by the sharepoint library. - Bumps timestamp format output to include timezone offset (as we do with others) ## Testing Unit test added to validate this datetime handling across various formats. --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-02-28 16:11:53 +00:00
Christine Straub	47b35ccdd6	build(release): release commit for 0.12.5 (#2585 ) 0.12.5	2024-02-26 21:16:21 +00:00
Christine Straub	ee8b0f93dc	feat: pass list type parameters via client sdk (#2567 ) The purpose of this PR is to support using the same type of parameters as `partition_()` when using `partition_via_api()`. This PR works together with `unsturctured-api` [PR #368](https://github.com/Unstructured-IO/unstructured-api/pull/368). Note:* This PR will support extracting image blocks("Image", "Table") via partition_via_api(). ### Summary - update `partition_via_api()` to convert all list type parameters to JSON formatted strings before passing them to the unstructured client SDK - add a unit test function to test extracting image blocks via `parition_via_api()` - add a unit test function to test list type parameters passed to API via unstructured client sdk ### Testing ``` from unstructured.partition.api import partition_via_api elements = partition_via_api( filename="example-docs/embedded-images-tables.pdf", api_key="YOUR-API-KEY", strategy="hi_res", extract_image_block_types=["image", "table"], ) image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"] print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements])) print("\n\n".join([el.metadata.image_base64 for el in image_block_elements])) ```	2024-02-26 19:17:06 +00:00
Klaijan	8f7853894e	bug: add main function to ingest/evaluate.py (#2583 ) Conflicted on previous two merge. Add the main function back.	2024-02-26 05:01:34 +00:00
Christine Straub	5cb6504d5a	docs: update image block extraction docs (#2578 ) This PR removes `extract_image_block_to_payload` section from "API Parameters" page. The "unstructured" API does not support the `extract_image_block_to_payload` parameter, and it is always set to `True` internally on the API side when trying to extract image blocks via the API. Users only need to specify `extract_image_block_types` parameter when extracting image blocks via the API. NOTE: The `extract_image_block_to_payload` parameter is only used when calling `partition()`, `partition_pdf()`, and `partition_image()` functions directly. ### Testing CI should pass.	2024-02-24 04:36:58 +00:00
David Potter	e8ec09c8b9	feat: astra dest connector (#2571 ) Thanks to Eric Hare @erichare at DataStax we have a new destination connector. This Pull Request implements an integration with [Astra DB](https://datastax.com) which allows for the Astra DB Vector Database to be compatible with Unstructured's set of integrations. To create your Astra account and authenticate with your `ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these steps: 1. Create an account at https://astra.datastax.com 2. Login and create a new database 3. From the database page, in the right hand panel, you will find your API Endpoint 4. Beneath that, you can create a Token to be used Some notes about Astra DB: - Astra DB is a Vector Database which allows for high-performance database transactions, and enables modern GenAI apps [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html) - It supports similarity search via a number of methods [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics) - It also supports non-vector tables / collections	2024-02-23 20:50:50 +00:00
Steve Canny	51cf6bf716	rfctr(chunking): extract strategy-specific chunking options (#2556 ) Summary A pluggable chunking strategy needs its own local set of chunking options that subclasses a base-class in `unstructured`. Extract distinct `_ByTitleChunkingOptions` and `_BasicChunkingOptions` for the existing two chunking strategies and move their strategy-specific option setting and validation to the respective subclass. This was also a good opportunity for us to clean up a few odds and ends we'd been meaning to. Might be worth looking at the commits individually as they are cohesive incremental steps toward the goal.	2024-02-23 18:22:44 +00:00
Matt Robinson	b4d9ad8130	enhancement: detect headers in `partition_pdf` with fast strategy (#2455 ) ### Summary Detects headers and footers when using `partition_pdf` with the fast strategy. Identifies elements that are positioned in the top or bottom 5% of the page as headers or footers. If no coordinate information is available, an element won't be detected as a header or footer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-02-23 16:56:09 +00:00
Klaijan	daaf1775b4	feat: separate evaluate grouping function (#2572 ) Separate the aggregating functionality of `text_extraction_accuracy` to a stand-alone function to avoid duplicated eval effort if the granular level eval is already available. To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` locally	2024-02-23 05:45:20 +00:00
Steve Canny	d3242fb546	rfctr(xlsx): extract connected components (#2575 ) Summary Refactoring as part of `partition_xlsx()` algorithm replacement that was delayed by some CI challenges. A separate PR because it is cohesive and relatively independent from the prior PR.	2024-02-22 22:50:48 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Steve Canny	1947375b2e	rfctr(chunking): preparation for plug-in chunkers, Part I (#2550 ) Summary In order to accommodate customized chunkers other than those directly provided by `unstructured`, some further modularization is necessary such that a new chunker can be added as a "plug-in" without modifying the `unstructured` library code. This PR is the straightforward refactoring required for this process like typing changes. There are also some other small changes we've been meaning to make like making all chunking options accept `None` to represent their default value so the broad field of callers (e.g. ingest, unstructured-api, SDK) don't need to determine and set default values for chunking arguments leading to diverging defaults. Isolating these "noisy" but easy to accept changes in this preparatory PR reduces the noise in the more substantive changes to follow.	2024-02-21 23:16:13 +00:00
erjieyong	4d12c61cb8	added parent_element as output for overlapping cases (#2507 ) To provide more utility to the `catch_overlapping_and_nested_bboxes` and `identify_overlapping_or_nesting_case` functions, included parent_element as part of the output. This would allow user to - identify the parent element in the overlapping case: `nested {type} in {type}`. Currently, if the element types is similar, an example case output would be `nested Image in Image` which is confusing. - easily identify elements to keep or delete	2024-02-21 00:13:09 -08:00
Steve Canny	f1c52c3e3f	fix(json): partition_json() does not chunk (#2564 ) Summary For whatever reason, the `@add_chunking_strategy` decorator was not present on `partition_json()`. This broke the only way to accomplish a "chunking-only" workflow using the REST API. This PR remedies that problem.	2024-02-21 01:35:16 +00:00

... 3 4 5 6 7 ...

1447 Commits