unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-09 23:17:21 +00:00

Author	SHA1	Message	Date
Steve Canny	b3a2dd4755	fix: html incorrectly categorizing text (#3841 ) Fixes #3666 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-18 18:46:54 +00:00
Steve Canny	9ece0b5ad2	fix: improve false-positive Title elements on Chinese text (#3836 ) Summary Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive `Title` elements. Fixes #3084 --------- Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2024-12-18 01:16:42 +00:00
Steve Canny	3b718ec89a	rfctr: prep for pluggable partitioners (#3806 ) Summary Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature in `auto/partition()` such that a custom or override partitioner can be registered without requiring code changes. Additional Context The central job of `auto/partition()` is to detect the file-type of the given file and use that to dispatch partitioning to the corresponding partitioner function e.g. `partition_pdf()` or `partition_docx()`. In the existing code, each partitioner function is called with parameters "hand-picked" from the available parameters passed to the `partition()` function. This is unnecessary and couples those partitioners tightly with the dispatch function. The desired state is that all available arguments are passed as `kwargs` and the partitioner function "self-selects" the arguments it will be sensitive to, applies its own appropriate default values when the argument is omitted, and simply ignore any arguments it doesn't use. Note that achieving this requires no changes to partitioner functions because they already do precisely this. So the job is to pass all arguments (other than `filename` and `file`) to the partitioner as `kwargs`. This will allow additional or alternate partitioners to be registered at runtime and dispatched to, because as long as they have the signature `partition_x(filename, file, kwargs) -> list[Element]` then they can be dispatched to without customization.	2024-12-10 20:44:34 +00:00
Christine Straub	9076d56d9f	fix: resolve mergeing conflict error	2024-12-07 19:40:11 -08:00
Tracy Shen	8c58bc57db	fix doctype parsing error (#3811 ) - per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551), there is a bug in the `unstructured` lib under metrics/evaluate.py that incorrectly retrieves the file extension before the conversion to cct file from paths like '.pdf.txt' . (see below screenshot) - the current status is in the top example - we should have the correct version in the bottom example of the screenshot. ![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d) - in addition, i also observe the doctype returned are not aligned, some returning '.' and some are returning without the dot. - therefore, i just aligned them to be output into the same version which is '.*".	2024-12-06 23:55:01 +00:00
Christine Straub	c0c3fd673f	test: enhance quote standardization tests with additional Unicode scenarios	2024-12-04 13:02:07 -08:00
Christine Straub	c821f12d29	test: update string tests for consistent quote handling	2024-12-04 12:17:16 -08:00
Christine Straub	4e0f7cdbc0	Feat: enhance quote standardization with comprehensive Unicode coverage and update tests	2024-12-04 11:33:03 -08:00
Christine Straub	371cb7528d	Feat: add quote standardization and update edit distance calculation	2024-12-03 21:21:39 -08:00
Yao You	3b9b01c502	Feat: weighted average table metrics (#3348 ) This PR uses (number of actual table) weighted average instead of average without weights for table metrics. - pages where there are ground truth tables the weight is proportional to the number of ground truth tables in that page - pages where there are no ground truth tables but has predicted tables (false positive) are assigned as 1 table worth of weight for the whole page for calculating the mean value of `table_level_acc` - pages with false positive tables do not contribute to table structural or table content metrics ## test This PR updates the existing test for evaluating table metrics: - adds a second file with just 1 table vs. the existing file with 2 tables - test the weighted average is written to the report	2024-11-20 17:14:57 +00:00
Marianna	aa5935b357	Ml 384/whitespaces in cct (#3747 ) This ticket ensures that CCT metric will not be sensitive to differences in whitespace (including newline). All whitespaces in string are changed to single space `" "` in both GT and PRED before the metric is computed. Additional changes in CHANGELOG due to auto-formatting.	2024-10-24 13:02:34 +00:00
Steve Canny	718891a447	rfctr(part): remove double-decoration 5 (#3692 ) Summary Remove double-decoration from EML and MSG. Additional Context - These needed to wait to the end because `partition_email()` and `partition_msg()` can use any other partitioner for one of their attachments. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-04 21:01:32 +00:00
Steve Canny	3fe5c094fa	rfctr(file): refactor detect_filetype() (#3429 ) Summary In preparation for fixing a cluster of bugs with automatic file-type detection and paving the way for some reliability improvements, refactor `unstructured.file_utils.filetype` module and improve thoroughness of tests. Additional Context Factor type-recognition process into three distinct strategies that are attempted in sequence. Attempted in order of preference, type-recognition falls to the next strategy when the one before it is not applicable or cannot determine the file-type. This provides a clear basis for organizing the code and tests at the top level. Consolidate the existing tests around these strategies, adding additional cases to achieve better coverage. Several bugs were uncovered in the process. Small ones were just fixed, bigger ones will be remedied in following PRs.	2024-07-23 23:18:48 +00:00
Christine Straub	0eb461acc2	refactor: restructure PDF/Image example document organization (#3410 ) This PR aims to improve the organization and readability of our example documents used in unit tests, specifically focusing on PDF and image files. ### Summary - Created two new subdirectories in the `example-docs` folder: - `pdf/`: for all PDF example files - `img/`: for all image example files - Moved relevant PDF files from `example-docs/` to `example-docs/pdf/` - Moved relevant image files from `example-docs/` to `example-docs/img/` - Updated file paths in affected unit & ingest tests to reflect the new directory structure ### Testing All unit & ingest tests should be updated and verified to work with the new file structure. ## Notes Other file types (e.g., office documents, HTML files) remain in the root of `example-docs/` for now. ## Next Steps Consider similar reorganization for other file types if this structure proves to be beneficial. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-07-18 22:21:32 +00:00
Pluto	609a08a95f	remove unused _with_spans metric (#3342 ) The table metrics considering spans is not used and it messes with the output thus I have cleaned the code from it. Though, I have left table_as_cells in the source code - it still may be useful for the users	2024-07-08 16:59:53 +00:00
Pluto	caea73c8e3	Tables detection f1 (#3341 ) This pull request add table detection metrics. One case that was considered by me: Case: Two tables are predicted and matched with one table in ground truth Question: Is this matching correct in both cases or just for on table There are two subcases: - table was predicted by OD as two sub tables (so half in two, there are two non overlapping subtables) -> in my opinion both are correct - it is false positive from tables matching script in get_table_level_alignment -> 1 good, 1 wrong As we don't have bounding boxes I followed the notebook calculation script and assumed pessimistic, second subcase version	2024-07-08 13:29:52 +00:00
Pluto	5d89b41b1a	Fix not counting false negatives and false positives in table metrics (#3300 ) This pull request fixes counting tables metric for three cases: - False Negatives: when table exist in ground truth but any of the predicted tables doesn't match the table, the table should count as 0 and the file should not be completely skipped (before it was np.NaN). - False Positives: When there is a predicted table that didn't match any ground truth table it should be counted as 0, right now it is skipped in processing (matched_indices==-1) - The file should be completely skipped only if there is no tables in ground truth and in prediction In short we can say that previous metric calculation didn't consider OD mistakes	2024-07-02 10:07:24 +00:00
Pawel Kmiecik	c3af03d5ac	feat: expose converters deckerd -> html and back (#3233 ) This PR exposes functions in evaluation module for easy conversion between tables in Deckerd and HTML formats, which are useful in evalution experiments.	2024-06-19 07:03:38 +00:00
Pawel Kmiecik	29e64eb281	feat: table evaluations for fixed html table generation (#3196 ) Update to the evaluation script to handle correct HTML syntax for tables. See https://github.com/Unstructured-IO/unstructured-inference/pull/355 for details. This change: - modifies transforming HTML tables to evaluation internal `cells` format - fixes the indexing of the output (internal format cells) when HTML cells use spans	2024-06-14 09:03:27 +00:00
Michał Martyniak	2f25d8f79e	Support for concurrent processing of documents during evaluation (#2973 ) Currently, CCT eval takes a long time for any of the test_metrics CI runs. Documents in an eval set are evaluated sequentially, and It appears that a max of 1 cpu core is currently utilized. This implies there could be a large speedup by running eval across multiple docs concurrently (probably with multiprocessing). Things done in this PR: - [x] concurrent.futures.ProcessPoolExecutor instead of sequential for-loop - [x] refactor/reorganization of redundant pieces of code without changing the inner logic too much. Without that we'd have 3 places where documents are being processed. Take a look at `BaseMetricsCalculator` class and classes that inherit from it. - [x] string paths manipulation is now reworked and relies on `pathlib.Path()`	2024-05-09 21:25:47 +00:00
Pluto	4397dd6a10	Add calculation of table related metrics based on table_as_cells (#2898 ) This pull request add metrics that are calculated based on table_as_cells instead of text_as_html. This change is required for comprehensive metrics calculation, as previously every colspan or rowspan predicted was considered to be an incorrect predicted (even if it was correct prediction) This change has to be merged after https://github.com/Unstructured-IO/unstructured/pull/2892 which introduces table_as_cells field	2024-05-07 13:57:38 +00:00
Pluto	df1f7bcd0e	Save table prediction in cells format (#2892 ) This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation. This PR has to be merged, after https://github.com/Unstructured-IO/unstructured-inference/pull/335	2024-04-25 11:14:48 +00:00
Klaijan	8a239b346c	feat: add cleanup fixtures for test_evaluate (#2701 ) This PR adds `@pytest.mark.usefixtures("_cleanup_after_test")` to `test_evaluate` on tests that do not have.	2024-04-02 15:10:59 +00:00
Klaijan	fd8b682194	fix: mean group add param (#2684 )	2024-03-22 15:16:23 +00:00
Steve Canny	31bef433ad	rfctr: prepare to add orig_elements serde (#2668 ) Summary The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.	2024-03-20 21:27:59 +00:00
Yao You	2eb0b25e0d	Feat: single table structure eval metric (#2655 ) Creates a compounding metric to represent table structure score. It is an average of existing row and col index and content score. This PR adds a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores. This new metric is meant to offer a single number to represent the performance of table structure extraction model/algorithms. This PR also refactors the eval computation logic so it uses a constant `table_eval_metrics` instead of hard coding the name of the metrics in multiple places in the code. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2024-03-19 15:15:32 +00:00
Klaijan	ccda40f750	feat: grouping eval takes list of filenames (#2635 ) Add features to `get_mean_grouping` to allow input as a list of filenames in the format of List of strings or txt file. --------- Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-17 17:19:55 +00:00
John	fe300fe56d	fix: teardown fixture for tests and update pre-commit-config (#2565 ) Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The updated decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2024-03-12 22:16:39 +00:00
Yao You	911f9983c1	feat: redefine table level acc (#2620 ) This PR redefines the `table_level_acc` metric as follow: - for each predicted table use sequence matching ratio as its accuracy - as a prerequisite for the sequence matching we sort the table cells by row then column for both predicted and ground truth to ensure they are ordered the same - average all predicted table accuracy - any prediction without a matching ground truth (false positive) would decrease the score - prediction that splits ground truth into smaller tables would also have low score with perfectly equal splits having lowest score This new definition makes the new metric a value between 0 and 1 per file. This replaces the existing definition where the metric is defined as (the number of predicted table that has a match to ground truth) to (the number of ground truth table). This existing metric actually gives higher values for predictions that splits tables and can be higher than 1. The new definition prefers predictions that do not split ground truth tables.	2024-03-08 17:00:57 +00:00
Pawel Kmiecik	e35306cfc7	fix: table evaluation metrics fix calculations when no tables found in predictions (#2619 ) The current way table structure metrics are computed does not cover cases when none table is found and all stats are empty. This PR fixes this + adds some hardenning tests for table eval processor. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-03-07 18:39:19 +00:00
Klaijan	3ff6de4f50	refactor: refactor var name for consistency (#2609 ) refactor variable name for consistency.	2024-03-05 09:08:25 +00:00
Klaijan	6a4b7a134b	feat: element type accuracy grouping (#2594 ) This PR allow grouping functionality on `evaluate.py` To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or call `get_mean_grouping(<doctype or connector>, <dataframe or path to tsv file>, <export directory>, "element_type")` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2024-03-01 15:18:37 +00:00
Klaijan	daaf1775b4	feat: separate evaluate grouping function (#2572 ) Separate the aggregating functionality of `text_extraction_accuracy` to a stand-alone function to avoid duplicated eval effort if the granular level eval is already available. To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` locally	2024-02-23 05:45:20 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Klaijan	e65a44eabb	feat: update cct eval for text dir (#2299 ) The code makes edit to the `measure_text_extraction_accuracy` function to allows dir of txt as well as json. The function also takes input `output_type` to be either "json" or "txt" only, and checks if the files under given directory/list contains only specified file type or not. To test this feature, run the following code: ```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```	2024-01-05 23:34:53 +00:00
John	04f4c3ab16	create teardown fixture for tests (#2269 ) Closes #2263 Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The added decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2023-12-20 17:50:12 +00:00
Klaijan	0aae1faa54	feat: add visualize param to command and add test (#2178 ) - Add `visualize` parameter to the click command -- now callable using `--visualize` flag to show the progress bar. - Refactor the name.	2023-11-29 01:05:55 +00:00
Klaijan	2c2d5b65ca	refactor: measure_text_edit_distance function for aggregation (#2108 ) - Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. - Switch to `DataFrame` for easier analysis and aggregation.	2023-11-22 13:30:16 -08:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00
Mallori Harrell	d07baed4a1	bug: empty-elements (#1252 ) - This PR adds a function to check if a piece of text only contains a bullet (no text) to prevent creating an empty element. - Also fixed a test that had a typo.	2023-11-02 10:52:41 -05:00
Klaijan	1893d5a669	fix: avoid loop through None (#1975 ) Fix this issue https://unstructured-ai.atlassian.net/browse/CORE-2455. Adding logical check if the variable is not None.	2023-11-01 20:50:34 +00:00
Yao You	42f8cf1997	chore: add metric helper for table structure eval (#1877 ) - add helper to run inference over an image or pdf of table and compare it against a ground truth csv file - this metric generates a similarity score between 1 and 0, where 1 is perfect match and 0 is no match at all - add example docs for testing - NOTE: this metric is only relevant to table structure detection. Therefore the input should be just the table area in an image/pdf file; we are not evaluating table element detection in this metric	2023-10-27 13:23:44 -05:00
Klaijan	ba4c649cf0	feat: calculate element type percent match (#1723 ) Executive Summary Adds function to calculate the percent match between two element type frequency output from `get_element_type_frequency` function. Technical Detail - The function takes two `Dict` input which both should be output from `get_element_type_frequency` - Implementors can define weight `category_depth_weight` they want to give to the matching `type` but different in `category_depth` case - The function loops through output item list first to find exact match and count total exact match, and collect the remaining value for both output and source in new list (of `dict` type). Then it loops through existing source item list that has not been an exact match, to find `type` match which then weigh with the factor of `category_depth_weight` defined earlier, default at 0.5) Output output ``` { ("Title", 0): 2, ("Title", 1): 1, ("NarrativeText", None): 3, ("UncategorizedText", None): 1, } ``` source ``` { ("Title", 0): 1, ("Title", 1): 2, ("NarrativeText", None): 5, } ``` With this output and source, and weight of 0.5, the % match will yield 5.5 / 8 -- for 5 exact match, and 1 partial match with 0.5 weight. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-16 17:57:28 +00:00
Klaijan	ee75ce25e2	feat: element type frequency (#1688 ) Executive Summary Add function that returns frequency of given element types and depth. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-11 00:36:44 +00:00
shreyanid	9d228c7ecb	feat: calculate metric for percent of text missing (#1701 ) ### Summary Missing text is a particularly important metric of quality for the Unstructured library because information from the document is not being captured and therefore not usable by downstream applications. Add function to calculate the percent of text missing relative to the source transcription. Function takes 2 text strings (output and source) as input, and returns the percentage of text missing as a decimal. ### Technical Details - The 2 input strings are both assumed to already contain clean and concatenated text (CCT) - Implementation compares the bags of words (frequency counts for each word present in the text) of each input text - Duplicated/extra text is not penalized - Value is limited to the range [0, 1] ### Test - Several edge cases are covered in the test function (missing text, duplicated text, spaced out words, etc). - Can test other cases or text inputs by calling the function with 2 CCT strings as "output" and "source"	2023-10-10 20:54:49 +00:00
Mallori Harrell	a5d7ae4611	Feat: Bag of words for testing metric (#1650 ) This PR adds the `bag_of_words` function to count the frequency of words for evaluation. Testing ```Python from unstructured.cleaners.core import bag_of_words string = "The dog loved the cat, but the cat loved the cow." print(bag_of_words) --------- Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local> Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-10 18:46:01 +00:00
Klaijan	33edbf84f5	feat: add calculate edit distance feature (#1656 ) Executive Summary Adds function to calculate edit distance (Levenshtein distance) between two strings. The function can return as: 1. score (similarity = 1 - distance/source_len) 2. distance (raw levenshtein distance) Technical details - The `weights` param is set to default at (2,1,1) for (insertion, deletion, substitution), meaning that we will penalize the insertion we need to add from output (target) in comparison with the source (reference). In other word, the missing extraction will be penalized higher. - The function takes in 2 strings in an assumption that both string are already clean and concatenated (CCT) Important Note! Test case needs to be updated to use CCT once the function is ready. It is now only tested the "functionality" of edit distance, not the edit distance with CCT as its intended to be. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:21:14 +00:00

47 Commits