unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-22 21:29:51 +00:00

Author	SHA1	Message	Date
Klaijan	e65a44eabb	feat: update cct eval for text dir (#2299 ) The code makes edit to the `measure_text_extraction_accuracy` function to allows dir of txt as well as json. The function also takes input `output_type` to be either "json" or "txt" only, and checks if the files under given directory/list contains only specified file type or not. To test this feature, run the following code: ```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```	2024-01-05 23:34:53 +00:00
John	04f4c3ab16	create teardown fixture for tests (#2269 ) Closes #2263 Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The added decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2023-12-20 17:50:12 +00:00
Klaijan	0aae1faa54	feat: add visualize param to command and add test (#2178 ) - Add `visualize` parameter to the click command -- now callable using `--visualize` flag to show the progress bar. - Refactor the name.	2023-11-29 01:05:55 +00:00
Klaijan	2c2d5b65ca	refactor: measure_text_edit_distance function for aggregation (#2108 ) - Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. - Switch to `DataFrame` for easier analysis and aggregation.	2023-11-22 13:30:16 -08:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00
Mallori Harrell	d07baed4a1	bug: empty-elements (#1252 ) - This PR adds a function to check if a piece of text only contains a bullet (no text) to prevent creating an empty element. - Also fixed a test that had a typo.	2023-11-02 10:52:41 -05:00
Klaijan	1893d5a669	fix: avoid loop through None (#1975 ) Fix this issue https://unstructured-ai.atlassian.net/browse/CORE-2455. Adding logical check if the variable is not None.	2023-11-01 20:50:34 +00:00
Yao You	42f8cf1997	chore: add metric helper for table structure eval (#1877 ) - add helper to run inference over an image or pdf of table and compare it against a ground truth csv file - this metric generates a similarity score between 1 and 0, where 1 is perfect match and 0 is no match at all - add example docs for testing - NOTE: this metric is only relevant to table structure detection. Therefore the input should be just the table area in an image/pdf file; we are not evaluating table element detection in this metric	2023-10-27 13:23:44 -05:00
Klaijan	ba4c649cf0	feat: calculate element type percent match (#1723 ) Executive Summary Adds function to calculate the percent match between two element type frequency output from `get_element_type_frequency` function. Technical Detail - The function takes two `Dict` input which both should be output from `get_element_type_frequency` - Implementors can define weight `category_depth_weight` they want to give to the matching `type` but different in `category_depth` case - The function loops through output item list first to find exact match and count total exact match, and collect the remaining value for both output and source in new list (of `dict` type). Then it loops through existing source item list that has not been an exact match, to find `type` match which then weigh with the factor of `category_depth_weight` defined earlier, default at 0.5) Output output ``` { ("Title", 0): 2, ("Title", 1): 1, ("NarrativeText", None): 3, ("UncategorizedText", None): 1, } ``` source ``` { ("Title", 0): 1, ("Title", 1): 2, ("NarrativeText", None): 5, } ``` With this output and source, and weight of 0.5, the % match will yield 5.5 / 8 -- for 5 exact match, and 1 partial match with 0.5 weight. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-16 17:57:28 +00:00
Klaijan	ee75ce25e2	feat: element type frequency (#1688 ) Executive Summary Add function that returns frequency of given element types and depth. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-11 00:36:44 +00:00
shreyanid	9d228c7ecb	feat: calculate metric for percent of text missing (#1701 ) ### Summary Missing text is a particularly important metric of quality for the Unstructured library because information from the document is not being captured and therefore not usable by downstream applications. Add function to calculate the percent of text missing relative to the source transcription. Function takes 2 text strings (output and source) as input, and returns the percentage of text missing as a decimal. ### Technical Details - The 2 input strings are both assumed to already contain clean and concatenated text (CCT) - Implementation compares the bags of words (frequency counts for each word present in the text) of each input text - Duplicated/extra text is not penalized - Value is limited to the range [0, 1] ### Test - Several edge cases are covered in the test function (missing text, duplicated text, spaced out words, etc). - Can test other cases or text inputs by calling the function with 2 CCT strings as "output" and "source"	2023-10-10 20:54:49 +00:00
Mallori Harrell	a5d7ae4611	Feat: Bag of words for testing metric (#1650 ) This PR adds the `bag_of_words` function to count the frequency of words for evaluation. Testing ```Python from unstructured.cleaners.core import bag_of_words string = "The dog loved the cat, but the cat loved the cow." print(bag_of_words) --------- Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local> Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-10 18:46:01 +00:00
Klaijan	33edbf84f5	feat: add calculate edit distance feature (#1656 ) Executive Summary Adds function to calculate edit distance (Levenshtein distance) between two strings. The function can return as: 1. score (similarity = 1 - distance/source_len) 2. distance (raw levenshtein distance) Technical details - The `weights` param is set to default at (2,1,1) for (insertion, deletion, substitution), meaning that we will penalize the insertion we need to add from output (target) in comparison with the source (reference). In other word, the missing extraction will be penalized higher. - The function takes in 2 strings in an assumption that both string are already clean and concatenated (CCT) Important Note! Test case needs to be updated to use CCT once the function is ready. It is now only tested the "functionality" of edit distance, not the edit distance with CCT as its intended to be. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:21:14 +00:00

13 Commits