13 Commits

Author SHA1 Message Date
Klaijan
e65a44eabb
feat: update cct eval for text dir (#2299)
The code makes edit to the `measure_text_extraction_accuracy` function
to allows dir of txt as well as json. The function also takes input
`output_type` to be either "json" or "txt" only, and checks if the files
under given directory/list contains only specified file type or not.

To test this feature, run the following code:

```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```
2024-01-05 23:34:53 +00:00
John
04f4c3ab16
create teardown fixture for tests (#2269)
Closes #2263 
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The added decorator
removes the created directory and its files after the tests run.

Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
2023-12-20 17:50:12 +00:00
Klaijan
0aae1faa54
feat: add visualize param to command and add test (#2178)
- Add `visualize` parameter to the click command -- now callable using
`--visualize` flag to show the progress bar.
- Refactor the name.
2023-11-29 01:05:55 +00:00
Klaijan
2c2d5b65ca
refactor: measure_text_edit_distance function for aggregation (#2108)
- Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. 
- Switch to `DataFrame` for easier analysis and aggregation.
2023-11-22 13:30:16 -08:00
shreyanid
6db663e7bb
refactor: separate click wrappers from core evaluation functionality (#1981)
### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.

### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance

### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
2023-11-07 19:54:22 +00:00
Mallori Harrell
d07baed4a1
bug: empty-elements (#1252)
- This PR adds a function to check if a piece of text only contains a
bullet (no text) to prevent creating an empty element.
- Also fixed a test that had a typo.
2023-11-02 10:52:41 -05:00
Klaijan
1893d5a669
fix: avoid loop through None (#1975)
Fix this issue https://unstructured-ai.atlassian.net/browse/CORE-2455.
Adding logical check if the variable is not None.
2023-11-01 20:50:34 +00:00
Yao You
42f8cf1997
chore: add metric helper for table structure eval (#1877)
- add helper to run inference over an image or pdf of table and compare
it against a ground truth csv file
- this metric generates a similarity score between 1 and 0, where 1 is
perfect match and 0 is no match at all
- add example docs for testing
- NOTE: this metric is only relevant to table structure detection.
Therefore the input should be just the table area in an image/pdf file;
we are not evaluating table element detection in this metric
2023-10-27 13:23:44 -05:00
Klaijan
ba4c649cf0
feat: calculate element type percent match (#1723)
**Executive Summary**
Adds function to calculate the percent match between two element type
frequency output from `get_element_type_frequency` function.

**Technical Detail**
- The function takes two `Dict` input which both should be output from
`get_element_type_frequency`
- Implementors can define weight `category_depth_weight` they want to
give to the matching `type` but different in `category_depth` case
- The function loops through output item list first to find exact match
and count total exact match, and collect the remaining value for both
output and source in new list (of `dict` type). Then it loops through
existing source item list that has not been an exact match, to find
`type` match which then weigh with the factor of `category_depth_weight`
defined earlier, default at 0.5)

**Output**
output
```
{
  ("Title", 0): 2,
  ("Title", 1): 1,
  ("NarrativeText", None): 3,
  ("UncategorizedText", None): 1,
}
```

source
```
{
  ("Title", 0): 1,
  ("Title", 1): 2,
  ("NarrativeText", None): 5,
}
```

With this output and source, and weight of 0.5, the % match will yield
5.5 / 8 -- for 5 exact match, and 1 partial match with 0.5 weight.

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-16 17:57:28 +00:00
Klaijan
ee75ce25e2
feat: element type frequency (#1688)
**Executive Summary**

Add function that returns frequency of given element types and depth.

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-11 00:36:44 +00:00
shreyanid
9d228c7ecb
feat: calculate metric for percent of text missing (#1701)
### Summary
Missing text is a particularly important metric of quality for the
Unstructured library because information from the document is not being
captured and therefore not usable by downstream applications.

Add function to calculate the percent of text missing relative to the
source transcription. Function takes 2 text strings (output and source)
as input, and returns the percentage of text missing as a decimal.

### Technical Details
- The 2 input strings are both assumed to already contain clean and
concatenated text (CCT)
- Implementation compares the bags of words (frequency counts for each
word present in the text) of each input text
- Duplicated/extra text is not penalized
- Value is limited to the range [0, 1]

### Test
- Several edge cases are covered in the test function (missing text,
duplicated text, spaced out words, etc).
- Can test other cases or text inputs by calling the function with 2 CCT
strings as "output" and "source"
2023-10-10 20:54:49 +00:00
Mallori Harrell
a5d7ae4611
Feat: Bag of words for testing metric (#1650)
This PR adds the `bag_of_words` function to count the frequency of words
for evaluation.

**Testing**
```Python
from unstructured.cleaners.core import bag_of_words
string = "The dog loved the cat, but the cat loved the cow."

print(bag_of_words)

---------

Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-10 18:46:01 +00:00
Klaijan
33edbf84f5
feat: add calculate edit distance feature (#1656)
**Executive Summary**

Adds function to calculate edit distance (Levenshtein distance) between
two strings. The function can return as: 1. score (similarity = 1 -
distance/source_len) 2. distance (raw levenshtein distance)

**Technical details**
- The `weights` param is set to default at (2,1,1) for (insertion,
deletion, substitution), meaning that we will penalize the insertion we
need to add from output (target) in comparison with the source
(reference). In other word, the missing extraction will be penalized
higher.
- The function takes in 2 strings in an assumption that both string are
already clean and concatenated (CCT)

**Important Note!**
Test case needs to be updated to use CCT once the function is ready. It
is now only tested the "functionality" of edit distance, not the edit
distance with CCT as its intended to be.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:21:14 +00:00