4 Commits

Author SHA1 Message Date
Klaijan
1893d5a669
fix: avoid loop through None (#1975)
Fix this issue https://unstructured-ai.atlassian.net/browse/CORE-2455.
Adding logical check if the variable is not None.
2023-11-01 20:50:34 +00:00
shreyanid
9d228c7ecb
feat: calculate metric for percent of text missing (#1701)
### Summary
Missing text is a particularly important metric of quality for the
Unstructured library because information from the document is not being
captured and therefore not usable by downstream applications.

Add function to calculate the percent of text missing relative to the
source transcription. Function takes 2 text strings (output and source)
as input, and returns the percentage of text missing as a decimal.

### Technical Details
- The 2 input strings are both assumed to already contain clean and
concatenated text (CCT)
- Implementation compares the bags of words (frequency counts for each
word present in the text) of each input text
- Duplicated/extra text is not penalized
- Value is limited to the range [0, 1]

### Test
- Several edge cases are covered in the test function (missing text,
duplicated text, spaced out words, etc).
- Can test other cases or text inputs by calling the function with 2 CCT
strings as "output" and "source"
2023-10-10 20:54:49 +00:00
Mallori Harrell
a5d7ae4611
Feat: Bag of words for testing metric (#1650)
This PR adds the `bag_of_words` function to count the frequency of words
for evaluation.

**Testing**
```Python
from unstructured.cleaners.core import bag_of_words
string = "The dog loved the cat, but the cat loved the cow."

print(bag_of_words)

---------

Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-10 18:46:01 +00:00
Klaijan
33edbf84f5
feat: add calculate edit distance feature (#1656)
**Executive Summary**

Adds function to calculate edit distance (Levenshtein distance) between
two strings. The function can return as: 1. score (similarity = 1 -
distance/source_len) 2. distance (raw levenshtein distance)

**Technical details**
- The `weights` param is set to default at (2,1,1) for (insertion,
deletion, substitution), meaning that we will penalize the insertion we
need to add from output (target) in comparison with the source
(reference). In other word, the missing extraction will be penalized
higher.
- The function takes in 2 strings in an assumption that both string are
already clean and concatenated (CCT)

**Important Note!**
Test case needs to be updated to use CCT once the function is ready. It
is now only tested the "functionality" of edit distance, not the edit
distance with CCT as its intended to be.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:21:14 +00:00