Wahab Alshahin e4e25c9feb
Add clean_ligatures to core cleaners (#1326)
# Background


[Ligatures](https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets))
can sometimes show up during the text extraction process when they
should not. Very common examples of this are with the Latin `f` related
ligatures which can be **very subtle** to spot by eye (see example
below), but can wreak havoc later.

```python
"ff": "ff",
"fi": "fi",
"fl": "fl",
"ffi": "ffi",
"ffl": "ffl",
```

Several libraries already do something like this. Most recently,
`pdfplumber` added this sort of capability as part of the text
extraction process, see https://github.com/jsvine/pdfplumber/issues/598

Instead of incorporating any sort of breaking change to the PDF text
processing in `unstructured`, it is best to add this as another cleaner
and allow users to opt in. In turn, the `clean_ligatures` method has
been added in this PR - with accompanying tests.

# Example

Here is an example PDF that causes the issue. For example: `Benefits`,
which should be `Benefits`.


[example.pdf](https://github.com/Unstructured-IO/unstructured/files/12544344/example.pdf)

```bash
curl -X 'POST' \
    'https://api.unstructured.io/general/v0/general' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -H 'unstructured-api-key: ${UNSTRUCTURED_API_KEY}' \
    -F 'files=@example.pdf' \
    -s | jq -C .
```

# Notes

An initial list of mappings was added with the most common ligatures.
There is some subjectivity to this, but this should be a relatively safe
starting set. Can always be expanded as needed.
2023-09-07 21:30:18 +00:00
..