**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.
**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
# Background
[Ligatures](https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets))
can sometimes show up during the text extraction process when they
should not. Very common examples of this are with the Latin `f` related
ligatures which can be **very subtle** to spot by eye (see example
below), but can wreak havoc later.
```python
"ff": "ff",
"fi": "fi",
"fl": "fl",
"ffi": "ffi",
"ffl": "ffl",
```
Several libraries already do something like this. Most recently,
`pdfplumber` added this sort of capability as part of the text
extraction process, see https://github.com/jsvine/pdfplumber/issues/598
Instead of incorporating any sort of breaking change to the PDF text
processing in `unstructured`, it is best to add this as another cleaner
and allow users to opt in. In turn, the `clean_ligatures` method has
been added in this PR - with accompanying tests.
# Example
Here is an example PDF that causes the issue. For example: `Benefits`,
which should be `Benefits`.
[example.pdf](https://github.com/Unstructured-IO/unstructured/files/12544344/example.pdf)
```bash
curl -X 'POST' \
'https://api.unstructured.io/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: ${UNSTRUCTURED_API_KEY}' \
-F 'files=@example.pdf' \
-s | jq -C .
```
# Notes
An initial list of mappings was added with the most common ligatures.
There is some subjectivity to this, but this should be a relatively safe
starting set. Can always be expanded as needed.
* add modified arabic and hebrew encodings
* added calls to format_encoding_str so encoding is checked before use
* added formatting to detect_filetype()
* explicitly provided default value for null encoding parameter
* fixed format of annotated encodings list
* adding hebrew base64 test file
* small lint fixes
* update changelog
* bump version to -dev2
* cleaning brick to group broken paragraphs
* docs for group_broken_paragraphs
* add docs for partition_text with grouper
* partition_text and auto with paragraph_grouper
* version and changelog
* typo in the docs
* linting, linting, linting
* switch to using regular expressions
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
* feat: new cleaning brick for ordered bullets
* test: add test for cleaning ordered bullets
* feat: new brick for extracting ordered bullets
* test: add test for extracting ordered bullets
* docs: update CHANGELOG and bump new dev version
* chore: change extract ordered bullets return type to tuple
* chore: made tidy
* chore: regex to split on pattern instead of built-in
* chore: catch ValueError, made tidy and fix incompatible type
* chore: assertion statements in one line of code
* docs: add documentation for new clean and extract bricks to bricks.rst
* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets
* docs: update CHANGELOG 0.3.6-dev0 changes and bump version
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
* added pattern for finding phone numbers
* added cleaning brick for extracting phone numbers
* add docs
* changelog and bump version
* switch to us phone numbers
* bump dev version
* fix for processing deeply embedded list elements
* fix types in mime encodings cleaner
* first pass on partition_email
* tests for email
* test for mime encodings
* changelog bump
* added note about \n=
* linting, linting, linting
* added email docs
* add partition_email to the readme
* add one more test
* initial implementation for translate brick
* more input validation
* tests for translate brick
* added docs
* bumped version
* chinese and arabic tests
* re-run pip-compile
* add torch to dependencies
* cleanup doc string
* fix long string
* fix typo in docs
* take out empty string check
* return string if string is empty
* added huggingface into make install
* brick to extract text before
* brick for extract text after
* tests for extract before and after
* updated docs
* changelog and bump version
* fix typo
* fix another typo
* positive -> non-negative
* added prefix and postfix cleaners
* added test for pre and postfix cleaners
* added docs for prefix and postfix bricks
* changelog and bump version
* add dev to version