### Summary
In order to convert between incompatible language codes from packages
used for OCR, this change adds a function to map between any standard
language codes and tesseract OCR specific codes. Users can input
language information to `languages` in any Tesseract-supported langcode
or any ISO 639 standard language code.
### Details
- Introduces the
[python-iso639](https://pypi.org/project/python-iso639/) package for
matching standard language codes. Recompiles all dependencies.
- If a language is not already supplied by the user as a Tesseract
specific langcode, supplies all possible script/orthography variants of
the language to the Tesseract OCR agent.
### Test
Added many unit tests for a variety of language combinations, special
cases, and variants. For general testing, call partition functions with
any lang codes in the languages parameter (Tesseract or standard).
for example,
```
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"])
print("\n\n".join([str(el) for el in elements]))
```
should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract
This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.
---------
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
### Summary
Duplicate PR of #1259 because of issues with checks
Closes#1227, which found that `nan` values were present in the
coordinates being generated for some elements.
This breaks logic out from `add_pytesseract_bbox_to_elements` to new
functions `_get_element_box` and
`convert_multiple_coordinates_to_new_system`. It also updates the logic
to check that the current bounding box matches the first character of
the element's text (as to avoid the `~` characters that
`pytesseract.image_to_boxes` includes, but are not present in
`pytesseract.image_to_string`.
### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw
filename="example-docs/layout-parser-paper-with-table.jpg"
elements = partition_image(filename=filename, strategy="ocr_only")
image = Image.open(filename)
draw = ImageDraw.Draw(image)
for i, element in enumerate(elements):
print(i, element.metadata.coordinates)
if element.metadata.coordinates:
draw.polygon(element.metadata.coordinates.points, outline="red", width=2)
output = "example-docs/box-layout-parser-paper-with-table.jpg"
image.save(output)
image.close()
```
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
`partition_pdf` allows for passing a `model_name` parameter. Given the
similarity between the image and PDF pipelines, the expected behavior is
that `partition_image` should support the same parameter, but
`partition_image` was unintentionally not passing along its `kwargs`.
This was corrected by adding the kwargs to the downstream call.
#### Testing:
```python
from unstructured.partition.image import partition_image
output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx")
output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox")
# These shouldn't be the same, since they were produced using different models.
assert output1 != output2
```
The assertion should fail on `main`, but pass on this branch.
This PR adds an arg to the html partition flow called `source_format` if
anything other than "html" we will return non-HTML elements to conform
with the file type we received.
addresses: https://github.com/Unstructured-IO/unstructured/issues/726
Two changes:
1. Improved mapping of `chipper` element types `Headline` (to `Title`),
`Subheadline`(to `Title`) and `Abstract`( to `NarrativeText`.
2. New element metadata `category_depth`: `None` unless is `Headline`
(`category_depth=1`), or `Subheadline` (`category_depth=2`). The update
of `category_depth` happens during the transform
`normalize_layout_element`.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: LaverdeS <LaverdeS@users.noreply.github.com>
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
Co-authored-by: Benjamin Torres <benjamin@unstructured.io>
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.
### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset
### How it works
Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:
1. Title, "This is the Title"
2. Text, "this is the text"
And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.
### Schema Changes
```
@dataclass
class ElementMetadata:
coordinates: Optional[CoordinatesMetadata] = None
data_source: Optional[DataSourceMetadata] = None
filename: Optional[str] = None
file_directory: Optional[str] = None
last_modified: Optional[str] = None
filetype: Optional[str] = None
attached_to_filename: Optional[str] = None
+ parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+ category_depth: Optional[int] = None
...
```
### Testing
```
from unstructured.partition.auto import partition
from typing import List
elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")
for element in elements:
print(
f"Category: {getattr(element, 'category', '')}\n"\
f"Text: {getattr(element, 'text', '')}\n"
f"ID: {element.id}\n" \
f"Parent ID: {element.metadata.parent_id}\n"\
f"Depth: {element.metadata.category_depth}\n" \
)
```
### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.
I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.
---------
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
**Summary**
Adds logic to combine broken numbered list for pdf fast strategy.
**Details**
Previously the document reads the numbered list items part of the
`layout-parser-paper-fast.pdf` file as:
```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character'
'recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that'
'underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model'
'tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```
Now it reads:
```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model' tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```
The added logic leverages `ElementType` and `coordinates` to determine
whether the following lines is a part of the previously detected
`ListItem` or not.
**Test**
Add test that checks the element length less than original version with
broken numbered list. The test also checks whether the first detected
numbered list ends with previously broken line.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings. To identify element
types such as NarrativeText and Title, continue the refactor into
functions that use language checks to determine those potential
classifications.
### Details
Replaces `language` with `languages` (a list of strings) as a parameter
to `is_possible_narrative_text` and `is_possible_title`.
### Test
Call `is_possible_narrative_text` and `is_possible_title` with text in a
variety of languages and different inputs for `languages`. The resulting
element classifications should be no different from the current outputs.
ex: see `test_text_type_handles_multi_language_examples` in
`test_unstructured/partition/test_text_type.py`.
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.
### Details
Follows the pattern established with PDFs in #1334. Adds languages (a
list of strings) as a parameter to partition in auto.py. Marks
ocr_languages for deprecation.
### Test
Call partition with a variety of filetypes (especially pdfs/images),
strategies, languages, or ocr_languages.
- inclusion of ocr_languages as a parameter should display a deprecation
warning and may proceed with partitioning if no other conflicts
- the other valid call outputs should be no different from the current
outputs
This PR does two things:
1. Adds test case (and alters sample doc) for rtf and epub files with
table
2. Adds `xls/x` file extension to `skip_infer_table_types` default list
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Currently there are some cases when `partition_pdf` is run using the
`hi_res` strategy, in which elements can come back with category
`UncategorizedText`. This happens when the detection model fails to
detect an element, but we're able to find it anyway either because it
was embedded in the PDF, or we found it using OCR.
This commit is to allow for attempting to categorize these uncategorized
elements using our text-based classification function,
`element_from_text`.
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.
### Details
Adds `languages` (a list of strings) as a parameter to pdf partitioning
functions. Marks `ocr_languages` for deprecation. Adds a new file
`lang.py` for language-related helper functions.
Coming up: langcode standardization, language detection
### Test
Call `partition_pdf` or `partition_pdf_or_image` with a variety of
strategies, languages, or `ocr_languages`.
- inclusion of `ocr_languages` as a parameter should display a
deprecation warning
- the other valid call outputs should be no different from the current
outputs.
ex:
```
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"])
print("\n\n".join([str(el) for el in elements]))
```
### Summary
Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.
Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.
### Testing
from unstructured.partition.html import partition_html
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
Adding table extraction to HTML partitioning.
This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.
```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
The default sorting algorithm for PDF's, "xycut," would cause an error
when partitioning a document if Y coordinate points were negative. This
change checks for that condition (or more broadly, any negative
coordinates) and falls back to the "basic" sort if that is the case.
This PR does not address the underlying issue of "bad points" which
still should be investigated. However, the sorting code should be less
brittle to unexpected bounding boxes in the first case.
Resolves: https://github.com/Unstructured-IO/unstructured/issues/1296
If a layout model is used from unstructured-inference, you get back
class probabilities in the element metadata from partition.
extra-pdf-image-in in requirements already has the newest version of
unstructured-inference in there without a pinned version. Is there any
place else that the unstructured-inference version needs to be updated
to the required release version, 0.5.22?
### Summary
Closes#1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.
### Testing
The following is code previously produced one giant element on `main`.
```python
from unstructured.partition.html import partition_html
filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)
len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```
The output should be:
```python
January 2023
(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from. The
answer was ok, but not what I would have said. This is what I would have said.)
The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.
Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
- revert the layout parser fast pdf file to original with just two pages
- add a new file that has one empty page and one page says "this page is
intentionally left blank" for tests
This PR resolves#1247 by using the matching elements and bbox for
coordinate computation.
This PR also updates the example doc
`example-docs/layout-parser-paper-fast.pdf` so that it includes a true
blank page and a page with text "this page is intentionally left blank".
This change helps us testing:
- differences between fast and hi_res
- code handling empty pages in between pages with contents (which
triggers the bug found in #1247 )
Lastly, this PR updates the names of the variables inside
`_partition_pdf_or_image_with_ocr` so that matching inputs all starts
with `_` like `_elements`, `_text`, and `_bboxes` to improve
readability.
This change also improves partition performance for multi-page pdfs as
it reduces the amount of iterations inside
`add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky
docker shows it reduces partition time for DA-619p.pdf file from around
1min to around 23s.
### Summary
Closes#1229. Updates `partition_xml` so that the element type is
inferred on each leaf node when `xml_keep_tags=False` instead of
delegating splitting and partitioning to `partition_xml`. If
`xml_keep_tags=True`, the file is treated like a text file still and
partitioning is still delegated to `partition_text`.
Also adds the option to pass `text` as an input to `partition_xml`.
### Testing
Create a `parrots.xml` file that looks like:
```xml
<xml><parrot><name>Conure</name><description>A conure is a very friendly bird.
Conures are feathery and like to dance.</description></parrot></xml>
```
Run:
```python
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import convert_to_dict
elements = partition_xml(filename="parrots.xml")
convert_to_dict(elements)
```
One `main`, the output is the following. Notice how the `<name>` tag
incorrectly gets merged into `<description>` in the first element.
```python
[{'element_id': '7ae4074435df8dfcefcf24a4e6c52026',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conure A conure is a very friendly bird.',
'type': 'NarrativeText'},
{'element_id': '859ecb332da6961acd2fb6a0185d1549',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conures are feathery and like to dance.',
'type': 'NarrativeText'}]
```
One the feature branch, the output is the following, and the tags are
correctly separated.
```python
[{'element_id': '5512218914e4eeacf71a9cd42c373710',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conure',
'type': 'Title'},
{'element_id': '113bf8d250c2b1a77c9c2caa4b812f85',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'A conure is a very friendly bird.\n'
'\n'
'Conures are feathery and like to dance.',
'type': 'NarrativeText'}]
```
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
### Summary
Closes#1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.
### Testing
```python
from unstructured.partition_email import partition_email
filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```
```python
from unstructured.partition_msg import partition_msg
filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
### Summary
Closes#1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.
### Testing
The elements in the following example should be in the correct order.
```python
from unstructured.partition.html import partition_html
html_text = """
<pre>The Big Brown Bear</pre>
<div>The big brown bear is growling.</div>
<pre>The big brown bear is sleeping.</pre>
<div>The Big Blue Bear</div>
"""
elements = partition_html(text=html_text)
print("\n\n".join([str(el) for el in elements]))
```
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach
### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and
- confirms that the function reads all the pages in the TIFF.
- page number is added to the metadata
This PR is branched from and developed on top of 6d6be99 commit.
### Summary
Closes#1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.
### Testing
```python
from unstructured.partition.auto import partition
filename = "example-docs/winter-sports.epub"
# Should not emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should raise an error
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
- fixes#1079 where partitioning is happening twice in the case of
`strategy="ocr_only"`
- only calls `extractable_elements` if we can predetermine that
`ocr_only` is not a possible strategy even if it was the intended
strategy.
- Adds additional assertion test that `_partition_pdf_or_image_with_ocr`
is not called when falling back to `fast` from `ocr_only`
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
### Summary
Updates `partition` to let users know to installs the appropriate extras
if they're missing. Prior to this PR, users would get an exception
stating `partition_pdf` (or whichever function that requires extras)
does not exist.
### Testing
First `pip uninstall ebooklib`. Then run
```python
from unstructured.partition.auto import partition
partition(filename="example-docs/winter-sports.epub")
```
The error should look like
```python
ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]"
```
**Summary**
Closes#747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
### Summary
Closes#1027
The msg test in question was no longer failing after removing the
quick-fix and comment explaining the issue. However, the test was not
functioning as intended. Test was refactored to appropriately test
`metadata_last_modified` of attachments.
`partition_msg` was then updated to pass `metadata_last_modified` to
`attachment_partitioner`.
The same was done for email partitioning.
### Testing
```
from unstructured.partition.text import partition_text
from unstructured.partition.msg import partition_msg
from unstructured.partition.email import partition_email
filename="example-docs/fake-email-attachment.msg"
elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")
# previously, these were different values because last_modified wasn't being updated in attachments
elements[1].metadata.last_modified
elements[-1].text
elements[-1].metadata.last_modified
email_filename="example-docs/eml/fake-email-attachment.eml"
email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")
email_elements[1].metadata.last_modified
email_elements[-1].text
email_elements[-1].metadata.last_modified
```
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).
I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
The reason this test is failing is the API is returning "fast" results
when "hi_res" is requested, which is being tracked in this ticket:
https://github.com/Unstructured-IO/unstructured-api/issues/188 .
This failure was only showing up on the `main` branch, per the commented
out `pytest` skips.
Handle Content-Disposition: inline and attachment without filename
* Add new email test example and test with Content-Disposition: inline.
* Move attachment_info above for loop so it is always defined
* Check if item is inline as well as attachment as these both lack an = character to split on
* Create filename if filename is not specified and write file.
* Update list_attachments with new filename
Fix attachments with = in filename
* Limit split to first match of = to prevent creating a list of more than two parts
* Add example email with attachment name and test for issue
* feat: add functionality to check if a string contains any emoji characters
* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji
* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use
* chore: update changelog & version
* chore: update changelog & version
* chore: update dependencies
* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`
* chore: update changelog & version
* feat: add functionality to switch html text parser based on whether the html text contains emoji
* chore: update changelog & version
* fix lint errors
* test: revert the `EXPECTED_XLS_TEXT_LEN` value back
* feat: always use `soupparser_fromstring` to parse `html text`
* fix lint error
* feat: add functionality to check if a string contains any emoji characters
* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji
* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use
* chore: update changelog & version
* chore: update changelog & version
* chore: update dependencies
* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`
* chore: update changelog & version