### Summary
Uses `langdetect` to detect all languages present in the input document.
### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization
### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']
---------
Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Adding table extraction to HTML partitioning.
This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.
```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
### Summary
Closes#1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.
### Testing
The following is code previously produced one giant element on `main`.
```python
from unstructured.partition.html import partition_html
filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)
len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```
The output should be:
```python
January 2023
(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from. The
answer was ok, but not what I would have said. This is what I would have said.)
The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.
Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
* feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph
* chore: add docstring
* chore: fix lint errors
* feat: ignore spaces when extracting emphasized texts from a paragraph
* feat: add functionality to track emphasized text (`bold/italic` formatting) from table
* test: add test case for grabbing emphasized texts from element metadata
* chore: fix lint errors
* chore: update changelog & version
* Update ingest test fixtures (#1047)
* track tags in html
* pass through links as metadata
* add test for grabbing links
* one more link
* changelog and version
* update docs
* fix tests
* update empty link assertion
* ingest-test-fixtures-update
* Update ingest test fixtures (#961)