### Summary
Uses `langdetect` to detect all languages present in the input document.
### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization
### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']
---------
Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.
### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset
### How it works
Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:
1. Title, "This is the Title"
2. Text, "this is the text"
And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.
### Schema Changes
```
@dataclass
class ElementMetadata:
coordinates: Optional[CoordinatesMetadata] = None
data_source: Optional[DataSourceMetadata] = None
filename: Optional[str] = None
file_directory: Optional[str] = None
last_modified: Optional[str] = None
filetype: Optional[str] = None
attached_to_filename: Optional[str] = None
+ parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+ category_depth: Optional[int] = None
...
```
### Testing
```
from unstructured.partition.auto import partition
from typing import List
elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")
for element in elements:
print(
f"Category: {getattr(element, 'category', '')}\n"\
f"Text: {getattr(element, 'text', '')}\n"
f"ID: {element.id}\n" \
f"Parent ID: {element.metadata.parent_id}\n"\
f"Depth: {element.metadata.category_depth}\n" \
)
```
### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.
I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.
---------
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
### Description
Update all other connectors to use the new downstream architecture that
was recently introduced for the s3 connector.
Closes#1313 and #1311
* split dependencies by document type
* make pip-compile with new requirements
* add extra requirements to setup.py
* add in all docs; re pip-compile
* extra for all docs
* add pandas to xlsx
* dependency requires for tsv and csv
* handling for doc, docx and odt
* dependency check for pypandoc
* required dependencies for pandoc files
* xml and html
* markdown
* msg
* add in pdf
* add in pptx
* add in excel
* add lxml as base req
* extra all docs for local inference
* local inference installs all
* pin pillow version
* fixes for plain text tests
* fixes for doc
* update make commands
* changelog and version
* add xlrd
* update pip-compile
* pin numpy for python 3.8 support
* more constraints
* contraint on scipy
* update install docs
* constrain ipython
* add outlook to pip-compile
* more ipython constraints
* add extras to dockerfile
* pin office365 client
* few doc tweaks
* types as strings
* last pip-compile
* re pip-comple
* make tidy
* make tidy
* add support for page numbers in docx when present
* version and changelog
* add comment on page numbers
* add header and footer to doc elements list
* update integrations docs
* include_page_breaks kwarg for doc and docx
* merge element metadata for pagebreaks
* fix typo
* fix changelog typo
* change page number default to None
* add initial_page_number kwarg
* make page number tests in pdf more explicit
* revert test file
* update ingest tests
* update test fixture outputs
* updates to IRS forms fixtures
* ingest-test-fixtures-update
* Update ingest test fixtures (#759)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
---------
Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>