* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
### Summary
An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.
### Testing
```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
```
Documentation Overhaul
- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)
* don't push
* enhancement: improve json detection by detect_filetype (#971)
* update regex pattern
* improve json regex pattern checks and add test file
* update file name
* update tests and formatting
* update changelog and version
* refactor: simplifies JSON detection and add tests (#975)
* refactor json detection
* version and changelog
* fix mock in test
* feat: adds Outlook connector (#939)
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
* Roman/expose dpi param (#966)
* Bump inference version
* Pass through the dpi param if available
* Update CHANGELOG
* Check dpi param passed in via unit test
* Bump inference version
* Fix unit test around file info to work on mac as well
* chore: cleanup changelog for 0.8.2 (#976)
* Update `partition_via_api` to not post a strategy value if not user specified (#967)
* remove default strategy
* working on test
* fixed test, coordinates param needed to be included
* nits
* update changelog
* lint
* update requirements
* build(release): cut 0.8.4 release (#979)
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Chore: add uns api repo unittests (#954)
* stage
* git clone
* ci ignore markdown file
* make install
* use env instead
* remove md
* add script
* wrong env value
* add note
* maybe don't rm
* no cd../
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
* fix: handling for empty tables in word docs and powerpoints (#982)
* fix table index error
* changelog and version
* fix: only download nltk packages if necessary (#985)
* fix: only download nltk if necessary
* changelog and version
* Chore: Pass table support param to partition image (#973)
* add param and test in image table extraction
* version and changelog
* need to publish this one for api repo
* add new param skip_infer_table_types
* use warning
* clean up with mapping
* add test for tsv
* fix test fail
* weird change from merge
* doc nit
* don't use mapping
* correct conflict
* Update pip in makefile (#981)
* update pip in makefile
* merge and update requirements
* update version
* update outlook requirements
* chore: remove debug printing (#988)
* fix: correct nltk download arg order (#991)
* fix: correct download order to nltk args
* add smoke test for tokenizers
* Chore: put back function `split_by_paragraph` (#992)
* put back function
* not really fixes
* don't push
* fix: clean up code
* fix: clean up
* fix: clean up
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Roman/ingest refactor (#978)
* Pull out s3 code as subcommand
* Pull out dropbox code as subcommand
* Pull out azure code as subcommand
* Pull out fsspec code as subcommand
* Pull out github code as subcommand
* Pull out gitlab code as subcommand
* Pull out reddit code as subcommand
* Pull out slack code as subcommand
* Pull out discord code as subcommand
* Pull out wikipedia code as subcommand
* Pull out gdrive code as subcommand
* Pull out biomed code as subcommand
* rename parameters
* Pull out onedrive code as subcommand
* Pull out outlook code as subcommand
* Pull out local code as subcommand
* Pull out elasticsearch code as subcommand
* Pull out confluence code as subcommand
* Drop previous main file
* update changelog
* Add back in mp.Pool
* Fix mypy issues with click
* Make sure all tests run with verbose flag
* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data
* Pull out some more shared options
* Support running code via python as well as cli
* update ingest readme and move it to the ingest folder
* update usage in connector docs
* move local command arg in test
* Seperate out cli code from logic running unstructured
* Make some cli fields required rather than optional
* rename process -> processor
* Improve logger to avoid duplicate handlers
---------
Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* feat: adds Box connector (#996)
* chore: rename Element's "date" field to "last_modified" (#997)
Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.
* don't push
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* fix: removie prints
* remove unused file
* fix: apply linter
* feat: add post processing filter_element_types
* feat: add tests for filter_element_types
* feat: update changelog
* feat: add doc string for filter_element_types
* fix: change the version
* feat: update documentation
* bump dev version number
* cleanup changelog
* linting, linting, linting
---------
Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: David Potter <potterdavidm@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* track tags in html
* pass through links as metadata
* add test for grabbing links
* one more link
* changelog and version
* update docs
* fix tests
* update empty link assertion
* ingest-test-fixtures-update
* Update ingest test fixtures (#961)
* process attachments for email
* add attachment processing to msg
* fix up metadata for attachments
* add test for processing email attachments
* added test for processing msg attachments
* update docs
* tests for error conditions
* version and changelog
* add max partition size logic
* work splitting logic into split_by_paragraph
* pass through max_partition to other functions
* added test for splitting long document
* add type hint
* add documentation
* version and changelog
* ingest-test-fixtures-update
* Update ingest test fixtures (#819)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* retrigger ci
* ingest-test-fixtures-update
* ingest-test-fixtures-update
* Update ingest test fixtures (#821)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* update default for partition_xml
* update version for release
* update msg doc string
---------
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* optionally dont assemble articles
* add test for content outside of articles
* pass kwargs in partition
* changelog and version
* update default to False
* bump version for release
* back to dev version to get another fix in the release
* first pass at partition_tsv
* working tests
* create constants for tests and debug `make test` failure
* make check and tidy
* undo changes for testing locally
* update changelog and version
* fix bricks.rst
* refactor if statements
* make tidy
* fix README and change try/except to if/else
* update changelog and version
* fix\ docstring
* add support for page numbers in docx when present
* version and changelog
* add comment on page numbers
* add header and footer to doc elements list
* update integrations docs
* include_page_breaks kwarg for doc and docx
* merge element metadata for pagebreaks
* fix typo
* fix changelog typo
* change page number default to None
* add initial_page_number kwarg
* make page number tests in pdf more explicit
* revert test file
* update ingest tests
* update test fixture outputs
* updates to IRS forms fixtures
* ingest-test-fixtures-update
* Update ingest test fixtures (#759)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
---------
Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* first pass on partition_xml
* add option to keep xml tags
* added tests for xml
* fix filename
* update filenames
* remove outdated readme
* add xml to auto
* version and changelog
* update readme and docs
* pass through include_metadata
* update include_metadata description
* add README back in
* linting, linting, linting
* more linting
* spooled to bytes doesnt need to be a tuple
* Add tests for newly supported filetypes
* Correct metadata filetype
* doc typo
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* keep_xml_tags -> xml_keep_tags
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* first pass on partition_xlsx
* add support for files
* add test for xlsx from filename
* added filetype metadata
* add xlsx to auto
* remove fake excel from unsupported
* version and changelog
* update docs
* update readme
* fix removed file reference
* fix some more tests
* pass in metadata filename
* add include_metadata flag
* added functions for determining auto stratgy
* change default strategy to auto
* tests for auto strategy
* update docs
* changelog and version
* bump version
* remove ingest file in wrong location
* update jpg output
* typo fix
* add tests for validating strategy
* refactor into determine_pdf_strategy function
* refactor pdf strategies into strategies
* remove commented out code
* remove unreachable code
* add in handling for image types
* a little more refactoring
* import ocr partioning for images
* catch warnings, partition type for valid strategies
* fallback to ocr_only from fast
* fallback logic for hi_res
* test for fallback to ocr only
* fallback logic ofr ocr_only
* more tests for fallback logic
* update doc strings
* version and changelog
* linting, linting, linting
* update docs to include notes about strategy
* fix typos
* change back patched filename
* spike for ocr-only strategy for images
* fix for file processing
* extra space
* add korean to ci
* added test for ocr_only strategy
* added docs for ocr_only
* changelog and version
* added test for bad strategy
* skip korean test if in docker
* bump version
* version bump
* document valid strategies
* bump version for release
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* added filetype detection for odt
* add function for partition odt documents
* add odt files to auto
* changelog and version
* docs and readme
* update installation docs
* skip tests if not supported or in docker
* import pytest
* fix docs typos
* added function for multiple files via api
* make multiple work with files
* updated docs strings
* changelog and version
* docs and contextlib for open files
* tests for partition multiple
* add tests for error conditions
* add output example
* function to check if pdf is extractable
* add fallback logic for unextractable pdfs
* tests for docs with copy protection
* add test for unprocessable pdf
* update docs
* changelog and version
* update logic for images; reset file before proceeding
* 3 files for api tests
* docs update
* pip-compile new reqs
* bump inference version
* add language to pdf and image calls
* tests for passing in language
* version bump and changelog
* update docs
* pass ocr_languages in auto
* updated test fixtures
* typo in doc string
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks
* cleaning brick to group broken paragraphs
* docs for group_broken_paragraphs
* add docs for partition_text with grouper
* partition_text and auto with paragraph_grouper
* version and changelog
* typo in the docs
* linting, linting, linting
* switch to using regular expressions
* added msg-parser dependency
* pass through kwargs in convert_file_to_text
* added partition_msg for processing msft outlook files
* version bump and changelog
* added tests for partition_msg
* added test for msg with plain text
* add partition_msg docs; fix underlines in integration docs
* add .msg to file list
* finish tests for auto msg
* linting, linting, linting