kravetsmic 73eeae852e
feat: add filter element types as post processing function (#1014)
* don't push

* enhancement: improve json detection by detect_filetype (#971)

* update regex pattern

* improve json regex pattern checks and add test file

* update file name

* update tests and formatting

* update changelog and version

* refactor: simplifies JSON detection and add tests (#975)

* refactor json detection

* version and changelog

* fix mock in test

* feat: adds Outlook connector (#939)

* bonus: fixes issue with email partitioning where From field was being assigned the To field value.

* Roman/expose dpi param (#966)

* Bump inference version

* Pass through the dpi param if available

* Update CHANGELOG

* Check dpi param passed in via unit test

* Bump inference version

* Fix unit test around file info to work on mac as well

* chore: cleanup changelog for 0.8.2 (#976)

* Update `partition_via_api` to not post a strategy value if not user specified (#967)

* remove default strategy

* working on test

* fixed test, coordinates param needed to be included

* nits

* update changelog

* lint

* update requirements

* build(release): cut 0.8.4 release (#979)

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Chore: add uns api repo unittests (#954)

* stage

* git clone

* ci ignore markdown file

* make install

* use env instead

* remove md

* add script

* wrong env value

* add note

* maybe don't rm

* no cd../

---------

Co-authored-by: cragwolfe <crag@unstructured.io>

* fix: handling for empty tables in word docs and powerpoints (#982)

* fix table index error

* changelog and version

* fix: only download nltk packages if necessary (#985)

* fix: only download nltk if necessary

* changelog and version

* Chore: Pass table support  param to partition image (#973)

* add param and test in image table extraction

* version and changelog

* need to publish this one for api repo

* add new param skip_infer_table_types

* use warning

* clean up with mapping

* add test for tsv

* fix test fail

* weird change from merge

* doc nit

* don't use mapping

* correct conflict

* Update pip in makefile (#981)

* update pip in makefile

* merge and update requirements

* update version

* update outlook requirements

* chore: remove debug printing (#988)

* fix: correct nltk download arg order (#991)

* fix: correct download order to nltk args

* add smoke test for tokenizers

* Chore: put back function `split_by_paragraph` (#992)

* put back function

* not really fixes

* don't push

* fix: clean up code

* fix: clean up

* fix: clean up

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Roman/ingest refactor (#978)

* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* feat: adds Box connector (#996)

* chore: rename Element's "date" field to "last_modified" (#997)

Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.

* don't push

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* fix: removie prints

* remove unused file

* fix: apply linter

* feat: add post processing filter_element_types

* feat: add tests for filter_element_types

* feat: update changelog

* feat: add doc string for filter_element_types

* fix: change the version

* feat: update documentation

* bump dev version number

* cleanup changelog

* linting, linting, linting

---------

Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: David Potter <potterdavidm@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-08-03 10:50:35 -04:00
..