208 Commits

Author SHA1 Message Date
Noah Greer
fa0a5afb71
docs: correct spelling of partition in docs (#1104)
Fixes a typo in several places where the word `partition` is misspelled
as `partiton`
2023-08-12 14:57:27 -07:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00
Ronny H
0d5b5a0e79
Revamp README & Bricks documentation (#1103)
Reorganize README.md
2023-08-12 19:58:51 +00:00
Ahmet Melek
627f78c16f
feat: airtable connector (#1012)
* add the first version of airtable connector

* change imports as inline to fail gracefully in case of lacking dependency

* parse tables as csv rather than plain text

* add relevant logic to be able to use --airtable-list-of-paths

* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings

* fix ingest test names

* add scripts for the large table test

* remove large table test from diff test

* make base and table ids explicit

* add and remove comments

* use -ne instead of !=

* update code based on the recent ingest refactor, update changelog and version

* shellcheck fix

* update comments

* update check-num-rows-and-columns-output error message

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* update help comments

* update help comments

* update help comments

* update workflows to set auth tokens and to run make install

* add comments on create_scale_test_components

* separate component ids from the test script, add comments to document test component creation

* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id

* shellcheck fixes

* shellcheck fixes

* update docs

* update comment

* bump version

* add wrongly deleted file

* sort columns before saving to process

* Update ingest test fixtures (#1098)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-11 12:02:51 -07:00
Matt Robinson
fa5a3dbd81
feat: unique_element_ids kwarg for UUID elements (#1085)
* added kwarg for unique elements

* test for unique ids

* update docs

* changelog and version
2023-08-11 11:02:37 +00:00
cragwolfe
6779918406
build(release): bump unstructured-inference (#1074)
* build(release): bump unstructured-inference

Related to downstream issue:
Unstructured-IO/unstructured-api#182

And upstream PR:
Unstructured-IO/unstructured-inference#165

---------

Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
2023-08-10 20:57:46 +00:00
Yuming Long
112347aa0d
doc: update API doc to sync with new parameter in prod API (#1049)
* doc doc

* changelog and version

* sample docs -> example docs

* nit on compute cost doc

* pass empty dict not none

* note note

* cutting release
2023-08-09 11:09:37 -04:00
kravetsmic
25ca5744cf
feat: optionally ignore header and footer tags in partition html (#1013)
---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 21:56:33 +00:00
kravetsmic
bef93aef6e
fix: email addresses shouldn't be flagged as titles (#957)
* feat: add func for checking on EmailAddress type

* feat: add EmailAddress type

* feat: add check for email type

* feat: add test for cheking EmailAdress type

* feat: update existing example files with email

* feat: add new exampe fileds with email in the text

* fix: apply linter

* feat: update changelog file

* feat: add test for is_email_address function

* don't push

* fix: clean up code

* apply linter

* fix: clean up

* fix: remove file chaanges

* fix: remove not used  files for email address test

* fix: remove not necessary tests

* clean up

* fix: apply linter

* fix: update CHANGELOG

* fix: change version

* fix: fix  msg test

* fix: apply linter for tests

* fix: remove spaces

* fix: apply linter with longer line

* feat: update documentation

* fix: remove duplicates

* Update getting_started.rst

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 11:28:36 -04:00
Hynek Kydlíček
47b20119c3
fix: extract emojis with partition_xlsx (#1009)
* 🐛 fixxed emoji xlsx bug

* update version and changelog

* check if beautifulsoup exists

* update docs

* fix html parser call

* fix failing attachment test

*   added emoji test, added requirment fixed dependency

* 🐛 dependency

* 🐛 correct depeendency

* linting, linting, linting

* check for bs4

* skip auto xls filename test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 10:14:08 -04:00
kravetsmic
73eeae852e
feat: add filter element types as post processing function (#1014)
* don't push

* enhancement: improve json detection by detect_filetype (#971)

* update regex pattern

* improve json regex pattern checks and add test file

* update file name

* update tests and formatting

* update changelog and version

* refactor: simplifies JSON detection and add tests (#975)

* refactor json detection

* version and changelog

* fix mock in test

* feat: adds Outlook connector (#939)

* bonus: fixes issue with email partitioning where From field was being assigned the To field value.

* Roman/expose dpi param (#966)

* Bump inference version

* Pass through the dpi param if available

* Update CHANGELOG

* Check dpi param passed in via unit test

* Bump inference version

* Fix unit test around file info to work on mac as well

* chore: cleanup changelog for 0.8.2 (#976)

* Update `partition_via_api` to not post a strategy value if not user specified (#967)

* remove default strategy

* working on test

* fixed test, coordinates param needed to be included

* nits

* update changelog

* lint

* update requirements

* build(release): cut 0.8.4 release (#979)

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Chore: add uns api repo unittests (#954)

* stage

* git clone

* ci ignore markdown file

* make install

* use env instead

* remove md

* add script

* wrong env value

* add note

* maybe don't rm

* no cd../

---------

Co-authored-by: cragwolfe <crag@unstructured.io>

* fix: handling for empty tables in word docs and powerpoints (#982)

* fix table index error

* changelog and version

* fix: only download nltk packages if necessary (#985)

* fix: only download nltk if necessary

* changelog and version

* Chore: Pass table support  param to partition image (#973)

* add param and test in image table extraction

* version and changelog

* need to publish this one for api repo

* add new param skip_infer_table_types

* use warning

* clean up with mapping

* add test for tsv

* fix test fail

* weird change from merge

* doc nit

* don't use mapping

* correct conflict

* Update pip in makefile (#981)

* update pip in makefile

* merge and update requirements

* update version

* update outlook requirements

* chore: remove debug printing (#988)

* fix: correct nltk download arg order (#991)

* fix: correct download order to nltk args

* add smoke test for tokenizers

* Chore: put back function `split_by_paragraph` (#992)

* put back function

* not really fixes

* don't push

* fix: clean up code

* fix: clean up

* fix: clean up

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Roman/ingest refactor (#978)

* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* feat: adds Box connector (#996)

* chore: rename Element's "date" field to "last_modified" (#997)

Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.

* don't push

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* fix: removie prints

* remove unused file

* fix: apply linter

* feat: add post processing filter_element_types

* feat: add tests for filter_element_types

* feat: update changelog

* feat: add doc string for filter_element_types

* fix: change the version

* feat: update documentation

* bump dev version number

* cleanup changelog

* linting, linting, linting

---------

Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: David Potter <potterdavidm@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-08-03 10:50:35 -04:00
Yuming Long
5af717f16f
doc: update API doc to sync with new parameter in prod API (#1020)
* sync readme
2023-08-01 19:47:32 -04:00
Matt Robinson
331c7faf38
build(deps): split up dependencies by document type (#986)
* split dependencies by document type

* make pip-compile with new requirements

* add extra requirements to setup.py

* add in all docs; re pip-compile

* extra for all docs

* add pandas to xlsx

* dependency requires for tsv and csv

* handling for doc, docx and odt

* dependency check for pypandoc

* required dependencies for pandoc files

* xml and html

* markdown

* msg

* add in pdf

* add in pptx

* add in excel

* add lxml as base req

* extra all docs for local inference

* local inference installs all

* pin pillow version

* fixes for plain text tests

* fixes for doc

* update make commands

* changelog and version

* add xlrd

* update pip-compile

* pin numpy for python 3.8 support

* more constraints

* contraint on scipy

* update install docs

* constrain ipython

* add outlook to pip-compile

* more ipython constraints

* add extras to dockerfile

* pin office365 client

* few doc tweaks

* types as strings

* last pip-compile

* re pip-comple

* make tidy

* make tidy
2023-08-01 11:31:13 -04:00
David Potter
1542607892
feat: adds Box connector (#996) 2023-08-01 01:10:10 +00:00
Roman Isecke
28214a6cc3
Roman/ingest refactor (#978)
* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-31 13:20:10 -04:00
shreyanid
c3e92057f2
Update pip in makefile (#981)
* update pip in makefile

* merge and update requirements

* update version

* update outlook requirements
2023-07-27 21:38:51 +00:00
cragwolfe
1e2d531bb9
build(release): cut 0.8.4 release (#979) 2023-07-26 18:01:31 +00:00
shreyanid
71a24b2887
Update partition_via_api to not post a strategy value if not user specified (#967)
* remove default strategy

* working on test

* fixed test, coordinates param needed to be included

* nits

* update changelog

* lint

* update requirements
2023-07-26 09:56:39 -07:00
David Potter
f7e46af22f
feat: adds Outlook connector (#939)
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
2023-07-26 04:09:26 +00:00
Matt Robinson
6e852cbe70
feat: track links from anchor tags in partition_html (#959)
* track tags in html

* pass through links as metadata

* add test for grabbing links

* one more link

* changelog and version

* update docs

* fix tests

* update empty link assertion

* ingest-test-fixtures-update

* Update ingest test fixtures (#961)
2023-07-24 18:28:56 +00:00
Jack Retterer
708714dab5
docs: fixed typo in Installation guide (#945) 2023-07-21 13:33:44 +00:00
Ronny H
31511793cb
Update README and API doc for Chipper announcement (#940)
Update README and API doc for Chipper model beta version announcement
2023-07-19 13:00:37 -07:00
Emily Chen
4b1e5a8057
Publicly document OneDrive connector (#949) 2023-07-18 16:37:44 -07:00
Yuming Long
067eb5701f
Fix: docker build with missing dependency (#931)
* pip -compile

* test trigger

* Revert "test trigger"

This reverts commit 69d4c8cd9f285f6ef4bf445f5fb27b5c62e1391c.

* version conflict and pip compile
2023-07-14 22:20:11 +00:00
fran-unstructured
dd4bb752e2
docs: Add Unstructured API documentation (#928)
* fran-unstructured/Add unstructured API documentation

* fran-unstructured/add api docs to index.rst

* fran-unstructured/add api docs with changes requested
2023-07-14 18:28:57 +00:00
fran-unstructured
26da51c765
docs: Add source code links to bricks' docs (#923)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-07-13 17:27:47 +00:00
fran-unstructured
f7b3c0f741
docs: adds connectors' documentation (#917)
* Add connectors documentation

* Add connectors documentation with corrections and index.rst update

* Add connectors documentation - add API information

---------

Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-12 14:56:09 -04:00
Matt Robinson
a583d47b84
docs: update table and API documentation (#919)
* more detailed api docs

* add table docs

* remove rtf/epubs comment

* remove confusing request_kwargs verbiage

* add missing a
2023-07-12 12:59:59 -04:00
Matt Robinson
b3936893b8
build: add python 3.11 to CI (#908)
* remove argilla; bump reqs

* enable py 3.11

* add 3.11 to setup.py

* make pip-compile

* ignore cli mypy errors

* install argilla

* fix constraints

* install argilla

* changelog and version

* skip argilla in docker

* dont import argilla in docker

* skip all of argilla if in container

* only import argilla if outside docker

* more docker skips

* remove weird pypi settings
2023-07-10 18:52:25 +00:00
Emily Chen
24ebd0fa4e
chore: Move coordinate details from Element model to a metadata model (#827) 2023-07-05 11:25:11 -07:00
Matt Robinson
c581a33c8a
feat: attachment processing for emails (#855)
* process attachments for email

* add attachment processing to msg

* fix up metadata for attachments

* add test for processing email attachments

* added test for processing msg attachments

* update docs

* tests for error conditions

* version and changelog
2023-06-29 18:01:12 -04:00
Matt Robinson
44411ecc59
enhancement: max_partition kwarg for limiting element size (#818)
* add max partition size logic

* work splitting logic into split_by_paragraph

* pass through max_partition to other functions

* added test for splitting long document

* add type hint

* add documentation

* version and changelog

* ingest-test-fixtures-update

* Update ingest test fixtures (#819)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* retrigger ci

* ingest-test-fixtures-update

* ingest-test-fixtures-update

* Update ingest test fixtures (#821)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* update default for partition_xml

* update version for release

* update msg doc string

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-28 15:26:01 -04:00
Martin Mauch
752e78e803
feat: partition_org for Org Mode documents (#780)
* feat: partition_org for Org Mode documents

* update version
2023-06-23 18:45:31 +00:00
David Potter
3b472cb7df
feat: add google cloud storage connector (#746) 2023-06-21 15:14:50 -07:00
Matt Robinson
c53ce117bc
fix: enable partition_html to grab content outside of <article> tags (#772)
* optionally dont assemble articles

* add test for content outside of articles

* pass kwargs in partition

* changelog and version

* update default to False

* bump version for release

* back to dev version to get another fix in the release
2023-06-20 17:07:30 +00:00
qued
db4c5dfdf7
feat: coordinate systems (#774)
Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.
2023-06-20 11:19:55 -05:00
Matt Robinson
4ea716837d
feat: add ability to extract extra metadata with regex (#763)
* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
2023-06-16 10:10:56 -04:00
John
a9b9b873b1
feat: partition_tsv for tab separated value files (#758)
* first pass at partition_tsv

* working tests

* create constants for tests and debug `make test` failure

* make check and tidy

* undo changes for testing locally

* update changelog and version

* fix bricks.rst

* refactor if statements

* make tidy

* fix README and change try/except to if/else

* update changelog and version

* fix\ docstring
2023-06-15 18:50:53 +00:00
Matt Robinson
a800967478
enhancements: add page numbers for word docs when available (#750)
* add support for page numbers in docx when present

* version and changelog

* add comment on page numbers

* add header and footer to doc elements list

* update integrations docs

* include_page_breaks kwarg for doc and docx

* merge element metadata for pagebreaks

* fix typo

* fix changelog typo

* change page number default to None

* add initial_page_number kwarg

* make page number tests in pdf more explicit

* revert test file

* update ingest tests

* update test fixture outputs

* updates to IRS forms fixtures

* ingest-test-fixtures-update

* Update ingest test fixtures (#759)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-15 12:21:17 -04:00
Matt Robinson
e0c477de68
docs: update slack invite link (#749) 2023-06-14 10:06:45 -04:00
Matt Robinson
053a6c6e5c
enhancement: extract headers and footers in partition_docx (#742)
* added tests for headers and footers

* add docs on headers and footers; tweak to metadata

* version and changelog
2023-06-14 09:42:59 -04:00
fran-unstructured
a313c02f69
docs: sort functions in bricks.rst in alphabetical order v2 (#728)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-06-12 18:22:23 -04:00
Matt Robinson
c82fdb6a89
feat: partition_rst for ReStructured Text documents (#725)
* add example rst file

* filetype detection for rst files

* add partition_rst function

* add partition_rst to auto

* update readme

* update docs

* changelog and version

* pandocs -> pandoc

* fix typo
2023-06-12 19:31:10 +00:00
John
fc53277826
fix: Enable MIME type detection if libmagic is not available (#714)
* fix: Add filetype check if libmagic unavailable

* make tidy

* make check

* fix: change mime_type error to warning

* Update changelog and __version__

* fix: Add filetype to requirements
2023-06-09 17:06:21 -04:00
ryannikolaidis
2094b976cf
feat: adds data_source metadata to ElementMetadata (#690) 2023-06-07 21:22:18 -07:00
qued
01f76888e0
build(deps): add tabulate dependency (#673)
tabulate is used by functions that extract tables from Microsoft documents, but there is nothing explicitly requiring the library. This was not caught by tests, because for some reason, tabulate is in base.txt.

This PR adds the dependency to base.in (which also puts it in setup.py), and recompiles the dependencies.
2023-06-01 16:56:24 -05:00
Matt Robinson
c35fff2972
feat: Add stage_for_weaviate and schema creation function (#672)
* add weaviate docker compose

* added staging brick and tests for weaviate

* initial notebook and requirements file

* add commentary to weaviate notebook

* weaviate readme

* update docs

* version and change log

* install weaviate client

* install weaviate; skip for docker

* linting, linting, linting

* install weaviate client with deps

* comments on weaviate client

* fix module not found error for docker container

* skipped wrong test in docker

* fix typos

* add in local-inference
2023-06-01 20:48:54 +00:00
qued
d3600dd5da
build(deps): update inference version (#662)
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 13:50:15 -05:00
qued
c82bad1061
build(deps): avoid version conflicts (#636)
Addresses #631.

* Uses constraints to keep dependency versions more consistent.
* Moves all dependencies to .in files which are then ingested by setup.py.
* Adds script to check consistency of all extras.
* Adds consistency check to CI.

I should note that while it shouldn't be possible to cause a conflict between base.txt and any of the extras (because base.txt constrains all the extras) it is possible to get a conflict between two of the extras files. There are ways of trying to avoid that (like constraining each file by all the files that have already been processed before it in the order given in the make pip-compile target) but the ones I could think of seemed a little overwrought, and come with problems of their own. If a conflict arises, it should be flagged by CI or locally with make check-deps. When/if that happens, you can resolve the conflict by adding appropriate global constraints in requirements/constraints.txt.

Also note that if fileA.in is constrained by fileB.txt, then fileB.in should be compiled before fileA.in in the make pip-compile target. Otherwise fileA.in will be compiled with the old version of fileB.txt which can cause conflicts or keep dependencies from being updated properly.
2023-05-24 22:29:35 +00:00
Matt Robinson
21c821d651
feat: add partition_csv function (#619)
* add csv into filetype detection

* first pass on csv

* add tests for csv

* add csv to auto

* version bump

* update readme and docs

* fix doc strings
2023-05-19 15:57:42 -04:00