### Description
Convert s3 cli code to also support writing to s3. Writers are added as
optional subcommands to the parent command with their own arguments.
Custom `click.Group` introduced to add some custom formatting and text
in help messages.
To limit the scope of this PR, most existing files were not touched but
instead new files were added for the new flow. This allowed _only_ the
s3 connector to be updated without breaking any other ones.
Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and
- confirms that the function reads all the pages in the TIFF.
- page number is added to the metadata
This PR is branched from and developed on top of 6d6be99 commit.
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.
I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Bump to unstructured-inference==0.5.13, which includes:
Fix extracted image elements being included in layout merge, addresses the issue
where an entire-page image in a PDF was not passed to the layout model when using hi_res.
Pulls in fix from unstructured-inference==0.5.12:
When a pdf page doesn't have much data, it may get buffered in the write to a tempfile. If this happens, we'll hit an error reading the file back. Open to suggestions for a way to unit test this - I was creating some test files with pypdf but I couldn't trigger the error.
* add the first version of airtable connector
* change imports as inline to fail gracefully in case of lacking dependency
* parse tables as csv rather than plain text
* add relevant logic to be able to use --airtable-list-of-paths
* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings
* fix ingest test names
* add scripts for the large table test
* remove large table test from diff test
* make base and table ids explicit
* add and remove comments
* use -ne instead of !=
* update code based on the recent ingest refactor, update changelog and version
* shellcheck fix
* update comments
* update check-num-rows-and-columns-output error message
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* update help comments
* update help comments
* update help comments
* update workflows to set auth tokens and to run make install
* add comments on create_scale_test_components
* separate component ids from the test script, add comments to document test component creation
* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id
* shellcheck fixes
* shellcheck fixes
* update docs
* update comment
* bump version
* add wrongly deleted file
* sort columns before saving to process
* Update ingest test fixtures (#1098)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* feat: add functionality to check if a string contains any emoji characters
* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji
* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use
* chore: update changelog & version
* chore: update changelog & version
* chore: update dependencies
* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`
* chore: update changelog & version
* Add notion connector and supporting code
* minor fixes
* Don't ignore types that aren't recognized when mapping json
* Add support for recursively getting docs
* Add recursive search for databases
* fix logging
* fix linting
* Support extracting page content to html
* Support extracting database content to html
* update CHANGELOG
* fix linting
* fix linting
* Add notion connector and supporting code
* minor fixes
* Add notion deps to extras
* Use the same return type for both helper methods
* Don't ignore types that aren't recognized when mapping json
* Add support for recursively getting docs
* Add recursive search for databases
* fix logging
* fix linting
* remove debugging code
* split dependencies by document type
* make pip-compile with new requirements
* add extra requirements to setup.py
* add in all docs; re pip-compile
* extra for all docs
* add pandas to xlsx
* dependency requires for tsv and csv
* handling for doc, docx and odt
* dependency check for pypandoc
* required dependencies for pandoc files
* xml and html
* markdown
* msg
* add in pdf
* add in pptx
* add in excel
* add lxml as base req
* extra all docs for local inference
* local inference installs all
* pin pillow version
* fixes for plain text tests
* fixes for doc
* update make commands
* changelog and version
* add xlrd
* update pip-compile
* pin numpy for python 3.8 support
* more constraints
* contraint on scipy
* update install docs
* constrain ipython
* add outlook to pip-compile
* more ipython constraints
* add extras to dockerfile
* pin office365 client
* few doc tweaks
* types as strings
* last pip-compile
* re pip-comple
* make tidy
* make tidy
* Pull out s3 code as subcommand
* Pull out dropbox code as subcommand
* Pull out azure code as subcommand
* Pull out fsspec code as subcommand
* Pull out github code as subcommand
* Pull out gitlab code as subcommand
* Pull out reddit code as subcommand
* Pull out slack code as subcommand
* Pull out discord code as subcommand
* Pull out wikipedia code as subcommand
* Pull out gdrive code as subcommand
* Pull out biomed code as subcommand
* rename parameters
* Pull out onedrive code as subcommand
* Pull out outlook code as subcommand
* Pull out local code as subcommand
* Pull out elasticsearch code as subcommand
* Pull out confluence code as subcommand
* Drop previous main file
* update changelog
* Add back in mp.Pool
* Fix mypy issues with click
* Make sure all tests run with verbose flag
* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data
* Pull out some more shared options
* Support running code via python as well as cli
* update ingest readme and move it to the ingest folder
* update usage in connector docs
* move local command arg in test
* Seperate out cli code from logic running unstructured
* Make some cli fields required rather than optional
* rename process -> processor
* Improve logger to avoid duplicate handlers
---------
Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* remove default strategy
* working on test
* fixed test, coordinates param needed to be included
* nits
* update changelog
* lint
* update requirements
* Bump inference version
* Pass through the dpi param if available
* Update CHANGELOG
* Check dpi param passed in via unit test
* Bump inference version
* Fix unit test around file info to work on mac as well
* Add confluence connector and an example script
* add test script, add dependency installations
* add authentication secret variables for ci tests and actions
* add dependency installation commands for workflows
* add dependency installation commands for workflows
* Update ingest test fixtures (#907)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values
* change workflow name to avoid confusion
* change workflow name to avoid confusion
* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update
* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions
* Update ingest test fixtures (#911)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* revert back the test python version matrix
* recompile dependencies
* modifications for shellcheck
* update changelog and version
* changelog and version
* remove comments
* Update ingest test fixtures (#915)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add the option to state the number of spaces to be fetched
* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users
* add help message
* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test
* change test names
* rename connector arg
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* change arg name for connector
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* add comment to example
* change arg names
* add new tests to ingest test
* shellcheck remove redundant statement
* Update ingest test fixtures (#932)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* Update ingest test fixtures (#936)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* linting
* change file extensions to parse as html
* Update ingest test fixtures (#943)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* remove old fixtures
* update version to 0.8.2-dev3
* change file to trigger CI
* change file to trigger CI
* change file to trigger CI
* change file to trigger CI
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* remove argilla; bump reqs
* enable py 3.11
* add 3.11 to setup.py
* make pip-compile
* ignore cli mypy errors
* install argilla
* fix constraints
* install argilla
* changelog and version
* skip argilla in docker
* dont import argilla in docker
* skip all of argilla if in container
* only import argilla if outside docker
* more docker skips
* remove weird pypi settings
More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
Make large model available (from unstructured-inference bump to 0.5.3)
Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
---------
Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructured.io>