* bump cryptography version
* re pip-compile for latest versions
* update argilla example requirements
* dependency updates
* bump versions
* pin unstructured-inference due to multithreading issue
* linting, linting, linting
* dependency on one line
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
Add Reddit data connector for ingest.
* The connector can process a subreddit.
* Either via a search query,
* or via hot posts.
* The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos).
* The number of posts to fetch can be changed with the CLI.
* included the previous PR changes and verified black
* resolved the issues mentioned
* make tidy and add tests
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* added type to text element map
* add element_id and coordinates
* added test for serialization
* added serialization for check boxes
* add dict_to_elements and covert_to_dict aliases
* helpers for serializing and deserializing elements
* bump version; changelog
* add Text to tests
* aliases for isd functions
* remove test elements json
* changelog updates
* make indent a kwarg
* update expected structured output
* docs update
* use new function in ingest code
* pop coordinates due to floating point differences
* pop coordinates
* code for downloading nltk packages
* don't run nltk make command in ci
* test for model downloads
* remove nltk install from docs
* update changelog and bump version
This will allow partition_html to use a custom XMLParser or HTMLParser.
It can be useful if one needs to specify additional arguments to these parsers (not only built-in remove_comments=True).
---------
Co-authored-by: Viktor Zhemchuzhnikov <v.zhemchuzhnikov@xsolla.com>
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py
Note that there were no logic changes, this is essentially a refactoring PR.
Test instructions:
Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
* added partition_ppt function and tests
* add ppt support to auto
* version bump
* update docs
* doc fixes
* update changelog
* `.docx` -> `.pptx`
* its -> their
* remove whitespace
* first pass on doc partitioning
* add libreoffice to deps
* update docs and readme
* add .doc to auto
* changelog bump
* value error with missing doc
* doc updates
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
* Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower,
to make room for future connector outputs.
* add metadata field to elements
* metadata tracking for pdf/image
* metadata for html
* update expected outputs
* metadata for the rest of the document types
* take out file metadata for now
* add url to tables
* added metadata to test_auto
* bump version
* added coordinates to __init__
* fix coordinates in tests
* Update xml.py
remove comments while parsing
* change logged in CHANGLOG and editted version
* make tidy
* editted version
* new version 0.4.8-dev1
* editted version
* Update CHANGELOG.md
Co-authored-by: cragwolfe <crag@unstructuredai.io>
---------
Co-authored-by: cragwolfe <crag@unstructuredai.io>
* added to_dict to elements
* first training notebook
* bump changelog, rerun notebook
* remove coordinates and id
* rerun notebook
* has -> have
* partitioning -> partition
* various and sundry typos
* switch to using convert_to_isd
* added function for exploring a list of files
* file info from file contents
* added tests for file info from contents
* bump version and add tests
* add dev to version
* page breaks for pptx
* added page breaks for image/pdf
* tests for images with page breaks
* page breaks for html documents
* linting, linting, linting
* changelog and bump version
* update docs
* fix typo
* refactor reusable code to common.py
* add type back in