* feat: new cleaning brick for ordered bullets
* test: add test for cleaning ordered bullets
* feat: new brick for extracting ordered bullets
* test: add test for extracting ordered bullets
* docs: update CHANGELOG and bump new dev version
* chore: change extract ordered bullets return type to tuple
* chore: made tidy
* chore: regex to split on pattern instead of built-in
* chore: catch ValueError, made tidy and fix incompatible type
* chore: assertion statements in one line of code
* docs: add documentation for new clean and extract bricks to bricks.rst
* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets
* docs: update CHANGELOG 0.3.6-dev0 changes and bump version
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
* added pattern for finding phone numbers
* added cleaning brick for extracting phone numbers
* add docs
* changelog and bump version
* switch to us phone numbers
* bump dev version
* fix for processing deeply embedded list elements
* fix types in mime encodings cleaner
* first pass on partition_email
* tests for email
* test for mime encodings
* changelog bump
* added note about \n=
* linting, linting, linting
* added email docs
* add partition_email to the readme
* add one more test
* add apply method to apply cleaners to elements
* bump version
* add check for string output
* documentations for the apply method
* change interface to *cleaners
* initial implementation for translate brick
* more input validation
* tests for translate brick
* added docs
* bumped version
* chinese and arabic tests
* re-run pip-compile
* add torch to dependencies
* cleanup doc string
* fix long string
* fix typo in docs
* take out empty string check
* return string if string is empty
* added huggingface into make install
* add support for text2text
* add support for token classification datasets
* bump versions
* updated docs
* remove extra comment
* fix wording in docs
* fix some more wording
* Add argilla to dependencies and run pip-compile
* Implement Argilla staging brick and add unit tests
* Update version and changelog
* Update docs with description and usage for Argilla staging brick
* Remove unused fixtures and fix typo in Argilla tests
* add missing quote in docs
* changelog tweak
* doc tweaks
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py
* feat: new partition to call the document image analysis API
* fix: remove duplicated dependency on partition.py
* fix: linting error due to line-lenght > 100
* test: add test to call partition_pdf brick
* chore: new short example-doc pdf for speed up in test X8
* fix: add missing return statement to _read to pass check
* feat: new partitioning brick to call doc parse API
* docs: version update fix in CHANGELOG
* refactor: no nested ifs
* docs: documentation for new brick partition_pdf
* refactor: made tidy
* docs: minor doc refactor
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
* brick to extract text before
* brick for extract text after
* tests for extract before and after
* updated docs
* changelog and bump version
* fix typo
* fix another typo
* positive -> non-negative
* added prefix and postfix cleaners
* added test for pre and postfix cleaners
* added docs for prefix and postfix bricks
* changelog and bump version
* add dev to version
* feat: Add staging brick for Datasaur token-based tasks
* Added doc string and formatting with flake8,mypy and black
* docs: Added documentation for stage_for_datasaur
* fix: version sync correction
* fix: Corrections to docs fror stage_for_datasaur
* fix: changes in naming of example variables
* Update docs/source/bricks.rst
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* Implement convert_to_isd_csv function
* Add unit tests for convert_to_isd_csv function
* Update docs with description and example of convert_to_isd_csv function
* Update changelog and version
* add huggingface dependencies and re pip-compile
* first pass on chunk by attention window
* test for chunking function
* completed tests for chunk_by_attention_window
* change default buffer size to 2
* wrapper function for staging
* added docs for transformers
* fix wording and typos
* updated change log and bumped the version
* added docs on huggingface dependencies
* fix typo
* re pip-compile
* Implement stage_for_label_box function
* Add unit tests for stage_for_label_box function
* Update docs with description and example for stage_for_label_box function
* Bump version and update CHANGELOG.md
* Fix linting issues and implement suggested changes
* Update stage_for_label_box docs with a note for uploading files to cloud providers
* Allow stage_for_label_studio to take a predictions input and implement prediction class
* Update unit tests for LabelStudioPrediction and stage_for_label_studio function
* Update stage_for_label_studio docs with example of loading predictions
* Bump version and update changelog
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* Implement save_as_jsonl and read_from_jsonl utility functions
* Add unit tests for save_as_jsonl and read_from_jsonl utility functions
* Add example of using save_as_jsonl with prodigy staging brick
* Bump version and update changelog
* remove accidentally added prodigy json file
* added "the" in jsonl description
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* added types for label studio annotations
* added method to cast as dicts
* added length check for annotations
* tweaks to get upload to work
* added validation for label types
* annotations is a list for each example
* little bit of refactoring
* test for staging with label studio
* tests for error conditions and reviewers
* added test for NER annotations
* updated changelog and bumped version
* added docs with annotation examples
* fix label studio link
* bump version in sphinx docs
* fulle -> full (typo fix)
* Refactor metadata validation and implement stage_csv_for_prodigy brick
* Refactor unit tests for metadata validation and add tests for Prodigy CSV brick
* Add stage_csv_for_prodigy description and example in docs
* Bump version and update changelog
* added _csv_ to function name
* update changelog line to 0.2.1-dev2
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* Implement unit tests for stage_for_prodigy brick
* Implement brick for converting data to Prodigy format
* Add stage_for_prodigy description and example to docs
* updated changelog
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Added text_field and id_field to stage_for_label_studio signature, to allow user to specify the keys in the resulting JSON. Includes tests and update to example in sphinx docs.