* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example
* add python-magic
* first pass on filetype detection
* tests for filetype detection
* more tests for file detection
* added tests for error conditions
* install libmagic dev in github
* libmagic install instructions
* pattern for checking email files
* support reading .eml in rb mode
* add auto partition function
* auto tests for emal
* auto tests for docx
* added tests for html
* add pdf and html tests
* linting, linting, linting
* added docs for auto partitioning
* update readme with generic partition brick
* bumped version
* added test for bad type
* detect .docx files from application/octet-stream
* linting, linting, linting
* identify xlsx from octet stream
* install poppler in ci
* fix mocks; test for unknown type
* install poppler utils
* install in one line
* only poppler-utils
* file extension logic from application/octet-stream
* install local inference for ci
* install detectron2
* removing unused dockerfile
* first pass on docx parsing
* linting, linting, linting
* test docx with filename
* added documentation
* more tests; version bump
* typo
* another typo
* another typo!
* it -> its
* save -> saved
* remove None since it's the default argument
* add environment.yml
* instructions on how to install base package and detectron2
* added instructions on paddleocr
* remove covers
* install -> to install
* specified the shell
* updated example snippets
* update environment.yml
* updated the repo reference
* no more ands!
* feat: new cleaning brick for ordered bullets
* test: add test for cleaning ordered bullets
* feat: new brick for extracting ordered bullets
* test: add test for extracting ordered bullets
* docs: update CHANGELOG and bump new dev version
* chore: change extract ordered bullets return type to tuple
* chore: made tidy
* chore: regex to split on pattern instead of built-in
* chore: catch ValueError, made tidy and fix incompatible type
* chore: assertion statements in one line of code
* docs: add documentation for new clean and extract bricks to bricks.rst
* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets
* docs: update CHANGELOG 0.3.6-dev0 changes and bump version
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
* added pattern for finding phone numbers
* added cleaning brick for extracting phone numbers
* add docs
* changelog and bump version
* switch to us phone numbers
* bump dev version
* fix for processing deeply embedded list elements
* fix types in mime encodings cleaner
* first pass on partition_email
* tests for email
* test for mime encodings
* changelog bump
* added note about \n=
* linting, linting, linting
* added email docs
* add partition_email to the readme
* add one more test
* add apply method to apply cleaners to elements
* bump version
* add check for string output
* documentations for the apply method
* change interface to *cleaners
* initial implementation for translate brick
* more input validation
* tests for translate brick
* added docs
* bumped version
* chinese and arabic tests
* re-run pip-compile
* add torch to dependencies
* cleanup doc string
* fix long string
* fix typo in docs
* take out empty string check
* return string if string is empty
* added huggingface into make install
* add support for text2text
* add support for token classification datasets
* bump versions
* updated docs
* remove extra comment
* fix wording in docs
* fix some more wording
* Add argilla to dependencies and run pip-compile
* Implement Argilla staging brick and add unit tests
* Update version and changelog
* Update docs with description and usage for Argilla staging brick
* Remove unused fixtures and fix typo in Argilla tests
* add missing quote in docs
* changelog tweak
* doc tweaks
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py
* feat: new partition to call the document image analysis API
* fix: remove duplicated dependency on partition.py
* fix: linting error due to line-lenght > 100
* test: add test to call partition_pdf brick
* chore: new short example-doc pdf for speed up in test X8
* fix: add missing return statement to _read to pass check
* feat: new partitioning brick to call doc parse API
* docs: version update fix in CHANGELOG
* refactor: no nested ifs
* docs: documentation for new brick partition_pdf
* refactor: made tidy
* docs: minor doc refactor
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
* brick to extract text before
* brick for extract text after
* tests for extract before and after
* updated docs
* changelog and bump version
* fix typo
* fix another typo
* positive -> non-negative
* added prefix and postfix cleaners
* added test for pre and postfix cleaners
* added docs for prefix and postfix bricks
* changelog and bump version
* add dev to version
* feat: Add staging brick for Datasaur token-based tasks
* Added doc string and formatting with flake8,mypy and black
* docs: Added documentation for stage_for_datasaur
* fix: version sync correction
* fix: Corrections to docs fror stage_for_datasaur
* fix: changes in naming of example variables
* Update docs/source/bricks.rst
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* Implement convert_to_isd_csv function
* Add unit tests for convert_to_isd_csv function
* Update docs with description and example of convert_to_isd_csv function
* Update changelog and version
* add huggingface dependencies and re pip-compile
* first pass on chunk by attention window
* test for chunking function
* completed tests for chunk_by_attention_window
* change default buffer size to 2
* wrapper function for staging
* added docs for transformers
* fix wording and typos
* updated change log and bumped the version
* added docs on huggingface dependencies
* fix typo
* re pip-compile
* Implement stage_for_label_box function
* Add unit tests for stage_for_label_box function
* Update docs with description and example for stage_for_label_box function
* Bump version and update CHANGELOG.md
* Fix linting issues and implement suggested changes
* Update stage_for_label_box docs with a note for uploading files to cloud providers
* Added script to check/sync versions using CHANGELOG.md as a source of truth.
* Script currently only syncs __version__.py but can easily be extended to cover other files by adding the files to an array in the script.
* Also updated sphinx conf.py to get version dynamically from __version__.py
* Allow stage_for_label_studio to take a predictions input and implement prediction class
* Update unit tests for LabelStudioPrediction and stage_for_label_studio function
* Update stage_for_label_studio docs with example of loading predictions
* Bump version and update changelog
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* Implement save_as_jsonl and read_from_jsonl utility functions
* Add unit tests for save_as_jsonl and read_from_jsonl utility functions
* Add example of using save_as_jsonl with prodigy staging brick
* Bump version and update changelog
* remove accidentally added prodigy json file
* added "the" in jsonl description
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* added types for label studio annotations
* added method to cast as dicts
* added length check for annotations
* tweaks to get upload to work
* added validation for label types
* annotations is a list for each example
* little bit of refactoring
* test for staging with label studio
* tests for error conditions and reviewers
* added test for NER annotations
* updated changelog and bumped version
* added docs with annotation examples
* fix label studio link
* bump version in sphinx docs
* fulle -> full (typo fix)
* Refactor metadata validation and implement stage_csv_for_prodigy brick
* Refactor unit tests for metadata validation and add tests for Prodigy CSV brick
* Add stage_csv_for_prodigy description and example in docs
* Bump version and update changelog
* added _csv_ to function name
* update changelog line to 0.2.1-dev2
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* Implement unit tests for stage_for_prodigy brick
* Implement brick for converting data to Prodigy format
* Add stage_for_prodigy description and example to docs
* updated changelog
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Added text_field and id_field to stage_for_label_studio signature, to allow user to specify the keys in the resulting JSON. Includes tests and update to example in sphinx docs.