* remove argilla; bump reqs
* enable py 3.11
* add 3.11 to setup.py
* make pip-compile
* ignore cli mypy errors
* install argilla
* fix constraints
* install argilla
* changelog and version
* skip argilla in docker
* dont import argilla in docker
* skip all of argilla if in container
* only import argilla if outside docker
* more docker skips
* remove weird pypi settings
Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification).
Bonus refactor for PageBreak to have text values of "".
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
* add encoding to elements_to_json and elements_from_json
* version and changelog
* add new test
* fix version
* revert test file
* blank line to test
* no blank line
* update stage_for_transformers to return a list of elements
* bump changelog and version
* flag breaking change
* fix last word bug in chunk_by_attention_window
Closes#200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
* added type to text element map
* add element_id and coordinates
* added test for serialization
* added serialization for check boxes
* add dict_to_elements and covert_to_dict aliases
* helpers for serializing and deserializing elements
* bump version; changelog
* add Text to tests
* aliases for isd functions
* remove test elements json
* changelog updates
* make indent a kwarg
* update expected structured output
* docs update
* use new function in ingest code
* pop coordinates due to floating point differences
* pop coordinates
* add metadata field to elements
* metadata tracking for pdf/image
* metadata for html
* update expected outputs
* metadata for the rest of the document types
* take out file metadata for now
* add url to tables
* added metadata to test_auto
* bump version
* added coordinates to __init__
* fix coordinates in tests
* added to_dict to elements
* first training notebook
* bump changelog, rerun notebook
* remove coordinates and id
* rerun notebook
* has -> have
* partitioning -> partition
* various and sundry typos
* switch to using convert_to_isd
* add support for text2text
* add support for token classification datasets
* bump versions
* updated docs
* remove extra comment
* fix wording in docs
* fix some more wording
* Add argilla to dependencies and run pip-compile
* Implement Argilla staging brick and add unit tests
* Update version and changelog
* Update docs with description and usage for Argilla staging brick
* Remove unused fixtures and fix typo in Argilla tests
* add missing quote in docs
* changelog tweak
* doc tweaks
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* feat: Add staging brick for Datasaur token-based tasks
* Added doc string and formatting with flake8,mypy and black
* docs: Added documentation for stage_for_datasaur
* fix: version sync correction
* fix: Corrections to docs fror stage_for_datasaur
* fix: changes in naming of example variables
* Update docs/source/bricks.rst
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* Implement convert_to_isd_csv function
* Add unit tests for convert_to_isd_csv function
* Update docs with description and example of convert_to_isd_csv function
* Update changelog and version
* add huggingface dependencies and re pip-compile
* first pass on chunk by attention window
* test for chunking function
* completed tests for chunk_by_attention_window
* change default buffer size to 2
* wrapper function for staging
* added docs for transformers
* fix wording and typos
* updated change log and bumped the version
* added docs on huggingface dependencies
* fix typo
* re pip-compile
* Implement stage_for_label_box function
* Add unit tests for stage_for_label_box function
* Update docs with description and example for stage_for_label_box function
* Bump version and update CHANGELOG.md
* Fix linting issues and implement suggested changes
* Update stage_for_label_box docs with a note for uploading files to cloud providers
* Allow stage_for_label_studio to take a predictions input and implement prediction class
* Update unit tests for LabelStudioPrediction and stage_for_label_studio function
* Update stage_for_label_studio docs with example of loading predictions
* Bump version and update changelog
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* added types for label studio annotations
* added method to cast as dicts
* added length check for annotations
* tweaks to get upload to work
* added validation for label types
* annotations is a list for each example
* little bit of refactoring
* test for staging with label studio
* tests for error conditions and reviewers
* added test for NER annotations
* updated changelog and bumped version
* added docs with annotation examples
* fix label studio link
* bump version in sphinx docs
* fulle -> full (typo fix)
* Refactor metadata validation and implement stage_csv_for_prodigy brick
* Refactor unit tests for metadata validation and add tests for Prodigy CSV brick
* Add stage_csv_for_prodigy description and example in docs
* Bump version and update changelog
* added _csv_ to function name
* update changelog line to 0.2.1-dev2
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* Implement unit tests for stage_for_prodigy brick
* Implement brick for converting data to Prodigy format
* Add stage_for_prodigy description and example to docs
* updated changelog
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Added text_field and id_field to stage_for_label_studio signature, to allow user to specify the keys in the resulting JSON. Includes tests and update to example in sphinx docs.