24 Commits

Author SHA1 Message Date
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Tom Aarsen
e61ce2cc00
Skip posix_path test on Windows (#283) 2023-02-25 08:31:34 +00:00
Matt Robinson
0d229f0a5e
fix: preserve all elements when serialized; feat: helper functions for serialization (#273)
* added type to text element map

* add element_id and coordinates

* added test for serialization

* added serialization for check boxes

* add dict_to_elements and covert_to_dict aliases

* helpers for serializing and deserializing elements

* bump version; changelog

* add Text to tests

* aliases for isd functions

* remove test elements json

* changelog updates

* make indent a kwarg

* update expected structured output

* docs update

* use new function in ingest code

* pop coordinates due to floating point differences

* pop coordinates
2023-02-23 21:58:59 +00:00
Matt Robinson
f5ff140d7c
fix: ElementMetadata serializes when the filename is a Path object (#233) 2023-02-16 17:20:51 +00:00
Matt Robinson
74e6b84b41
feat: add metadata tracking to document elements (#225)
* add metadata field to elements

* metadata tracking for pdf/image

* metadata for html

* update expected outputs

* metadata for the rest of the document types

* take out file metadata for now

* add url to tables

* added metadata to test_auto

* bump version

* added coordinates to __init__

* fix coordinates in tests
2023-02-15 18:26:20 +00:00
Matt Robinson
f890972139
docs: add bricks training notebook (#211)
* added bricks notebook

* more unicode quotes; isd dataframe column fix

* fix remove_punctuation docs

* typo fixes

* put staging bricks in code
2023-02-10 14:39:14 +00:00
Matt Robinson
7fb3797165
docs: core concepts training notebook (#207)
* added to_dict to elements

* first training notebook

* bump changelog, rerun notebook

* remove coordinates and id

* rerun notebook

* has -> have

* partitioning -> partition

* various and sundry typos

* switch to using convert_to_isd
2023-02-09 14:34:34 +00:00
Matt Robinson
782b4352ec
build(deps): weekly dependency update; reduce dependabot frequency (#194)
* deps: pip-compile to update dependencies

* bump version

* linting, linting, linting

* typo
2023-02-06 16:39:29 +00:00
Matt Robinson
17045aed80
feat: add convert_to_dataframe staging brick (#127)
* add pandas to deps; pip-compile

* staging brick to convert elements to dataframe

* bump version

* add convert_to_dataframe docs

* bump wheel version

* typo fix

* typo fix 2!
2023-01-04 12:04:59 -05:00
Matt Robinson
77cd5cc01f
feat: text2text and token classification for argilla (#87)
* add support for text2text

* add support for token classification datasets

* bump versions

* updated docs

* remove extra comment

* fix wording in docs

* fix some more wording
2022-11-30 20:07:42 +00:00
asymness
2170a2aae2
feat: Implement Argilla staging brick (#81)
* Add argilla to dependencies and run pip-compile

* Implement Argilla staging brick and add unit tests

* Update version and changelog

* Update docs with description and usage for Argilla staging brick

* Remove unused fixtures and fix typo in Argilla tests

* add missing quote in docs

* changelog tweak

* doc tweaks

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-28 14:41:48 +00:00
Matt Robinson
b041b0197d
feat: Add entities kwarg to datasaur bricks (#77)
* added entities to datasaur

* add tests for datasaur with entities

* update docs

* fix missing imports

* bump version

* remove accidental file
2022-11-22 19:50:19 +00:00
benjats07
df16b5806b
feat: Add staging brick for Datasaur token-based tasks (#50)
* feat: Add staging brick for Datasaur token-based tasks

* Added doc string and formatting with flake8,mypy and black

* docs: Added documentation for stage_for_datasaur

* fix: version sync correction

* fix: Corrections to docs fror stage_for_datasaur

* fix: changes in naming of example variables

* Update docs/source/bricks.rst

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-07 14:56:02 -06:00
Matt Robinson
de31df51a9
feat: Adds a helper function to convert ISD dicts to elements (#39)
* updated category name for ListItem

* added brick to convert isd to elements

* bump version

* added isd_to_elements to documentation
2022-10-21 18:43:10 +00:00
asymness
2d5dba0ddc
feat: Implement staging brick for ISD CSV format (#36)
* Implement convert_to_isd_csv function

* Add unit tests for convert_to_isd_csv function

* Update docs with description and example of convert_to_isd_csv function

* Update changelog and version
2022-10-13 11:35:46 -04:00
Matt Robinson
fb16847946
feat: Staging brick for attention window chunking (#34)
* add huggingface dependencies and re pip-compile

* first pass on chunk by attention window

* test for chunking function

* completed tests for chunk_by_attention_window

* change default buffer size to 2

* wrapper function for staging

* added docs for transformers

* fix wording and typos

* updated change log and bumped the version

* added docs on huggingface dependencies

* fix typo

* re pip-compile
2022-10-13 11:18:27 -04:00
asymness
ec5be8e8b0
feat: Implement LabelBox staging brick (#26)
* Implement stage_for_label_box function

* Add unit tests for stage_for_label_box function

* Update docs with description and example for stage_for_label_box function

* Bump version and update CHANGELOG.md

* Fix linting issues and implement suggested changes

* Update stage_for_label_box docs with a note for uploading files to cloud providers
2022-10-11 10:15:25 -04:00
asymness
baba641d03
feat: Allow option to specify predictions in LabelStudio staging brick (#23)
* Allow stage_for_label_studio to take a predictions input and implement prediction class

* Update unit tests for LabelStudioPrediction and stage_for_label_studio function

* Update stage_for_label_studio docs with example of loading predictions

* Bump version and update changelog

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-10-06 13:35:55 +00:00
Yuming Long
779e48bafe
chore: Integration test to show LabelStudio brick working with SDK (#21) 2022-10-05 14:38:44 -04:00
Matt Robinson
a950559b94
feat: Optionally include LabelStudio annotations in staging brick (#19)
* added types for label studio annotations

* added method to cast as dicts

* added length check for annotations

* tweaks to get upload to work

* added validation for label types

* annotations is a list for each example

* little bit of refactoring

* test for staging with label studio

* tests for error conditions and reviewers

* added test for NER annotations

* updated changelog and bumped version

* added docs with annotation examples

* fix label studio link

* bump version in sphinx docs

* fulle -> full (typo fix)
2022-10-04 13:25:05 +00:00
asymness
d429e9b305
feat: Implement stage_csv_for_prodigy brick (#13)
* Refactor metadata validation and implement stage_csv_for_prodigy brick

* Refactor unit tests for metadata validation and add tests for Prodigy CSV brick

* Add stage_csv_for_prodigy description and example in docs

* Bump version and update changelog

* added _csv_ to function name

* update changelog line to 0.2.1-dev2

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-10-03 09:30:30 -04:00
asymness
35d488a466
feat: Implement stage_for_prodigy brick (#11)
* Implement unit tests for stage_for_prodigy brick

* Implement brick for converting data to Prodigy format

* Add stage_for_prodigy description and example to docs

* updated changelog

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-09-30 12:41:37 -04:00
qued
64e1c725eb
feat: Add text_field and id_field to stage_for_label_studio signature (#9)
Added text_field and id_field to stage_for_label_studio signature, to allow user to specify the keys in the resulting JSON. Includes tests and update to example in sphinx docs.
2022-09-28 09:30:17 -05:00
Matt Robinson
5f40c78f25 Initial Release 2022-09-26 14:55:20 -07:00