unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-02 03:45:24 +00:00

Author	SHA1	Message	Date
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Tom Aarsen	e61ce2cc00	Skip posix_path test on Windows (#283 )	2023-02-25 08:31:34 +00:00
Matt Robinson	0d229f0a5e	fix: preserve all elements when serialized; feat: helper functions for serialization (#273 ) * added type to text element map * add element_id and coordinates * added test for serialization * added serialization for check boxes * add dict_to_elements and covert_to_dict aliases * helpers for serializing and deserializing elements * bump version; changelog * add Text to tests * aliases for isd functions * remove test elements json * changelog updates * make indent a kwarg * update expected structured output * docs update * use new function in ingest code * pop coordinates due to floating point differences * pop coordinates	2023-02-23 21:58:59 +00:00
Matt Robinson	f5ff140d7c	fix: `ElementMetadata` serializes when the filename is a `Path` object (#233 )	2023-02-16 17:20:51 +00:00
Matt Robinson	74e6b84b41	feat: add metadata tracking to document elements (#225 ) * add metadata field to elements * metadata tracking for pdf/image * metadata for html * update expected outputs * metadata for the rest of the document types * take out file metadata for now * add url to tables * added metadata to test_auto * bump version * added coordinates to __init__ * fix coordinates in tests	2023-02-15 18:26:20 +00:00
Matt Robinson	f890972139	docs: add bricks training notebook (#211 ) * added bricks notebook * more unicode quotes; isd dataframe column fix * fix remove_punctuation docs * typo fixes * put staging bricks in code	2023-02-10 14:39:14 +00:00
Matt Robinson	7fb3797165	docs: core concepts training notebook (#207 ) * added to_dict to elements * first training notebook * bump changelog, rerun notebook * remove coordinates and id * rerun notebook * has -> have * partitioning -> partition * various and sundry typos * switch to using convert_to_isd	2023-02-09 14:34:34 +00:00
Matt Robinson	782b4352ec	build(deps): weekly dependency update; reduce dependabot frequency (#194 ) * deps: pip-compile to update dependencies * bump version * linting, linting, linting * typo	2023-02-06 16:39:29 +00:00
Matt Robinson	17045aed80	feat: add `convert_to_dataframe` staging brick (#127 ) * add pandas to deps; pip-compile * staging brick to convert elements to dataframe * bump version * add convert_to_dataframe docs * bump wheel version * typo fix * typo fix 2!	2023-01-04 12:04:59 -05:00
Matt Robinson	77cd5cc01f	feat: text2text and token classification for argilla (#87 ) * add support for text2text * add support for token classification datasets * bump versions * updated docs * remove extra comment * fix wording in docs * fix some more wording	2022-11-30 20:07:42 +00:00
asymness	2170a2aae2	feat: Implement Argilla staging brick (#81 ) * Add argilla to dependencies and run pip-compile * Implement Argilla staging brick and add unit tests * Update version and changelog * Update docs with description and usage for Argilla staging brick * Remove unused fixtures and fix typo in Argilla tests * add missing quote in docs * changelog tweak * doc tweaks Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-11-28 14:41:48 +00:00
Matt Robinson	b041b0197d	feat: Add entities kwarg to datasaur bricks (#77 ) * added entities to datasaur * add tests for datasaur with entities * update docs * fix missing imports * bump version * remove accidental file	2022-11-22 19:50:19 +00:00
benjats07	df16b5806b	feat: Add staging brick for Datasaur token-based tasks (#50 ) * feat: Add staging brick for Datasaur token-based tasks * Added doc string and formatting with flake8,mypy and black * docs: Added documentation for stage_for_datasaur * fix: version sync correction * fix: Corrections to docs fror stage_for_datasaur * fix: changes in naming of example variables * Update docs/source/bricks.rst Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-11-07 14:56:02 -06:00
Matt Robinson	de31df51a9	feat: Adds a helper function to convert ISD dicts to elements (#39 ) * updated category name for ListItem * added brick to convert isd to elements * bump version * added isd_to_elements to documentation	2022-10-21 18:43:10 +00:00
asymness	2d5dba0ddc	feat: Implement staging brick for ISD CSV format (#36 ) * Implement convert_to_isd_csv function * Add unit tests for convert_to_isd_csv function * Update docs with description and example of convert_to_isd_csv function * Update changelog and version	2022-10-13 11:35:46 -04:00
Matt Robinson	fb16847946	feat: Staging brick for attention window chunking (#34 ) * add huggingface dependencies and re pip-compile * first pass on chunk by attention window * test for chunking function * completed tests for chunk_by_attention_window * change default buffer size to 2 * wrapper function for staging * added docs for transformers * fix wording and typos * updated change log and bumped the version * added docs on huggingface dependencies * fix typo * re pip-compile	2022-10-13 11:18:27 -04:00
asymness	ec5be8e8b0	feat: Implement LabelBox staging brick (#26 ) * Implement stage_for_label_box function * Add unit tests for stage_for_label_box function * Update docs with description and example for stage_for_label_box function * Bump version and update CHANGELOG.md * Fix linting issues and implement suggested changes * Update stage_for_label_box docs with a note for uploading files to cloud providers	2022-10-11 10:15:25 -04:00
asymness	baba641d03	feat: Allow option to specify predictions in LabelStudio staging brick (#23 ) * Allow stage_for_label_studio to take a predictions input and implement prediction class * Update unit tests for LabelStudioPrediction and stage_for_label_studio function * Update stage_for_label_studio docs with example of loading predictions * Bump version and update changelog Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-10-06 13:35:55 +00:00
Yuming Long	779e48bafe	chore: Integration test to show LabelStudio brick working with SDK (#21 )	2022-10-05 14:38:44 -04:00
Matt Robinson	a950559b94	feat: Optionally include LabelStudio annotations in staging brick (#19 ) * added types for label studio annotations * added method to cast as dicts * added length check for annotations * tweaks to get upload to work * added validation for label types * annotations is a list for each example * little bit of refactoring * test for staging with label studio * tests for error conditions and reviewers * added test for NER annotations * updated changelog and bumped version * added docs with annotation examples * fix label studio link * bump version in sphinx docs * fulle -> full (typo fix)	2022-10-04 13:25:05 +00:00
asymness	d429e9b305	feat: Implement `stage_csv_for_prodigy` brick (#13 ) * Refactor metadata validation and implement stage_csv_for_prodigy brick * Refactor unit tests for metadata validation and add tests for Prodigy CSV brick * Add stage_csv_for_prodigy description and example in docs * Bump version and update changelog * added _csv_ to function name * update changelog line to 0.2.1-dev2 Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2022-10-03 09:30:30 -04:00
asymness	35d488a466	feat: Implement stage_for_prodigy brick (#11 ) * Implement unit tests for stage_for_prodigy brick * Implement brick for converting data to Prodigy format * Add stage_for_prodigy description and example to docs * updated changelog Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2022-09-30 12:41:37 -04:00
qued	64e1c725eb	feat: Add text_field and id_field to stage_for_label_studio signature (#9 ) Added text_field and id_field to stage_for_label_studio signature, to allow user to specify the keys in the resulting JSON. Includes tests and update to example in sphinx docs.	2022-09-28 09:30:17 -05:00
Matt Robinson	5f40c78f25	Initial Release	2022-09-26 14:55:20 -07:00

24 Commits