unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-01 03:13:20 +00:00

Author	SHA1	Message	Date
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Matt Robinson	8b6c5fac9d	feat: basic PowerPoint parsing in `partition_pptx` (#166 ) * parition pptx and tests * add parition_pptx to auto * update doc types in readme * add pptx docs * bump version * remove extra whitespace * partition -> partitioning	2023-01-23 17:03:09 +00:00
Matt Robinson	f12240c5e7	feat: add support for `.txt` files in `partition` (#150 ) * added partition_text for auto * rename partition_text tests * bump version and update docs	2023-01-13 16:39:53 -05:00
Matt Robinson	eba4c80b1e	feat: `get_directory_file_info` for exploring a directory of files (#142 ) * added python-pptx to requirements * added filetype detection for powerpoint * add more filetypes to detect * more tests * added tests for filetype * reorder document types * tests for get_directory_file_info * added docs for get_directory_file_info * bump version * Word -> Office * added test for filetype * add group by filetype example	2023-01-11 12:40:50 -05:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Mallori Harrell	d7a00046a9	feat: Add new functionality to parse text and header of emails (#111 ) * partition_text function	2023-01-09 17:08:08 +00:00
Matt Robinson	fee95b643c	feat: add `partition_docx` for Word documents (#131 ) * first pass on docx parsing * linting, linting, linting * test docx with filename * added documentation * more tests; version bump * typo * another typo * another typo! * it -> its * save -> saved * remove None since it's the default argument	2023-01-05 20:13:39 +00:00
Matt Robinson	33b983fbf0	docs: instructions on how to install on Windows + `conda` (#129 ) * add environment.yml * instructions on how to install base package and detectron2 * added instructions on paddleocr * remove covers * install -> to install * specified the shell * updated example snippets * update environment.yml * updated the repo reference * no more ands!	2023-01-05 16:21:44 +00:00
Sebastian Laverde Alfonso	5a47eb06e9	feat: new bricks for removing and extracting ordered bullets (#128 ) * feat: new cleaning brick for ordered bullets * test: add test for cleaning ordered bullets * feat: new brick for extracting ordered bullets * test: add test for extracting ordered bullets * docs: update CHANGELOG and bump new dev version * chore: change extract ordered bullets return type to tuple * chore: made tidy * chore: regex to split on pattern instead of built-in * chore: catch ValueError, made tidy and fix incompatible type * chore: assertion statements in one line of code * docs: add documentation for new clean and extract bricks to bricks.rst * docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets * docs: update CHANGELOG 0.3.6-dev0 changes and bump version Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>	2023-01-05 17:06:26 +01:00
Matt Robinson	17045aed80	feat: add `convert_to_dataframe` staging brick (#127 ) * add pandas to deps; pip-compile * staging brick to convert elements to dataframe * bump version * add convert_to_dataframe docs * bump wheel version * typo fix * typo fix 2!	2023-01-04 12:04:59 -05:00
Matt Robinson	445533745c	feat: helper functions to identify and extract phone numbers (#124 ) * added pattern for finding phone numbers * added cleaning brick for extracting phone numbers * add docs * changelog and bump version * switch to us phone numbers * bump dev version	2023-01-03 13:31:05 -05:00
Mallori Harrell	509ad4951c	feat: Add `extract_attachment_info` (#112 ) * Adds function to extract attachments and their metadata from eml files	2023-01-03 11:41:54 -06:00
Matt Robinson	b14f6ac9bd	feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113 ) * add python-docx dependency * added function for extracting metadata from word documents * add openpyxl * added get_jpg_metadata; fixed typing * bump changelog * added pillow to dependencies	2022-12-26 09:34:36 -05:00
Matt Robinson	7a74cdda86	feat: add `partition_email` cleaning brick (#104 ) * fix for processing deeply embedded list elements * fix types in mime encodings cleaner * first pass on partition_email * tests for email * test for mime encodings * changelog bump * added note about \n= * linting, linting, linting * added email docs * add partition_email to the readme * add one more test	2022-12-19 18:02:44 +00:00
Matt Robinson	1d68bb2482	feat: `apply` method to apply cleaning bricks to elements (#102 ) * add apply method to apply cleaners to elements * bump version * add check for string output * documentations for the apply method * change interface to *cleaners	2022-12-15 22:19:02 +00:00
Matt Robinson	b1cce16c16	feat: `translate_text` cleaning brick (#101 ) * initial implementation for translate brick * more input validation * tests for translate brick * added docs * bumped version * chinese and arabic tests * re-run pip-compile * add torch to dependencies * cleanup doc string * fix long string * fix typo in docs * take out empty string check * return string if string is empty * added huggingface into make install	2022-12-15 15:35:15 -05:00
Matt Robinson	3c19c7cd8a	feat: Add partition_html brick (#91 ) * update readme * updated sphinx docs * bump version; changelog * clear cache; retrigger ci * rename test file * switch default parameters to None * typo in the changelog * add in text output	2022-12-12 14:22:10 +00:00
Matt Robinson	77cd5cc01f	feat: text2text and token classification for argilla (#87 ) * add support for text2text * add support for token classification datasets * bump versions * updated docs * remove extra comment * fix wording in docs * fix some more wording	2022-11-30 20:07:42 +00:00
asymness	2170a2aae2	feat: Implement Argilla staging brick (#81 ) * Add argilla to dependencies and run pip-compile * Implement Argilla staging brick and add unit tests * Update version and changelog * Update docs with description and usage for Argilla staging brick * Remove unused fixtures and fix typo in Argilla tests * add missing quote in docs * changelog tweak * doc tweaks Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-11-28 14:41:48 +00:00
Matt Robinson	b041b0197d	feat: Add entities kwarg to datasaur bricks (#77 ) * added entities to datasaur * add tests for datasaur with entities * update docs * fix missing imports * bump version * remove accidental file	2022-11-22 19:50:19 +00:00
Matt Robinson	08e091c5a9	chore: Reorganize partition bricks under partition directory (#76 ) * move partition_pdf to partition folder * move partition.py * refactor partioning bricks into partition diretory * import to nlp for backward compatibility * update docs * update version and bump changelog * fix typo in changelog * update readme reference	2022-11-21 22:27:23 +00:00
Mallori Harrell	53fcf4e912	chore: Remove PDF parsing code and dependencies (#75 ) Remove PDF parsing code and dependencies.	2022-11-21 11:47:29 -06:00
Sebastian Laverde Alfonso	baa15d0098	feat: new partitioning brick that calls the document image analysis API (#68 ) * docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py * feat: new partition to call the document image analysis API * fix: remove duplicated dependency on partition.py * fix: linting error due to line-lenght > 100 * test: add test to call partition_pdf brick * chore: new short example-doc pdf for speed up in test X8 * fix: add missing return statement to _read to pass check * feat: new partitioning brick to call doc parse API * docs: version update fix in CHANGELOG * refactor: no nested ifs * docs: documentation for new brick partition_pdf * refactor: made tidy * docs: minor doc refactor Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>	2022-11-16 17:48:30 +01:00
Matt Robinson	300c564c62	feat: Cleaning bricks to extract text before/after a pattern (#63 ) * brick to extract text before * brick for extract text after * tests for extract before and after * updated docs * changelog and bump version * fix typo * fix another typo * positive -> non-negative	2022-11-10 21:35:37 +00:00
Matt Robinson	f3756abc90	feat: Cleaning bricks for removing prefixes and postfixes (#62 ) * added prefix and postfix cleaners * added test for pre and postfix cleaners * added docs for prefix and postfix bricks * changelog and bump version * add dev to version	2022-11-10 12:24:58 -05:00
benjats07	df16b5806b	feat: Add staging brick for Datasaur token-based tasks (#50 ) * feat: Add staging brick for Datasaur token-based tasks * Added doc string and formatting with flake8,mypy and black * docs: Added documentation for stage_for_datasaur * fix: version sync correction * fix: Corrections to docs fror stage_for_datasaur * fix: changes in naming of example variables * Update docs/source/bricks.rst Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-11-07 14:56:02 -06:00
Matt Robinson	de31df51a9	feat: Adds a helper function to convert ISD dicts to elements (#39 ) * updated category name for ListItem * added brick to convert isd to elements * bump version * added isd_to_elements to documentation	2022-10-21 18:43:10 +00:00
asymness	2d5dba0ddc	feat: Implement staging brick for ISD CSV format (#36 ) * Implement convert_to_isd_csv function * Add unit tests for convert_to_isd_csv function * Update docs with description and example of convert_to_isd_csv function * Update changelog and version	2022-10-13 11:35:46 -04:00
Matt Robinson	fb16847946	feat: Staging brick for attention window chunking (#34 ) * add huggingface dependencies and re pip-compile * first pass on chunk by attention window * test for chunking function * completed tests for chunk_by_attention_window * change default buffer size to 2 * wrapper function for staging * added docs for transformers * fix wording and typos * updated change log and bumped the version * added docs on huggingface dependencies * fix typo * re pip-compile	2022-10-13 11:18:27 -04:00
asymness	ec5be8e8b0	feat: Implement LabelBox staging brick (#26 ) * Implement stage_for_label_box function * Add unit tests for stage_for_label_box function * Update docs with description and example for stage_for_label_box function * Bump version and update CHANGELOG.md * Fix linting issues and implement suggested changes * Update stage_for_label_box docs with a note for uploading files to cloud providers	2022-10-11 10:15:25 -04:00
qued	1d3076a4b2	feat: keep version synchronized (#25 ) * Added script to check/sync versions using CHANGELOG.md as a source of truth. * Script currently only syncs __version__.py but can easily be extended to cover other files by adding the files to an array in the script. * Also updated sphinx conf.py to get version dynamically from __version__.py	2022-10-10 13:11:48 -05:00
Matt Robinson	836f156582	docs: Add example LabelStudio sentiment analysis example (#24 ) * added documentation on how to use unstructured with labelstudio * hard code risk narrative for docs * link to create project call	2022-10-10 08:27:01 -04:00
asymness	baba641d03	feat: Allow option to specify predictions in LabelStudio staging brick (#23 ) * Allow stage_for_label_studio to take a predictions input and implement prediction class * Update unit tests for LabelStudioPrediction and stage_for_label_studio function * Update stage_for_label_studio docs with example of loading predictions * Bump version and update changelog Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-10-06 13:35:55 +00:00
asymness	28a4ae985d	feat: Implement utility functions for reading and writing `.jsonl` files (#22 ) * Implement save_as_jsonl and read_from_jsonl utility functions * Add unit tests for save_as_jsonl and read_from_jsonl utility functions * Add example of using save_as_jsonl with prodigy staging brick * Bump version and update changelog * remove accidentally added prodigy json file * added "the" in jsonl description Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2022-10-04 09:51:11 -04:00
Matt Robinson	a950559b94	feat: Optionally include LabelStudio annotations in staging brick (#19 ) * added types for label studio annotations * added method to cast as dicts * added length check for annotations * tweaks to get upload to work * added validation for label types * annotations is a list for each example * little bit of refactoring * test for staging with label studio * tests for error conditions and reviewers * added test for NER annotations * updated changelog and bumped version * added docs with annotation examples * fix label studio link * bump version in sphinx docs * fulle -> full (typo fix)	2022-10-04 13:25:05 +00:00
asymness	d429e9b305	feat: Implement `stage_csv_for_prodigy` brick (#13 ) * Refactor metadata validation and implement stage_csv_for_prodigy brick * Refactor unit tests for metadata validation and add tests for Prodigy CSV brick * Add stage_csv_for_prodigy description and example in docs * Bump version and update changelog * added _csv_ to function name * update changelog line to 0.2.1-dev2 Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2022-10-03 09:30:30 -04:00
asymness	35d488a466	feat: Implement stage_for_prodigy brick (#11 ) * Implement unit tests for stage_for_prodigy brick * Implement brick for converting data to Prodigy format * Add stage_for_prodigy description and example to docs * updated changelog Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2022-09-30 12:41:37 -04:00
qued	64e1c725eb	feat: Add text_field and id_field to stage_for_label_studio signature (#9 ) Added text_field and id_field to stage_for_label_studio signature, to allow user to specify the keys in the resulting JSON. Includes tests and update to example in sphinx docs.	2022-09-28 09:30:17 -05:00
Matt Robinson	5f40c78f25	Initial Release	2022-09-26 14:55:20 -07:00

39 Commits