39 Commits

Author SHA1 Message Date
Matt Robinson
339c133326
fix: cleanup from live .docx tests (#177)
* add env var for cap threshold; raise default threshold

* update docs and tests

* added check for ending in a comma

* update docs

* no caps check for all upper text

* capture Text in html and text

* check category in Text equality check

* lower case all caps before checking for verbs

* added check for us city/state/zip

* added address type

* add address to html

* add address to text

* fix for text tests; escape for large text segments

* refactor regex for readability

* update comment

* additional test for text with linebreaks

* update docs

* update changelog

* update elements docs

* remove old comment

* case -> cast

* type fix
2023-01-26 15:52:25 +00:00
Matt Robinson
8b6c5fac9d
feat: basic PowerPoint parsing in partition_pptx (#166)
* parition pptx and tests

* add parition_pptx to auto

* update doc types in readme

* add pptx docs

* bump version

* remove extra whitespace

* partition -> partitioning
2023-01-23 17:03:09 +00:00
Matt Robinson
f12240c5e7
feat: add support for .txt files in partition (#150)
* added partition_text for auto

* rename partition_text tests

* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
eba4c80b1e
feat: get_directory_file_info for exploring a directory of files (#142)
* added python-pptx to requirements

* added filetype detection for powerpoint

* add more filetypes to detect

* more tests

* added tests for filetype

* reorder document types

* tests for get_directory_file_info

* added docs for get_directory_file_info

* bump version

* Word -> Office

* added test for filetype

* add group by filetype example
2023-01-11 12:40:50 -05:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails (#111)
* partition_text function
2023-01-09 17:08:08 +00:00
Matt Robinson
fee95b643c
feat: add partition_docx for Word documents (#131)
* first pass on docx parsing

* linting, linting, linting

* test docx with filename

* added documentation

* more tests; version bump

* typo

* another typo

* another typo!

* it -> its

* save -> saved

* remove None since it's the default argument
2023-01-05 20:13:39 +00:00
Matt Robinson
33b983fbf0
docs: instructions on how to install on Windows + conda (#129)
* add environment.yml

* instructions on how to install base package and detectron2

* added instructions on paddleocr

* remove covers

* install -> to install

* specified the shell

* updated example snippets

* update environment.yml

* updated the repo reference

* no more ands!
2023-01-05 16:21:44 +00:00
Sebastian Laverde Alfonso
5a47eb06e9
feat: new bricks for removing and extracting ordered bullets (#128)
* feat: new cleaning brick for ordered bullets

* test: add test for cleaning ordered bullets

* feat: new brick for extracting ordered bullets

* test: add test for extracting ordered bullets

* docs: update CHANGELOG and bump new dev version

* chore: change extract ordered bullets return type to tuple

* chore: made tidy

* chore: regex to split on pattern instead of built-in

* chore: catch ValueError, made tidy and fix incompatible type

* chore: assertion statements in one line of code

* docs: add documentation for new clean and extract bricks to bricks.rst

* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets

* docs: update CHANGELOG 0.3.6-dev0 changes and bump version

Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2023-01-05 17:06:26 +01:00
Matt Robinson
17045aed80
feat: add convert_to_dataframe staging brick (#127)
* add pandas to deps; pip-compile

* staging brick to convert elements to dataframe

* bump version

* add convert_to_dataframe docs

* bump wheel version

* typo fix

* typo fix 2!
2023-01-04 12:04:59 -05:00
Matt Robinson
445533745c
feat: helper functions to identify and extract phone numbers (#124)
* added pattern for finding phone numbers

* added cleaning brick for extracting phone numbers

* add docs

* changelog and bump version

* switch to us phone numbers

* bump dev version
2023-01-03 13:31:05 -05:00
Mallori Harrell
509ad4951c
feat: Add extract_attachment_info (#112)
* Adds function to extract attachments and their metadata from eml files
2023-01-03 11:41:54 -06:00
Matt Robinson
b14f6ac9bd
feat: extract metadata from .docx, .xlsx, and .jpg (#113)
* add python-docx dependency

* added function for extracting metadata from word documents

* add openpyxl

* added get_jpg_metadata; fixed typing

* bump changelog

* added pillow to dependencies
2022-12-26 09:34:36 -05:00
Matt Robinson
7a74cdda86
feat: add partition_email cleaning brick (#104)
* fix for processing deeply embedded list elements

* fix types in mime encodings cleaner

* first pass on partition_email

* tests for email

* test for mime encodings

* changelog bump

* added note about \n=

* linting, linting, linting

* added email docs

* add partition_email to the readme

* add one more test
2022-12-19 18:02:44 +00:00
Matt Robinson
1d68bb2482
feat: apply method to apply cleaning bricks to elements (#102)
* add apply method to apply cleaners to elements

* bump version

* add check for string output

* documentations for the apply method

* change interface to *cleaners
2022-12-15 22:19:02 +00:00
Matt Robinson
b1cce16c16
feat: translate_text cleaning brick (#101)
* initial implementation for translate brick

* more input validation

* tests for translate brick

* added docs

* bumped version

* chinese and arabic tests

* re-run pip-compile

* add torch to dependencies

* cleanup doc string

* fix long string

* fix typo in docs

* take out empty string check

* return string if string is empty

* added huggingface into make install
2022-12-15 15:35:15 -05:00
Matt Robinson
3c19c7cd8a
feat: Add partition_html brick (#91)
* update readme

* updated sphinx docs

* bump version; changelog

* clear cache; retrigger ci

* rename test file

* switch default parameters to None

* typo in the changelog

* add in text output
2022-12-12 14:22:10 +00:00
Matt Robinson
77cd5cc01f
feat: text2text and token classification for argilla (#87)
* add support for text2text

* add support for token classification datasets

* bump versions

* updated docs

* remove extra comment

* fix wording in docs

* fix some more wording
2022-11-30 20:07:42 +00:00
asymness
2170a2aae2
feat: Implement Argilla staging brick (#81)
* Add argilla to dependencies and run pip-compile

* Implement Argilla staging brick and add unit tests

* Update version and changelog

* Update docs with description and usage for Argilla staging brick

* Remove unused fixtures and fix typo in Argilla tests

* add missing quote in docs

* changelog tweak

* doc tweaks

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-28 14:41:48 +00:00
Matt Robinson
b041b0197d
feat: Add entities kwarg to datasaur bricks (#77)
* added entities to datasaur

* add tests for datasaur with entities

* update docs

* fix missing imports

* bump version

* remove accidental file
2022-11-22 19:50:19 +00:00
Matt Robinson
08e091c5a9
chore: Reorganize partition bricks under partition directory (#76)
* move partition_pdf to partition folder

* move partition.py

* refactor partioning bricks into partition diretory

* import to nlp for backward compatibility

* update docs

* update version and bump changelog

* fix typo in changelog

* update readme reference
2022-11-21 22:27:23 +00:00
Mallori Harrell
53fcf4e912
chore: Remove PDF parsing code and dependencies (#75)
Remove PDF parsing code and dependencies.
2022-11-21 11:47:29 -06:00
Sebastian Laverde Alfonso
baa15d0098
feat: new partitioning brick that calls the document image analysis API (#68)
* docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py

* feat: new partition to call the document image analysis API

* fix: remove duplicated dependency on partition.py

* fix: linting error due to line-lenght > 100

* test: add test to call partition_pdf brick

* chore: new short example-doc pdf for speed up in test X8

* fix: add missing return statement to _read to pass check

* feat: new partitioning brick to call doc parse API

* docs: version update fix in CHANGELOG

* refactor: no nested ifs

* docs: documentation for new brick partition_pdf

* refactor: made tidy

* docs: minor doc refactor

Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2022-11-16 17:48:30 +01:00
Matt Robinson
300c564c62
feat: Cleaning bricks to extract text before/after a pattern (#63)
* brick to extract text before

* brick for extract text after

* tests for extract before and after

* updated docs

* changelog and bump version

* fix typo

* fix another typo

* positive -> non-negative
2022-11-10 21:35:37 +00:00
Matt Robinson
f3756abc90
feat: Cleaning bricks for removing prefixes and postfixes (#62)
* added prefix and postfix cleaners

* added test for pre and postfix cleaners

* added docs for prefix and postfix bricks

* changelog and bump version

* add dev to version
2022-11-10 12:24:58 -05:00
benjats07
df16b5806b
feat: Add staging brick for Datasaur token-based tasks (#50)
* feat: Add staging brick for Datasaur token-based tasks

* Added doc string and formatting with flake8,mypy and black

* docs: Added documentation for stage_for_datasaur

* fix: version sync correction

* fix: Corrections to docs fror stage_for_datasaur

* fix: changes in naming of example variables

* Update docs/source/bricks.rst

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-07 14:56:02 -06:00
Matt Robinson
de31df51a9
feat: Adds a helper function to convert ISD dicts to elements (#39)
* updated category name for ListItem

* added brick to convert isd to elements

* bump version

* added isd_to_elements to documentation
2022-10-21 18:43:10 +00:00
asymness
2d5dba0ddc
feat: Implement staging brick for ISD CSV format (#36)
* Implement convert_to_isd_csv function

* Add unit tests for convert_to_isd_csv function

* Update docs with description and example of convert_to_isd_csv function

* Update changelog and version
2022-10-13 11:35:46 -04:00
Matt Robinson
fb16847946
feat: Staging brick for attention window chunking (#34)
* add huggingface dependencies and re pip-compile

* first pass on chunk by attention window

* test for chunking function

* completed tests for chunk_by_attention_window

* change default buffer size to 2

* wrapper function for staging

* added docs for transformers

* fix wording and typos

* updated change log and bumped the version

* added docs on huggingface dependencies

* fix typo

* re pip-compile
2022-10-13 11:18:27 -04:00
asymness
ec5be8e8b0
feat: Implement LabelBox staging brick (#26)
* Implement stage_for_label_box function

* Add unit tests for stage_for_label_box function

* Update docs with description and example for stage_for_label_box function

* Bump version and update CHANGELOG.md

* Fix linting issues and implement suggested changes

* Update stage_for_label_box docs with a note for uploading files to cloud providers
2022-10-11 10:15:25 -04:00
qued
1d3076a4b2
feat: keep version synchronized (#25)
* Added script to check/sync versions using CHANGELOG.md as a source of truth.
* Script currently only syncs __version__.py but can easily be extended to cover other files by adding the files to an array in the script.
* Also updated sphinx conf.py to get version dynamically from __version__.py
2022-10-10 13:11:48 -05:00
Matt Robinson
836f156582
docs: Add example LabelStudio sentiment analysis example (#24)
* added documentation on how to use unstructured with labelstudio

* hard code risk narrative for docs

* link to create project call
2022-10-10 08:27:01 -04:00
asymness
baba641d03
feat: Allow option to specify predictions in LabelStudio staging brick (#23)
* Allow stage_for_label_studio to take a predictions input and implement prediction class

* Update unit tests for LabelStudioPrediction and stage_for_label_studio function

* Update stage_for_label_studio docs with example of loading predictions

* Bump version and update changelog

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-10-06 13:35:55 +00:00
asymness
28a4ae985d
feat: Implement utility functions for reading and writing .jsonl files (#22)
* Implement save_as_jsonl and read_from_jsonl utility functions

* Add unit tests for save_as_jsonl and read_from_jsonl utility functions

* Add example of using save_as_jsonl with prodigy staging brick

* Bump version and update changelog

* remove accidentally added prodigy json file

* added "the" in jsonl description

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-10-04 09:51:11 -04:00
Matt Robinson
a950559b94
feat: Optionally include LabelStudio annotations in staging brick (#19)
* added types for label studio annotations

* added method to cast as dicts

* added length check for annotations

* tweaks to get upload to work

* added validation for label types

* annotations is a list for each example

* little bit of refactoring

* test for staging with label studio

* tests for error conditions and reviewers

* added test for NER annotations

* updated changelog and bumped version

* added docs with annotation examples

* fix label studio link

* bump version in sphinx docs

* fulle -> full (typo fix)
2022-10-04 13:25:05 +00:00
asymness
d429e9b305
feat: Implement stage_csv_for_prodigy brick (#13)
* Refactor metadata validation and implement stage_csv_for_prodigy brick

* Refactor unit tests for metadata validation and add tests for Prodigy CSV brick

* Add stage_csv_for_prodigy description and example in docs

* Bump version and update changelog

* added _csv_ to function name

* update changelog line to 0.2.1-dev2

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-10-03 09:30:30 -04:00
asymness
35d488a466
feat: Implement stage_for_prodigy brick (#11)
* Implement unit tests for stage_for_prodigy brick

* Implement brick for converting data to Prodigy format

* Add stage_for_prodigy description and example to docs

* updated changelog

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-09-30 12:41:37 -04:00
qued
64e1c725eb
feat: Add text_field and id_field to stage_for_label_studio signature (#9)
Added text_field and id_field to stage_for_label_studio signature, to allow user to specify the keys in the resulting JSON. Includes tests and update to example in sphinx docs.
2022-09-28 09:30:17 -05:00
Matt Robinson
5f40c78f25 Initial Release 2022-09-26 14:55:20 -07:00