51 Commits

Author SHA1 Message Date
Matt Robinson
7c08450597
feat: add "fast" strategy for PDF parsing; fallback to "fast" if detectron2 is not available (#357)
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
2023-03-11 03:16:05 +00:00
Alvaro Bartolome
c51adb21e3
feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available (#318)
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.

I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
2023-03-10 06:15:19 +00:00
Matt Robinson
7c619f045b
feat: UNSTRUCTURED_LANGUAGE_CHECK env var to control (#351)
* environment variable to set language checks

* change log and version

* checks for if language checks are false

* update docs

* changelog type

* add assert to tests

* performance note in docstrings

* docstring tweaks
2023-03-09 17:33:48 +00:00
Matt Robinson
1cd1bd8eba
docs: more detailed bricks writeup; reoganize docs (#304)
* add print statement in readme

* elements before bricks

* new preamble to bricks section

* add preamble to bricks section

* add preamble to cleaning section

* descriptions of each documentation page

* non-brick helper functions to the bottom

* fix codeblock

* includes some optional kwargs

* code blocks

* typo fix
2023-02-27 23:11:49 +00:00
Tom Aarsen
9062d25d0d
Resolve numerous typos (#280)
* Resolve numerous typos

* Resolve typo in mime type
2023-02-24 17:48:23 -08:00
Matt Robinson
0d229f0a5e
fix: preserve all elements when serialized; feat: helper functions for serialization (#273)
* added type to text element map

* add element_id and coordinates

* added test for serialization

* added serialization for check boxes

* add dict_to_elements and covert_to_dict aliases

* helpers for serializing and deserializing elements

* bump version; changelog

* add Text to tests

* aliases for isd functions

* remove test elements json

* changelog updates

* make indent a kwarg

* update expected structured output

* docs update

* use new function in ingest code

* pop coordinates due to floating point differences

* pop coordinates
2023-02-23 21:58:59 +00:00
Matt Robinson
601f250edc
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests

* add ppt support to auto

* version bump

* update docs

* doc fixes

* update changelog

* `.docx` -> `.pptx`

* its -> their

* remove whitespace
2023-02-17 16:57:08 +00:00
Matt Robinson
6036af33e7
feat: add partition_doc for .doc files (#236)
* first pass on doc partitioning

* add libreoffice to deps

* update docs and readme

* add .doc to auto

* changelog bump

* value error with missing doc

* doc updates
2023-02-17 09:30:23 -05:00
Matt Robinson
558ee63e90
feat: ability to skip English language specific checks with env var (#224)
* add language env var

* update docs

* version and bump change log
2023-02-15 09:15:47 -05:00
Matt Robinson
a68dc35940
chore: default to local inference for partition_pdf and partition_image (#222)
* chore: default the url to None for pdf and images

* bump changelog and version
2023-02-14 16:16:33 -05:00
Matt Robinson
f890972139
docs: add bricks training notebook (#211)
* added bricks notebook

* more unicode quotes; isd dataframe column fix

* fix remove_punctuation docs

* typo fixes

* put staging bricks in code
2023-02-10 14:39:14 +00:00
Matt Robinson
d0c6d50962 note on local inference 2023-02-09 15:16:14 -05:00
Matt Robinson
7f9aefc549 update partition_pdf section; added partition_image 2023-02-09 15:13:26 -05:00
djacobs7
15b0dffdb0
docs: correct kwarg in bricks.rst (#206)
Changed whitespace to extra_whitespace in documentation, to match options text.
2023-02-08 18:21:58 +00:00
Matt Robinson
e73cf09977
feat: optional page breaks for .pptx, .pdf, .html and images (#205)
* page breaks for pptx

* added page breaks for image/pdf

* tests for images with page breaks

* page breaks for html documents

* linting, linting, linting

* changelog and bump version

* update docs

* fix typo

* refactor reusable code to common.py

* add type back in
2023-02-08 15:11:15 +00:00
Matt Robinson
a7ca58e0bc
fix: more english words; split on punctuation (#191)
* add a bigger list of english words

* update thresholds and add tests

* update docs; bump version

* fix version

* add additional english words back in

* linting, linting, linting

* add slashes

* work -> word
2023-02-02 17:25:47 +00:00
Matt Robinson
0589344ff7
fix: require a minimum prop of alpha characters for titles and narrative text (#190)
* added alpha ratio check

* added tests for alpha ratio

* bump changelog and update docs

* update changelog/version; update docs

* ofr -> or
2023-02-02 14:59:04 +00:00
Matt Robinson
1230a163fd
feat: set a user controlled max word length for titles (#189)
* update the docs

* add option for title max word length

* bump version; update changelog

* change max length to 12

* docs updates

* to -> too
2023-02-01 19:32:16 +00:00
Matt Robinson
2d08fcbf83
fix: titles and narrative text need at least one english word (#188)
* added check for english words

* update docs

* at least one word needs to have multiple characters

* bump change log
2023-02-01 09:10:48 -05:00
Matt Robinson
339c133326
fix: cleanup from live .docx tests (#177)
* add env var for cap threshold; raise default threshold

* update docs and tests

* added check for ending in a comma

* update docs

* no caps check for all upper text

* capture Text in html and text

* check category in Text equality check

* lower case all caps before checking for verbs

* added check for us city/state/zip

* added address type

* add address to html

* add address to text

* fix for text tests; escape for large text segments

* refactor regex for readability

* update comment

* additional test for text with linebreaks

* update docs

* update changelog

* update elements docs

* remove old comment

* case -> cast

* type fix
2023-01-26 15:52:25 +00:00
Matt Robinson
8b6c5fac9d
feat: basic PowerPoint parsing in partition_pptx (#166)
* parition pptx and tests

* add parition_pptx to auto

* update doc types in readme

* add pptx docs

* bump version

* remove extra whitespace

* partition -> partitioning
2023-01-23 17:03:09 +00:00
Matt Robinson
f12240c5e7
feat: add support for .txt files in partition (#150)
* added partition_text for auto

* rename partition_text tests

* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails (#111)
* partition_text function
2023-01-09 17:08:08 +00:00
Matt Robinson
fee95b643c
feat: add partition_docx for Word documents (#131)
* first pass on docx parsing

* linting, linting, linting

* test docx with filename

* added documentation

* more tests; version bump

* typo

* another typo

* another typo!

* it -> its

* save -> saved

* remove None since it's the default argument
2023-01-05 20:13:39 +00:00
Sebastian Laverde Alfonso
5a47eb06e9
feat: new bricks for removing and extracting ordered bullets (#128)
* feat: new cleaning brick for ordered bullets

* test: add test for cleaning ordered bullets

* feat: new brick for extracting ordered bullets

* test: add test for extracting ordered bullets

* docs: update CHANGELOG and bump new dev version

* chore: change extract ordered bullets return type to tuple

* chore: made tidy

* chore: regex to split on pattern instead of built-in

* chore: catch ValueError, made tidy and fix incompatible type

* chore: assertion statements in one line of code

* docs: add documentation for new clean and extract bricks to bricks.rst

* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets

* docs: update CHANGELOG 0.3.6-dev0 changes and bump version

Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2023-01-05 17:06:26 +01:00
Matt Robinson
17045aed80
feat: add convert_to_dataframe staging brick (#127)
* add pandas to deps; pip-compile

* staging brick to convert elements to dataframe

* bump version

* add convert_to_dataframe docs

* bump wheel version

* typo fix

* typo fix 2!
2023-01-04 12:04:59 -05:00
Matt Robinson
445533745c
feat: helper functions to identify and extract phone numbers (#124)
* added pattern for finding phone numbers

* added cleaning brick for extracting phone numbers

* add docs

* changelog and bump version

* switch to us phone numbers

* bump dev version
2023-01-03 13:31:05 -05:00
Mallori Harrell
509ad4951c
feat: Add extract_attachment_info (#112)
* Adds function to extract attachments and their metadata from eml files
2023-01-03 11:41:54 -06:00
Matt Robinson
7a74cdda86
feat: add partition_email cleaning brick (#104)
* fix for processing deeply embedded list elements

* fix types in mime encodings cleaner

* first pass on partition_email

* tests for email

* test for mime encodings

* changelog bump

* added note about \n=

* linting, linting, linting

* added email docs

* add partition_email to the readme

* add one more test
2022-12-19 18:02:44 +00:00
Matt Robinson
b1cce16c16
feat: translate_text cleaning brick (#101)
* initial implementation for translate brick

* more input validation

* tests for translate brick

* added docs

* bumped version

* chinese and arabic tests

* re-run pip-compile

* add torch to dependencies

* cleanup doc string

* fix long string

* fix typo in docs

* take out empty string check

* return string if string is empty

* added huggingface into make install
2022-12-15 15:35:15 -05:00
Matt Robinson
3c19c7cd8a
feat: Add partition_html brick (#91)
* update readme

* updated sphinx docs

* bump version; changelog

* clear cache; retrigger ci

* rename test file

* switch default parameters to None

* typo in the changelog

* add in text output
2022-12-12 14:22:10 +00:00
Matt Robinson
77cd5cc01f
feat: text2text and token classification for argilla (#87)
* add support for text2text

* add support for token classification datasets

* bump versions

* updated docs

* remove extra comment

* fix wording in docs

* fix some more wording
2022-11-30 20:07:42 +00:00
asymness
2170a2aae2
feat: Implement Argilla staging brick (#81)
* Add argilla to dependencies and run pip-compile

* Implement Argilla staging brick and add unit tests

* Update version and changelog

* Update docs with description and usage for Argilla staging brick

* Remove unused fixtures and fix typo in Argilla tests

* add missing quote in docs

* changelog tweak

* doc tweaks

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-28 14:41:48 +00:00
Matt Robinson
b041b0197d
feat: Add entities kwarg to datasaur bricks (#77)
* added entities to datasaur

* add tests for datasaur with entities

* update docs

* fix missing imports

* bump version

* remove accidental file
2022-11-22 19:50:19 +00:00
Matt Robinson
08e091c5a9
chore: Reorganize partition bricks under partition directory (#76)
* move partition_pdf to partition folder

* move partition.py

* refactor partioning bricks into partition diretory

* import to nlp for backward compatibility

* update docs

* update version and bump changelog

* fix typo in changelog

* update readme reference
2022-11-21 22:27:23 +00:00
Sebastian Laverde Alfonso
baa15d0098
feat: new partitioning brick that calls the document image analysis API (#68)
* docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py

* feat: new partition to call the document image analysis API

* fix: remove duplicated dependency on partition.py

* fix: linting error due to line-lenght > 100

* test: add test to call partition_pdf brick

* chore: new short example-doc pdf for speed up in test X8

* fix: add missing return statement to _read to pass check

* feat: new partitioning brick to call doc parse API

* docs: version update fix in CHANGELOG

* refactor: no nested ifs

* docs: documentation for new brick partition_pdf

* refactor: made tidy

* docs: minor doc refactor

Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2022-11-16 17:48:30 +01:00
Matt Robinson
300c564c62
feat: Cleaning bricks to extract text before/after a pattern (#63)
* brick to extract text before

* brick for extract text after

* tests for extract before and after

* updated docs

* changelog and bump version

* fix typo

* fix another typo

* positive -> non-negative
2022-11-10 21:35:37 +00:00
Matt Robinson
f3756abc90
feat: Cleaning bricks for removing prefixes and postfixes (#62)
* added prefix and postfix cleaners

* added test for pre and postfix cleaners

* added docs for prefix and postfix bricks

* changelog and bump version

* add dev to version
2022-11-10 12:24:58 -05:00
benjats07
df16b5806b
feat: Add staging brick for Datasaur token-based tasks (#50)
* feat: Add staging brick for Datasaur token-based tasks

* Added doc string and formatting with flake8,mypy and black

* docs: Added documentation for stage_for_datasaur

* fix: version sync correction

* fix: Corrections to docs fror stage_for_datasaur

* fix: changes in naming of example variables

* Update docs/source/bricks.rst

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-07 14:56:02 -06:00
Matt Robinson
de31df51a9
feat: Adds a helper function to convert ISD dicts to elements (#39)
* updated category name for ListItem

* added brick to convert isd to elements

* bump version

* added isd_to_elements to documentation
2022-10-21 18:43:10 +00:00
asymness
2d5dba0ddc
feat: Implement staging brick for ISD CSV format (#36)
* Implement convert_to_isd_csv function

* Add unit tests for convert_to_isd_csv function

* Update docs with description and example of convert_to_isd_csv function

* Update changelog and version
2022-10-13 11:35:46 -04:00
Matt Robinson
fb16847946
feat: Staging brick for attention window chunking (#34)
* add huggingface dependencies and re pip-compile

* first pass on chunk by attention window

* test for chunking function

* completed tests for chunk_by_attention_window

* change default buffer size to 2

* wrapper function for staging

* added docs for transformers

* fix wording and typos

* updated change log and bumped the version

* added docs on huggingface dependencies

* fix typo

* re pip-compile
2022-10-13 11:18:27 -04:00
asymness
ec5be8e8b0
feat: Implement LabelBox staging brick (#26)
* Implement stage_for_label_box function

* Add unit tests for stage_for_label_box function

* Update docs with description and example for stage_for_label_box function

* Bump version and update CHANGELOG.md

* Fix linting issues and implement suggested changes

* Update stage_for_label_box docs with a note for uploading files to cloud providers
2022-10-11 10:15:25 -04:00
asymness
baba641d03
feat: Allow option to specify predictions in LabelStudio staging brick (#23)
* Allow stage_for_label_studio to take a predictions input and implement prediction class

* Update unit tests for LabelStudioPrediction and stage_for_label_studio function

* Update stage_for_label_studio docs with example of loading predictions

* Bump version and update changelog

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-10-06 13:35:55 +00:00
asymness
28a4ae985d
feat: Implement utility functions for reading and writing .jsonl files (#22)
* Implement save_as_jsonl and read_from_jsonl utility functions

* Add unit tests for save_as_jsonl and read_from_jsonl utility functions

* Add example of using save_as_jsonl with prodigy staging brick

* Bump version and update changelog

* remove accidentally added prodigy json file

* added "the" in jsonl description

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-10-04 09:51:11 -04:00
Matt Robinson
a950559b94
feat: Optionally include LabelStudio annotations in staging brick (#19)
* added types for label studio annotations

* added method to cast as dicts

* added length check for annotations

* tweaks to get upload to work

* added validation for label types

* annotations is a list for each example

* little bit of refactoring

* test for staging with label studio

* tests for error conditions and reviewers

* added test for NER annotations

* updated changelog and bumped version

* added docs with annotation examples

* fix label studio link

* bump version in sphinx docs

* fulle -> full (typo fix)
2022-10-04 13:25:05 +00:00
asymness
d429e9b305
feat: Implement stage_csv_for_prodigy brick (#13)
* Refactor metadata validation and implement stage_csv_for_prodigy brick

* Refactor unit tests for metadata validation and add tests for Prodigy CSV brick

* Add stage_csv_for_prodigy description and example in docs

* Bump version and update changelog

* added _csv_ to function name

* update changelog line to 0.2.1-dev2

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-10-03 09:30:30 -04:00
asymness
35d488a466
feat: Implement stage_for_prodigy brick (#11)
* Implement unit tests for stage_for_prodigy brick

* Implement brick for converting data to Prodigy format

* Add stage_for_prodigy description and example to docs

* updated changelog

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2022-09-30 12:41:37 -04:00
qued
64e1c725eb
feat: Add text_field and id_field to stage_for_label_studio signature (#9)
Added text_field and id_field to stage_for_label_studio signature, to allow user to specify the keys in the resulting JSON. Includes tests and update to example in sphinx docs.
2022-09-28 09:30:17 -05:00