316 Commits

Author SHA1 Message Date
Matt Robinson
749f9c6be8
fix: avoid divide by zero in exceeds_cap_ratio (#160) 2023-01-17 15:22:12 -05:00
Matt Robinson
9c3c14e94d
fix: resolves UnicodeDecodeError in partition_email for emails with attachments (#158)
* split emails by \n=

* added test for equivalence betweent html and plain text

* changelog and bump version

* add check for content disposition
2023-01-17 11:33:45 -05:00
qued
8abf1f119d
feat: partition image (#144)
Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.
2023-01-13 22:24:13 -06:00
Matt Robinson
f12240c5e7
feat: add support for .txt files in partition (#150)
* added partition_text for auto

* rename partition_text tests

* bump version and update docs
2023-01-13 16:39:53 -05:00
Mallori Harrell
e0feba83f6
feat: Add Image element and find_embedded_image function (#130)
* add find_embedded_image
2023-01-09 19:49:19 -06:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails (#111)
* partition_text function
2023-01-09 17:08:08 +00:00
Matt Robinson
fee95b643c
feat: add partition_docx for Word documents (#131)
* first pass on docx parsing

* linting, linting, linting

* test docx with filename

* added documentation

* more tests; version bump

* typo

* another typo

* another typo!

* it -> its

* save -> saved

* remove None since it's the default argument
2023-01-05 20:13:39 +00:00
qued
a75499d465
feat: local inference (#125)
Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.
2023-01-04 16:19:05 -06:00
Matt Robinson
445533745c
feat: helper functions to identify and extract phone numbers (#124)
* added pattern for finding phone numbers

* added cleaning brick for extracting phone numbers

* add docs

* changelog and bump version

* switch to us phone numbers

* bump dev version
2023-01-03 13:31:05 -05:00
Mallori Harrell
509ad4951c
feat: Add extract_attachment_info (#112)
* Adds function to extract attachments and their metadata from eml files
2023-01-03 11:41:54 -06:00
Mallori Harrell
6f4d9ad06c
chore: add new pattern for dash bullet (#109)
* add new pattern for dash bullet
2022-12-21 10:23:51 -06:00
Matt Robinson
7a74cdda86
feat: add partition_email cleaning brick (#104)
* fix for processing deeply embedded list elements

* fix types in mime encodings cleaner

* first pass on partition_email

* tests for email

* test for mime encodings

* changelog bump

* added note about \n=

* linting, linting, linting

* added email docs

* add partition_email to the readme

* add one more test
2022-12-19 18:02:44 +00:00
Matt Robinson
3c19c7cd8a
feat: Add partition_html brick (#91)
* update readme

* updated sphinx docs

* bump version; changelog

* clear cache; retrigger ci

* rename test file

* switch default parameters to None

* typo in the changelog

* add in text output
2022-12-12 14:22:10 +00:00
Matt Robinson
0658744c38
test: mock model api calls; full coverage for partition_pdf (#88)
* test: mock model api calls; full coverage for partition_pdf

* bump version
2022-11-30 16:34:24 -05:00
Matt Robinson
08e091c5a9
chore: Reorganize partition bricks under partition directory (#76)
* move partition_pdf to partition folder

* move partition.py

* refactor partioning bricks into partition diretory

* import to nlp for backward compatibility

* update docs

* update version and bump changelog

* fix typo in changelog

* update readme reference
2022-11-21 22:27:23 +00:00