333 Commits

Author SHA1 Message Date
Matt Robinson
8da1ddc6ec
enhancement: add method for getting datetime; cleanup filename attribute (#575)
* added method for extracting datetime

* change filename metadata to the base filename

* fix filename metadata for msg

* changelog and bump version

* fix expected structured output

* newline back in file

* reset outpout file

* update filename output

* update test fixtures

* update fixture
2023-05-12 11:33:01 -04:00
Matt Robinson
38f7b652de
fix: add handling for non-standard rfc-2822 formats (#564)
* fix: add handling for non-standard rfc-2822 formats

* version and changelog

* linting, linting, linting
2023-05-11 14:36:25 +00:00
ryannikolaidis
b52638f8e3
chore: add support for SpooledTemporaryFiles (#569) 2023-05-09 21:39:07 -07:00
Matt Robinson
3d3f3df3ec
enhancement: add "ocr_only" strategy for PDFs (#553)
* add tests for validating strategy

* refactor into determine_pdf_strategy function

* refactor pdf strategies into strategies

* remove commented out code

* remove unreachable code

* add in handling for image types

* a little more refactoring

* import ocr partioning for images

* catch warnings, partition type for valid strategies

* fallback to ocr_only from fast

* fallback logic for hi_res

* test for fallback to ocr only

* fallback logic ofr ocr_only

* more tests for fallback logic

* update doc strings

* version and changelog

* linting, linting, linting

* update docs to include notes about strategy

* fix typos

* change back patched filename
2023-05-08 17:21:24 +00:00
Matt Robinson
392cccdbf7
enhancement: add ocr_only strategy for partition_image (#540)
* spike for ocr-only strategy for images

* fix for file processing

* extra space

* add korean to ci

* added test for ocr_only strategy

* added docs for ocr_only

* changelog and version

* added test for bad strategy

* skip korean test if in docker

* bump version

* version bump

* document valid strategies

* bump version for release

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-05-04 20:23:51 +00:00
Matt Robinson
fae5f8fdde
feat: add partition_odt for open office docs (#548)
* added filetype detection for odt

* add function for partition odt documents

* add odt files to auto

* changelog and version

* docs and readme

* update installation docs

* skip tests if not supported or in docker

* import pytest

* fix docs typos
2023-05-04 19:28:08 +00:00
Matt Robinson
aa01cdfc7a
fix: group together text from the same bounding box in partition_pdf with fast strategy (#542)
* switch to using PDF objects

* linting, linting, linting

* couple more tweaks

* added test for chevron-page

* version and changelog

* linting, linting, linting

* now processing 4 files
2023-05-03 18:33:24 -04:00
Matt Robinson
7e43a25f07
feat: add partition_multiple_via_api function (#539)
* added function for multiple files via api

* make multiple work with files

* updated docs strings

* changelog and version

* docs and contextlib for open files

* tests for partition multiple

* add tests for error conditions

* add output example
2023-05-03 15:06:06 -04:00
Matt Robinson
9fdc310358
fix: update detect_filetype for JSONs with text/plain MIME type (#520)
* check to see if text file is a json

* add json check into filetype detection

* added test for updated file detection logic

* bytes/strings handling

* changlog and version bump
2023-04-26 13:52:47 -04:00
Matt Robinson
4156cb12e0
feat: partition_via_api helper function (#518)
* added function for partitioning via api

* added tests for api function

* changelog and version

* add docs for partition_via_api
2023-04-26 09:05:35 -04:00
JaeyongLee
be8e6da884
fix: correct return types in exceeds_caps_ratio (#489)
* fix: fix text_type.py exceeds_cap_ratio() returns

There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected

* Update text_type.py exceeds_cap_ratio()

..

* Update text_type.py

..

* Update CHANGELOG.md

..

* linting, linting, linting ...

* update tests

* more test fixes

* Update text_type.py

..

* bump version and changelog

* add punctuation check

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-04-24 10:45:09 -04:00
Matt Robinson
894a190001
enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514)
* function to check if pdf is extractable

* add fallback logic for unextractable pdfs

* tests for docs with copy protection

* add test for unprocessable pdf

* update docs

* changelog and version

* update logic for images; reset file before proceeding

* 3 files for api tests

* docs update
2023-04-21 21:35:43 +00:00
qued
5b6640a55a
chore: change table param name (#513)
Updated parameter names that controls whether we try to infer table structure.
2023-04-21 13:48:19 -05:00
qued
dc4147d7df
feat: extract tables (#503)
Exposes table extraction through partition and partition_pdf.
2023-04-21 17:01:29 +00:00
Mallori Harrell
5d1e61cb3f
feat: add msg attachment support (#510)
* add msg function and fix bug in eml attachment function
2023-04-21 11:14:46 -05:00
Matt Robinson
6874df91ef
feat: allow users to pass OCR language into partition (#509)
* pip-compile new reqs

* bump inference version

* add language to pdf and image calls

* tests for passing in language

* version bump and changelog

* update docs

* pass ocr_languages in auto

* updated test fixtures

* typo in doc string
2023-04-21 13:41:26 +00:00
Matt Robinson
bd1e540af9
feat: parameter to turn off SSL verification (#506)
* add kwarg for ssl verification

* update docs

* update version and changelog

* add verify kwarg to test
2023-04-20 11:13:56 -04:00
Matt Robinson
39b261aee6
fix: group broken paragraphs when using the fast strategy for PDFs (#485)
* group broken paragraphs with fast strategy

* changelog and version

* fix broken tests for text.py

* formatting for paragraph pattern re

* fix test

* fix whitespace substitution

* one more test tweak

* blurb to account for short lines

* fix for shorter paragraphs

* update changelog

* remove extra line break from auto

* retrigger ci

* trying skipping azure

* skip azure (test)

* updated github and azure fixtures

* update slack fixture
2023-04-19 13:54:17 -04:00
cragwolfe
bfba2bb1eb
fix: workaround .json file detection with old libmagic installs (#493)
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).

Also adds missing json auto partition tests.

Including an xfail test for #492 .
2023-04-17 23:11:21 -07:00
JaeyongLee
8456676fad
fix: fix text_type.py exceeds_cap_ratio() returns (#478)
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-04-14 11:53:10 -07:00
Matt Robinson
137b4b9a2e
feat: cleaning brick for normalizing bytes string output (#481)
* add cleaning brick for emojis

* changelog and versoin

* docs for bytes_string_to_string

* different test for bytes_string_to_string
2023-04-13 19:39:08 +00:00
Matt Robinson
9c1c6a13f6
fix: updates markdown code to process markdown with embedded html (#480)
* add carriage return to html if missing

* test on markdown with embedded html

* changelog and version

* check for html parser

* linting, linting, linting
2023-04-13 12:47:45 -04:00
Matt Robinson
ec02d9298e
fix: only warn about fallback to fast in partition_pdf if hi_res is used (#479)
* only warn if detectron2 not available and hi_res is used

* changelog and version
2023-04-13 11:46:35 -04:00
Matt Robinson
b628fa8048
feat: allow headers in partition (#473)
* feat: allow headers in `partition`

* warning if header is set and url is not

* update emoji test
2023-04-13 15:04:15 +00:00
Matt Robinson
e2e473dddd
feat: add url kwarg to partititon (#470)
* added url option to auto partition

* add test for partition from url

* version and changelog

* update docs

* add url to element metadata
2023-04-12 18:31:01 +00:00
Matt Robinson
7ec85272b7
feat: add partition_rtf for rich text files (#466)
* refactor epub; add rtf

* added test for rtf files

* filetype detection for rtf files

* add rtf to auto

* update docs for group_broken_paragraphs

* add rtf to docs

* update file list in readme

* update stage_for_transformers docs

* changelog and version bump

* skip rtf if in docker

* skip test if rtf not supported

* docs tweaks
2023-04-10 21:25:03 +00:00
Matt Robinson
c99c099158
feat: enable grouping broken paragraphs in partition_text (#456)
* cleaning brick to group broken paragraphs

* docs for group_broken_paragraphs

* add docs for partition_text with grouper

* partition_text and auto with paragraph_grouper

* version and changelog

* typo in the docs

* linting, linting, linting

* switch to using regular expressions
2023-04-06 18:35:22 +00:00
Matt Robinson
b855fd269f
fix: fix html encoding to support foreign characters (#452)
* fix: fix html encoding to support foreign characters

* version and changelog
2023-04-05 20:18:54 +00:00
cragwolfe
3972c80c51
build(deps): bump requirements (#414) 2023-04-05 02:59:06 +00:00
Matt Robinson
5ae895051a
feat: add sender and receive info to element metadata for emails (#439)
* add header metadata for .eml messages

* sent to and from are lists

* add metadata for outlook emails

* version and changelog
2023-04-04 14:23:41 -04:00
Amanda Cameron
555b95b8f7
Fixing test for unstructured-api (#425)
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
2023-04-03 11:12:12 -07:00
Matt Robinson
414883455b
fix: correct order of kwargs in pandoc (#421)
* fix: correct order of kwargs in pandoc

* only skip epub tests in Docker

* changelog

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-03-30 20:54:29 +00:00
cragwolfe
32c79caee3
chore: use only regex for contains_english_word. (#382)
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word

Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
2023-03-30 16:57:43 +00:00
Matt Robinson
09b52b4fc4
fix: text kwargs no longer fail with empty string (#413)
* fix: text kwargs no longer fail with empty string

* linting
2023-03-28 21:03:51 +00:00
Matt Robinson
75cf233702
feat: add partition_msg for MSFT Outlook files (#412)
* added msg-parser dependency

* pass through kwargs in convert_file_to_text

* added partition_msg for processing msft outlook files

* version bump and changelog

* added tests for partition_msg

* added test for msg with plain text

* add partition_msg docs; fix underlines in integration docs

* add .msg to file list

* finish tests for auto msg

* linting, linting, linting
2023-03-28 20:15:22 +00:00
Amanda Cameron
71e035c34c
Adding content_type and file_filename to autopartition (#394)
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-03-24 16:32:45 -07:00
cragwolfe
ce9fc26009
feat: add ability to pass headers in partition_html (#397)
Also adds pytest-mock requirement, those fixtures are nice to have!

Implements issue/feature #396 .
2023-03-23 20:14:57 -07:00
Amanda Cameron
a9da858fa3
chore: add tests for docker (#373) 2023-03-21 13:46:09 -07:00
Matt Robinson
e43cb0e6e0
feat: add partition_epub function (#364)
* add pypandoc dependency

* added epub partitioner and file conversion

* test for partition_epub

* tests for file conversion

* add epub to filetype detection

* added epub to auto partition

* update bricks docs

* updated installing docs

* changelot and version

* add pandoc to dependencies

* add pandoc to debian dependencies

* linting, linting, linting

* typo fix

* typo fix

* file conversion type hints

* more type hints

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-03-14 15:52:21 +00:00
ryannikolaidis
a4726cb197
fix: open xml files in read only mode (#362) 2023-03-13 13:06:45 -07:00
Matt Robinson
7c08450597
feat: add "fast" strategy for PDF parsing; fallback to "fast" if detectron2 is not available (#357)
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
2023-03-11 03:16:05 +00:00
Matt Robinson
30b5a4da65
fix: parsing for files with message/rfc822 MIME type; dir for unsupported files (#358)
Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.
2023-03-10 15:10:39 -08:00
Matt Robinson
7c619f045b
feat: UNSTRUCTURED_LANGUAGE_CHECK env var to control (#351)
* environment variable to set language checks

* change log and version

* checks for if language checks are false

* update docs

* changelog type

* add assert to tests

* performance note in docstrings

* docstring tweaks
2023-03-09 17:33:48 +00:00
natygyoon
6be07a5260
feat: update auto.partition() function to recognize Unstructured json (#337) 2023-03-08 10:36:01 -08:00
Amanda Cameron
64efcc0e50
Adding optional encoding arg, and text_partition tests (#339) 2023-03-06 15:07:33 -08:00
Matt Robinson
a5da3de43b
fix: ensure all text is maintained in html output (#335)
* fix: ensure all text is maintained in html pages

* add back in replace unicode quotes

* changelog and version bump

* apt-get update in ci

* white space differences in output
2023-03-02 14:03:13 -05:00
Matt Robinson
69661788cf
fix: track narrative text and figure captions in HTML documents (#309)
* fix for missing narrative text in partition_html

* fixes so existing tests pass

* tests for figure caption and narrative text

* bump version; changelog
2023-02-28 15:36:08 +00:00
Tom Aarsen
ded60afda9
feat: Add GitHub data connector; add Markdown partitioner (#284) 2023-02-27 14:36:44 -08:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
601f250edc
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests

* add ppt support to auto

* version bump

* update docs

* doc fixes

* update changelog

* `.docx` -> `.pptx`

* its -> their

* remove whitespace
2023-02-17 16:57:08 +00:00