97 Commits

Author SHA1 Message Date
John
a9b9b873b1
feat: partition_tsv for tab separated value files (#758)
* first pass at partition_tsv

* working tests

* create constants for tests and debug `make test` failure

* make check and tidy

* undo changes for testing locally

* update changelog and version

* fix bricks.rst

* refactor if statements

* make tidy

* fix README and change try/except to if/else

* update changelog and version

* fix\ docstring
2023-06-15 18:50:53 +00:00
Matt Robinson
c82fdb6a89
feat: partition_rst for ReStructured Text documents (#725)
* add example rst file

* filetype detection for rst files

* add partition_rst function

* add partition_rst to auto

* update readme

* update docs

* changelog and version

* pandocs -> pandoc

* fix typo
2023-06-12 19:31:10 +00:00
Matt Robinson
19ab6d960f
enhancement: handling for empty files in detect_filetype and partition (#710)
* add empty filetype

* add empty handling to partition

* changelog and version
2023-06-09 16:07:50 -04:00
Yuming Long
80f0b4a132
Fix: Pass strategy parameter down from partition for partition_image (#708)
* changelog and version

* passing param down

* test should be auto

* doc nit

* lint

* update image output
2023-06-09 13:54:18 -04:00
John
b2b92ea79d
fix: filetype detection if a CSV has a text/plain MIME type (#691)
* fix:  Filetype detection if a CSV has a text/plain MIME type #621

* bug: fix csv detection and create _read_file_start_for_type_check func

* fix: Make call to _is_text_file_a_csv from detect_filetype
2023-06-08 16:21:07 -04:00
Christine Straub
547bb38d86
fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660)
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.

Change auto.py to have a None default for encoding

Remove the unused parameter encoding from partition_pdf

Add functionality to the read_txt_file utility function to handle file-like object from URL
2023-06-05 11:27:12 -07:00
qued
d3600dd5da
build(deps): update inference version (#662)
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 13:50:15 -05:00
Yuming Long
fc59a043b7
Chore: Support epub tests in docker image (#630)
* docker works

* more epub tests

* changelog version

* support epub + odt + rtf

* update dockerfile

* revert..

* install pandoc on ci env

* pandoc docker grab bashed on arch

* move arch into image

* move back to base image
2023-05-26 15:38:48 -04:00
cragwolfe
c5d9469001
feat: add xls support (#632)
Add support for older .XLS files from the partition function in unstructured.partition.auto.

Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).
2023-05-26 01:55:32 -07:00
Matt Robinson
fda51d6ead
fix: add more mime types for csv (#620) 2023-05-19 16:40:26 -05:00
Matt Robinson
21c821d651
feat: add partition_csv function (#619)
* add csv into filetype detection

* first pass on csv

* add tests for csv

* add csv to auto

* version bump

* update readme and docs

* fix doc strings
2023-05-19 15:57:42 -04:00
Matt Robinson
23ff32cc42
feat: add partition_xml for XML files (#596)
* first pass on partition_xml

* add option to keep xml tags

* added tests for xml

* fix filename

* update filenames

* remove outdated readme

* add xml to auto

* version and changelog

* update readme and docs

* pass through include_metadata

* update include_metadata description

* add README back in

* linting, linting, linting

* more linting

* spooled to bytes doesnt need to be a tuple

* Add tests for newly supported filetypes

* Correct metadata filetype

* doc typo

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* keep_xml_tags -> xml_keep_tags

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-05-18 15:40:12 +00:00
Matt Robinson
b8037118c4
feat: add partition_xlsx for MSFT Excel files (#594)
* first pass on partition_xlsx

* add support for files

* add test for xlsx from filename

* added filetype metadata

* add xlsx to auto

* remove fake excel from unsupported

* version and changelog

* update docs

* update readme

* fix removed file reference

* fix some more tests

* pass in metadata filename

* add include_metadata flag
2023-05-16 19:40:40 +00:00
Matt Robinson
bd6a8a3a40
enhancement: add file_directory to element metadata (#585)
* enhancement: add `file_directory` to element metadata

* update msg test

* exclude file_directory

* update slack output

* added file directory tests on partition_x paths
2023-05-15 18:25:39 -04:00
Yuming Long
5b6f11bb88
Chore(ingest): Add --partition-strategy parameter in CLI (#582)
* change strategy arg defalut to auto in partition

* passing --partition-strategy down

* add strategy="hi_res" to test (default changed)

* made an error on param name, added note
2023-05-15 19:26:53 +00:00
qued
55272eeceb
enhancement: filetype in metadata (#583)
Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata.

Tests are added to make sure:

* When partition is used, any content type or auto file type detection will override file-specific partition function metadata
* Both auto and file-specific partitioning gives the desired filetype metadata

Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .
2023-05-15 13:23:19 -05:00
Matt Robinson
727d366a94
enhancement: auto strategy for PDFs and images (#578)
* added functions for determining auto stratgy

* change default strategy to auto

* tests for auto strategy

* update docs

* changelog and version

* bump version

* remove ingest file in wrong location

* update jpg output

* typo fix
2023-05-12 17:45:08 +00:00
Matt Robinson
8da1ddc6ec
enhancement: add method for getting datetime; cleanup filename attribute (#575)
* added method for extracting datetime

* change filename metadata to the base filename

* fix filename metadata for msg

* changelog and bump version

* fix expected structured output

* newline back in file

* reset outpout file

* update filename output

* update test fixtures

* update fixture
2023-05-12 11:33:01 -04:00
Matt Robinson
fae5f8fdde
feat: add partition_odt for open office docs (#548)
* added filetype detection for odt

* add function for partition odt documents

* add odt files to auto

* changelog and version

* docs and readme

* update installation docs

* skip tests if not supported or in docker

* import pytest

* fix docs typos
2023-05-04 19:28:08 +00:00
Matt Robinson
9fdc310358
fix: update detect_filetype for JSONs with text/plain MIME type (#520)
* check to see if text file is a json

* add json check into filetype detection

* added test for updated file detection logic

* bytes/strings handling

* changlog and version bump
2023-04-26 13:52:47 -04:00
qued
5b6640a55a
chore: change table param name (#513)
Updated parameter names that controls whether we try to infer table structure.
2023-04-21 13:48:19 -05:00
qued
dc4147d7df
feat: extract tables (#503)
Exposes table extraction through partition and partition_pdf.
2023-04-21 17:01:29 +00:00
Matt Robinson
6874df91ef
feat: allow users to pass OCR language into partition (#509)
* pip-compile new reqs

* bump inference version

* add language to pdf and image calls

* tests for passing in language

* version bump and changelog

* update docs

* pass ocr_languages in auto

* updated test fixtures

* typo in doc string
2023-04-21 13:41:26 +00:00
cragwolfe
bfba2bb1eb
fix: workaround .json file detection with old libmagic installs (#493)
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).

Also adds missing json auto partition tests.

Including an xfail test for #492 .
2023-04-17 23:11:21 -07:00
Matt Robinson
9c1c6a13f6
fix: updates markdown code to process markdown with embedded html (#480)
* add carriage return to html if missing

* test on markdown with embedded html

* changelog and version

* check for html parser

* linting, linting, linting
2023-04-13 12:47:45 -04:00
Matt Robinson
b628fa8048
feat: allow headers in partition (#473)
* feat: allow headers in `partition`

* warning if header is set and url is not

* update emoji test
2023-04-13 15:04:15 +00:00
Matt Robinson
e2e473dddd
feat: add url kwarg to partititon (#470)
* added url option to auto partition

* add test for partition from url

* version and changelog

* update docs

* add url to element metadata
2023-04-12 18:31:01 +00:00
Matt Robinson
7ec85272b7
feat: add partition_rtf for rich text files (#466)
* refactor epub; add rtf

* added test for rtf files

* filetype detection for rtf files

* add rtf to auto

* update docs for group_broken_paragraphs

* add rtf to docs

* update file list in readme

* update stage_for_transformers docs

* changelog and version bump

* skip rtf if in docker

* skip test if rtf not supported

* docs tweaks
2023-04-10 21:25:03 +00:00
cragwolfe
3972c80c51
build(deps): bump requirements (#414) 2023-04-05 02:59:06 +00:00
Matt Robinson
414883455b
fix: correct order of kwargs in pandoc (#421)
* fix: correct order of kwargs in pandoc

* only skip epub tests in Docker

* changelog

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-03-30 20:54:29 +00:00
Matt Robinson
75cf233702
feat: add partition_msg for MSFT Outlook files (#412)
* added msg-parser dependency

* pass through kwargs in convert_file_to_text

* added partition_msg for processing msft outlook files

* version bump and changelog

* added tests for partition_msg

* added test for msg with plain text

* add partition_msg docs; fix underlines in integration docs

* add .msg to file list

* finish tests for auto msg

* linting, linting, linting
2023-03-28 20:15:22 +00:00
Amanda Cameron
71e035c34c
Adding content_type and file_filename to autopartition (#394)
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-03-24 16:32:45 -07:00
Amanda Cameron
a9da858fa3
chore: add tests for docker (#373) 2023-03-21 13:46:09 -07:00
Matt Robinson
e43cb0e6e0
feat: add partition_epub function (#364)
* add pypandoc dependency

* added epub partitioner and file conversion

* test for partition_epub

* tests for file conversion

* add epub to filetype detection

* added epub to auto partition

* update bricks docs

* updated installing docs

* changelot and version

* add pandoc to dependencies

* add pandoc to debian dependencies

* linting, linting, linting

* typo fix

* typo fix

* file conversion type hints

* more type hints

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-03-14 15:52:21 +00:00
Matt Robinson
7c08450597
feat: add "fast" strategy for PDF parsing; fallback to "fast" if detectron2 is not available (#357)
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
2023-03-11 03:16:05 +00:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
601f250edc
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests

* add ppt support to auto

* version bump

* update docs

* doc fixes

* update changelog

* `.docx` -> `.pptx`

* its -> their

* remove whitespace
2023-02-17 16:57:08 +00:00
Matt Robinson
6036af33e7
feat: add partition_doc for .doc files (#236)
* first pass on doc partitioning

* add libreoffice to deps

* update docs and readme

* add .doc to auto

* changelog bump

* value error with missing doc

* doc updates
2023-02-17 09:30:23 -05:00
Matt Robinson
74e6b84b41
feat: add metadata tracking to document elements (#225)
* add metadata field to elements

* metadata tracking for pdf/image

* metadata for html

* update expected outputs

* metadata for the rest of the document types

* take out file metadata for now

* add url to tables

* added metadata to test_auto

* bump version

* added coordinates to __init__

* fix coordinates in tests
2023-02-15 18:26:20 +00:00
Matt Robinson
e73cf09977
feat: optional page breaks for .pptx, .pdf, .html and images (#205)
* page breaks for pptx

* added page breaks for image/pdf

* tests for images with page breaks

* page breaks for html documents

* linting, linting, linting

* changelog and bump version

* update docs

* fix typo

* refactor reusable code to common.py

* add type back in
2023-02-08 15:11:15 +00:00
Matt Robinson
e6cfde5c4a
fix: no UserWarning when partition_pdf is called (#179) 2023-01-27 12:08:18 -05:00
Matt Robinson
339c133326
fix: cleanup from live .docx tests (#177)
* add env var for cap threshold; raise default threshold

* update docs and tests

* added check for ending in a comma

* update docs

* no caps check for all upper text

* capture Text in html and text

* check category in Text equality check

* lower case all caps before checking for verbs

* added check for us city/state/zip

* added address type

* add address to html

* add address to text

* fix for text tests; escape for large text segments

* refactor regex for readability

* update comment

* additional test for text with linebreaks

* update docs

* update changelog

* update elements docs

* remove old comment

* case -> cast

* type fix
2023-01-26 15:52:25 +00:00
Matt Robinson
8b6c5fac9d
feat: basic PowerPoint parsing in partition_pptx (#166)
* parition pptx and tests

* add parition_pptx to auto

* update doc types in readme

* add pptx docs

* bump version

* remove extra whitespace

* partition -> partitioning
2023-01-23 17:03:09 +00:00
Matt Robinson
c1822911a5
chore: return Element objects in partition_pdf and partition_image (#164)
* helper function to convert to element

* test for element types

* fix for healthcheck url

* version bump

* note on coordinates

* mention FigureCaption

* test_shared -> test_common

* add check boxes for checkbox template

* update changelog
2023-01-19 14:29:28 +00:00
qued
8abf1f119d
feat: partition image (#144)
Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.
2023-01-13 22:24:13 -06:00
Matt Robinson
f12240c5e7
feat: add support for .txt files in partition (#150)
* added partition_text for auto

* rename partition_text tests

* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00