unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-12 19:45:56 +00:00

Author	SHA1	Message	Date
John	a9b9b873b1	feat: partition_tsv for tab separated value files (#758 ) * first pass at partition_tsv * working tests * create constants for tests and debug `make test` failure * make check and tidy * undo changes for testing locally * update changelog and version * fix bricks.rst * refactor if statements * make tidy * fix README and change try/except to if/else * update changelog and version * fix\ docstring	2023-06-15 18:50:53 +00:00
Matt Robinson	c82fdb6a89	feat: `partition_rst` for ReStructured Text documents (#725 ) * add example rst file * filetype detection for rst files * add partition_rst function * add partition_rst to auto * update readme * update docs * changelog and version * pandocs -> pandoc * fix typo	2023-06-12 19:31:10 +00:00
Matt Robinson	19ab6d960f	enhancement: handling for empty files in `detect_filetype` and `partition` (#710 ) * add empty filetype * add empty handling to partition * changelog and version	2023-06-09 16:07:50 -04:00
Yuming Long	80f0b4a132	Fix: Pass `strategy` parameter down from `partition` for `partition_image` (#708 ) * changelog and version * passing param down * test should be auto * doc nit * lint * update image output	2023-06-09 13:54:18 -04:00
John	b2b92ea79d	fix: filetype detection if a CSV has a text/plain MIME type (#691 ) * fix: Filetype detection if a CSV has a text/plain MIME type #621 * bug: fix csv detection and create _read_file_start_for_type_check func * fix: Make call to _is_text_file_a_csv from detect_filetype	2023-06-08 16:21:07 -04:00
Christine Straub	547bb38d86	fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 ) Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding. Change auto.py to have a None default for encoding Remove the unused parameter encoding from partition_pdf Add functionality to the read_txt_file utility function to handle file-like object from URL	2023-06-05 11:27:12 -07:00
qued	d3600dd5da	build(deps): update inference version (#662 ) Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay! --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 13:50:15 -05:00
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
cragwolfe	c5d9469001	feat: add xls support (#632 ) Add support for older .XLS files from the partition function in unstructured.partition.auto. Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).	2023-05-26 01:55:32 -07:00
Matt Robinson	fda51d6ead	fix: add more mime types for csv (#620 )	2023-05-19 16:40:26 -05:00
Matt Robinson	21c821d651	feat: add `partition_csv` function (#619 ) * add csv into filetype detection * first pass on csv * add tests for csv * add csv to auto * version bump * update readme and docs * fix doc strings	2023-05-19 15:57:42 -04:00
Matt Robinson	23ff32cc42	feat: add `partition_xml` for XML files (#596 ) * first pass on partition_xml * add option to keep xml tags * added tests for xml * fix filename * update filenames * remove outdated readme * add xml to auto * version and changelog * update readme and docs * pass through include_metadata * update include_metadata description * add README back in * linting, linting, linting * more linting * spooled to bytes doesnt need to be a tuple * Add tests for newly supported filetypes * Correct metadata filetype * doc typo Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * keep_xml_tags -> xml_keep_tags --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-18 15:40:12 +00:00
Matt Robinson	b8037118c4	feat: add `partition_xlsx` for MSFT Excel files (#594 ) * first pass on partition_xlsx * add support for files * add test for xlsx from filename * added filetype metadata * add xlsx to auto * remove fake excel from unsupported * version and changelog * update docs * update readme * fix removed file reference * fix some more tests * pass in metadata filename * add include_metadata flag	2023-05-16 19:40:40 +00:00
Matt Robinson	bd6a8a3a40	enhancement: add `file_directory` to element metadata (#585 ) * enhancement: add `file_directory` to element metadata * update msg test * exclude file_directory * update slack output * added file directory tests on partition_x paths	2023-05-15 18:25:39 -04:00
Yuming Long	5b6f11bb88	Chore(ingest): Add `--partition-strategy` parameter in CLI (#582 ) * change strategy arg defalut to auto in partition * passing --partition-strategy down * add strategy="hi_res" to test (default changed) * made an error on param name, added note	2023-05-15 19:26:53 +00:00
qued	55272eeceb	enhancement: filetype in metadata (#583 ) Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata. Tests are added to make sure: * When partition is used, any content type or auto file type detection will override file-specific partition function metadata * Both auto and file-specific partitioning gives the desired filetype metadata Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .	2023-05-15 13:23:19 -05:00
Matt Robinson	727d366a94	enhancement: auto strategy for PDFs and images (#578 ) * added functions for determining auto stratgy * change default strategy to auto * tests for auto strategy * update docs * changelog and version * bump version * remove ingest file in wrong location * update jpg output * typo fix	2023-05-12 17:45:08 +00:00
Matt Robinson	8da1ddc6ec	enhancement: add method for getting datetime; cleanup filename attribute (#575 ) * added method for extracting datetime * change filename metadata to the base filename * fix filename metadata for msg * changelog and bump version * fix expected structured output * newline back in file * reset outpout file * update filename output * update test fixtures * update fixture	2023-05-12 11:33:01 -04:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	9fdc310358	fix: update `detect_filetype` for JSONs with text/plain MIME type (#520 ) * check to see if text file is a json * add json check into filetype detection * added test for updated file detection logic * bytes/strings handling * changlog and version bump	2023-04-26 13:52:47 -04:00
qued	5b6640a55a	chore: change table param name (#513 ) Updated parameter names that controls whether we try to infer table structure.	2023-04-21 13:48:19 -05:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf.	2023-04-21 17:01:29 +00:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
Matt Robinson	9c1c6a13f6	fix: updates markdown code to process markdown with embedded html (#480 ) * add carriage return to html if missing * test on markdown with embedded html * changelog and version * check for html parser * linting, linting, linting	2023-04-13 12:47:45 -04:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata	2023-04-12 18:31:01 +00:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
cragwolfe	3972c80c51	build(deps): bump requirements (#414 )	2023-04-05 02:59:06 +00:00
Matt Robinson	414883455b	fix: correct order of kwargs in pandoc (#421 ) * fix: correct order of kwargs in pandoc * only skip epub tests in Docker * changelog --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-30 20:54:29 +00:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
Amanda Cameron	71e035c34c	Adding content_type and file_filename to autopartition (#394 ) Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-24 16:32:45 -07:00
Amanda Cameron	a9da858fa3	chore: add tests for docker (#373 )	2023-03-21 13:46:09 -07:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-14 15:52:21 +00:00
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	74e6b84b41	feat: add metadata tracking to document elements (#225 ) * add metadata field to elements * metadata tracking for pdf/image * metadata for html * update expected outputs * metadata for the rest of the document types * take out file metadata for now * add url to tables * added metadata to test_auto * bump version * added coordinates to __init__ * fix coordinates in tests	2023-02-15 18:26:20 +00:00
Matt Robinson	e73cf09977	feat: optional page breaks for `.pptx`, `.pdf`, `.html` and images (#205 ) * page breaks for pptx * added page breaks for image/pdf * tests for images with page breaks * page breaks for html documents * linting, linting, linting * changelog and bump version * update docs * fix typo * refactor reusable code to common.py * add type back in	2023-02-08 15:11:15 +00:00
Matt Robinson	e6cfde5c4a	fix: no `UserWarning` when `partition_pdf` is called (#179 )	2023-01-27 12:08:18 -05:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Matt Robinson	8b6c5fac9d	feat: basic PowerPoint parsing in `partition_pptx` (#166 ) * parition pptx and tests * add parition_pptx to auto * update doc types in readme * add pptx docs * bump version * remove extra whitespace * partition -> partitioning	2023-01-23 17:03:09 +00:00
Matt Robinson	c1822911a5	chore: return `Element` objects in `partition_pdf` and `partition_image` (#164 ) * helper function to convert to element * test for element types * fix for healthcheck url * version bump * note on coordinates * mention FigureCaption * test_shared -> test_common * add check boxes for checkbox template * update changelog	2023-01-19 14:29:28 +00:00
qued	8abf1f119d	feat: partition image (#144 ) Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.	2023-01-13 22:24:13 -06:00
Matt Robinson	f12240c5e7	feat: add support for `.txt` files in `partition` (#150 ) * added partition_text for auto * rename partition_text tests * bump version and update docs	2023-01-13 16:39:53 -05:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00

1 2

97 Commits