unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-05 21:33:42 +00:00

Author	SHA1	Message	Date
Matt Robinson	8da1ddc6ec	enhancement: add method for getting datetime; cleanup filename attribute (#575 ) * added method for extracting datetime * change filename metadata to the base filename * fix filename metadata for msg * changelog and bump version * fix expected structured output * newline back in file * reset outpout file * update filename output * update test fixtures * update fixture	2023-05-12 11:33:01 -04:00
Matt Robinson	38f7b652de	fix: add handling for non-standard rfc-2822 formats (#564 ) * fix: add handling for non-standard rfc-2822 formats * version and changelog * linting, linting, linting	2023-05-11 14:36:25 +00:00
ryannikolaidis	b52638f8e3	chore: add support for SpooledTemporaryFiles (#569 )	2023-05-09 21:39:07 -07:00
Matt Robinson	3d3f3df3ec	enhancement: add "ocr_only" strategy for PDFs (#553 ) * add tests for validating strategy * refactor into determine_pdf_strategy function * refactor pdf strategies into strategies * remove commented out code * remove unreachable code * add in handling for image types * a little more refactoring * import ocr partioning for images * catch warnings, partition type for valid strategies * fallback to ocr_only from fast * fallback logic for hi_res * test for fallback to ocr only * fallback logic ofr ocr_only * more tests for fallback logic * update doc strings * version and changelog * linting, linting, linting * update docs to include notes about strategy * fix typos * change back patched filename	2023-05-08 17:21:24 +00:00
Matt Robinson	392cccdbf7	enhancement: add ocr_only strategy for `partition_image` (#540 ) * spike for ocr-only strategy for images * fix for file processing * extra space * add korean to ci * added test for ocr_only strategy * added docs for ocr_only * changelog and version * added test for bad strategy * skip korean test if in docker * bump version * version bump * document valid strategies * bump version for release --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-04 20:23:51 +00:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	aa01cdfc7a	fix: group together text from the same bounding box in `partition_pdf` with fast strategy (#542 ) * switch to using PDF objects * linting, linting, linting * couple more tweaks * added test for chevron-page * version and changelog * linting, linting, linting * now processing 4 files	2023-05-03 18:33:24 -04:00
Matt Robinson	7e43a25f07	feat: add `partition_multiple_via_api` function (#539 ) * added function for multiple files via api * make multiple work with files * updated docs strings * changelog and version * docs and contextlib for open files * tests for partition multiple * add tests for error conditions * add output example	2023-05-03 15:06:06 -04:00
Matt Robinson	9fdc310358	fix: update `detect_filetype` for JSONs with text/plain MIME type (#520 ) * check to see if text file is a json * add json check into filetype detection * added test for updated file detection logic * bytes/strings handling * changlog and version bump	2023-04-26 13:52:47 -04:00
Matt Robinson	4156cb12e0	feat: `partition_via_api` helper function (#518 ) * added function for partitioning via api * added tests for api function * changelog and version * add docs for partition_via_api	2023-04-26 09:05:35 -04:00
JaeyongLee	be8e6da884	fix: correct return types in `exceeds_caps_ratio` (#489 ) * fix: fix text_type.py exceeds_cap_ratio() returns There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected * Update text_type.py exceeds_cap_ratio() .. * Update text_type.py .. * Update CHANGELOG.md .. * linting, linting, linting ... * update tests * more test fixes * Update text_type.py .. * bump version and changelog * add punctuation check --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-04-24 10:45:09 -04:00
Matt Robinson	894a190001	enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 ) * function to check if pdf is extractable * add fallback logic for unextractable pdfs * tests for docs with copy protection * add test for unprocessable pdf * update docs * changelog and version * update logic for images; reset file before proceeding * 3 files for api tests * docs update	2023-04-21 21:35:43 +00:00
qued	5b6640a55a	chore: change table param name (#513 ) Updated parameter names that controls whether we try to infer table structure.	2023-04-21 13:48:19 -05:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf.	2023-04-21 17:01:29 +00:00
Mallori Harrell	5d1e61cb3f	feat: add msg attachment support (#510 ) * add msg function and fix bug in eml attachment function	2023-04-21 11:14:46 -05:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
Matt Robinson	bd1e540af9	feat: parameter to turn off SSL verification (#506 ) * add kwarg for ssl verification * update docs * update version and changelog * add verify kwarg to test	2023-04-20 11:13:56 -04:00
Matt Robinson	39b261aee6	fix: group broken paragraphs when using the fast strategy for PDFs (#485 ) * group broken paragraphs with fast strategy * changelog and version * fix broken tests for text.py * formatting for paragraph pattern re * fix test * fix whitespace substitution * one more test tweak * blurb to account for short lines * fix for shorter paragraphs * update changelog * remove extra line break from auto * retrigger ci * trying skipping azure * skip azure (test) * updated github and azure fixtures * update slack fixture	2023-04-19 13:54:17 -04:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
JaeyongLee	8456676fad	fix: fix text_type.py exceeds_cap_ratio() returns (#478 ) There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-14 11:53:10 -07:00
Matt Robinson	137b4b9a2e	feat: cleaning brick for normalizing bytes string output (#481 ) * add cleaning brick for emojis * changelog and versoin * docs for bytes_string_to_string * different test for bytes_string_to_string	2023-04-13 19:39:08 +00:00
Matt Robinson	9c1c6a13f6	fix: updates markdown code to process markdown with embedded html (#480 ) * add carriage return to html if missing * test on markdown with embedded html * changelog and version * check for html parser * linting, linting, linting	2023-04-13 12:47:45 -04:00
Matt Robinson	ec02d9298e	fix: only warn about fallback to fast in `partition_pdf` if hi_res is used (#479 ) * only warn if detectron2 not available and hi_res is used * changelog and version	2023-04-13 11:46:35 -04:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata	2023-04-12 18:31:01 +00:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
Matt Robinson	c99c099158	feat: enable grouping broken paragraphs in `partition_text` (#456 ) * cleaning brick to group broken paragraphs * docs for group_broken_paragraphs * add docs for partition_text with grouper * partition_text and auto with paragraph_grouper * version and changelog * typo in the docs * linting, linting, linting * switch to using regular expressions	2023-04-06 18:35:22 +00:00
Matt Robinson	b855fd269f	fix: fix html encoding to support foreign characters (#452 ) * fix: fix html encoding to support foreign characters * version and changelog	2023-04-05 20:18:54 +00:00
cragwolfe	3972c80c51	build(deps): bump requirements (#414 )	2023-04-05 02:59:06 +00:00
Matt Robinson	5ae895051a	feat: add sender and receive info to element metadata for emails (#439 ) * add header metadata for .eml messages * sent to and from are lists * add metadata for outlook emails * version and changelog	2023-04-04 14:23:41 -04:00
Amanda Cameron	555b95b8f7	Fixing test for unstructured-api (#425 ) Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.	2023-04-03 11:12:12 -07:00
Matt Robinson	414883455b	fix: correct order of kwargs in pandoc (#421 ) * fix: correct order of kwargs in pandoc * only skip epub tests in Docker * changelog --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-30 20:54:29 +00:00
cragwolfe	32c79caee3	chore: use only regex for contains_english_word. (#382 ) Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.	2023-03-30 16:57:43 +00:00
Matt Robinson	09b52b4fc4	fix: text kwargs no longer fail with empty string (#413 ) * fix: text kwargs no longer fail with empty string * linting	2023-03-28 21:03:51 +00:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
Amanda Cameron	71e035c34c	Adding content_type and file_filename to autopartition (#394 ) Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-24 16:32:45 -07:00
cragwolfe	ce9fc26009	feat: add ability to pass headers in partition_html (#397 ) Also adds pytest-mock requirement, those fixtures are nice to have! Implements issue/feature #396 .	2023-03-23 20:14:57 -07:00
Amanda Cameron	a9da858fa3	chore: add tests for docker (#373 )	2023-03-21 13:46:09 -07:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-14 15:52:21 +00:00
ryannikolaidis	a4726cb197	fix: open xml files in read only mode (#362 )	2023-03-13 13:06:45 -07:00
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Matt Robinson	7c619f045b	feat: `UNSTRUCTURED_LANGUAGE_CHECK` env var to control (#351 ) * environment variable to set language checks * change log and version * checks for if language checks are false * update docs * changelog type * add assert to tests * performance note in docstrings * docstring tweaks	2023-03-09 17:33:48 +00:00
natygyoon	6be07a5260	feat: update auto.partition() function to recognize Unstructured json (#337 )	2023-03-08 10:36:01 -08:00
Amanda Cameron	64efcc0e50	Adding optional encoding arg, and text_partition tests (#339 )	2023-03-06 15:07:33 -08:00
Matt Robinson	a5da3de43b	fix: ensure all text is maintained in html output (#335 ) * fix: ensure all text is maintained in html pages * add back in replace unicode quotes * changelog and version bump * apt-get update in ci * white space differences in output	2023-03-02 14:03:13 -05:00
Matt Robinson	69661788cf	fix: track narrative text and figure captions in HTML documents (#309 ) * fix for missing narrative text in partition_html * fixes so existing tests pass * tests for figure caption and narrative text * bump version; changelog	2023-02-28 15:36:08 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace	2023-02-17 16:57:08 +00:00

1 2 3 4 5

233 Commits