unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-12 16:37:31 +00:00

Author	SHA1	Message	Date
Matt Robinson	bd1e540af9	feat: parameter to turn off SSL verification (#506 ) * add kwarg for ssl verification * update docs * update version and changelog * add verify kwarg to test	2023-04-20 11:13:56 -04:00
Matt Robinson	39b261aee6	fix: group broken paragraphs when using the fast strategy for PDFs (#485 ) * group broken paragraphs with fast strategy * changelog and version * fix broken tests for text.py * formatting for paragraph pattern re * fix test * fix whitespace substitution * one more test tweak * blurb to account for short lines * fix for shorter paragraphs * update changelog * remove extra line break from auto * retrigger ci * trying skipping azure * skip azure (test) * updated github and azure fixtures * update slack fixture	2023-04-19 13:54:17 -04:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
JaeyongLee	8456676fad	fix: fix text_type.py exceeds_cap_ratio() returns (#478 ) There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-14 11:53:10 -07:00
Matt Robinson	137b4b9a2e	feat: cleaning brick for normalizing bytes string output (#481 ) * add cleaning brick for emojis * changelog and versoin * docs for bytes_string_to_string * different test for bytes_string_to_string	2023-04-13 19:39:08 +00:00
Matt Robinson	9c1c6a13f6	fix: updates markdown code to process markdown with embedded html (#480 ) * add carriage return to html if missing * test on markdown with embedded html * changelog and version * check for html parser * linting, linting, linting	2023-04-13 12:47:45 -04:00
Matt Robinson	ec02d9298e	fix: only warn about fallback to fast in `partition_pdf` if hi_res is used (#479 ) * only warn if detectron2 not available and hi_res is used * changelog and version	2023-04-13 11:46:35 -04:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata	2023-04-12 18:31:01 +00:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
Matt Robinson	c99c099158	feat: enable grouping broken paragraphs in `partition_text` (#456 ) * cleaning brick to group broken paragraphs * docs for group_broken_paragraphs * add docs for partition_text with grouper * partition_text and auto with paragraph_grouper * version and changelog * typo in the docs * linting, linting, linting * switch to using regular expressions	2023-04-06 18:35:22 +00:00
Matt Robinson	b855fd269f	fix: fix html encoding to support foreign characters (#452 ) * fix: fix html encoding to support foreign characters * version and changelog	2023-04-05 20:18:54 +00:00
cragwolfe	3972c80c51	build(deps): bump requirements (#414 )	2023-04-05 02:59:06 +00:00
Matt Robinson	5ae895051a	feat: add sender and receive info to element metadata for emails (#439 ) * add header metadata for .eml messages * sent to and from are lists * add metadata for outlook emails * version and changelog	2023-04-04 14:23:41 -04:00
Amanda Cameron	555b95b8f7	Fixing test for unstructured-api (#425 ) Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.	2023-04-03 11:12:12 -07:00
Matt Robinson	414883455b	fix: correct order of kwargs in pandoc (#421 ) * fix: correct order of kwargs in pandoc * only skip epub tests in Docker * changelog --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-30 20:54:29 +00:00
cragwolfe	32c79caee3	chore: use only regex for contains_english_word. (#382 ) Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.	2023-03-30 16:57:43 +00:00
Matt Robinson	09b52b4fc4	fix: text kwargs no longer fail with empty string (#413 ) * fix: text kwargs no longer fail with empty string * linting	2023-03-28 21:03:51 +00:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
Amanda Cameron	71e035c34c	Adding content_type and file_filename to autopartition (#394 ) Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-24 16:32:45 -07:00
cragwolfe	ce9fc26009	feat: add ability to pass headers in partition_html (#397 ) Also adds pytest-mock requirement, those fixtures are nice to have! Implements issue/feature #396 .	2023-03-23 20:14:57 -07:00
Amanda Cameron	a9da858fa3	chore: add tests for docker (#373 )	2023-03-21 13:46:09 -07:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-14 15:52:21 +00:00
ryannikolaidis	a4726cb197	fix: open xml files in read only mode (#362 )	2023-03-13 13:06:45 -07:00
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Matt Robinson	7c619f045b	feat: `UNSTRUCTURED_LANGUAGE_CHECK` env var to control (#351 ) * environment variable to set language checks * change log and version * checks for if language checks are false * update docs * changelog type * add assert to tests * performance note in docstrings * docstring tweaks	2023-03-09 17:33:48 +00:00
natygyoon	6be07a5260	feat: update auto.partition() function to recognize Unstructured json (#337 )	2023-03-08 10:36:01 -08:00
Amanda Cameron	64efcc0e50	Adding optional encoding arg, and text_partition tests (#339 )	2023-03-06 15:07:33 -08:00
Matt Robinson	a5da3de43b	fix: ensure all text is maintained in html output (#335 ) * fix: ensure all text is maintained in html pages * add back in replace unicode quotes * changelog and version bump * apt-get update in ci * white space differences in output	2023-03-02 14:03:13 -05:00
Matt Robinson	69661788cf	fix: track narrative text and figure captions in HTML documents (#309 ) * fix for missing narrative text in partition_html * fixes so existing tests pass * tests for figure caption and narrative text * bump version; changelog	2023-02-28 15:36:08 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	74e6b84b41	feat: add metadata tracking to document elements (#225 ) * add metadata field to elements * metadata tracking for pdf/image * metadata for html * update expected outputs * metadata for the rest of the document types * take out file metadata for now * add url to tables * added metadata to test_auto * bump version * added coordinates to __init__ * fix coordinates in tests	2023-02-15 18:26:20 +00:00
Matt Robinson	558ee63e90	feat: ability to skip English language specific checks with env var (#224 ) * add language env var * update docs * version and bump change log	2023-02-15 09:15:47 -05:00
Matt Robinson	e73cf09977	feat: optional page breaks for `.pptx`, `.pdf`, `.html` and images (#205 ) * page breaks for pptx * added page breaks for image/pdf * tests for images with page breaks * page breaks for html documents * linting, linting, linting * changelog and bump version * update docs * fix typo * refactor reusable code to common.py * add type back in	2023-02-08 15:11:15 +00:00
Matt Robinson	ee9f15483f	feat: `partition_html` directly from a url (#202 ) * added tests for html from url * bump version * added types-requests * and -> an	2023-02-07 14:09:34 +00:00
Matt Robinson	014585e872	fix: preserve the order of shapes in `partition_pptx` output (#193 ) * order the shapes top to bottom and left to right * added tests for ordering * update change log and bump version * more tests * don't need enumerate * n -> on	2023-02-03 22:12:33 +00:00
Matt Robinson	a7ca58e0bc	fix: more english words; split on punctuation (#191 ) * add a bigger list of english words * update thresholds and add tests * update docs; bump version * fix version * add additional english words back in * linting, linting, linting * add slashes * work -> word	2023-02-02 17:25:47 +00:00
Matt Robinson	0589344ff7	fix: require a minimum prop of alpha characters for titles and narrative text (#190 ) * added alpha ratio check * added tests for alpha ratio * bump changelog and update docs * update changelog/version; update docs * ofr -> or	2023-02-02 14:59:04 +00:00
Matt Robinson	1230a163fd	feat: set a user controlled max word length for titles (#189 ) * update the docs * add option for title max word length * bump version; update changelog * change max length to 12 * docs updates * to -> too	2023-02-01 19:32:16 +00:00
Matt Robinson	2d08fcbf83	fix: titles and narrative text need at least one english word (#188 ) * added check for english words * update docs * at least one word needs to have multiple characters * bump change log	2023-02-01 09:10:48 -05:00
sparkbrains	243bf7ed5e	test: Increase coverage (#181 )	2023-01-30 22:47:09 -08:00
Matt Robinson	e6cfde5c4a	fix: no `UserWarning` when `partition_pdf` is called (#179 )	2023-01-27 12:08:18 -05:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Matt Robinson	8b6c5fac9d	feat: basic PowerPoint parsing in `partition_pptx` (#166 ) * parition pptx and tests * add parition_pptx to auto * update doc types in readme * add pptx docs * bump version * remove extra whitespace * partition -> partitioning	2023-01-23 17:03:09 +00:00
Matt Robinson	8d3e616846	feat: add ability to parse `LayoutElement` lists (#165 ) * added ability to split list items * changelog and version bump * retrigger ci	2023-01-20 08:55:11 -05:00
Matt Robinson	c1822911a5	chore: return `Element` objects in `partition_pdf` and `partition_image` (#164 ) * helper function to convert to element * test for element types * fix for healthcheck url * version bump * note on coordinates * mention FigureCaption * test_shared -> test_common * add check boxes for checkbox template * update changelog	2023-01-19 14:29:28 +00:00

... 5 6 7 8 9

417 Commits