unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-19 05:23:43 +00:00

Author	SHA1	Message	Date
cshaddox	d23e0d6420	feat: table extraction for power points (#664 ) * Handling tables * updating changelog * Adding accidentally removed code * remove newline * reuse table extraction function; add test --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 18:26:32 +00:00
Matt Robinson	52e5a5ca8d	fix: raise `ValueError` in `partition_via_api` if filename not present (#663 ) * raise value error if filename not specified for api * version and changelog	2023-05-31 18:09:58 +00:00
John	c78c5b6adf	fix: `page_number` appears in `partition_html` metadata if `include_metadata=False` (#658 ) * fix: page_number appears in partition_html metadata if include_metadata=False * Update common.py * Update CHANGELOG --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-30 20:47:55 +00:00
Matt Robinson	f7cde5539a	fix: `page_number` should not always be 1 in the metadata (#657 ) * fix page number issue * add tests * changelog and version * update changelog	2023-05-30 15:10:14 -04:00
Christine Straub	5b5fb3e13b	Issue/encoding error eml (#639 ) This PR adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.	2023-05-30 10:24:02 -07:00
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
cragwolfe	c5d9469001	feat: add xls support (#632 ) Add support for older .XLS files from the partition function in unstructured.partition.auto. Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).	2023-05-26 01:55:32 -07:00
Christine Straub	a1fed6d4c6	Issue/unicode error (#608 ) This PR adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.	2023-05-23 13:35:38 -07:00
qued	55e5d8ea2f	enhancement: include coords in fast (#626 ) Makes the bounding box coordinates available when using fast strategy. * Refactored partition_text to make the workflow of categorizing an element purely from the text available without running the entirety of partition_text. * Transformed the coordinates from pdf space into pixel space to be consistent with hi_res. We will probably want to revisit the coordinate system soon.	2023-05-20 16:26:55 -05:00
Matt Robinson	fda51d6ead	fix: add more mime types for csv (#620 )	2023-05-19 16:40:26 -05:00
Matt Robinson	21c821d651	feat: add `partition_csv` function (#619 ) * add csv into filetype detection * first pass on csv * add tests for csv * add csv to auto * version bump * update readme and docs * fix doc strings	2023-05-19 15:57:42 -04:00
Matt Robinson	23ff32cc42	feat: add `partition_xml` for XML files (#596 ) * first pass on partition_xml * add option to keep xml tags * added tests for xml * fix filename * update filenames * remove outdated readme * add xml to auto * version and changelog * update readme and docs * pass through include_metadata * update include_metadata description * add README back in * linting, linting, linting * more linting * spooled to bytes doesnt need to be a tuple * Add tests for newly supported filetypes * Correct metadata filetype * doc typo Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * keep_xml_tags -> xml_keep_tags --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-18 15:40:12 +00:00
Matt Robinson	b6bfbf9108	fix: track filename in metadata for docx tables (#597 ) * fix: track filename in metadata for docx tables * bump version * remove accidental commit	2023-05-18 10:20:38 -04:00
Meir	301cef27a4	feat: add page_name to metadata for Excel documents (#609 ) * Add page_name to metadata for Excel documents * Update changelog and version number * fix lint	2023-05-18 13:53:23 +00:00
Matt Robinson	b8037118c4	feat: add `partition_xlsx` for MSFT Excel files (#594 ) * first pass on partition_xlsx * add support for files * add test for xlsx from filename * added filetype metadata * add xlsx to auto * remove fake excel from unsupported * version and changelog * update docs * update readme * fix removed file reference * fix some more tests * pass in metadata filename * add include_metadata flag	2023-05-16 19:40:40 +00:00
Matt Robinson	bd6a8a3a40	enhancement: add `file_directory` to element metadata (#585 ) * enhancement: add `file_directory` to element metadata * update msg test * exclude file_directory * update slack output * added file directory tests on partition_x paths	2023-05-15 18:25:39 -04:00
Yuming Long	5b6f11bb88	Chore(ingest): Add `--partition-strategy` parameter in CLI (#582 ) * change strategy arg defalut to auto in partition * passing --partition-strategy down * add strategy="hi_res" to test (default changed) * made an error on param name, added note	2023-05-15 19:26:53 +00:00
qued	55272eeceb	enhancement: filetype in metadata (#583 ) Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata. Tests are added to make sure: * When partition is used, any content type or auto file type detection will override file-specific partition function metadata * Both auto and file-specific partitioning gives the desired filetype metadata Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .	2023-05-15 13:23:19 -05:00
Matt Robinson	727d366a94	enhancement: auto strategy for PDFs and images (#578 ) * added functions for determining auto stratgy * change default strategy to auto * tests for auto strategy * update docs * changelog and version * bump version * remove ingest file in wrong location * update jpg output * typo fix	2023-05-12 17:45:08 +00:00
Matt Robinson	8da1ddc6ec	enhancement: add method for getting datetime; cleanup filename attribute (#575 ) * added method for extracting datetime * change filename metadata to the base filename * fix filename metadata for msg * changelog and bump version * fix expected structured output * newline back in file * reset outpout file * update filename output * update test fixtures * update fixture	2023-05-12 11:33:01 -04:00
Matt Robinson	38f7b652de	fix: add handling for non-standard rfc-2822 formats (#564 ) * fix: add handling for non-standard rfc-2822 formats * version and changelog * linting, linting, linting	2023-05-11 14:36:25 +00:00
ryannikolaidis	b52638f8e3	chore: add support for SpooledTemporaryFiles (#569 )	2023-05-09 21:39:07 -07:00
Matt Robinson	3d3f3df3ec	enhancement: add "ocr_only" strategy for PDFs (#553 ) * add tests for validating strategy * refactor into determine_pdf_strategy function * refactor pdf strategies into strategies * remove commented out code * remove unreachable code * add in handling for image types * a little more refactoring * import ocr partioning for images * catch warnings, partition type for valid strategies * fallback to ocr_only from fast * fallback logic for hi_res * test for fallback to ocr only * fallback logic ofr ocr_only * more tests for fallback logic * update doc strings * version and changelog * linting, linting, linting * update docs to include notes about strategy * fix typos * change back patched filename	2023-05-08 17:21:24 +00:00
Matt Robinson	392cccdbf7	enhancement: add ocr_only strategy for `partition_image` (#540 ) * spike for ocr-only strategy for images * fix for file processing * extra space * add korean to ci * added test for ocr_only strategy * added docs for ocr_only * changelog and version * added test for bad strategy * skip korean test if in docker * bump version * version bump * document valid strategies * bump version for release --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-04 20:23:51 +00:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	aa01cdfc7a	fix: group together text from the same bounding box in `partition_pdf` with fast strategy (#542 ) * switch to using PDF objects * linting, linting, linting * couple more tweaks * added test for chevron-page * version and changelog * linting, linting, linting * now processing 4 files	2023-05-03 18:33:24 -04:00
Matt Robinson	7e43a25f07	feat: add `partition_multiple_via_api` function (#539 ) * added function for multiple files via api * make multiple work with files * updated docs strings * changelog and version * docs and contextlib for open files * tests for partition multiple * add tests for error conditions * add output example	2023-05-03 15:06:06 -04:00
Matt Robinson	9fdc310358	fix: update `detect_filetype` for JSONs with text/plain MIME type (#520 ) * check to see if text file is a json * add json check into filetype detection * added test for updated file detection logic * bytes/strings handling * changlog and version bump	2023-04-26 13:52:47 -04:00
Matt Robinson	4156cb12e0	feat: `partition_via_api` helper function (#518 ) * added function for partitioning via api * added tests for api function * changelog and version * add docs for partition_via_api	2023-04-26 09:05:35 -04:00
JaeyongLee	be8e6da884	fix: correct return types in `exceeds_caps_ratio` (#489 ) * fix: fix text_type.py exceeds_cap_ratio() returns There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected * Update text_type.py exceeds_cap_ratio() .. * Update text_type.py .. * Update CHANGELOG.md .. * linting, linting, linting ... * update tests * more test fixes * Update text_type.py .. * bump version and changelog * add punctuation check --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-04-24 10:45:09 -04:00
Matt Robinson	894a190001	enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 ) * function to check if pdf is extractable * add fallback logic for unextractable pdfs * tests for docs with copy protection * add test for unprocessable pdf * update docs * changelog and version * update logic for images; reset file before proceeding * 3 files for api tests * docs update	2023-04-21 21:35:43 +00:00
qued	5b6640a55a	chore: change table param name (#513 ) Updated parameter names that controls whether we try to infer table structure.	2023-04-21 13:48:19 -05:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf.	2023-04-21 17:01:29 +00:00
Mallori Harrell	5d1e61cb3f	feat: add msg attachment support (#510 ) * add msg function and fix bug in eml attachment function	2023-04-21 11:14:46 -05:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
Matt Robinson	bd1e540af9	feat: parameter to turn off SSL verification (#506 ) * add kwarg for ssl verification * update docs * update version and changelog * add verify kwarg to test	2023-04-20 11:13:56 -04:00
Matt Robinson	39b261aee6	fix: group broken paragraphs when using the fast strategy for PDFs (#485 ) * group broken paragraphs with fast strategy * changelog and version * fix broken tests for text.py * formatting for paragraph pattern re * fix test * fix whitespace substitution * one more test tweak * blurb to account for short lines * fix for shorter paragraphs * update changelog * remove extra line break from auto * retrigger ci * trying skipping azure * skip azure (test) * updated github and azure fixtures * update slack fixture	2023-04-19 13:54:17 -04:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
JaeyongLee	8456676fad	fix: fix text_type.py exceeds_cap_ratio() returns (#478 ) There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-14 11:53:10 -07:00
Matt Robinson	137b4b9a2e	feat: cleaning brick for normalizing bytes string output (#481 ) * add cleaning brick for emojis * changelog and versoin * docs for bytes_string_to_string * different test for bytes_string_to_string	2023-04-13 19:39:08 +00:00
Matt Robinson	9c1c6a13f6	fix: updates markdown code to process markdown with embedded html (#480 ) * add carriage return to html if missing * test on markdown with embedded html * changelog and version * check for html parser * linting, linting, linting	2023-04-13 12:47:45 -04:00
Matt Robinson	ec02d9298e	fix: only warn about fallback to fast in `partition_pdf` if hi_res is used (#479 ) * only warn if detectron2 not available and hi_res is used * changelog and version	2023-04-13 11:46:35 -04:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata	2023-04-12 18:31:01 +00:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
Matt Robinson	c99c099158	feat: enable grouping broken paragraphs in `partition_text` (#456 ) * cleaning brick to group broken paragraphs * docs for group_broken_paragraphs * add docs for partition_text with grouper * partition_text and auto with paragraph_grouper * version and changelog * typo in the docs * linting, linting, linting * switch to using regular expressions	2023-04-06 18:35:22 +00:00
Matt Robinson	b855fd269f	fix: fix html encoding to support foreign characters (#452 ) * fix: fix html encoding to support foreign characters * version and changelog	2023-04-05 20:18:54 +00:00
cragwolfe	3972c80c51	build(deps): bump requirements (#414 )	2023-04-05 02:59:06 +00:00
Matt Robinson	5ae895051a	feat: add sender and receive info to element metadata for emails (#439 ) * add header metadata for .eml messages * sent to and from are lists * add metadata for outlook emails * version and changelog	2023-04-04 14:23:41 -04:00
Amanda Cameron	555b95b8f7	Fixing test for unstructured-api (#425 ) Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.	2023-04-03 11:12:12 -07:00

... 5 6 7 8 9

402 Commits