unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-10 07:27:34 +00:00

Author	SHA1	Message	Date
Steve Canny	31bef433ad	rfctr: prepare to add orig_elements serde (#2668 ) Summary The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.	2024-03-20 21:27:59 +00:00
Matt Robinson	882370022e	fix: don't treat double quote enclosed text as JSON (#2544 ) ### Summary Closes #2444. Treats JSON serializable content that results in a string as plain text. Even though this is valid JSON per [RFC 4627](https://www.ietf.org/rfc/rfc4627.txt), this is valid JSON, but in almost every cases were really want to treat this as a text file. ### Testing 1. Put `"This is not a JSON"` is a text file `notajson.txt` 2. Run the following ```python from unstructured.file_utils.filetype import _is_text_file_a_json _is_text_file_a_json(filename="notajson.txt") # Should be False ```	2024-02-14 13:41:43 +00:00
Matt Robinson	4613e52e11	fix: treat yaml files as plain text (#2446 ) ### Summary Closes #2412. Adds support for YAML MIME types and treats them as plain text. In response to `500` errors that the API currently returns if the MIME type is `text/yaml`.	2024-01-24 17:48:36 +00:00
Matt Robinson	4d5038d9fd	enhancement: add support from bitmap images (#2414 ) ### Summary Adds support for bitmap images (`.bmp`) in both file detection and partitioning. Bitmap images will be processed with `partition_image` just like JPGs and PNGs. ### Testing ```python from unstructured.file_utils.filetype import detect_filetype from unstructured.partition.auto import partition from PIL import Image filename = "example-docs/layout-parser-paper-with-table.jpg" bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp" img = Image.open(filename) img.save(bmp_filename) detect_filetype(filename=bmp_filename) # Should be FileType.BMP elements = partition(filename=bmp_filename) ```	2024-01-17 22:50:36 +00:00
Matt Robinson	36faf677c0	enhancement: file detection for `.wav` files (#2387 ) ### Summary Adds filetype detection for `.wav` audio files ### Testing ```python from unstructured.file_utils.filetype import detect_filetype filename = "example-docs/CantinaBand3.wav" detect_filetype(filename=filename) # Should be FileType.WAV ```	2024-01-15 16:50:49 +00:00
suraj chauhan	f4bf1fa270	Chore: Libmagic detection for "application/octet-stream" when it is not a zip file. (#1347 ) Addressed the issue #494 . Updated the `_detect_filetype_from_octet_stream()` function to use libmagic to infer the content type of file when it is not a zip file.	2023-09-08 18:49:00 +00:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Yuming Long	112347aa0d	doc: update API doc to sync with new parameter in prod API (#1049 ) * doc doc * changelog and version * sample docs -> example docs * nit on compute cost doc * pass empty dict not none * note note * cutting release	2023-08-09 11:09:37 -04:00
Matt Robinson	331c7faf38	build(deps): split up dependencies by document type (#986 ) * split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy	2023-08-01 11:31:13 -04:00
Yuming Long	d46c1c2d83	Chore: Pass table support param to partition image (#973 ) * add param and test in image table extraction * version and changelog * need to publish this one for api repo * add new param skip_infer_table_types * use warning * clean up with mapping * add test for tsv * fix test fail * weird change from merge * doc nit * don't use mapping * correct conflict	2023-07-27 13:33:36 -04:00
Roman Isecke	b39e0d7354	Roman/expose dpi param (#966 ) * Bump inference version * Pass through the dpi param if available * Update CHANGELOG * Check dpi param passed in via unit test * Bump inference version * Fix unit test around file info to work on mac as well	2023-07-26 09:26:06 -04:00
Matt Robinson	d694cd53bf	refactor: simplifies JSON detection and add tests (#975 ) * refactor json detection * version and changelog * fix mock in test	2023-07-25 19:59:45 +00:00
John	f282a10715	enhancement: improve json detection by detect_filetype (#971 ) * update regex pattern * improve json regex pattern checks and add test file * update file name * update tests and formatting * update changelog and version	2023-07-25 12:47:39 -04:00
Christine Straub	f7def03d55	Fix/521 pdf2image memory error hi res (#948 ) This PR is to reflect changes in the unstructured-inference PR #152 * Update functionality to retrieve image metadata from a page for document_to_element_list	2023-07-24 19:22:56 +00:00
Emily Chen	2635b0be07	Don't instantiate an element with a coordinate system when there isn't a way to get its location (#913 )	2023-07-10 21:47:41 -07:00
Matt Robinson	b3936893b8	build: add python 3.11 to CI (#908 ) * remove argilla; bump reqs * enable py 3.11 * add 3.11 to setup.py * make pip-compile * ignore cli mypy errors * install argilla * fix constraints * install argilla * changelog and version * skip argilla in docker * dont import argilla in docker * skip all of argilla if in container * only import argilla if outside docker * more docker skips * remove weird pypi settings	2023-07-10 18:52:25 +00:00
Matt Robinson	38457777fa	fix: ignore escaped commas in CSV checks (#832 ) * fix file content checking bug * skip counting commas in quotes for csv detection * add test for comma count * change file content grab to -1 * version and changelog * add csv to extension check * add file to tests * ingest-test-fixtures-update * Update ingest test fixtures (#833) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * fix typo * fix changelog wording --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 17:22:23 +00:00
Martin Mauch	752e78e803	feat: partition_org for Org Mode documents (#780 ) * feat: partition_org for Org Mode documents * update version	2023-06-23 18:45:31 +00:00
Christine Straub	743482b6d3	Bug/635 unicode decode error eml (#739 ) * Adds functionality to extract charset info from eml files * Adds missed file-like object handling in detect_file_encoding * Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs * Organize the eml example files in the example-docs/eml directory	2023-06-17 00:52:13 +00:00
John	a9b9b873b1	feat: partition_tsv for tab separated value files (#758 ) * first pass at partition_tsv * working tests * create constants for tests and debug `make test` failure * make check and tidy * undo changes for testing locally * update changelog and version * fix bricks.rst * refactor if statements * make tidy * fix README and change try/except to if/else * update changelog and version * fix\ docstring	2023-06-15 18:50:53 +00:00
Matt Robinson	c82fdb6a89	feat: `partition_rst` for ReStructured Text documents (#725 ) * add example rst file * filetype detection for rst files * add partition_rst function * add partition_rst to auto * update readme * update docs * changelog and version * pandocs -> pandoc * fix typo	2023-06-12 19:31:10 +00:00
John	fc53277826	fix: Enable MIME type detection if libmagic is not available (#714 ) * fix: Add filetype check if libmagic unavailable * make tidy * make check * fix: change mime_type error to warning * Update changelog and __version__ * fix: Add filetype to requirements	2023-06-09 17:06:21 -04:00
Matt Robinson	19ab6d960f	enhancement: handling for empty files in `detect_filetype` and `partition` (#710 ) * add empty filetype * add empty handling to partition * changelog and version	2023-06-09 16:07:50 -04:00
Matt Robinson	0289ca3ea7	fix: handle encoding for text file checks (#707 ) * fixed encoding issue for _is_text_file_a_json * changelog and version	2023-06-09 11:08:16 -04:00
John	b2b92ea79d	fix: filetype detection if a CSV has a text/plain MIME type (#691 ) * fix: Filetype detection if a CSV has a text/plain MIME type #621 * bug: fix csv detection and create _read_file_start_for_type_check func * fix: Make call to _is_text_file_a_csv from detect_filetype	2023-06-08 16:21:07 -04:00
Matt Robinson	cf0ff91e37	fix: recognize code files with auto (#677 ) * add check for code mime type * add file extensions * add new tests * version and changelog	2023-06-02 20:09:43 +00:00
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
Matt Robinson	fda51d6ead	fix: add more mime types for csv (#620 )	2023-05-19 16:40:26 -05:00
Matt Robinson	21c821d651	feat: add `partition_csv` function (#619 ) * add csv into filetype detection * first pass on csv * add tests for csv * add csv to auto * version bump * update readme and docs * fix doc strings	2023-05-19 15:57:42 -04:00
Matt Robinson	23ff32cc42	feat: add `partition_xml` for XML files (#596 ) * first pass on partition_xml * add option to keep xml tags * added tests for xml * fix filename * update filenames * remove outdated readme * add xml to auto * version and changelog * update readme and docs * pass through include_metadata * update include_metadata description * add README back in * linting, linting, linting * more linting * spooled to bytes doesnt need to be a tuple * Add tests for newly supported filetypes * Correct metadata filetype * doc typo Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * keep_xml_tags -> xml_keep_tags --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-18 15:40:12 +00:00
Eu Jin Marcus Yatim	7eac1f8ca7	refactor: update detect_filetype() to use hashmap for mime type return (#591 ) * Update detect_filetype() to use hashmap for mime type return * fix: text mime type and linting * fix: declare docx and xlsx mime types locally and also fix linting * Update CHANGELOG.md * tweaks for failing tests --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-05-17 13:48:52 +00:00
Matt Robinson	b8037118c4	feat: add `partition_xlsx` for MSFT Excel files (#594 ) * first pass on partition_xlsx * add support for files * add test for xlsx from filename * added filetype metadata * add xlsx to auto * remove fake excel from unsupported * version and changelog * update docs * update readme * fix removed file reference * fix some more tests * pass in metadata filename * add include_metadata flag	2023-05-16 19:40:40 +00:00
Yida Liu	f46eb06e2d	fix: check json and eml decode ignore error (#574 )	2023-05-10 22:00:11 -07:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	9fdc310358	fix: update `detect_filetype` for JSONs with text/plain MIME type (#520 ) * check to see if text file is a json * add json check into filetype detection * added test for updated file detection logic * bytes/strings handling * changlog and version bump	2023-04-26 13:52:47 -04:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
ryannikolaidis	d298f57b8f	fix: issue when filename is provided but file is not on disk (#446 )	2023-04-05 17:54:11 +00:00
Amanda Cameron	a9da858fa3	chore: add tests for docker (#373 )	2023-03-21 13:46:09 -07:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-14 15:52:21 +00:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
grungyfeline998	956f04d770	feat: detect filetype with extension if libmagic is unavailable (#268 ) * included the previous PR changes and verified black * resolved the issues mentioned * make tidy and add tests --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-02-24 15:23:29 +00:00
Matt Robinson	47ab808e0f	feat: file info dataframe from filenames and file content (#204 ) * added function for exploring a list of files * file info from file contents * added tests for file info from contents * bump version and add tests * add dev to version	2023-02-08 20:48:39 +00:00
Matt Robinson	26a5546152	fix: handle xml filetype detection on amazon linux (#173 ) * fix: handle xml filetype detection on amazon linux * option for html or xml * fix typo * back to dev tag	2023-01-25 11:20:01 -05:00
Matt Robinson	74ce2ae6e5	fix: update `detect_filetype` to properly handle older office files (#161 )	2023-01-18 11:18:20 -05:00
Matt Robinson	eba4c80b1e	feat: `get_directory_file_info` for exploring a directory of files (#142 ) * added python-pptx to requirements * added filetype detection for powerpoint * add more filetypes to detect * more tests * added tests for filetype * reorder document types * tests for get_directory_file_info * added docs for get_directory_file_info * bump version * Word -> Office * added test for filetype * add group by filetype example	2023-01-11 12:40:50 -05:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Matt Robinson	b14f6ac9bd	feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113 ) * add python-docx dependency * added function for extracting metadata from word documents * add openpyxl * added get_jpg_metadata; fixed typing * bump changelog * added pillow to dependencies	2022-12-26 09:34:36 -05:00

48 Commits