unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-03 12:27:04 +00:00

Author	SHA1	Message	Date
Christine Straub	5b5fb3e13b	Issue/encoding error eml (#639 ) This PR adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.	2023-05-30 10:24:02 -07:00
cragwolfe	c5d9469001	feat: add xls support (#632 ) Add support for older .XLS files from the partition function in unstructured.partition.auto. Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).	2023-05-26 01:55:32 -07:00
Christine Straub	a1fed6d4c6	Issue/unicode error (#608 ) This PR adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.	2023-05-23 13:35:38 -07:00
Matt Robinson	21c821d651	feat: add `partition_csv` function (#619 ) * add csv into filetype detection * first pass on csv * add tests for csv * add csv to auto * version bump * update readme and docs * fix doc strings	2023-05-19 15:57:42 -04:00
Matt Robinson	23ff32cc42	feat: add `partition_xml` for XML files (#596 ) * first pass on partition_xml * add option to keep xml tags * added tests for xml * fix filename * update filenames * remove outdated readme * add xml to auto * version and changelog * update readme and docs * pass through include_metadata * update include_metadata description * add README back in * linting, linting, linting * more linting * spooled to bytes doesnt need to be a tuple * Add tests for newly supported filetypes * Correct metadata filetype * doc typo Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * keep_xml_tags -> xml_keep_tags --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-18 15:40:12 +00:00
Mallori Harrell	34d563c1fc	feat: Create spacy notebook example (#593 ) * add new notebook for spacy	2023-05-17 15:42:15 -05:00
Matt Robinson	b8037118c4	feat: add `partition_xlsx` for MSFT Excel files (#594 ) * first pass on partition_xlsx * add support for files * add test for xlsx from filename * added filetype metadata * add xlsx to auto * remove fake excel from unsupported * version and changelog * update docs * update readme * fix removed file reference * fix some more tests * pass in metadata filename * add include_metadata flag	2023-05-16 19:40:40 +00:00
Kevin Pan	7c07b3f690	feat: Read docx tables (#572 ) * add table parsing * import paragraph * update changelog * add example docx * revert changelog formatting * update function name for consistency * add both text and html metadata for table * update with metadata in docx table note --------- Co-authored-by: kevin pan <kevin.pan@strivr.com>	2023-05-11 18:31:38 +00:00
Matt Robinson	392cccdbf7	enhancement: add ocr_only strategy for `partition_image` (#540 ) * spike for ocr-only strategy for images * fix for file processing * extra space * add korean to ci * added test for ocr_only strategy * added docs for ocr_only * changelog and version * added test for bad strategy * skip korean test if in docker * bump version * version bump * document valid strategies * bump version for release --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-04 20:23:51 +00:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	aa01cdfc7a	fix: group together text from the same bounding box in `partition_pdf` with fast strategy (#542 ) * switch to using PDF objects * linting, linting, linting * couple more tweaks * added test for chevron-page * version and changelog * linting, linting, linting * now processing 4 files	2023-05-03 18:33:24 -04:00
Matt Robinson	9fdc310358	fix: update `detect_filetype` for JSONs with text/plain MIME type (#520 ) * check to see if text file is a json * add json check into filetype detection * added test for updated file detection logic * bytes/strings handling * changlog and version bump	2023-04-26 13:52:47 -04:00
Matt Robinson	894a190001	enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 ) * function to check if pdf is extractable * add fallback logic for unextractable pdfs * tests for docs with copy protection * add test for unprocessable pdf * update docs * changelog and version * update logic for images; reset file before proceeding * 3 files for api tests * docs update	2023-04-21 21:35:43 +00:00
Sebastian Laverde Alfonso	ba59ad6b3a	chore: add copy-protected pdf to sample-docs (#512 )	2023-04-21 18:02:38 +00:00
Mallori Harrell	5d1e61cb3f	feat: add msg attachment support (#510 ) * add msg function and fix bug in eml attachment function	2023-04-21 11:14:46 -05:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
Minty Mac	533241c274	Update README.md (#435 ) Spelling error	2023-04-02 09:52:14 -07:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-14 15:52:21 +00:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Amanda Cameron	64efcc0e50	Adding optional encoding arg, and text_partition tests (#339 )	2023-03-06 15:07:33 -08:00
Matt Robinson	a5da3de43b	fix: ensure all text is maintained in html output (#335 ) * fix: ensure all text is maintained in html pages * add back in replace unicode quotes * changelog and version bump * apt-get update in ci * white space differences in output	2023-03-02 14:03:13 -05:00
Tom Aarsen	350c4230ee	fix: Remove JavaScript from HTML reader output (#313 ) * Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.	2023-02-28 14:24:24 -08:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	f890972139	docs: add bricks training notebook (#211 ) * added bricks notebook * more unicode quotes; isd dataframe column fix * fix remove_punctuation docs * typo fixes * put staging bricks in code	2023-02-10 14:39:14 +00:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Matt Robinson	8b6c5fac9d	feat: basic PowerPoint parsing in `partition_pptx` (#166 ) * parition pptx and tests * add parition_pptx to auto * update doc types in readme * add pptx docs * bump version * remove extra whitespace * partition -> partitioning	2023-01-23 17:03:09 +00:00
Mallori Harrell	08ccee0acb	chore: Fix parse received data (#143 ) * fix parse_received data	2023-01-17 16:36:44 -06:00
Matt Robinson	eba4c80b1e	feat: `get_directory_file_info` for exploring a directory of files (#142 ) * added python-pptx to requirements * added filetype detection for powerpoint * add more filetypes to detect * more tests * added tests for filetype * reorder document types * tests for get_directory_file_info * added docs for get_directory_file_info * bump version * Word -> Office * added test for filetype * add group by filetype example	2023-01-11 12:40:50 -05:00
Mallori Harrell	e0feba83f6	feat: Add Image element and `find_embedded_image` function (#130 ) * add find_embedded_image	2023-01-09 19:49:19 -06:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Mallori Harrell	d7a00046a9	feat: Add new functionality to parse text and header of emails (#111 ) * partition_text function	2023-01-09 17:08:08 +00:00
Mallori Harrell	509ad4951c	feat: Add `extract_attachment_info` (#112 ) * Adds function to extract attachments and their metadata from eml files	2023-01-03 11:41:54 -06:00
Matt Robinson	b14f6ac9bd	feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113 ) * add python-docx dependency * added function for extracting metadata from word documents * add openpyxl * added get_jpg_metadata; fixed typing * bump changelog * added pillow to dependencies	2022-12-26 09:34:36 -05:00
Matt Robinson	7a74cdda86	feat: add `partition_email` cleaning brick (#104 ) * fix for processing deeply embedded list elements * fix types in mime encodings cleaner * first pass on partition_email * tests for email * test for mime encodings * changelog bump * added note about \n= * linting, linting, linting * added email docs * add partition_email to the readme * add one more test	2022-12-19 18:02:44 +00:00
Sebastian Laverde Alfonso	baa15d0098	feat: new partitioning brick that calls the document image analysis API (#68 ) * docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py * feat: new partition to call the document image analysis API * fix: remove duplicated dependency on partition.py * fix: linting error due to line-lenght > 100 * test: add test to call partition_pdf brick * chore: new short example-doc pdf for speed up in test X8 * fix: add missing return statement to _read to pass check * feat: new partitioning brick to call doc parse API * docs: version update fix in CHANGELOG * refactor: no nested ifs * docs: documentation for new brick partition_pdf * refactor: made tidy * docs: minor doc refactor Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>	2022-11-16 17:48:30 +01:00
Matt Robinson	5f40c78f25	Initial Release	2022-09-26 14:55:20 -07:00

38 Commits