unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-10 23:54:23 +00:00

Author	SHA1	Message	Date
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Amanda Cameron	64efcc0e50	Adding optional encoding arg, and text_partition tests (#339 )	2023-03-06 15:07:33 -08:00
Matt Robinson	a5da3de43b	fix: ensure all text is maintained in html output (#335 ) * fix: ensure all text is maintained in html pages * add back in replace unicode quotes * changelog and version bump * apt-get update in ci * white space differences in output	2023-03-02 14:03:13 -05:00
Tom Aarsen	350c4230ee	fix: Remove JavaScript from HTML reader output (#313 ) * Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.	2023-02-28 14:24:24 -08:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	f890972139	docs: add bricks training notebook (#211 ) * added bricks notebook * more unicode quotes; isd dataframe column fix * fix remove_punctuation docs * typo fixes * put staging bricks in code	2023-02-10 14:39:14 +00:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Matt Robinson	8b6c5fac9d	feat: basic PowerPoint parsing in `partition_pptx` (#166 ) * parition pptx and tests * add parition_pptx to auto * update doc types in readme * add pptx docs * bump version * remove extra whitespace * partition -> partitioning	2023-01-23 17:03:09 +00:00
Mallori Harrell	08ccee0acb	chore: Fix parse received data (#143 ) * fix parse_received data	2023-01-17 16:36:44 -06:00
Matt Robinson	eba4c80b1e	feat: `get_directory_file_info` for exploring a directory of files (#142 ) * added python-pptx to requirements * added filetype detection for powerpoint * add more filetypes to detect * more tests * added tests for filetype * reorder document types * tests for get_directory_file_info * added docs for get_directory_file_info * bump version * Word -> Office * added test for filetype * add group by filetype example	2023-01-11 12:40:50 -05:00
Mallori Harrell	e0feba83f6	feat: Add Image element and `find_embedded_image` function (#130 ) * add find_embedded_image	2023-01-09 19:49:19 -06:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Mallori Harrell	d7a00046a9	feat: Add new functionality to parse text and header of emails (#111 ) * partition_text function	2023-01-09 17:08:08 +00:00
Mallori Harrell	509ad4951c	feat: Add `extract_attachment_info` (#112 ) * Adds function to extract attachments and their metadata from eml files	2023-01-03 11:41:54 -06:00
Matt Robinson	b14f6ac9bd	feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113 ) * add python-docx dependency * added function for extracting metadata from word documents * add openpyxl * added get_jpg_metadata; fixed typing * bump changelog * added pillow to dependencies	2022-12-26 09:34:36 -05:00
Matt Robinson	7a74cdda86	feat: add `partition_email` cleaning brick (#104 ) * fix for processing deeply embedded list elements * fix types in mime encodings cleaner * first pass on partition_email * tests for email * test for mime encodings * changelog bump * added note about \n= * linting, linting, linting * added email docs * add partition_email to the readme * add one more test	2022-12-19 18:02:44 +00:00
Sebastian Laverde Alfonso	baa15d0098	feat: new partitioning brick that calls the document image analysis API (#68 ) * docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py * feat: new partition to call the document image analysis API * fix: remove duplicated dependency on partition.py * fix: linting error due to line-lenght > 100 * test: add test to call partition_pdf brick * chore: new short example-doc pdf for speed up in test X8 * fix: add missing return statement to _read to pass check * feat: new partitioning brick to call doc parse API * docs: version update fix in CHANGELOG * refactor: no nested ifs * docs: documentation for new brick partition_pdf * refactor: made tidy * docs: minor doc refactor Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>	2022-11-16 17:48:30 +01:00
Matt Robinson	5f40c78f25	Initial Release	2022-09-26 14:55:20 -07:00

19 Commits