unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-01 19:37:27 +00:00

Author	SHA1	Message	Date
ryannikolaidis	a4726cb197	fix: open xml files in read only mode (#362 )	2023-03-13 13:06:45 -07:00
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Matt Robinson	7c619f045b	feat: `UNSTRUCTURED_LANGUAGE_CHECK` env var to control (#351 ) * environment variable to set language checks * change log and version * checks for if language checks are false * update docs * changelog type * add assert to tests * performance note in docstrings * docstring tweaks	2023-03-09 17:33:48 +00:00
natygyoon	6be07a5260	feat: update auto.partition() function to recognize Unstructured json (#337 )	2023-03-08 10:36:01 -08:00
Amanda Cameron	64efcc0e50	Adding optional encoding arg, and text_partition tests (#339 )	2023-03-06 15:07:33 -08:00
Matt Robinson	a5da3de43b	fix: ensure all text is maintained in html output (#335 ) * fix: ensure all text is maintained in html pages * add back in replace unicode quotes * changelog and version bump * apt-get update in ci * white space differences in output	2023-03-02 14:03:13 -05:00
Tom Aarsen	350c4230ee	fix: Remove JavaScript from HTML reader output (#313 ) * Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.	2023-02-28 14:24:24 -08:00
Matt Robinson	69661788cf	fix: track narrative text and figure captions in HTML documents (#309 ) * fix for missing narrative text in partition_html * fixes so existing tests pass * tests for figure caption and narrative text * bump version; changelog	2023-02-28 15:36:08 +00:00
Alvaro Bartolome	e52dd5c179	feat: add `requires_dependencies` decorator (#302 ) * Add `requires_dependencies` decorator * Use `required_dependencies` on Reddit & S3 * Fix bug in `requires_dependencies` To used named args the decorator needs to be also wrapped * Add `requires_dependencies` integration tests * Add `requires_dependencies` in `Competition.md` * Update `CHANGELOG.md` * Bump version 0.4.16-dev5 * Ignore `F401` unused imports in `requires_dependencies` tests * Apply suggestions from code review * Add `functools.wrap` to keep docs, & annotations * Use `requires_dependencies` in `GitHubConnector`	2023-02-28 14:50:39 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Tom Aarsen	e61ce2cc00	Skip posix_path test on Windows (#283 )	2023-02-25 08:31:34 +00:00
grungyfeline998	956f04d770	feat: detect filetype with extension if libmagic is unavailable (#268 ) * included the previous PR changes and verified black * resolved the issues mentioned * make tidy and add tests --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-02-24 15:23:29 +00:00
Matt Robinson	0d229f0a5e	fix: preserve all elements when serialized; feat: helper functions for serialization (#273 ) * added type to text element map * add element_id and coordinates * added test for serialization * added serialization for check boxes * add dict_to_elements and covert_to_dict aliases * helpers for serializing and deserializing elements * bump version; changelog * add Text to tests * aliases for isd functions * remove test elements json * changelog updates * make indent a kwarg * update expected structured output * docs update * use new function in ingest code * pop coordinates due to floating point differences * pop coordinates	2023-02-23 21:58:59 +00:00
Matt Robinson	354eff1e2b	build(deps): automatically download `nltk` models when required (#246 ) * code for downloading nltk packages * don't run nltk make command in ci * test for model downloads * remove nltk install from docs * update changelog and bump version	2023-02-23 17:19:13 +00:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	f5ff140d7c	fix: `ElementMetadata` serializes when the filename is a `Path` object (#233 )	2023-02-16 17:20:51 +00:00
Matt Robinson	74e6b84b41	feat: add metadata tracking to document elements (#225 ) * add metadata field to elements * metadata tracking for pdf/image * metadata for html * update expected outputs * metadata for the rest of the document types * take out file metadata for now * add url to tables * added metadata to test_auto * bump version * added coordinates to __init__ * fix coordinates in tests	2023-02-15 18:26:20 +00:00
Matt Robinson	558ee63e90	feat: ability to skip English language specific checks with env var (#224 ) * add language env var * update docs * version and bump change log	2023-02-15 09:15:47 -05:00
Matt Robinson	f890972139	docs: add bricks training notebook (#211 ) * added bricks notebook * more unicode quotes; isd dataframe column fix * fix remove_punctuation docs * typo fixes * put staging bricks in code	2023-02-10 14:39:14 +00:00
Matt Robinson	7fb3797165	docs: core concepts training notebook (#207 ) * added to_dict to elements * first training notebook * bump changelog, rerun notebook * remove coordinates and id * rerun notebook * has -> have * partitioning -> partition * various and sundry typos * switch to using convert_to_isd	2023-02-09 14:34:34 +00:00
Matt Robinson	47ab808e0f	feat: file info dataframe from filenames and file content (#204 ) * added function for exploring a list of files * file info from file contents * added tests for file info from contents * bump version and add tests * add dev to version	2023-02-08 20:48:39 +00:00
Matt Robinson	e73cf09977	feat: optional page breaks for `.pptx`, `.pdf`, `.html` and images (#205 ) * page breaks for pptx * added page breaks for image/pdf * tests for images with page breaks * page breaks for html documents * linting, linting, linting * changelog and bump version * update docs * fix typo * refactor reusable code to common.py * add type back in	2023-02-08 15:11:15 +00:00
Matt Robinson	ee9f15483f	feat: `partition_html` directly from a url (#202 ) * added tests for html from url * bump version * added types-requests * and -> an	2023-02-07 14:09:34 +00:00
Matt Robinson	782b4352ec	build(deps): weekly dependency update; reduce dependabot frequency (#194 ) * deps: pip-compile to update dependencies * bump version * linting, linting, linting * typo	2023-02-06 16:39:29 +00:00
Matt Robinson	014585e872	fix: preserve the order of shapes in `partition_pptx` output (#193 ) * order the shapes top to bottom and left to right * added tests for ordering * update change log and bump version * more tests * don't need enumerate * n -> on	2023-02-03 22:12:33 +00:00
Matt Robinson	a7ca58e0bc	fix: more english words; split on punctuation (#191 ) * add a bigger list of english words * update thresholds and add tests * update docs; bump version * fix version * add additional english words back in * linting, linting, linting * add slashes * work -> word	2023-02-02 17:25:47 +00:00
Matt Robinson	0589344ff7	fix: require a minimum prop of alpha characters for titles and narrative text (#190 ) * added alpha ratio check * added tests for alpha ratio * bump changelog and update docs * update changelog/version; update docs * ofr -> or	2023-02-02 14:59:04 +00:00
Matt Robinson	1230a163fd	feat: set a user controlled max word length for titles (#189 ) * update the docs * add option for title max word length * bump version; update changelog * change max length to 12 * docs updates * to -> too	2023-02-01 19:32:16 +00:00
Matt Robinson	2d08fcbf83	fix: titles and narrative text need at least one english word (#188 ) * added check for english words * update docs * at least one word needs to have multiple characters * bump change log	2023-02-01 09:10:48 -05:00
sparkbrains	243bf7ed5e	test: Increase coverage (#181 )	2023-01-30 22:47:09 -08:00
Matt Robinson	e6cfde5c4a	fix: no `UserWarning` when `partition_pdf` is called (#179 )	2023-01-27 12:08:18 -05:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Matt Robinson	26a5546152	fix: handle xml filetype detection on amazon linux (#173 ) * fix: handle xml filetype detection on amazon linux * option for html or xml * fix typo * back to dev tag	2023-01-25 11:20:01 -05:00
Matt Robinson	8b6c5fac9d	feat: basic PowerPoint parsing in `partition_pptx` (#166 ) * parition pptx and tests * add parition_pptx to auto * update doc types in readme * add pptx docs * bump version * remove extra whitespace * partition -> partitioning	2023-01-23 17:03:09 +00:00
Matt Robinson	8d3e616846	feat: add ability to parse `LayoutElement` lists (#165 ) * added ability to split list items * changelog and version bump * retrigger ci	2023-01-20 08:55:11 -05:00
Matt Robinson	c1822911a5	chore: return `Element` objects in `partition_pdf` and `partition_image` (#164 ) * helper function to convert to element * test for element types * fix for healthcheck url * version bump * note on coordinates * mention FigureCaption * test_shared -> test_common * add check boxes for checkbox template * update changelog	2023-01-19 14:29:28 +00:00
Matt Robinson	74ce2ae6e5	fix: update `detect_filetype` to properly handle older office files (#161 )	2023-01-18 11:18:20 -05:00
Mallori Harrell	08ccee0acb	chore: Fix parse received data (#143 ) * fix parse_received data	2023-01-17 16:36:44 -06:00
Matt Robinson	749f9c6be8	fix: avoid divide by zero in `exceeds_cap_ratio` (#160 )	2023-01-17 15:22:12 -05:00
Matt Robinson	9c3c14e94d	fix: resolves `UnicodeDecodeError` in `partition_email` for emails with attachments (#158 ) * split emails by \n= * added test for equivalence betweent html and plain text * changelog and bump version * add check for content disposition	2023-01-17 11:33:45 -05:00
qued	8abf1f119d	feat: partition image (#144 ) Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.	2023-01-13 22:24:13 -06:00
Matt Robinson	f12240c5e7	feat: add support for `.txt` files in `partition` (#150 ) * added partition_text for auto * rename partition_text tests * bump version and update docs	2023-01-13 16:39:53 -05:00
Matt Robinson	eba4c80b1e	feat: `get_directory_file_info` for exploring a directory of files (#142 ) * added python-pptx to requirements * added filetype detection for powerpoint * add more filetypes to detect * more tests * added tests for filetype * reorder document types * tests for get_directory_file_info * added docs for get_directory_file_info * bump version * Word -> Office * added test for filetype * add group by filetype example	2023-01-11 12:40:50 -05:00
Mallori Harrell	e0feba83f6	feat: Add Image element and `find_embedded_image` function (#130 ) * add find_embedded_image	2023-01-09 19:49:19 -06:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Mallori Harrell	d7a00046a9	feat: Add new functionality to parse text and header of emails (#111 ) * partition_text function	2023-01-09 17:08:08 +00:00
Matt Robinson	fee95b643c	feat: add `partition_docx` for Word documents (#131 ) * first pass on docx parsing * linting, linting, linting * test docx with filename * added documentation * more tests; version bump * typo * another typo * another typo! * it -> its * save -> saved * remove None since it's the default argument	2023-01-05 20:13:39 +00:00

1 2

89 Commits