unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-11 19:15:56 +00:00

Author	SHA1	Message	Date
John	c58b261feb	chunk_by_title decorator (#1304 ) ### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()	2023-09-11 21:00:14 +00:00
Matt Robinson	c49df62967	feat: `partition_xml` infers element type on each leaf node (#1249 ) ### Summary Closes #1229. Updates `partition_xml` so that the element type is inferred on each leaf node when `xml_keep_tags=False` instead of delegating splitting and partitioning to `partition_xml`. If `xml_keep_tags=True`, the file is treated like a text file still and partitioning is still delegated to `partition_text`. Also adds the option to pass `text` as an input to `partition_xml`. ### Testing Create a `parrots.xml` file that looks like: ```xml <xml><parrot><name>Conure</name><description>A conure is a very friendly bird. Conures are feathery and like to dance.</description></parrot></xml> ``` Run: ```python from unstructured.partition.xml import partition_xml from unstructured.staging.base import convert_to_dict elements = partition_xml(filename="parrots.xml") convert_to_dict(elements) ``` One `main`, the output is the following. Notice how the `<name>` tag incorrectly gets merged into `<description>` in the first element. ```python [{'element_id': '7ae4074435df8dfcefcf24a4e6c52026', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conure A conure is a very friendly bird.', 'type': 'NarrativeText'}, {'element_id': '859ecb332da6961acd2fb6a0185d1549', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conures are feathery and like to dance.', 'type': 'NarrativeText'}] ``` One the feature branch, the output is the following, and the tags are correctly separated. ```python [{'element_id': '5512218914e4eeacf71a9cd42c373710', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conure', 'type': 'Title'}, {'element_id': '113bf8d250c2b1a77c9c2caa4b812f85', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'A conure is a very friendly bird.\n' '\n' 'Conures are feathery and like to dance.', 'type': 'NarrativeText'}] ```	2023-08-30 17:07:10 -04:00
Matt Robinson	c578b85699	fix: respect `<pre>` tag order in `partition_html` (#1197 ) ### Summary Closes #1184. Updates `partition_html` to respect the ordering of `<pre>` tags in HTML documents. ### Testing The elements in the following example should be in the correct order. ```python from unstructured.partition.html import partition_html html_text = """ <pre>The Big Brown Bear</pre> <div>The big brown bear is growling.</div> <pre>The big brown bear is sleeping.</pre> <div>The Big Blue Bear</div> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ```	2023-08-25 04:14:48 +00:00
Klaijan	1524841cd9	feat: supports multipage tiff (#1131 ) Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and - confirms that the function reads all the pages in the TIFF. - page number is added to the metadata This PR is branched from and developed on top of 6d6be99 commit.	2023-08-24 15:12:50 +00:00
Matt Robinson	cdae53cc29	chore: deprecation warning for `file_filename` (#1191 ) ### Summary Closes #1007. Adds a deprecation warning for the `file_filename` kwarg to `partition`, `partition_via_api`, and `partition_multiple_via_api`. Also catches a warning in `ebooklib` that we do not want to emit in `unstructured`. ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/winter-sports.epub" # Should not emit a warning with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should emit a warning with open(filename, "rb") as f: elements = partition(file=f, file_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should raise an error with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub") ```	2023-08-24 07:02:47 +00:00
Matt Robinson	ad595d32f6	enhancement: tell users to install missing extras (#1167 ) ### Summary Updates `partition` to let users know to installs the appropriate extras if they're missing. Prior to this PR, users would get an exception stating `partition_pdf` (or whichever function that requires extras) does not exist. ### Testing First `pip uninstall ebooklib`. Then run ```python from unstructured.partition.auto import partition partition(filename="example-docs/winter-sports.epub") ``` The error should look like ```python ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]" ```	2023-08-22 03:00:21 +00:00
John	9f7bd6127b	enhancement: Add `include_header` kwarg for xlsx, default True(#1125 ) Closes Github issue #1121 Adds include_header kwarg to partition_xlsx and change default behavior to True.	2023-08-17 04:16:23 +00:00
cragwolfe	6779918406	build(release): bump unstructured-inference (#1074 ) * build(release): bump unstructured-inference Related to downstream issue: Unstructured-IO/unstructured-api#182 And upstream PR: Unstructured-IO/unstructured-inference#165 --------- Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>	2023-08-10 20:57:46 +00:00
Klaijan	ad386af8b5	Klaijan/auto paragraph grouper (#994 ) * add auto_paragraph_grouper. add line break pattern. * combine group_broken_paragraph and blank_line_grouper function * fix make check errors * fix make check errors * fix make check errors * fix make check errors * run make tidy to fix errors * tidy core.py and text.py * fix blank-line breaker to extends the result and replace new line with space * fix function name typo * call group_broken_paragraphs for blank_line_grouper * edit function name from one_line_grouper to new_line_grouper for consistency * edit threshold from 0.5 to 0.1 * edit threshold from 0.5 to 0.1 * Revert "call group_broken_paragraphs for blank_line_grouper" This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9. * revert to commit 8fb93b7 and change threshold from 0.5 to 0.1 * edit test_text assertion. remove all BULLETS_PATTERN. * Update ingest test fixtures (#1052) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * edit test case in test_xml_partition * update assertion on test_auto --------- Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local> Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com> Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-07 18:37:18 -04:00
Hynek Kydlíček	47b20119c3	fix: extract emojis with `partition_xlsx` (#1009 ) * 🐛 fixxed emoji xlsx bug * update version and changelog * check if beautifulsoup exists * update docs * fix html parser call * fix failing attachment test * ✅ added emoji test, added requirment fixed dependency * 🐛 dependency * 🐛 correct depeendency * linting, linting, linting * check for bs4 * skip auto xls filename test --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-08-04 10:14:08 -04:00
Chris Pappalardo	f8a7ae2953	fixed filename metadata bug when using file and file_filename (#1002 )	2023-08-02 18:14:15 -07:00
shreyanid	a23d75a292	Set default strategy for images to be "hi_res" (#968 ) Set default strategy for images (not PDFs) to be hi_res.	2023-08-02 09:22:20 -07:00
Yuming Long	d46c1c2d83	Chore: Pass table support param to partition image (#973 ) * add param and test in image table extraction * version and changelog * need to publish this one for api repo * add new param skip_infer_table_types * use warning * clean up with mapping * add test for tsv * fix test fail * weird change from merge * doc nit * don't use mapping * correct conflict	2023-07-27 13:33:36 -04:00
Matt Robinson	d694cd53bf	refactor: simplifies JSON detection and add tests (#975 ) * refactor json detection * version and changelog * fix mock in test	2023-07-25 19:59:45 +00:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
qued	350bb1dad5	enhancement: clean pdf elements (bump unstructured-inference) (#790 ) More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4) Make large model available (from unstructured-inference bump to 0.5.3) Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2) --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructured.io>	2023-06-29 18:35:06 -07:00
ryannikolaidis	62e20442df	chore: refactor ingest tests (#814 ) - Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth - Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory - Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository - Construct paths as reusable variables declared at top of scripts - Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance) - OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind - Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true - Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)	2023-06-29 23:13:41 +00:00
Roman Isecke	9882c2b83f	Avoid setting metadata in constructor signature for elements (#837 ) Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification). Bonus refactor for PageBreak to have text values of "". --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-29 03:14:05 +00:00
kravetsmic	58e988e110	feature(html partition): parse pre tag (#642 ) * feature(html partition): parse pre tag * chore: update CHANGELOG.md * style: black format xml.py * Added tests dor html with pre tag * remove skip test, update parse pre tag * fix style * chore: spell check * chore: update changelog & version * chore: update ingest test fixtures * chore: add exception handling if `element.text` is `None` in `_read_xml` * test: add more sanity testing on the `.text` content of the element(s) * refactor: move the conditional logic for <pre> outside of the `try/except` block --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-06-27 18:52:39 +00:00
Martin Mauch	752e78e803	feat: partition_org for Org Mode documents (#780 ) * feat: partition_org for Org Mode documents * update version	2023-06-23 18:45:31 +00:00
qued	db4c5dfdf7	feat: coordinate systems (#774 ) Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.	2023-06-20 11:19:55 -05:00
Christine Straub	743482b6d3	Bug/635 unicode decode error eml (#739 ) * Adds functionality to extract charset info from eml files * Adds missed file-like object handling in detect_file_encoding * Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs * Organize the eml example files in the example-docs/eml directory	2023-06-17 00:52:13 +00:00
John	a9b9b873b1	feat: partition_tsv for tab separated value files (#758 ) * first pass at partition_tsv * working tests * create constants for tests and debug `make test` failure * make check and tidy * undo changes for testing locally * update changelog and version * fix bricks.rst * refactor if statements * make tidy * fix README and change try/except to if/else * update changelog and version * fix\ docstring	2023-06-15 18:50:53 +00:00
Matt Robinson	c82fdb6a89	feat: `partition_rst` for ReStructured Text documents (#725 ) * add example rst file * filetype detection for rst files * add partition_rst function * add partition_rst to auto * update readme * update docs * changelog and version * pandocs -> pandoc * fix typo	2023-06-12 19:31:10 +00:00
Matt Robinson	19ab6d960f	enhancement: handling for empty files in `detect_filetype` and `partition` (#710 ) * add empty filetype * add empty handling to partition * changelog and version	2023-06-09 16:07:50 -04:00
Yuming Long	80f0b4a132	Fix: Pass `strategy` parameter down from `partition` for `partition_image` (#708 ) * changelog and version * passing param down * test should be auto * doc nit * lint * update image output	2023-06-09 13:54:18 -04:00
John	b2b92ea79d	fix: filetype detection if a CSV has a text/plain MIME type (#691 ) * fix: Filetype detection if a CSV has a text/plain MIME type #621 * bug: fix csv detection and create _read_file_start_for_type_check func * fix: Make call to _is_text_file_a_csv from detect_filetype	2023-06-08 16:21:07 -04:00
Christine Straub	547bb38d86	fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 ) Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding. Change auto.py to have a None default for encoding Remove the unused parameter encoding from partition_pdf Add functionality to the read_txt_file utility function to handle file-like object from URL	2023-06-05 11:27:12 -07:00
qued	d3600dd5da	build(deps): update inference version (#662 ) Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay! --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 13:50:15 -05:00
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
cragwolfe	c5d9469001	feat: add xls support (#632 ) Add support for older .XLS files from the partition function in unstructured.partition.auto. Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).	2023-05-26 01:55:32 -07:00
Matt Robinson	fda51d6ead	fix: add more mime types for csv (#620 )	2023-05-19 16:40:26 -05:00
Matt Robinson	21c821d651	feat: add `partition_csv` function (#619 ) * add csv into filetype detection * first pass on csv * add tests for csv * add csv to auto * version bump * update readme and docs * fix doc strings	2023-05-19 15:57:42 -04:00
Matt Robinson	23ff32cc42	feat: add `partition_xml` for XML files (#596 ) * first pass on partition_xml * add option to keep xml tags * added tests for xml * fix filename * update filenames * remove outdated readme * add xml to auto * version and changelog * update readme and docs * pass through include_metadata * update include_metadata description * add README back in * linting, linting, linting * more linting * spooled to bytes doesnt need to be a tuple * Add tests for newly supported filetypes * Correct metadata filetype * doc typo Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * keep_xml_tags -> xml_keep_tags --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-18 15:40:12 +00:00
Matt Robinson	b8037118c4	feat: add `partition_xlsx` for MSFT Excel files (#594 ) * first pass on partition_xlsx * add support for files * add test for xlsx from filename * added filetype metadata * add xlsx to auto * remove fake excel from unsupported * version and changelog * update docs * update readme * fix removed file reference * fix some more tests * pass in metadata filename * add include_metadata flag	2023-05-16 19:40:40 +00:00
Matt Robinson	bd6a8a3a40	enhancement: add `file_directory` to element metadata (#585 ) * enhancement: add `file_directory` to element metadata * update msg test * exclude file_directory * update slack output * added file directory tests on partition_x paths	2023-05-15 18:25:39 -04:00
Yuming Long	5b6f11bb88	Chore(ingest): Add `--partition-strategy` parameter in CLI (#582 ) * change strategy arg defalut to auto in partition * passing --partition-strategy down * add strategy="hi_res" to test (default changed) * made an error on param name, added note	2023-05-15 19:26:53 +00:00
qued	55272eeceb	enhancement: filetype in metadata (#583 ) Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata. Tests are added to make sure: * When partition is used, any content type or auto file type detection will override file-specific partition function metadata * Both auto and file-specific partitioning gives the desired filetype metadata Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .	2023-05-15 13:23:19 -05:00
Matt Robinson	727d366a94	enhancement: auto strategy for PDFs and images (#578 ) * added functions for determining auto stratgy * change default strategy to auto * tests for auto strategy * update docs * changelog and version * bump version * remove ingest file in wrong location * update jpg output * typo fix	2023-05-12 17:45:08 +00:00
Matt Robinson	8da1ddc6ec	enhancement: add method for getting datetime; cleanup filename attribute (#575 ) * added method for extracting datetime * change filename metadata to the base filename * fix filename metadata for msg * changelog and bump version * fix expected structured output * newline back in file * reset outpout file * update filename output * update test fixtures * update fixture	2023-05-12 11:33:01 -04:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	9fdc310358	fix: update `detect_filetype` for JSONs with text/plain MIME type (#520 ) * check to see if text file is a json * add json check into filetype detection * added test for updated file detection logic * bytes/strings handling * changlog and version bump	2023-04-26 13:52:47 -04:00
qued	5b6640a55a	chore: change table param name (#513 ) Updated parameter names that controls whether we try to infer table structure.	2023-04-21 13:48:19 -05:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf.	2023-04-21 17:01:29 +00:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
Matt Robinson	9c1c6a13f6	fix: updates markdown code to process markdown with embedded html (#480 ) * add carriage return to html if missing * test on markdown with embedded html * changelog and version * check for html parser * linting, linting, linting	2023-04-13 12:47:45 -04:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata	2023-04-12 18:31:01 +00:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00

1 2 3

119 Commits