unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-12 08:27:46 +00:00

Author	SHA1	Message	Date
shreyanid	eb8ce89137	chore: function to map between standard and Tesseract language codes (#1421 ) ### Summary In order to convert between incompatible language codes from packages used for OCR, this change adds a function to map between any standard language codes and tesseract OCR specific codes. Users can input language information to `languages` in any Tesseract-supported langcode or any ISO 639 standard language code. ### Details - Introduces the [python-iso639](https://pypi.org/project/python-iso639/) package for matching standard language codes. Recompiles all dependencies. - If a language is not already supplied by the user as a Tesseract specific langcode, supplies all possible script/orthography variants of the language to the Tesseract OCR agent. ### Test Added many unit tests for a variety of language combinations, special cases, and variants. For general testing, call partition functions with any lang codes in the languages parameter (Tesseract or standard). for example, ``` from unstructured.partition.auto import partition elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"]) print("\n\n".join([str(el) for el in elements])) ``` should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract	2023-09-18 08:42:02 -07:00
Amanda Cameron	a9f18eddb8	chore: adding test case for odt tables (#1434 ) ODT table extraction is happening! Just added to an existing example-doc and an accompanying test case.	2023-09-16 22:29:44 -07:00
Yao You	b534b2a6cd	Chore: bump inference package version to 0.5.28 and new release (#1355 ) This bump removes the preprocessing before table structure extraction and improves the OCR results for tables. --------- Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-09-15 18:26:15 -07:00
John	de4d496fcf	Fix bbox coordinates for ocr_only strategy (#1325 ) ### Summary Duplicate PR of #1259 because of issues with checks Closes #1227, which found that `nan` values were present in the coordinates being generated for some elements. This breaks logic out from `add_pytesseract_bbox_to_elements` to new functions `_get_element_box` and `convert_multiple_coordinates_to_new_system`. It also updates the logic to check that the current bounding box matches the first character of the element's text (as to avoid the `~` characters that `pytesseract.image_to_boxes` includes, but are not present in `pytesseract.image_to_string`. ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw filename="example-docs/layout-parser-paper-with-table.jpg" elements = partition_image(filename=filename, strategy="ocr_only") image = Image.open(filename) draw = ImageDraw.Draw(image) for i, element in enumerate(elements): print(i, element.metadata.coordinates) if element.metadata.coordinates: draw.polygon(element.metadata.coordinates.points, outline="red", width=2) output = "example-docs/box-layout-parser-paper-with-table.jpg" image.save(output) image.close() ``` --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-09-15 15:11:16 -05:00
qued	0d61c98481	fix: Pass partition_image kwargs downstream (#1426 ) `partition_pdf` allows for passing a `model_name` parameter. Given the similarity between the image and PDF pipelines, the expected behavior is that `partition_image` should support the same parameter, but `partition_image` was unintentionally not passing along its `kwargs`. This was corrected by adding the kwargs to the downstream call. #### Testing: ```python from unstructured.partition.image import partition_image output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx") output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox") # These shouldn't be the same, since they were produced using different models. assert output1 != output2 ``` The assertion should fail on `main`, but pass on this branch.	2023-09-15 15:09:58 -05:00
Amanda Cameron	50db2abd9f	fix: updating element types (#1394 ) This PR adds an arg to the html partition flow called `source_format` if anything other than "html" we will return non-HTML elements to conform with the file type we received. addresses: https://github.com/Unstructured-IO/unstructured/issues/726	2023-09-15 11:51:22 -05:00
Sebastian Laverde Alfonso	40b1d0d092	feat: improved chipper elements mapping and new category_depth metadata (#1308 ) Two changes: 1. Improved mapping of `chipper` element types `Headline` (to `Title`), `Subheadline`(to `Title`) and `Abstract`( to `NarrativeText`. 2. New element metadata `category_depth`: `None` unless is `Headline` (`category_depth=1`), or `Subheadline` (`category_depth=2`). The update of `category_depth` happens during the transform `normalize_layout_element`. --------- Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: LaverdeS <LaverdeS@users.noreply.github.com> Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> Co-authored-by: Benjamin Torres <benjamin@unstructured.io>	2023-09-15 14:43:17 +00:00
Newel H	cd704e873b	Feat: Create a naive hierarchy for elements (#1268 ) ## Summary By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>	2023-09-14 11:23:16 -04:00
Klaijan	00181b88df	feat: pdf auto strategy groups broken numbered and bullet list items(#1393 ) Summary Adds logic to combine broken numbered list for pdf fast strategy. Details Previously the document reads the numbered list items part of the `layout-parser-paper-fast.pdf` file as: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character' 'recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that' 'underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' 'tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` Now it reads: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` The added logic leverages `ElementType` and `coordinates` to determine whether the following lines is a part of the previously detected `ListItem` or not. Test Add test that checks the element length less than original version with broken numbered list. The test also checks whether the first detected numbered list ends with previously broken line. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-13 21:30:06 +00:00
shreyanid	d87c83d7b6	chore: refactor languages parameter for text_type functions (#1399 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. To identify element types such as NarrativeText and Title, continue the refactor into functions that use language checks to determine those potential classifications. ### Details Replaces `language` with `languages` (a list of strings) as a parameter to `is_possible_narrative_text` and `is_possible_title`. ### Test Call `is_possible_narrative_text` and `is_possible_title` with text in a variety of languages and different inputs for `languages`. The resulting element classifications should be no different from the current outputs. ex: see `test_text_type_handles_multi_language_examples` in `test_unstructured/partition/test_text_type.py`.	2023-09-13 19:46:36 +00:00
shreyanid	1b7c99d878	chore: refactor languages parameter for auto partition (#1400 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. ### Details Follows the pattern established with PDFs in #1334. Adds languages (a list of strings) as a parameter to partition in auto.py. Marks ocr_languages for deprecation. ### Test Call partition with a variety of filetypes (especially pdfs/images), strategies, languages, or ocr_languages. - inclusion of ocr_languages as a parameter should display a deprecation warning and may proceed with partitioning if no other conflicts - the other valid call outputs should be no different from the current outputs	2023-09-13 13:07:28 -04:00
shreyanid	2b571eb9a3	chore: refactor languages parameter for image partition functions (#1395 ) Adds languages (a list of strings) as a parameter to `partition_image`. Marks ocr_languages for deprecation.	2023-09-13 04:11:58 +00:00
Amanda Cameron	7fd81dc7df	Table processing test for RTF (#1388 ) This PR does two things: 1. Adds test case (and alters sample doc) for rtf and epub files with table 2. Adds `xls/x` file extension to `skip_infer_table_types` default list --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-09-12 18:27:05 -07:00
qued	6595632a57	enhancement: backup text categorization (#1322 ) Currently there are some cases when `partition_pdf` is run using the `hi_res` strategy, in which elements can come back with category `UncategorizedText`. This happens when the detection model fails to detect an element, but we're able to find it anyway either because it was embedded in the PDF, or we found it using OCR. This commit is to allow for attempting to categorize these uncategorized elements using our text-based classification function, `element_from_text`.	2023-09-12 20:32:48 +00:00
shreyanid	c2853e4ac3	refactor languages parameter for pdf partition functions (#1334 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. ### Details Adds `languages` (a list of strings) as a parameter to pdf partitioning functions. Marks `ocr_languages` for deprecation. Adds a new file `lang.py` for language-related helper functions. Coming up: langcode standardization, language detection ### Test Call `partition_pdf` or `partition_pdf_or_image` with a variety of strategies, languages, or `ocr_languages`. - inclusion of `ocr_languages` as a parameter should display a deprecation warning - the other valid call outputs should be no different from the current outputs. ex: ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"]) print("\n\n".join([str(el) for el in elements])) ```	2023-09-12 16:15:26 +00:00
John	c58b261feb	chunk_by_title decorator (#1304 ) ### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()	2023-09-11 21:00:14 +00:00
Amanda Cameron	a501d1d18f	Adding table extraction to partition_html (#1324 ) Adding table extraction to HTML partitioning. This PR utilizes 'table' HTML elements to extract and parse HTML tables and return them in partitioning. ``` # checkout this branch, go into ipython shell In [1]: from unstructured.partition.html import partition_html In [2]: path_to_html = "{html sample file with table}" In [3]: elements = partition_html(path_to_html) ``` you should see the table in the elements list!	2023-09-11 11:14:11 -07:00
cragwolfe	d0749d181f	fix: avoid PDF sorting error on negative coords (#1361 ) The default sorting algorithm for PDF's, "xycut," would cause an error when partitioning a document if Y coordinate points were negative. This change checks for that condition (or more broadly, any negative coordinates) and falls back to the "basic" sort if that is the case. This PR does not address the underlying issue of "bad points" which still should be investigated. However, the sorting code should be less brittle to unexpected bounding boxes in the first case. Resolves: https://github.com/Unstructured-IO/unstructured/issues/1296	2023-09-10 19:29:49 -07:00
pravin-unstructured	8641fe39dc	Add Model Probabilities to Hi-Res strategy MetaData for Images + PDFs. (#1323 ) If a layout model is used from unstructured-inference, you get back class probabilities in the element metadata from partition. extra-pdf-image-in in requirements already has the newest version of unstructured-inference in there without a pinned version. Is there any place else that the unstructured-inference version needs to be updated to the required release version, 0.5.22?	2023-09-07 22:56:43 -04:00
Matt Robinson	22974f61ce	fix: separate elements by `<br>` tag in `partition_html` (#1314 ) ### Summary Closes #1230. Updates `partition_html` to split on `<br>` tags that appear within text elements. ### Testing The following is code previously produced one giant element on `main`. ```python from unstructured.partition.html import partition_html filename = "example-docs/ideas-page.html" elements = partition_html(filename=filename) len(elements) # Should be 4 print("\n\n".join([str(el) for el in elements)]) ``` The output should be: ```python January 2023 (Someone fed my essays into GPT to make something that could answer questions based on them, then asked it where good ideas come from. The answer was ok, but not what I would have said. This is what I would have said.) The way to get new ideas is to notice anomalies: what seems strange, or missing, or broken? You can see anomalies in everyday life (much of standup comedy is based on this), but the best place to look for them is at the frontiers of knowledge. Knowledge grows fractally. From a distance its edges look smooth, but when you learn enough to get close to one, you'll notice it's full of gaps. These gaps will seem obvious; it will seem inexplicable that no one has tried x or wondered about y. In the best case, exploring such gaps yields whole new fractal buds. ```	2023-09-07 13:16:31 +00:00
Yao You	1a0b737e9c	revert pdf changes and add new pdf for empty page testing (#1255 ) - revert the layout parser fast pdf file to original with just two pages - add a new file that has one empty page and one page says "this page is intentionally left blank" for tests	2023-09-01 22:33:06 +00:00
Yao You	9191be7ae8	[issue 1237] fix empty coordinates break sorting bug (#1242 ) This PR resolves #1237 by checking if any coordinates are `None`; if yes do not attempt to sort with xy cut method and return the list as is.	2023-09-01 03:15:10 +00:00
Yao You	27773132b7	[issue 1247] fix element and bbox mismatch bug (#1250 ) This PR resolves #1247 by using the matching elements and bbox for coordinate computation. This PR also updates the example doc `example-docs/layout-parser-paper-fast.pdf` so that it includes a true blank page and a page with text "this page is intentionally left blank". This change helps us testing: - differences between fast and hi_res - code handling empty pages in between pages with contents (which triggers the bug found in #1247 ) Lastly, this PR updates the names of the variables inside `_partition_pdf_or_image_with_ocr` so that matching inputs all starts with `_` like `_elements`, `_text`, and `_bboxes` to improve readability. This change also improves partition performance for multi-page pdfs as it reduces the amount of iterations inside `add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky docker shows it reduces partition time for DA-619p.pdf file from around 1min to around 23s.	2023-08-30 23:34:55 +00:00
Matt Robinson	c49df62967	feat: `partition_xml` infers element type on each leaf node (#1249 ) ### Summary Closes #1229. Updates `partition_xml` so that the element type is inferred on each leaf node when `xml_keep_tags=False` instead of delegating splitting and partitioning to `partition_xml`. If `xml_keep_tags=True`, the file is treated like a text file still and partitioning is still delegated to `partition_text`. Also adds the option to pass `text` as an input to `partition_xml`. ### Testing Create a `parrots.xml` file that looks like: ```xml <xml><parrot><name>Conure</name><description>A conure is a very friendly bird. Conures are feathery and like to dance.</description></parrot></xml> ``` Run: ```python from unstructured.partition.xml import partition_xml from unstructured.staging.base import convert_to_dict elements = partition_xml(filename="parrots.xml") convert_to_dict(elements) ``` One `main`, the output is the following. Notice how the `<name>` tag incorrectly gets merged into `<description>` in the first element. ```python [{'element_id': '7ae4074435df8dfcefcf24a4e6c52026', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conure A conure is a very friendly bird.', 'type': 'NarrativeText'}, {'element_id': '859ecb332da6961acd2fb6a0185d1549', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conures are feathery and like to dance.', 'type': 'NarrativeText'}] ``` One the feature branch, the output is the following, and the tags are correctly separated. ```python [{'element_id': '5512218914e4eeacf71a9cd42c373710', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'Conure', 'type': 'Title'}, {'element_id': '113bf8d250c2b1a77c9c2caa4b812f85', 'metadata': {'file_directory': '/home/matt/tmp', 'filename': 'parrots.xml', 'filetype': 'application/xml', 'last_modified': '2023-08-30T14:21:38'}, 'text': 'A conure is a very friendly bird.\n' '\n' 'Conures are feathery and like to dance.', 'type': 'NarrativeText'}] ```	2023-08-30 17:07:10 -04:00
Klaijan	675a10ea69	fix: update test_json to not use auto partition (#1187 ) Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.	2023-08-29 16:59:26 -04:00
Klaijan	4b830e3b05	fix: return ocr coordinates points as tuple (#1219 ) The `add_pytesseract_bbox_to_elements` returned the `metadata.coordinates.points` as `Tuple` whereas other strategies returned as `List`. Make change accordingly for consistency. Previously: ``` element.metadata.coordinates.points = [ (x1, y1), (x2, y2), (x3, y3), (x4, y4), ] ``` Currently: ``` element.metadata.coordinates.points = ( (x1, y1), (x2, y2), (x3, y3), (x4, y4), ) ```	2023-08-28 13:31:55 -04:00
Matt Robinson	07f76275f1	feat: detect PGP encrypted content in `partition_email` and `partition_msg` (#1205 ) ### Summary Closes #1018. Enables `partition_email` and `partition_msg` to detect if an email has PGP encrypted content. Based on the specification in [RFC 2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based on the example email in the spec. If PGP detected content is detected, a warning is emitted and an empty set of lists is returned. ### Testing ```python from unstructured.partition_email import partition_email filename = "example-docs/eml/fake-encrypted.eml" partition_email(filename=filename) ``` ```python from unstructured.partition_msg import partition_msg filename = "example-docs/fake-encrypted.msg" partition_msgl(filename=filename) ```	2023-08-25 17:09:25 -07:00
John	5872fa23c3	Extract coordinates from PDFs and images when using OCR only strategy (#1163 ) ### Summary Closes #983 Creates new function `add_pytesseract_bbox_to_elements` Fixes typos in docstrings ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw png_filename="example-docs/english-and-korean.png" png_elements = partition_image(filename=png_filename, strategy="ocr_only") png_image = Image.open(png_filename) draw = ImageDraw.Draw(png_image) draw.polygon(png_elements[0].metadata.coordinates.points, outline="red", width=2) draw.polygon(png_elements[1].metadata.coordinates.points, outline="red", width=2) draw.polygon(png_elements[2].metadata.coordinates.points, outline="red", width=2) output = "example-docs/english-and-korean-box.png" png_image.save(output) png_image.close() ```	2023-08-25 05:32:12 +00:00
Matt Robinson	c578b85699	fix: respect `<pre>` tag order in `partition_html` (#1197 ) ### Summary Closes #1184. Updates `partition_html` to respect the ordering of `<pre>` tags in HTML documents. ### Testing The elements in the following example should be in the correct order. ```python from unstructured.partition.html import partition_html html_text = """ <pre>The Big Brown Bear</pre> <div>The big brown bear is growling.</div> <pre>The big brown bear is sleeping.</pre> <div>The Big Blue Bear</div> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ```	2023-08-25 04:14:48 +00:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Klaijan	1524841cd9	feat: supports multipage tiff (#1131 ) Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and - confirms that the function reads all the pages in the TIFF. - page number is added to the metadata This PR is branched from and developed on top of 6d6be99 commit.	2023-08-24 15:12:50 +00:00
Matt Robinson	cdae53cc29	chore: deprecation warning for `file_filename` (#1191 ) ### Summary Closes #1007. Adds a deprecation warning for the `file_filename` kwarg to `partition`, `partition_via_api`, and `partition_multiple_via_api`. Also catches a warning in `ebooklib` that we do not want to emit in `unstructured`. ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/winter-sports.epub" # Should not emit a warning with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should emit a warning with open(filename, "rb") as f: elements = partition(file=f, file_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should raise an error with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub") ```	2023-08-24 07:02:47 +00:00
Charles	1ddf542e14	fix: Don't call extractable_elements if strategy is ocr_only (#1160 ) - fixes #1079 where partitioning is happening twice in the case of `strategy="ocr_only"` - only calls `extractable_elements` if we can predetermine that `ocr_only` is not a possible strategy even if it was the intended strategy. - Adds additional assertion test that `_partition_pdf_or_image_with_ocr` is not called when falling back to `fast` from `ocr_only`	2023-08-22 19:43:33 -07:00
Austin Walker	e7d189fcc8	chore: Bump inference and set default ocr_mode to entire_page (#1172 ) * pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](https://github.com/Unstructured-IO/unstructured-inference/pull/183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file.	2023-08-22 16:05:02 -07:00
Matt Robinson	ad595d32f6	enhancement: tell users to install missing extras (#1167 ) ### Summary Updates `partition` to let users know to installs the appropriate extras if they're missing. Prior to this PR, users would get an exception stating `partition_pdf` (or whichever function that requires extras) does not exist. ### Testing First `pip uninstall ebooklib`. Then run ```python from unstructured.partition.auto import partition partition(filename="example-docs/winter-sports.epub") ``` The error should look like ```python ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]" ```	2023-08-22 03:00:21 +00:00
Newel H	e4aa7373e2	test: create CI pipelines for verifying base and extras pass respective tests (#1137 ) Summary Closes #747 * Create CI Pipeline for running text, xml, email, and html doc tests against the library installed without extras * Create CI Pipeline for running each library extra against their respective tests	2023-08-19 12:56:13 -04:00
John	69edffb0c0	bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134 ) ### Summary Closes #1027 The msg test in question was no longer failing after removing the quick-fix and comment explaining the issue. However, the test was not functioning as intended. Test was refactored to appropriately test `metadata_last_modified` of attachments. `partition_msg` was then updated to pass `metadata_last_modified` to `attachment_partitioner`. The same was done for email partitioning. ### Testing ``` from unstructured.partition.text import partition_text from unstructured.partition.msg import partition_msg from unstructured.partition.email import partition_email filename="example-docs/fake-email-attachment.msg" elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") # previously, these were different values because last_modified wasn't being updated in attachments elements[1].metadata.last_modified elements[-1].text elements[-1].metadata.last_modified email_filename="example-docs/eml/fake-email-attachment.eml" email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") email_elements[1].metadata.last_modified email_elements[-1].text email_elements[-1].metadata.last_modified ```	2023-08-18 23:21:11 +00:00
Austin Walker	dd243b4fd9	chore: pass ocr_mode in partition_pdf_or_image (#1154 ) Set to individual_blocks for now to work around [this bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179). I verified by printing the current ocr_mode in inference. The `entire_page` default is overridden. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: awalker4 <awalker4@users.noreply.github.com>	2023-08-18 20:59:08 +00:00
cragwolfe	1456f06b2d	chore: skip consistently failing test in main (#1150 ) The reason this test is failing is the API is returning "fast" results when "hi_res" is requested, which is being tracked in this ticket: https://github.com/Unstructured-IO/unstructured-api/issues/188 . This failure was only showing up on the `main` branch, per the commented out `pytest` skips.	2023-08-18 10:06:17 -07:00
John	9f7bd6127b	enhancement: Add `include_header` kwarg for xlsx, default True(#1125 ) Closes Github issue #1121 Adds include_header kwarg to partition_xlsx and change default behavior to True.	2023-08-17 04:16:23 +00:00
Christine Straub	0e887cc36b	Feat/1060 update metadata fields (#1099 ) Closes Github Issue #1060. * update the metadata field links * update the metadata field emphasized_texts	2023-08-16 04:33:06 +00:00
Mike Lay	79a1eb8683	Handle inline and lacking filename (#1109 ) Handle Content-Disposition: inline and attachment without filename * Add new email test example and test with Content-Disposition: inline. * Move attachment_info above for loop so it is always defined * Check if item is inline as well as attachment as these both lack an = character to split on * Create filename if filename is not specified and write file. * Update list_attachments with new filename	2023-08-14 18:38:53 +00:00
Mike Lay	2e0ab86c6a	Fix attachments with `=` in filename (#1110 ) Fix attachments with = in filename * Limit split to first match of = to prevent creating a list of more than two parts * Add example email with attachment name and test for issue	2023-08-13 20:35:18 -07:00
Christine Straub	fc2699ff06	Fix/1057 etree parser error tsv (#1106 ) * feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji * chore: update changelog & version	2023-08-14 01:22:36 +00:00
Christine Straub	4a3176885f	Fix/1057 etree parser error xlsx (#1094 ) * feat: add functionality to check if a string contains any emoji characters * feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji * chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use * chore: update changelog & version * chore: update changelog & version * chore: update dependencies * test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename` * chore: update changelog & version * feat: add functionality to switch html text parser based on whether the html text contains emoji * chore: update changelog & version * fix lint errors * test: revert the `EXPECTED_XLS_TEXT_LEN` value back * feat: always use `soupparser_fromstring` to parse `html text` * fix lint error	2023-08-13 12:20:33 -07:00
cragwolfe	02af625b93	chore: fix fickle test to not be so time sensitive (#1105 )	2023-08-13 10:58:46 -07:00
John	f63a66dbef	Capture section and chapter in the metadata for epubs under `epub_section` (#1005 ) Capture section and chapter in the metadata for epubs under epub_section. Closes Github issue #459	2023-08-12 21:02:06 +00:00
Matt Robinson	fa5a3dbd81	feat: `unique_element_ids` kwarg for UUID elements (#1085 ) * added kwarg for unique elements * test for unique ids * update docs * changelog and version	2023-08-11 11:02:37 +00:00
Christine Straub	d26ab1deac	fix: etree parser error (#1077 ) * feat: add functionality to check if a string contains any emoji characters * feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji * chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use * chore: update changelog & version * chore: update changelog & version * chore: update dependencies * test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename` * chore: update changelog & version	2023-08-10 23:28:57 +00:00
cragwolfe	6779918406	build(release): bump unstructured-inference (#1074 ) * build(release): bump unstructured-inference Related to downstream issue: Unstructured-IO/unstructured-api#182 And upstream PR: Unstructured-IO/unstructured-inference#165 --------- Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>	2023-08-10 20:57:46 +00:00

... 3 4 5 6 7 ...

417 Commits