unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-13 20:15:54 +00:00

Author	SHA1	Message	Date
shreyanid	eb8ce89137	chore: function to map between standard and Tesseract language codes (#1421 ) ### Summary In order to convert between incompatible language codes from packages used for OCR, this change adds a function to map between any standard language codes and tesseract OCR specific codes. Users can input language information to `languages` in any Tesseract-supported langcode or any ISO 639 standard language code. ### Details - Introduces the [python-iso639](https://pypi.org/project/python-iso639/) package for matching standard language codes. Recompiles all dependencies. - If a language is not already supplied by the user as a Tesseract specific langcode, supplies all possible script/orthography variants of the language to the Tesseract OCR agent. ### Test Added many unit tests for a variety of language combinations, special cases, and variants. For general testing, call partition functions with any lang codes in the languages parameter (Tesseract or standard). for example, ``` from unstructured.partition.auto import partition elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"]) print("\n\n".join([str(el) for el in elements])) ``` should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract	2023-09-18 08:42:02 -07:00
Yao You	b534b2a6cd	Chore: bump inference package version to 0.5.28 and new release (#1355 ) This bump removes the preprocessing before table structure extraction and improves the OCR results for tables. --------- Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-09-15 18:26:15 -07:00
John	de4d496fcf	Fix bbox coordinates for ocr_only strategy (#1325 ) ### Summary Duplicate PR of #1259 because of issues with checks Closes #1227, which found that `nan` values were present in the coordinates being generated for some elements. This breaks logic out from `add_pytesseract_bbox_to_elements` to new functions `_get_element_box` and `convert_multiple_coordinates_to_new_system`. It also updates the logic to check that the current bounding box matches the first character of the element's text (as to avoid the `~` characters that `pytesseract.image_to_boxes` includes, but are not present in `pytesseract.image_to_string`. ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw filename="example-docs/layout-parser-paper-with-table.jpg" elements = partition_image(filename=filename, strategy="ocr_only") image = Image.open(filename) draw = ImageDraw.Draw(image) for i, element in enumerate(elements): print(i, element.metadata.coordinates) if element.metadata.coordinates: draw.polygon(element.metadata.coordinates.points, outline="red", width=2) output = "example-docs/box-layout-parser-paper-with-table.jpg" image.save(output) image.close() ``` --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-09-15 15:11:16 -05:00
qued	0d61c98481	fix: Pass partition_image kwargs downstream (#1426 ) `partition_pdf` allows for passing a `model_name` parameter. Given the similarity between the image and PDF pipelines, the expected behavior is that `partition_image` should support the same parameter, but `partition_image` was unintentionally not passing along its `kwargs`. This was corrected by adding the kwargs to the downstream call. #### Testing: ```python from unstructured.partition.image import partition_image output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx") output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox") # These shouldn't be the same, since they were produced using different models. assert output1 != output2 ``` The assertion should fail on `main`, but pass on this branch.	2023-09-15 15:09:58 -05:00
shreyanid	2b571eb9a3	chore: refactor languages parameter for image partition functions (#1395 ) Adds languages (a list of strings) as a parameter to `partition_image`. Marks ocr_languages for deprecation.	2023-09-13 04:11:58 +00:00
John	c58b261feb	chunk_by_title decorator (#1304 ) ### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()	2023-09-11 21:00:14 +00:00
pravin-unstructured	8641fe39dc	Add Model Probabilities to Hi-Res strategy MetaData for Images + PDFs. (#1323 ) If a layout model is used from unstructured-inference, you get back class probabilities in the element metadata from partition. extra-pdf-image-in in requirements already has the newest version of unstructured-inference in there without a pinned version. Is there any place else that the unstructured-inference version needs to be updated to the required release version, 0.5.22?	2023-09-07 22:56:43 -04:00
Klaijan	675a10ea69	fix: update test_json to not use auto partition (#1187 ) Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.	2023-08-29 16:59:26 -04:00
John	5872fa23c3	Extract coordinates from PDFs and images when using OCR only strategy (#1163 ) ### Summary Closes #983 Creates new function `add_pytesseract_bbox_to_elements` Fixes typos in docstrings ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw png_filename="example-docs/english-and-korean.png" png_elements = partition_image(filename=png_filename, strategy="ocr_only") png_image = Image.open(png_filename) draw = ImageDraw.Draw(png_image) draw.polygon(png_elements[0].metadata.coordinates.points, outline="red", width=2) draw.polygon(png_elements[1].metadata.coordinates.points, outline="red", width=2) draw.polygon(png_elements[2].metadata.coordinates.points, outline="red", width=2) output = "example-docs/english-and-korean-box.png" png_image.save(output) png_image.close() ```	2023-08-25 05:32:12 +00:00
Klaijan	1524841cd9	feat: supports multipage tiff (#1131 ) Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and - confirms that the function reads all the pages in the TIFF. - page number is added to the metadata This PR is branched from and developed on top of 6d6be99 commit.	2023-08-24 15:12:50 +00:00
Newel H	e4aa7373e2	test: create CI pipelines for verifying base and extras pass respective tests (#1137 ) Summary Closes #747 * Create CI Pipeline for running text, xml, email, and html doc tests against the library installed without extras * Create CI Pipeline for running each library extra against their respective tests	2023-08-19 12:56:13 -04:00

11 Commits