unstructured/test_unstructured/partition/pdf_image/test_chipper.py

import pytest

from unstructured.partition.pdf_image import pdf
from unstructured.partition.utils.constants import PartitionStrategy


@pytest.fixture(scope="session")
def chipper_results():
    elements = pdf.partition_pdf(
        "example-docs/layout-parser-paper-fast.pdf",
        strategy=PartitionStrategy.HI_RES,
        model_name="chipper",
    )
    return elements


@pytest.fixture(scope="session")
def chipper_children(chipper_results):
    return [el for el in chipper_results if el.metadata.parent_id is not None]


@pytest.mark.chipper()
def test_chipper_has_hierarchy(chipper_children):
    assert chipper_children


@pytest.mark.chipper()
def test_chipper_not_losing_parents(chipper_results, chipper_children):
    assert all(
        [el for el in chipper_results if el.id == child.metadata.parent_id]
        for child in chipper_children
    )
tests: separate chipper tests (#1939) Separates chipper tests to speed up testing and CI. 2023-10-31 16:02:00 -05:00			`import pytest`

Refactor: support merging `extracted` layout with `inferred` layout (#2158) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer` 2023-12-01 12:56:31 -08:00			`from unstructured.partition.pdf_image import pdf`
Refactor: partition pdf (#2074) ### Summary - add constants for strategies - add `_process_uncategorized_text_elements()` to remove code block duplication ### Testing CI should pass. 2023-11-15 21:41:02 -08:00			`from unstructured.partition.utils.constants import PartitionStrategy`
tests: separate chipper tests (#1939) Separates chipper tests to speed up testing and CI. 2023-10-31 16:02:00 -05:00

			`@pytest.fixture(scope="session")`
			`def chipper_results():`
			`elements = pdf.partition_pdf(`
			`"example-docs/layout-parser-paper-fast.pdf",`
Refactor: partition pdf (#2074) ### Summary - add constants for strategies - add `_process_uncategorized_text_elements()` to remove code block duplication ### Testing CI should pass. 2023-11-15 21:41:02 -08:00			`strategy=PartitionStrategy.HI_RES,`
tests: separate chipper tests (#1939) Separates chipper tests to speed up testing and CI. 2023-10-31 16:02:00 -05:00			`model_name="chipper",`
			`)`
			`return elements`


			`@pytest.fixture(scope="session")`
			`def chipper_children(chipper_results):`
			`return [el for el in chipper_results if el.metadata.parent_id is not None]`


			`@pytest.mark.chipper()`
			`def test_chipper_has_hierarchy(chipper_children):`
			`assert chipper_children`


			`@pytest.mark.chipper()`
			`def test_chipper_not_losing_parents(chipper_results, chipper_children):`
			`assert all(`
			`[el for el in chipper_results if el.id == child.metadata.parent_id]`
			`for child in chipper_children`
			`)`