unstructured/test_unstructured/documents/test_base.py

import pytest

from unstructured.documents.base import Document, Page
from unstructured.documents.elements import Formula, NarrativeText, Title


class MockDocument(Document):
    def __init__(self):
        super().__init__()
        elements = [
            Title(text="This is a narrative."),
            NarrativeText(text="This is a narrative."),
            NarrativeText(text="This is a narrative."),
        ]
        page = Page(number=0)
        page.elements = elements
        self._pages = [page]


class MockDocumentWithFormula(Document):
    def __init__(self):
        super().__init__()
        elements = [
            Title(text="This is a narrative."),
            Formula(text="e=mc2"),
        ]
        page = Page(number=0)
        page.elements = elements
        self._pages = [page]


def test_get_narrative():
    document = MockDocument()
    narrative = document.get_narrative()
    for element in narrative:
        assert isinstance(element, NarrativeText)
    document.print_narrative()


def test_get_formula():
    document = MockDocumentWithFormula()
    formula = [e for e in document.elements if isinstance(e, Formula)]
    assert formula[0].text != ""


@pytest.mark.parametrize("index", [0, 1, 2])
def test_split(index):
    document = MockDocument()
    elements = document.pages[0].elements
    split_before_doc = document.before_element(elements[index])
    before_elements = split_before_doc.pages[0].elements if split_before_doc.pages else []
    split_after_doc = document.after_element(elements[index])
    after_elements = split_after_doc.pages[0].elements if split_after_doc.pages else []
    expected_before_elements = document.pages[0].elements[:index]
    next_index = index + 1
    expected_after_elements = document.pages[0].elements[next_index:]
    assert all(a.id == b.id for a, b in zip(before_elements, expected_before_elements))
    assert all(a.id == b.id for a, b in zip(after_elements, expected_after_elements))
Initial Release 2022-06-29 14:35:19 -04:00			`import pytest`
Resolve various style issues to improve overall code quality (#282) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR. 2023-02-27 17:30:54 +01:00
Initial Release 2022-06-29 14:35:19 -04:00			`from unstructured.documents.base import Document, Page`
fix/bad formed formula (#1481) @ron-unstructured reported that loading files with: ``` from unstructured.partition.pdf import partition_pdf elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox") print(elements_yolox) ``` Throws an error. After debugging the execution I found that the issue is that an object of class Formula is being created, however, this class doesn't contain an __init__ method. This PR solves the issue of adding a constructor method with an empty string for the element. The file can be found at: https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing After this PR is merged this file is correctly processed 2023-09-22 20:36:22 -06:00			`from unstructured.documents.elements import Formula, NarrativeText, Title`
Initial Release 2022-06-29 14:35:19 -04:00

			`class MockDocument(Document):`
fix: move _read out of base Document class Changed where _read sits in the inheritance structure since PDFDocument doesn't really need lazy document processing 2022-11-14 13:34:42 -06:00			`def __init__(self):`
			`super().__init__()`
Initial Release 2022-06-29 14:35:19 -04:00			`elements = [`
			`Title(text="This is a narrative."),`
			`NarrativeText(text="This is a narrative."),`
			`NarrativeText(text="This is a narrative."),`
			`]`
			`page = Page(number=0)`
			`page.elements = elements`
fix: move _read out of base Document class Changed where _read sits in the inheritance structure since PDFDocument doesn't really need lazy document processing 2022-11-14 13:34:42 -06:00			`self._pages = [page]`
Initial Release 2022-06-29 14:35:19 -04:00

fix/bad formed formula (#1481) @ron-unstructured reported that loading files with: ``` from unstructured.partition.pdf import partition_pdf elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox") print(elements_yolox) ``` Throws an error. After debugging the execution I found that the issue is that an object of class Formula is being created, however, this class doesn't contain an __init__ method. This PR solves the issue of adding a constructor method with an empty string for the element. The file can be found at: https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing After this PR is merged this file is correctly processed 2023-09-22 20:36:22 -06:00			`class MockDocumentWithFormula(Document):`
			`def __init__(self):`
			`super().__init__()`
			`elements = [`
			`Title(text="This is a narrative."),`
			`Formula(text="e=mc2"),`
			`]`
			`page = Page(number=0)`
			`page.elements = elements`
			`self._pages = [page]`


Initial Release 2022-06-29 14:35:19 -04:00			`def test_get_narrative():`
			`document = MockDocument()`
			`narrative = document.get_narrative()`
			`for element in narrative:`
			`assert isinstance(element, NarrativeText)`
			`document.print_narrative()`


fix/bad formed formula (#1481) @ron-unstructured reported that loading files with: ``` from unstructured.partition.pdf import partition_pdf elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox") print(elements_yolox) ``` Throws an error. After debugging the execution I found that the issue is that an object of class Formula is being created, however, this class doesn't contain an __init__ method. This PR solves the issue of adding a constructor method with an empty string for the element. The file can be found at: https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing After this PR is merged this file is correctly processed 2023-09-22 20:36:22 -06:00			`def test_get_formula():`
			`document = MockDocumentWithFormula()`
			`formula = [e for e in document.elements if isinstance(e, Formula)]`
			`assert formula[0].text != ""`


Initial Release 2022-06-29 14:35:19 -04:00			`@pytest.mark.parametrize("index", [0, 1, 2])`
			`def test_split(index):`
			`document = MockDocument()`
			`elements = document.pages[0].elements`
			`split_before_doc = document.before_element(elements[index])`
			`before_elements = split_before_doc.pages[0].elements if split_before_doc.pages else []`
			`split_after_doc = document.after_element(elements[index])`
			`after_elements = split_after_doc.pages[0].elements if split_after_doc.pages else []`
			`expected_before_elements = document.pages[0].elements[:index]`
			`next_index = index + 1`
			`expected_after_elements = document.pages[0].elements[next_index:]`
			`assert all(a.id == b.id for a, b in zip(before_elements, expected_before_elements))`
			`assert all(a.id == b.id for a, b in zip(after_elements, expected_after_elements))`