unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-06 15:23:37 +00:00

History

fix: fix IndexError when partioning a pdf with starting_page_number (#3246 )

The Issue:

When extracting images from pdfs, we use the metadata page number to
index into a list of the images. However, the metadata page number can
now be changed via `starting_page_number`. To get the true page index,
we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an
IndexError. On this branch, it will return the elements.
```
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)
```

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>

2024-06-19 18:20:54 +00:00

chunking

feat/Move the category field to Element (#3056 )

2024-05-23 10:43:26 +00:00

cleaners

rfctr: prepare to add orig_elements serde (#2668 )

2024-03-20 21:27:59 +00:00

documents

rfctr(html): drop HTML-specific elements (#3207 )

2024-06-15 00:14:22 +00:00

embed

feat: add VoyageAI embeddings (#3069 ) (#3099 )