unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-12 08:27:46 +00:00

History

Christine Straub df156ebe5a

feat: support pdf link extraction in hi_res strategy (#3753 )

This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.

### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test

### Testing
```
elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>

2024-10-31 16:52:27 +00:00

2023-Jan-economic-outlook.pdf.json

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

page-with-formula.pdf.json

feat: support pdf link extraction in hi_res strategy (#3753 )

2024-10-31 16:52:27 +00:00

recalibrating-risk-report.pdf.json

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

Silent-Giant-(1).pdf.json

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00