89 Commits

Author SHA1 Message Date
Yao You
a9ff1e70b2
Fix/fix ocr region to elements bug (#3891)
This PR fixes a bug in `build_layout_elements_from_ocr_regions` where
texts are joint in incorrect orders.

The bug is due to incorrect masking of the `ocr_regions` after some are
already selected as one of the final groups. The fix uses simpler method
to mask the indices by simply use the same indices that adds the regions
to the final groups to mask them so they are not considered again.

## Testing

This PR adds a unit test specifically aimed for this bug. Without the
fix the test would fail.
Additionally any PDF files with repeated texts has a potential to
trigger this bug. e.g., create a simple pdf use the test text

```python
"LayoutParser: \n\nA Unified Toolkit for Deep Learning Based Document Image\n\nLayoutParser for Deep Learning"
```
and partition with `ocr_only` mode on main branch would hit this bug and
output text where position of the second "LayoutParser" is incorrect.
```python
[
    'LayoutParser:', 
    'A Unified Toolkit for Deep Learning Based Document Image',
    'for Deep Learning LayoutParser',
]
```
2025-01-29 12:11:17 +00:00
David Huggins-Daines
9e5ff225f6
fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR (#3822)
Fixes: #3815 

Verified on my very large documents that it doesn't unnecessarily and
unsuccessfully "repair" them.

You may or may not wish to keep the version check in `patch_psparser`.
Since ~you're pinning the version of pdfminer.six and since it isn't
guaranteed that the bug in question will be fixed in the next
pdfminer.six release (but it is rather serious, so I should hope so),
then perhaps you just want to unconditionally patch it.~ it seems like
pinning of versions is only operative when running from Docker (good!)
so never mind! Keep that version check!

Also corrected an import so that if you do feel like using a newer
version of pdfminer.six, it won't break on you.

---------

Authored-by: David Huggins-Daines <dhdaines@logisphere.ca>
2025-01-24 14:27:25 -06:00
Yao You
8f2a719873
Feat/refactor layoutelement textregion to vectorized data structure (#3881)
This PR refactors the data structure for `list[LayoutElement]` and
`list[TextRegion]` used in partition pdf/image files.

- new data structure replaces a list of objects with one object with
`numpy` array to store data
- this only affects partition internal steps and it doesn't change input
or output signature of `partition` function itself, i.e., `partition`
still returns `list[Element]`
- internally `list[LayoutElement]` -> `LayoutElements`;
`list[TextRegion]` -> `TextRegions`
- current refactor stops before clean up pdfminer elements inside
inferred layout elements -> the algorithm of clean up needs to be
refactored before the data structure refactor can move forward. So
current refactor converts the array data structure into list data
structure with `element_array.as_list()` call. This is the last step
before turning `list[LayoutElement]` into `list[Element]` as return
- a future PR will update this last step so that we build
`list[Element]` from `LayoutElements` data structure instead.

The goal of this PR is to replace the data structure as much as possible
without changing underlying logic. There are a few places where the
slicing or filtering logic was simple enough to be converted into vector
data structure operations. Those are refactored to be vector based. As a
result there is some small improvements observed in ingest test. This is
likely because the vector operations cleaned up some previous
inconsistency in data types and operations.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-01-23 17:11:38 +00:00
Pluto
8685905bd1
Character confidence threshold (#3860)
This change adds the ability to filter out characters predicted by
Tesseract with low confidence scores.

Some notes:
- I intentionally disabled it by default; I think some low score(like
0.9-0.95 for Tesseract) could be a safe choice though
- I wanted to use character bboxes and combine them into word bbox
later. However, a bug in Tesseract in some specific scenarios returns
incorrect character bboxes (unit tests caught it 🥳 ). More in comment in
the code
2025-01-13 13:12:46 +00:00
Steve Canny
10f0d54ac2
build: remove ruff version upper bound (#3829)
**Summary**
Remove pin on `ruff` linter and fix the handful of lint errors a newer
version catches.
2024-12-16 23:01:22 +00:00
Christine Straub
df156ebe5a
feat: support pdf link extraction in hi_res strategy (#3753)
This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.

### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test

### Testing
```
elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2024-10-31 16:52:27 +00:00
Yao You
a11ad22609
bump unstructured-inference (#3711)
This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.

A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-10-21 21:55:08 +00:00
Nathan Van Gheem
b092d45816
Remove unsupported chipper model (#3728)
The chipper model is no longer supported.
2024-10-17 17:40:45 +00:00
David Blore
ecf0267b85
fix: add language to OCRAgentGoogleVision constructor (#3696)
This PR addresses issue #3659 by adding an optional `language` parameter
to the `OCRAgentGoogleVision` class constructor.

This parameter serves as a "language hint" for the
`document_text_detection` method in the `ImageAnnotatorClient`. For more
information on language hints, refer to the [Google Cloud Vision
documentation](https://cloud.google.com/vision/docs/languages).


**Default Behavior**: 
The language parameter defaults to None, allowing Google Cloud Vision to
auto-detect the language, as recommended in their documentation.

**Purpose**: 
This change is necessary because the `OCRAgent`'s `get_instance` method
expects all `OCRAgent`s to include a language parameter in their
constructors.

**Context on Issue:**
When trying to parse a PDF with
`OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`,
an error occurs in the `get_instance` method. The method expects a
`language` parameter, which the current `OCRAgentGoogleVision`
constructor does not support, leading to a positional argument error.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-10-14 05:35:05 +00:00
Steve Canny
44bad216f3
rfctr(part): prepare for pluggable auto-partitioners 3 (#3661)
**Summary**
Remove unused `include_metadata` parameter.

**Additional Context**
- The `include_metadata` parameter was originally added circa v0.7.12 as
a mechanism for avoiding the "double-decorating" problem on delegating
partitioners.
- It turns out it doesn't fully address that problem, is now unused, and
is unnecessary for the solution we'll be adding as part of pluggable
partitioners.
- Remove the unnecessary complexity introduced by this unused parameter.
2024-09-25 18:17:48 +00:00
Steve Canny
3bab9d93e6
rfctr(part): prepare for pluggable auto-partitioners 1 (#3655)
**Summary**
In preparation for pluggable auto-partitioners simplify metadata as
discussed.

**Additional Context**
- Pluggable auto-partitioners requires partitioners to have a consistent
call signature. An arbitrary partitioner provided at runtime needs to
have a call signature that is known and consistent. Basically
`partition_x(filename, *, file, **kwargs)`.
- The current `auto.partition()` is highly coupled to each distinct
file-type partitioner, deciding which arguments to forward to each.
- This is driven by the existence of "delegating" partitioners, those
that convert their file-type and then call a second partitioner to do
the actual partitioning. Both the delegating and proxy partitioners are
decorated with metadata-post-processing decorators and those decorators
are not idempotent. We call the situation where those decorators would
run twice "double-decorating". For example, EPUB converts to HTML and
calls `partition_html()` and both `partition_epub()` and
`partition_html()` are decorated.
- The way double-decorating has been avoided in the past is to avoid
sending the arguments the metadata decorators are sensitive to to the
proxy partitioner. This is very obscure, complex to reason about,
error-prone, and just overall not a viable strategy. The better solution
is to not decorate delegating partitioners and let the proxy partitioner
handle all the metadata.
- This first step in preparation for that is part of simplifying the
metadata processing by removing unused or unwanted legacy parameters.
- `date_from_file_object` is a misnomer because a file-object never
contains last-modified data.
- It can never produce useful results in the API where last-modified
information must be provided by `metadata_last_modified`.
- It is an undocumented parameter so not in use.
- Using it can produce incorrect metadata.
2024-09-23 22:23:10 +00:00
Christine Straub
0ed69a1ac3
refactor: pdfminer image cleanup (#3648)
This PR aims to remove `clean_pdfminer_duplicate_image_elements()`
function, as its functionality has already been integrated into the
`remove_duplicate_elements()` function in [PR
#3630](https://github.com/Unstructured-IO/unstructured/pull/3630).
2024-09-19 18:57:02 +00:00
Christine Straub
be88eef06f
perf: optimize pdfminer image cleanup process for improved performance (#3630)
This PR enhances `pdfminer` image cleanup process by repositioning the
duplicate image removal step. It optimizes the removal of duplicated
pdfminer images by performing the cleanup before merging elements,
rather than after. This improvement reduces execution time and enhances
the overall processing speed of PDF documents.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2024-09-19 14:05:05 +00:00
Christine Straub
87a88a3c87
feat: improve pdfminer element processing (#3618)
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-09-12 21:17:27 +00:00
Yao You
d51fb134e6
Feat/improve iou speed (#3582)
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.

## test

This PR adds unit test to the new vectorized IOU and subregion
computation functions.

In addition, running partition on large files with many elements like
this slide:

[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)

shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.

Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

![Screenshot 2024-08-30 at 9 29
27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)
2024-09-03 00:06:18 +00:00
Pawel Kmiecik
404f780bbb
feat: make analysis drawing more flexible (#3574)
This PR changes the way the analysis tools can be used:
- by default if `analysis` is set to `True` in `partition_pdf` and the
strategy is resolved to `hi_res`:
- for each file 4 layout dumps are produced and saved as JSON files
(`object_detection`, `extracted`, `ocr`, `final`) - similar way to the
current `object_detection` dump
- the drawing functions/classes now accept these dumps accordingly
instead of the internal classes instances (like `TextRegion`,
`DocumentLayout`
- it makes it possible to use the lightweight JSON files to render the
bboxes of a given file after the partition is done
- `_partition_pdf_or_image_local` has been refactored and most of the
analysis code is now encapsulated in `save_analysis_artifiacts` function
- to do this, helper function `render_bboxes_for_file` is added
<img width="338" alt="Screenshot 2024-08-28 at 14 37 56"
src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">
2024-09-02 11:06:11 +00:00
Christine Straub
fc26426310
feat: replace pytesseract with unstructured.pytesseract fork (#3528)
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.

This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
2024-08-16 10:34:22 -04:00
Jake Zerrer
051be5aead
Remove unstructured.pytesseract fork (#3454)
A second attempt at
https://github.com/Unstructured-IO/unstructured/pull/3360, this PR
removes unstructured's dependency on its own fork of `pytesseract`. (The
original reason for the fork, the addition of
`run_and_get_multiple_output`, was removed
[here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).)

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-09 04:28:48 +00:00
Maciej Kurzawa
b749b891a7
fix: disabled checking max pages for images (#3473)
Added fix related to
https://github.com/Unstructured-IO/unstructured/pull/3431, which
disables checking max pages for images
2024-08-02 14:25:08 +00:00
Maciej Kurzawa
8fd216cc9f
feat/pdf-page-limit-in-hi-res (#3431)
# Description:
Passing `max_pages` argument allows rejecting pdf files which exceeds
this page number limit while `high_res` strategy is chosen. By default
it will allow parsing pdf files with unlimited number of pages.

# Testing:
```python
from unstructured.partition.auto import partition

elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res')  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4)  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2)  # should raise PdfMaxPagesExceededError
```
2024-07-30 16:52:17 +00:00
Christine Straub
0eb461acc2
refactor: restructure PDF/Image example document organization (#3410)
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.

### Summary
- Created two new subdirectories in the `example-docs` folder:
  - `pdf/`: for all PDF example files
  - `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure

### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.

## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.

## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-07-18 22:21:32 +00:00
Christine Straub
48bdf94656
feat: partition_pdf() support language specification for PaddleOCR (#3400)
Closes #3159.

This PR extends language specification capability to `PaddleOCR` in
addition to `TesseractOCR`. Users can now specify OCR languages for both
OCR engines when using `partition_pdf()`.

### Testing

```
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"

elements = partition_pdf(
    filename=<file_path>,
    strategy=strategy,
    languages=["chi_sim"], # chinese - simplified
    infer_table_structure=True,
)
```
2024-07-16 22:19:25 +00:00
Steve Canny
d48fa3b163
rfctr(auto): improve typing and organize auto tests (#3355)
**Summary**
In preparation for further work on auto-partitioning (`partition()`),
improve typing and organize `test_auto.py` by introducing categories.
2024-07-08 21:25:17 +00:00
Pawel Kmiecik
575957b2d2
feat: enhance analysis options with od model dump and better vis (#3234)
This PR adds new capabilities for drawing bboxes for each layout
(extracted, inferred, ocr and final) + OD model output dump as a json
file for better analysis.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>
2024-06-26 13:14:55 +00:00
Austin Walker
0b73978b92
fix: fix IndexError when partioning a pdf with starting_page_number (#3246)
The Issue:

When extracting images from pdfs, we use the metadata page number to
index into a list of the images. However, the metadata page number can
now be changed via `starting_page_number`. To get the true page index,
we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an
IndexError. On this branch, it will return the elements.
```
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)
```

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-06-19 18:20:54 +00:00
Christine Straub
9552fbbfbf
chore: bump unstructured-inference 0.7.35 (#3205)
### Summary
- bump unstructured-inference to `0.7.35` which fixed syntax for
generated HTML tables
- update unit tests and ingest test fixtures to reflect changes in the
generated HTML tables
- cut a release for `0.14.6`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-06-14 18:11:38 +00:00
Christine Straub
f4457249a7
fix: partition_pdf() removes spaces from the text (#3106)
Closes #2896.

This PR aims to fix `partition_pdf()` to keep spaces in text. The
control character `\t` is now replaced with a space instead of being
removed when merging inferred and embedded elements.

### Testing
PDF:
[rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf)
```
elements = partition_pdf(
    filename="rok_20230930_1-1.pdf",
    strategy="hi_res",
)

print(str(elements[20]))
```
**Results:**
- PR
```
Name of each exchange on which registered New York Stock Exchange
```
- main branch
```
Nameofeachexchangeonwhichregistered NewYorkStockExchange
```
2024-05-29 04:53:17 +00:00
Christine Straub
b0d8a779da
feat: partiton_pdf() set inferred elements text (#3061)
This PR adds the ability to fill inferred elements text from embedded
text (`pdfminer`) without depending on `unstructured-inference` library.
This PR is the second part of moving embedded text related code from
`unstructured-inference` to `unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/349.
2024-05-21 19:43:38 +00:00
Christine Straub
76831f154b
refactor: partition_pdf() pass kwargs through fast strategy pipeline (#3040)
This PR aims to pass `kwargs` through `fast` strategy pipeline, which
was missing as part of the previous PR -
https://github.com/Unstructured-IO/unstructured/pull/3030.
I also did some code refactoring in this PR, so I recommend reviewing
this PR commit by commit.

### Summary
- pass `kwargs` through `fast` strategy pipeline, which will allow users
to specify additional params like `sort_mode`
- refactor: code reorganization
- cut a release for `0.14.0`
### Testing
CI should pass
2024-05-17 20:55:11 +00:00
amadeusz-ds
1c8b2b23eb
feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres (#3014)
This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR
controlling where temporary files are stored during partition flow, via
tempfile.tempdir.

#### Edit:
Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_

#### Edit 2:
Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_
2024-05-17 19:16:10 +00:00
Christine Straub
1fb0fe5cf5
enhancement: partitoin_pdf() skip unnecessary element sorting (#3030)
This PR aims to skip element sorting when determining whether embedded
text can be extracted. The extracted elements in this step are returned
as final elements only for the `fast` strategy pipeline and are never
used for other strategy pipelines (`hi_res`, `ocr`).
Removing element sorting in this step and adding it to the `fast`
strategy pipeline later will improve performance and reduce execution
time.

### Summary
- skip element sorting when determining whether embedded text can be
extracted.
- add `_partition_pdf_with_pdfparser()` function for fast` strategy
pipeline

### Testing
CI should pass.
2024-05-16 06:02:56 +00:00
John
d829b669e6
Add starting_page_num param to partition_image (#2987)
Add missing starting_page_num param to partition_image

Closes #2985
2024-05-09 21:31:35 +00:00
Christine Straub
0cd07d78f9
feat: parition_pdf() add ability to get cid ratio (#2970)
This PR adds the ability to get the ratio of `cid` characters in
embedded text extracted by `pdfminer`. This PR is the second part of
moving `cid` related code from `unstructured-inference` to
`unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/342.
2024-05-04 05:21:27 +00:00
Michał Martyniak
7720e72424
Fix: avoid elements sharing the same memory address (#2940)
This PR attempts to fix a memory issue, which resulted in errors like
this: https://github.com/Unstructured-IO/unstructured/issues/2931
The root cause seems to be in how ListItems are being combined, not in
how hashes or parent IDs are updated.

When `assign_and_map_hash_ids()` is called and elements (or elements'
metadata) do not have unique memory addresses, then updating the
parent_id of one element will also overwrite the parent_id of some other
element.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2024-04-28 19:15:17 -07:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Dimitri Lozeve
abb0174181
Integration with the Google Cloud Vision API (#2902)
This PR adds a third OCR provider, alongside Tesseract and Paddle: the
[Google Cloud Vision API](https://cloud.google.com/vision).

It can be used similarly to other OCR methods: set the `OCR_AGENT`
environment variable to the path to the OCR module
(`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`).
You also need to set the credentials to use Google APIs, for instance by
setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-04-23 21:11:39 +00:00
Christine Straub
ac5048bf30
enhancement: remove duplicate embedded images (#2897)
This PR aims to remove duplicate embedded images taken by `PDFminer`.

### Summary
- add `clean_pdfminer_duplicate_image_elements()` to remove embedded
images with similar `bboxes` and the same `text`
- add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the
bounding boxes of two embedded images as the same region
- refactor: reorganzie `clean_pdfminer_inner_elements()`
2024-04-18 23:07:47 +00:00
Michał Martyniak
cb1e91058e
Introduce start_page argument to partitioning functions that assign element.metadata.page_number (#2884)
This small change will be useful for users who partition only fragments
of their PDF documents.
It's a small step towards addressing this issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

Related PRs:
* https://github.com/Unstructured-IO/unstructured/pull/2842
* https://github.com/Unstructured-IO/unstructured/pull/2673
2024-04-15 21:03:42 +00:00
Christine Straub
887e6c9094
refactor: use env_config instead of SUBREGION_THRESHOLD_FOR_OCR constant (#2697)
The purpose of this PR is to introduce a new env_config for the
subregion threshold for OCR.

### Testing
CI should pass.
2024-03-28 20:28:35 +00:00
Filip Knefel
6af6604057
feat: introduce date_from_file_object parameter to partitions (#2563)
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-18 01:09:44 +00:00
Matt Robinson
b4d9ad8130
enhancement: detect headers in partition_pdf with fast strategy (#2455)
### Summary

Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-02-23 16:56:09 +00:00
Christine Straub
d11a83ce65
refactor: embedded text processing modules (#2535)
This PR is similar to ocr module refactoring PR -
https://github.com/Unstructured-IO/unstructured/pull/2492.

### Summary
- refactor "embedded text extraction" related modules to use decorator -
`@requires_dependencies` on functions that require external libraries
and import those libraries inside those functions instead of on module
level.
- add missing test cases for `pdf_image_utils.py` module to improve
average test coverage

### Testing
CI should pass.
2024-02-13 21:19:07 -08:00
Christine Straub
29b9ea7ba6
refactor: ocr modules (#2492)
The purpose of this PR is to refactor OCR-related modules to reduce
unnecessary module imports to avoid potential issues (most likely due to
a "circular import").

### Summary
- add `inference_utils` module
(unstructured/partition/pdf_image/inference_utils.py) to define
unstructured-inference library related utility functions, which will
reduce importing unstructured-inference library functions in other files
- add `conftest.py` in `test_unstructured/partition/pdf_image/`
directory to define fixtures that are available to all tests in the same
directory and its subdirectories

### Testing
CI should pass
2024-02-06 17:11:55 +00:00
Christine Straub
94001a208d
feat: improve table cell data (#2457)
The purpose of this PR is to pass embedded text through table processing
sub-pipeline later later use.
2024-02-01 05:29:19 +00:00
Christophe Jolif
ccc2302b33
feat: add the ability to specify a custom OCR besides the ones natively supported (#2462)
This is nice to natively support both Tesseract and Paddle. However, one
might already use another OCR and might want to keep using it (for
quality reasons, for cost reasons etc...).
This PR adds the ability for the user to specify its own OCR agent
implementation that is then called by unstructured.

I am new to unstructured so don't hesitate to let me know if you would
prefer this being done differently and I will rework the PR.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
2024-01-31 16:38:14 -06:00
Christine Straub
8b1de4c2b8
fix: partition_pdf() not working when using chipper model with file (#2479)
Closes #2480.
 
### Summary
- fixed an error introduced by PR
[#2347](https://github.com/Unstructured-IO/unstructured/pull/2347) -
https://github.com/Unstructured-IO/unstructured/pull/2347/files#diff-cefa2d296ae7ffcf5c28b5734d5c7d506fbdb225c05a0bc27c6b755d5424ffdaL373
- updated `test_partition_pdf_with_model_name()` to test more model
names

### Testing
The updated test function `test_partition_pdf_with_model_name()` should
work on this branch, but fails on the `main` branch.
2024-01-31 17:36:59 +00:00
John
db67805ec6
feat: add support for partitioning .heic files (#2454)
.heic files are an image filetype we have not supported.

#### Testing
```
from unstructured.partition.image import partition_image

png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"

png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")

for i in range(len(heic_elements)):
	print(heic_elements[i].text == png_elements[i].text)
```

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-01-30 04:49:00 +00:00
John
9320311a19
fix: check languages args (#2435)
This PR is the last in a series of PRs for refactoring and fixing the
language parameters (`languages` and `ocr_languages` so we can address
incorrect input by users. See #2293

It is recommended to go though this PR commit-by-commit and note the
commit message. The most significant commit is "update
check_languages..."
2024-01-29 20:12:08 +00:00
Yao You
97fb10db4a
fix: default hi_res model rely on inference setting (#2441)
- there are multiple places setting the default `hi_res_model_name` in
both `unstructured` and `unstructured-inference`
- they lead to inconsistency and unexpected behaviors
- this fix removes a helper in `unstructured` that tries to set the
default hi_res layout detection model; instead we rely on the
`unstructured-inference` to provide that default when no explicit model
name is passed in

## test

```bash
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython
```

```python
from unstructured.partition.auto import partition

# find a pdf file
elements = partition("foo.pdf", strategy="hi_res")
assert elements[0].metadata.detection_origin == "yolox"
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2024-01-29 16:44:41 +00:00
Antonio Jose Jimeno Yepes
d8b3bdb919
Check chipper version and prevent running pdfminer with chipper (#2347)
We have added a new version of chipper (Chipperv3), which needs to allow
unstructured to effective work with all the current Chipper versions.
This implies resizing images with the appropriate resolution and make
sure that Chipper elements are not sorted by unstructured.

In addition, it seems that PDFMiner is being called when calling
Chipper, which adds repeated elements from Chipper and PDFMiner.

To evaluate this PR, you can test the code below with the attached PDF.
The code writes a JSON file with the generated elements. The output can
be examined with `cat out.un.json | python -m json.tool`. There are
three things to check:

1. The size of the image passed to Chipper, which can be identiied in
the layout_height and layout_width attributes, which should have values
3301 and 2550 as shown in the example below:

```
[
    {
        "element_id": "c0493a7872f227e4172c4192c5f48a06",
        "metadata": {
            "coordinates": {
                "layout_height": 3301,
                "layout_width": 2550,

```

2. There should be no repeated elements. 
3. Order should be closer to reading order.

The script to run Chipper from unstructured is:

```
from unstructured import __version__
print(__version__.__version__)

import json
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json

elements = json.loads(elements_to_json(partition("Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf", strategy="hi_res", model_name="chipperv3")))

with open('out.un.json', 'w') as w:
    json.dump(elements, w)

```



[Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf](https://github.com/Unstructured-IO/unstructured/files/13817273/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf)

---------

Co-authored-by: Antonio Jimeno Yepes <antonio@unstructured.io>
2024-01-25 02:33:32 +00:00