100 Commits

Author SHA1 Message Date
jiajun-unstructured
b0dbd71aff
Parallelize tests (#4024) 2025-06-16 23:29:35 +00:00
Austin Walker
e3417d7e98
fix: Fix for Pillow error when extracting PNG images (#3998)
When I tried to partition a PNG file and extract images, I got an error
from Pillow:

```
WARNING  unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image
Traceback (most recent call last):
  File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save
    rawmode = RAWMODE[im.mode]
KeyError: 'RGBA'
```

The issue is that a PNG has an additional layer that cannot be saved off
in jpeg format. We can fix this with a quick conversion. I added a png
test case that is now passing with this fix.
2025-05-08 21:57:05 +00:00
Philippe PRADOS
d570f4624b
Fix sort_page_element. ensures that sorting is stable and not random. (#3978)
The sort_page_element() use the element id to sort the elements.
Two executions of the same code, on the same file, produce different
results. The order of the elements is random.
This makes it impossible to write stable unit tests, for example, or to
obtain reproducible results.
2025-04-07 15:57:20 +00:00
cragwolfe
dfa17bd3a0
fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975) 2025-04-04 14:38:23 -07:00
Yao You
4e424efd22
feat: use lxml instead of bs4 to parse hOCR data (#3960)
- `lxml` is a much faster library than `bs4` when the input data is
regular
- since the hOCR data is guaranteed to be regular (programmatically
generated) we don't need `bs4` here to parse the data
- `lxml` improves parsing speed by about 10x

Example runtime profiling locally using the same `hocr` data from 1 page
pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main
and `agent.hocr_to_dataframe` is the PR's method.

![Screenshot 2025-03-17 at 12 14
59 PM](https://github.com/user-attachments/assets/7c483857-8711-4d72-8954-e83510fef783)
2025-03-18 00:36:19 +00:00
Yao You
2dceac34b5
Feat/remove reference of PageLayout.elements (#3943)
This PR removes usage of `PageLayout.elements` from partition function,
except for when `analysis=True`. This PR updates the partition logic so
that `PageLayout.elements_array` is used everywhere to save memory and
cpu cost.
Since the analysis function is intended for investigation and not for
general document processing purposes, this part of the code is left for
a future refactor.

`PageLayout.elements` uses a list to store layout elements' data while
`elements_array` uses `numpy` array to store the data, which has much
lower memory requirements. Using `memory_profiler` to test the
differences is usually around 10x.
2025-03-12 15:21:21 +00:00
Yao You
8759b0aac9
feat: allow passing down of ocr agent and table agent (#3954)
This PR allows passing down both `ocr_agent` and `table_ocr_agent` as
parameters to specify the `OCRAgent` class for the page and tables, if
any, respectively. Both are default to using `tesseract`, consistent
with the present default behavior.

We used to rely on env variables to specify the agents but os env can be
changed during runtime outside of the caller's control. This method of
passing down the variables ensures that specification is independent of
env changes.

## testing

Using `example-docs/img/layout-parser-paper-with-table.jpg` and run
partition with two different settings. Note that this test requires
`paddleocr` extra.

```python
from unstructured.partition.auto import partition
from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE
elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE)
elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT)
```

we should see both finish and slight differences in the table element's
text attribute.
2025-03-11 16:36:31 +00:00
Yao You
43b682ad3f
feat: allow extraction of camel cased element type names (#3938)
This PR allows element types with CamelCase names to be extractable
using `extract_image_block_types` variable.

Before: specify `extract_image_block_types=["NarrativeText"]` (or any
casing for `NarrativeText`) would raise a warning that it doesn't match
any available types and not image would be extracted for this element
type

Now: specify `extract_image_block_types=["NarrativeText"]` would extract
images for this element type

## testing

```python
from unstructured.partition.auto import partition
f = "example-docs/pdf/embedded-images-tables.pdf"
elements = partition(f, strategy="hi_res", extract_image_block_types=["narrativetext"])
```

Without this PR no figures would be extracted. With this PR a local
folder would be created to contain images of the narrative text elements
in path like `./figures/figure-1-1.jpg`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2025-03-04 01:33:05 +00:00
Pluto
3973a30b8c
Feat: Add pdfminer parameters configuration (#3918)
This pull request adds the ability to configure multiple pdfminer
parameters (with the simple possibility to extend for the additional
parameters). One of the parameters overwrites the default from LA Params
config class.

Example:
```python3
partition(
    filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"),
    pdfminer_line_margin=1.123,
    pdfminer_char_margin=None,
    pdfminer_line_overlap=0.0123,
    pdfminer_word_margin=3.21,
)
assert pdfminer_mock.call_args.kwargs == {
    "line_margin": 1.123,
    "line_overlap": 0.0123,
    "word_margin": 3.21,
}
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: plutasnyy <plutasnyy@users.noreply.github.com>
2025-02-17 11:41:20 +00:00
Philippe PRADOS
b521bce9c6
Add password with PDF files (#3721)
Add password with PDF files
Must be combined with [PR 392 in
unstructured-inference](https://github.com/Unstructured-IO/unstructured-inference/pull/392)

---------

Co-authored-by: John J <43506685+Coniferish@users.noreply.github.com>
2025-02-11 17:39:16 +00:00
Yao You
9d58b34ab4
Fix/fix table id checking logic (#3898)
- there is a bug in deciding if a page has tables before performing
table extraction. This logic checks if the id associated with Table type
element is True
- however, it should be checking if the id is `None` because sometimes
the id can be 0 (the first type of element in the page)
- the fix updates the logic
- adds a unit test for this specific case
2025-01-31 10:19:14 -08:00
Yao You
a9ff1e70b2
Fix/fix ocr region to elements bug (#3891)
This PR fixes a bug in `build_layout_elements_from_ocr_regions` where
texts are joint in incorrect orders.

The bug is due to incorrect masking of the `ocr_regions` after some are
already selected as one of the final groups. The fix uses simpler method
to mask the indices by simply use the same indices that adds the regions
to the final groups to mask them so they are not considered again.

## Testing

This PR adds a unit test specifically aimed for this bug. Without the
fix the test would fail.
Additionally any PDF files with repeated texts has a potential to
trigger this bug. e.g., create a simple pdf use the test text

```python
"LayoutParser: \n\nA Unified Toolkit for Deep Learning Based Document Image\n\nLayoutParser for Deep Learning"
```
and partition with `ocr_only` mode on main branch would hit this bug and
output text where position of the second "LayoutParser" is incorrect.
```python
[
    'LayoutParser:', 
    'A Unified Toolkit for Deep Learning Based Document Image',
    'for Deep Learning LayoutParser',
]
```
2025-01-29 12:11:17 +00:00
David Huggins-Daines
9e5ff225f6
fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR (#3822)
Fixes: #3815 

Verified on my very large documents that it doesn't unnecessarily and
unsuccessfully "repair" them.

You may or may not wish to keep the version check in `patch_psparser`.
Since ~you're pinning the version of pdfminer.six and since it isn't
guaranteed that the bug in question will be fixed in the next
pdfminer.six release (but it is rather serious, so I should hope so),
then perhaps you just want to unconditionally patch it.~ it seems like
pinning of versions is only operative when running from Docker (good!)
so never mind! Keep that version check!

Also corrected an import so that if you do feel like using a newer
version of pdfminer.six, it won't break on you.

---------

Authored-by: David Huggins-Daines <dhdaines@logisphere.ca>
2025-01-24 14:27:25 -06:00
Yao You
8f2a719873
Feat/refactor layoutelement textregion to vectorized data structure (#3881)
This PR refactors the data structure for `list[LayoutElement]` and
`list[TextRegion]` used in partition pdf/image files.

- new data structure replaces a list of objects with one object with
`numpy` array to store data
- this only affects partition internal steps and it doesn't change input
or output signature of `partition` function itself, i.e., `partition`
still returns `list[Element]`
- internally `list[LayoutElement]` -> `LayoutElements`;
`list[TextRegion]` -> `TextRegions`
- current refactor stops before clean up pdfminer elements inside
inferred layout elements -> the algorithm of clean up needs to be
refactored before the data structure refactor can move forward. So
current refactor converts the array data structure into list data
structure with `element_array.as_list()` call. This is the last step
before turning `list[LayoutElement]` into `list[Element]` as return
- a future PR will update this last step so that we build
`list[Element]` from `LayoutElements` data structure instead.

The goal of this PR is to replace the data structure as much as possible
without changing underlying logic. There are a few places where the
slicing or filtering logic was simple enough to be converted into vector
data structure operations. Those are refactored to be vector based. As a
result there is some small improvements observed in ingest test. This is
likely because the vector operations cleaned up some previous
inconsistency in data types and operations.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-01-23 17:11:38 +00:00
Pluto
8685905bd1
Character confidence threshold (#3860)
This change adds the ability to filter out characters predicted by
Tesseract with low confidence scores.

Some notes:
- I intentionally disabled it by default; I think some low score(like
0.9-0.95 for Tesseract) could be a safe choice though
- I wanted to use character bboxes and combine them into word bbox
later. However, a bug in Tesseract in some specific scenarios returns
incorrect character bboxes (unit tests caught it 🥳 ). More in comment in
the code
2025-01-13 13:12:46 +00:00
Steve Canny
10f0d54ac2
build: remove ruff version upper bound (#3829)
**Summary**
Remove pin on `ruff` linter and fix the handful of lint errors a newer
version catches.
2024-12-16 23:01:22 +00:00
Christine Straub
df156ebe5a
feat: support pdf link extraction in hi_res strategy (#3753)
This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.

### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test

### Testing
```
elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2024-10-31 16:52:27 +00:00
Yao You
a11ad22609
bump unstructured-inference (#3711)
This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.

A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-10-21 21:55:08 +00:00
Nathan Van Gheem
b092d45816
Remove unsupported chipper model (#3728)
The chipper model is no longer supported.
2024-10-17 17:40:45 +00:00
David Blore
ecf0267b85
fix: add language to OCRAgentGoogleVision constructor (#3696)
This PR addresses issue #3659 by adding an optional `language` parameter
to the `OCRAgentGoogleVision` class constructor.

This parameter serves as a "language hint" for the
`document_text_detection` method in the `ImageAnnotatorClient`. For more
information on language hints, refer to the [Google Cloud Vision
documentation](https://cloud.google.com/vision/docs/languages).


**Default Behavior**: 
The language parameter defaults to None, allowing Google Cloud Vision to
auto-detect the language, as recommended in their documentation.

**Purpose**: 
This change is necessary because the `OCRAgent`'s `get_instance` method
expects all `OCRAgent`s to include a language parameter in their
constructors.

**Context on Issue:**
When trying to parse a PDF with
`OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`,
an error occurs in the `get_instance` method. The method expects a
`language` parameter, which the current `OCRAgentGoogleVision`
constructor does not support, leading to a positional argument error.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-10-14 05:35:05 +00:00
Steve Canny
44bad216f3
rfctr(part): prepare for pluggable auto-partitioners 3 (#3661)
**Summary**
Remove unused `include_metadata` parameter.

**Additional Context**
- The `include_metadata` parameter was originally added circa v0.7.12 as
a mechanism for avoiding the "double-decorating" problem on delegating
partitioners.
- It turns out it doesn't fully address that problem, is now unused, and
is unnecessary for the solution we'll be adding as part of pluggable
partitioners.
- Remove the unnecessary complexity introduced by this unused parameter.
2024-09-25 18:17:48 +00:00
Steve Canny
3bab9d93e6
rfctr(part): prepare for pluggable auto-partitioners 1 (#3655)
**Summary**
In preparation for pluggable auto-partitioners simplify metadata as
discussed.

**Additional Context**
- Pluggable auto-partitioners requires partitioners to have a consistent
call signature. An arbitrary partitioner provided at runtime needs to
have a call signature that is known and consistent. Basically
`partition_x(filename, *, file, **kwargs)`.
- The current `auto.partition()` is highly coupled to each distinct
file-type partitioner, deciding which arguments to forward to each.
- This is driven by the existence of "delegating" partitioners, those
that convert their file-type and then call a second partitioner to do
the actual partitioning. Both the delegating and proxy partitioners are
decorated with metadata-post-processing decorators and those decorators
are not idempotent. We call the situation where those decorators would
run twice "double-decorating". For example, EPUB converts to HTML and
calls `partition_html()` and both `partition_epub()` and
`partition_html()` are decorated.
- The way double-decorating has been avoided in the past is to avoid
sending the arguments the metadata decorators are sensitive to to the
proxy partitioner. This is very obscure, complex to reason about,
error-prone, and just overall not a viable strategy. The better solution
is to not decorate delegating partitioners and let the proxy partitioner
handle all the metadata.
- This first step in preparation for that is part of simplifying the
metadata processing by removing unused or unwanted legacy parameters.
- `date_from_file_object` is a misnomer because a file-object never
contains last-modified data.
- It can never produce useful results in the API where last-modified
information must be provided by `metadata_last_modified`.
- It is an undocumented parameter so not in use.
- Using it can produce incorrect metadata.
2024-09-23 22:23:10 +00:00
Christine Straub
0ed69a1ac3
refactor: pdfminer image cleanup (#3648)
This PR aims to remove `clean_pdfminer_duplicate_image_elements()`
function, as its functionality has already been integrated into the
`remove_duplicate_elements()` function in [PR
#3630](https://github.com/Unstructured-IO/unstructured/pull/3630).
2024-09-19 18:57:02 +00:00
Christine Straub
be88eef06f
perf: optimize pdfminer image cleanup process for improved performance (#3630)
This PR enhances `pdfminer` image cleanup process by repositioning the
duplicate image removal step. It optimizes the removal of duplicated
pdfminer images by performing the cleanup before merging elements,
rather than after. This improvement reduces execution time and enhances
the overall processing speed of PDF documents.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2024-09-19 14:05:05 +00:00
Christine Straub
87a88a3c87
feat: improve pdfminer element processing (#3618)
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-09-12 21:17:27 +00:00
Yao You
d51fb134e6
Feat/improve iou speed (#3582)
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.

## test

This PR adds unit test to the new vectorized IOU and subregion
computation functions.

In addition, running partition on large files with many elements like
this slide:

[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)

shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.

Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

![Screenshot 2024-08-30 at 9 29
27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)
2024-09-03 00:06:18 +00:00
Pawel Kmiecik
404f780bbb
feat: make analysis drawing more flexible (#3574)
This PR changes the way the analysis tools can be used:
- by default if `analysis` is set to `True` in `partition_pdf` and the
strategy is resolved to `hi_res`:
- for each file 4 layout dumps are produced and saved as JSON files
(`object_detection`, `extracted`, `ocr`, `final`) - similar way to the
current `object_detection` dump
- the drawing functions/classes now accept these dumps accordingly
instead of the internal classes instances (like `TextRegion`,
`DocumentLayout`
- it makes it possible to use the lightweight JSON files to render the
bboxes of a given file after the partition is done
- `_partition_pdf_or_image_local` has been refactored and most of the
analysis code is now encapsulated in `save_analysis_artifiacts` function
- to do this, helper function `render_bboxes_for_file` is added
<img width="338" alt="Screenshot 2024-08-28 at 14 37 56"
src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">
2024-09-02 11:06:11 +00:00
Christine Straub
fc26426310
feat: replace pytesseract with unstructured.pytesseract fork (#3528)
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.

This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
2024-08-16 10:34:22 -04:00
Jake Zerrer
051be5aead
Remove unstructured.pytesseract fork (#3454)
A second attempt at
https://github.com/Unstructured-IO/unstructured/pull/3360, this PR
removes unstructured's dependency on its own fork of `pytesseract`. (The
original reason for the fork, the addition of
`run_and_get_multiple_output`, was removed
[here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).)

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-09 04:28:48 +00:00
Maciej Kurzawa
b749b891a7
fix: disabled checking max pages for images (#3473)
Added fix related to
https://github.com/Unstructured-IO/unstructured/pull/3431, which
disables checking max pages for images
2024-08-02 14:25:08 +00:00
Maciej Kurzawa
8fd216cc9f
feat/pdf-page-limit-in-hi-res (#3431)
# Description:
Passing `max_pages` argument allows rejecting pdf files which exceeds
this page number limit while `high_res` strategy is chosen. By default
it will allow parsing pdf files with unlimited number of pages.

# Testing:
```python
from unstructured.partition.auto import partition

elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res')  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4)  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2)  # should raise PdfMaxPagesExceededError
```
2024-07-30 16:52:17 +00:00
Christine Straub
0eb461acc2
refactor: restructure PDF/Image example document organization (#3410)
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.

### Summary
- Created two new subdirectories in the `example-docs` folder:
  - `pdf/`: for all PDF example files
  - `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure

### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.

## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.

## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-07-18 22:21:32 +00:00
Christine Straub
48bdf94656
feat: partition_pdf() support language specification for PaddleOCR (#3400)
Closes #3159.

This PR extends language specification capability to `PaddleOCR` in
addition to `TesseractOCR`. Users can now specify OCR languages for both
OCR engines when using `partition_pdf()`.

### Testing

```
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"

elements = partition_pdf(
    filename=<file_path>,
    strategy=strategy,
    languages=["chi_sim"], # chinese - simplified
    infer_table_structure=True,
)
```
2024-07-16 22:19:25 +00:00
Steve Canny
d48fa3b163
rfctr(auto): improve typing and organize auto tests (#3355)
**Summary**
In preparation for further work on auto-partitioning (`partition()`),
improve typing and organize `test_auto.py` by introducing categories.
2024-07-08 21:25:17 +00:00
Pawel Kmiecik
575957b2d2
feat: enhance analysis options with od model dump and better vis (#3234)
This PR adds new capabilities for drawing bboxes for each layout
(extracted, inferred, ocr and final) + OD model output dump as a json
file for better analysis.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>
2024-06-26 13:14:55 +00:00
Austin Walker
0b73978b92
fix: fix IndexError when partioning a pdf with starting_page_number (#3246)
The Issue:

When extracting images from pdfs, we use the metadata page number to
index into a list of the images. However, the metadata page number can
now be changed via `starting_page_number`. To get the true page index,
we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an
IndexError. On this branch, it will return the elements.
```
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)
```

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-06-19 18:20:54 +00:00
Christine Straub
9552fbbfbf
chore: bump unstructured-inference 0.7.35 (#3205)
### Summary
- bump unstructured-inference to `0.7.35` which fixed syntax for
generated HTML tables
- update unit tests and ingest test fixtures to reflect changes in the
generated HTML tables
- cut a release for `0.14.6`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-06-14 18:11:38 +00:00
Christine Straub
f4457249a7
fix: partition_pdf() removes spaces from the text (#3106)
Closes #2896.

This PR aims to fix `partition_pdf()` to keep spaces in text. The
control character `\t` is now replaced with a space instead of being
removed when merging inferred and embedded elements.

### Testing
PDF:
[rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf)
```
elements = partition_pdf(
    filename="rok_20230930_1-1.pdf",
    strategy="hi_res",
)

print(str(elements[20]))
```
**Results:**
- PR
```
Name of each exchange on which registered New York Stock Exchange
```
- main branch
```
Nameofeachexchangeonwhichregistered NewYorkStockExchange
```
2024-05-29 04:53:17 +00:00
Christine Straub
b0d8a779da
feat: partiton_pdf() set inferred elements text (#3061)
This PR adds the ability to fill inferred elements text from embedded
text (`pdfminer`) without depending on `unstructured-inference` library.
This PR is the second part of moving embedded text related code from
`unstructured-inference` to `unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/349.
2024-05-21 19:43:38 +00:00
Christine Straub
76831f154b
refactor: partition_pdf() pass kwargs through fast strategy pipeline (#3040)
This PR aims to pass `kwargs` through `fast` strategy pipeline, which
was missing as part of the previous PR -
https://github.com/Unstructured-IO/unstructured/pull/3030.
I also did some code refactoring in this PR, so I recommend reviewing
this PR commit by commit.

### Summary
- pass `kwargs` through `fast` strategy pipeline, which will allow users
to specify additional params like `sort_mode`
- refactor: code reorganization
- cut a release for `0.14.0`
### Testing
CI should pass
2024-05-17 20:55:11 +00:00
amadeusz-ds
1c8b2b23eb
feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres (#3014)
This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR
controlling where temporary files are stored during partition flow, via
tempfile.tempdir.

#### Edit:
Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_

#### Edit 2:
Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_
2024-05-17 19:16:10 +00:00
Christine Straub
1fb0fe5cf5
enhancement: partitoin_pdf() skip unnecessary element sorting (#3030)
This PR aims to skip element sorting when determining whether embedded
text can be extracted. The extracted elements in this step are returned
as final elements only for the `fast` strategy pipeline and are never
used for other strategy pipelines (`hi_res`, `ocr`).
Removing element sorting in this step and adding it to the `fast`
strategy pipeline later will improve performance and reduce execution
time.

### Summary
- skip element sorting when determining whether embedded text can be
extracted.
- add `_partition_pdf_with_pdfparser()` function for fast` strategy
pipeline

### Testing
CI should pass.
2024-05-16 06:02:56 +00:00
John
d829b669e6
Add starting_page_num param to partition_image (#2987)
Add missing starting_page_num param to partition_image

Closes #2985
2024-05-09 21:31:35 +00:00
Christine Straub
0cd07d78f9
feat: parition_pdf() add ability to get cid ratio (#2970)
This PR adds the ability to get the ratio of `cid` characters in
embedded text extracted by `pdfminer`. This PR is the second part of
moving `cid` related code from `unstructured-inference` to
`unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/342.
2024-05-04 05:21:27 +00:00
Michał Martyniak
7720e72424
Fix: avoid elements sharing the same memory address (#2940)
This PR attempts to fix a memory issue, which resulted in errors like
this: https://github.com/Unstructured-IO/unstructured/issues/2931
The root cause seems to be in how ListItems are being combined, not in
how hashes or parent IDs are updated.

When `assign_and_map_hash_ids()` is called and elements (or elements'
metadata) do not have unique memory addresses, then updating the
parent_id of one element will also overwrite the parent_id of some other
element.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2024-04-28 19:15:17 -07:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Dimitri Lozeve
abb0174181
Integration with the Google Cloud Vision API (#2902)
This PR adds a third OCR provider, alongside Tesseract and Paddle: the
[Google Cloud Vision API](https://cloud.google.com/vision).

It can be used similarly to other OCR methods: set the `OCR_AGENT`
environment variable to the path to the OCR module
(`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`).
You also need to set the credentials to use Google APIs, for instance by
setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-04-23 21:11:39 +00:00
Christine Straub
ac5048bf30
enhancement: remove duplicate embedded images (#2897)
This PR aims to remove duplicate embedded images taken by `PDFminer`.

### Summary
- add `clean_pdfminer_duplicate_image_elements()` to remove embedded
images with similar `bboxes` and the same `text`
- add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the
bounding boxes of two embedded images as the same region
- refactor: reorganzie `clean_pdfminer_inner_elements()`
2024-04-18 23:07:47 +00:00
Michał Martyniak
cb1e91058e
Introduce start_page argument to partitioning functions that assign element.metadata.page_number (#2884)
This small change will be useful for users who partition only fragments
of their PDF documents.
It's a small step towards addressing this issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

Related PRs:
* https://github.com/Unstructured-IO/unstructured/pull/2842
* https://github.com/Unstructured-IO/unstructured/pull/2673
2024-04-15 21:03:42 +00:00
Christine Straub
887e6c9094
refactor: use env_config instead of SUBREGION_THRESHOLD_FOR_OCR constant (#2697)
The purpose of this PR is to introduce a new env_config for the
subregion threshold for OCR.

### Testing
CI should pass.
2024-03-28 20:28:35 +00:00