mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-12 15:42:19 +00:00
fix: Pass partition_image kwargs downstream (#1426)
`partition_pdf` allows for passing a `model_name` parameter. Given the
similarity between the image and PDF pipelines, the expected behavior is
that `partition_image` should support the same parameter, but
`partition_image` was unintentionally not passing along its `kwargs`.
This was corrected by adding the kwargs to the downstream call.
#### Testing:
```python
from unstructured.partition.image import partition_image
output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx")
output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox")
# These shouldn't be the same, since they were produced using different models.
assert output1 != output2
```
The assertion should fail on `main`, but pass on this branch.
This commit is contained in:
parent
fe11ab4235
commit
0d61c98481
@ -34,6 +34,7 @@
|
||||
|
||||
### Fixes
|
||||
|
||||
* **Selecting a different model wasn't being respected when calling `partition_image`.** Problem: `partition_pdf` allows for passing a `model_name` parameter. Given the similarity between the image and PDF pipelines, the expected behavior is that `partition_image` should support the same parameter, but `partition_image` was unintentionally not passing along its `kwargs`. This was corrected by adding the kwargs to the downstream call.
|
||||
* **Fixes a chunking issue via dropping the field "coordinates".** Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates. Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision. Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements.
|
||||
|
||||
## 0.10.14
|
||||
|
||||
@ -426,3 +426,14 @@ def test_add_chunking_strategy_on_partition_image(
|
||||
chunks = chunk_by_title(elements)
|
||||
assert chunk_elements != elements
|
||||
assert chunk_elements == chunks
|
||||
|
||||
|
||||
def test_partition_image_uses_model_name():
|
||||
with mock.patch.object(
|
||||
pdf,
|
||||
"_partition_pdf_or_image_local",
|
||||
) as mockpartition:
|
||||
image.partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="test")
|
||||
print(mockpartition.call_args)
|
||||
assert "model_name" in mockpartition.call_args.kwargs
|
||||
assert mockpartition.call_args.kwargs["model_name"]
|
||||
|
||||
@ -851,3 +851,19 @@ def test_combine_numbered_list(filename):
|
||||
break
|
||||
assert len(elements) < 28
|
||||
assert first_list_element.text.endswith("(Section 3)")
|
||||
|
||||
|
||||
def test_partition_pdf_uses_model_name():
|
||||
with mock.patch.object(
|
||||
pdf,
|
||||
"_partition_pdf_or_image_local",
|
||||
) as mockpartition:
|
||||
pdf.partition_pdf(
|
||||
"example-docs/layout-parser-paper-fast.pdf",
|
||||
model_name="test",
|
||||
strategy="hi_res",
|
||||
)
|
||||
|
||||
mockpartition.assert_called_once()
|
||||
assert "model_name" in mockpartition.call_args.kwargs
|
||||
assert mockpartition.call_args.kwargs["model_name"]
|
||||
|
||||
@ -81,4 +81,5 @@ def partition_image(
|
||||
languages=languages,
|
||||
strategy=strategy,
|
||||
metadata_last_modified=metadata_last_modified,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user