mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-03 07:05:20 +00:00
feat: include text from shapes in docx (#2510)
Reported bug: Text from docx shapes is not included in the `partition` output. Fix: Extend docx partition to search for text tags nested inside structures responsible for creating the shape. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>
This commit is contained in:
parent
51427b3103
commit
f048695a55
@ -28,6 +28,7 @@
|
|||||||
* **Add .heic file partitioning** .heic image files were previously unsupported and are now supported though partition_image()
|
* **Add .heic file partitioning** .heic image files were previously unsupported and are now supported though partition_image()
|
||||||
* **Add the ability to specify an alternate OCR** implementation by implementing an `OCRAgent` interface and specify it using `OCR_AGENT` environment variable.
|
* **Add the ability to specify an alternate OCR** implementation by implementing an `OCRAgent` interface and specify it using `OCR_AGENT` environment variable.
|
||||||
* **Add Vectara destination connector** Adds support for writing partitioned documents into a Vectara index.
|
* **Add Vectara destination connector** Adds support for writing partitioned documents into a Vectara index.
|
||||||
|
* **Add ability to detect text in .docx inline shapes** extensions of docx partition, extracts text from inline shapes and includes them in paragraph's text
|
||||||
|
|
||||||
### Fixes
|
### Fixes
|
||||||
|
|
||||||
@ -41,6 +42,7 @@
|
|||||||
* **Add title to Vectara upload - was not separated out from initial connector **
|
* **Add title to Vectara upload - was not separated out from initial connector **
|
||||||
* **Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test **
|
* **Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test **
|
||||||
|
|
||||||
|
|
||||||
## 0.12.3
|
## 0.12.3
|
||||||
|
|
||||||
### Enhancements
|
### Enhancements
|
||||||
|
BIN
example-docs/docx-shapes.docx
Normal file
BIN
example-docs/docx-shapes.docx
Normal file
Binary file not shown.
@ -764,6 +764,20 @@ def test_partition_docx_includes_hyperlink_metadata():
|
|||||||
assert metadata.link_urls is None
|
assert metadata.link_urls is None
|
||||||
|
|
||||||
|
|
||||||
|
# -- shape behaviors -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_it_considers_text_inside_shapes():
|
||||||
|
# -- <bracketed> text is written inside inline shapes --
|
||||||
|
partitioned_doc = partition_docx(example_doc_path("docx-shapes.docx"))
|
||||||
|
assert [element.text for element in partitioned_doc] == [
|
||||||
|
"Paragraph with single <inline-image> within.",
|
||||||
|
"Paragraph with <inline-image1> and <inline-image2> within.",
|
||||||
|
# -- text "<floating-shape>" in floating shape is ignored --
|
||||||
|
"Paragraph with floating shape attached.",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
# -- module-level fixtures -----------------------------------------------------------------------
|
# -- module-level fixtures -----------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -330,7 +330,12 @@ class _DocxPartitioner:
|
|||||||
does not contribute to the document-element stream and will not cause an element to be
|
does not contribute to the document-element stream and will not cause an element to be
|
||||||
emitted.
|
emitted.
|
||||||
"""
|
"""
|
||||||
text = paragraph.text
|
text = "".join(
|
||||||
|
e.text
|
||||||
|
for e in paragraph._p.xpath(
|
||||||
|
"w:r | w:hyperlink | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
# NOTE(scanny) - blank paragraphs are commonly used for spacing between paragraphs and
|
# NOTE(scanny) - blank paragraphs are commonly used for spacing between paragraphs and
|
||||||
# do not contribute to the document-element stream.
|
# do not contribute to the document-element stream.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user