mirror of
				https://github.com/Unstructured-IO/unstructured.git
				synced 2025-10-31 01:54:25 +00:00 
			
		
		
		
	feat: include text from shapes in docx (#2510)
Reported bug: Text from docx shapes is not included in the `partition` output. Fix: Extend docx partition to search for text tags nested inside structures responsible for creating the shape. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>
This commit is contained in:
		
							parent
							
								
									51427b3103
								
							
						
					
					
						commit
						f048695a55
					
				| @ -28,6 +28,7 @@ | |||||||
| * **Add .heic file partitioning** .heic image files were previously unsupported and are now supported though partition_image() | * **Add .heic file partitioning** .heic image files were previously unsupported and are now supported though partition_image() | ||||||
| * **Add the ability to specify an alternate OCR** implementation by implementing an `OCRAgent` interface and specify it using `OCR_AGENT` environment variable. | * **Add the ability to specify an alternate OCR** implementation by implementing an `OCRAgent` interface and specify it using `OCR_AGENT` environment variable. | ||||||
| * **Add Vectara destination connector** Adds support for writing partitioned documents into a Vectara index. | * **Add Vectara destination connector** Adds support for writing partitioned documents into a Vectara index. | ||||||
|  | * **Add ability to detect text in .docx inline shapes** extensions of docx partition, extracts text from inline shapes and includes them in paragraph's text | ||||||
| 
 | 
 | ||||||
| ### Fixes | ### Fixes | ||||||
| 
 | 
 | ||||||
| @ -41,6 +42,7 @@ | |||||||
| * **Add title to Vectara upload - was not separated out from initial connector ** | * **Add title to Vectara upload - was not separated out from initial connector ** | ||||||
| * **Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test ** | * **Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test ** | ||||||
| 
 | 
 | ||||||
|  | 
 | ||||||
| ## 0.12.3 | ## 0.12.3 | ||||||
| 
 | 
 | ||||||
| ### Enhancements | ### Enhancements | ||||||
|  | |||||||
							
								
								
									
										
											BIN
										
									
								
								example-docs/docx-shapes.docx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								example-docs/docx-shapes.docx
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| @ -764,6 +764,20 @@ def test_partition_docx_includes_hyperlink_metadata(): | |||||||
|     assert metadata.link_urls is None |     assert metadata.link_urls is None | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | # -- shape behaviors ----------------------------------------------------------------------------- | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_it_considers_text_inside_shapes(): | ||||||
|  |     # -- <bracketed> text is written inside inline shapes -- | ||||||
|  |     partitioned_doc = partition_docx(example_doc_path("docx-shapes.docx")) | ||||||
|  |     assert [element.text for element in partitioned_doc] == [ | ||||||
|  |         "Paragraph with single <inline-image> within.", | ||||||
|  |         "Paragraph with <inline-image1> and <inline-image2> within.", | ||||||
|  |         # -- text "<floating-shape>" in floating shape is ignored -- | ||||||
|  |         "Paragraph with floating shape attached.", | ||||||
|  |     ] | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| # -- module-level fixtures ----------------------------------------------------------------------- | # -- module-level fixtures ----------------------------------------------------------------------- | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | |||||||
| @ -330,7 +330,12 @@ class _DocxPartitioner: | |||||||
|         does not contribute to the document-element stream and will not cause an element to be |         does not contribute to the document-element stream and will not cause an element to be | ||||||
|         emitted. |         emitted. | ||||||
|         """ |         """ | ||||||
|         text = paragraph.text |         text = "".join( | ||||||
|  |             e.text | ||||||
|  |             for e in paragraph._p.xpath( | ||||||
|  |                 "w:r | w:hyperlink | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r" | ||||||
|  |             ) | ||||||
|  |         ) | ||||||
| 
 | 
 | ||||||
|         # NOTE(scanny) - blank paragraphs are commonly used for spacing between paragraphs and |         # NOTE(scanny) - blank paragraphs are commonly used for spacing between paragraphs and | ||||||
|         # do not contribute to the document-element stream. |         # do not contribute to the document-element stream. | ||||||
|  | |||||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user
	 Filip Knefel
						Filip Knefel