2 Commits

Author SHA1 Message Date
Sri Sudarsan
349728162e
Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names (#3959)
Instead of looking for presence of `word/document.xml` ,
`ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and
XLSX files, we look for prefix `word/document*.xml`,
`ppt/presentation*.xml` and `xl/workbook*.xml` as certain files
generated from office365 has files with different names.
Fixes https://github.com/Unstructured-IO/unstructured/issues/3937

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2025-03-21 16:27:13 +00:00
Steve Canny
4379d883a3
chunk: relax table segregation during chunking (#3812)
**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.

**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-12-09 18:57:22 +00:00