mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-31 01:03:30 +00:00
**Summary** Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. **Additional Context** Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>
33 lines
1.3 KiB
JSON
33 lines
1.3 KiB
JSON
[
|
|
{
|
|
"type": "Table",
|
|
"element_id": "ca96108263324e9d865a98f19cf7c940",
|
|
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
|
|
"metadata": {
|
|
"category_depth": 1,
|
|
"page_number": 1,
|
|
"parent_id": "747587de72444235a68c768d544ff5f3",
|
|
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"filetype": "text/html"
|
|
}
|
|
},
|
|
{
|
|
"type": "Text",
|
|
"element_id": "0163a58539934b3aaca402c9e961b0d6",
|
|
"text": "REQUEST FOR PROPOSALS",
|
|
"metadata": {
|
|
"category_depth": 1,
|
|
"page_number": 1,
|
|
"parent_id": "747587de72444235a68c768d544ff5f3",
|
|
"text_as_html": "<h2 class=\"Subtitle\" id=\"0163a58539934b3aaca402c9e961b0d6\">REQUEST FOR PROPOSALS </h2>",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"filetype": "text/html"
|
|
}
|
|
}
|
|
]
|