Steve Canny 4379d883a3
chunk: relax table segregation during chunking (#3812)
**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.

**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-12-09 18:57:22 +00:00

33 lines
1.3 KiB
JSON

[
{
"type": "Table",
"element_id": "ca96108263324e9d865a98f19cf7c940",
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
"languages": [
"eng"
],
"filetype": "text/html"
}
},
{
"type": "Text",
"element_id": "0163a58539934b3aaca402c9e961b0d6",
"text": "REQUEST FOR PROPOSALS",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<h2 class=\"Subtitle\" id=\"0163a58539934b3aaca402c9e961b0d6\">REQUEST FOR PROPOSALS </h2>",
"languages": [
"eng"
],
"filetype": "text/html"
}
}
]