Michał Martyniak 2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00

40 lines
1.1 KiB
JSON

[
{
"element_id": "7997025526d4d565f2442e6c10be4c3d",
"metadata": {
"data_source": {
"date_created": "2023-10-16T22:37:02.481000+00:00",
"date_modified": "2023-10-16T22:37:07.918000+00:00",
"record_locator": {
"hubspot_id": "41286477879"
}
},
"filename": "41286477879.txt",
"filetype": "text/plain",
"languages": [
"eng"
]
},
"text": "Call with Testing Ipsum",
"type": "Title"
},
{
"element_id": "1111265d062cb14df259b8a212466554",
"metadata": {
"data_source": {
"date_created": "2023-10-16T22:37:02.481000+00:00",
"date_modified": "2023-10-16T22:37:07.918000+00:00",
"record_locator": {
"hubspot_id": "41286477879"
}
},
"filename": "41286477879.txt",
"filetype": "text/plain",
"languages": [
"eng"
]
},
"text": "<div style=\"\" dir=\"auto\" data-top-level=\"true\"><p style=\"margin:0;\">Log discussing details on call done with Testing Ipsum contact at 5:00pm.</p></div>",
"type": "NarrativeText"
}
]