Michał Martyniak 2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00

24 lines
629 B
JSON

[
{
"element_id": "f273c4bc5102c3e9b7463be8210ad7ab",
"metadata": {
"data_source": {
"date_created": "2023-07-13T14:28:06.310000",
"date_modified": "2023-07-14T22:16:58.907000",
"record_locator": {
"page_id": "2589704",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/2589704",
"version": "3"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "Test text",
"type": "Title"
}
]