docling/docs/examples/dpk-ingest-chunck-tokenize.ipynb
Maroun Touma e76298c40d
docs: DPK pipeline example using docling library (#2112)
* Notebook showing example on how to use docling transforms in DPK

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* fix HF Token name

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* use %pip instead of pip install jupyter lab

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* run formatter

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* add example to mkdocs and fix typo

Signed-off-by: Maroun Touma <touma@us.ibm.com>

---------

Signed-off-by: Maroun Touma <touma@us.ibm.com>
2025-08-21 10:14:36 +02:00

772 lines
31 KiB
Plaintext
Vendored

{
"cells": [
{
"cell_type": "markdown",
"id": "3f312845",
"metadata": {},
"source": [
"# 🛡️ Chunking and tokenizing HTML documents using Data Prep Kit and the Docling Transforms\n",
"\n",
"This notebook demonstrates how to build a sequence of <a href=https://github.com/data-prep-kit/data-prep-kit> <b>DPK transforms</b> </a> for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the <a href=https://docling-project.github.io/docling/> Docling library</a>. \n",
"\n",
"In this example, we will use the <i>Wikimedia API<i> to retrieve the HTML articles that will be used as a seed for our LLM application. Once the articles are loaded to a local cache, we will construct and invoke the sequence of transforms to ingest the content and produce the embedding for the chuncked content.\n"
]
},
{
"cell_type": "markdown",
"id": "bdd83586",
"metadata": {},
"source": [
"## 🔍 Why DPK Pipelines\n",
"\n",
"DPK transform pipelines are intended to simplify how any number of transforms can be executed in a sequence to ingest, annotate, filter and create embedding used for LLM post-training and RAG applications. \n"
]
},
{
"cell_type": "markdown",
"id": "2bbf8525",
"metadata": {},
"source": [
"## 🧰 Key Transforms in This Recipe\n",
"\n",
"We will use the following transforms from DPK:\n",
"\n",
"- `Docling2Parquet`: Ingest one or more HTML document and turn it into a parquet file.\n",
"- `Doc_Chunk`: Create chunks from one more more ducment.\n",
"- `Tokenization`: Create embedding for document chunks.\n"
]
},
{
"cell_type": "markdown",
"id": "d19b07eb-af15-48d9-973d-f1a5a0654331",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"1- This notebook uses Wikimedia API for retrieving the initial HTML documents and llama-tokenizer from hugging face. \n",
"\n",
"2- In order to use the notebook, users must provide a <b>.env</b> file with a valid access tokens to be used for accessing the wikimedia endpoint (<a href=https://enterprise.wikimedia.com/docs/> instructions can be found here </a>) and a Hugging face token for loading the model (<a href=https://huggingface.co/docs/hub/en/security-tokens> instructions can be found here</a>). The .env file will look something like this:\n",
"```\n",
"WIKI_ACCESS_TOKEN='eyxxx'\n",
"HF_READ_ACCESS_TOKEN='hf_xxx'\n",
"```\n",
"\n",
"3- Install DPK library to environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8e34691-3b14-4796-847d-f07132bbca00",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"%pip install \"data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]\"\n",
"%pip install pandas\n",
"%pip install \"numpy<2.0\"\n",
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv(\".env\", override=True)"
]
},
{
"cell_type": "markdown",
"id": "31f3226b-3b3f-4354-bb5c-6f40ce9a511d",
"metadata": {},
"source": [
"We will define and use a utility function for downloading the articles and saving them to the local disk:\n",
"\n",
"<b>load_corpus</b>: Uses http request with the wikimedia api token to connect to a Wikimedia endpoint and retrieve the HTML articles that will be used as a seed for our LLM application. The article will then be saved to a local cache folder for further processing\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec0f6f63-9c4b-400d-af10-0ce9f805d6f9",
"metadata": {},
"outputs": [],
"source": [
"def load_corpus(articles: list, folder: str) -> int:\n",
" import os\n",
" import re\n",
"\n",
" import requests\n",
"\n",
" headers = {\"Authorization\": f\"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}\"}\n",
" count = 0\n",
" for article in articles:\n",
" try:\n",
" endpoint = f\"https://api.enterprise.wikimedia.com/v2/articles/{article}\"\n",
" response = requests.get(endpoint, headers=headers)\n",
" response.raise_for_status()\n",
" doc = response.json()\n",
" for article in doc:\n",
" filename = re.sub(r\"[^a-zA-Z0-9_]\", \"_\", article[\"name\"])\n",
" with open(f\"{folder}/{filename}.html\", \"w\") as f:\n",
" f.write(article[\"article_body\"][\"html\"])\n",
" count = count + 1\n",
" except Exception as e:\n",
" print(f\"Failed to retrieve content: {e}\")\n",
" return count"
]
},
{
"cell_type": "markdown",
"id": "671a16d7",
"metadata": {},
"source": [
"## 🔗 Setup the experiment\n",
"\n",
"DPK requires that we define a source/input folder where the transform sequence will be ingesting the document and a destination/output folder where the embedding will be stored. We will also initialize the list of articles we want to use in our application\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "712befc1-7b3b-4949-9c5b-84e434c1ff60",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import tempfile\n",
"\n",
"datafolder = tempfile.mkdtemp(dir=os.getcwd())\n",
"articles = [\"Science,_technology,_engineering,_and_mathematics\"]\n",
"assert load_corpus(articles, datafolder) > 0, \"Faild to download any documents\""
]
},
{
"cell_type": "markdown",
"id": "c6148ba9-b0ba-4c24-9422-52bf27cb812b",
"metadata": {},
"source": [
"### 🔗 Injest\n",
"\n",
"Invoke Docling2Parquet tansform that will parse the HTML document and create a Markdown"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd453793-8f68-440c-a08a-8a6a16f3de04",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%capture\n",
"from dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types\n",
"\n",
"result = Docling2Parquet(\n",
" input_folder=datafolder,\n",
" output_folder=f\"{datafolder}/docling2parquet\",\n",
" data_files_to_use=[\".html\"],\n",
" docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN, # markdown\n",
").transform()"
]
},
{
"cell_type": "markdown",
"id": "a4670351-a439-4212-9392-e890fc943a3c",
"metadata": {},
"source": [
"### 🔗 Chunk\n",
"\n",
"Invoke DocChunk tansform to break the HTML document into chunks"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8db5da67-4797-4b85-b83c-3347e8852611",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%capture\n",
"from dpk_doc_chunk import DocChunk\n",
"\n",
"result = DocChunk(\n",
" input_folder=f\"{datafolder}/docling2parquet\",\n",
" output_folder=f\"{datafolder}/doc_chunk\",\n",
" doc_chunk_chunking_type=\"li_markdown\",\n",
" doc_chunk_chunk_size_tokens=128, # default 128\n",
" doc_chunk_chunk_overlap_tokens=30, # default 30\n",
").transform()"
]
},
{
"cell_type": "markdown",
"id": "a9dbe452-fc9a-490d-a28a-961df528f812",
"metadata": {},
"source": [
"### 🔗 Tokenization\n",
"\n",
"Invoke Tokenization transform to create embedding of various chunks"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93be7a31-d13c-4345-861c-6654aea4f680",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"from dpk_tokenization import Tokenization\n",
"\n",
"Tokenization(\n",
" input_folder=f\"{datafolder}/doc_chunk\",\n",
" output_folder=f\"{datafolder}/tkn\",\n",
" tkn_tokenizer=\"hf-internal-testing/llama-tokenizer\",\n",
" tkn_chunk_size=20_000,\n",
").transform()"
]
},
{
"cell_type": "markdown",
"id": "d36d136a",
"metadata": {},
"source": [
"## ✅ Summary\n",
"\n",
"This notebook demonstrated how to run a DPK pipeline using IBM's Data Prep Kit and the Docling library. Each transform create one or more parquet files that users can explore to better understand what each stage of the pipeline produces. \n",
"The see the output of the final stage, we will use Pandas to read the final parquet file and display its content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "997003fc-ba1a-4ebf-a37d-4419bcaf5389",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>tokens</th>\n",
" <th>document_id</th>\n",
" <th>document_length</th>\n",
" <th>token_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>[1, 444, 11814, 262, 3002]</td>\n",
" <td>f1f5b56a78829ab2165b3bbeb94b1167e4c5583c437f1d...</td>\n",
" <td>14</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>[1, 835, 5298, 13, 13, 797, 278, 4688, 29871, ...</td>\n",
" <td>402e82a9e81cc3d2494fac36bebf8bf1a2662800e5a00c...</td>\n",
" <td>2100</td>\n",
" <td>655</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>[1, 835, 5901, 21833, 13, 13, 29899, 321, 1254...</td>\n",
" <td>4fb389d0f0e999c2496f137b4a7c0671e79c09cf9477e9...</td>\n",
" <td>2833</td>\n",
" <td>968</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>[1, 444, 26304, 4978, 13, 13, 14136, 1967, 666...</td>\n",
" <td>3709997548d84224361a6835760b5ae48a1637e78d54a0...</td>\n",
" <td>1496</td>\n",
" <td>483</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>[1, 444, 2648, 4234]</td>\n",
" <td>1e1a58ad5664d963bc207dc791825258c33337c2559f6a...</td>\n",
" <td>13</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>[1, 835, 8314, 13, 13, 1576, 9870, 315, 1038, ...</td>\n",
" <td>83a63864e5ddfdd41ef0f813fb7aa3c95e04c029c32ab3...</td>\n",
" <td>1340</td>\n",
" <td>442</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>[1, 835, 7400, 13, 13, 6028, 1114, 27871, 2987...</td>\n",
" <td>5e29fb4e4cf37ed4c49994620e4a00da9693bc061e82c1...</td>\n",
" <td>1800</td>\n",
" <td>548</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>[1, 835, 7551, 13, 13, 25411, 3762, 8950, 6020...</td>\n",
" <td>3fc34013d93391a7504e84069190479fbc85ba7e7072cb...</td>\n",
" <td>1784</td>\n",
" <td>511</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>[1, 835, 4092, 13, 13, 13393, 884, 29901, 518,...</td>\n",
" <td>e8b28e20e3fc3da40b6b368e30f9c953f5218370ec2f7a...</td>\n",
" <td>774</td>\n",
" <td>229</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>[1, 3191, 18312, 13, 13, 1576, 365, 29965, 152...</td>\n",
" <td>94b54fbda274536622f70442b18126f554610e8915b235...</td>\n",
" <td>1076</td>\n",
" <td>263</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>[1, 3191, 3444, 13, 13, 1576, 1024, 310, 317, ...</td>\n",
" <td>fef9b66567944df131851834e2fdfb42b5c668e4b08031...</td>\n",
" <td>238</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>[1, 835, 12798, 12026, 13, 13, 1254, 12665, 97...</td>\n",
" <td>eeb74ae3490539aa07f25987b6b2666dc907b39147e810...</td>\n",
" <td>366</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>[1, 835, 7513, 13, 13, 19302, 284, 2879, 515, ...</td>\n",
" <td>cc2ccd2e9f4d0a8224716109f7a6e7b30f33ff1f8c7adf...</td>\n",
" <td>1395</td>\n",
" <td>402</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>[1, 835, 20537, 423, 13, 13, 797, 20537, 423, ...</td>\n",
" <td>baf13788a018da24d86b630a9032eaeee54913bbbdd0d4...</td>\n",
" <td>511</td>\n",
" <td>137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>[1, 835, 21215, 13, 13, 1254, 12665, 17800, 52...</td>\n",
" <td>a5b3973ab3a98d10f4ae07a004d70c6cdcfacb41fda8d7...</td>\n",
" <td>1949</td>\n",
" <td>536</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>[1, 835, 26260, 13, 13, 797, 278, 518, 4819, 2...</td>\n",
" <td>dfa35b16704a4dd549701a7821b6aa856f2dd5e5b69daf...</td>\n",
" <td>1042</td>\n",
" <td>291</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>[1, 835, 660, 14873, 13, 13, 797, 518, 29984, ...</td>\n",
" <td>a0809b265e4a011407d38cd06c7b3ce5932683a2f9c6af...</td>\n",
" <td>852</td>\n",
" <td>282</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>[1, 835, 25960, 13, 13, 1254, 12665, 338, 760,...</td>\n",
" <td>85e8f3b2af3268d49e60451d3ac87b3bd281a70cf6c4b7...</td>\n",
" <td>1165</td>\n",
" <td>285</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>[1, 835, 498, 26517, 13, 13, 797, 29871, 29906...</td>\n",
" <td>15c924efdbf0135de91a095237cbe831275bab67ee1371...</td>\n",
" <td>1612</td>\n",
" <td>397</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>[1, 835, 26459, 13, 13, 29911, 29641, 728, 317...</td>\n",
" <td>b473b50753dd07f08da05bbf776c57747ab85ba79cb081...</td>\n",
" <td>435</td>\n",
" <td>145</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>[1, 835, 3303, 3900, 13, 13, 797, 278, 3303, 3...</td>\n",
" <td>841cefc910bd5d1920187b23554ee67e0e65563373e6de...</td>\n",
" <td>1212</td>\n",
" <td>344</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>[1, 3191, 3086, 9327, 10606, 13, 13, 14804, 25...</td>\n",
" <td>63924939eab38ad6636495f1c5c13760014efe42b330a6...</td>\n",
" <td>1592</td>\n",
" <td>416</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>[1, 3191, 1954, 29885, 16783, 8898, 13, 13, 24...</td>\n",
" <td>44288e766c343592a44f3da59ad3b57a9f26096ac13412...</td>\n",
" <td>1653</td>\n",
" <td>465</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>[1, 3191, 13151, 13, 13, 13393, 884, 29901, 51...</td>\n",
" <td>40a0f6e213901d92f1a158c3e2a55ad2558eb1deaa973f...</td>\n",
" <td>4418</td>\n",
" <td>1285</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>[1, 3191, 6981, 1455, 17261, 297, 317, 4330, 2...</td>\n",
" <td>5cc92a05d39ee56e9c65cdb00f55bc9dcbe8bc1647a442...</td>\n",
" <td>1289</td>\n",
" <td>375</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>[1, 3191, 402, 1581, 330, 2547, 297, 317, 4330...</td>\n",
" <td>37c88bed7898d9a7406b5b0e4b1ccfaca65a732dff0c03...</td>\n",
" <td>821</td>\n",
" <td>280</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>[1, 3191, 4124, 2042, 2877, 297, 317, 4330, 29...</td>\n",
" <td>f144b97af462b2ab8aba5cb6d9cba0cf5f383cc710aba0...</td>\n",
" <td>1093</td>\n",
" <td>297</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>[1, 3191, 3082, 24620, 277, 20193, 512, 4812, ...</td>\n",
" <td>16525e2054a7bb7543308ad4e6642bf60e66dc475a0e0a...</td>\n",
" <td>2203</td>\n",
" <td>538</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>[1, 3191, 317, 4330, 29924, 13151, 3189, 284, ...</td>\n",
" <td>ebb319391e1bda81edd5ec214887150044c15cfc04f42f...</td>\n",
" <td>514</td>\n",
" <td>149</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>[1, 3191, 2522, 449, 292, 13, 13, 797, 29871, ...</td>\n",
" <td>882582d1f6202a4e495f67952d3a27929177745b1f575e...</td>\n",
" <td>850</td>\n",
" <td>261</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>[1, 3191, 10317, 310, 5282, 1947, 11104, 13, 1...</td>\n",
" <td>311aa5c91354b6bf575682be701981ccc6569eb35fd726...</td>\n",
" <td>1561</td>\n",
" <td>416</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>[1, 3191, 24206, 13, 13, 1254, 12665, 23992, 2...</td>\n",
" <td>abaa73aba997ea267d9b556679c5d680810ee5baa231fa...</td>\n",
" <td>384</td>\n",
" <td>139</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>[1, 3191, 18991, 362, 13, 13, 1576, 518, 29048...</td>\n",
" <td>00f85d6dffd914d89eb44dbb4caa3a1c6b2af47f5c4c96...</td>\n",
" <td>878</td>\n",
" <td>247</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>[1, 3191, 17163, 29879, 13, 13, 797, 3979, 298...</td>\n",
" <td>f8d901fca6dcac6c266cf2799da814c5f5b5644c3b9476...</td>\n",
" <td>2321</td>\n",
" <td>682</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>[1, 3191, 3599, 296, 6728, 13, 13, 7504, 3278,...</td>\n",
" <td>8347c4988e3acde4723696fbf63a0f2c13d61e92c8fbac...</td>\n",
" <td>2960</td>\n",
" <td>841</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>[1, 3191, 28488, 322, 11104, 304, 1371, 2693, ...</td>\n",
" <td>c3d0c80c861ffcd422f60b78d693bb953b69dfc3c3d55f...</td>\n",
" <td>222</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>[1, 835, 18444, 13, 13, 797, 18444, 29892, 676...</td>\n",
" <td>9c41677100393c4e5e3bc4bc36caee5561cb5c93546aaf...</td>\n",
" <td>1143</td>\n",
" <td>288</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>[1, 444, 10152, 13, 13, 6330, 7456, 29901, 518...</td>\n",
" <td>83f0f668bac5736d5f23f750f86ebbe173c0a56e3c51b8...</td>\n",
" <td>2777</td>\n",
" <td>833</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>[1, 444, 365, 7210, 29911, 29984, 29974, 13, 1...</td>\n",
" <td>24bbfff971979686cd41132b491060bdaaf357bd3bc7cf...</td>\n",
" <td>2579</td>\n",
" <td>847</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>[1, 444, 15976, 293, 1608, 13, 13, 1576, 8569,...</td>\n",
" <td>1b8c147d642e4d53152e1be73223ed58e0788700d82c73...</td>\n",
" <td>4700</td>\n",
" <td>1299</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>[1, 444, 2823, 884, 13, 13, 29899, 518, 29907,...</td>\n",
" <td>ac3fb4073323718ea3e32e006ed67c298af9801c4a03dd...</td>\n",
" <td>1310</td>\n",
" <td>443</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>[1, 444, 28318, 13, 13, 29896, 29889, 518, 298...</td>\n",
" <td>2dad03b0e2b81c47012f94be0ab730e9c8341f0311c59e...</td>\n",
" <td>59373</td>\n",
" <td>26470</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>[1, 444, 8725, 5183, 13, 13, 29899, 4699, 1522...</td>\n",
" <td>07dabd1b5cfa6f8c70f97eb33c3a19189a866eae1203c7...</td>\n",
" <td>2648</td>\n",
" <td>1075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>[1, 444, 3985, 2988, 13, 13, 29899, 8213, 4475...</td>\n",
" <td>ef8cc66ae18d7238680d07372859c5be061d57b955cf7d...</td>\n",
" <td>5025</td>\n",
" <td>705</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" tokens \\\n",
"0 [1, 444, 11814, 262, 3002] \n",
"1 [1, 835, 5298, 13, 13, 797, 278, 4688, 29871, ... \n",
"2 [1, 835, 5901, 21833, 13, 13, 29899, 321, 1254... \n",
"3 [1, 444, 26304, 4978, 13, 13, 14136, 1967, 666... \n",
"4 [1, 444, 2648, 4234] \n",
"5 [1, 835, 8314, 13, 13, 1576, 9870, 315, 1038, ... \n",
"6 [1, 835, 7400, 13, 13, 6028, 1114, 27871, 2987... \n",
"7 [1, 835, 7551, 13, 13, 25411, 3762, 8950, 6020... \n",
"8 [1, 835, 4092, 13, 13, 13393, 884, 29901, 518,... \n",
"9 [1, 3191, 18312, 13, 13, 1576, 365, 29965, 152... \n",
"10 [1, 3191, 3444, 13, 13, 1576, 1024, 310, 317, ... \n",
"11 [1, 835, 12798, 12026, 13, 13, 1254, 12665, 97... \n",
"12 [1, 835, 7513, 13, 13, 19302, 284, 2879, 515, ... \n",
"13 [1, 835, 20537, 423, 13, 13, 797, 20537, 423, ... \n",
"14 [1, 835, 21215, 13, 13, 1254, 12665, 17800, 52... \n",
"15 [1, 835, 26260, 13, 13, 797, 278, 518, 4819, 2... \n",
"16 [1, 835, 660, 14873, 13, 13, 797, 518, 29984, ... \n",
"17 [1, 835, 25960, 13, 13, 1254, 12665, 338, 760,... \n",
"18 [1, 835, 498, 26517, 13, 13, 797, 29871, 29906... \n",
"19 [1, 835, 26459, 13, 13, 29911, 29641, 728, 317... \n",
"20 [1, 835, 3303, 3900, 13, 13, 797, 278, 3303, 3... \n",
"21 [1, 3191, 3086, 9327, 10606, 13, 13, 14804, 25... \n",
"22 [1, 3191, 1954, 29885, 16783, 8898, 13, 13, 24... \n",
"23 [1, 3191, 13151, 13, 13, 13393, 884, 29901, 51... \n",
"24 [1, 3191, 6981, 1455, 17261, 297, 317, 4330, 2... \n",
"25 [1, 3191, 402, 1581, 330, 2547, 297, 317, 4330... \n",
"26 [1, 3191, 4124, 2042, 2877, 297, 317, 4330, 29... \n",
"27 [1, 3191, 3082, 24620, 277, 20193, 512, 4812, ... \n",
"28 [1, 3191, 317, 4330, 29924, 13151, 3189, 284, ... \n",
"29 [1, 3191, 2522, 449, 292, 13, 13, 797, 29871, ... \n",
"30 [1, 3191, 10317, 310, 5282, 1947, 11104, 13, 1... \n",
"31 [1, 3191, 24206, 13, 13, 1254, 12665, 23992, 2... \n",
"32 [1, 3191, 18991, 362, 13, 13, 1576, 518, 29048... \n",
"33 [1, 3191, 17163, 29879, 13, 13, 797, 3979, 298... \n",
"34 [1, 3191, 3599, 296, 6728, 13, 13, 7504, 3278,... \n",
"35 [1, 3191, 28488, 322, 11104, 304, 1371, 2693, ... \n",
"36 [1, 835, 18444, 13, 13, 797, 18444, 29892, 676... \n",
"37 [1, 444, 10152, 13, 13, 6330, 7456, 29901, 518... \n",
"38 [1, 444, 365, 7210, 29911, 29984, 29974, 13, 1... \n",
"39 [1, 444, 15976, 293, 1608, 13, 13, 1576, 8569,... \n",
"40 [1, 444, 2823, 884, 13, 13, 29899, 518, 29907,... \n",
"41 [1, 444, 28318, 13, 13, 29896, 29889, 518, 298... \n",
"42 [1, 444, 8725, 5183, 13, 13, 29899, 4699, 1522... \n",
"43 [1, 444, 3985, 2988, 13, 13, 29899, 8213, 4475... \n",
"\n",
" document_id document_length \\\n",
"0 f1f5b56a78829ab2165b3bbeb94b1167e4c5583c437f1d... 14 \n",
"1 402e82a9e81cc3d2494fac36bebf8bf1a2662800e5a00c... 2100 \n",
"2 4fb389d0f0e999c2496f137b4a7c0671e79c09cf9477e9... 2833 \n",
"3 3709997548d84224361a6835760b5ae48a1637e78d54a0... 1496 \n",
"4 1e1a58ad5664d963bc207dc791825258c33337c2559f6a... 13 \n",
"5 83a63864e5ddfdd41ef0f813fb7aa3c95e04c029c32ab3... 1340 \n",
"6 5e29fb4e4cf37ed4c49994620e4a00da9693bc061e82c1... 1800 \n",
"7 3fc34013d93391a7504e84069190479fbc85ba7e7072cb... 1784 \n",
"8 e8b28e20e3fc3da40b6b368e30f9c953f5218370ec2f7a... 774 \n",
"9 94b54fbda274536622f70442b18126f554610e8915b235... 1076 \n",
"10 fef9b66567944df131851834e2fdfb42b5c668e4b08031... 238 \n",
"11 eeb74ae3490539aa07f25987b6b2666dc907b39147e810... 366 \n",
"12 cc2ccd2e9f4d0a8224716109f7a6e7b30f33ff1f8c7adf... 1395 \n",
"13 baf13788a018da24d86b630a9032eaeee54913bbbdd0d4... 511 \n",
"14 a5b3973ab3a98d10f4ae07a004d70c6cdcfacb41fda8d7... 1949 \n",
"15 dfa35b16704a4dd549701a7821b6aa856f2dd5e5b69daf... 1042 \n",
"16 a0809b265e4a011407d38cd06c7b3ce5932683a2f9c6af... 852 \n",
"17 85e8f3b2af3268d49e60451d3ac87b3bd281a70cf6c4b7... 1165 \n",
"18 15c924efdbf0135de91a095237cbe831275bab67ee1371... 1612 \n",
"19 b473b50753dd07f08da05bbf776c57747ab85ba79cb081... 435 \n",
"20 841cefc910bd5d1920187b23554ee67e0e65563373e6de... 1212 \n",
"21 63924939eab38ad6636495f1c5c13760014efe42b330a6... 1592 \n",
"22 44288e766c343592a44f3da59ad3b57a9f26096ac13412... 1653 \n",
"23 40a0f6e213901d92f1a158c3e2a55ad2558eb1deaa973f... 4418 \n",
"24 5cc92a05d39ee56e9c65cdb00f55bc9dcbe8bc1647a442... 1289 \n",
"25 37c88bed7898d9a7406b5b0e4b1ccfaca65a732dff0c03... 821 \n",
"26 f144b97af462b2ab8aba5cb6d9cba0cf5f383cc710aba0... 1093 \n",
"27 16525e2054a7bb7543308ad4e6642bf60e66dc475a0e0a... 2203 \n",
"28 ebb319391e1bda81edd5ec214887150044c15cfc04f42f... 514 \n",
"29 882582d1f6202a4e495f67952d3a27929177745b1f575e... 850 \n",
"30 311aa5c91354b6bf575682be701981ccc6569eb35fd726... 1561 \n",
"31 abaa73aba997ea267d9b556679c5d680810ee5baa231fa... 384 \n",
"32 00f85d6dffd914d89eb44dbb4caa3a1c6b2af47f5c4c96... 878 \n",
"33 f8d901fca6dcac6c266cf2799da814c5f5b5644c3b9476... 2321 \n",
"34 8347c4988e3acde4723696fbf63a0f2c13d61e92c8fbac... 2960 \n",
"35 c3d0c80c861ffcd422f60b78d693bb953b69dfc3c3d55f... 222 \n",
"36 9c41677100393c4e5e3bc4bc36caee5561cb5c93546aaf... 1143 \n",
"37 83f0f668bac5736d5f23f750f86ebbe173c0a56e3c51b8... 2777 \n",
"38 24bbfff971979686cd41132b491060bdaaf357bd3bc7cf... 2579 \n",
"39 1b8c147d642e4d53152e1be73223ed58e0788700d82c73... 4700 \n",
"40 ac3fb4073323718ea3e32e006ed67c298af9801c4a03dd... 1310 \n",
"41 2dad03b0e2b81c47012f94be0ab730e9c8341f0311c59e... 59373 \n",
"42 07dabd1b5cfa6f8c70f97eb33c3a19189a866eae1203c7... 2648 \n",
"43 ef8cc66ae18d7238680d07372859c5be061d57b955cf7d... 5025 \n",
"\n",
" token_count \n",
"0 5 \n",
"1 655 \n",
"2 968 \n",
"3 483 \n",
"4 4 \n",
"5 442 \n",
"6 548 \n",
"7 511 \n",
"8 229 \n",
"9 263 \n",
"10 60 \n",
"11 97 \n",
"12 402 \n",
"13 137 \n",
"14 536 \n",
"15 291 \n",
"16 282 \n",
"17 285 \n",
"18 397 \n",
"19 145 \n",
"20 344 \n",
"21 416 \n",
"22 465 \n",
"23 1285 \n",
"24 375 \n",
"25 280 \n",
"26 297 \n",
"27 538 \n",
"28 149 \n",
"29 261 \n",
"30 416 \n",
"31 139 \n",
"32 247 \n",
"33 682 \n",
"34 841 \n",
"35 81 \n",
"36 288 \n",
"37 833 \n",
"38 847 \n",
"39 1299 \n",
"40 443 \n",
"41 26470 \n",
"42 1075 \n",
"43 705 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pathlib import Path\n",
"\n",
"import pandas as pd\n",
"\n",
"parquet_files = list(Path(f\"{datafolder}/tkn/\").glob(\"*.parquet\"))\n",
"pd.concat(pd.read_parquet(file) for file in parquet_files)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cdfc4719-4b38-4e88-93cb-0a8e76f01412",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}