{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conversion of custom XML"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Step | Tech | Execution | \n",
"| --- | --- | --- |\n",
"| Embedding | Hugging Face / Sentence Transformers | ๐ป Local |\n",
"| Vector store | Milvus | ๐ป Local |\n",
"| Gen AI | Hugging Face Inference API | ๐ Remote | "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is an example of using [Docling](https://ds4sd.github.io/docling/) for converting structured data (XML) into a unified document\n",
"representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n",
"\n",
"Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n",
"articles from [PubMed Centralยฎ (PMC)](https://pmc.ncbi.nlm.nih.gov/).\n",
"\n",
"In this notebook, we accomplish the following:\n",
"- [Simple conversion](#simple-conversion) of supported XML files in a nutshell\n",
"- An [end-to-end application](#end-to-end-application) using public collections of XML files supported by Docling\n",
" - [Setup](#setup) the API access for generative AI\n",
" - [Fetch the data](#fetch-the-data) from USPTO and PubMed Centralยฎ sites, using Docling custom backends\n",
" - [Parse, chunk, and index](#parse-chunk-and-index) the documents in a vector database\n",
" - [Perform RAG](#question-answering-with-rag) using [LlamaIndex Docling extension](../../integrations/llamaindex/)\n",
"\n",
"For more details on document chunking with Docling, refer to the [Chunking](../../concepts/chunking/) documentation. For RAG with Docling and LlamaIndex, also check the example [RAG with LlamaIndex](../rag_llamaindex/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simple conversion\n",
"\n",
"The XML file format defines and stores data in a format that is both human-readable and machine-readable.\n",
"Because of this flexibility, Docling requires custom backend processors to interpret XML definitions and convert them into `DoclingDocument` objects.\n",
"\n",
"Some public data collections in XML format are already supported by Docling (USTPO patents and PMC articles). In these cases, the document conversion is straightforward and the same as with any other supported format, such as PDF or HTML. The execution example in [Simple Conversion](../minimal/) is the recommended usage of Docling for a single file:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ConversionStatus.SUCCESS\n"
]
}
],
"source": [
"from docling.document_converter import DocumentConverter\n",
"\n",
"# a sample PMC article:\n",
"source = \"../../tests/data/pubmed/elife-56337.nxml\"\n",
"converter = DocumentConverter()\n",
"result = converter.convert(source)\n",
"print(result.status)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the document is converted, it can be exported to any format supported by Docling. For instance, to markdown (showing here the first lines only):"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage\n",
"\n",
"Wolf Gernot; 1: The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health: Bethesda: United States; de Iaco Alberto; 2: School of Life Sciences, รcole Polytechnique Fรฉdรฉrale de Lausanne (EPFL): Lausanne: Switzerland; Sun Ming-An; 1: The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health: Bethesda: United States; Bruno Melania; 1: The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health: Bethesda: United States; Tinkham Matthew; 1: The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health: Bethesda: United States; Hoang Don; 1: The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health: Bethesda: United States; Mitra Apratim; 1: The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health: Bethesda: United States; Ralls Sherry; 1: The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health: Bethesda: United States; Trono Didier; 2: School of Life Sciences, รcole Polytechnique Fรฉdรฉrale de Lausanne (EPFL): Lausanne: Switzerland; Macfarlan Todd S; 1: The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health: Bethesda: United States\n",
"\n",
"## Abstract\n",
"\n",
"The Krรผppel-associated box zinc finger protein (KRAB-ZFP) family diversified in mammals. The majority of human KRAB-ZFPs bind transposable elements (TEs), however, since most TEs are inactive in humans it is unclear whether KRAB-ZFPs emerged to suppress TEs. We demonstrate that many recently emerged murine KRAB-ZFPs also bind to TEs, including the active ETn, IAP, and L1 families. Using a CRISPR/Cas9-based engineering approach, we genetically deleted five large clusters of KRAB-ZFPs and demonstrate that target TEs are de-repressed, unleashing TE-encoded enhancers. Homozygous knockout mice lacking one of two KRAB-ZFP gene clusters on chromosome 2 and chromosome 4 were nonetheless viable. In pedigrees of chromosome 4 cluster KRAB-ZFP mutants, we identified numerous novel ETn insertions with a modest increase in mutants. Our data strongly support the current model that recent waves of retrotransposon activity drove the expansion of KRAB-ZFP genes in mice and that many KRAB-ZFPs play a redundant role restricting TE activity.\n",
"\n"
]
}
],
"source": [
"md_doc = result.document.export_to_markdown()\n",
"\n",
"delim = \"\\n\"\n",
"print(delim.join(md_doc.split(delim)[:8]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the XML file is not supported, a `ConversionError` message will be raised."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Input document docling_test.xml does not match any allowed format.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"File format not allowed: docling_test.xml\n"
]
}
],
"source": [
"from io import BytesIO\n",
"\n",
"from docling.datamodel.base_models import DocumentStream\n",
"from docling.exceptions import ConversionError\n",
"\n",
"xml_content = (\n",
" b'
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Prompt โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\n", "โ Do mosquitoes in high altitude expand viruses over large distances? โ\n", "โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\n", "\n" ], "text/plain": [ "\u001b[1;31mโญโ\u001b[0m\u001b[1;31mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[1;31m Prompt \u001b[0m\u001b[1;31mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[1;31mโโฎ\u001b[0m\n", "\u001b[1;31mโ\u001b[0m Do mosquitoes in high altitude expand viruses over large distances? \u001b[1;31mโ\u001b[0m\n", "\u001b[1;31mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Generated Content โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\n", "โ Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not โ\n", "โ primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever, โ\n", "โ and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, โ\n", "โ and environmental conditions that support their survival and reproduction. โ\n", "โ โ\n", "โ At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder โ\n", "โ temperatures, lower humidity, and stronger winds, which can limit their population size and distribution. โ\n", "โ However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases โ\n", "โ in these areas. โ\n", "โ โ\n", "โ It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is โ\n", "โ not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance โ\n", "โ transmission of viruses is more often associated with human travel and transportation, which can rapidly spread โ\n", "โ infected mosquitoes or humans to new areas, leading to the spread of disease. โ\n", "โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\n", "\n" ], "text/plain": [ "\u001b[1;32mโญโ\u001b[0m\u001b[1;32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[1;32m Generated Content \u001b[0m\u001b[1;32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[1;32mโโฎ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever, \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m and environmental conditions that support their survival and reproduction. \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m temperatures, lower humidity, and stronger winds, which can limit their population size and distribution. \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m in these areas. \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m transmission of viruses is more often associated with human travel and transportation, which can rapidly spread \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m infected mosquitoes or humans to new areas, leading to the spread of disease. \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from llama_index.core.base.llms.types import ChatMessage, MessageRole\n", "from rich.console import Console\n", "from rich.panel import Panel\n", "\n", "console = Console()\n", "query = \"Do mosquitoes in high altitude expand viruses over large distances?\"\n", "\n", "usr_msg = ChatMessage(role=MessageRole.USER, content=query)\n", "response = GEN_MODEL.chat(messages=[usr_msg])\n", "\n", "console.print(Panel(query, title=\"Prompt\", border_style=\"bold red\"))\n", "console.print(\n", " Panel(\n", " response.message.content.strip(),\n", " title=\"Generated Content\",\n", " border_style=\"bold green\",\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can compare the response when the model is prompted with the indexed PMC article as supporting context:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Generated Content with RAG โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\n", "โ Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female โ\n", "โ mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with โ\n", "โ arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with โ\n", "โ flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens, โ\n", "โ including three arboviruses that affect humans (dengue, West Nile, and MโPoko viruses). The study provides โ\n", "โ compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude. โ\n", "โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\n", "\n" ], "text/plain": [ "\u001b[1;32mโญโ\u001b[0m\u001b[1;32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[1;32m Generated Content with RAG \u001b[0m\u001b[1;32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[1;32mโโฎ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens, \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m including three arboviruses that affect humans (dengue, West Nile, and MโPoko viruses). The study provides \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโ\u001b[0m compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude. \u001b[1;32mโ\u001b[0m\n", "\u001b[1;32mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n", "\n", "filters = MetadataFilters(\n", " filters=[\n", " ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"),\n", " ]\n", ")\n", "\n", "query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)\n", "result = query_engine.query(query)\n", "\n", "console.print(\n", " Panel(\n", " result.response.strip(),\n", " title=\"Generated Content with RAG\",\n", " border_style=\"bold green\",\n", " )\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 2 }