{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conversion of custom XML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Step | Tech | Execution | \n", "| --- | --- | --- |\n", "| Embedding | Hugging Face / Sentence Transformers | ๐Ÿ’ป Local |\n", "| Vector store | Milvus | ๐Ÿ’ป Local |\n", "| Gen AI | Hugging Face Inference API | ๐ŸŒ Remote | " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an example of using [Docling](https://docling-project.github.io/docling/) for converting structured data (XML) into a unified document\n", "representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n", "\n", "Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n", "articles from [PubMed Centralยฎ (PMC)](https://pmc.ncbi.nlm.nih.gov/).\n", "\n", "In this notebook, we accomplish the following:\n", "- [Simple conversion](#simple-conversion) of supported XML files in a nutshell\n", "- An [end-to-end application](#end-to-end-application) using public collections of XML files supported by Docling\n", " - [Setup](#setup) the API access for generative AI\n", " - [Fetch the data](#fetch-the-data) from USPTO and PubMed Centralยฎ sites, using Docling custom backends\n", " - [Parse, chunk, and index](#parse-chunk-and-index) the documents in a vector database\n", " - [Perform RAG](#question-answering-with-rag) using [LlamaIndex Docling extension](../../integrations/llamaindex/)\n", "\n", "For more details on document chunking with Docling, refer to the [Chunking](../../concepts/chunking/) documentation. For RAG with Docling and LlamaIndex, also check the example [RAG with LlamaIndex](../rag_llamaindex/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple conversion\n", "\n", "The XML file format defines and stores data in a format that is both human-readable and machine-readable.\n", "Because of this flexibility, Docling requires custom backend processors to interpret XML definitions and convert them into `DoclingDocument` objects.\n", "\n", "Some public data collections in XML format are already supported by Docling (USTPO patents and PMC articles). In these cases, the document conversion is straightforward and the same as with any other supported format, such as PDF or HTML. The execution example in [Simple Conversion](../minimal/) is the recommended usage of Docling for a single file:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ConversionStatus.SUCCESS\n" ] } ], "source": [ "from docling.document_converter import DocumentConverter\n", "\n", "# a sample PMC article:\n", "source = \"../../tests/data/jats/elife-56337.nxml\"\n", "converter = DocumentConverter()\n", "result = converter.convert(source)\n", "print(result.status)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the document is converted, it can be exported to any format supported by Docling. For instance, to markdown (showing here the first lines only):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage\n", "\n", "Gernot Wolf, Alberto de Iaco, Ming-An Sun, Melania Bruno, Matthew Tinkham, Don Hoang, Apratim Mitra, Sherry Ralls, Didier Trono, Todd S Macfarlan\n", "\n", "The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health, Bethesda, United States; School of Life Sciences, ร‰cole Polytechnique Fรฉdรฉrale de Lausanne (EPFL), Lausanne, Switzerland\n", "\n", "## Abstract\n", "\n" ] } ], "source": [ "md_doc = result.document.export_to_markdown()\n", "\n", "delim = \"\\n\"\n", "print(delim.join(md_doc.split(delim)[:8]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the XML file is not supported, a `ConversionError` message will be raised." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Input document docling_test.xml does not match any allowed format.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "File format not allowed: docling_test.xml\n" ] } ], "source": [ "from io import BytesIO\n", "\n", "from docling.datamodel.base_models import DocumentStream\n", "from docling.exceptions import ConversionError\n", "\n", "xml_content = (\n", " b'Random content'\n", ")\n", "stream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content))\n", "try:\n", " result = converter.convert(stream)\n", "except ConversionError as ce:\n", " print(ce)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can always refer to the [Usage](../../usage/#supported-formats) documentation page for a list of supported formats." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## End-to-end application\n", "\n", "This section describes a step-by-step application for processing XML files from supported public collections and use them for question-answering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Requirements can be installed as shown below. The `--no-warn-conflicts` argument is meant for Colab's pre-populated Python environment, feel free to remove for stricter usage." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook uses HuggingFace's Inference API. For an increased LLM quota, a token can be provided via the environment variable `HF_TOKEN`.\n", "\n", "If you're running this notebook in Google Colab, make sure you [add](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75) your API key as a secret." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import os\n", "from warnings import filterwarnings\n", "\n", "from dotenv import load_dotenv\n", "\n", "\n", "def _get_env_from_colab_or_os(key):\n", " try:\n", " from google.colab import userdata\n", "\n", " try:\n", " return userdata.get(key)\n", " except userdata.SecretNotFoundError:\n", " pass\n", " except ImportError:\n", " pass\n", " return os.getenv(key)\n", "\n", "\n", "load_dotenv()\n", "\n", "filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now define the main parameters:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from tempfile import mkdtemp\n", "\n", "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n", "from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n", "\n", "EMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\"\n", "EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)\n", "TEMP_DIR = Path(mkdtemp())\n", "MILVUS_URI = str(TEMP_DIR / \"docling.db\")\n", "GEN_MODEL = HuggingFaceInferenceAPI(\n", " token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n", " model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n", ")\n", "embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n", "# https://github.com/huggingface/transformers/issues/5486:\n", "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fetch the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will use XML data from collections supported by Docling:\n", "- Medical articles from the [PubMed Centralยฎ (PMC)](https://pmc.ncbi.nlm.nih.gov/). They are available in an [FTP server](https://ftp.ncbi.nlm.nih.gov/pub/pmc/) as `.tar.gz` files. Each file contains the full article data in XML format, among other supplementary files like images or spreadsheets.\n", "- Patents from the [United States Patent and Trademark Office](https://www.uspto.gov/). They are available in the [Bulk Data Storage System (BDSS)](https://bulkdata.uspto.gov/) as zip files. Each zip file may contain several patents in XML format.\n", "\n", "The raw files will be downloaded form the source and saved in a temporary directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### PMC articles\n", "\n", "The [OA file](https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv) is a manifest file of all the PMC articles, including the URL path to download the source files. In this notebook we will use as example the article [Pathogens spread by high-altitude windborne mosquitoes](https://pmc.ncbi.nlm.nih.gov/articles/PMC11703268/), which is available in the archive file [PMC11703268.tar.gz](https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz...\n", "Extracting and storing the XML file containing the article text...\n", "Stored XML file nihpp-2024.12.26.630351v1.nxml\n" ] } ], "source": [ "import tarfile\n", "from io import BytesIO\n", "\n", "import requests\n", "\n", "# PMC article PMC11703268\n", "url: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\"\n", "\n", "print(f\"Downloading {url}...\")\n", "buf = BytesIO(requests.get(url).content)\n", "print(\"Extracting and storing the XML file containing the article text...\")\n", "with tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file:\n", " for tarinfo in tar_file:\n", " if tarinfo.isreg():\n", " file_path = Path(tarinfo.name)\n", " if file_path.suffix == \".nxml\":\n", " with open(TEMP_DIR / file_path.name, \"wb\") as file_obj:\n", " file_obj.write(tar_file.extractfile(tarinfo).read())\n", " print(f\"Stored XML file {file_path.name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### USPTO patents\n", "\n", "Since each USPTO file is a concatenation of several patents, we need to split its content into valid XML pieces. The following code downloads a sample zip file, split its content in sections, and dumps each section as an XML file. For simplicity, this pipeline is shown here in a sequential manner, but it could be parallelized." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip...\n", "Parsing zip file, splitting into XML sections, and exporting to files...\n" ] } ], "source": [ "import zipfile\n", "\n", "# Patent grants from December 17-23, 2024\n", "url: str = (\n", " \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\"\n", ")\n", "XML_SPLITTER: str = ' 0\n", " ): # cases like 0 and is_patent:\n", " doc_num += 1\n", " patent_id = f\"ipg241217-{doc_num}\"\n", " with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj:\n", " file_obj.write(patent_buffer.getbuffer())\n", " is_patent = False\n", " patent_buffer = BytesIO()\n", " elif decoded_line.startswith(\"" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index.from_documents(\n", " documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"),\n", " transformations=[node_parser],\n", " storage_context=StorageContext.from_defaults(vector_store=vector_store),\n", " embed_model=EMBED_MODEL,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question-answering with RAG" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The retriever can be used to identify highly relevant documents:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Node ID: 5afd36c0-a739-4a88-a51c-6d0f75358db5\n", "Text: The portable fitness monitoring device 102 may be a device such\n", "as, for example, a mobile phone, a personal digital assistant, a music\n", "file player (e.g. and MP3 player), an intelligent article for wearing\n", "(e.g. a fitness monitoring garment, wrist band, or watch), a dongle\n", "(e.g. a small hardware device that protects software) that includes a\n", "fitn...\n", "Score: 0.772\n", "\n", "Node ID: f294b5fd-9089-43cb-8c4e-d1095a634ff1\n", "Text: US Patent Application US 20120071306 entitled โ€œPortable\n", "Multipurpose Whole Body Exercise Deviceโ€ discloses a portable\n", "multipurpose whole body exercise device which can be used for general\n", "fitness, Pilates-type, core strengthening, therapeutic, and\n", "rehabilitative exercises as well as stretching and physical therapy\n", "and which includes storable acc...\n", "Score: 0.749\n", "\n", "Node ID: 8251c7ef-1165-42e1-8c91-c99c8a711bf7\n", "Text: Program products, methods, and systems for providing fitness\n", "monitoring services of the present invention can include any software\n", "application executed by one or more computing devices. A computing\n", "device can be any type of computing device having one or more\n", "processors. For example, a computing device can be a workstation,\n", "mobile device (e.g., ...\n", "Score: 0.744\n", "\n" ] } ], "source": [ "retriever = index.as_retriever(similarity_top_k=3)\n", "results = retriever.retrieve(\"What patents are related to fitness devices?\")\n", "\n", "for item in results:\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the query engine, we can run the question-answering with the RAG pattern on the set of indexed documents.\n", "\n", "First, we can prompt the LLM directly:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Prompt โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ\n",
       "โ”‚ Do mosquitoes in high altitude expand viruses over large distances?                                             โ”‚\n",
       "โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\n",
       "
\n" ], "text/plain": [ "\u001b[1;31mโ•ญโ”€\u001b[0m\u001b[1;31mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;31m Prompt \u001b[0m\u001b[1;31mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;31mโ”€โ•ฎ\u001b[0m\n", "\u001b[1;31mโ”‚\u001b[0m Do mosquitoes in high altitude expand viruses over large distances? \u001b[1;31mโ”‚\u001b[0m\n", "\u001b[1;31mโ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Generated Content โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ\n",
       "โ”‚ Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not     โ”‚\n",
       "โ”‚ primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever,    โ”‚\n",
       "โ”‚ and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, โ”‚\n",
       "โ”‚ and environmental conditions that support their survival and reproduction.                                      โ”‚\n",
       "โ”‚                                                                                                                 โ”‚\n",
       "โ”‚ At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder            โ”‚\n",
       "โ”‚ temperatures, lower humidity, and stronger winds, which can limit their population size and distribution.       โ”‚\n",
       "โ”‚ However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases  โ”‚\n",
       "โ”‚ in these areas.                                                                                                 โ”‚\n",
       "โ”‚                                                                                                                 โ”‚\n",
       "โ”‚ It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is    โ”‚\n",
       "โ”‚ not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance       โ”‚\n",
       "โ”‚ transmission of viruses is more often associated with human travel and transportation, which can rapidly spread โ”‚\n",
       "โ”‚ infected mosquitoes or humans to new areas, leading to the spread of disease.                                   โ”‚\n",
       "โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\n",
       "
\n" ], "text/plain": [ "\u001b[1;32mโ•ญโ”€\u001b[0m\u001b[1;32mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;32m Generated Content \u001b[0m\u001b[1;32mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;32mโ”€โ•ฎ\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever, \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m and environmental conditions that support their survival and reproduction. \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m temperatures, lower humidity, and stronger winds, which can limit their population size and distribution. \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m in these areas. \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m transmission of viruses is more often associated with human travel and transportation, which can rapidly spread \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m infected mosquitoes or humans to new areas, leading to the spread of disease. \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from llama_index.core.base.llms.types import ChatMessage, MessageRole\n", "from rich.console import Console\n", "from rich.panel import Panel\n", "\n", "console = Console()\n", "query = \"Do mosquitoes in high altitude expand viruses over large distances?\"\n", "\n", "usr_msg = ChatMessage(role=MessageRole.USER, content=query)\n", "response = GEN_MODEL.chat(messages=[usr_msg])\n", "\n", "console.print(Panel(query, title=\"Prompt\", border_style=\"bold red\"))\n", "console.print(\n", " Panel(\n", " response.message.content.strip(),\n", " title=\"Generated Content\",\n", " border_style=\"bold green\",\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can compare the response when the model is prompted with the indexed PMC article as supporting context:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Generated Content with RAG โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ\n",
       "โ”‚ Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female      โ”‚\n",
       "โ”‚ mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with      โ”‚\n",
       "โ”‚ arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with            โ”‚\n",
       "โ”‚ flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens,         โ”‚\n",
       "โ”‚ including three arboviruses that affect humans (dengue, West Nile, and Mโ€™Poko viruses). The study provides      โ”‚\n",
       "โ”‚ compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude.         โ”‚\n",
       "โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\n",
       "
\n" ], "text/plain": [ "\u001b[1;32mโ•ญโ”€\u001b[0m\u001b[1;32mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;32m Generated Content with RAG \u001b[0m\u001b[1;32mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;32mโ”€โ•ฎ\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens, \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m including three arboviruses that affect humans (dengue, West Nile, and Mโ€™Poko viruses). The study provides \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ”‚\u001b[0m compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude. \u001b[1;32mโ”‚\u001b[0m\n", "\u001b[1;32mโ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n", "\n", "filters = MetadataFilters(\n", " filters=[\n", " ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"),\n", " ]\n", ")\n", "\n", "query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)\n", "result = query_engine.query(query)\n", "\n", "console.print(\n", " Panel(\n", " result.response.strip(),\n", " title=\"Generated Content with RAG\",\n", " border_style=\"bold green\",\n", " )\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 2 }