docling/docs/examples/rag_llamaindex.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RAG with LlamaIndex 🦙"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example leverages the official [LlamaIndex Docling extension](../../integrations/llamaindex/).\n",
    "\n",
    "Presented extensions `DoclingReader` and `DoclingNodeParser` enable you to:\n",
    "- use various document types in your LLM applications with ease and speed, and\n",
    "- leverage Docling's rich format for advanced, document-native grounding."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.\n",
    "- Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.\n",
    "- Requirements can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "from tempfile import mkdtemp\n",
    "from warnings import filterwarnings\n",
    "\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "\n",
    "def _get_env_from_colab_or_os(key):\n",
    "    try:\n",
    "        from google.colab import userdata\n",
    "\n",
    "        try:\n",
    "            return userdata.get(key)\n",
    "        except userdata.SecretNotFoundError:\n",
    "            pass\n",
    "    except ImportError:\n",
    "        pass\n",
    "    return os.getenv(key)\n",
    "\n",
    "\n",
    "load_dotenv()\n",
    "\n",
    "filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\n",
    "filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")\n",
    "# https://github.com/huggingface/transformers/issues/5486:\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now define the main parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n",
    "\n",
    "EMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\n",
    "MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n",
    "GEN_MODEL = HuggingFaceInferenceAPI(\n",
    "    token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n",
    "    model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n",
    ")\n",
    "SOURCE = \"https://arxiv.org/pdf/2408.09869\"  # Docling Technical Report\n",
    "QUERY = \"Which are the main AI models in Docling?\"\n",
    "\n",
    "embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Markdown export"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create a simple RAG pipeline, we can:\n",
    "- define a `DoclingReader`, which by default exports to Markdown, and\n",
    "- use a standard node parser for these Markdown-based docs, e.g. a `MarkdownNodeParser`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q: Which are the main AI models in Docling?\n",
      "A: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.\n",
      "\n",
      "Sources:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('3.2 AI models\\n\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n",
       "  {'Header_2': '3.2 AI models'}),\n",
       " (\"5 Applications\\n\\nThanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.\",\n",
       "  {'Header_2': '5 Applications'})]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.core import StorageContext, VectorStoreIndex\n",
    "from llama_index.core.node_parser import MarkdownNodeParser\n",
    "from llama_index.readers.docling import DoclingReader\n",
    "from llama_index.vector_stores.milvus import MilvusVectorStore\n",
    "\n",
    "reader = DoclingReader()\n",
    "node_parser = MarkdownNodeParser()\n",
    "\n",
    "vector_store = MilvusVectorStore(\n",
    "    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n",
    "    dim=embed_dim,\n",
    "    overwrite=True,\n",
    ")\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents=reader.load_data(SOURCE),\n",
    "    transformations=[node_parser],\n",
    "    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n",
    "    embed_model=EMBED_MODEL,\n",
    ")\n",
    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
    "display([(n.text, n.metadata) for n in result.source_nodes])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Docling format"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To leverage Docling's rich native format, we:\n",
    "- create a `DoclingReader` with JSON export type, and\n",
    "- employ a `DoclingNodeParser` in order to appropriately parse that Docling format.\n",
    "\n",
    "Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q: Which are the main AI models in Docling?\n",
      "A: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table structure recognition model.\n",
      "\n",
      "Sources:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n",
       "  {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n",
       "   'version': '1.0.0',\n",
       "   'doc_items': [{'self_ref': '#/texts/34',\n",
       "     'parent': {'$ref': '#/body'},\n",
       "     'children': [],\n",
       "     'label': 'text',\n",
       "     'prov': [{'page_no': 3,\n",
       "       'bbox': {'l': 107.07593536376953,\n",
       "        't': 406.1695251464844,\n",
       "        'r': 504.1148681640625,\n",
       "        'b': 330.2677307128906,\n",
       "        'coord_origin': 'BOTTOMLEFT'},\n",
       "       'charspan': [0, 608]}]}],\n",
       "   'headings': ['3.2 AI models'],\n",
       "   'origin': {'mimetype': 'application/pdf',\n",
       "    'binary_hash': 14981478401387673002,\n",
       "    'filename': '2408.09869v3.pdf'}}),\n",
       " ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n",
       "  {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n",
       "   'version': '1.0.0',\n",
       "   'doc_items': [{'self_ref': '#/texts/9',\n",
       "     'parent': {'$ref': '#/body'},\n",
       "     'children': [],\n",
       "     'label': 'text',\n",
       "     'prov': [{'page_no': 1,\n",
       "       'bbox': {'l': 107.0031967163086,\n",
       "        't': 136.7283935546875,\n",
       "        'r': 504.04998779296875,\n",
       "        'b': 83.30133056640625,\n",
       "        'coord_origin': 'BOTTOMLEFT'},\n",
       "       'charspan': [0, 488]}]}],\n",
       "   'headings': ['1 Introduction'],\n",
       "   'origin': {'mimetype': 'application/pdf',\n",
       "    'binary_hash': 14981478401387673002,\n",
       "    'filename': '2408.09869v3.pdf'}})]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.node_parser.docling import DoclingNodeParser\n",
    "\n",
    "reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\n",
    "node_parser = DoclingNodeParser()\n",
    "\n",
    "vector_store = MilvusVectorStore(\n",
    "    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n",
    "    dim=embed_dim,\n",
    "    overwrite=True,\n",
    ")\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents=reader.load_data(SOURCE),\n",
    "    transformations=[node_parser],\n",
    "    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n",
    "    embed_model=EMBED_MODEL,\n",
    ")\n",
    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
    "display([(n.text, n.metadata) for n in result.source_nodes])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## With Simple Directory Reader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To demonstrate this usage pattern, we first set up a test document directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from tempfile import mkdtemp\n",
    "\n",
    "import requests\n",
    "\n",
    "tmp_dir_path = Path(mkdtemp())\n",
    "r = requests.get(SOURCE)\n",
    "with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n",
    "    out_file.write(r.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `reader` and `node_parser` definitions from any of the above variants, usage with `SimpleDirectoryReader` then looks as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading files: 100%|██████████| 1/1 [00:11<00:00, 11.27s/file]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q: Which are the main AI models in Docling?\n",
      "A: 1. A layout analysis model, an accurate object-detector for page elements. 2. TableFormer, a state-of-the-art table structure recognition model.\n",
      "\n",
      "Sources:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n",
       "  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n",
       "   'file_name': '2408.09869.pdf',\n",
       "   'file_type': 'application/pdf',\n",
       "   'file_size': 5566574,\n",
       "   'creation_date': '2024-10-28',\n",
       "   'last_modified_date': '2024-10-28',\n",
       "   'schema_name': 'docling_core.transforms.chunker.DocMeta',\n",
       "   'version': '1.0.0',\n",
       "   'doc_items': [{'self_ref': '#/texts/34',\n",
       "     'parent': {'$ref': '#/body'},\n",
       "     'children': [],\n",
       "     'label': 'text',\n",
       "     'prov': [{'page_no': 3,\n",
       "       'bbox': {'l': 107.07593536376953,\n",
       "        't': 406.1695251464844,\n",
       "        'r': 504.1148681640625,\n",
       "        'b': 330.2677307128906,\n",
       "        'coord_origin': 'BOTTOMLEFT'},\n",
       "       'charspan': [0, 608]}]}],\n",
       "   'headings': ['3.2 AI models'],\n",
       "   'origin': {'mimetype': 'application/pdf',\n",
       "    'binary_hash': 14981478401387673002,\n",
       "    'filename': '2408.09869.pdf'}}),\n",
       " ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n",
       "  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n",
       "   'file_name': '2408.09869.pdf',\n",
       "   'file_type': 'application/pdf',\n",
       "   'file_size': 5566574,\n",
       "   'creation_date': '2024-10-28',\n",
       "   'last_modified_date': '2024-10-28',\n",
       "   'schema_name': 'docling_core.transforms.chunker.DocMeta',\n",
       "   'version': '1.0.0',\n",
       "   'doc_items': [{'self_ref': '#/texts/9',\n",
       "     'parent': {'$ref': '#/body'},\n",
       "     'children': [],\n",
       "     'label': 'text',\n",
       "     'prov': [{'page_no': 1,\n",
       "       'bbox': {'l': 107.0031967163086,\n",
       "        't': 136.7283935546875,\n",
       "        'r': 504.04998779296875,\n",
       "        'b': 83.30133056640625,\n",
       "        'coord_origin': 'BOTTOMLEFT'},\n",
       "       'charspan': [0, 488]}]}],\n",
       "   'headings': ['1 Introduction'],\n",
       "   'origin': {'mimetype': 'application/pdf',\n",
       "    'binary_hash': 14981478401387673002,\n",
       "    'filename': '2408.09869.pdf'}})]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "dir_reader = SimpleDirectoryReader(\n",
    "    input_dir=tmp_dir_path,\n",
    "    file_extractor={\".pdf\": reader},\n",
    ")\n",
    "\n",
    "vector_store = MilvusVectorStore(\n",
    "    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n",
    "    dim=embed_dim,\n",
    "    overwrite=True,\n",
    ")\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents=dir_reader.load_data(SOURCE),\n",
    "    transformations=[node_parser],\n",
    "    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n",
    "    embed_model=EMBED_MODEL,\n",
    ")\n",
    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
    "display([(n.text, n.metadata) for n in result.source_nodes])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}