docling/docs/examples/rag_opensearch.ipynb

1448 lines
79 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# RAG with OpenSearch\n",
"\n",
"| Step | Tech | Execution |\n",
"| --- | --- | --- |\n",
"| Embedding | HuggingFace (IBM Granite Embedding 30M) | 💻 Local |\n",
"| Vector store | OpenSearch 3.0.0 | 💻 Local |\n",
"| Gen AI | Ollama (IBM Granite 4.0 Tiny) | 💻 Local |\n",
"\n",
"\n",
"This is a code recipe that uses [OpenSearch](https://opensearch.org/), an open-source search and analytics tool,\n",
"and the [LlamaIndex](https://github.com/run-llama/llama_index) framework to perform RAG over documents parsed by [Docling](https://docling-project.github.io/docling/).\n",
"\n",
"In this notebook, we accomplish the following:\n",
"* 📚 Parse documents using Docling's document conversion capabilities\n",
"* 🧩 Perform hierarchical chunking of the documents using Docling\n",
"* 🔢 Generate text embeddings on document chunks\n",
"* 🤖 Perform RAG using OpenSearch and the LlamaIndex framework\n",
"* 🛠️ Leverage the transformation and structure capabilities of Docling documents for RAG\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparation\n",
"\n",
"### Running the notebook\n",
"\n",
"For running this notebook on your machine, you can use applications like [Jupyter Notebook](https://jupyter.org/install) or [Visual Studio Code](https://code.visualstudio.com/docs/datascience/jupyter-notebooks).\n",
"\n",
"💡 For best results, please use **GPU acceleration** to run this notebook.\n",
"\n",
"### Virtual environment\n",
"\n",
"Before installing dependencies and to avoid conflicts in your environment, it is advisable to use a [virtual environment (venv)](https://docs.python.org/3/library/venv.html).\n",
"For instance, [uv](https://docs.astral.sh/uv/) is a popular tool to manage virtual environments and dependencies. You can install it with:\n",
"\n",
"\n",
"```shell\n",
"curl -LsSf https://astral.sh/uv/install.sh | sh\n",
"```\n",
"\n",
"Then create the virtual environment and activate it:\n",
"\n",
"```shell\n",
" uv venv\n",
" source .venv/bin/activate\n",
" ```\n",
"\n",
"Refer to [Installing uv](https://docs.astral.sh/uv/getting-started/installation/) for more details.\n",
"\n",
"### Dependencies\n",
"\n",
"To start, install the required dependencies by running the following command:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
"\n",
"! uv pip install -q --no-progress notebook ipywidgets docling llama-index-readers-file llama-index-readers-docling llama-index-readers-elasticsearch llama-index-node-parser-docling llama-index-vector-stores-opensearch llama-index-embeddings-huggingface llama-index-llms-ollama"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now import all the necessary modules for this notebook:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/ceb/git/docling/.venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'validate_default' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'validate_default' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.\n",
" warnings.warn(\n"
]
}
],
"source": [
"import logging\n",
"from pathlib import Path\n",
"from tempfile import mkdtemp\n",
"\n",
"import requests\n",
"import torch\n",
"from docling_core.transforms.chunker import HierarchicalChunker\n",
"from docling_core.transforms.chunker.hierarchical_chunker import (\n",
" ChunkingDocSerializer,\n",
" ChunkingSerializerProvider,\n",
")\n",
"from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n",
"from docling_core.transforms.serializer.markdown import MarkdownTableSerializer\n",
"from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex\n",
"from llama_index.core.data_structs import Node\n",
"from llama_index.core.response_synthesizers import get_response_synthesizer\n",
"from llama_index.core.schema import NodeWithScore, TransformComponent\n",
"from llama_index.core.vector_stores import MetadataFilter, MetadataFilters\n",
"from llama_index.core.vector_stores.types import VectorStoreQueryMode\n",
"from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
"from llama_index.llms.ollama import Ollama\n",
"from llama_index.node_parser.docling import DoclingNodeParser\n",
"from llama_index.readers.docling import DoclingReader\n",
"from llama_index.readers.elasticsearch import ElasticsearchReader\n",
"from llama_index.vector_stores.opensearch import (\n",
" OpensearchVectorClient,\n",
" OpensearchVectorStore,\n",
")\n",
"from rich.console import Console\n",
"from rich.pretty import pprint\n",
"from transformers import AutoTokenizer\n",
"\n",
"from docling.chunking import HybridChunker\n",
"\n",
"logging.getLogger().setLevel(logging.WARNING)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GPU Checking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.\n",
"\n",
"The code below checks if a GPU is available, either via CUDA or MPS."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MPS GPU is enabled.\n"
]
}
],
"source": [
"# Check if GPU or MPS is available\n",
"if torch.cuda.is_available():\n",
" device = torch.device(\"cuda\")\n",
" print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\n",
"elif torch.backends.mps.is_available():\n",
" device = torch.device(\"mps\")\n",
" print(\"MPS GPU is enabled.\")\n",
"else:\n",
" raise OSError(\n",
" \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Local OpenSearch instance\n",
"\n",
"To run the notebook locally, we can pull an OpenSearch image and run a single node for local development.\n",
"You can use a container tool like [Podman](https://podman.io/) or [Docker](https://www.docker.com/).\n",
"In the interest of simplicity, we disable the SSL option for this example.\n",
"\n",
"💡 The version of the OpenSearch instance needs to be compatible with the version of the [OpenSearch Python Client](https://github.com/opensearch-project/opensearch-py) library,\n",
"since this library is used by the LlamaIndex framework, which we leverage in this notebook.\n",
"\n",
"On your computer terminal run:\n",
"\n",
"\n",
"```shell\n",
"podman run \\\n",
" -it \\\n",
" --pull always \\\n",
" -p 9200:9200 \\\n",
" -p 9600:9600 \\\n",
" -e \"discovery.type=single-node\" \\\n",
" -e DISABLE_INSTALL_DEMO_CONFIG=true \\\n",
" -e DISABLE_SECURITY_PLUGIN=true \\\n",
" --name opensearch-node \\\n",
" -d opensearchproject/opensearch:3.0.0\n",
"```\n",
"\n",
"Once the instance is running, verify that you can connect to OpenSearch:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"name\" : \"b20d8368e745\",\n",
" \"cluster_name\" : \"docker-cluster\",\n",
" \"cluster_uuid\" : \"0gEZCJQwRHabS_E-n_3i9g\",\n",
" \"version\" : {\n",
" \"distribution\" : \"opensearch\",\n",
" \"number\" : \"3.0.0\",\n",
" \"build_type\" : \"tar\",\n",
" \"build_hash\" : \"dc4efa821904cc2d7ea7ef61c0f577d3fc0d8be9\",\n",
" \"build_date\" : \"2025-05-03T06:23:50.311109522Z\",\n",
" \"build_snapshot\" : false,\n",
" \"lucene_version\" : \"10.1.0\",\n",
" \"minimum_wire_compatibility_version\" : \"2.19.0\",\n",
" \"minimum_index_compatibility_version\" : \"2.0.0\"\n",
" },\n",
" \"tagline\" : \"The OpenSearch Project: https://opensearch.org/\"\n",
"}\n",
"\n"
]
}
],
"source": [
"response = requests.get(\"http://localhost:9200\")\n",
"print(response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Language models\n",
"\n",
"We will use [HuggingFace](https://huggingface.co/) and [Ollama](https://ollama.com/) to run language models on your local computer, rather than relying on cloud services.\n",
"\n",
"In this example, the following models are considered:\n",
"- [IBM Granite Embedding 30M English](https://huggingface.co/ibm-granite/granite-embedding-30m-english) with HuggingFace for text embeddings\n",
"- [IBM Granite 4.0 Tiny](https://ollama.com/library/granite4:tiny-h) with Ollama for model inference\n",
"\n",
"Once Ollama is installed on your computer, you can pull the model above from your terminal:\n",
"\n",
"```shell\n",
"ollama pull granite4:tiny-h\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"We setup the main variables for OpenSearch and the embedding and generation models."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The embedding dimension is 384.\n"
]
}
],
"source": [
"# http endpoint for your cluster\n",
"OPENSEARCH_ENDPOINT = \"http://localhost:9200\"\n",
"# index to store the Docling document vectors\n",
"OPENSEARCH_INDEX = \"docling-index\"\n",
"# the embedding model\n",
"EMBED_MODEL = HuggingFaceEmbedding(\n",
" model_name=\"ibm-granite/granite-embedding-30m-english\"\n",
")\n",
"# maximum chunk size in tokens\n",
"EMBED_MAX_TOKENS = 200\n",
"# the generation model\n",
"GEN_MODEL = Ollama(\n",
" model=\"granite4:tiny-h\",\n",
" request_timeout=120.0,\n",
" # Manually set the context window to limit memory usage\n",
" context_window=8000,\n",
" # Set temperature to 0 for reproducibility of the results\n",
" temperature=0.0,\n",
")\n",
"# a sample document\n",
"SOURCE = \"https://arxiv.org/pdf/2408.09869\"\n",
"\n",
"embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n",
"print(f\"The embedding dimension is {embed_dim}.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Process Data Using Docling\n",
"\n",
"Docling can parse various document formats into a unified representation ([DoclingDocument](https://docling-project.github.io/docling/concepts/docling_document/)), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to [Supported formats](https://docling-project.github.io/docling/usage/supported_formats/) section of Docling's documentation.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this recipe, we will use a single PDF file, the [Docling Technical Report](https://arxiv.org/pdf/2408.09869). We will process it using the [Hybrid Chunker](https://docling-project.github.io/docling/concepts/chunking/#hybrid-chunker) provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run the document conversion pipeline\n",
"\n",
"We will convert the original PDF file into a `DoclingDocument` format using a `DoclingReader` object. We specify the JSON export type to retain the document hierarchical structure as an input for the next step (chunking the document)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"tmp_dir_path = Path(mkdtemp())\n",
"req = requests.get(SOURCE)\n",
"with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n",
" out_file.write(req.content)\n",
"\n",
"reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\n",
"dir_reader = SimpleDirectoryReader(\n",
" input_dir=tmp_dir_path,\n",
" file_extractor={\".pdf\": reader},\n",
")\n",
"\n",
"# load the PDF files\n",
"documents = dir_reader.load_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Data into OpenSearch\n",
"\n",
"#### Define the Transformations\n",
"\n",
"Before the actual ingestion of data, we need to define the data transformations to apply on the `DoclingDocument`:\n",
"\n",
"- `DoclingNodeParser` executes the document-based chunking with the hybrid chunker, which leverages the tokenizer of the embedding model to ensure that the resulting chunks fit within the model input text limit.\n",
"- `MetadataTransform` is a custom transformation to ensure that generated chunk metadata is best formatted for indexing with OpenSearch\n",
"\n",
"\n",
"💡 For demonstration purposes, we configure the hybrid chunker to produce chunks capped at 200 tokens. The optimal limit will vary according to the specific requirements of the AI application in question.\n",
"If this value is omitted, the chunker automatically derives the maximum size from the tokenizer. This safeguard guarantees that each chunk remains within the bounds supported by the underlying embedding model."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# create the hybrid chunker\n",
"tokenizer = HuggingFaceTokenizer(\n",
" tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL.model_name),\n",
" max_tokens=EMBED_MAX_TOKENS,\n",
")\n",
"chunker = HybridChunker(tokenizer=tokenizer)\n",
"\n",
"# create a Docling node parser\n",
"node_parser = DoclingNodeParser(chunker=chunker)\n",
"\n",
"\n",
"# create a custom transformation to avoid out-of-range integers\n",
"class MetadataTransform(TransformComponent):\n",
" def __call__(self, nodes, **kwargs):\n",
" for node in nodes:\n",
" binary_hash = node.metadata.get(\"origin\", {}).get(\"binary_hash\", None)\n",
" if binary_hash is not None:\n",
" node.metadata[\"origin\"][\"binary_hash\"] = str(binary_hash)\n",
" return nodes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Embed and Insert the Data\n",
"\n",
"In this step, we create an `OpenSearchVectorClient`, which encapsulates the logic for a single OpenSearch index with vector search enabled.\n",
"\n",
"We then initialize the index using our sample data (a single PDF file), the Docling node parser, and the OpenSearch client that we just created.\n",
"\n",
"💡 You may get a warning message like:\n",
"> Token indices sequence length is longer than the specified maximum sequence length for this model\n",
"\n",
"This is a _false alarm_ and you may get more background explanation in [Docling's FAQ](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model) page."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-10-24 15:05:49,841 - WARNING - GET http://localhost:9200/docling-index [status:404 request:0.006s]\n"
]
}
],
"source": [
"# OpensearchVectorClient stores text in this field by default\n",
"text_field = \"content\"\n",
"# OpensearchVectorClient stores embeddings in this field by default\n",
"embed_field = \"embedding\"\n",
"\n",
"client = OpensearchVectorClient(\n",
" endpoint=OPENSEARCH_ENDPOINT,\n",
" index=OPENSEARCH_INDEX,\n",
" dim=embed_dim,\n",
" engine=\"faiss\",\n",
" embedding_field=embed_field,\n",
" text_field=text_field,\n",
")\n",
"\n",
"vector_store = OpensearchVectorStore(client)\n",
"storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
"\n",
"index = VectorStoreIndex.from_documents(\n",
" documents=documents,\n",
" transformations=[node_parser, MetadataTransform()],\n",
" storage_context=storage_context,\n",
" embed_model=EMBED_MODEL,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build RAG\n",
"\n",
"In this section, we will see how to assemble a RAG system, execute a query, and get a generated response.\n",
"\n",
"We will also describe how to leverage Docling capabilities to improve RAG results.\n",
"\n",
"\n",
"### Run a query\n",
"\n",
"With LlamaIndex's query engine, we can simply run a RAG system as follows:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: Which are the main AI models in Docling?\n",
"🤖: The two main AI models used in Docling are:\n",
"\n",
"<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>. A layout analysis model, an accurate object-detector for page elements \n",
"<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span>. TableFormer, a state-of-the-art table structure recognition model\n",
"\n",
"These models were initially released as part of the open-source Docling package to help \n",
"with document understanding tasks.\n",
"</pre>\n"
],
"text/plain": [
"👤: Which are the main AI models in Docling?\n",
"🤖: The two main AI models used in Docling are:\n",
"\n",
"\u001b[1;36m1\u001b[0m. A layout analysis model, an accurate object-detector for page elements \n",
"\u001b[1;36m2\u001b[0m. TableFormer, a state-of-the-art table structure recognition model\n",
"\n",
"These models were initially released as part of the open-source Docling package to help \n",
"with document understanding tasks.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"console = Console(width=88)\n",
"\n",
"QUERY = \"Which are the main AI models in Docling?\"\n",
"query_engine = index.as_query_engine(llm=GEN_MODEL)\n",
"res = query_engine.query(QUERY)\n",
"\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom serializers\n",
"\n",
"Docling can extract the table content and process it for chunking, like other text elements.\n",
"\n",
"In the following example, the response is generated from a retrieved chunk containing a table."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: What is the time to solution with the native backend on Intel?\n",
"🤖: The time to solution <span style=\"font-weight: bold\">(</span>TTS<span style=\"font-weight: bold\">)</span> for the native backend on Intel is:\n",
"- For Apple M3 Max <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">16</span> cores<span style=\"font-weight: bold\">)</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds \n",
"- For <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">Intel</span><span style=\"font-weight: bold\">(</span>R<span style=\"font-weight: bold\">)</span> Xeon E5-<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2690</span>, native backend: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">244</span> seconds\n",
"\n",
"So the TTS with the native backend on Intel ranges from approximately <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">244</span> to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds\n",
"depending on the specific configuration.\n",
"</pre>\n"
],
"text/plain": [
"👤: What is the time to solution with the native backend on Intel?\n",
"🤖: The time to solution \u001b[1m(\u001b[0mTTS\u001b[1m)\u001b[0m for the native backend on Intel is:\n",
"- For Apple M3 Max \u001b[1m(\u001b[0m\u001b[1;36m16\u001b[0m cores\u001b[1m)\u001b[0m: \u001b[1;36m375\u001b[0m seconds \n",
"- For \u001b[1;35mIntel\u001b[0m\u001b[1m(\u001b[0mR\u001b[1m)\u001b[0m Xeon E5-\u001b[1;36m2690\u001b[0m, native backend: \u001b[1;36m244\u001b[0m seconds\n",
"\n",
"So the TTS with the native backend on Intel ranges from approximately \u001b[1;36m244\u001b[0m to \u001b[1;36m375\u001b[0m seconds\n",
"depending on the specific configuration.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"QUERY = \"What is the time to solution with the native backend on Intel?\"\n",
"query_engine = index.as_query_engine(llm=GEN_MODEL)\n",
"res = query_engine.query(QUERY)\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result above was generated with the table serialized in a triplet format.\n",
"Language models may perform better on complex tables if the structure is represented in a format that is widely adopted,\n",
"like [markdown](https://en.wikipedia.org/wiki/Markdown).\n",
"\n",
"For this purpose, we can leverage a custom serializer that transforms tables in markdown format:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Token indices sequence length is longer than the specified maximum sequence length for this model (538 > 512). Running this sequence through the model will result in indexing errors\n"
]
}
],
"source": [
"class MDTableSerializerProvider(ChunkingSerializerProvider):\n",
" def get_serializer(self, doc):\n",
" return ChunkingDocSerializer(\n",
" doc=doc,\n",
" # configuring a different table serializer\n",
" table_serializer=MarkdownTableSerializer(),\n",
" )\n",
"\n",
"\n",
"# clear the database from the previous chunks\n",
"client.clear()\n",
"vector_store.clear()\n",
"\n",
"chunker = HybridChunker(\n",
" tokenizer=tokenizer,\n",
" max_tokens=EMBED_MAX_TOKENS,\n",
" serializer_provider=MDTableSerializerProvider(),\n",
")\n",
"node_parser = DoclingNodeParser(chunker=chunker)\n",
"index = VectorStoreIndex.from_documents(\n",
" documents=documents,\n",
" transformations=[node_parser, MetadataTransform()],\n",
" storage_context=storage_context,\n",
" embed_model=EMBED_MODEL,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: What is the time to solution with the native backend on Intel?\n",
"🤖: The table shows that for the native backend on Intel systems, the time-to-solution \n",
"<span style=\"font-weight: bold\">(</span>TTS<span style=\"font-weight: bold\">)</span> ranges from <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">239</span> seconds to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds. Specifically:\n",
"- With <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span> threads, the TTS is <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">239</span> seconds.\n",
"- With <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">16</span> threads, the TTS is <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">244</span> seconds.\n",
"\n",
"So the time to solution with the native backend on Intel varies between approximately \n",
"<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">239</span> and <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds depending on the thread budget used.\n",
"</pre>\n"
],
"text/plain": [
"👤: What is the time to solution with the native backend on Intel?\n",
"🤖: The table shows that for the native backend on Intel systems, the time-to-solution \n",
"\u001b[1m(\u001b[0mTTS\u001b[1m)\u001b[0m ranges from \u001b[1;36m239\u001b[0m seconds to \u001b[1;36m375\u001b[0m seconds. Specifically:\n",
"- With \u001b[1;36m4\u001b[0m threads, the TTS is \u001b[1;36m239\u001b[0m seconds.\n",
"- With \u001b[1;36m16\u001b[0m threads, the TTS is \u001b[1;36m244\u001b[0m seconds.\n",
"\n",
"So the time to solution with the native backend on Intel varies between approximately \n",
"\u001b[1;36m239\u001b[0m and \u001b[1;36m375\u001b[0m seconds depending on the thread budget used.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"query_engine = index.as_query_engine(llm=GEN_MODEL)\n",
"res = query_engine.query(QUERY)\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that the generated response is now more accurate. Refer to the [Advanced chunking & serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/) example for more details on serialization strategies."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filter-context Query\n",
"\n",
"By default, the `DoclingNodeParser` will keep the hierarchical information of items when creating the chunks.\n",
"That information will be stored as metadata in the OpenSearch index. Leveraging the document structure is a powerful\n",
"feature of Docling for improving RAG systems, both for retrieval and for answer generation.\n",
"\n",
"For example, we can use chunk metadata with layout information to run queries in a filter context, for high retrieval accuracy.\n",
"\n",
"Using the previous setup, we can see that the most similar chunk corresponds to a paragraph without enough grounding for the question:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def display_nodes(nodes):\n",
" res = []\n",
" for idx, item in enumerate(nodes):\n",
" doc_res = {\"k\": idx + 1, \"score\": item.score, \"text\": item.text, \"items\": []}\n",
" doc_items = item.metadata[\"doc_items\"]\n",
" for doc in doc_items:\n",
" doc_res[\"items\"].append({\"ref\": doc[\"self_ref\"], \"label\": doc[\"label\"]})\n",
" res.append(doc_res)\n",
" pprint(res, max_string=200)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"How does pypdfium perform?\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ </span><span style=\"font-weight: bold\">{</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'k'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'score'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.694972</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'text'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'- [13] B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. Doclaynet: a large humanannotated dataset for document-layout segmentation. pages 3743-3751, 2022.\\n- [14] pypdf Maintainers. pypdf: '</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">314</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'items'</span>: <span style=\"font-weight: bold\">[</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ │ </span><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/93'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'list_item'</span><span style=\"font-weight: bold\">}</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ │ </span><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/94'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'list_item'</span><span style=\"font-weight: bold\">}</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ │ </span><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/95'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'list_item'</span><span style=\"font-weight: bold\">}</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ │ </span><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/96'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'list_item'</span><span style=\"font-weight: bold\">}</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"font-weight: bold\">]</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ </span><span style=\"font-weight: bold\">}</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m[\u001b[0m\n",
"\u001b[2;32m│ \u001b[0m\u001b[1m{\u001b[0m\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'k'\u001b[0m: \u001b[1;36m1\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.694972\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'- \u001b[0m\u001b[32m[\u001b[0m\u001b[32m13\u001b[0m\u001b[32m]\u001b[0m\u001b[32m B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. Doclaynet: a large humanannotated dataset for document-layout segmentation. pages 3743-3751, 2022.\\n- \u001b[0m\u001b[32m[\u001b[0m\u001b[32m14\u001b[0m\u001b[32m]\u001b[0m\u001b[32m pypdf Maintainers. pypdf: '\u001b[0m+\u001b[1;36m314\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\n",
"\u001b[2;32m│ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/93'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n",
"\u001b[2;32m│ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/94'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n",
"\u001b[2;32m│ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/95'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n",
"\u001b[2;32m│ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/96'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[1m]\u001b[0m\n",
"\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"retriever = index.as_retriever(similarity_top_k=1)\n",
"\n",
"QUERY = \"How does pypdfium perform?\"\n",
"nodes = retriever.retrieve(QUERY)\n",
"\n",
"print(QUERY)\n",
"display_nodes(nodes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may want to restrict the retrieval to only those chunks containing tabular data, expecting to retrieve more quantitative information for our type of question:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"How does pypdfium perform?\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ </span><span style=\"font-weight: bold\">{</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'k'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'score'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.6238112</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'text'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TT'</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">515</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'items'</span>: <span style=\"font-weight: bold\">[{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/tables/0'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'table'</span><span style=\"font-weight: bold\">}</span>, <span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/tables/0'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'table'</span><span style=\"font-weight: bold\">}]</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ </span><span style=\"font-weight: bold\">}</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m[\u001b[0m\n",
"\u001b[2;32m│ \u001b[0m\u001b[1m{\u001b[0m\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'k'\u001b[0m: \u001b[1;36m1\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6238112\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u001b[0m\u001b[32m(\u001b[0m\u001b[32mTT'\u001b[0m+\u001b[1;36m515\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/tables/0'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'table'\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/tables/0'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'table'\u001b[0m\u001b[1m}\u001b[0m\u001b[1m]\u001b[0m\n",
"\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"filters = MetadataFilters(\n",
" filters=[MetadataFilter(key=\"doc_items.label\", value=\"table\")]\n",
")\n",
"\n",
"table_retriever = index.as_retriever(filters=filters, similarity_top_k=1)\n",
"nodes = table_retriever.retrieve(QUERY)\n",
"\n",
"print(QUERY)\n",
"display_nodes(nodes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hybrid Search Retrieval with RRF\n",
"\n",
"Hybrid search combines keyword and semantic search to improve search relevance. To avoid relying on traditional score normalization techniques, the reciprocal rank fusion (RRF) feature on hybrid search can significantly improve the relevance of the retrieved chunks in our RAG system.\n",
"\n",
"First, create a search pipeline and specify RRF as technique:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"acknowledged\":true}\n"
]
}
],
"source": [
"url = f\"{OPENSEARCH_ENDPOINT}/_search/pipeline/rrf-pipeline\"\n",
"headers = {\"Content-Type\": \"application/json\"}\n",
"body = {\n",
" \"description\": \"Post processor for hybrid RRF search\",\n",
" \"phase_results_processors\": [\n",
" {\"score-ranker-processor\": {\"combination\": {\"technique\": \"rrf\"}}}\n",
" ],\n",
"}\n",
"\n",
"response = requests.put(url, json=body, headers=headers)\n",
"print(response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then repeat the previous steps to get a `VectorStoreIndex` object, leveraging the search pipeline that we just created:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-10-24 15:06:05,175 - WARNING - GET http://localhost:9200/docling-index-rrf [status:404 request:0.001s]\n"
]
}
],
"source": [
"client_rrf = OpensearchVectorClient(\n",
" endpoint=OPENSEARCH_ENDPOINT,\n",
" index=f\"{OPENSEARCH_INDEX}-rrf\",\n",
" dim=embed_dim,\n",
" engine=\"faiss\",\n",
" embedding_field=embed_field,\n",
" text_field=text_field,\n",
" search_pipeline=\"rrf-pipeline\",\n",
")\n",
"\n",
"vector_store_rrf = OpensearchVectorStore(client_rrf)\n",
"storage_context_rrf = StorageContext.from_defaults(vector_store=vector_store_rrf)\n",
"index_hybrid = VectorStoreIndex.from_documents(\n",
" documents=documents,\n",
" transformations=[node_parser, MetadataTransform()],\n",
" storage_context=storage_context_rrf,\n",
" embed_model=EMBED_MODEL,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first retriever, which entirely relies on semantic (vector) search, fails to catch the supporting chunk for the given question in the top 1 position.\n",
"Note that we highlight few expected keywords for illustration purposes.\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
"contributing guidelines included in the Docling repository. If you use Docling in your \n",
"projects, please consider citing this technical report.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
"contributing guidelines included in the Docling repository. If you use Docling in your \n",
"projects, please consider citing this technical report.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span> ***\n",
"In the final pipeline stage, Docling assembles all prediction results produced on each \n",
"page into a well-defined datatype that encapsulates a converted document, as defined in \n",
"the auxiliary package docling-core . The generated document object is passed through a \n",
"post-processing model which leverages several algorithms to augment features, such as \n",
"detection of the document language, correcting the reading order, matching figures with \n",
"captions and labelling metadata such as title, authors and references. The final output \n",
"can then be serialized to JSON or transformed into a Markdown representation at the \n",
"users request.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m2\u001b[0m ***\n",
"In the final pipeline stage, Docling assembles all prediction results produced on each \n",
"page into a well-defined datatype that encapsulates a converted document, as defined in \n",
"the auxiliary package docling-core . The generated document object is passed through a \n",
"post-processing model which leverages several algorithms to augment features, such as \n",
"detection of the document language, correcting the reading order, matching figures with \n",
"captions and labelling metadata such as title, authors and references. The final output \n",
"can then be serialized to JSON or transformed into a Markdown representation at the \n",
"users request.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> ***\n",
"```\n",
"source = <span style=\"color: #008000; text-decoration-color: #008000\">\"https://arxiv.org/pdf/2206.01062\"</span> # PDF path or URL converter = \n",
"<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">DocumentConverter</span><span style=\"font-weight: bold\">()</span> result = <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">converter.convert_single</span><span style=\"font-weight: bold\">(</span>source<span style=\"font-weight: bold\">)</span> \n",
"<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">print</span><span style=\"font-weight: bold\">(</span><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">result.render_as_markdown</span><span style=\"font-weight: bold\">())</span> # output: <span style=\"color: #008000; text-decoration-color: #008000\">\"## DocLayNet: A Large Human -Annotated </span>\n",
"<span style=\"color: #008000; text-decoration-color: #008000\">Dataset for Document -Layout Analysis [...]\"</span>\n",
"```\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features <span style=\"font-weight: bold\">(</span>e.g. OCR, table structure recognition<span style=\"font-weight: bold\">)</span>, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
"and options are documented in the README file. <span style=\"color: #808000; text-decoration-color: #808000; font-weight: bold\">Docling also provides a Dockerfile</span> to \n",
"demonstrate how to install and run it inside a container.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n",
"```\n",
"source = \u001b[32m\"https://arxiv.org/pdf/2206.01062\"\u001b[0m # PDF path or URL converter = \n",
"\u001b[1;35mDocumentConverter\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m result = \u001b[1;35mconverter.convert_single\u001b[0m\u001b[1m(\u001b[0msource\u001b[1m)\u001b[0m \n",
"\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0m\u001b[1;35mresult.render_as_markdown\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # output: \u001b[32m\"## DocLayNet: A Large Human -Annotated \u001b[0m\n",
"\u001b[32mDataset for Document -Layout Analysis \u001b[0m\u001b[32m[\u001b[0m\u001b[32m...\u001b[0m\u001b[32m]\u001b[0m\u001b[32m\"\u001b[0m\n",
"```\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features \u001b[1m(\u001b[0me.g. OCR, table structure recognition\u001b[1m)\u001b[0m, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
"and options are documented in the README file. \u001b[1;33mDocling also provides a Dockerfile\u001b[0m to \n",
"demonstrate how to install and run it inside a container.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"QUERY = \"Does Docling project provide a Dockerfile?\"\n",
"retriever = index.as_retriever(similarity_top_k=3)\n",
"nodes = retriever.retrieve(QUERY)\n",
"exp = \"Docling also provides a Dockerfile\"\n",
"start = \"[bold yellow]\"\n",
"end = \"[/]\"\n",
"for idx, item in enumerate(nodes):\n",
" console.print(\n",
" f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, the retriever with the hybrid search pipeline effectively recognizes the key paragraph in the first position:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> ***\n",
"```\n",
"source = <span style=\"color: #008000; text-decoration-color: #008000\">\"https://arxiv.org/pdf/2206.01062\"</span> # PDF path or URL converter = \n",
"<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">DocumentConverter</span><span style=\"font-weight: bold\">()</span> result = <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">converter.convert_single</span><span style=\"font-weight: bold\">(</span>source<span style=\"font-weight: bold\">)</span> \n",
"<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">print</span><span style=\"font-weight: bold\">(</span><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">result.render_as_markdown</span><span style=\"font-weight: bold\">())</span> # output: <span style=\"color: #008000; text-decoration-color: #008000\">\"## DocLayNet: A Large Human -Annotated </span>\n",
"<span style=\"color: #008000; text-decoration-color: #008000\">Dataset for Document -Layout Analysis [...]\"</span>\n",
"```\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features <span style=\"font-weight: bold\">(</span>e.g. OCR, table structure recognition<span style=\"font-weight: bold\">)</span>, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
"and options are documented in the README file. <span style=\"color: #808000; text-decoration-color: #808000; font-weight: bold\">Docling also provides a Dockerfile</span> to \n",
"demonstrate how to install and run it inside a container.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n",
"```\n",
"source = \u001b[32m\"https://arxiv.org/pdf/2206.01062\"\u001b[0m # PDF path or URL converter = \n",
"\u001b[1;35mDocumentConverter\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m result = \u001b[1;35mconverter.convert_single\u001b[0m\u001b[1m(\u001b[0msource\u001b[1m)\u001b[0m \n",
"\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0m\u001b[1;35mresult.render_as_markdown\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # output: \u001b[32m\"## DocLayNet: A Large Human -Annotated \u001b[0m\n",
"\u001b[32mDataset for Document -Layout Analysis \u001b[0m\u001b[32m[\u001b[0m\u001b[32m...\u001b[0m\u001b[32m]\u001b[0m\u001b[32m\"\u001b[0m\n",
"```\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features \u001b[1m(\u001b[0me.g. OCR, table structure recognition\u001b[1m)\u001b[0m, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
"and options are documented in the README file. \u001b[1;33mDocling also provides a Dockerfile\u001b[0m to \n",
"demonstrate how to install and run it inside a container.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span> ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
"contributing guidelines included in the Docling repository. If you use Docling in your \n",
"projects, please consider citing this technical report.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m2\u001b[0m ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
"contributing guidelines included in the Docling repository. If you use Docling in your \n",
"projects, please consider citing this technical report.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> ***\n",
"We therefore decided to provide multiple backend choices, and additionally open-source a\n",
"custombuilt PDF parser, which is based on the low-level qpdf <span style=\"font-weight: bold\">[</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span><span style=\"font-weight: bold\">]</span> library. It is made \n",
"available in a separate package named docling-parse and powers the default PDF backend \n",
"in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \n",
"be a safe backup choice in certain cases, e.g. if issues are seen with particular font \n",
"encodings.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n",
"We therefore decided to provide multiple backend choices, and additionally open-source a\n",
"custombuilt PDF parser, which is based on the low-level qpdf \u001b[1m[\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m]\u001b[0m library. It is made \n",
"available in a separate package named docling-parse and powers the default PDF backend \n",
"in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \n",
"be a safe backup choice in certain cases, e.g. if issues are seen with particular font \n",
"encodings.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"retriever_rrf = index_hybrid.as_retriever(\n",
" vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3\n",
")\n",
"nodes = retriever_rrf.retrieve(QUERY)\n",
"for idx, item in enumerate(nodes):\n",
" console.print(\n",
" f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Context expansion\n",
"\n",
"Using small chunks can offer several benefits: it increases retrieval precision and it keeps the answer generation tightly focused, which improves accuracy, reduces hallucination, and speeds up inferece.\n",
"However, your RAG system may overlook contextual information necessary for producing a fully grounded response.\n",
"\n",
"Docling's preservation of document structure enables you to employ various strategies for enriching the context available during answer generation within the RAG pipeline.\n",
"For example, after identifying the most relevant chunk, you might include adjacent chunks from the same section as additional groudning material before generating the final answer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following example, the generated response is wrong, since the top retrieved chunks do not contain all the information that is required to answer the question."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
"have limited resources and complex tables?\n",
"🤖: According to the tests in this section using both the MacBook Pro M3 Max and \n",
"bare-metal server running Ubuntu <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">20.04</span> LTS on an Intel Xeon E5-<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2690</span> CPU with a fixed \n",
"thread budget of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>, Docling achieved faster processing speeds when using the \n",
"custom-built PDF backend based on the low-level qpdf library <span style=\"font-weight: bold\">(</span>docling-parse<span style=\"font-weight: bold\">)</span> compared to\n",
"the alternative PDF backend relying on pypdfium.\n",
"\n",
"Furthermore, the context mentions that Docling provides a separate package named \n",
"docling-ibm-models which includes pre-trained weights and inference code for \n",
"TableFormer, a state-of-the-art table structure recognition model. This suggests that if\n",
"you have complex tables in your documents, using this specialized table recognition \n",
"model could be beneficial.\n",
"\n",
"Therefore, based on the tests with arXiv papers and IBM Redbooks, if you have limited \n",
"resources <span style=\"font-weight: bold\">(</span>likely referring to computational power<span style=\"font-weight: bold\">)</span> and need to process documents \n",
"containing complex tables, it would be recommended to use the docling-parse PDF backend \n",
"along with the TableFormer AI model from docling-ibm-models. This combination should \n",
"provide a good balance of performance and table recognition capabilities for your \n",
"specific needs.\n",
"</pre>\n"
],
"text/plain": [
"👤: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
"have limited resources and complex tables?\n",
"🤖: According to the tests in this section using both the MacBook Pro M3 Max and \n",
"bare-metal server running Ubuntu \u001b[1;36m20.04\u001b[0m LTS on an Intel Xeon E5-\u001b[1;36m2690\u001b[0m CPU with a fixed \n",
"thread budget of \u001b[1;36m4\u001b[0m, Docling achieved faster processing speeds when using the \n",
"custom-built PDF backend based on the low-level qpdf library \u001b[1m(\u001b[0mdocling-parse\u001b[1m)\u001b[0m compared to\n",
"the alternative PDF backend relying on pypdfium.\n",
"\n",
"Furthermore, the context mentions that Docling provides a separate package named \n",
"docling-ibm-models which includes pre-trained weights and inference code for \n",
"TableFormer, a state-of-the-art table structure recognition model. This suggests that if\n",
"you have complex tables in your documents, using this specialized table recognition \n",
"model could be beneficial.\n",
"\n",
"Therefore, based on the tests with arXiv papers and IBM Redbooks, if you have limited \n",
"resources \u001b[1m(\u001b[0mlikely referring to computational power\u001b[1m)\u001b[0m and need to process documents \n",
"containing complex tables, it would be recommended to use the docling-parse PDF backend \n",
"along with the TableFormer AI model from docling-ibm-models. This combination should \n",
"provide a good balance of performance and table recognition capabilities for your \n",
"specific needs.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"QUERY = \"According to the tests with arXiv and IBM Redbooks, which backend should I use if I have limited resources and complex tables?\"\n",
"query_rrf = index_hybrid.as_query_engine(\n",
" vector_store_query_mode=VectorStoreQueryMode.HYBRID,\n",
" llm=GEN_MODEL,\n",
" similarity_top_k=3,\n",
")\n",
"res = query_rrf.query(QUERY)\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> ***\n",
"In this section, we establish some reference numbers for the processing speed of Docling\n",
"and the resource budget it requires. All tests in this section are run with default \n",
"options on our standard test set distributed with Docling, which consists of three \n",
"papers from arXiv and two IBM Redbooks, with a total of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">225</span> pages. Measurements were \n",
"taken using both available PDF backends on two different hardware systems: one MacBook \n",
"Pro M3 Max, and one bare-metal server running Ubuntu <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">20.04</span> LTS on an Intel Xeon E5-<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2690</span> \n",
"CPU. For reproducibility, we fixed the thread budget <span style=\"font-weight: bold\">(</span>through setting OMP NUM THREADS \n",
"environment variable <span style=\"font-weight: bold\">)</span> once to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span> <span style=\"font-weight: bold\">(</span>Docling default<span style=\"font-weight: bold\">)</span> and once to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">16</span> <span style=\"font-weight: bold\">(</span>equal to full core \n",
"count on the test hardware<span style=\"font-weight: bold\">)</span>. All results are shown in Table <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n",
"In this section, we establish some reference numbers for the processing speed of Docling\n",
"and the resource budget it requires. All tests in this section are run with default \n",
"options on our standard test set distributed with Docling, which consists of three \n",
"papers from arXiv and two IBM Redbooks, with a total of \u001b[1;36m225\u001b[0m pages. Measurements were \n",
"taken using both available PDF backends on two different hardware systems: one MacBook \n",
"Pro M3 Max, and one bare-metal server running Ubuntu \u001b[1;36m20.04\u001b[0m LTS on an Intel Xeon E5-\u001b[1;36m2690\u001b[0m \n",
"CPU. For reproducibility, we fixed the thread budget \u001b[1m(\u001b[0mthrough setting OMP NUM THREADS \n",
"environment variable \u001b[1m)\u001b[0m once to \u001b[1;36m4\u001b[0m \u001b[1m(\u001b[0mDocling default\u001b[1m)\u001b[0m and once to \u001b[1;36m16\u001b[0m \u001b[1m(\u001b[0mequal to full core \n",
"count on the test hardware\u001b[1m)\u001b[0m. All results are shown in Table \u001b[1;36m1\u001b[0m.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span> ***\n",
"We therefore decided to provide multiple backend choices, and additionally open-source a\n",
"custombuilt PDF parser, which is based on the low-level qpdf <span style=\"font-weight: bold\">[</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span><span style=\"font-weight: bold\">]</span> library. It is made \n",
"available in a separate package named docling-parse and powers the default PDF backend \n",
"in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \n",
"be a safe backup choice in certain cases, e.g. if issues are seen with particular font \n",
"encodings.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m2\u001b[0m ***\n",
"We therefore decided to provide multiple backend choices, and additionally open-source a\n",
"custombuilt PDF parser, which is based on the low-level qpdf \u001b[1m[\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m]\u001b[0m library. It is made \n",
"available in a separate package named docling-parse and powers the default PDF backend \n",
"in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \n",
"be a safe backup choice in certain cases, e.g. if issues are seen with particular font \n",
"encodings.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> ***\n",
"As part of Docling, we initially release two highly capable AI models to the open-source\n",
"community, which have been developed and published recently by our team. The first model\n",
"is a layout analysis model, an accurate object-detector for page elements <span style=\"font-weight: bold\">[</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">13</span><span style=\"font-weight: bold\">]</span>. The \n",
"second model is TableFormer <span style=\"font-weight: bold\">[</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12</span>, <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">9</span><span style=\"font-weight: bold\">]</span>, a state-of-the-art table structure recognition \n",
"model. We provide the pre-trained weights <span style=\"font-weight: bold\">(</span>hosted on huggingface<span style=\"font-weight: bold\">)</span> and a separate package\n",
"for the inference code as docling-ibm-models . Both models are also powering the \n",
"open-access deepsearch-experience, our cloud-native service for knowledge exploration \n",
"tasks.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n",
"As part of Docling, we initially release two highly capable AI models to the open-source\n",
"community, which have been developed and published recently by our team. The first model\n",
"is a layout analysis model, an accurate object-detector for page elements \u001b[1m[\u001b[0m\u001b[1;36m13\u001b[0m\u001b[1m]\u001b[0m. The \n",
"second model is TableFormer \u001b[1m[\u001b[0m\u001b[1;36m12\u001b[0m, \u001b[1;36m9\u001b[0m\u001b[1m]\u001b[0m, a state-of-the-art table structure recognition \n",
"model. We provide the pre-trained weights \u001b[1m(\u001b[0mhosted on huggingface\u001b[1m)\u001b[0m and a separate package\n",
"for the inference code as docling-ibm-models . Both models are also powering the \n",
"open-access deepsearch-experience, our cloud-native service for knowledge exploration \n",
"tasks.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"nodes = retriever_rrf.retrieve(QUERY)\n",
"for idx, item in enumerate(nodes):\n",
" console.print(\n",
" f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even though the top retrieved chunks are relevant for the question, the key information lays in the paragraph after the first chunk:\n",
"\n",
"> If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it will come at the expense of worse quality results, especially in table structure recovery."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We next examine the fragments that immediately precede and follow the topretrieved chunk, so long as those neighbors remain within the same section, to preserve the semantic integrity of the context.\n",
"The generated answer is now accurate because it has been grounded in the necessary contextual information.\n",
"\n",
"💡 In a production setting, it may be preferable to persist the parsed documents (i.e., `DoclingDocument` objects) as JSON in an object store or database and then fetch them when you need to traverse the document for contextexpansion scenarios. In this simplified example, however, we will query the OpenSearch index directly to obtain the required chunks."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
"have limited resources and complex tables?\n",
"🤖: According to the tests described in the provided context, if you need to run Docling\n",
"in a very low-resource environment and are dealing with complex tables that require \n",
"high-quality table structure recovery, you should consider configuring the pypdfium \n",
"backend. The context mentions that while it is faster and more memory efficient than the\n",
"default docling-parse backend, it may come at the expense of worse quality results, \n",
"especially in table structure recovery. Therefore, for limited resources and complex \n",
"tables where quality is crucial, pypdfium would be a suitable choice despite its \n",
"potential drawbacks compared to the default backend.\n",
"</pre>\n"
],
"text/plain": [
"👤: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
"have limited resources and complex tables?\n",
"🤖: According to the tests described in the provided context, if you need to run Docling\n",
"in a very low-resource environment and are dealing with complex tables that require \n",
"high-quality table structure recovery, you should consider configuring the pypdfium \n",
"backend. The context mentions that while it is faster and more memory efficient than the\n",
"default docling-parse backend, it may come at the expense of worse quality results, \n",
"especially in table structure recovery. Therefore, for limited resources and complex \n",
"tables where quality is crucial, pypdfium would be a suitable choice despite its \n",
"potential drawbacks compared to the default backend.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"top_headings = nodes[0].metadata[\"headings\"]\n",
"top_text = nodes[0].text\n",
"\n",
"rdr = ElasticsearchReader(endpoint=OPENSEARCH_ENDPOINT, index=OPENSEARCH_INDEX)\n",
"docs = rdr.load_data(\n",
" field=text_field,\n",
" query={\n",
" \"query\": {\n",
" \"terms_set\": {\n",
" \"metadata.headings.keyword\": {\n",
" \"terms\": top_headings,\n",
" \"minimum_should_match_script\": {\"source\": \"params.num_terms\"},\n",
" }\n",
" }\n",
" }\n",
" },\n",
")\n",
"ext_nodes = []\n",
"for idx, item in enumerate(docs):\n",
" if item.text == top_text:\n",
" ext_nodes.append(NodeWithScore(node=Node(text=item.text), score=1.0))\n",
" if idx > 0:\n",
" ext_nodes.append(\n",
" NodeWithScore(node=Node(text=docs[idx - 1].text), score=1.0)\n",
" )\n",
" if idx < len(docs) - 1:\n",
" ext_nodes.append(\n",
" NodeWithScore(node=Node(text=docs[idx + 1].text), score=1.0)\n",
" )\n",
" break\n",
"\n",
"synthesizer = get_response_synthesizer(llm=GEN_MODEL)\n",
"res = synthesizer.synthesize(query=QUERY, nodes=ext_nodes)\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}