docling/docs/examples/rag_milvus.ipynb
nkh0472 a097ccd8d5
chore: typo fix (#1465)
* typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

---------

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
2025-04-28 08:52:09 +02:00

552 lines
17 KiB
Plaintext
Vendored
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_milvus.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# RAG with Milvus\n",
"\n",
"| Step | Tech | Execution |\n",
"| --- | --- | --- |\n",
"| Embedding | OpenAI (text-embedding-3-small) | 🌐 Remote |\n",
"| Vector store | Milvus | 💻 Local |\n",
"| Gen AI | OpenAI (gpt-4o) | 🌐 Remote |\n",
"\n",
"\n",
"## A recipe 🧑‍🍳 🐥 💚\n",
"\n",
"This is a code recipe that uses [Milvus](https://milvus.io/), the world's most advanced open-source vector database, to perform RAG over documents parsed by [Docling](https://docling-project.github.io/docling/).\n",
"\n",
"In this notebook, we accomplish the following:\n",
"* Parse documents using Docling's document conversion capabilities\n",
"* Perform hierarchical chunking of the documents using Docling\n",
"* Generate text embeddings with OpenAI\n",
"* Perform RAG using Milvus, the world's most advanced open-source vector database\n",
"\n",
"Note: For best results, please use **GPU acceleration** to run this notebook. Here are two options for running this notebook:\n",
"1. **Locally on a MacBook with an Apple Silicon chip.** Converting all documents in the notebook takes ~2 minutes on a MacBook M2 due to Docling's usage of MPS accelerators.\n",
"2. **Run this notebook on Google Colab.** Converting all documents in the notebook takes ~8 minutes on a Google Colab T4 GPU.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparation\n",
"\n",
"### Dependencies and Environment\n",
"\n",
"To start, install the required dependencies by running the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install --upgrade pymilvus docling openai torch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GPU Checking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.\n",
"\n",
"The code below checks to see if a GPU is available, either via CUDA or MPS."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MPS GPU is enabled.\n"
]
}
],
"source": [
"import torch\n",
"\n",
"# Check if GPU or MPS is available\n",
"if torch.cuda.is_available():\n",
" device = torch.device(\"cuda\")\n",
" print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\n",
"elif torch.backends.mps.is_available():\n",
" device = torch.device(\"mps\")\n",
" print(\"MPS GPU is enabled.\")\n",
"else:\n",
" raise OSError(\n",
" \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setting Up API Keys\n",
"\n",
"We will use OpenAI as the LLM in this example. You should prepare the [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart) as an environment variable."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare the LLM and Embedding Model\n",
"\n",
"We initialize the OpenAI client to prepare the embedding model.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"openai_client = OpenAI()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a function to generate text embeddings using OpenAI client. We use the [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings) model as an example."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def emb_text(text):\n",
" return (\n",
" openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\")\n",
" .data[0]\n",
" .embedding\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate a test embedding and print its dimension and first few elements."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1536\n",
"[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]\n"
]
}
],
"source": [
"test_embedding = emb_text(\"This is a test\")\n",
"embedding_dim = len(test_embedding)\n",
"print(embedding_dim)\n",
"print(test_embedding[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Process Data Using Docling\n",
"\n",
"Docling can parse various document formats into a unified representation (Docling Document), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to [the official documentation](https://docling-project.github.io/docling/usage/supported_formats/).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, we will use a Markdown file ([source](https://milvus.io/docs/overview.md)) as the input. We will process the document using a **HierarchicalChunker** provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from docling_core.transforms.chunker import HierarchicalChunker\n",
"\n",
"from docling.document_converter import DocumentConverter\n",
"\n",
"converter = DocumentConverter()\n",
"chunker = HierarchicalChunker()\n",
"\n",
"# Convert the input file to Docling Document\n",
"source = \"https://milvus.io/docs/overview.md\"\n",
"doc = converter.convert(source).document\n",
"\n",
"# Perform hierarchical chunking\n",
"texts = [chunk.text for chunk in chunker.chunk(doc)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data into Milvus\n",
"\n",
"### Create the collection\n",
"\n",
"With data in hand, we can create a `MilvusClient` instance and insert the data into a Milvus collection. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from pymilvus import MilvusClient\n",
"\n",
"milvus_client = MilvusClient(uri=\"./milvus_demo.db\")\n",
"collection_name = \"my_rag_collection\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> As for the argument of `MilvusClient`:\n",
"> - Setting the `uri` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n",
"> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.\n",
"> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check if the collection already exists and drop it if it does."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"if milvus_client.has_collection(collection_name):\n",
" milvus_client.drop_collection(collection_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a new collection with specified parameters.\n",
"\n",
"If we dont specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"milvus_client.create_collection(\n",
" collection_name=collection_name,\n",
" dimension=embedding_dim,\n",
" metric_type=\"IP\", # Inner product distance\n",
" consistency_level=\"Strong\", # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Insert data"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processing chunks: 100%|██████████| 38/38 [00:14<00:00, 2.59it/s]\n"
]
},
{
"data": {
"text/plain": [
"{'insert_count': 38, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], 'cost': 0}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tqdm import tqdm\n",
"\n",
"data = []\n",
"\n",
"for i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")):\n",
" embedding = emb_text(chunk)\n",
" data.append({\"id\": i, \"vector\": embedding, \"text\": chunk})\n",
"\n",
"milvus_client.insert(collection_name=collection_name, data=data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build RAG"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve data for a query\n",
"\n",
"Lets specify a query question about the website we just scraped."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"question = (\n",
" \"What are the three deployment modes of Milvus, and what are their differences?\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Search for the question in the collection and retrieve the semantic top-3 matches."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"search_res = milvus_client.search(\n",
" collection_name=collection_name,\n",
" data=[emb_text(question)],\n",
" limit=3,\n",
" search_params={\"metric_type\": \"IP\", \"params\": {}},\n",
" output_fields=[\"text\"],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets take a look at the search results of the query\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[\n",
" [\n",
" \"Milvus offers three deployment modes, covering a wide range of data scales\\u2014from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors:\",\n",
" 0.6503315567970276\n",
" ],\n",
" [\n",
" \"Milvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it\\u2019s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more.\",\n",
" 0.6281915903091431\n",
" ],\n",
" [\n",
" \"What is Milvus?\\nUnstructured Data, Embeddings, and Milvus\\nWhat Makes Milvus so Fast\\uff1f\\nWhat Makes Milvus so Scalable\\nTypes of Searches Supported by Milvus\\nComprehensive Feature Set\",\n",
" 0.6117826700210571\n",
" ]\n",
"]\n"
]
}
],
"source": [
"import json\n",
"\n",
"retrieved_lines_with_distances = [\n",
" (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0]\n",
"]\n",
"print(json.dumps(retrieved_lines_with_distances, indent=4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use LLM to get a RAG response\n",
"\n",
"Convert the retrieved documents into a string format.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"context = \"\\n\".join(\n",
" [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define system and user prompts for the Lanage Model. This prompt is assembled with the retrieved documents from Milvus.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"SYSTEM_PROMPT = \"\"\"\n",
"Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n",
"\"\"\"\n",
"USER_PROMPT = f\"\"\"\n",
"Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.\n",
"<context>\n",
"{context}\n",
"</context>\n",
"<question>\n",
"{question}\n",
"</question>\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use OpenAI ChatGPT to generate a response based on the prompts."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The three deployment modes of Milvus are:\n",
"\n",
"1. **Milvus Lite**: This is a Python library that integrates easily into your applications. It's a lightweight version ideal for quick prototyping in Jupyter Notebooks or for running on edge devices with limited resources.\n",
"\n",
"2. **Milvus Standalone**: This mode is a single-machine server deployment where all components are bundled into a single Docker image, making it convenient to deploy.\n",
"\n",
"3. **Milvus Distributed**: This mode is designed for deployment on Kubernetes clusters. It features a cloud-native architecture suited for managing scenarios at a billion-scale or larger, ensuring redundancy in critical components.\n"
]
}
],
"source": [
"response = openai_client.chat.completions.create(\n",
" model=\"gpt-4o\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
" {\"role\": \"user\", \"content\": USER_PROMPT},\n",
" ],\n",
")\n",
"print(response.choices[0].message.content)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}