mirror of
https://github.com/docling-project/docling.git
synced 2025-06-27 05:20:05 +00:00

* typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> --------- Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
552 lines
17 KiB
Plaintext
Vendored
552 lines
17 KiB
Plaintext
Vendored
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_milvus.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# RAG with Milvus\n",
|
||
"\n",
|
||
"| Step | Tech | Execution |\n",
|
||
"| --- | --- | --- |\n",
|
||
"| Embedding | OpenAI (text-embedding-3-small) | 🌐 Remote |\n",
|
||
"| Vector store | Milvus | 💻 Local |\n",
|
||
"| Gen AI | OpenAI (gpt-4o) | 🌐 Remote |\n",
|
||
"\n",
|
||
"\n",
|
||
"## A recipe 🧑🍳 🐥 💚\n",
|
||
"\n",
|
||
"This is a code recipe that uses [Milvus](https://milvus.io/), the world's most advanced open-source vector database, to perform RAG over documents parsed by [Docling](https://docling-project.github.io/docling/).\n",
|
||
"\n",
|
||
"In this notebook, we accomplish the following:\n",
|
||
"* Parse documents using Docling's document conversion capabilities\n",
|
||
"* Perform hierarchical chunking of the documents using Docling\n",
|
||
"* Generate text embeddings with OpenAI\n",
|
||
"* Perform RAG using Milvus, the world's most advanced open-source vector database\n",
|
||
"\n",
|
||
"Note: For best results, please use **GPU acceleration** to run this notebook. Here are two options for running this notebook:\n",
|
||
"1. **Locally on a MacBook with an Apple Silicon chip.** Converting all documents in the notebook takes ~2 minutes on a MacBook M2 due to Docling's usage of MPS accelerators.\n",
|
||
"2. **Run this notebook on Google Colab.** Converting all documents in the notebook takes ~8 minutes on a Google Colab T4 GPU.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Preparation\n",
|
||
"\n",
|
||
"### Dependencies and Environment\n",
|
||
"\n",
|
||
"To start, install the required dependencies by running the following command:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"! pip install --upgrade pymilvus docling openai torch"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### GPU Checking"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.\n",
|
||
"\n",
|
||
"The code below checks to see if a GPU is available, either via CUDA or MPS."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"MPS GPU is enabled.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import torch\n",
|
||
"\n",
|
||
"# Check if GPU or MPS is available\n",
|
||
"if torch.cuda.is_available():\n",
|
||
" device = torch.device(\"cuda\")\n",
|
||
" print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\n",
|
||
"elif torch.backends.mps.is_available():\n",
|
||
" device = torch.device(\"mps\")\n",
|
||
" print(\"MPS GPU is enabled.\")\n",
|
||
"else:\n",
|
||
" raise OSError(\n",
|
||
" \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n",
|
||
" )"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Setting Up API Keys\n",
|
||
"\n",
|
||
"We will use OpenAI as the LLM in this example. You should prepare the [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart) as an environment variable."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"\n",
|
||
"os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Prepare the LLM and Embedding Model\n",
|
||
"\n",
|
||
"We initialize the OpenAI client to prepare the embedding model.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from openai import OpenAI\n",
|
||
"\n",
|
||
"openai_client = OpenAI()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Define a function to generate text embeddings using OpenAI client. We use the [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings) model as an example."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def emb_text(text):\n",
|
||
" return (\n",
|
||
" openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\")\n",
|
||
" .data[0]\n",
|
||
" .embedding\n",
|
||
" )"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Generate a test embedding and print its dimension and first few elements."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"1536\n",
|
||
"[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"test_embedding = emb_text(\"This is a test\")\n",
|
||
"embedding_dim = len(test_embedding)\n",
|
||
"print(embedding_dim)\n",
|
||
"print(test_embedding[:10])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Process Data Using Docling\n",
|
||
"\n",
|
||
"Docling can parse various document formats into a unified representation (Docling Document), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to [the official documentation](https://docling-project.github.io/docling/usage/supported_formats/).\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In this tutorial, we will use a Markdown file ([source](https://milvus.io/docs/overview.md)) as the input. We will process the document using a **HierarchicalChunker** provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from docling_core.transforms.chunker import HierarchicalChunker\n",
|
||
"\n",
|
||
"from docling.document_converter import DocumentConverter\n",
|
||
"\n",
|
||
"converter = DocumentConverter()\n",
|
||
"chunker = HierarchicalChunker()\n",
|
||
"\n",
|
||
"# Convert the input file to Docling Document\n",
|
||
"source = \"https://milvus.io/docs/overview.md\"\n",
|
||
"doc = converter.convert(source).document\n",
|
||
"\n",
|
||
"# Perform hierarchical chunking\n",
|
||
"texts = [chunk.text for chunk in chunker.chunk(doc)]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Load Data into Milvus\n",
|
||
"\n",
|
||
"### Create the collection\n",
|
||
"\n",
|
||
"With data in hand, we can create a `MilvusClient` instance and insert the data into a Milvus collection. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from pymilvus import MilvusClient\n",
|
||
"\n",
|
||
"milvus_client = MilvusClient(uri=\"./milvus_demo.db\")\n",
|
||
"collection_name = \"my_rag_collection\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"> As for the argument of `MilvusClient`:\n",
|
||
"> - Setting the `uri` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n",
|
||
"> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.\n",
|
||
"> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Check if the collection already exists and drop it if it does."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"if milvus_client.has_collection(collection_name):\n",
|
||
" milvus_client.drop_collection(collection_name)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Create a new collection with specified parameters.\n",
|
||
"\n",
|
||
"If we don’t specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"milvus_client.create_collection(\n",
|
||
" collection_name=collection_name,\n",
|
||
" dimension=embedding_dim,\n",
|
||
" metric_type=\"IP\", # Inner product distance\n",
|
||
" consistency_level=\"Strong\", # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Insert data"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Processing chunks: 100%|██████████| 38/38 [00:14<00:00, 2.59it/s]\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"{'insert_count': 38, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], 'cost': 0}"
|
||
]
|
||
},
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"from tqdm import tqdm\n",
|
||
"\n",
|
||
"data = []\n",
|
||
"\n",
|
||
"for i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")):\n",
|
||
" embedding = emb_text(chunk)\n",
|
||
" data.append({\"id\": i, \"vector\": embedding, \"text\": chunk})\n",
|
||
"\n",
|
||
"milvus_client.insert(collection_name=collection_name, data=data)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Build RAG"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Retrieve data for a query\n",
|
||
"\n",
|
||
"Let’s specify a query question about the website we just scraped."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"question = (\n",
|
||
" \"What are the three deployment modes of Milvus, and what are their differences?\"\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Search for the question in the collection and retrieve the semantic top-3 matches."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"search_res = milvus_client.search(\n",
|
||
" collection_name=collection_name,\n",
|
||
" data=[emb_text(question)],\n",
|
||
" limit=3,\n",
|
||
" search_params={\"metric_type\": \"IP\", \"params\": {}},\n",
|
||
" output_fields=[\"text\"],\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Let’s take a look at the search results of the query\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[\n",
|
||
" [\n",
|
||
" \"Milvus offers three deployment modes, covering a wide range of data scales\\u2014from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors:\",\n",
|
||
" 0.6503315567970276\n",
|
||
" ],\n",
|
||
" [\n",
|
||
" \"Milvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it\\u2019s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more.\",\n",
|
||
" 0.6281915903091431\n",
|
||
" ],\n",
|
||
" [\n",
|
||
" \"What is Milvus?\\nUnstructured Data, Embeddings, and Milvus\\nWhat Makes Milvus so Fast\\uff1f\\nWhat Makes Milvus so Scalable\\nTypes of Searches Supported by Milvus\\nComprehensive Feature Set\",\n",
|
||
" 0.6117826700210571\n",
|
||
" ]\n",
|
||
"]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import json\n",
|
||
"\n",
|
||
"retrieved_lines_with_distances = [\n",
|
||
" (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0]\n",
|
||
"]\n",
|
||
"print(json.dumps(retrieved_lines_with_distances, indent=4))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Use LLM to get a RAG response\n",
|
||
"\n",
|
||
"Convert the retrieved documents into a string format.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"context = \"\\n\".join(\n",
|
||
" [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Define system and user prompts for the Lanage Model. This prompt is assembled with the retrieved documents from Milvus.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SYSTEM_PROMPT = \"\"\"\n",
|
||
"Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n",
|
||
"\"\"\"\n",
|
||
"USER_PROMPT = f\"\"\"\n",
|
||
"Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.\n",
|
||
"<context>\n",
|
||
"{context}\n",
|
||
"</context>\n",
|
||
"<question>\n",
|
||
"{question}\n",
|
||
"</question>\n",
|
||
"\"\"\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Use OpenAI ChatGPT to generate a response based on the prompts."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"The three deployment modes of Milvus are:\n",
|
||
"\n",
|
||
"1. **Milvus Lite**: This is a Python library that integrates easily into your applications. It's a lightweight version ideal for quick prototyping in Jupyter Notebooks or for running on edge devices with limited resources.\n",
|
||
"\n",
|
||
"2. **Milvus Standalone**: This mode is a single-machine server deployment where all components are bundled into a single Docker image, making it convenient to deploy.\n",
|
||
"\n",
|
||
"3. **Milvus Distributed**: This mode is designed for deployment on Kubernetes clusters. It features a cloud-native architecture suited for managing scenarios at a billion-scale or larger, ensuring redundancy in critical components.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"response = openai_client.chat.completions.create(\n",
|
||
" model=\"gpt-4o\",\n",
|
||
" messages=[\n",
|
||
" {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
|
||
" {\"role\": \"user\", \"content\": USER_PROMPT},\n",
|
||
" ],\n",
|
||
")\n",
|
||
"print(response.choices[0].message.content)"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "base",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|