{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# RAG with Haystack" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Step | Tech | Execution | \n", "| --- | --- | --- |\n", "| Embedding | Hugging Face / Sentence Transformers | 💻 Local |\n", "| Vector store | Milvus | 💻 Local |\n", "| Gen AI | Hugging Face Inference API | 🌐 Remote | " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example leverages the\n", "[Haystack Docling extension](../../integrations/haystack/), along with\n", "Milvus-based document store and retriever instances, as well as sentence-transformers\n", "embeddings.\n", "\n", "The presented `DoclingConverter` component enables you to:\n", "- use various document types in your LLM applications with ease and speed, and\n", "- leverage Docling's rich format for advanced, document-native grounding.\n", "\n", "`DoclingConverter` supports two different export modes:\n", "- `ExportType.MARKDOWN`: if you want to capture each input document as a separate\n", " Haystack document, or\n", "- `ExportType.DOC_CHUNKS` (default): if you want to have each input document chunked and\n", " to then capture each individual chunk as a separate Haystack document downstream.\n", "\n", "The example allows to explore both modes via parameter `EXPORT_TYPE`; depending on the\n", "value set, the ingestion and RAG pipelines are then set up accordingly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.\n", "- Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.\n", "- Requirements can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "from tempfile import mkdtemp\n", "\n", "from docling_haystack.converter import ExportType\n", "from dotenv import load_dotenv\n", "\n", "\n", "def _get_env_from_colab_or_os(key):\n", " try:\n", " from google.colab import userdata\n", "\n", " try:\n", " return userdata.get(key)\n", " except userdata.SecretNotFoundError:\n", " pass\n", " except ImportError:\n", " pass\n", " return os.getenv(key)\n", "\n", "\n", "load_dotenv()\n", "HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\n", "PATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\n", "EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n", "GENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n", "EXPORT_TYPE = ExportType.DOC_CHUNKS\n", "QUESTION = \"Which are the main AI models in Docling?\"\n", "TOP_K = 3\n", "MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing pipeline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "80beca8762c34095a21467fb7f056059", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Batches: 0%| | 0/2 [00:00