{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copyright (c) 2024 Microsoft Corporation.\n", "# Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bring-Your-Own Vector Store\n", "\n", "This notebook demonstrates how to implement a custom vector store and register for usage with GraphRAG.\n", "\n", "## Overview\n", "\n", "GraphRAG uses a plug-and-play architecture that allow for easy integration of custom vector stores (outside of what is natively supported) by following a factory design pattern. This allows you to:\n", "\n", "- **Extend functionality**: Add support for new vector database backends\n", "- **Customize behavior**: Implement specialized search logic or data structures\n", "- **Integrate existing systems**: Connect GraphRAG to your existing vector database infrastructure\n", "\n", "### What You'll Learn\n", "\n", "1. Understanding the `BaseVectorStore` interface\n", "2. Implementing a custom vector store class\n", "3. Registering your vector store with the `VectorStoreFactory`\n", "4. Testing and validating your implementation\n", "5. Configuring GraphRAG to use your custom vector store\n", "\n", "Let's get started!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Import Required Dependencies\n", "\n", "First, let's import the necessary GraphRAG components and other dependencies we'll need.\n", "\n", "```bash\n", "pip install graphrag\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from typing import Any\n", "\n", "import numpy as np\n", "import yaml\n", "\n", "from graphrag.data_model.types import TextEmbedder\n", "\n", "# GraphRAG vector store components\n", "from graphrag.vector_stores.base import (\n", " BaseVectorStore,\n", " VectorStoreDocument,\n", " VectorStoreSearchResult,\n", ")\n", "from graphrag.vector_stores.factory import VectorStoreFactory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Understand the BaseVectorStore Interface\n", "\n", "Before using a custom vector store, let's examine the `BaseVectorStore` interface to understand what methods need to be implemented." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's inspect the BaseVectorStore class to understand the required methods\n", "import inspect\n", "\n", "print(\"BaseVectorStore Abstract Methods:\")\n", "print(\"=\" * 40)\n", "\n", "abstract_methods = []\n", "for name, method in inspect.getmembers(BaseVectorStore, predicate=inspect.isfunction):\n", " if getattr(method, \"__isabstractmethod__\", False):\n", " signature = inspect.signature(method)\n", " abstract_methods.append(f\"โ€ข {name}{signature}\")\n", " print(f\"โ€ข {name}{signature}\")\n", "\n", "print(f\"\\nTotal abstract methods to implement: {len(abstract_methods)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Implement a Custom Vector Store\n", "\n", "Now let's implement a simple in-memory vector store as an example. This vector store will:\n", "\n", "- Store documents and vectors in memory using Python data structures\n", "- Support all required BaseVectorStore methods\n", "\n", "**Note**: This is a simplified example for demonstration. Production vector stores would typically use optimized libraries like FAISS, more sophisticated indexing, and persistent storage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class SimpleInMemoryVectorStore(BaseVectorStore):\n", " \"\"\"A simple in-memory vector store implementation for demonstration purposes.\n", "\n", " This vector store stores documents and their embeddings in memory and provides\n", " basic similarity search functionality using cosine similarity.\n", "\n", " WARNING: This is for demonstration only - not suitable for production use.\n", " For production, consider using optimized vector databases like LanceDB,\n", " Azure AI Search, or other specialized vector stores.\n", " \"\"\"\n", "\n", " # Internal storage for documents and vectors\n", " documents: dict[str, VectorStoreDocument]\n", " vectors: dict[str, np.ndarray]\n", " connected: bool\n", "\n", " def __init__(self, **kwargs: Any):\n", " \"\"\"Initialize the in-memory vector store.\"\"\"\n", " super().__init__(**kwargs)\n", "\n", " self.documents: dict[str, VectorStoreDocument] = {}\n", " self.vectors: dict[str, np.ndarray] = {}\n", " self.connected = False\n", "\n", " print(\n", " f\"๐Ÿš€ SimpleInMemoryVectorStore initialized for collection: {self.collection_name}\"\n", " )\n", "\n", " def connect(self, **kwargs: Any) -> None:\n", " \"\"\"Connect to the vector storage (no-op for in-memory store).\"\"\"\n", " self.connected = True\n", " print(f\"โœ… Connected to in-memory vector store: {self.collection_name}\")\n", "\n", " def load_documents(\n", " self, documents: list[VectorStoreDocument], overwrite: bool = True\n", " ) -> None:\n", " \"\"\"Load documents into the vector store.\"\"\"\n", " if not self.connected:\n", " msg = \"Vector store not connected. Call connect() first.\"\n", " raise RuntimeError(msg)\n", "\n", " if overwrite:\n", " self.documents.clear()\n", " self.vectors.clear()\n", "\n", " loaded_count = 0\n", " for doc in documents:\n", " if doc.vector is not None:\n", " doc_id = str(doc.id)\n", " self.documents[doc_id] = doc\n", " self.vectors[doc_id] = np.array(doc.vector, dtype=np.float32)\n", " loaded_count += 1\n", "\n", " print(f\"๐Ÿ“š Loaded {loaded_count} documents into vector store\")\n", "\n", " def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:\n", " \"\"\"Calculate cosine similarity between two vectors.\"\"\"\n", " # Normalize vectors\n", " norm1 = np.linalg.norm(vec1)\n", " norm2 = np.linalg.norm(vec2)\n", "\n", " if norm1 == 0 or norm2 == 0:\n", " return 0.0\n", "\n", " return float(np.dot(vec1, vec2) / (norm1 * norm2))\n", "\n", " def similarity_search_by_vector(\n", " self, query_embedding: list[float], k: int = 10, **kwargs: Any\n", " ) -> list[VectorStoreSearchResult]:\n", " \"\"\"Perform similarity search using a query vector.\"\"\"\n", " if not self.connected:\n", " msg = \"Vector store not connected. Call connect() first.\"\n", " raise RuntimeError(msg)\n", "\n", " if not self.vectors:\n", " return []\n", "\n", " query_vec = np.array(query_embedding, dtype=np.float32)\n", " similarities = []\n", "\n", " # Calculate similarity with all stored vectors\n", " for doc_id, stored_vec in self.vectors.items():\n", " similarity = self._cosine_similarity(query_vec, stored_vec)\n", " similarities.append((doc_id, similarity))\n", "\n", " # Sort by similarity (descending) and take top k\n", " similarities.sort(key=lambda x: x[1], reverse=True)\n", " top_k = similarities[:k]\n", "\n", " # Create search results\n", " results = []\n", " for doc_id, score in top_k:\n", " document = self.documents[doc_id]\n", " result = VectorStoreSearchResult(document=document, score=score)\n", " results.append(result)\n", "\n", " return results\n", "\n", " def similarity_search_by_text(\n", " self, text: str, text_embedder: TextEmbedder, k: int = 10, **kwargs: Any\n", " ) -> list[VectorStoreSearchResult]:\n", " \"\"\"Perform similarity search using text (which gets embedded first).\"\"\"\n", " # Embed the text first\n", " query_embedding = text_embedder(text)\n", "\n", " # Use vector search with the embedding\n", " return self.similarity_search_by_vector(query_embedding, k, **kwargs)\n", "\n", " def filter_by_id(self, include_ids: list[str] | list[int]) -> Any:\n", " \"\"\"Build a query filter to filter documents by id.\n", "\n", " For this simple implementation, we return the list of IDs as the filter.\n", " \"\"\"\n", " return [str(id_) for id_ in include_ids]\n", "\n", " def search_by_id(self, id: str) -> VectorStoreDocument:\n", " \"\"\"Search for a document by id.\"\"\"\n", " doc_id = str(id)\n", " if doc_id not in self.documents:\n", " msg = f\"Document with id '{id}' not found\"\n", " raise KeyError(msg)\n", "\n", " return self.documents[doc_id]\n", "\n", " def get_stats(self) -> dict[str, Any]:\n", " \"\"\"Get statistics about the vector store (custom method).\"\"\"\n", " return {\n", " \"collection_name\": self.collection_name,\n", " \"document_count\": len(self.documents),\n", " \"vector_count\": len(self.vectors),\n", " \"connected\": self.connected,\n", " \"vector_dimension\": len(next(iter(self.vectors.values())))\n", " if self.vectors\n", " else 0,\n", " }\n", "\n", "\n", "print(\"โœ… SimpleInMemoryVectorStore class defined!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Register the Custom Vector Store\n", "\n", "Now let's register our custom vector store with the `VectorStoreFactory` so it can be used throughout GraphRAG." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Register our custom vector store with a unique identifier\n", "CUSTOM_VECTOR_STORE_TYPE = \"simple_memory\"\n", "\n", "# Register the vector store class\n", "VectorStoreFactory.register(CUSTOM_VECTOR_STORE_TYPE, SimpleInMemoryVectorStore)\n", "\n", "print(f\"โœ… Registered custom vector store with type: '{CUSTOM_VECTOR_STORE_TYPE}'\")\n", "\n", "# Verify registration\n", "available_types = VectorStoreFactory.get_vector_store_types()\n", "print(f\"\\n๐Ÿ“‹ Available vector store types: {available_types}\")\n", "print(\n", " f\"๐Ÿ” Is our custom type supported? {VectorStoreFactory.is_supported_type(CUSTOM_VECTOR_STORE_TYPE)}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Test the Custom Vector Store\n", "\n", "Let's create some sample data and test our custom vector store implementation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create sample documents with mock embeddings\n", "def create_mock_embedding(dimension: int = 384) -> list[float]:\n", " \"\"\"Create a random embedding vector for testing.\"\"\"\n", " return np.random.normal(0, 1, dimension).tolist()\n", "\n", "\n", "# Sample documents\n", "sample_documents = [\n", " VectorStoreDocument(\n", " id=\"doc_1\",\n", " text=\"GraphRAG is a powerful knowledge graph extraction and reasoning framework.\",\n", " vector=create_mock_embedding(),\n", " attributes={\"category\": \"technology\", \"source\": \"documentation\"},\n", " ),\n", " VectorStoreDocument(\n", " id=\"doc_2\",\n", " text=\"Vector stores enable efficient similarity search over high-dimensional data.\",\n", " vector=create_mock_embedding(),\n", " attributes={\"category\": \"technology\", \"source\": \"research\"},\n", " ),\n", " VectorStoreDocument(\n", " id=\"doc_3\",\n", " text=\"Machine learning models can process and understand natural language text.\",\n", " vector=create_mock_embedding(),\n", " attributes={\"category\": \"AI\", \"source\": \"article\"},\n", " ),\n", " VectorStoreDocument(\n", " id=\"doc_4\",\n", " text=\"Custom implementations allow for specialized behavior and integration.\",\n", " vector=create_mock_embedding(),\n", " attributes={\"category\": \"development\", \"source\": \"tutorial\"},\n", " ),\n", "]\n", "\n", "print(f\"๐Ÿ“ Created {len(sample_documents)} sample documents\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test creating vector store using the factory\n", "vector_store_config = {\"collection_name\": \"test_collection\"}\n", "\n", "# Create vector store instance using factory\n", "vector_store = VectorStoreFactory.create_vector_store(\n", " CUSTOM_VECTOR_STORE_TYPE, vector_store_config\n", ")\n", "\n", "print(f\"โœ… Created vector store instance: {type(vector_store).__name__}\")\n", "print(f\"๐Ÿ“Š Initial stats: {vector_store.get_stats()}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Connect and load documents\n", "vector_store.connect()\n", "vector_store.load_documents(sample_documents)\n", "\n", "print(f\"๐Ÿ“Š Updated stats: {vector_store.get_stats()}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test similarity search\n", "query_vector = create_mock_embedding() # Random query vector for testing\n", "\n", "search_results = vector_store.similarity_search_by_vector(\n", " query_vector,\n", " k=3, # Get top 3 similar documents\n", ")\n", "\n", "print(f\"๐Ÿ” Found {len(search_results)} similar documents:\\n\")\n", "\n", "for i, result in enumerate(search_results, 1):\n", " doc = result.document\n", " print(f\"{i}. ID: {doc.id}\")\n", " print(f\" Text: {doc.text[:60]}...\")\n", " print(f\" Similarity Score: {result.score:.4f}\")\n", " print(f\" Category: {doc.attributes.get('category', 'N/A')}\")\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test search by ID\n", "try:\n", " found_doc = vector_store.search_by_id(\"doc_2\")\n", " print(\"โœ… Found document by ID:\")\n", " print(f\" ID: {found_doc.id}\")\n", " print(f\" Text: {found_doc.text}\")\n", " print(f\" Attributes: {found_doc.attributes}\")\n", "except KeyError as e:\n", " print(f\"โŒ Error: {e}\")\n", "\n", "# Test filter by ID\n", "id_filter = vector_store.filter_by_id([\"doc_1\", \"doc_3\"])\n", "print(f\"\\n๐Ÿ”ง ID filter result: {id_filter}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: Configuration for GraphRAG\n", "\n", "Now let's see how you would configure GraphRAG to use your custom vector store in a settings file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example GraphRAG yaml settings\n", "example_settings = {\n", " \"vector_store\": {\n", " \"default_vector_store\": {\n", " \"type\": CUSTOM_VECTOR_STORE_TYPE, # \"simple_memory\"\n", " \"collection_name\": \"graphrag_entities\",\n", " # Add any custom parameters your vector store needs\n", " \"custom_parameter\": \"custom_value\",\n", " }\n", " },\n", " # Other GraphRAG configuration...\n", " \"models\": {\n", " \"default_embedding_model\": {\n", " \"type\": \"openai_embedding\",\n", " \"model\": \"text-embedding-3-small\",\n", " }\n", " },\n", "}\n", "\n", "# Convert to YAML format for settings.yml\n", "yaml_config = yaml.dump(example_settings, default_flow_style=False, indent=2)\n", "\n", "print(\"๐Ÿ“„ Example settings.yml configuration:\")\n", "print(\"=\" * 40)\n", "print(yaml_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7: Integration with GraphRAG Pipeline\n", "\n", "Here's how your custom vector store would be used in a typical GraphRAG pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example of how GraphRAG would use your custom vector store\n", "def simulate_graphrag_pipeline():\n", " \"\"\"Simulate how GraphRAG would use the custom vector store.\"\"\"\n", " print(\"๐Ÿš€ Simulating GraphRAG pipeline with custom vector store...\\n\")\n", "\n", " # 1. GraphRAG creates vector store using factory\n", " config = {\"collection_name\": \"graphrag_entities\", \"similarity_threshold\": 0.3}\n", "\n", " store = VectorStoreFactory.create_vector_store(CUSTOM_VECTOR_STORE_TYPE, config)\n", " store.connect()\n", "\n", " print(\"โœ… Step 1: Vector store created and connected\")\n", "\n", " # 2. During indexing, GraphRAG loads extracted entities\n", " entity_documents = [\n", " VectorStoreDocument(\n", " id=f\"entity_{i}\",\n", " text=f\"Entity {i} description: Important concept in the knowledge graph\",\n", " vector=create_mock_embedding(),\n", " attributes={\"type\": \"entity\", \"importance\": i % 3 + 1},\n", " )\n", " for i in range(10)\n", " ]\n", "\n", " store.load_documents(entity_documents)\n", " print(f\"โœ… Step 2: Loaded {len(entity_documents)} entity documents\")\n", "\n", " # 3. During query time, GraphRAG searches for relevant entities\n", " query_embedding = create_mock_embedding()\n", " relevant_entities = store.similarity_search_by_vector(query_embedding, k=5)\n", "\n", " print(f\"โœ… Step 3: Found {len(relevant_entities)} relevant entities for query\")\n", "\n", " # 4. GraphRAG uses these entities for context building\n", " context_entities = [result.document for result in relevant_entities]\n", "\n", " print(\"โœ… Step 4: Context built using retrieved entities\")\n", " print(f\"๐Ÿ“Š Final stats: {store.get_stats()}\")\n", "\n", " return context_entities\n", "\n", "\n", "# Run the simulation\n", "context = simulate_graphrag_pipeline()\n", "print(f\"\\n๐ŸŽฏ Retrieved {len(context)} entities for context building\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 8: Testing and Validation\n", "\n", "Let's create a comprehensive test suite to ensure our vector store works correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def test_custom_vector_store():\n", " \"\"\"Comprehensive test suite for the custom vector store.\"\"\"\n", " print(\"๐Ÿงช Running comprehensive vector store tests...\\n\")\n", "\n", " # Test 1: Basic functionality\n", " print(\"Test 1: Basic functionality\")\n", " store = VectorStoreFactory.create_vector_store(\n", " CUSTOM_VECTOR_STORE_TYPE, {\"collection_name\": \"test\"}\n", " )\n", " store.connect()\n", "\n", " # Load test documents\n", " test_docs = sample_documents[:2]\n", " store.load_documents(test_docs)\n", "\n", " assert len(store.documents) == 2, \"Should have 2 documents\"\n", " assert len(store.vectors) == 2, \"Should have 2 vectors\"\n", " print(\"โœ… Basic functionality test passed\")\n", "\n", " # Test 2: Search functionality\n", " print(\"\\nTest 2: Search functionality\")\n", " query_vec = create_mock_embedding()\n", " results = store.similarity_search_by_vector(query_vec, k=5)\n", "\n", " assert len(results) <= 2, \"Should not return more results than documents\"\n", " assert all(isinstance(r, VectorStoreSearchResult) for r in results), (\n", " \"Should return VectorStoreSearchResult objects\"\n", " )\n", " assert all(-1 <= r.score <= 1 for r in results), (\n", " \"Similarity scores should be between -1 and 1\"\n", " )\n", " print(\"โœ… Search functionality test passed\")\n", "\n", " # Test 3: Search by ID\n", " print(\"\\nTest 3: Search by ID\")\n", " found_doc = store.search_by_id(\"doc_1\")\n", " assert found_doc.id == \"doc_1\", \"Should find correct document\"\n", "\n", " try:\n", " store.search_by_id(\"nonexistent\")\n", " assert False, \"Should raise KeyError for nonexistent ID\"\n", " except KeyError:\n", " pass # Expected\n", "\n", " print(\"โœ… Search by ID test passed\")\n", "\n", " # Test 4: Filter functionality\n", " print(\"\\nTest 4: Filter functionality\")\n", " filter_result = store.filter_by_id([\"doc_1\", \"doc_2\"])\n", " assert filter_result == [\"doc_1\", \"doc_2\"], \"Should return filtered IDs\"\n", " print(\"โœ… Filter functionality test passed\")\n", "\n", " # Test 5: Error handling\n", " print(\"\\nTest 5: Error handling\")\n", " disconnected_store = VectorStoreFactory.create_vector_store(\n", " CUSTOM_VECTOR_STORE_TYPE, {\"collection_name\": \"test2\"}\n", " )\n", "\n", " try:\n", " disconnected_store.load_documents(test_docs)\n", " assert False, \"Should raise error when not connected\"\n", " except RuntimeError:\n", " pass # Expected\n", "\n", " try:\n", " disconnected_store.similarity_search_by_vector(query_vec)\n", " assert False, \"Should raise error when not connected\"\n", " except RuntimeError:\n", " pass # Expected\n", "\n", " print(\"โœ… Error handling test passed\")\n", "\n", " print(\"\\n๐ŸŽ‰ All tests passed! Your custom vector store is working correctly.\")\n", "\n", "\n", "# Run the tests\n", "test_custom_vector_store()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary and Next Steps\n", "\n", "Congratulations! You've successfully learned how to implement and register a custom vector store with GraphRAG. Here's what you accomplished:\n", "\n", "### What You Built\n", "- โœ… **Custom Vector Store Class**: Implemented `SimpleInMemoryVectorStore` with all required methods\n", "- โœ… **Factory Integration**: Registered your vector store with `VectorStoreFactory`\n", "- โœ… **Comprehensive Testing**: Validated functionality with a full test suite\n", "- โœ… **Configuration Examples**: Learned how to configure GraphRAG to use your vector store\n", "\n", "### Key Takeaways\n", "1. **Interface Compliance**: Always implement all methods from `BaseVectorStore`\n", "2. **Factory Pattern**: Use `VectorStoreFactory.register()` to make your vector store available\n", "3. **Configuration**: Vector stores are configured in GraphRAG settings files\n", "4. **Testing**: Thoroughly test all functionality before deploying\n", "\n", "### Next Steps\n", "Check out the API Overview notebook to learn how to index and query data via the graphrag API.\n", "\n", "### Resources\n", "- [GraphRAG Documentation](https://microsoft.github.io/graphrag/)\n", "\n", "Happy building! ๐Ÿš€" ] } ], "metadata": { "kernelspec": { "display_name": "graphrag-venv (3.10.18)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.18" } }, "nbformat": 4, "nbformat_minor": 4 }