mirror of
https://github.com/microsoft/graphrag.git
synced 2025-11-15 09:33:52 +00:00
* Initial plan * Refactor VectorStoreFactory to use registration functionality like StorageFactory Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix linting issues in VectorStoreFactory refactoring Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove backward compatibility support from VectorStoreFactory and StorageFactory Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Run ruff check --fix and ruff format, add semversioner file Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff formatting fixes * Fix pytest errors in storage factory tests by updating PipelineStorage interface implementation Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff formatting fixes * update storage factory design * Refactor CacheFactory to use registration functionality like StorageFactory Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * revert copilot changes * fix copilot changes * update comments * Fix failing pytest compatibility for factory tests Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * update class instantiation issue * ruff fixes * fix pytest * add default value * ruff formatting changes * ruff fixes * revert minor changes * cleanup cache factory * Update CacheFactory tests to match consistent factory pattern Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * update pytest thresholds * adjust threshold levels * Add custom vector store implementation notebook Create comprehensive notebook demonstrating how to implement and register custom vector stores with GraphRAG as a plug-and-play framework. Includes: - Complete implementation of SimpleInMemoryVectorStore - Registration with VectorStoreFactory - Testing and validation examples - Configuration examples for GraphRAG settings - Advanced features and best practices - Production considerations checklist The notebook provides a complete walkthrough for developers to understand and implement their own vector store backends. Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * remove sample notebook for now * update tests * fix cache pytests * add pandas-stub to dev dependencies * disable warning check for well known key * skip tests when running on ubuntu * add documentation for custom vector store implementations * ignore ruff findings in notebooks * fix merge breakages * speedup CLI import statements * remove unnecessary import statements in init file * Add str type option on storage/cache type * Fix store name * Add LoggerFactory * Fix up logging setup across CLI/API * Add LoggerFactory test * Fix err message * Semver * Remove enums from factory methods --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
676 lines
25 KiB
Plaintext
676 lines
25 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Copyright (c) 2024 Microsoft Corporation.\n",
|
|
"# Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Bring-Your-Own Vector Store\n",
|
|
"\n",
|
|
"This notebook demonstrates how to implement a custom vector store and register for usage with GraphRAG.\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"GraphRAG uses a plug-and-play architecture that allow for easy integration of custom vector stores (outside of what is natively supported) by following a factory design pattern. This allows you to:\n",
|
|
"\n",
|
|
"- **Extend functionality**: Add support for new vector database backends\n",
|
|
"- **Customize behavior**: Implement specialized search logic or data structures\n",
|
|
"- **Integrate existing systems**: Connect GraphRAG to your existing vector database infrastructure\n",
|
|
"\n",
|
|
"### What You'll Learn\n",
|
|
"\n",
|
|
"1. Understanding the `BaseVectorStore` interface\n",
|
|
"2. Implementing a custom vector store class\n",
|
|
"3. Registering your vector store with the `VectorStoreFactory`\n",
|
|
"4. Testing and validating your implementation\n",
|
|
"5. Configuring GraphRAG to use your custom vector store\n",
|
|
"\n",
|
|
"Let's get started!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 1: Import Required Dependencies\n",
|
|
"\n",
|
|
"First, let's import the necessary GraphRAG components and other dependencies we'll need.\n",
|
|
"\n",
|
|
"```bash\n",
|
|
"pip install graphrag\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from typing import Any\n",
|
|
"\n",
|
|
"import numpy as np\n",
|
|
"import yaml\n",
|
|
"\n",
|
|
"from graphrag.data_model.types import TextEmbedder\n",
|
|
"\n",
|
|
"# GraphRAG vector store components\n",
|
|
"from graphrag.vector_stores.base import (\n",
|
|
" BaseVectorStore,\n",
|
|
" VectorStoreDocument,\n",
|
|
" VectorStoreSearchResult,\n",
|
|
")\n",
|
|
"from graphrag.vector_stores.factory import VectorStoreFactory"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 2: Understand the BaseVectorStore Interface\n",
|
|
"\n",
|
|
"Before using a custom vector store, let's examine the `BaseVectorStore` interface to understand what methods need to be implemented."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Let's inspect the BaseVectorStore class to understand the required methods\n",
|
|
"import inspect\n",
|
|
"\n",
|
|
"print(\"BaseVectorStore Abstract Methods:\")\n",
|
|
"print(\"=\" * 40)\n",
|
|
"\n",
|
|
"abstract_methods = []\n",
|
|
"for name, method in inspect.getmembers(BaseVectorStore, predicate=inspect.isfunction):\n",
|
|
" if getattr(method, \"__isabstractmethod__\", False):\n",
|
|
" signature = inspect.signature(method)\n",
|
|
" abstract_methods.append(f\"• {name}{signature}\")\n",
|
|
" print(f\"• {name}{signature}\")\n",
|
|
"\n",
|
|
"print(f\"\\nTotal abstract methods to implement: {len(abstract_methods)}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 3: Implement a Custom Vector Store\n",
|
|
"\n",
|
|
"Now let's implement a simple in-memory vector store as an example. This vector store will:\n",
|
|
"\n",
|
|
"- Store documents and vectors in memory using Python data structures\n",
|
|
"- Support all required BaseVectorStore methods\n",
|
|
"\n",
|
|
"**Note**: This is a simplified example for demonstration. Production vector stores would typically use optimized libraries like FAISS, more sophisticated indexing, and persistent storage."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class SimpleInMemoryVectorStore(BaseVectorStore):\n",
|
|
" \"\"\"A simple in-memory vector store implementation for demonstration purposes.\n",
|
|
"\n",
|
|
" This vector store stores documents and their embeddings in memory and provides\n",
|
|
" basic similarity search functionality using cosine similarity.\n",
|
|
"\n",
|
|
" WARNING: This is for demonstration only - not suitable for production use.\n",
|
|
" For production, consider using optimized vector databases like LanceDB,\n",
|
|
" Azure AI Search, or other specialized vector stores.\n",
|
|
" \"\"\"\n",
|
|
"\n",
|
|
" # Internal storage for documents and vectors\n",
|
|
" documents: dict[str, VectorStoreDocument]\n",
|
|
" vectors: dict[str, np.ndarray]\n",
|
|
" connected: bool\n",
|
|
"\n",
|
|
" def __init__(self, **kwargs: Any):\n",
|
|
" \"\"\"Initialize the in-memory vector store.\"\"\"\n",
|
|
" super().__init__(**kwargs)\n",
|
|
"\n",
|
|
" self.documents: dict[str, VectorStoreDocument] = {}\n",
|
|
" self.vectors: dict[str, np.ndarray] = {}\n",
|
|
" self.connected = False\n",
|
|
"\n",
|
|
" print(\n",
|
|
" f\"🚀 SimpleInMemoryVectorStore initialized for collection: {self.collection_name}\"\n",
|
|
" )\n",
|
|
"\n",
|
|
" def connect(self, **kwargs: Any) -> None:\n",
|
|
" \"\"\"Connect to the vector storage (no-op for in-memory store).\"\"\"\n",
|
|
" self.connected = True\n",
|
|
" print(f\"✅ Connected to in-memory vector store: {self.collection_name}\")\n",
|
|
"\n",
|
|
" def load_documents(\n",
|
|
" self, documents: list[VectorStoreDocument], overwrite: bool = True\n",
|
|
" ) -> None:\n",
|
|
" \"\"\"Load documents into the vector store.\"\"\"\n",
|
|
" if not self.connected:\n",
|
|
" msg = \"Vector store not connected. Call connect() first.\"\n",
|
|
" raise RuntimeError(msg)\n",
|
|
"\n",
|
|
" if overwrite:\n",
|
|
" self.documents.clear()\n",
|
|
" self.vectors.clear()\n",
|
|
"\n",
|
|
" loaded_count = 0\n",
|
|
" for doc in documents:\n",
|
|
" if doc.vector is not None:\n",
|
|
" doc_id = str(doc.id)\n",
|
|
" self.documents[doc_id] = doc\n",
|
|
" self.vectors[doc_id] = np.array(doc.vector, dtype=np.float32)\n",
|
|
" loaded_count += 1\n",
|
|
"\n",
|
|
" print(f\"📚 Loaded {loaded_count} documents into vector store\")\n",
|
|
"\n",
|
|
" def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:\n",
|
|
" \"\"\"Calculate cosine similarity between two vectors.\"\"\"\n",
|
|
" # Normalize vectors\n",
|
|
" norm1 = np.linalg.norm(vec1)\n",
|
|
" norm2 = np.linalg.norm(vec2)\n",
|
|
"\n",
|
|
" if norm1 == 0 or norm2 == 0:\n",
|
|
" return 0.0\n",
|
|
"\n",
|
|
" return float(np.dot(vec1, vec2) / (norm1 * norm2))\n",
|
|
"\n",
|
|
" def similarity_search_by_vector(\n",
|
|
" self, query_embedding: list[float], k: int = 10, **kwargs: Any\n",
|
|
" ) -> list[VectorStoreSearchResult]:\n",
|
|
" \"\"\"Perform similarity search using a query vector.\"\"\"\n",
|
|
" if not self.connected:\n",
|
|
" msg = \"Vector store not connected. Call connect() first.\"\n",
|
|
" raise RuntimeError(msg)\n",
|
|
"\n",
|
|
" if not self.vectors:\n",
|
|
" return []\n",
|
|
"\n",
|
|
" query_vec = np.array(query_embedding, dtype=np.float32)\n",
|
|
" similarities = []\n",
|
|
"\n",
|
|
" # Calculate similarity with all stored vectors\n",
|
|
" for doc_id, stored_vec in self.vectors.items():\n",
|
|
" similarity = self._cosine_similarity(query_vec, stored_vec)\n",
|
|
" similarities.append((doc_id, similarity))\n",
|
|
"\n",
|
|
" # Sort by similarity (descending) and take top k\n",
|
|
" similarities.sort(key=lambda x: x[1], reverse=True)\n",
|
|
" top_k = similarities[:k]\n",
|
|
"\n",
|
|
" # Create search results\n",
|
|
" results = []\n",
|
|
" for doc_id, score in top_k:\n",
|
|
" document = self.documents[doc_id]\n",
|
|
" result = VectorStoreSearchResult(document=document, score=score)\n",
|
|
" results.append(result)\n",
|
|
"\n",
|
|
" return results\n",
|
|
"\n",
|
|
" def similarity_search_by_text(\n",
|
|
" self, text: str, text_embedder: TextEmbedder, k: int = 10, **kwargs: Any\n",
|
|
" ) -> list[VectorStoreSearchResult]:\n",
|
|
" \"\"\"Perform similarity search using text (which gets embedded first).\"\"\"\n",
|
|
" # Embed the text first\n",
|
|
" query_embedding = text_embedder(text)\n",
|
|
"\n",
|
|
" # Use vector search with the embedding\n",
|
|
" return self.similarity_search_by_vector(query_embedding, k, **kwargs)\n",
|
|
"\n",
|
|
" def filter_by_id(self, include_ids: list[str] | list[int]) -> Any:\n",
|
|
" \"\"\"Build a query filter to filter documents by id.\n",
|
|
"\n",
|
|
" For this simple implementation, we return the list of IDs as the filter.\n",
|
|
" \"\"\"\n",
|
|
" return [str(id_) for id_ in include_ids]\n",
|
|
"\n",
|
|
" def search_by_id(self, id: str) -> VectorStoreDocument:\n",
|
|
" \"\"\"Search for a document by id.\"\"\"\n",
|
|
" doc_id = str(id)\n",
|
|
" if doc_id not in self.documents:\n",
|
|
" msg = f\"Document with id '{id}' not found\"\n",
|
|
" raise KeyError(msg)\n",
|
|
"\n",
|
|
" return self.documents[doc_id]\n",
|
|
"\n",
|
|
" def get_stats(self) -> dict[str, Any]:\n",
|
|
" \"\"\"Get statistics about the vector store (custom method).\"\"\"\n",
|
|
" return {\n",
|
|
" \"collection_name\": self.collection_name,\n",
|
|
" \"document_count\": len(self.documents),\n",
|
|
" \"vector_count\": len(self.vectors),\n",
|
|
" \"connected\": self.connected,\n",
|
|
" \"vector_dimension\": len(next(iter(self.vectors.values())))\n",
|
|
" if self.vectors\n",
|
|
" else 0,\n",
|
|
" }\n",
|
|
"\n",
|
|
"\n",
|
|
"print(\"✅ SimpleInMemoryVectorStore class defined!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 4: Register the Custom Vector Store\n",
|
|
"\n",
|
|
"Now let's register our custom vector store with the `VectorStoreFactory` so it can be used throughout GraphRAG."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Register our custom vector store with a unique identifier\n",
|
|
"CUSTOM_VECTOR_STORE_TYPE = \"simple_memory\"\n",
|
|
"\n",
|
|
"# Register the vector store class\n",
|
|
"VectorStoreFactory.register(CUSTOM_VECTOR_STORE_TYPE, SimpleInMemoryVectorStore)\n",
|
|
"\n",
|
|
"print(f\"✅ Registered custom vector store with type: '{CUSTOM_VECTOR_STORE_TYPE}'\")\n",
|
|
"\n",
|
|
"# Verify registration\n",
|
|
"available_types = VectorStoreFactory.get_vector_store_types()\n",
|
|
"print(f\"\\n📋 Available vector store types: {available_types}\")\n",
|
|
"print(\n",
|
|
" f\"🔍 Is our custom type supported? {VectorStoreFactory.is_supported_type(CUSTOM_VECTOR_STORE_TYPE)}\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 5: Test the Custom Vector Store\n",
|
|
"\n",
|
|
"Let's create some sample data and test our custom vector store implementation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create sample documents with mock embeddings\n",
|
|
"def create_mock_embedding(dimension: int = 384) -> list[float]:\n",
|
|
" \"\"\"Create a random embedding vector for testing.\"\"\"\n",
|
|
" return np.random.normal(0, 1, dimension).tolist()\n",
|
|
"\n",
|
|
"\n",
|
|
"# Sample documents\n",
|
|
"sample_documents = [\n",
|
|
" VectorStoreDocument(\n",
|
|
" id=\"doc_1\",\n",
|
|
" text=\"GraphRAG is a powerful knowledge graph extraction and reasoning framework.\",\n",
|
|
" vector=create_mock_embedding(),\n",
|
|
" attributes={\"category\": \"technology\", \"source\": \"documentation\"},\n",
|
|
" ),\n",
|
|
" VectorStoreDocument(\n",
|
|
" id=\"doc_2\",\n",
|
|
" text=\"Vector stores enable efficient similarity search over high-dimensional data.\",\n",
|
|
" vector=create_mock_embedding(),\n",
|
|
" attributes={\"category\": \"technology\", \"source\": \"research\"},\n",
|
|
" ),\n",
|
|
" VectorStoreDocument(\n",
|
|
" id=\"doc_3\",\n",
|
|
" text=\"Machine learning models can process and understand natural language text.\",\n",
|
|
" vector=create_mock_embedding(),\n",
|
|
" attributes={\"category\": \"AI\", \"source\": \"article\"},\n",
|
|
" ),\n",
|
|
" VectorStoreDocument(\n",
|
|
" id=\"doc_4\",\n",
|
|
" text=\"Custom implementations allow for specialized behavior and integration.\",\n",
|
|
" vector=create_mock_embedding(),\n",
|
|
" attributes={\"category\": \"development\", \"source\": \"tutorial\"},\n",
|
|
" ),\n",
|
|
"]\n",
|
|
"\n",
|
|
"print(f\"📝 Created {len(sample_documents)} sample documents\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test creating vector store using the factory\n",
|
|
"vector_store_config = {\"collection_name\": \"test_collection\"}\n",
|
|
"\n",
|
|
"# Create vector store instance using factory\n",
|
|
"vector_store = VectorStoreFactory.create_vector_store(\n",
|
|
" CUSTOM_VECTOR_STORE_TYPE, vector_store_config\n",
|
|
")\n",
|
|
"\n",
|
|
"print(f\"✅ Created vector store instance: {type(vector_store).__name__}\")\n",
|
|
"print(f\"📊 Initial stats: {vector_store.get_stats()}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Connect and load documents\n",
|
|
"vector_store.connect()\n",
|
|
"vector_store.load_documents(sample_documents)\n",
|
|
"\n",
|
|
"print(f\"📊 Updated stats: {vector_store.get_stats()}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test similarity search\n",
|
|
"query_vector = create_mock_embedding() # Random query vector for testing\n",
|
|
"\n",
|
|
"search_results = vector_store.similarity_search_by_vector(\n",
|
|
" query_vector,\n",
|
|
" k=3, # Get top 3 similar documents\n",
|
|
")\n",
|
|
"\n",
|
|
"print(f\"🔍 Found {len(search_results)} similar documents:\\n\")\n",
|
|
"\n",
|
|
"for i, result in enumerate(search_results, 1):\n",
|
|
" doc = result.document\n",
|
|
" print(f\"{i}. ID: {doc.id}\")\n",
|
|
" print(f\" Text: {doc.text[:60]}...\")\n",
|
|
" print(f\" Similarity Score: {result.score:.4f}\")\n",
|
|
" print(f\" Category: {doc.attributes.get('category', 'N/A')}\")\n",
|
|
" print()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test search by ID\n",
|
|
"try:\n",
|
|
" found_doc = vector_store.search_by_id(\"doc_2\")\n",
|
|
" print(\"✅ Found document by ID:\")\n",
|
|
" print(f\" ID: {found_doc.id}\")\n",
|
|
" print(f\" Text: {found_doc.text}\")\n",
|
|
" print(f\" Attributes: {found_doc.attributes}\")\n",
|
|
"except KeyError as e:\n",
|
|
" print(f\"❌ Error: {e}\")\n",
|
|
"\n",
|
|
"# Test filter by ID\n",
|
|
"id_filter = vector_store.filter_by_id([\"doc_1\", \"doc_3\"])\n",
|
|
"print(f\"\\n🔧 ID filter result: {id_filter}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 6: Configuration for GraphRAG\n",
|
|
"\n",
|
|
"Now let's see how you would configure GraphRAG to use your custom vector store in a settings file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Example GraphRAG yaml settings\n",
|
|
"example_settings = {\n",
|
|
" \"vector_store\": {\n",
|
|
" \"default_vector_store\": {\n",
|
|
" \"type\": CUSTOM_VECTOR_STORE_TYPE, # \"simple_memory\"\n",
|
|
" \"collection_name\": \"graphrag_entities\",\n",
|
|
" # Add any custom parameters your vector store needs\n",
|
|
" \"custom_parameter\": \"custom_value\",\n",
|
|
" }\n",
|
|
" },\n",
|
|
" # Other GraphRAG configuration...\n",
|
|
" \"models\": {\n",
|
|
" \"default_embedding_model\": {\n",
|
|
" \"type\": \"openai_embedding\",\n",
|
|
" \"model\": \"text-embedding-3-small\",\n",
|
|
" }\n",
|
|
" },\n",
|
|
"}\n",
|
|
"\n",
|
|
"# Convert to YAML format for settings.yml\n",
|
|
"yaml_config = yaml.dump(example_settings, default_flow_style=False, indent=2)\n",
|
|
"\n",
|
|
"print(\"📄 Example settings.yml configuration:\")\n",
|
|
"print(\"=\" * 40)\n",
|
|
"print(yaml_config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 7: Integration with GraphRAG Pipeline\n",
|
|
"\n",
|
|
"Here's how your custom vector store would be used in a typical GraphRAG pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Example of how GraphRAG would use your custom vector store\n",
|
|
"def simulate_graphrag_pipeline():\n",
|
|
" \"\"\"Simulate how GraphRAG would use the custom vector store.\"\"\"\n",
|
|
" print(\"🚀 Simulating GraphRAG pipeline with custom vector store...\\n\")\n",
|
|
"\n",
|
|
" # 1. GraphRAG creates vector store using factory\n",
|
|
" config = {\"collection_name\": \"graphrag_entities\", \"similarity_threshold\": 0.3}\n",
|
|
"\n",
|
|
" store = VectorStoreFactory.create_vector_store(CUSTOM_VECTOR_STORE_TYPE, config)\n",
|
|
" store.connect()\n",
|
|
"\n",
|
|
" print(\"✅ Step 1: Vector store created and connected\")\n",
|
|
"\n",
|
|
" # 2. During indexing, GraphRAG loads extracted entities\n",
|
|
" entity_documents = [\n",
|
|
" VectorStoreDocument(\n",
|
|
" id=f\"entity_{i}\",\n",
|
|
" text=f\"Entity {i} description: Important concept in the knowledge graph\",\n",
|
|
" vector=create_mock_embedding(),\n",
|
|
" attributes={\"type\": \"entity\", \"importance\": i % 3 + 1},\n",
|
|
" )\n",
|
|
" for i in range(10)\n",
|
|
" ]\n",
|
|
"\n",
|
|
" store.load_documents(entity_documents)\n",
|
|
" print(f\"✅ Step 2: Loaded {len(entity_documents)} entity documents\")\n",
|
|
"\n",
|
|
" # 3. During query time, GraphRAG searches for relevant entities\n",
|
|
" query_embedding = create_mock_embedding()\n",
|
|
" relevant_entities = store.similarity_search_by_vector(query_embedding, k=5)\n",
|
|
"\n",
|
|
" print(f\"✅ Step 3: Found {len(relevant_entities)} relevant entities for query\")\n",
|
|
"\n",
|
|
" # 4. GraphRAG uses these entities for context building\n",
|
|
" context_entities = [result.document for result in relevant_entities]\n",
|
|
"\n",
|
|
" print(\"✅ Step 4: Context built using retrieved entities\")\n",
|
|
" print(f\"📊 Final stats: {store.get_stats()}\")\n",
|
|
"\n",
|
|
" return context_entities\n",
|
|
"\n",
|
|
"\n",
|
|
"# Run the simulation\n",
|
|
"context = simulate_graphrag_pipeline()\n",
|
|
"print(f\"\\n🎯 Retrieved {len(context)} entities for context building\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 8: Testing and Validation\n",
|
|
"\n",
|
|
"Let's create a comprehensive test suite to ensure our vector store works correctly."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def test_custom_vector_store():\n",
|
|
" \"\"\"Comprehensive test suite for the custom vector store.\"\"\"\n",
|
|
" print(\"🧪 Running comprehensive vector store tests...\\n\")\n",
|
|
"\n",
|
|
" # Test 1: Basic functionality\n",
|
|
" print(\"Test 1: Basic functionality\")\n",
|
|
" store = VectorStoreFactory.create_vector_store(\n",
|
|
" CUSTOM_VECTOR_STORE_TYPE, {\"collection_name\": \"test\"}\n",
|
|
" )\n",
|
|
" store.connect()\n",
|
|
"\n",
|
|
" # Load test documents\n",
|
|
" test_docs = sample_documents[:2]\n",
|
|
" store.load_documents(test_docs)\n",
|
|
"\n",
|
|
" assert len(store.documents) == 2, \"Should have 2 documents\"\n",
|
|
" assert len(store.vectors) == 2, \"Should have 2 vectors\"\n",
|
|
" print(\"✅ Basic functionality test passed\")\n",
|
|
"\n",
|
|
" # Test 2: Search functionality\n",
|
|
" print(\"\\nTest 2: Search functionality\")\n",
|
|
" query_vec = create_mock_embedding()\n",
|
|
" results = store.similarity_search_by_vector(query_vec, k=5)\n",
|
|
"\n",
|
|
" assert len(results) <= 2, \"Should not return more results than documents\"\n",
|
|
" assert all(isinstance(r, VectorStoreSearchResult) for r in results), (\n",
|
|
" \"Should return VectorStoreSearchResult objects\"\n",
|
|
" )\n",
|
|
" assert all(-1 <= r.score <= 1 for r in results), (\n",
|
|
" \"Similarity scores should be between -1 and 1\"\n",
|
|
" )\n",
|
|
" print(\"✅ Search functionality test passed\")\n",
|
|
"\n",
|
|
" # Test 3: Search by ID\n",
|
|
" print(\"\\nTest 3: Search by ID\")\n",
|
|
" found_doc = store.search_by_id(\"doc_1\")\n",
|
|
" assert found_doc.id == \"doc_1\", \"Should find correct document\"\n",
|
|
"\n",
|
|
" try:\n",
|
|
" store.search_by_id(\"nonexistent\")\n",
|
|
" assert False, \"Should raise KeyError for nonexistent ID\"\n",
|
|
" except KeyError:\n",
|
|
" pass # Expected\n",
|
|
"\n",
|
|
" print(\"✅ Search by ID test passed\")\n",
|
|
"\n",
|
|
" # Test 4: Filter functionality\n",
|
|
" print(\"\\nTest 4: Filter functionality\")\n",
|
|
" filter_result = store.filter_by_id([\"doc_1\", \"doc_2\"])\n",
|
|
" assert filter_result == [\"doc_1\", \"doc_2\"], \"Should return filtered IDs\"\n",
|
|
" print(\"✅ Filter functionality test passed\")\n",
|
|
"\n",
|
|
" # Test 5: Error handling\n",
|
|
" print(\"\\nTest 5: Error handling\")\n",
|
|
" disconnected_store = VectorStoreFactory.create_vector_store(\n",
|
|
" CUSTOM_VECTOR_STORE_TYPE, {\"collection_name\": \"test2\"}\n",
|
|
" )\n",
|
|
"\n",
|
|
" try:\n",
|
|
" disconnected_store.load_documents(test_docs)\n",
|
|
" assert False, \"Should raise error when not connected\"\n",
|
|
" except RuntimeError:\n",
|
|
" pass # Expected\n",
|
|
"\n",
|
|
" try:\n",
|
|
" disconnected_store.similarity_search_by_vector(query_vec)\n",
|
|
" assert False, \"Should raise error when not connected\"\n",
|
|
" except RuntimeError:\n",
|
|
" pass # Expected\n",
|
|
"\n",
|
|
" print(\"✅ Error handling test passed\")\n",
|
|
"\n",
|
|
" print(\"\\n🎉 All tests passed! Your custom vector store is working correctly.\")\n",
|
|
"\n",
|
|
"\n",
|
|
"# Run the tests\n",
|
|
"test_custom_vector_store()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Summary and Next Steps\n",
|
|
"\n",
|
|
"Congratulations! You've successfully learned how to implement and register a custom vector store with GraphRAG. Here's what you accomplished:\n",
|
|
"\n",
|
|
"### What You Built\n",
|
|
"- ✅ **Custom Vector Store Class**: Implemented `SimpleInMemoryVectorStore` with all required methods\n",
|
|
"- ✅ **Factory Integration**: Registered your vector store with `VectorStoreFactory`\n",
|
|
"- ✅ **Comprehensive Testing**: Validated functionality with a full test suite\n",
|
|
"- ✅ **Configuration Examples**: Learned how to configure GraphRAG to use your vector store\n",
|
|
"\n",
|
|
"### Key Takeaways\n",
|
|
"1. **Interface Compliance**: Always implement all methods from `BaseVectorStore`\n",
|
|
"2. **Factory Pattern**: Use `VectorStoreFactory.register()` to make your vector store available\n",
|
|
"3. **Configuration**: Vector stores are configured in GraphRAG settings files\n",
|
|
"4. **Testing**: Thoroughly test all functionality before deploying\n",
|
|
"\n",
|
|
"### Next Steps\n",
|
|
"Check out the API Overview notebook to learn how to index and query data via the graphrag API.\n",
|
|
"\n",
|
|
"### Resources\n",
|
|
"- [GraphRAG Documentation](https://microsoft.github.io/graphrag/)\n",
|
|
"\n",
|
|
"Happy building! 🚀"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "graphrag-venv (3.10.18)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.18"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|