2024-11-08 15:36:04 +08:00

412 lines
10 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Indexing Using Faiss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In practical cases, datasets contain thousands or millions of rows. Looping through the whole corpus to find the best answer to a query is very time and space consuming. In this tutorial, we'll introduce how to use indexing to make our retrieval fast and neat."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 0: Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Install the dependencies in the environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -U FlagEmbedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### faiss-gpu on Linux (x86_64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Faiss maintain the latest updates on conda. So if you have GPUs on Linux x86_64, create a conda virtual environment and run:\n",
"\n",
"```conda install -c pytorch -c nvidia faiss-gpu=1.8.0```\n",
"\n",
"and make sure you select that conda env as the kernel for this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### faiss-cpu\n",
"\n",
"Otherwise it's simple, just run the following cell to install `faiss-cpu`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -U faiss-cpu"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below is a super tiny courpus with only 10 sentences, which will be the dataset we use.\n",
"\n",
"Each sentence is a concise discription of a famous people in specific domain."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"corpus = [\n",
" \"Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.\",\n",
" \"Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.\",\n",
" \"Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'\",\n",
" \"Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\",\n",
" \"Eminem is a renowned rapper and one of the best-selling music artists of all time.\",\n",
" \"Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\",\n",
" \"Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.\",\n",
" \"Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.\",\n",
" \"Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.\",\n",
" \"Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\",\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And a few queries (add your own queries and check the result!): "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"queries = [\n",
" \"Who is Robert Downey Jr.?\",\n",
" \"An expert of neural network\",\n",
" \"A famous female singer\",\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Text Embedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, for the sake of speed, we just embed the first 500 docs in the corpus."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape of the corpus embeddings: (10, 768)\n",
"data type of the embeddings: float32\n"
]
}
],
"source": [
"from FlagEmbedding import FlagModel\n",
"\n",
"# get the BGE embedding model\n",
"model = FlagModel('BAAI/bge-base-en-v1.5',\n",
" query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n",
" use_fp16=True)\n",
"\n",
"# get the embedding of the corpus\n",
"corpus_embeddings = model.encode(corpus)\n",
"\n",
"print(\"shape of the corpus embeddings:\", corpus_embeddings.shape)\n",
"print(\"data type of the embeddings: \", corpus_embeddings.dtype)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Faiss only accepts float32 inputs.\n",
"\n",
"So make sure the dtype of corpus_embeddings is float32 before adding them to the index."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"corpus_embeddings = corpus_embeddings.astype(np.float32)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Indexing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this step, we build an index and add the embedding vectors to it."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import faiss\n",
"\n",
"# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768\n",
"dim = corpus_embeddings.shape[-1]\n",
"\n",
"# create the faiss index and store the corpus embeddings into the vector space\n",
"index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)\n",
"\n",
"# if you installed faiss-gpu, uncomment the following lines to make the index on your GPUs.\n",
"\n",
"# co = faiss.GpuMultipleClonerOptions()\n",
"# index = faiss.index_cpu_to_all_gpus(index, co)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No need to train if we use \"Flat\" quantizer and METRIC_INNER_PRODUCT as metric. Some other indices that using quantization might need training."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n",
"total number of vectors: 10\n"
]
}
],
"source": [
"# check if the index is trained\n",
"print(index.is_trained) \n",
"# index.train(corpus_embeddings)\n",
"\n",
"# add all the vectors to the index\n",
"index.add(corpus_embeddings)\n",
"\n",
"print(f\"total number of vectors: {index.ntotal}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3.5 (Optional): Saving Faiss index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you have your index with the embedding vectors, you can save it locally for future usage."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# change the path to where you want to save the index\n",
"path = \"./index.bin\"\n",
"faiss.write_index(index, path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you already have stored index in your local directory, you can load it by:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"index = faiss.read_index(\"./index.bin\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Find answers to the query"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, get the embeddings of all the queries:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"query_embeddings = model.encode_queries(queries)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, use the Faiss index to do a knn search in the vector space:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0.6686779 0.37858668 0.3767978 ]\n",
" [0.6062041 0.59364545 0.527691 ]\n",
" [0.5409331 0.5097007 0.42427146]]\n",
"[[9 7 2]\n",
" [3 1 8]\n",
" [5 0 4]]\n"
]
}
],
"source": [
"dists, ids = index.search(query_embeddings, k=3)\n",
"print(dists)\n",
"print(ids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see the result:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query:\tWho is Robert Downey Jr.?\n",
"answer:\tRobert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\n",
"\n",
"query:\tAn expert of neural network\n",
"answer:\tGeoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\n",
"\n",
"query:\tA famous female singer\n",
"answer:\tTaylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\n",
"\n"
]
}
],
"source": [
"for i, q in enumerate(queries):\n",
" print(f\"query:\\t{q}\\nanswer:\\t{corpus[ids[i][0]]}\\n\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}