mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2025-07-14 12:35:51 +00:00
412 lines
10 KiB
Plaintext
412 lines
10 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Indexing Using Faiss"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In practical cases, datasets contain thousands or millions of rows. Looping through the whole corpus to find the best answer to a query is very time and space consuming. In this tutorial, we'll introduce how to use indexing to make our retrieval fast and neat."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 0: Setup"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Install the dependencies in the environment."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install -U FlagEmbedding"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### faiss-gpu on Linux (x86_64)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Faiss maintain the latest updates on conda. So if you have GPUs on Linux x86_64, create a conda virtual environment and run:\n",
|
|
"\n",
|
|
"```conda install -c pytorch -c nvidia faiss-gpu=1.8.0```\n",
|
|
"\n",
|
|
"and make sure you select that conda env as the kernel for this notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### faiss-cpu\n",
|
|
"\n",
|
|
"Otherwise it's simple, just run the following cell to install `faiss-cpu`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install -U faiss-cpu"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 1: Dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Below is a super tiny courpus with only 10 sentences, which will be the dataset we use.\n",
|
|
"\n",
|
|
"Each sentence is a concise discription of a famous people in specific domain."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"corpus = [\n",
|
|
" \"Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.\",\n",
|
|
" \"Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.\",\n",
|
|
" \"Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'\",\n",
|
|
" \"Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\",\n",
|
|
" \"Eminem is a renowned rapper and one of the best-selling music artists of all time.\",\n",
|
|
" \"Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\",\n",
|
|
" \"Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.\",\n",
|
|
" \"Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.\",\n",
|
|
" \"Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.\",\n",
|
|
" \"Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\",\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"And a few queries (add your own queries and check the result!): "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"queries = [\n",
|
|
" \"Who is Robert Downey Jr.?\",\n",
|
|
" \"An expert of neural network\",\n",
|
|
" \"A famous female singer\",\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 2: Text Embedding"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here, for the sake of speed, we just embed the first 500 docs in the corpus."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"shape of the corpus embeddings: (10, 768)\n",
|
|
"data type of the embeddings: float32\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from FlagEmbedding import FlagModel\n",
|
|
"\n",
|
|
"# get the BGE embedding model\n",
|
|
"model = FlagModel('BAAI/bge-base-en-v1.5',\n",
|
|
" query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n",
|
|
" use_fp16=True)\n",
|
|
"\n",
|
|
"# get the embedding of the corpus\n",
|
|
"corpus_embeddings = model.encode(corpus)\n",
|
|
"\n",
|
|
"print(\"shape of the corpus embeddings:\", corpus_embeddings.shape)\n",
|
|
"print(\"data type of the embeddings: \", corpus_embeddings.dtype)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Faiss only accepts float32 inputs.\n",
|
|
"\n",
|
|
"So make sure the dtype of corpus_embeddings is float32 before adding them to the index."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"corpus_embeddings = corpus_embeddings.astype(np.float32)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 3: Indexing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this step, we build an index and add the embedding vectors to it."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import faiss\n",
|
|
"\n",
|
|
"# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768\n",
|
|
"dim = corpus_embeddings.shape[-1]\n",
|
|
"\n",
|
|
"# create the faiss index and store the corpus embeddings into the vector space\n",
|
|
"index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)\n",
|
|
"\n",
|
|
"# if you installed faiss-gpu, uncomment the following lines to make the index on your GPUs.\n",
|
|
"\n",
|
|
"# co = faiss.GpuMultipleClonerOptions()\n",
|
|
"# index = faiss.index_cpu_to_all_gpus(index, co)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"No need to train if we use \"Flat\" quantizer and METRIC_INNER_PRODUCT as metric. Some other indices that using quantization might need training."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"True\n",
|
|
"total number of vectors: 10\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# check if the index is trained\n",
|
|
"print(index.is_trained) \n",
|
|
"# index.train(corpus_embeddings)\n",
|
|
"\n",
|
|
"# add all the vectors to the index\n",
|
|
"index.add(corpus_embeddings)\n",
|
|
"\n",
|
|
"print(f\"total number of vectors: {index.ntotal}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Step 3.5 (Optional): Saving Faiss index"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once you have your index with the embedding vectors, you can save it locally for future usage."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# change the path to where you want to save the index\n",
|
|
"path = \"./index.bin\"\n",
|
|
"faiss.write_index(index, path)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If you already have stored index in your local directory, you can load it by:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"index = faiss.read_index(\"./index.bin\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 4: Find answers to the query"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"First, get the embeddings of all the queries:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"query_embeddings = model.encode_queries(queries)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Then, use the Faiss index to do a knn search in the vector space:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[[0.6686779 0.37858668 0.3767978 ]\n",
|
|
" [0.6062041 0.59364545 0.527691 ]\n",
|
|
" [0.5409331 0.5097007 0.42427146]]\n",
|
|
"[[9 7 2]\n",
|
|
" [3 1 8]\n",
|
|
" [5 0 4]]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"dists, ids = index.search(query_embeddings, k=3)\n",
|
|
"print(dists)\n",
|
|
"print(ids)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now let's see the result:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query:\tWho is Robert Downey Jr.?\n",
|
|
"answer:\tRobert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\n",
|
|
"\n",
|
|
"query:\tAn expert of neural network\n",
|
|
"answer:\tGeoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\n",
|
|
"\n",
|
|
"query:\tA famous female singer\n",
|
|
"answer:\tTaylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for i, q in enumerate(queries):\n",
|
|
" print(f\"query:\\t{q}\\nanswer:\\t{corpus[ids[i][0]]}\\n\")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "base",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.13"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|