update tutorial

2026-01-08 13:11:35 +00:00 · 2024-09-14 18:07:40 +08:00 · 2024-09-14 18:07:40 +08:00 · 9b6e521bcb
commit 9b6e521bcb
parent 2b9720f8a4
3 changed files with 901 additions and 21 deletions
--- a/Tutorials/1_Embedding/1.2.3_BGE-M3.ipynb
+++ b/Tutorials/1_Embedding/1.2.3_BGE-M3.ipynb
@ -0,0 +1,414 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# BGE-M3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 0. Installation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Install the required packages in your environment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "%pip install -U transformers FlagEmbedding accelerate"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. BGE-M3 structure"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer, AutoModel\n",
+    "import torch, os\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-m3\")\n",
+    "raw_model = AutoModel.from_pretrained(\"BAAI/bge-m3\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The base model of BGE-M3 is [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large), which is a multilingual version of RoBERTa."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "XLMRobertaModel(\n",
+       "  (embeddings): XLMRobertaEmbeddings(\n",
+       "    (word_embeddings): Embedding(250002, 1024, padding_idx=1)\n",
+       "    (position_embeddings): Embedding(8194, 1024, padding_idx=1)\n",
+       "    (token_type_embeddings): Embedding(1, 1024)\n",
+       "    (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n",
+       "    (dropout): Dropout(p=0.1, inplace=False)\n",
+       "  )\n",
+       "  (encoder): XLMRobertaEncoder(\n",
+       "    (layer): ModuleList(\n",
+       "      (0-23): 24 x XLMRobertaLayer(\n",
+       "        (attention): XLMRobertaAttention(\n",
+       "          (self): XLMRobertaSelfAttention(\n",
+       "            (query): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "            (key): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "            (value): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "          )\n",
+       "          (output): XLMRobertaSelfOutput(\n",
+       "            (dense): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "            (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "          )\n",
+       "        )\n",
+       "        (intermediate): XLMRobertaIntermediate(\n",
+       "          (dense): Linear(in_features=1024, out_features=4096, bias=True)\n",
+       "          (intermediate_act_fn): GELUActivation()\n",
+       "        )\n",
+       "        (output): XLMRobertaOutput(\n",
+       "          (dense): Linear(in_features=4096, out_features=1024, bias=True)\n",
+       "          (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n",
+       "          (dropout): Dropout(p=0.1, inplace=False)\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "  )\n",
+       "  (pooler): XLMRobertaPooler(\n",
+       "    (dense): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "    (activation): Tanh()\n",
+       "  )\n",
+       ")"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "raw_model.eval()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Multi-Functionality"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 240131.91it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from FlagEmbedding import BGEM3FlagModel\n",
+    "\n",
+    "model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)\n",
+    "\n",
+    "sentences_1 = [\"What is BGE M3?\", \"Defination of BM25\"]\n",
+    "sentences_2 = [\"BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.\", \n",
+    "               \"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2.1 Dense Retrieval"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using BGE M3 for dense embedding has similar steps to BGE or BGE 1.5 models.\n",
+    "\n",
+    "Use the normalized hidden state of the special token [CLS] as the embedding:\n",
+    "\n",
+    "$$e_q = norm(H_q[0])$$\n",
+    "\n",
+    "Then compute the relevance score between the query and passage:\n",
+    "\n",
+    "$$s_{dense}=f_{sim}(e_p, e_q)$$\n",
+    "\n",
+    "where $e_p, e_q$ are the embedding vectors of passage and query, respectively.\n",
+    "\n",
+    "$f_{sim}$ is the score function (such as inner product and L2 distance) for comupting two embeddings' similarity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[0.6259035  0.34749585]\n",
+      " [0.349868   0.6782462 ]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# If you don't need such a long length of 8192 input tokens, you can set max_length to a smaller value to speed up encoding.\n",
+    "embeddings_1 = model.encode(sentences_1, max_length=10)['dense_vecs']\n",
+    "embeddings_2 = model.encode(sentences_2, max_length=100)['dense_vecs']\n",
+    "\n",
+    "# compute the similarity scores\n",
+    "s_dense = embeddings_1 @ embeddings_2.T\n",
+    "print(s_dense)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2.2 Sparse Retrieval"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Set `return_sparse` to true to make the model return sparse vector.  If a term token appears multiple times in the sentence, we only retain its max weight.\n",
+    "\n",
+    "BGE-M3 generates sparce embeddings by adding a linear layer and a ReLU activation function following the hidden states:\n",
+    "\n",
+    "$$w_{qt} = \\text{Relu}(W_{lex}^T H_q [i])$$\n",
+    "\n",
+    "where $W_{lex}$ representes the weights of linear layer and $H_q[i]$ is the encoder's output of the $i^{th}$ token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{'What': 0.08362077, 'is': 0.081469566, 'B': 0.12964639, 'GE': 0.25186998, 'M': 0.17001738, '3': 0.26957875, '?': 0.040755156}, {'De': 0.050144322, 'fin': 0.13689369, 'ation': 0.045134712, 'of': 0.06342201, 'BM': 0.25167602, '25': 0.33353207}]\n"
+     ]
+    }
+   ],
+   "source": [
+    "output_1 = model.encode(sentences_1, return_sparse=True)\n",
+    "output_2 = model.encode(sentences_2, return_sparse=True)\n",
+    "\n",
+    "# you can see the weight for each token:\n",
+    "print(model.convert_id_to_token(output_1['lexical_weights']))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Based on the tokens' weights of query and passage, the relevance score between them is computed by the joint importance of the co-existed terms within the query and passage:\n",
+    "\n",
+    "$$s_{lex} = \\sum_{t\\in q\\cap p}(w_{qt} * w_{pt})$$\n",
+    "\n",
+    "where $w_{qt}, w_{pt}$ are the importance weights of each co-existed term $t$ in query and passage, respectively."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.19554448500275612\n",
+      "0.00880391988903284\n"
+     ]
+    }
+   ],
+   "source": [
+    "# compute the scores via lexical mathcing\n",
+    "s_lex_10_20 = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])\n",
+    "s_lex_10_21 = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][1])\n",
+    "\n",
+    "print(s_lex_10_20)\n",
+    "print(s_lex_10_21)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2.3 Multi-Vector"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The multi-vector method utilizes the entire output embeddings for the representation of query $E_q$ and passage $E_p$.\n",
+    "\n",
+    "$$E_q = norm(W_{mul}^T H_q)$$\n",
+    "$$E_p = norm(W_{mul}^T H_p)$$\n",
+    "\n",
+    "where $W_{mul}$ is the learnable projection matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(8, 1024)\n",
+      "(30, 1024)\n"
+     ]
+    }
+   ],
+   "source": [
+    "output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)\n",
+    "output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)\n",
+    "\n",
+    "print(f\"({len(output_1['colbert_vecs'][0])}, {len(output_1['colbert_vecs'][0][0])})\")\n",
+    "print(f\"({len(output_2['colbert_vecs'][0])}, {len(output_2['colbert_vecs'][0][0])})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Following ColBert, we use late-interaction to compute the fine-grained relevance score:\n",
+    "\n",
+    "$$s_{mul}=\\frac{1}{N}\\sum_{i=1}^N\\max_{j=1}^M E_q[i]\\cdot E_p^T[j]$$\n",
+    "\n",
+    "where $E_q, E_p$ are the entire output embeddings of query and passage, respectively.\n",
+    "\n",
+    "This is a summation of average of maximum similarity of each $v\\in E_q$ with vectors in $E_p$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.7796662449836731\n",
+      "0.4621177911758423\n"
+     ]
+    }
+   ],
+   "source": [
+    "s_mul_10_20 = model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]).item()\n",
+    "s_mul_10_21 = model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]).item()\n",
+    "\n",
+    "print(s_mul_10_20)\n",
+    "print(s_mul_10_21)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2.4 Hybrid Ranking"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "BGE-M3's multi-functionality gives the possibility of hybrid ranking to improve retrieval. Firstly, due to the heavy cost of multi-vector method, we can retrieve the candidate results by either of the dense or sparse method. Then, to get the final result, we can rerank the candidates based on the integrated relevance score:\n",
+    "\n",
+    "$$s_{rank} = w_1\\cdot s_{dense}+w_2\\cdot s_{lex} + w_3\\cdot s_{mul}$$\n",
+    "\n",
+    "where the values chosen for $w_1, w_2$ and $w_3$ varies depending on the downstream scenario (here 1/3 is just for demonstration)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.5337047390639782\n",
+      "0.27280585498859483\n"
+     ]
+    }
+   ],
+   "source": [
+    "s_rank_10_20 = 1/3 * s_dense[0][0] + 1/3 * s_lex_10_20 + 1/3 * s_mul_10_20\n",
+    "s_rank_10_21 = 1/3 * s_dense[0][1] + 1/3 * s_lex_10_21 + 1/3 * s_mul_10_21\n",
+    "\n",
+    "print(s_rank_10_20)\n",
+    "print(s_rank_10_21)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/Tutorials/4_Evaluation/4.1.1_Evaluation_MSMARCO.ipynb
+++ b/Tutorials/4_Evaluation/4.1.1_Evaluation_MSMARCO.ipynb
@ -58,6 +58,7 @@
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
+    "import numpy as np\n",
    "\n",
    "data = load_dataset(\"namespace-Pt/msmarco\", split=\"dev\")"
   ]
@ -92,25 +93,24 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "data_queries = load_dataset(\"namespace-Pt/msmarco\", split=\"dev\")\n",
-    "data_corpus = load_dataset(\"namespace-PT/msmarco-corpus\", split=\"train\")\n",
+    "# data = load_dataset(\"namespace-Pt/msmarco\", split=\"dev\")\n",
+    "# queries = np.array(data[\"query\"])\n",
    "\n",
-    "queries = np.array(data_queries[\"query\"])\n",
-    "corpus = data_corpus"
+    "# corpus = load_dataset(\"namespace-PT/msmarco-corpus\", split=\"train\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Step 2: Text Embedding"
+    "## Step 2: Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Choose the embedding model that we would like to evaluate."
+    "Choose the embedding model that we would like to evaluate, and encode the corpus to embeddings."
   ]
  },
  {
@ -163,6 +163,19 @@
    "## Step 3: Indexing"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We use the index_factory() functions to create a Faiss index we want:\n",
+    "\n",
+    "- The first argument `dim` is the dimension of the vector space, in this case is 768 if you're using bge-base-en-v1.5.\n",
+    "\n",
+    "- The second argument `'Flat'` makes the index do exhaustive search.\n",
+    "\n",
+    "- The thrid argument `faiss.METRIC_INNER_PRODUCT` tells the index to use inner product as the distance metric."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 7,
@ -185,6 +198,7 @@
    "# create the faiss index and store the corpus embeddings into the vector space\n",
    "index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)\n",
    "corpus_embeddings = corpus_embeddings.astype(np.float32)\n",
+    "# train and add the embeddings to the index\n",
    "index.train(corpus_embeddings)\n",
    "index.add(corpus_embeddings)\n",
    "\n",
@ -195,7 +209,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As the evaluation process could take quite long time, it's a good choice to save the index."
+    "Since the embedding process is time consuming, it's a good choice to save the index for reproduction or other experiments.\n",
+    "\n",
+    "Uncomment the following lines to save the index."
   ]
  },
  {
@ -204,13 +220,20 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "path = \"./index.bin\"\n",
-    "faiss.write_index(index, path)"
+    "# path = \"./index.bin\"\n",
+    "# faiss.write_index(index, path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you already have stored index in your local directory, you can load it by:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@ -224,6 +247,13 @@
    "## Step 4: Retrieval"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Get the embeddings of all the queries, and get their corresponding ground truth answers for evaluation."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 10,
@ -235,6 +265,13 @@
    "corpus = np.asarray(corpus)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use the faiss index to search top $k$ answers of each query."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 11,
@ -250,15 +287,17 @@
   ],
   "source": [
    "from tqdm import tqdm\n",
-    "import numpy as np\n",
    "\n",
    "res_scores, res_ids, res_text = [], [], []\n",
    "query_size = len(query_embeddings)\n",
    "batch_size = 256\n",
-    "k = 10\n",
+    "# The cutoffs we will use during evaluation, and set k to be the maximum of the cutoffs.\n",
+    "cut_offs = [1, 10]\n",
+    "k = max(cut_offs)\n",
    "\n",
    "for i in tqdm(range(0, query_size, batch_size), desc=\"Searching\"):\n",
    "    q_embedding = query_embeddings[i: min(i+batch_size, query_size)].astype(np.float32)\n",
+    "    # search the top k answers for each of the queries\n",
    "    score, idx = index.search(q_embedding, k=k)\n",
    "    res_scores += list(score)\n",
    "    res_ids += list(idx)\n",
@ -272,15 +311,6 @@
    "## Step 5: Evaluate"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "cut_offs = [1, 10]"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
--- a/Tutorials/4_Evaluation/4.2.1_MTEB_Intro.ipynb
+++ b/Tutorials/4_Evaluation/4.2.1_MTEB_Intro.ipynb
@ -0,0 +1,436 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# MTEB"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For evaluation of embedding models, MTEB is one of the most well-known benchmark. In this tutorial, we'll introduce MTEB, its basic usage, and evaluate how your model performs on the MTEB leaderboard."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 0. Installation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Install the packages we will use in your environment:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "%pip install sentence_transformers mteb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Intro"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The [Massive Text Embedding Benchmark (MTEB)](https://github.com/embeddings-benchmark/mteb) is a large-scale evaluation framework designed to assess the performance of text embedding models across a wide variety of natural language processing (NLP) tasks. Introduced to standardize and improve the evaluation of text embeddings, MTEB is crucial for assessing how well these models generalize across various real-world applications. It contains a wide range of datasets in eight main NLP tasks and different languages, and provides an easy pipeline for evaluation.\n",
+    "\n",
+    "MTEB is also well known for the MTEB leaderboard, which contains a ranking of the latest first-class embedding models. We'll cover that in the next tutorial. Now let's have a look on how to use MTEB to do evaluation easily."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import mteb\n",
+    "from sentence_transformers import SentenceTransformer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's take a look at how to use MTEB to do a quick evaluation.\n",
+    "\n",
+    "First we load the model that we would like to evaluate on:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_name = \"BAAI/bge-base-en-v1.5\"\n",
+    "model = SentenceTransformer(model_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below is the list of datasets of retrieval used by MTEB's English leaderboard.\n",
+    "\n",
+    "MTEB directly use the open source benchmark BEIR in its retrieval part, which contains 15 datasets (note there are 12 subsets of CQADupstack)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "retrieval_tasks = [\n",
+    "    \"ArguAna\",\n",
+    "    \"ClimateFEVER\",\n",
+    "    \"CQADupstackAndroidRetrieval\",\n",
+    "    \"CQADupstackEnglishRetrieval\",\n",
+    "    \"CQADupstackGamingRetrieval\",\n",
+    "    \"CQADupstackGisRetrieval\",\n",
+    "    \"CQADupstackMathematicaRetrieval\",\n",
+    "    \"CQADupstackPhysicsRetrieval\",\n",
+    "    \"CQADupstackProgrammersRetrieval\",\n",
+    "    \"CQADupstackStatsRetrieval\",\n",
+    "    \"CQADupstackTexRetrieval\",\n",
+    "    \"CQADupstackUnixRetrieval\",\n",
+    "    \"CQADupstackWebmastersRetrieval\",\n",
+    "    \"CQADupstackWordpressRetrieval\",\n",
+    "    \"DBPedia\",\n",
+    "    \"FEVER\",\n",
+    "    \"FiQA2018\",\n",
+    "    \"HotpotQA\",\n",
+    "    \"MSMARCO\",\n",
+    "    \"NFCorpus\",\n",
+    "    \"NQ\",\n",
+    "    \"QuoraRetrieval\",\n",
+    "    \"SCIDOCS\",\n",
+    "    \"SciFact\",\n",
+    "    \"Touche2020\",\n",
+    "    \"TRECCOVID\",\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For demonstration, let's just run the first one, \"ArguAna\".\n",
+    "\n",
+    "For a full list of tasks and languages that MTEB supports, check the [page](https://github.com/embeddings-benchmark/mteb/blob/18662380f0f476db3d170d0926892045aa9f74ee/docs/tasks.md)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tasks = mteb.get_tasks(tasks=retrieval_tasks[:1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then, create and initialize an MTEB instance with our chosen tasks, and run the evaluation process."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #262626; text-decoration-color: #262626\">───────────────────────────────────────────────── </span><span style=\"font-weight: bold\">Selected tasks </span><span style=\"color: #262626; text-decoration-color: #262626\"> ─────────────────────────────────────────────────</span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[38;5;235m───────────────────────────────────────────────── \u001b[0m\u001b[1mSelected tasks \u001b[0m\u001b[38;5;235m ─────────────────────────────────────────────────\u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">Retrieval</span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[1mRetrieval\u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">    - ArguAna, <span style=\"color: #626262; text-decoration-color: #626262; font-style: italic\">s2p</span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "    - ArguAna, \u001b[3;38;5;241ms2p\u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
+       "\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\n",
+       "\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Batches: 100%|██████████| 44/44 [00:41<00:00,  1.06it/s]\n",
+      "Batches: 100%|██████████| 272/272 [03:36<00:00,  1.26it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# use the tasks we chose to initialize the MTEB instance\n",
+    "evaluation = mteb.MTEB(tasks=tasks)\n",
+    "\n",
+    "# call run() with the model and output_folder\n",
+    "results = evaluation.run(model, output_folder=\"results\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The results should be stored in `{output_folder}/{model_name}/{model_revision}/{task_name}.json`.\n",
+    "\n",
+    "Openning the json file you should see contents as below, which are the evaluation results on \"ArguAna\" with different metrics on cutoffs from 1 to 1000."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "{\n",
+    "  \"dataset_revision\": \"c22ab2a51041ffd869aaddef7af8d8215647e41a\",\n",
+    "  \"evaluation_time\": 260.14976954460144,\n",
+    "  \"kg_co2_emissions\": null,\n",
+    "  \"mteb_version\": \"1.14.17\",\n",
+    "  \"scores\": {\n",
+    "    \"test\": [\n",
+    "      {\n",
+    "        \"hf_subset\": \"default\",\n",
+    "        \"languages\": [\n",
+    "          \"eng-Latn\"\n",
+    "        ],\n",
+    "        \"main_score\": 0.63616,\n",
+    "        \"map_at_1\": 0.40754,\n",
+    "        \"map_at_10\": 0.55773,\n",
+    "        \"map_at_100\": 0.56344,\n",
+    "        \"map_at_1000\": 0.56347,\n",
+    "        \"map_at_20\": 0.56202,\n",
+    "        \"map_at_3\": 0.51932,\n",
+    "        \"map_at_5\": 0.54023,\n",
+    "        \"mrr_at_1\": 0.4139402560455192,\n",
+    "        \"mrr_at_10\": 0.5603739077423295,\n",
+    "        \"mrr_at_100\": 0.5660817425350153,\n",
+    "        \"mrr_at_1000\": 0.5661121884705748,\n",
+    "        \"mrr_at_20\": 0.564661930998293,\n",
+    "        \"mrr_at_3\": 0.5208629682313899,\n",
+    "        \"mrr_at_5\": 0.5429113323850182,\n",
+    "        \"nauc_map_at_1000_diff1\": 0.15930478114759905,\n",
+    "        \"nauc_map_at_1000_max\": -0.06396189194646361,\n",
+    "        \"nauc_map_at_1000_std\": -0.13168797291549253,\n",
+    "        \"nauc_map_at_100_diff1\": 0.15934819555197366,\n",
+    "        \"nauc_map_at_100_max\": -0.06389635013430676,\n",
+    "        \"nauc_map_at_100_std\": -0.13164524259533786,\n",
+    "        \"nauc_map_at_10_diff1\": 0.16057318234658585,\n",
+    "        \"nauc_map_at_10_max\": -0.060962623117325254,\n",
+    "        \"nauc_map_at_10_std\": -0.1300413865104607,\n",
+    "        \"nauc_map_at_1_diff1\": 0.17346152653542332,\n",
+    "        \"nauc_map_at_1_max\": -0.09705499215630589,\n",
+    "        \"nauc_map_at_1_std\": -0.14726476953035533,\n",
+    "        \"nauc_map_at_20_diff1\": 0.15956349246366208,\n",
+    "        \"nauc_map_at_20_max\": -0.06259296677860492,\n",
+    "        \"nauc_map_at_20_std\": -0.13097093150054095,\n",
+    "        \"nauc_map_at_3_diff1\": 0.15620049317363813,\n",
+    "        \"nauc_map_at_3_max\": -0.06690213479396273,\n",
+    "        \"nauc_map_at_3_std\": -0.13440904793529648,\n",
+    "        \"nauc_map_at_5_diff1\": 0.1557795701081579,\n",
+    "        \"nauc_map_at_5_max\": -0.06255283252590663,\n",
+    "        \"nauc_map_at_5_std\": -0.1355361594910923,\n",
+    "        \"nauc_mrr_at_1000_diff1\": 0.1378988612808882,\n",
+    "        \"nauc_mrr_at_1000_max\": -0.07507962333910836,\n",
+    "        \"nauc_mrr_at_1000_std\": -0.12969109830101241,\n",
+    "        \"nauc_mrr_at_100_diff1\": 0.13794450668758515,\n",
+    "        \"nauc_mrr_at_100_max\": -0.07501290390362861,\n",
+    "        \"nauc_mrr_at_100_std\": -0.12964855554504057,\n",
+    "        \"nauc_mrr_at_10_diff1\": 0.1396047981645623,\n",
+    "        \"nauc_mrr_at_10_max\": -0.07185174301688693,\n",
+    "        \"nauc_mrr_at_10_std\": -0.12807325096717753,\n",
+    "        \"nauc_mrr_at_1_diff1\": 0.15610387932529113,\n",
+    "        \"nauc_mrr_at_1_max\": -0.09824591983546396,\n",
+    "        \"nauc_mrr_at_1_std\": -0.13914318784294258,\n",
+    "        \"nauc_mrr_at_20_diff1\": 0.1382786098284509,\n",
+    "        \"nauc_mrr_at_20_max\": -0.07364476417961506,\n",
+    "        \"nauc_mrr_at_20_std\": -0.12898192060943495,\n",
+    "        \"nauc_mrr_at_3_diff1\": 0.13118224861025093,\n",
+    "        \"nauc_mrr_at_3_max\": -0.08164985279853691,\n",
+    "        \"nauc_mrr_at_3_std\": -0.13241573571401533,\n",
+    "        \"nauc_mrr_at_5_diff1\": 0.1346130730317385,\n",
+    "        \"nauc_mrr_at_5_max\": -0.07404093236468848,\n",
+    "        \"nauc_mrr_at_5_std\": -0.1340775377068567,\n",
+    "        \"nauc_ndcg_at_1000_diff1\": 0.15919987960292029,\n",
+    "        \"nauc_ndcg_at_1000_max\": -0.05457945565481172,\n",
+    "        \"nauc_ndcg_at_1000_std\": -0.12457339152558143,\n",
+    "        \"nauc_ndcg_at_100_diff1\": 0.1604091882521101,\n",
+    "        \"nauc_ndcg_at_100_max\": -0.05281549383775287,\n",
+    "        \"nauc_ndcg_at_100_std\": -0.12347288098914058,\n",
+    "        \"nauc_ndcg_at_10_diff1\": 0.1657018523692905,\n",
+    "        \"nauc_ndcg_at_10_max\": -0.036222943297402846,\n",
+    "        \"nauc_ndcg_at_10_std\": -0.11284619565817842,\n",
+    "        \"nauc_ndcg_at_1_diff1\": 0.17346152653542332,\n",
+    "        \"nauc_ndcg_at_1_max\": -0.09705499215630589,\n",
+    "        \"nauc_ndcg_at_1_std\": -0.14726476953035533,\n",
+    "        \"nauc_ndcg_at_20_diff1\": 0.16231721725673165,\n",
+    "        \"nauc_ndcg_at_20_max\": -0.04147115653921931,\n",
+    "        \"nauc_ndcg_at_20_std\": -0.11598700704312062,\n",
+    "        \"nauc_ndcg_at_3_diff1\": 0.15256475371124711,\n",
+    "        \"nauc_ndcg_at_3_max\": -0.05432154580979357,\n",
+    "        \"nauc_ndcg_at_3_std\": -0.12841084787822227,\n",
+    "        \"nauc_ndcg_at_5_diff1\": 0.15236205846534961,\n",
+    "        \"nauc_ndcg_at_5_max\": -0.04356123278888682,\n",
+    "        \"nauc_ndcg_at_5_std\": -0.12942556865700913,\n",
+    "        \"nauc_precision_at_1000_diff1\": -0.038790629929866066,\n",
+    "        \"nauc_precision_at_1000_max\": 0.3630826341915611,\n",
+    "        \"nauc_precision_at_1000_std\": 0.4772189839676386,\n",
+    "        \"nauc_precision_at_100_diff1\": 0.32118609204433185,\n",
+    "        \"nauc_precision_at_100_max\": 0.4740132817600036,\n",
+    "        \"nauc_precision_at_100_std\": 0.3456396169952022,\n",
+    "        \"nauc_precision_at_10_diff1\": 0.22279659689895104,\n",
+    "        \"nauc_precision_at_10_max\": 0.16823918613191954,\n",
+    "        \"nauc_precision_at_10_std\": 0.0377209694331257,\n",
+    "        \"nauc_precision_at_1_diff1\": 0.17346152653542332,\n",
+    "        \"nauc_precision_at_1_max\": -0.09705499215630589,\n",
+    "        \"nauc_precision_at_1_std\": -0.14726476953035533,\n",
+    "        \"nauc_precision_at_20_diff1\": 0.23025740175221762,\n",
+    "        \"nauc_precision_at_20_max\": 0.2892313928157665,\n",
+    "        \"nauc_precision_at_20_std\": 0.13522755012490692,\n",
+    "        \"nauc_precision_at_3_diff1\": 0.1410889527057097,\n",
+    "        \"nauc_precision_at_3_max\": -0.010771302313530132,\n",
+    "        \"nauc_precision_at_3_std\": -0.10744937823276193,\n",
+    "        \"nauc_precision_at_5_diff1\": 0.14012953903010988,\n",
+    "        \"nauc_precision_at_5_max\": 0.03977485677045894,\n",
+    "        \"nauc_precision_at_5_std\": -0.10292184602358977,\n",
+    "        \"nauc_recall_at_1000_diff1\": -0.03879062992990034,\n",
+    "        \"nauc_recall_at_1000_max\": 0.36308263419153386,\n",
+    "        \"nauc_recall_at_1000_std\": 0.47721898396760526,\n",
+    "        \"nauc_recall_at_100_diff1\": 0.3211860920443005,\n",
+    "        \"nauc_recall_at_100_max\": 0.4740132817599919,\n",
+    "        \"nauc_recall_at_100_std\": 0.345639616995194,\n",
+    "        \"nauc_recall_at_10_diff1\": 0.22279659689895054,\n",
+    "        \"nauc_recall_at_10_max\": 0.16823918613192046,\n",
+    "        \"nauc_recall_at_10_std\": 0.037720969433127145,\n",
+    "        \"nauc_recall_at_1_diff1\": 0.17346152653542332,\n",
+    "        \"nauc_recall_at_1_max\": -0.09705499215630589,\n",
+    "        \"nauc_recall_at_1_std\": -0.14726476953035533,\n",
+    "        \"nauc_recall_at_20_diff1\": 0.23025740175221865,\n",
+    "        \"nauc_recall_at_20_max\": 0.2892313928157675,\n",
+    "        \"nauc_recall_at_20_std\": 0.13522755012490456,\n",
+    "        \"nauc_recall_at_3_diff1\": 0.14108895270570979,\n",
+    "        \"nauc_recall_at_3_max\": -0.010771302313529425,\n",
+    "        \"nauc_recall_at_3_std\": -0.10744937823276134,\n",
+    "        \"nauc_recall_at_5_diff1\": 0.14012953903010958,\n",
+    "        \"nauc_recall_at_5_max\": 0.039774856770459645,\n",
+    "        \"nauc_recall_at_5_std\": -0.10292184602358935,\n",
+    "        \"ndcg_at_1\": 0.40754,\n",
+    "        \"ndcg_at_10\": 0.63616,\n",
+    "        \"ndcg_at_100\": 0.66063,\n",
+    "        \"ndcg_at_1000\": 0.6613,\n",
+    "        \"ndcg_at_20\": 0.65131,\n",
+    "        \"ndcg_at_3\": 0.55717,\n",
+    "        \"ndcg_at_5\": 0.59461,\n",
+    "        \"precision_at_1\": 0.40754,\n",
+    "        \"precision_at_10\": 0.08841,\n",
+    "        \"precision_at_100\": 0.00991,\n",
+    "        \"precision_at_1000\": 0.001,\n",
+    "        \"precision_at_20\": 0.04716,\n",
+    "        \"precision_at_3\": 0.22238,\n",
+    "        \"precision_at_5\": 0.15149,\n",
+    "        \"recall_at_1\": 0.40754,\n",
+    "        \"recall_at_10\": 0.88407,\n",
+    "        \"recall_at_100\": 0.99147,\n",
+    "        \"recall_at_1000\": 0.99644,\n",
+    "        \"recall_at_20\": 0.9431,\n",
+    "        \"recall_at_3\": 0.66714,\n",
+    "        \"recall_at_5\": 0.75747\n",
+    "      }\n",
+    "    ]\n",
+    "  },\n",
+    "  \"task_name\": \"ArguAna\"\n",
+    "}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we've successfully run the evaluation using mteb! In the next tutorial, we'll show how to evaluate your model on the whole 56 tasks of English MTEB and compete with models on the leaderboard."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}