update tutorials

2026-01-07 12:43:12 +00:00 · 2024-11-15 09:05:04 +00:00 · 2024-11-15 09:05:04 +00:00 · 2aa9204f82
commit 2aa9204f82
parent c9cfa7c04e
4 changed files with 85 additions and 15 deletions
--- a/Tutorials/1_Embedding/1.2.2_Auto_Embedder.ipynb
+++ b/Tutorials/1_Embedding/1.2.2_Auto_Embedder.ipynb
--- a/Tutorials/1_Embedding/1.2.2_BGE_Explanation.ipynb
+++ b/Tutorials/1_Embedding/1.2.2_BGE_Explanation.ipynb
--- a/Tutorials/1_Embedding/1.2.4_BGE-M3.ipynb
+++ b/Tutorials/1_Embedding/1.2.4_BGE-M3.ipynb
--- a/Tutorials/4_Evaluation/4.4.2_BEIR.ipynb
+++ b/Tutorials/4_Evaluation/4.4.2_BEIR.ipynb
@ -42,7 +42,40 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## 1. Use BEIR"
+    "## 1. Evaluate using BEIR"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "BEIR contains 18 datasets which can be downloaded from the [link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/), while 4 of them are private datasets that need appropriate licences. If you want to access to those 4 datasets, take a look at their [wiki](https://github.com/beir-cellar/beir/wiki/Datasets-available) for more information. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "| Dataset Name | Type     |  Queries  | Documents | Avg. Docs/Q | Public | \n",
+    "| ---------| :-----------: | ---------| --------- | ------| :------------:| \n",
+    "| ``msmarco`` | `Train` `Dev` `Test` | 6,980   |  8.84M     |    1.1 | Yes |  \n",
+    "| ``trec-covid``| `Test` | 50|  171K| 493.5 | Yes | \n",
+    "| ``nfcorpus``  | `Train` `Dev` `Test` |  323     |  3.6K     |  38.2 | Yes |\n",
+    "| ``bioasq``| `Train` `Test` |    500    |  14.91M    |  8.05 | No | \n",
+    "| ``nq``| `Train` `Test`   |  3,452   |  2.68M  |  1.2 | Yes | \n",
+    "| ``hotpotqa``| `Train` `Dev` `Test`   |  7,405   |  5.23M  |  2.0 | Yes |\n",
+    "| ``fiqa``    | `Train` `Dev` `Test`     |  648     |  57K    |  2.6 | Yes | \n",
+    "| ``signal1m`` | `Test`     |   97   |  2.86M  |  19.6 | No |\n",
+    "| ``trec-news``    | `Test`     |   57    |  595K    |  19.6 | No |\n",
+    "| ``arguana`` | `Test`       |  1,406     |  8.67K    |  1.0 | Yes |\n",
+    "| ``webis-touche2020``| `Test` |   49     |  382K    |  49.2 |  Yes |\n",
+    "| ``cqadupstack``| `Test`      |   13,145 |  457K  |  1.4 |  Yes |\n",
+    "| ``quora``| `Dev` `Test`  |   10,000     |  523K    |  1.6 |  Yes | \n",
+    "| ``dbpedia-entity``| `Dev` `Test` |   400    |  4.63M    |  38.2 |  Yes | \n",
+    "| ``scidocs``| `Test` |    1,000     |  25K    |  4.9 |  Yes | \n",
+    "| ``fever``| `Train` `Dev` `Test`     |   6,666     |  5.42M    |  1.2|  Yes | \n",
+    "| ``climate-fever``| `Test` |  1,535     |  5.42M |  3.0 |  Yes |\n",
+    "| ``scifact``| `Train` `Test` |  300     |  5K    |  1.1 |  Yes |"
   ]
  },
  {
@ -52,6 +85,13 @@
    "### 1.1 Load Dataset"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First prepare the logging setup."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 12,
@ -66,6 +106,13 @@
    "                    handlers=[LoggingHandler()])"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this demo, we choose the `arguana` dataset for a quick demonstration."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -140,6 +187,13 @@
    "### 1.2 Evaluation"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then we load `bge-base-en-v1.5` from huggingface and evaluate its performance on arguana."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -248,7 +302,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Evaluate using FlagEmbedding"
+    "## 2. Evaluate using FlagEmbedding"
   ]
  },
  {
@ -267,7 +321,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
@ -290,7 +344,8 @@
    "    --eval_metrics ndcg_at_10 recall_at_100 \n",
    "    --ignore_identical_ids True \n",
    "    --embedder_name_or_path BAAI/bge-base-en-v1.5 \n",
-    "    --devices cuda:7\n",
+    "    --embedder_batch_size 1024\n",
+    "    --devices cuda:4\n",
    "\"\"\".replace('\\n','')\n",
    "\n",
    "sys.argv = arguments.split()"
@ -305,9 +360,24 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Split 'dev' not found in the dataset. Removing it from the list.\n",
+      "ignore_identical_ids is set to True. This means that the search results will not contain identical ids. Note: Dataset such as MIRACL should NOT set this to True.\n",
+      "pre tokenize: 100%|██████████| 9/9 [00:00<00:00, 16.19it/s]\n",
+      "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n",
+      "Inference Embeddings: 100%|██████████| 9/9 [00:11<00:00,  1.27s/it]\n",
+      "pre tokenize: 100%|██████████| 2/2 [00:00<00:00, 19.54it/s]\n",
+      "Inference Embeddings: 100%|██████████| 2/2 [00:02<00:00,  1.29s/it]\n",
+      "Searching: 100%|██████████| 44/44 [00:00<00:00, 208.73it/s]\n"
+     ]
+    }
+   ],
   "source": [
    "from transformers import HfArgumentParser\n",
    "\n",
@ -343,7 +413,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
@ -352,16 +422,16 @@
     "text": [
      "{\n",
      "    \"arguana-test\": {\n",
-      "        \"ndcg_at_10\": 0.6361,\n",
-      "        \"ndcg_at_100\": 0.66057,\n",
-      "        \"map_at_10\": 0.55766,\n",
-      "        \"map_at_100\": 0.56337,\n",
-      "        \"recall_at_10\": 0.88407,\n",
+      "        \"ndcg_at_10\": 0.63668,\n",
+      "        \"ndcg_at_100\": 0.66075,\n",
+      "        \"map_at_10\": 0.55801,\n",
+      "        \"map_at_100\": 0.56358,\n",
+      "        \"recall_at_10\": 0.88549,\n",
      "        \"recall_at_100\": 0.99147,\n",
-      "        \"precision_at_10\": 0.08841,\n",
+      "        \"precision_at_10\": 0.08855,\n",
      "        \"precision_at_100\": 0.00991,\n",
-      "        \"mrr_at_10\": 0.55766,\n",
-      "        \"mrr_at_100\": 0.56337\n",
+      "        \"mrr_at_10\": 0.55809,\n",
+      "        \"mrr_at_100\": 0.56366\n",
      "    }\n",
      "}\n"
     ]