extend equation description

2025-11-01 18:30:00 +00:00 · 2024-08-06 19:46:50 -05:00 · 2024-08-06 19:46:50 -05:00 · e810f9f004
commit e810f9f004
parent c8090f30ef
1 changed files with 4 additions and 3 deletions
--- a/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
+++ b/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
@ -1811,10 +1811,11 @@
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/dpo/3.webp?123\" width=800px>\n",
    "\n",
    "- In the equation above,\n",
-    " - \"expected value\" $\\mathbb{E}$ is statistics jargon and stands for the average or mean value of the random variable (the expression inside the brackets)\n",
-    " - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
-    " - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
+    "  - \"expected value\" $\\mathbb{E}$ is statistics jargon and stands for the average or mean value of the random variable (the expression inside the brackets); optimizing $-\\mathbb{E}$ aligns the model better with user preferences\n",
+    "  - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
+    "  - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
    "$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models\n",
+    "  - the logistic sigmoid function, $\\log \\sigma(\\centerdot)$ transforms the log-odds of the preferred and rejected responses (the terms inside the logistic sigmoid function) into a log-probability score \n",
    "- In code, we can implement the DPO loss as follows:"
   ]
  },