mirror of
https://github.com/rasbt/LLMs-from-scratch.git
synced 2025-08-09 01:02:50 +00:00
note about logistic sigmoid
This commit is contained in:
parent
26df0c474c
commit
06151a809e
@ -11,7 +11,7 @@ import numpy as np
|
|||||||
import tiktoken
|
import tiktoken
|
||||||
import torch
|
import torch
|
||||||
import torch.nn as nn
|
import torch.nn as nn
|
||||||
from torch.utils.data import Dataset, DataLoader
|
from torch.utils.data import Daaset, DataLoader
|
||||||
|
|
||||||
#####################################
|
#####################################
|
||||||
# Chapter 2
|
# Chapter 2
|
||||||
|
@ -158,7 +158,7 @@
|
|||||||
" - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
|
" - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
|
||||||
" - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
|
" - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
|
||||||
"$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models\n",
|
"$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models\n",
|
||||||
" - the logistic sigmoid function, $\\log \\sigma(\\centerdot)$ transforms the log-odds of the preferred and rejected responses (the terms inside the logistic sigmoid function) into a log-probability score \n",
|
" - the logistic sigmoid function, $\\sigma(\\centerdot)$ transforms the log-odds of the preferred and rejected responses (the terms inside the logistic sigmoid function) into a probability score \n",
|
||||||
"- To avoid bloating the code notebook with a more detailed discussion, I may write a separate standalone article with more details on these concepts in the future\n",
|
"- To avoid bloating the code notebook with a more detailed discussion, I may write a separate standalone article with more details on these concepts in the future\n",
|
||||||
"- In the meantime, if you are interested in comparing RLHF and DPO, please see the section [2.2. RLHF vs Direct Preference Optimization (DPO)](https://magazine.sebastianraschka.com/i/142924793/rlhf-vs-direct-preference-optimization-dpo) in my article [Tips for LLM Pretraining and Evaluating Reward Models](https://magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms)"
|
"- In the meantime, if you are interested in comparing RLHF and DPO, please see the section [2.2. RLHF vs Direct Preference Optimization (DPO)](https://magazine.sebastianraschka.com/i/142924793/rlhf-vs-direct-preference-optimization-dpo) in my article [Tips for LLM Pretraining and Evaluating Reward Models](https://magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms)"
|
||||||
]
|
]
|
||||||
@ -1815,7 +1815,7 @@
|
|||||||
" - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
|
" - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
|
||||||
" - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
|
" - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
|
||||||
"$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models\n",
|
"$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models\n",
|
||||||
" - the logistic sigmoid function, $\\log \\sigma(\\centerdot)$ transforms the log-odds of the preferred and rejected responses (the terms inside the logistic sigmoid function) into a log-probability score \n",
|
" - the logistic sigmoid function, $\\sigma(\\centerdot)$ transforms the log-odds of the preferred and rejected responses (the terms inside the logistic sigmoid function) into a probability score \n",
|
||||||
"- In code, we can implement the DPO loss as follows:"
|
"- In code, we can implement the DPO loss as follows:"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
Loading…
x
Reference in New Issue
Block a user