diff --git a/ch06/02_bonus_additional-experiments/README.md b/ch06/02_bonus_additional-experiments/README.md index 320b056..7e011f6 100644 --- a/ch06/02_bonus_additional-experiments/README.md +++ b/ch06/02_bonus_additional-experiments/README.md @@ -24,6 +24,7 @@ For example, | 11 | gpt2-small (124M) | pretrained | last | last_block | variable: no padding (batch size 1) | 100.00% | 98.66% | 98.00% | 1.75 min | A100 | | 12 | gpt2-small (124M) | pretrained | last | last_block | variable: no padding (batch size 8) | 99.33% | 98.66% | 98.33% | 1.70 min | A100 | | 13 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120); but no causal mask | 99.23% | 98.66% | 95.33% | 0.29 min | A100 | +| 14 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120) and `ignore_index` for padding | 96.63% | 99.33% | 95.00% | 0.28 min | A100 |   @@ -45,27 +46,21 @@ You can use the following code to reproduce the experiments: - Row 11: `python additional-experiments.py --no_padding --batch_size 1` - Row 12: `python additional-experiments.py --no_padding --batch_size 1 --accumulation_steps 8` - Row 13: `python additional-experiments.py --disable_causal_mask` +- Row 14: `python additional-experiments.py --ignore_index 50256` -I've kept the LLM and dataset small on purpose, so you can run the training on a regular laptop like a MacBook Air M3 in about 15 minutes in case you don't have access to a GPU. +I've kept the LLM and dataset small on purpose, so you can run the training on a regular laptop like a MacBook Air M3 in about 15 minutes (for the default setting) in case you don't have access to a GPU.   ### Interpretation 1. **Training the Last vs. First Output Token (Row 1 vs. 2)**: Training the last output token results in substantially better performance compared to the first. This improvement is expected due to the causal self-attention mask. - 2. **Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3)**: Training the entire last transformer block is also results in substantially better results than training only the last layer. - 3. **Training All Layers vs. Last Transformer Block (Row 1 vs. 4)**: Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. - 4. **Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6 and 7)**: Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.) - 5. **Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 8)**: Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights. - 6. **Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4)**: Keeping the model frozen and adding trainable LoRA layers (see [Appendix E](../../appendix-E/01_main-chapter-code/appendix-E.ipynb) for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the 1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. Moreover, using LoRA is also slightly faster because fewer parameters have to be updated. - 7. **Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 10)**: Padding the input to the full supported context length results is significantly worse. - 8. **Padding vs no padding (Row 1 vs. 11 and 12)**: The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments, which helps reduce overfitting and slightly boost the test set accuracy. - 9. **Disabling the causal attention mask (Row 1 vs. 13)**: Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask. +10. **Ignoring the padding indeces in the loss and backpropagation (Row 1 vs. 14)**: Setting `--ignore_index 50256` excludes the `|endoftext|` padding tokens in the `cross_entropy` loss function in PyTorch. In this case, it does not have any effect because we replaced the output layers so that the token IDs are either 0 or 1 for the binary classification example. However, this setting is useful when instruction finetuning models in chapter 7. diff --git a/ch06/02_bonus_additional-experiments/additional-experiments.py b/ch06/02_bonus_additional-experiments/additional-experiments.py index f8f2df3..a3dd719 100644 --- a/ch06/02_bonus_additional-experiments/additional-experiments.py +++ b/ch06/02_bonus_additional-experiments/additional-experiments.py @@ -164,14 +164,16 @@ def instantiate_model(choose_model, load_weights): return model -def calc_loss_batch(input_batch, target_batch, model, device, trainable_token=-1): +def calc_loss_batch(input_batch, target_batch, model, device, + trainable_token=-1, ignore_index=-100): input_batch, target_batch = input_batch.to(device), target_batch.to(device) logits = model(input_batch)[:, trainable_token, :] # Logits of last output token - loss = torch.nn.functional.cross_entropy(logits, target_batch) + loss = torch.nn.functional.cross_entropy(logits, target_batch, ignore_index=ignore_index) return loss -def calc_loss_loader(data_loader, model, device, num_batches=None, trainable_token=-1): +def calc_loss_loader(data_loader, model, device, + num_batches=None, trainable_token=-1, ignore_index=-100): total_loss = 0. if len(data_loader) == 0: return float("nan") @@ -183,7 +185,10 @@ def calc_loss_loader(data_loader, model, device, num_batches=None, trainable_tok num_batches = min(num_batches, len(data_loader)) for i, (input_batch, target_batch) in enumerate(data_loader): if i < num_batches: - loss = calc_loss_batch(input_batch, target_batch, model, device, trainable_token=trainable_token) + loss = calc_loss_batch( + input_batch, target_batch, model, device, + trainable_token=trainable_token, ignore_index=ignore_index + ) total_loss += loss.item() else: break @@ -212,18 +217,25 @@ def calc_accuracy_loader(data_loader, model, device, num_batches=None, trainable return correct_predictions / num_examples -def evaluate_model(model, train_loader, val_loader, device, eval_iter, trainable_token=-1): +def evaluate_model(model, train_loader, val_loader, device, + eval_iter, trainable_token=-1, ignore_index=-100): model.eval() with torch.no_grad(): - train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter, trainable_token=trainable_token) - val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter, trainable_token=trainable_token) + train_loss = calc_loss_loader( + train_loader, model, device, num_batches=eval_iter, + trainable_token=trainable_token, ignore_index=ignore_index + ) + val_loss = calc_loss_loader( + val_loader, model, device, num_batches=eval_iter, + trainable_token=trainable_token, ignore_index=ignore_index + ) model.train() return train_loss, val_loss def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter, tokenizer, max_steps=None, trainable_token=-1, - accumulation_steps=1): + accumulation_steps=1, ignore_index=-100): # Initialize lists to track losses and tokens seen train_losses, val_losses, train_accs, val_accs = [], [], [], [] examples_seen, global_step = 0, -1 @@ -233,7 +245,10 @@ def train_classifier_simple(model, train_loader, val_loader, optimizer, device, model.train() # Set model to training mode for batch_idx, (input_batch, target_batch) in enumerate(train_loader): - loss = calc_loss_batch(input_batch, target_batch, model, device, trainable_token=trainable_token) + loss = calc_loss_batch( + input_batch, target_batch, model, device, + trainable_token=trainable_token, ignore_index=ignore_index + ) # Use gradient accumulation if accumulation_steps > 1 # See https://sebastianraschka.com/blog/2023/llm-grad-accumulation.html @@ -253,7 +268,9 @@ def train_classifier_simple(model, train_loader, val_loader, optimizer, device, # Optional evaluation step if global_step % eval_freq == 0: train_loss, val_loss = evaluate_model( - model, train_loader, val_loader, device, eval_iter, trainable_token=trainable_token) + model, train_loader, val_loader, device, eval_iter, + trainable_token=trainable_token, ignore_index=ignore_index + ) train_losses.append(train_loss) val_losses.append(val_loss) print(f"Ep {epoch+1} (Step {global_step:06d}): " @@ -395,6 +412,15 @@ if __name__ == "__main__": ) ) + parser.add_argument( + "--ignore_index", + type=int, + default=-100, + help=( + "Sets the `ignore_index` in the cross entropy loss." + ) + ) + args = parser.parse_args() if args.trainable_token == "first":