LLMs-from-scratch/ch06/02_bonus_additional-experiments
Sebastian Raschka 1887b89af6 Update README.md
2024-04-27 07:59:42 -05:00
..
2024-04-24 07:48:51 -05:00
2024-04-23 21:02:57 -05:00
2024-04-23 21:02:57 -05:00
2024-04-27 07:59:42 -05:00

Additional Experiments

The table below adds experiments to answer additional questions about various design choices. The first row uses the same settings as the main chapter and is used as a reference. For example,

  • comparing rows 1 and 2 answers the question: "What is the performance difference when we train the last or first token?";
  • comparing rows 1 and 3 answers the question: "What is the performance difference when we train only the last layer instead of the last block?";
  • and so forth.

 

Model Weights Trainable token Trainable layers Context length CPU/GPU Training time Training acc Validation acc Test acc
1 gpt2-small (124M) pretrained last last_block longest train ex. (120) V100 0.39 min 96.63% 97.99% 94.33%
2 gpt2-small (124M) pretrained first last_block longest train ex. (120) V100 0.37 min 78.46% 80.54% 75.00%
3 gpt2-small (124M) pretrained last last_layer longest train ex. (120) V100 0.33 min 78.65% 87.25% 78.33%
4 gpt2-small (124M) pretrained last all longest train ex. (120) V100 0.94 min 99.62% 96.64% 96.33%
5 gpt2-medium (355M) pretrained last last_block longest train ex. (120) V100 0.91 min 87.50% 51.01% 56.67%
6 gpt2-large (774M) pretrained last last_block longest train ex. (120) V100 1.91 min 99.52% 98.66% 96.67%
7 gpt2-small (124M) random last all longest train ex. (120) V100 0.93 min 100% 97.32% 93.00%
8 gpt2-small (124M) pretrained last last_block context length (1024) V100 3.24 min 83.08% 87.92% 78.33%

 

Usage

You can use the following code to reproduce the experiments:

  • Row 1: python additional-experiments.py
  • Row 2: python additional-experiments.py --trainable_token first
  • Row 3: python additional-experiments.py --trainable_layers last_layer
  • Row 4: python additional-experiments.py --trainable_layers all
  • Row 5: python additional-experiments.py --model_size gpt2-medium (355M)
  • Row 6: python additional-experiments.py --model_size gpt2-large (774M)
  • Row 7: python additional-experiments.py --weights random --trainable_layers all
  • Row 8: python additional-experiments.py --context_length "model_context_length"

I've kept the LLM and dataset small on purpose, so you can run the training on a regular laptop like a MacBook Air M3 in about 15 minutes in case you don't have access to a GPU.

 

Interpretation

  1. Training the Last vs. First Output Token (Row 1 vs. 2): Training the last output token results in significantly better performance compared to the first. This improvement is expected due to the causal self-attention mask.

  2. Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3): Training the entire last transformer block is much more effective than training only the last layer.

  3. Training All Layers vs. Last Transformer Block (Row 1 vs. 4): Training all layers shows a modest improvement of 2% over just training the last transformer block, but it requires almost three times longer in terms of training duration.

  4. Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6): Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated.

  5. Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 7): Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights.

  6. Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 8): Padding the input to the full supported context length results is significantly worse.