* typo fix

experiment 5 is the same size as experiment 1, both are 124 Million. and experiment 6 is 355M ~3x 124M.

* typo fix

all layer FT vs all layer FT, "slightly worse by 1.3% " indicates it's exp 5 vs exp 9

* exp 5 vs 9

* Update ch06/02_bonus_additional-experiments/README.md

---------

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
This commit is contained in:
TITC 2024-08-15 21:45:34 +08:00 committed by GitHub
parent e4cfcc5b66
commit e1072bbcd6

View File

@ -59,8 +59,8 @@ I've kept the LLM and dataset small on purpose, so you can run the training on a
2. **Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3)**: Training the entire last transformer block is also results in substantially better results than training only the last layer.
3. **Training the Last vs. Last Two Last Transformer Blocks (Row 1 vs. 4)**: Training the two last transformer blocks instead of only the last block results in a noticeable 3.33% accuracy boost.
4. **Training Last Transformer Block vs All Layers (Row 1 vs. 5)**: Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. Also, it does not perform as well as training only the last two out of 12 transformer blocks.
5. **Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 7 and 8)**: Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.)
6. **Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 9)**: Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights.
5. **Using Larger Pretrained Models (Row 1 vs 6, and Row 1 vs. 7 and 8)**: Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.)
6. **Using a Model with Random Weights vs. Pretrained Weights (Row 1 and 5 vs. 9)**: Utilizing a model with random weights yields results that are only slightly worse (by 3% and 1.3%) compared to using pretrained weights.
7. **Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 10 vs. 5)**: Keeping the model frozen and adding trainable LoRA layers (see [Appendix E](../../appendix-E/01_main-chapter-code/appendix-E.ipynb) for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the ~1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. Moreover, using LoRA is also more memory-efficient because fewer parameters have to be updated.
8. **Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 11)**: Padding the input to the full supported context length results is significantly worse.
9. **Padding vs no padding (Row 1 vs. 12 and 13)**: The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments, which helps reduce overfitting and slightly boost the test set accuracy.