LLMs-from-scratch/ch06/02_bonus_additional-experiments/README.md

# Additional Experiments

The table below adds experiments to answer additional questions about various design choices. The first row uses the same settings as the main chapter and is used as a reference.
For example, 

- comparing rows 1 and 2 answers the question: "What is the performance difference when we train the last or first token?";
- comparing rows 1 and 3 answers the question: "What is the performance difference when we train only the last layer instead of the last block?";
- and so forth.

|   | Model              | Weights    | Trainable token | Trainable layers | Context length          | CPU/GPU | Training time | Training acc | Validation acc | Test acc |
|---|--------------------|------------|-----------------|------------------|-------------------------|---------|---------------|--------------|----------------|----------|
| 1 | gpt2-small (124M)  | pretrained | last            | last_block       | longest train ex. (120) | V100    | 0.39 min      | 96.63%       | 97.99%         | 94.33%   |
| 2 | gpt2-small (124M)  | pretrained | first           | last_block       | longest train ex. (120) | V100    | 0.37 min      | 78.46%       | 80.54%         | 75.00%   |
| 3 | gpt2-small (124M)  | pretrained | last            | last_layer       | longest train ex. (120) | V100    | 0.33 min      | 78.65%       | 87.25%         | 78.33%   |
| 4 | gpt2-small (124M)  | pretrained | last            | all              | longest train ex. (120) | V100    | 0.94 min      | 99.62%       | 96.64%         | 96.33%   |
| 5 | gpt2-medium (355M) | pretrained | last            | last_block       | longest train ex. (120) | V100    | 0.91 min      | 87.50%       | 51.01%         | 56.67%   |
| 6 | gpt2-large (774M)  | pretrained | last            | last_block       | longest train ex. (120) | V100    | 1.91 min      | 99.52%       | 98.66%         | 96.67%   |
| 7 | gpt2-small (124M)  | random     | last            | all              | longest train ex. (120) | V100    | 0.93 min      | 100%         | 97.32%         | 93.00%   |
| 8 | gpt2-small (124M)  | pretrained | last            | last_block       | context length (1024)   | V100    | 3.24 min      | 83.08%       | 87.92%         | 78.33%   |
Chapter 6 ablation studies (#127) * Chapter 6 ablation studies * add table * formatting * formatting * formatting 2024-04-23 09:51:52 -05:00			`# Additional Experiments`

add more experiments 2024-04-24 07:23:11 -05:00			`The table below adds experiments to answer additional questions about various design choices. The first row uses the same settings as the main chapter and is used as a reference.`
			`For example,`

			`- comparing rows 1 and 2 answers the question: "What is the performance difference when we train the last or first token?";`
			`- comparing rows 1 and 3 answers the question: "What is the performance difference when we train only the last layer instead of the last block?";`
			`- and so forth.`

			`\| \| Model \| Weights \| Trainable token \| Trainable layers \| Context length \| CPU/GPU \| Training time \| Training acc \| Validation acc \| Test acc \|`
			`\|---\|--------------------\|------------\|-----------------\|------------------\|-------------------------\|---------\|---------------\|--------------\|----------------\|----------\|`
			`\| 1 \| gpt2-small (124M) \| pretrained \| last \| last_block \| longest train ex. (120) \| V100 \| 0.39 min \| 96.63% \| 97.99% \| 94.33% \|`
			`\| 2 \| gpt2-small (124M) \| pretrained \| first \| last_block \| longest train ex. (120) \| V100 \| 0.37 min \| 78.46% \| 80.54% \| 75.00% \|`
			`\| 3 \| gpt2-small (124M) \| pretrained \| last \| last_layer \| longest train ex. (120) \| V100 \| 0.33 min \| 78.65% \| 87.25% \| 78.33% \|`
			`\| 4 \| gpt2-small (124M) \| pretrained \| last \| all \| longest train ex. (120) \| V100 \| 0.94 min \| 99.62% \| 96.64% \| 96.33% \|`
			`\| 5 \| gpt2-medium (355M) \| pretrained \| last \| last_block \| longest train ex. (120) \| V100 \| 0.91 min \| 87.50% \| 51.01% \| 56.67% \|`
			`\| 6 \| gpt2-large (774M) \| pretrained \| last \| last_block \| longest train ex. (120) \| V100 \| 1.91 min \| 99.52% \| 98.66% \| 96.67% \|`
			`\| 7 \| gpt2-small (124M) \| random \| last \| all \| longest train ex. (120) \| V100 \| 0.93 min \| 100% \| 97.32% \| 93.00% \|`
			`\| 8 \| gpt2-small (124M) \| pretrained \| last \| last_block \| context length (1024) \| V100 \| 3.24 min \| 83.08% \| 87.92% \| 78.33% \|`