The table below adds experiments to answer additional questions about various design choices. The first row uses the same settings as the main chapter and is used as a reference. For example,

comparing rows 1 and 2 answers the question: "What is the performance difference when we train the last or first token?";
comparing rows 1 and 3 answers the question: "What is the performance difference when we train only the last layer instead of the last block?";
and so forth.

	Model	Weights	Trainable token	Trainable layers	Context length	CPU/GPU	Training time	Training acc	Validation acc	Test acc
1	gpt2-small (124M)	pretrained	last	last_block	longest train ex. (120)	V100	0.39 min	96.63%	97.99%	94.33%
2	gpt2-small (124M)	pretrained	first	last_block	longest train ex. (120)	V100	0.37 min	78.46%	80.54%	75.00%
3	gpt2-small (124M)	pretrained	last	last_layer	longest train ex. (120)	V100	0.33 min	78.65%	87.25%	78.33%
4	gpt2-small (124M)	pretrained	last	all	longest train ex. (120)	V100	0.94 min	99.62%	96.64%	96.33%
5	gpt2-medium (355M)	pretrained	last	last_block	longest train ex. (120)	V100	0.91 min	87.50%	51.01%	56.67%
6	gpt2-large (774M)	pretrained	last	last_block	longest train ex. (120)	V100	1.91 min	99.52%	98.66%	96.67%
7	gpt2-small (124M)	random	last	all	longest train ex. (120)	V100	0.93 min	100%	97.32%	93.00%
8	gpt2-small (124M)	pretrained	last	last_block	context length (1024)	V100	3.24 min	83.08%	87.92%	78.33%