"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"- FLOPs (Floating Point Operations Per Second) measure the computational complexity of neural network models by counting the number of floating-point operations executed\n",
"# Benchmark with automatic batch size finding and Model FLOP Utilization (MFU)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Model FLOPs Utilization (MFU) explanation from the [PaLM paper](https://arxiv.org/abs/2204.02311)\n",
"\n",
"> We propose a new metric for efficiency that is implementation-independent and permits a cleaner comparison of system efficiency, called model FLOPs utilization (MFU). This is the ratio of the observed throughput (tokens-per-second) relative to the theoretical maximum throughput of a system operating at peak FLOPs. Crucially, the “theoretical maximum” throughput only accounts for the required operations to compute the forward+backward passes, and not rematerialization.\n",
"\n",
"\n",
"$$\\text{MFU} = \\frac{\\text{Observed Tokens per Second}}{\\text{Theoretical Max Tokens per Second}}$$\n",
"\n",
"where \n",
"\n",
"$$\\text{Theoretical Max Tokens per Second} = \\frac{\\text{Max FLOPs per Second}}{\\text{Total FLOPs per Token}}$$\n",
"\n",
"and\n",
"\n",
"$$\\text{Tokens per Second} = \\frac{\\text{Batch Size} \\times \\text{Sequence Length}}{\\text{Total Time}}$$"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Max flops per second provided by the GPU manufacturer\n",
"\n",
"flops_per_second = {\n",
" \"H100\": {\n",
" torch.float32: 60e12, # 60 TFLOPs for FP32 on NVIDIA H100\n",
" torch.float16: 1.979e15, # 1979 TFLOPs for FP16 on NVIDIA H100\n",
" torch.bfloat16: 1.979e15\n",
" },\n",
" \"L4\": {\n",
" torch.float32: 15e12, # 15 TFLOPs for FP32 on NVIDIA L4\n",
" torch.float16: 30e12, # 30 TFLOPs for FP16 on NVIDIA L4\n",
" torch.bfloat16: 30e12 \n",
" },\n",
" \"T4\": {\n",
" torch.float32: 8.1e12, # 8.1 TFLOPs for FP32 on NVIDIA T4\n",
" torch.float16: 130e12, # 130 TFLOPs for FP16 on NVIDIA T4\n",
" torch.bfloat16: 130e12\n",
" },\n",
" \"A10G\": {\n",
" torch.float32: 15.6e12, # 15.6 TFLOPs for FP32 on NVIDIA A10G\n",
" torch.float16: 78e12, # 78 TFLOPs for FP16 on NVIDIA A10G\n",
" torch.bfloat16: 78e12\n",
" },\n",
" \"A100\": {\n",
" torch.float32: 19.5e12, # 19.5 TFLOPs for FP32 on NVIDIA A100\n",
" torch.float16: 1.248e15, # 1248 TFLOPs for FP16 on NVIDIA A100\n",
" torch.bfloat16: 1.248e15\n",
" },\n",
" \"H200\": {\n",
" torch.float32: 70e12, # 70 TFLOPs for FP32 on NVIDIA H200\n",
" torch.float16: 1.2e15, # Assuming 1200 TFLOPs for FP16 on NVIDIA H200\n",
" torch.bfloat16: 1.2e15\n",
" },\n",
" \"RTX_3080\": {\n",
" torch.float32: 29.8e12, # 29.8 TFLOPs for FP32 on NVIDIA RTX 3080\n",
" torch.float16: 59.6e12, # 59.6 TFLOPs for FP16 on NVIDIA RTX 3080\n",
" torch.bfloat16: 59.6e12\n",
" },\n",
" \"RTX_3090\": {\n",
" torch.float32: 35.6e12, # 35.6 TFLOPs for FP32 on NVIDIA RTX 3090\n",
" torch.float16: 71.2e12, # 71.2 TFLOPs for FP16 on NVIDIA RTX 3090\n",
" torch.bfloat16: 71.2e12\n",
" },\n",
" \"GTX_1080\": {\n",
" torch.float32: 8.9e12, # 8.9 TFLOPs for FP32 on NVIDIA GTX 1080\n",
" torch.float16: 8.9e12, # No dedicated FP16 performance; using FP32 value\n",
" torch.bfloat16: 8.9e12\n",
" },\n",
" \"GTX_1080Ti\": {\n",
" torch.float32: 11.3e12, # 11.3 TFLOPs for FP32 on NVIDIA GTX 1080Ti\n",
" torch.float16: 11.3e12, # No dedicated FP16 performance; using FP32 value\n",
" torch.bfloat16: 11.3e12\n",
" },\n",
" \"GTX_1660\": {\n",
" torch.float32: 5e12, # 5 TFLOPs for FP32 on NVIDIA GTX 1660\n",
" torch.float16: 5e12, # No dedicated FP16 performance; using FP32 value\n",
" torch.bfloat16: 5e12\n",
" },\n",
" \"GTX_1660Ti\": {\n",
" torch.float32: 5.5e12, # 5.5 TFLOPs for FP32 on NVIDIA GTX 1660Ti\n",
" torch.float16: 5.5e12, # No dedicated FP16 performance; using FP32 value\n",