LLMs-from-scratch/ch05/07_gpt_to_llama/README.md

# Converting GPT to Llama


This folder contains code for converting the GPT implementation from chapter 4 and 5 to Meta AI's Llama architecture in the following recommended reading order:

- [converting-gpt-to-llama2.ipynb](converting-gpt-to-llama2.ipynb): contains code to convert GPT to Llama 2 7B step by step and loads pretrained weights from Meta AI
- [converting-llama2-to-llama3.ipynb](converting-llama2-to-llama3.ipynb): contains code to convert the Llama 2 model to Llama 3, Llama 3.1, and Llama 3.2
- [standalone-llama32.ipynb](standalone-llama32.ipynb): a standalone notebook implementing Llama 3.2

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/gpt-and-all-llamas.webp">


&nbsp;
### Using Llama 3.2 via the `llms-from-scratch` package

For an easy way to use the Llama 3.2 1B and 3B models, you can also use the `llms-from-scratch` PyPI package based on the source code in this repository at [pkg/llms_from_scratch](../../pkg/llms_from_scratch).

&nbsp;
#### 1) Installation

```bash
pip install llms_from_scratch blobfile
```

(Note that `blobfile` is needed to load the tokenizer.)

&nbsp;
#### 2) Model and text generation settings

Specify which model to use:

```python
MODEL_FILE = "llama3.2-1B-instruct.pth"
# MODEL_FILE = "llama3.2-1B-base.pth"
# MODEL_FILE = "llama3.2-3B-instruct.pth"
# MODEL_FILE = "llama3.2-3B-base.pth"
```

Basic text generation settings that can be defined by the user. Note that the recommended 8192-token context size requires approximately 3 GB of VRAM for the text generation example.

```python
# Text generation settings
if "instruct" in MODEL_FILE:
    PROMPT = "What do llamas eat?"
else:
    PROMPT = "Llamas eat"

MAX_NEW_TOKENS = 150
TEMPERATURE = 0.
TOP_K = 1
```

&nbsp;
#### 3) Weight download and loading

This automatically downloads the weight file based on the model choice above:

```python
import os
import urllib.request

url = f"https://huggingface.co/rasbt/llama-3.2-from-scratch/resolve/main/{MODEL_FILE}"

if not os.path.exists(MODEL_FILE):
    urllib.request.urlretrieve(url, MODEL_FILE)
    print(f"Downloaded to {MODEL_FILE}")
```

The model weights are then loaded as follows:

```python
import torch
from llms_from_scratch.llama3 import Llama3Model

if "1B" in MODEL_FILE:
    from llms_from_scratch.llama3 import LLAMA32_CONFIG_1B as LLAMA32_CONFIG
elif "3B" in MODEL_FILE:
    from llms_from_scratch.llama3 import LLAMA32_CONFIG_3B as LLAMA32_CONFIG
else:
    raise ValueError("Incorrect model file name")

model = Llama3Model(LLAMA32_CONFIG)
model.load_state_dict(torch.load(MODEL_FILE, weights_only=True, map_location="cpu"))

device = (
    torch.device("cuda") if torch.cuda.is_available() else
    torch.device("mps") if torch.backends.mps.is_available() else
    torch.device("cpu")
)
model.to(device)
```

&nbsp;
#### 4) Initialize tokenizer

The following code downloads and initializes the tokenizer:

```python
from llms_from_scratch.llama3 import Llama3Tokenizer, ChatFormat, clean_text

TOKENIZER_FILE = "tokenizer.model"

url = f"https://huggingface.co/rasbt/llama-3.2-from-scratch/resolve/main/{TOKENIZER_FILE}"

if not os.path.exists(TOKENIZER_FILE):
    urllib.request.urlretrieve(url, TOKENIZER_FILE)
    print(f"Downloaded to {TOKENIZER_FILE}")
    
tokenizer = Llama3Tokenizer("tokenizer.model")

if "instruct" in MODEL_FILE:
    tokenizer = ChatFormat(tokenizer)
```

&nbsp;
#### 5) Generating text

Lastly, we can generate text via the following code:

```python
import time

from llms_from_scratch.ch05 import (
    generate,
    text_to_token_ids,
    token_ids_to_text
)

torch.manual_seed(123)

start = time.time()

token_ids = generate(
    model=model,
    idx=text_to_token_ids(PROMPT, tokenizer).to(device),
    max_new_tokens=MAX_NEW_TOKENS,
    context_size=LLAMA32_CONFIG["context_length"],
    top_k=TOP_K,
    temperature=TEMPERATURE
)

total_time = time.time() - start
print(f"Time: {total_time:.2f} sec")
print(f"{int(len(token_ids[0])/total_time)} tokens/sec")

if torch.cuda.is_available():
    max_mem_bytes = torch.cuda.max_memory_allocated()
    max_mem_gb = max_mem_bytes / (1024 ** 3)
    print(f"Max memory allocated: {max_mem_gb:.2f} GB")

output_text = token_ids_to_text(token_ids, tokenizer)

if "instruct" in MODEL_FILE:
    output_text = clean_text(output_text)

print("\n\nOutput text:\n\n", output_text)
```

When using the Llama 3.2 1B Instruct model, the output should look similar to the one shown below:

```
Time: 3.17 sec
50 tokens/sec
Max memory allocated: 2.91 GB


Output text:

 Llamas are herbivores, which means they primarily eat plants. Their diet consists mainly of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses and grassy meadows.
2. Hay: Llamas also eat hay, which is a dry, compressed form of grass or other plants.
3. Alfalfa: Alfalfa is a legume that is commonly used as a hay substitute in llama feed.
4. Other plants: Llamas will also eat other plants, such as clover, dandelions, and wild grasses.

It's worth noting that the specific diet of llamas can vary depending on factors such as the breed,
```

&nbsp;
#### Pro tip 1: speed up inference with FlashAttention

Instead of using `Llama3Model`, you can use `Llama3ModelFast` as a drop-in replacement. For more information, I encourage you to inspect the [pkg/llms_from_scratch/llama3.py](../../pkg/llms_from_scratch/llama3.py) code.

The `Llama3ModelFast` replaces my from-scratch scaled dot-product code in the `GroupedQueryAttention` module with PyTorch's `scaled_dot_product` function, which uses `FlashAttention` on Ampere GPUs or newer.

The following table shows a performance comparison on an A100:

|                 | Tokens/sec | Memory  |
| --------------- | ---------- | ------- |
| Llama3Model     | 42         | 2.91 GB |
| Llama3ModelFast | 54         | 2.91 GB |

&nbsp;
#### Pro tip 2: speed up inference with compilation


For up to a 4× speed-up, replace

```python
model.to(device)
```

with

```python
model = torch.compile(model)
model.to(device)
```

Note: There is a significant multi-minute upfront cost when compiling, and the speed-up takes effect after the first `generate` call. 

The following table shows a performance comparison on an A100 for consequent `generate` calls:

|                 | Tokens/sec | Memory  |
| --------------- | ---------- | ------- |
| Llama3Model     | 170        | 3.12 GB |
| Llama3ModelFast | 177        | 3.61 GB |

&nbsp;
#### Pro tip 3: speed up inference with compilation

You can significantly boost inference performance using the KV cache `Llama3Model` drop-in replacement when running the model on a CPU. (See my [Understanding and Coding the KV Cache in LLMs from Scratch](https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms) article to learn more about KV caches.)

```python
from llms_from_scratch.kv_cache.llama3 import Llama3Model
from llms_from_scratch.kv_cache.generate import generate_text_simple

model = Llama3Model(LLAMA32_CONFIG)
# ...
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(PROMPT, tokenizer).to(device),
    max_new_tokens=MAX_NEW_TOKENS,
    context_size=LLAMA32_CONFIG["context_length"],
)
```

Note that the peak memory usage is only listed for Nvidia CUDA devices, as it is easier to calculate. However, the memory usage on other devices is likely similar as it uses a similar precision format, and the KV cache storage dominates here for the generated 150-token text (however, different devices may implement matrix multiplication differently and may result in different peak memory requirements).

| Model       | Mode              | Hardware        | Tokens/sec | GPU Memory (VRAM) |
|-------------|-------------------|-----------------|------------|-------------------|
| Llama3Model | Regular           | Mac Mini M4 CPU | 1          | -                 |
| Llama3Model | Regular compiled  | Mac Mini M4 CPU | -          | -                 |
| Llama3Model | KV cache          | Mac Mini M4 CPU | 62         | -                 |
| Llama3Model | KV cache compiled | Mac Mini M4 CPU | -          | -                 |
|             |                   |                 |            |                   |
| Llama3Model | Regular           | Mac Mini M4 GPU | 15         | -                 |
| Llama3Model | Regular compiled  | Mac Mini M4 GPU | -          | -                 |
| Llama3Model | KV cache          | Mac Mini M4 GPU | 62         | -                 |
| Llama3Model | KV cache compiled | Mac Mini M4 GPU | -          | -                 |
|             |                   |                 |            |                   |
| Llama3Model | Regular           | Nvidia A100 GPU | 42         | 2.91 GB           |
| Llama3Model | Regular compiled  | Nvidia A100 GPU | 170        | 3.12 GB           |
| Llama3Model | KV cache          | Nvidia A100 GPU | 60         | 18.87 GB          |
| Llama3Model | KV cache compiled | Nvidia A100 GPU | 59         | 19.12 GB          |

Note that all settings above have been tested to produce the same text outputs.
-												GPT to Llama (#368)

* GPT to Llama

* fix urls
											
										
										
											2024-09-23 05:34:06 -07:00
+								# Converting GPT to Llama
-												Implement Llama 3.2 (#383)


											
										
										
											2024-10-05 07:30:47 -05:00
+								This folder contains code for converting the GPT implementation from chapter 4 and 5 to Meta AI's Llama architecture in the following recommended reading order:
-												GPT to Llama (#368)

* GPT to Llama

* fix urls
											
										
										
											2024-09-23 05:34:06 -07:00
-												Implement Llama 3.2 (#383)


											
										
										
											2024-10-05 07:30:47 -05:00
+								- [converting-gpt-to-llama2.ipynb](converting-gpt-to-llama2.ipynb): contains code to convert GPT to Llama 2 7B step by step and loads pretrained weights from Meta AI
 								- [converting-llama2-to-llama3.ipynb](converting-llama2-to-llama3.ipynb): contains code to convert the Llama 2 model to Llama 3, Llama 3.1, and Llama 3.2
 								- [standalone-llama32.ipynb](standalone-llama32.ipynb): a standalone notebook implementing Llama 3.2
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
+								<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/gpt-and-all-llamas.webp">
 								&nbsp;
 								### Using Llama 3.2 via the `llms-from-scratch` package
 								For an easy way to use the Llama 3.2 1B and 3B models, you can also use the `llms-from-scratch` PyPI package based on the source code in this repository at [pkg/llms_from_scratch](../../pkg/llms_from_scratch).
 								&nbsp;
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								#### 1) Installation
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
 								```bash
 								pip install llms_from_scratch blobfile
 								```
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
 								(Note that `blobfile` is needed to load the tokenizer.)
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
+								&nbsp;
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								#### 2) Model and text generation settings
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
 								Specify which model to use:
 								```python
 								MODEL_FILE = "llama3.2-1B-instruct.pth"
 								# MODEL_FILE = "llama3.2-1B-base.pth"
 								# MODEL_FILE = "llama3.2-3B-instruct.pth"
 								# MODEL_FILE = "llama3.2-3B-base.pth"
 								```
 								Basic text generation settings that can be defined by the user. Note that the recommended 8192-token context size requires approximately 3 GB of VRAM for the text generation example.
 								```python
 								# Text generation settings
 								if "instruct" in MODEL_FILE:
 								    PROMPT = "What do llamas eat?"
 								else:
 								    PROMPT = "Llamas eat"
 								MAX_NEW_TOKENS = 150
 								TEMPERATURE = 0.
 								TOP_K = 1
 								```
 								&nbsp;
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								#### 3) Weight download and loading
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
 								This automatically downloads the weight file based on the model choice above:
 								```python
 								import os
 								import urllib.request
 								url = f"https://huggingface.co/rasbt/llama-3.2-from-scratch/resolve/main/{MODEL_FILE}"
 								if not os.path.exists(MODEL_FILE):
 								    urllib.request.urlretrieve(url, MODEL_FILE)
 								    print(f"Downloaded to {MODEL_FILE}")
 								```
 								The model weights are then loaded as follows:
 								```python
 								import torch
 								from llms_from_scratch.llama3 import Llama3Model
 								if "1B" in MODEL_FILE:
 								    from llms_from_scratch.llama3 import LLAMA32_CONFIG_1B as LLAMA32_CONFIG
 								elif "3B" in MODEL_FILE:
 								    from llms_from_scratch.llama3 import LLAMA32_CONFIG_3B as LLAMA32_CONFIG
 								else:
 								    raise ValueError("Incorrect model file name")
 								model = Llama3Model(LLAMA32_CONFIG)
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								model.load_state_dict(torch.load(MODEL_FILE, weights_only=True, map_location="cpu"))
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
 								device = (
 								    torch.device("cuda") if torch.cuda.is_available() else
 								    torch.device("mps") if torch.backends.mps.is_available() else
 								    torch.device("cpu")
 								)
 								model.to(device)
 								```
 								&nbsp;
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								#### 4) Initialize tokenizer
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
 								The following code downloads and initializes the tokenizer:
 								```python
 								from llms_from_scratch.llama3 import Llama3Tokenizer, ChatFormat, clean_text
 								TOKENIZER_FILE = "tokenizer.model"
 								url = f"https://huggingface.co/rasbt/llama-3.2-from-scratch/resolve/main/{TOKENIZER_FILE}"
 								if not os.path.exists(TOKENIZER_FILE):
 								    urllib.request.urlretrieve(url, TOKENIZER_FILE)
 								    print(f"Downloaded to {TOKENIZER_FILE}")
 								tokenizer = Llama3Tokenizer("tokenizer.model")
 								if "instruct" in MODEL_FILE:
 								    tokenizer = ChatFormat(tokenizer)
 								```
 								&nbsp;
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								#### 5) Generating text
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
 								Lastly, we can generate text via the following code:
 								```python
 								import time
-												Reduce Llama 3 RoPE memory requirements (#658)

* Llama3 from scratch improvements

* Fix Llama 3 expensive RoPE memory issue

* updates

* update package

* benchmark

* remove unused rescale_theta
											
										
										
											2025-06-12 11:08:02 -05:00
+								from llms_from_scratch.ch05 import (
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
+								    generate,
 								    text_to_token_ids,
 								    token_ids_to_text
 								)
 								torch.manual_seed(123)
 								start = time.time()
 								token_ids = generate(
 								    model=model,
 								    idx=text_to_token_ids(PROMPT, tokenizer).to(device),
 								    max_new_tokens=MAX_NEW_TOKENS,
 								    context_size=LLAMA32_CONFIG["context_length"],
 								    top_k=TOP_K,
 								    temperature=TEMPERATURE
 								)
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								total_time = time.time() - start
 								print(f"Time: {total_time:.2f} sec")
 								print(f"{int(len(token_ids[0])/total_time)} tokens/sec")
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
 								if torch.cuda.is_available():
 								    max_mem_bytes = torch.cuda.max_memory_allocated()
 								    max_mem_gb = max_mem_bytes / (1024 ** 3)
 								    print(f"Max memory allocated: {max_mem_gb:.2f} GB")
 								output_text = token_ids_to_text(token_ids, tokenizer)
 								if "instruct" in MODEL_FILE:
 								    output_text = clean_text(output_text)
 								print("\n\nOutput text:\n\n", output_text)
 								```
 								When using the Llama 3.2 1B Instruct model, the output should look similar to the one shown below:
 								```
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								Time: 3.17 sec
 tokens/sec
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
+								Max memory allocated: 2.91 GB
 								Output text:
 								 Llamas are herbivores, which means they primarily eat plants. Their diet consists mainly of:
 . Grasses: Llamas love to graze on various types of grasses, including tall grasses and grassy meadows.
 . Hay: Llamas also eat hay, which is a dry, compressed form of grass or other plants.
 . Alfalfa: Alfalfa is a legume that is commonly used as a hay substitute in llama feed.
 . Other plants: Llamas will also eat other plants, such as clover, dandelions, and wild grasses.
 								It's worth noting that the specific diet of llamas can vary depending on factors such as the breed,
 								```
 								&nbsp;
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								#### Pro tip 1: speed up inference with FlashAttention
 								Instead of using `Llama3Model`, you can use `Llama3ModelFast` as a drop-in replacement. For more information, I encourage you to inspect the [pkg/llms_from_scratch/llama3.py](../../pkg/llms_from_scratch/llama3.py) code.
 								The `Llama3ModelFast` replaces my from-scratch scaled dot-product code in the `GroupedQueryAttention` module with PyTorch's `scaled_dot_product` function, which uses `FlashAttention` on Ampere GPUs or newer.
 								The following table shows a performance comparison on an A100:
 								|                 | Tokens/sec | Memory  |
 								| --------------- | ---------- | ------- |
-												Reduce Llama 3 RoPE memory requirements (#658)

* Llama3 from scratch improvements

* Fix Llama 3 expensive RoPE memory issue

* updates

* update package

* benchmark

* remove unused rescale_theta
											
										
										
											2025-06-12 11:08:02 -05:00
+								| Llama3Model     | 42         | 2.91 GB |
 								| Llama3ModelFast | 54         | 2.91 GB |
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
 								&nbsp;
 								#### Pro tip 2: speed up inference with compilation
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
 								For up to a 4× speed-up, replace
 								```python
 								model.to(device)
 								```
 								with
 								```python
 								model = torch.compile(model)
 								model.to(device)
 								```
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								Note: There is a significant multi-minute upfront cost when compiling, and the speed-up takes effect after the first `generate` call.
 								The following table shows a performance comparison on an A100 for consequent `generate` calls:
-												Add Llama 3.2 to pkg (#591)

* Add Llama 3.2 to pkg

* remove redundant attributes

* update tests

* updates

* updates

* updates

* fix link

* fix link
											
										
										
											2025-03-31 18:59:47 -05:00
-												Llama3 from scratch improvements (#621)

* Llama3 from scratch improvements

* restore
											
										
										
											2025-04-16 18:08:26 -05:00
+								|                 | Tokens/sec | Memory  |
 								| --------------- | ---------- | ------- |
-												Reduce Llama 3 RoPE memory requirements (#658)

* Llama3 from scratch improvements

* Fix Llama 3 expensive RoPE memory issue

* updates

* update package

* benchmark

* remove unused rescale_theta
											
										
										
											2025-06-12 11:08:02 -05:00
+								| Llama3Model     | 170        | 3.12 GB |
 								| Llama3ModelFast | 177        | 3.61 GB |
-												Llama 3 KV Cache (#685)

* Llama 3 KV Cache

* skip expensive tests on Gh actions

* Update __init__.py
											
										
										
											2025-06-21 10:55:20 -05:00
 								&nbsp;
 								#### Pro tip 3: speed up inference with compilation
 								You can significantly boost inference performance using the KV cache `Llama3Model` drop-in replacement when running the model on a CPU. (See my [Understanding and Coding the KV Cache in LLMs from Scratch](https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms) article to learn more about KV caches.)
 								```python
 								from llms_from_scratch.kv_cache.llama3 import Llama3Model
 								from llms_from_scratch.kv_cache.generate import generate_text_simple
 								model = Llama3Model(LLAMA32_CONFIG)
 								# ...
 								token_ids = generate_text_simple(
 								    model=model,
 								    idx=text_to_token_ids(PROMPT, tokenizer).to(device),
 								    max_new_tokens=MAX_NEW_TOKENS,
 								    context_size=LLAMA32_CONFIG["context_length"],
 								)
 								```
 								Note that the peak memory usage is only listed for Nvidia CUDA devices, as it is easier to calculate. However, the memory usage on other devices is likely similar as it uses a similar precision format, and the KV cache storage dominates here for the generated 150-token text (however, different devices may implement matrix multiplication differently and may result in different peak memory requirements).
 								| Model       | Mode              | Hardware        | Tokens/sec | GPU Memory (VRAM) |
 								|-------------|-------------------|-----------------|------------|-------------------|
 								| Llama3Model | Regular           | Mac Mini M4 CPU | 1          | -                 |
 								| Llama3Model | Regular compiled  | Mac Mini M4 CPU | -          | -                 |
 								| Llama3Model | KV cache          | Mac Mini M4 CPU | 62         | -                 |
 								| Llama3Model | KV cache compiled | Mac Mini M4 CPU | -          | -                 |
 								|             |                   |                 |            |                   |
 								| Llama3Model | Regular           | Mac Mini M4 GPU | 15         | -                 |
 								| Llama3Model | Regular compiled  | Mac Mini M4 GPU | -          | -                 |
 								| Llama3Model | KV cache          | Mac Mini M4 GPU | 62         | -                 |
 								| Llama3Model | KV cache compiled | Mac Mini M4 GPU | -          | -                 |
 								|             |                   |                 |            |                   |
 								| Llama3Model | Regular           | Nvidia A100 GPU | 42         | 2.91 GB           |
 								| Llama3Model | Regular compiled  | Nvidia A100 GPU | 170        | 3.12 GB           |
 								| Llama3Model | KV cache          | Nvidia A100 GPU | 60         | 18.87 GB          |
-												Qwen3 KV cache (#688)


											
										
										
											2025-06-21 17:34:39 -05:00
+								| Llama3Model | KV cache compiled | Nvidia A100 GPU | 59         | 19.12 GB          |
 								Note that all settings above have been tested to produce the same text outputs.