haystack/docs-website/reference_versioned_docs/version-2.17/integrations-api/llama_cpp.md

---
title: "Llama.cpp"
id: integrations-llama-cpp
description: "Llama.cpp integration for Haystack"
slug: "/integrations-llama-cpp"
---

<a id="haystack_integrations.components.generators.llama_cpp.generator"></a>

## Module haystack\_integrations.components.generators.llama\_cpp.generator

<a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator"></a>

### LlamaCppGenerator

Provides an interface to generate text using LLM via llama.cpp.

[llama.cpp](https://github.com/ggml-org/llama.cpp) is a project written in C/C++ for efficient inference of LLMs.
It employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs).

Usage example:
```python
from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
generator = LlamaCppGenerator(model="zephyr-7b-beta.Q4_0.gguf", n_ctx=2048, n_batch=512)

print(generator.run("Who is the best American actor?", generation_kwargs={"max_tokens": 128}))
# {'replies': ['John Cusack'], 'meta': [{"object": "text_completion", ...}]}
```

<a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator.__init__"></a>

#### LlamaCppGenerator.\_\_init\_\_

```python
def __init__(model: str,
             n_ctx: Optional[int] = 0,
             n_batch: Optional[int] = 512,
             model_kwargs: Optional[Dict[str, Any]] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None)
```

**Arguments**:

- `model`: The path of a quantized model for text generation, for example, "zephyr-7b-beta.Q4_0.gguf".
If the model path is also specified in the `model_kwargs`, this parameter will be ignored.
- `n_ctx`: The number of tokens in the context. When set to 0, the context will be taken from the model.
- `n_batch`: Prompt processing maximum batch size.
- `model_kwargs`: Dictionary containing keyword arguments used to initialize the LLM for text generation.
These keyword arguments provide fine-grained control over the model loading.
In case of duplication, these kwargs override `model`, `n_ctx`, and `n_batch` init parameters.
For more information on the available kwargs, see
[llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.__init__`).
- `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
For more information on the available kwargs, see
[llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_completion`).

<a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator.run"></a>

#### LlamaCppGenerator.run

```python
@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(
    prompt: str,
    generation_kwargs: Optional[Dict[str, Any]] = None
) -> Dict[str, Union[List[str], List[Dict[str, Any]]]]
```

Run the text generation model on the given prompt.

**Arguments**:

- `prompt`: the prompt to be sent to the generative model.
- `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
For more information on the available kwargs, see
[llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_completion`).

**Returns**:

A dictionary with the following keys:
- `replies`: the list of replies generated by the model.
- `meta`: metadata about the request.
docs: add integrations API reference (#9912) 2025-10-21 16:37:52 +02:00			`---`
			`title: "Llama.cpp"`
			`id: integrations-llama-cpp`
			`description: "Llama.cpp integration for Haystack"`
			`slug: "/integrations-llama-cpp"`
			`---`

			`<a id="haystack_integrations.components.generators.llama_cpp.generator"></a>`

docs: integrations api with regenerated headings (#9913) * docs: add regenerated integrations API reference * updates 2025-10-21 18:10:10 +02:00			`## Module haystack\_integrations.components.generators.llama\_cpp.generator`
docs: add integrations API reference (#9912) 2025-10-21 16:37:52 +02:00
			`<a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator"></a>`

docs: integrations api with regenerated headings (#9913) * docs: add regenerated integrations API reference * updates 2025-10-21 18:10:10 +02:00			`### LlamaCppGenerator`
docs: add integrations API reference (#9912) 2025-10-21 16:37:52 +02:00
			`Provides an interface to generate text using LLM via llama.cpp.`

			`[llama.cpp](https://github.com/ggml-org/llama.cpp) is a project written in C/C++ for efficient inference of LLMs.`
			`It employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs).`

			`Usage example:`
			```python
			`from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator`
			`generator = LlamaCppGenerator(model="zephyr-7b-beta.Q4_0.gguf", n_ctx=2048, n_batch=512)`

			`print(generator.run("Who is the best American actor?", generation_kwargs={"max_tokens": 128}))`
			`# {'replies': ['John Cusack'], 'meta': [{"object": "text_completion", ...}]}`
			```

			`<a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator.__init__"></a>`

			`#### LlamaCppGenerator.\_\_init\_\_`

			```python
			`def __init__(model: str,`
			`n_ctx: Optional[int] = 0,`
			`n_batch: Optional[int] = 512,`
			`model_kwargs: Optional[Dict[str, Any]] = None,`
			`generation_kwargs: Optional[Dict[str, Any]] = None)`
			```

			`Arguments:`

			- `model`: The path of a quantized model for text generation, for example, "zephyr-7b-beta.Q4_0.gguf".
			If the model path is also specified in the `model_kwargs`, this parameter will be ignored.
			- `n_ctx`: The number of tokens in the context. When set to 0, the context will be taken from the model.
			- `n_batch`: Prompt processing maximum batch size.
			- `model_kwargs`: Dictionary containing keyword arguments used to initialize the LLM for text generation.
			`These keyword arguments provide fine-grained control over the model loading.`
			In case of duplication, these kwargs override `model`, `n_ctx`, and `n_batch` init parameters.
			`For more information on the available kwargs, see`
			[llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.__init__`).
			- `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
			`For more information on the available kwargs, see`
			[llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_completion`).

			`<a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator.run"></a>`

			`#### LlamaCppGenerator.run`

			```python
			`@component.output_types(replies=List[str], meta=List[Dict[str, Any]])`
			`def run(`
			`prompt: str,`
			`generation_kwargs: Optional[Dict[str, Any]] = None`
			`) -> Dict[str, Union[List[str], List[Dict[str, Any]]]]`
			```

			`Run the text generation model on the given prompt.`

			`Arguments:`

			- `prompt`: the prompt to be sent to the generative model.
			- `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
			`For more information on the available kwargs, see`
			[llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_completion`).

			`Returns:`

			`A dictionary with the following keys:`
			- `replies`: the list of replies generated by the model.
			- `meta`: metadata about the request.