mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-02-01 12:33:09 +00:00
253 lines
11 KiB
Plaintext
253 lines
11 KiB
Plaintext
---
|
|
title: "LlamaCppChatGenerator"
|
|
id: llamacppchatgenerator
|
|
slug: "/llamacppchatgenerator"
|
|
description: "`LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp."
|
|
---
|
|
|
|
# LlamaCppChatGenerator
|
|
|
|
`LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp.
|
|
|
|
| | |
|
|
| :------------------------------------- | :------------------------------------------------------------------------------------------------------------------------ |
|
|
| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx) |
|
|
| **Mandatory init variables** | "model": The path of the model to use |
|
|
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](/docs/data-classes#chatmessage) instances representing the input messages |
|
|
| **Output variables** | “replies”: A list of [`ChatMessage`](/docs/data-classes#chatmessage) instances with all the replies generated by the LLM |
|
|
| **API reference** | [Llama.cpp](/reference/integrations-llama-cpp) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp |
|
|
|
|
## Overview
|
|
|
|
[Llama.cpp](https://github.com/ggerganov/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs).
|
|
|
|
`Llama.cpp` uses the quantized binary file of the LLM in GGUF format, which can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppChatGenerator` supports models running on `Llama.cpp` by taking the path to the locally saved GGUF file as `model` parameter at initialization.
|
|
|
|
## Installation
|
|
|
|
Install the `llama-cpp-haystack` package to use this integration:
|
|
|
|
```shell
|
|
pip install llama-cpp-haystack
|
|
```
|
|
|
|
### Using a different compute backend
|
|
|
|
The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends:
|
|
|
|
1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend.
|
|
2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above.
|
|
|
|
For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands:
|
|
|
|
```shell
|
|
export GGML_CUDA=1
|
|
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
|
|
pip install llama-cpp-haystack
|
|
```
|
|
|
|
## Usage
|
|
|
|
1. Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf).
|
|
2. Initialize `LlamaCppChatGenerator` with the path to the GGUF file and specify the required model and text generation parameters:
|
|
|
|
```python
|
|
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
|
|
|
|
generator = LlamaCppChatGenerator(
|
|
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
|
|
n_ctx=512,
|
|
n_batch=128,
|
|
model_kwargs={"n_gpu_layers": -1},
|
|
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
|
|
)
|
|
generator.warm_up()
|
|
messages = [ChatMessage.from_user("Who is the best American actor?")]
|
|
result = generator.run(messages)
|
|
```
|
|
|
|
### Passing additional model parameters
|
|
|
|
The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter.
|
|
|
|
The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters.
|
|
|
|
See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.
|
|
|
|
**Note**: Llama.cpp automatically extracts the `chat_template` from the model metadata for applying formatting to ChatMessages. You can override the `chat_template` used by passing in a custom `chat_handler` or `chat_format` as a model parameter.
|
|
|
|
For example, to offload the model to GPU during initialization:
|
|
|
|
```python
|
|
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
|
|
from haystack.dataclasses import ChatMessage
|
|
|
|
generator = LlamaCppChatGenerator(
|
|
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
|
|
n_ctx=512,
|
|
n_batch=128,
|
|
model_kwargs={"n_gpu_layers": -1}
|
|
)
|
|
generator.warm_up()
|
|
messages = [ChatMessage.from_user("Who is the best American actor?")]
|
|
result = generator.run(messages, generation_kwargs={"max_tokens": 128})
|
|
generated_reply = result["replies"][0].content
|
|
print(generated_reply)
|
|
```
|
|
|
|
### Passing text generation parameters
|
|
|
|
The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference.
|
|
|
|
See [Llama.cpp's Chat Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) for more information on the available generation arguments.
|
|
|
|
**Note**: JSON mode, Function Calling, and Tools are all supported as `generation_kwargs`. Please see the [llama-cpp-python GitHub README](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#json-and-json-schema-mode) for more information on how to use them.
|
|
|
|
For example, to set the `max_tokens` and `temperature`:
|
|
|
|
```python
|
|
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
|
|
from haystack.dataclasses import ChatMessage
|
|
|
|
generator = LlamaCppChatGenerator(
|
|
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
|
|
n_ctx=512,
|
|
n_batch=128,
|
|
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
|
|
)
|
|
generator.warm_up()
|
|
messages = [ChatMessage.from_user("Who is the best American actor?")]
|
|
result = generator.run(messages)
|
|
```
|
|
|
|
The `generation_kwargs` can also be passed to the `run` method of the generator directly:
|
|
|
|
```python
|
|
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
|
|
from haystack.dataclasses import ChatMessage
|
|
|
|
generator = LlamaCppChatGenerator(
|
|
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
|
|
n_ctx=512,
|
|
n_batch=128,
|
|
)
|
|
generator.warm_up()
|
|
messages = [ChatMessage.from_user("Who is the best American actor?")]
|
|
result = generator.run(
|
|
messages,
|
|
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
|
|
)
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
We use the `LlamaCppChatGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from Hugging Face and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM.
|
|
|
|
Load the dataset:
|
|
|
|
```python
|
|
## Install HuggingFace Datasets using "pip install datasets"
|
|
from datasets import load_dataset
|
|
from haystack import Document, Pipeline
|
|
from haystack.components.builders.answer_builder import AnswerBuilder
|
|
from haystack.components.builders import ChatPromptBuilder
|
|
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
|
|
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
|
|
from haystack.components.writers import DocumentWriter
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
from haystack.dataclasses import ChatMessage
|
|
|
|
## Import LlamaCppChatGenerator
|
|
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
|
|
|
|
## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
|
|
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
|
|
|
|
docs = [
|
|
Document(
|
|
content=doc["text"],
|
|
meta={
|
|
"title": doc["title"],
|
|
"url": doc["url"],
|
|
},
|
|
)
|
|
for doc in dataset
|
|
]
|
|
```
|
|
|
|
Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`:
|
|
|
|
```python
|
|
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
|
|
## Install sentence transformers using "pip install sentence-transformers"
|
|
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
|
|
|
|
## Indexing Pipeline
|
|
indexing_pipeline = Pipeline()
|
|
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
|
|
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
|
|
indexing_pipeline.connect("DocEmbedder", "DocWriter")
|
|
|
|
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
|
|
```
|
|
|
|
Create the RAG pipeline and add the `LlamaCppChatGenerator` to it:
|
|
|
|
```python
|
|
system_message = ChatMessage.from_system(
|
|
"""
|
|
Answer the question using the provided context.
|
|
Context:
|
|
{% for doc in documents %}
|
|
{{ doc.content }}
|
|
{% endfor %}
|
|
"""
|
|
)
|
|
user_message = ChatMessage.from_user("Question: {{question}}")
|
|
assistent_message = ChatMessage.from_assistant("Answer: ")
|
|
|
|
chat_template = [system_message, user_message, assistent_message]
|
|
|
|
rag_pipeline = Pipeline()
|
|
|
|
text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
|
|
|
|
## Load the LLM using LlamaCppChatGenerator
|
|
model_path = "openchat-3.5-1210.Q3_K_S.gguf"
|
|
generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128)
|
|
|
|
rag_pipeline.add_component(
|
|
instance=text_embedder,
|
|
name="text_embedder",
|
|
)
|
|
rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), name="retriever")
|
|
rag_pipeline.add_component(instance=ChatPromptBuilder(template=chat_template), name="prompt_builder")
|
|
rag_pipeline.add_component(instance=generator, name="llm")
|
|
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
|
|
|
|
rag_pipeline.connect("text_embedder", "retriever")
|
|
rag_pipeline.connect("retriever", "prompt_builder.documents")
|
|
rag_pipeline.connect("prompt_builder", "llm")
|
|
rag_pipeline.connect("llm", "answer_builder")
|
|
rag_pipeline.connect("retriever", "answer_builder.documents")
|
|
```
|
|
|
|
Run the pipeline:
|
|
|
|
```python
|
|
question = "Which year did the Joker movie release?"
|
|
result = rag_pipeline.run(
|
|
{
|
|
"text_embedder": {"text": question},
|
|
"prompt_builder": {"question": question},
|
|
"llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
|
|
"answer_builder": {"query": question},
|
|
}
|
|
)
|
|
|
|
generated_answer = result["answer_builder"]["answers"][0]
|
|
print(generated_answer.data)
|
|
## The Joker movie was released on October 4, 2019.
|
|
```
|