Haystack Bot 1de94413c4
Sync Haystack API reference on Docusaurus (#9898)
Co-authored-by: vblagoje <458335+vblagoje@users.noreply.github.com>
2025-10-20 09:53:59 +02:00

1882 lines
77 KiB
Markdown

---
title: "Generators"
id: generators-api
description: "Enables text generation using LLMs."
slug: "/generators-api"
---
<a id="azure"></a>
# Module azure
<a id="azure.AzureOpenAIGenerator"></a>
## AzureOpenAIGenerator
Generates text using OpenAI's large language models (LLMs).
It works with the gpt-4 - type models and supports streaming responses
from OpenAI API.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the `**generation_kwargs` argument when you initialize
the component or when you run it. Any parameter that works with
`openai.ChatCompletion.create` will work here too.
For details on OpenAI API parameters, see
[OpenAI documentation](https://platform.openai.com/docs/api-reference/chat).
### Usage example
```python
from haystack.components.generators import AzureOpenAIGenerator
from haystack.utils import Secret
client = AzureOpenAIGenerator(
azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
api_key=Secret.from_token("<your-api-key>"),
azure_deployment="<this a model name, e.g. gpt-4o-mini>")
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
```
```
>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}
```
<a id="azure.AzureOpenAIGenerator.__init__"></a>
#### AzureOpenAIGenerator.\_\_init\_\_
```python
def __init__(azure_endpoint: Optional[str] = None,
api_version: Optional[str] = "2023-05-15",
azure_deployment: Optional[str] = "gpt-4o-mini",
api_key: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_API_KEY", strict=False),
azure_ad_token: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_AD_TOKEN", strict=False),
organization: Optional[str] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
system_prompt: Optional[str] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
http_client_kwargs: Optional[dict[str, Any]] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
default_headers: Optional[dict[str, str]] = None,
*,
azure_ad_token_provider: Optional[AzureADTokenProvider] = None)
```
Initialize the Azure OpenAI Generator.
**Arguments**:
- `azure_endpoint`: The endpoint of the deployed model, for example `https://example-resource.azure.openai.com/`.
- `api_version`: The version of the API to use. Defaults to 2023-05-15.
- `azure_deployment`: The deployment of the model, usually the model name.
- `api_key`: The API key to use for authentication.
- `azure_ad_token`: [Azure Active Directory token](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id).
- `organization`: Your organization ID, defaults to `None`. For help, see
[Setting up your organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization).
- `streaming_callback`: A callback function called when a new token is received from the stream.
It accepts [StreamingChunk](https://docs.haystack.deepset.ai/docs/data-classes#streamingchunk)
as an argument.
- `system_prompt`: The system prompt to use for text generation. If not provided, the Generator
omits the system prompt and uses the default system prompt.
- `timeout`: Timeout for AzureOpenAI client. If not set, it is inferred from the
`OPENAI_TIMEOUT` environment variable or set to 30.
- `max_retries`: Maximum retries to establish contact with AzureOpenAI if it returns an internal error.
If not set, it is inferred from the `OPENAI_MAX_RETRIES` environment variable or set to 5.
- `http_client_kwargs`: A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/`client`).
- `generation_kwargs`: Other parameters to use for the model, sent directly to
the OpenAI endpoint. See [OpenAI documentation](https://platform.openai.com/docs/api-reference/chat) for
more details.
Some of the supported parameters:
- `max_completion_tokens`: An upper bound for the number of tokens that can be generated for a completion,
including visible output tokens and reasoning tokens.
- `temperature`: The sampling temperature to use. Higher values mean the model takes more risks.
Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
- `top_p`: An alternative to sampling with temperature, called nucleus sampling, where the model
considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens
comprising the top 10% probability mass are considered.
- `n`: The number of completions to generate for each prompt. For example, with 3 prompts and n=2,
the LLM will generate two completions per prompt, resulting in 6 completions total.
- `stop`: One or more sequences after which the LLM should stop generating tokens.
- `presence_penalty`: The penalty applied if a token is already present.
Higher values make the model less likely to repeat the token.
- `frequency_penalty`: Penalty applied if a token has already been generated.
Higher values make the model less likely to repeat the token.
- `logit_bias`: Adds a logit bias to specific tokens. The keys of the dictionary are tokens, and the
values are the bias to add to that token.
- `default_headers`: Default headers to use for the AzureOpenAI client.
- `azure_ad_token_provider`: A function that returns an Azure Active Directory token, will be invoked on
every request.
<a id="azure.AzureOpenAIGenerator.to_dict"></a>
#### AzureOpenAIGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize this component to a dictionary.
**Returns**:
The serialized component as a dictionary.
<a id="azure.AzureOpenAIGenerator.from_dict"></a>
#### AzureOpenAIGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AzureOpenAIGenerator"
```
Deserialize this component from a dictionary.
**Arguments**:
- `data`: The dictionary representation of this component.
**Returns**:
The deserialized component instance.
<a id="azure.AzureOpenAIGenerator.run"></a>
#### AzureOpenAIGenerator.run
```python
@component.output_types(replies=list[str], meta=list[dict[str, Any]])
def run(prompt: str,
system_prompt: Optional[str] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None)
```
Invoke the text generation inference based on the provided messages and generation parameters.
**Arguments**:
- `prompt`: The string prompt to use for text generation.
- `system_prompt`: The system prompt to use for text generation. If this run time system prompt is omitted, the system
prompt, if defined at initialisation time, is used.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
- `generation_kwargs`: Additional keyword arguments for text generation. These parameters will potentially override the parameters
passed in the `__init__` method. For more details on the parameters supported by the OpenAI API, refer to
the OpenAI [documentation](https://platform.openai.com/docs/api-reference/chat/create).
**Returns**:
A list of strings containing the generated responses and a list of dictionaries containing the metadata
for each response.
<a id="hugging_face_local"></a>
# Module hugging\_face\_local
<a id="hugging_face_local.HuggingFaceLocalGenerator"></a>
## HuggingFaceLocalGenerator
Generates text using models from Hugging Face that run locally.
LLMs running locally may need powerful hardware.
### Usage example
```python
from haystack.components.generators import HuggingFaceLocalGenerator
generator = HuggingFaceLocalGenerator(
model="google/flan-t5-large",
task="text2text-generation",
generation_kwargs={"max_new_tokens": 100, "temperature": 0.9})
generator.warm_up()
print(generator.run("Who is the best American actor?"))
# {'replies': ['John Cusack']}
```
<a id="hugging_face_local.HuggingFaceLocalGenerator.__init__"></a>
#### HuggingFaceLocalGenerator.\_\_init\_\_
```python
def __init__(model: str = "google/flan-t5-base",
task: Optional[Literal["text-generation",
"text2text-generation"]] = None,
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[dict[str, Any]] = None,
huggingface_pipeline_kwargs: Optional[dict[str, Any]] = None,
stop_words: Optional[list[str]] = None,
streaming_callback: Optional[StreamingCallbackT] = None)
```
Creates an instance of a HuggingFaceLocalGenerator.
**Arguments**:
- `model`: The Hugging Face text generation model name or path.
- `task`: The task for the Hugging Face pipeline. Possible options:
- `text-generation`: Supported by decoder models, like GPT.
- `text2text-generation`: Supported by encoder-decoder models, like T5.
If the task is specified in `huggingface_pipeline_kwargs`, this parameter is ignored.
If not specified, the component calls the Hugging Face API to infer the task from the model name.
- `device`: The device for loading the model. If `None`, automatically selects the default device.
If a device or device map is specified in `huggingface_pipeline_kwargs`, it overrides this parameter.
- `token`: The token to use as HTTP bearer authorization for remote files.
If the token is specified in `huggingface_pipeline_kwargs`, this parameter is ignored.
- `generation_kwargs`: A dictionary with keyword arguments to customize text generation.
Some examples: `max_length`, `max_new_tokens`, `temperature`, `top_k`, `top_p`.
See Hugging Face's documentation for more information:
- [customize-text-generation](https://huggingface.co/docs/transformers/main/en/generation_strategies#customize-text-generation)
- [transformers.GenerationConfig](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig)
- `huggingface_pipeline_kwargs`: Dictionary with keyword arguments to initialize the
Hugging Face pipeline for text generation.
These keyword arguments provide fine-grained control over the Hugging Face pipeline.
In case of duplication, these kwargs override `model`, `task`, `device`, and `token` init parameters.
For available kwargs, see [Hugging Face documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline.task).
In this dictionary, you can also include `model_kwargs` to specify the kwargs for model initialization:
[transformers.PreTrainedModel.from_pretrained](https://huggingface.co/docs/transformers/en/main_classes/model#transformers.PreTrainedModel.from_pretrained)
- `stop_words`: If the model generates a stop word, the generation stops.
If you provide this parameter, don't specify the `stopping_criteria` in `generation_kwargs`.
For some chat models, the output includes both the new text and the original prompt.
In these cases, make sure your prompt has no stop words.
- `streaming_callback`: An optional callable for handling streaming responses.
<a id="hugging_face_local.HuggingFaceLocalGenerator.warm_up"></a>
#### HuggingFaceLocalGenerator.warm\_up
```python
def warm_up()
```
Initializes the component.
<a id="hugging_face_local.HuggingFaceLocalGenerator.to_dict"></a>
#### HuggingFaceLocalGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serializes the component to a dictionary.
**Returns**:
Dictionary with serialized data.
<a id="hugging_face_local.HuggingFaceLocalGenerator.from_dict"></a>
#### HuggingFaceLocalGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HuggingFaceLocalGenerator"
```
Deserializes the component from a dictionary.
**Arguments**:
- `data`: The dictionary to deserialize from.
**Returns**:
The deserialized component.
<a id="hugging_face_local.HuggingFaceLocalGenerator.run"></a>
#### HuggingFaceLocalGenerator.run
```python
@component.output_types(replies=list[str])
def run(prompt: str,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None)
```
Run the text generation model on the given prompt.
**Arguments**:
- `prompt`: A string representing the prompt.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
- `generation_kwargs`: Additional keyword arguments for text generation.
**Returns**:
A dictionary containing the generated replies.
- replies: A list of strings representing the generated replies.
<a id="hugging_face_api"></a>
# Module hugging\_face\_api
<a id="hugging_face_api.HuggingFaceAPIGenerator"></a>
## HuggingFaceAPIGenerator
Generates text using Hugging Face APIs.
Use it with the following Hugging Face APIs:
- [Paid Inference Endpoints](https://huggingface.co/inference-endpoints)
- [Self-hosted Text Generation Inference](https://github.com/huggingface/text-generation-inference)
**Note:** As of July 2025, the Hugging Face Inference API no longer offers generative models through the
`text_generation` endpoint. Generative models are now only available through providers supporting the
`chat_completion` endpoint. As a result, this component might no longer work with the Hugging Face Inference API.
Use the `HuggingFaceAPIChatGenerator` component, which supports the `chat_completion` endpoint.
### Usage examples
#### With Hugging Face Inference Endpoints
#### With self-hosted text generation inference
#### With the free serverless inference API
Be aware that this example might not work as the Hugging Face Inference API no longer offer models that support the
`text_generation` endpoint. Use the `HuggingFaceAPIChatGenerator` for generative models through the
`chat_completion` endpoint.
```python
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret
generator = HuggingFaceAPIGenerator(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(prompt="What's Natural Language Processing?")
print(result)
```
```python
from haystack.components.generators import HuggingFaceAPIGenerator
generator = HuggingFaceAPIGenerator(api_type="text_generation_inference",
api_params={"url": "http://localhost:8080"})
result = generator.run(prompt="What's Natural Language Processing?")
print(result)
```
```python
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret
generator = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(prompt="What's Natural Language Processing?")
print(result)
```
<a id="hugging_face_api.HuggingFaceAPIGenerator.__init__"></a>
#### HuggingFaceAPIGenerator.\_\_init\_\_
```python
def __init__(api_type: Union[HFGenerationAPIType, str],
api_params: dict[str, str],
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[dict[str, Any]] = None,
stop_words: Optional[list[str]] = None,
streaming_callback: Optional[StreamingCallbackT] = None)
```
Initialize the HuggingFaceAPIGenerator instance.
**Arguments**:
- `api_type`: The type of Hugging Face API to use. Available types:
- `text_generation_inference`: See [TGI](https://github.com/huggingface/text-generation-inference).
- `inference_endpoints`: See [Inference Endpoints](https://huggingface.co/inference-endpoints).
- `serverless_inference_api`: See [Serverless Inference API](https://huggingface.co/inference-api).
This might no longer work due to changes in the models offered in the Hugging Face Inference API.
Please use the `HuggingFaceAPIChatGenerator` component instead.
- `api_params`: A dictionary with the following keys:
- `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.
- `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or
`TEXT_GENERATION_INFERENCE`.
- Other parameters specific to the chosen API type, such as `timeout`, `headers`, `provider` etc.
- `token`: The Hugging Face token to use as HTTP bearer authorization.
Check your HF token in your [account settings](https://huggingface.co/settings/tokens).
- `generation_kwargs`: A dictionary with keyword arguments to customize text generation. Some examples: `max_new_tokens`,
`temperature`, `top_k`, `top_p`.
For details, see [Hugging Face documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation)
for more information.
- `stop_words`: An optional list of strings representing the stop words.
- `streaming_callback`: An optional callable for handling streaming responses.
<a id="hugging_face_api.HuggingFaceAPIGenerator.to_dict"></a>
#### HuggingFaceAPIGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize this component to a dictionary.
**Returns**:
A dictionary containing the serialized component.
<a id="hugging_face_api.HuggingFaceAPIGenerator.from_dict"></a>
#### HuggingFaceAPIGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HuggingFaceAPIGenerator"
```
Deserialize this component from a dictionary.
<a id="hugging_face_api.HuggingFaceAPIGenerator.run"></a>
#### HuggingFaceAPIGenerator.run
```python
@component.output_types(replies=list[str], meta=list[dict[str, Any]])
def run(prompt: str,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None)
```
Invoke the text generation inference for the given prompt and generation parameters.
**Arguments**:
- `prompt`: A string representing the prompt.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
- `generation_kwargs`: Additional keyword arguments for text generation.
**Returns**:
A dictionary with the generated replies and metadata. Both are lists of length n.
- replies: A list of strings representing the generated replies.
<a id="openai"></a>
# Module openai
<a id="openai.OpenAIGenerator"></a>
## OpenAIGenerator
Generates text using OpenAI's large language models (LLMs).
It works with the gpt-4 and o-series models and supports streaming responses
from OpenAI API. It uses strings as input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the `**generation_kwargs` argument when you initialize
the component or when you run it. Any parameter that works with
`openai.ChatCompletion.create` will work here too.
For details on OpenAI API parameters, see
[OpenAI documentation](https://platform.openai.com/docs/api-reference/chat).
### Usage example
```python
from haystack.components.generators import OpenAIGenerator
client = OpenAIGenerator()
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}
```
<a id="openai.OpenAIGenerator.__init__"></a>
#### OpenAIGenerator.\_\_init\_\_
```python
def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
model: str = "gpt-4o-mini",
streaming_callback: Optional[StreamingCallbackT] = None,
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
system_prompt: Optional[str] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
http_client_kwargs: Optional[dict[str, Any]] = None)
```
Creates an instance of OpenAIGenerator. Unless specified otherwise in `model`, uses OpenAI's gpt-4o-mini
By setting the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' you can change the timeout and max_retries parameters
in the OpenAI client.
**Arguments**:
- `api_key`: The OpenAI API key to connect to OpenAI.
- `model`: The name of the model to use.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
The callback function accepts StreamingChunk as an argument.
- `api_base_url`: An optional base URL.
- `organization`: The Organization ID, defaults to `None`.
- `system_prompt`: The system prompt to use for text generation. If not provided, the system prompt is
omitted, and the default system prompt of the model is used.
- `generation_kwargs`: Other parameters to use for the model. These parameters are all sent directly to
the OpenAI endpoint. See OpenAI [documentation](https://platform.openai.com/docs/api-reference/chat) for
more details.
Some of the supported parameters:
- `max_completion_tokens`: An upper bound for the number of tokens that can be generated for a completion,
including visible output tokens and reasoning tokens.
- `temperature`: What sampling temperature to use. Higher values mean the model will take more risks.
Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
- `top_p`: An alternative to sampling with temperature, called nucleus sampling, where the model
considers the results of the tokens with top_p probability mass. So, 0.1 means only the tokens
comprising the top 10% probability mass are considered.
- `n`: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2,
it will generate two completions for each of the three prompts, ending up with 6 completions in total.
- `stop`: One or more sequences after which the LLM should stop generating tokens.
- `presence_penalty`: What penalty to apply if a token is already present at all. Bigger values mean
the model will be less likely to repeat the same token in the text.
- `frequency_penalty`: What penalty to apply if a token has already been generated in the text.
Bigger values mean the model will be less likely to repeat the same token in the text.
- `logit_bias`: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the
values are the bias to add to that token.
- `timeout`: Timeout for OpenAI Client calls, if not set it is inferred from the `OPENAI_TIMEOUT` environment variable
or set to 30.
- `max_retries`: Maximum retries to establish contact with OpenAI if it returns an internal error, if not set it is inferred
from the `OPENAI_MAX_RETRIES` environment variable or set to 5.
- `http_client_kwargs`: A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/`client`).
<a id="openai.OpenAIGenerator.to_dict"></a>
#### OpenAIGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize this component to a dictionary.
**Returns**:
The serialized component as a dictionary.
<a id="openai.OpenAIGenerator.from_dict"></a>
#### OpenAIGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "OpenAIGenerator"
```
Deserialize this component from a dictionary.
**Arguments**:
- `data`: The dictionary representation of this component.
**Returns**:
The deserialized component instance.
<a id="openai.OpenAIGenerator.run"></a>
#### OpenAIGenerator.run
```python
@component.output_types(replies=list[str], meta=list[dict[str, Any]])
def run(prompt: str,
system_prompt: Optional[str] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None)
```
Invoke the text generation inference based on the provided messages and generation parameters.
**Arguments**:
- `prompt`: The string prompt to use for text generation.
- `system_prompt`: The system prompt to use for text generation. If this run time system prompt is omitted, the system
prompt, if defined at initialisation time, is used.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
- `generation_kwargs`: Additional keyword arguments for text generation. These parameters will potentially override the parameters
passed in the `__init__` method. For more details on the parameters supported by the OpenAI API, refer to
the OpenAI [documentation](https://platform.openai.com/docs/api-reference/chat/create).
**Returns**:
A list of strings containing the generated responses and a list of dictionaries containing the metadata
for each response.
<a id="openai_dalle"></a>
# Module openai\_dalle
<a id="openai_dalle.DALLEImageGenerator"></a>
## DALLEImageGenerator
Generates images using OpenAI's DALL-E model.
For details on OpenAI API parameters, see
[OpenAI documentation](https://platform.openai.com/docs/api-reference/images/create).
### Usage example
```python
from haystack.components.generators import DALLEImageGenerator
image_generator = DALLEImageGenerator()
response = image_generator.run("Show me a picture of a black cat.")
print(response)
```
<a id="openai_dalle.DALLEImageGenerator.__init__"></a>
#### DALLEImageGenerator.\_\_init\_\_
```python
def __init__(model: str = "dall-e-3",
quality: Literal["standard", "hd"] = "standard",
size: Literal["256x256", "512x512", "1024x1024", "1792x1024",
"1024x1792"] = "1024x1024",
response_format: Literal["url", "b64_json"] = "url",
api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
http_client_kwargs: Optional[dict[str, Any]] = None)
```
Creates an instance of DALLEImageGenerator. Unless specified otherwise in `model`, uses OpenAI's dall-e-3.
**Arguments**:
- `model`: The model to use for image generation. Can be "dall-e-2" or "dall-e-3".
- `quality`: The quality of the generated image. Can be "standard" or "hd".
- `size`: The size of the generated images.
Must be one of 256x256, 512x512, or 1024x1024 for dall-e-2.
Must be one of 1024x1024, 1792x1024, or 1024x1792 for dall-e-3 models.
- `response_format`: The format of the response. Can be "url" or "b64_json".
- `api_key`: The OpenAI API key to connect to OpenAI.
- `api_base_url`: An optional base URL.
- `organization`: The Organization ID, defaults to `None`.
- `timeout`: Timeout for OpenAI Client calls. If not set, it is inferred from the `OPENAI_TIMEOUT` environment variable
or set to 30.
- `max_retries`: Maximum retries to establish contact with OpenAI if it returns an internal error. If not set, it is inferred
from the `OPENAI_MAX_RETRIES` environment variable or set to 5.
- `http_client_kwargs`: A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/`client`).
<a id="openai_dalle.DALLEImageGenerator.warm_up"></a>
#### DALLEImageGenerator.warm\_up
```python
def warm_up() -> None
```
Warm up the OpenAI client.
<a id="openai_dalle.DALLEImageGenerator.run"></a>
#### DALLEImageGenerator.run
```python
@component.output_types(images=list[str], revised_prompt=str)
def run(prompt: str,
size: Optional[Literal["256x256", "512x512", "1024x1024", "1792x1024",
"1024x1792"]] = None,
quality: Optional[Literal["standard", "hd"]] = None,
response_format: Optional[Optional[Literal["url",
"b64_json"]]] = None)
```
Invokes the image generation inference based on the provided prompt and generation parameters.
**Arguments**:
- `prompt`: The prompt to generate the image.
- `size`: If provided, overrides the size provided during initialization.
- `quality`: If provided, overrides the quality provided during initialization.
- `response_format`: If provided, overrides the response format provided during initialization.
**Returns**:
A dictionary containing the generated list of images and the revised prompt.
Depending on the `response_format` parameter, the list of images can be URLs or base64 encoded JSON strings.
The revised prompt is the prompt that was used to generate the image, if there was any revision
to the prompt made by OpenAI.
<a id="openai_dalle.DALLEImageGenerator.to_dict"></a>
#### DALLEImageGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize this component to a dictionary.
**Returns**:
The serialized component as a dictionary.
<a id="openai_dalle.DALLEImageGenerator.from_dict"></a>
#### DALLEImageGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DALLEImageGenerator"
```
Deserialize this component from a dictionary.
**Arguments**:
- `data`: The dictionary representation of this component.
**Returns**:
The deserialized component instance.
<a id="chat/azure"></a>
# Module chat/azure
<a id="chat/azure.AzureOpenAIChatGenerator"></a>
## AzureOpenAIChatGenerator
Generates text using OpenAI's models on Azure.
It works with the gpt-4 - type models and supports streaming responses
from OpenAI API. It uses [ChatMessage](https://docs.haystack.deepset.ai/docs/chatmessage)
format in input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the `**generation_kwargs` argument when you initialize
the component or when you run it. Any parameter that works with
`openai.ChatCompletion.create` will work here too.
For details on OpenAI API parameters, see
[OpenAI documentation](https://platform.openai.com/docs/api-reference/chat).
### Usage example
```python
from haystack.components.generators.chat import AzureOpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
client = AzureOpenAIChatGenerator(
azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
api_key=Secret.from_token("<your-api-key>"),
azure_deployment="<this a model name, e.g. gpt-4o-mini>")
response = client.run(messages)
print(response)
```
```
{'replies':
[ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text=
"Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
enabling computers to understand, interpret, and generate human language in a way that is useful.")],
_name=None,
_meta={'model': 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop',
'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})]
}
```
<a id="chat/azure.AzureOpenAIChatGenerator.__init__"></a>
#### AzureOpenAIChatGenerator.\_\_init\_\_
```python
def __init__(azure_endpoint: Optional[str] = None,
api_version: Optional[str] = "2023-05-15",
azure_deployment: Optional[str] = "gpt-4o-mini",
api_key: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_API_KEY", strict=False),
azure_ad_token: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_AD_TOKEN", strict=False),
organization: Optional[str] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
default_headers: Optional[dict[str, str]] = None,
tools: Optional[ToolsType] = None,
tools_strict: bool = False,
*,
azure_ad_token_provider: Optional[Union[
AzureADTokenProvider, AsyncAzureADTokenProvider]] = None,
http_client_kwargs: Optional[dict[str, Any]] = None)
```
Initialize the Azure OpenAI Chat Generator component.
**Arguments**:
- `azure_endpoint`: The endpoint of the deployed model, for example `"https://example-resource.azure.openai.com/"`.
- `api_version`: The version of the API to use. Defaults to 2023-05-15.
- `azure_deployment`: The deployment of the model, usually the model name.
- `api_key`: The API key to use for authentication.
- `azure_ad_token`: [Azure Active Directory token](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id).
- `organization`: Your organization ID, defaults to `None`. For help, see
[Setting up your organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization).
- `streaming_callback`: A callback function called when a new token is received from the stream.
It accepts [StreamingChunk](https://docs.haystack.deepset.ai/docs/data-classes#streamingchunk)
as an argument.
- `timeout`: Timeout for OpenAI client calls. If not set, it defaults to either the
`OPENAI_TIMEOUT` environment variable, or 30 seconds.
- `max_retries`: Maximum number of retries to contact OpenAI after an internal error.
If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or set to 5.
- `generation_kwargs`: Other parameters to use for the model. These parameters are sent directly to
the OpenAI endpoint. For details, see [OpenAI documentation](https://platform.openai.com/docs/api-reference/chat).
Some of the supported parameters:
- `max_completion_tokens`: An upper bound for the number of tokens that can be generated for a completion,
including visible output tokens and reasoning tokens.
- `temperature`: The sampling temperature to use. Higher values mean the model takes more risks.
Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
- `top_p`: Nucleus sampling is an alternative to sampling with temperature, where the model considers
tokens with a top_p probability mass. For example, 0.1 means only the tokens comprising
the top 10% probability mass are considered.
- `n`: The number of completions to generate for each prompt. For example, with 3 prompts and n=2,
the LLM will generate two completions per prompt, resulting in 6 completions total.
- `stop`: One or more sequences after which the LLM should stop generating tokens.
- `presence_penalty`: The penalty applied if a token is already present.
Higher values make the model less likely to repeat the token.
- `frequency_penalty`: Penalty applied if a token has already been generated.
Higher values make the model less likely to repeat the token.
- `logit_bias`: Adds a logit bias to specific tokens. The keys of the dictionary are tokens, and the
values are the bias to add to that token.
- `response_format`: A JSON schema or a Pydantic model that enforces the structure of the model's response.
If provided, the output will always be validated against this
format (unless the model returns a tool call).
For details, see the [OpenAI Structured Outputs documentation](https://platform.openai.com/docs/guides/structured-outputs).
Notes:
- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
Older models only support basic version of structured outputs through `{"type": "json_object"}`.
For detailed information on JSON mode, see the [OpenAI Structured Outputs documentation](https://platform.openai.com/docs/guides/structured-outputs#json-mode).
- For structured outputs with streaming,
the `response_format` must be a JSON schema and not a Pydantic model.
- `default_headers`: Default headers to use for the AzureOpenAI client.
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
- `tools_strict`: Whether to enable strict schema adherence for tool calls. If set to `True`, the model will follow exactly
the schema provided in the `parameters` field of the tool definition, but this may increase latency.
- `azure_ad_token_provider`: A function that returns an Azure Active Directory token, will be invoked on
every request.
- `http_client_kwargs`: A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/`client`).
<a id="chat/azure.AzureOpenAIChatGenerator.to_dict"></a>
#### AzureOpenAIChatGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize this component to a dictionary.
**Returns**:
The serialized component as a dictionary.
<a id="chat/azure.AzureOpenAIChatGenerator.from_dict"></a>
#### AzureOpenAIChatGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AzureOpenAIChatGenerator"
```
Deserialize this component from a dictionary.
**Arguments**:
- `data`: The dictionary representation of this component.
**Returns**:
The deserialized component instance.
<a id="chat/azure.AzureOpenAIChatGenerator.run"></a>
#### AzureOpenAIChatGenerator.run
```python
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
*,
tools: Optional[ToolsType] = None,
tools_strict: Optional[bool] = None)
```
Invokes chat completion based on the provided messages and generation parameters.
**Arguments**:
- `messages`: A list of ChatMessage instances representing the input messages.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
- `generation_kwargs`: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see [OpenAI documentation](https://platform.openai.com/docs/api-reference/chat/create).
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override the `tools` parameter provided during initialization.
- `tools_strict`: Whether to enable strict schema adherence for tool calls. If set to `True`, the model will follow exactly
the schema provided in the `parameters` field of the tool definition, but this may increase latency.
If set, it will override the `tools_strict` parameter set during component initialization.
**Returns**:
A dictionary with the following key:
- `replies`: A list containing the generated responses as ChatMessage instances.
<a id="chat/azure.AzureOpenAIChatGenerator.run_async"></a>
#### AzureOpenAIChatGenerator.run\_async
```python
@component.output_types(replies=list[ChatMessage])
async def run_async(messages: list[ChatMessage],
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
*,
tools: Optional[ToolsType] = None,
tools_strict: Optional[bool] = None)
```
Asynchronously invokes chat completion based on the provided messages and generation parameters.
This is the asynchronous version of the `run` method. It has the same parameters and return values
but can be used with `await` in async code.
**Arguments**:
- `messages`: A list of ChatMessage instances representing the input messages.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
Must be a coroutine.
- `generation_kwargs`: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see [OpenAI documentation](https://platform.openai.com/docs/api-reference/chat/create).
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override the `tools` parameter provided during initialization.
- `tools_strict`: Whether to enable strict schema adherence for tool calls. If set to `True`, the model will follow exactly
the schema provided in the `parameters` field of the tool definition, but this may increase latency.
If set, it will override the `tools_strict` parameter set during component initialization.
**Returns**:
A dictionary with the following key:
- `replies`: A list containing the generated responses as ChatMessage instances.
<a id="chat/hugging_face_local"></a>
# Module chat/hugging\_face\_local
<a id="chat/hugging_face_local.default_tool_parser"></a>
#### default\_tool\_parser
```python
def default_tool_parser(text: str) -> Optional[list[ToolCall]]
```
Default implementation for parsing tool calls from model output text.
Uses DEFAULT_TOOL_PATTERN to extract tool calls.
**Arguments**:
- `text`: The text to parse for tool calls.
**Returns**:
A list containing a single ToolCall if a valid tool call is found, None otherwise.
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator"></a>
## HuggingFaceLocalChatGenerator
Generates chat responses using models from Hugging Face that run locally.
Use this component with chat-based models,
such as `HuggingFaceH4/zephyr-7b-beta` or `meta-llama/Llama-2-7b-chat-hf`.
LLMs running locally may need powerful hardware.
### Usage example
```python
from haystack.components.generators.chat import HuggingFaceLocalChatGenerator
from haystack.dataclasses import ChatMessage
generator = HuggingFaceLocalChatGenerator(model="HuggingFaceH4/zephyr-7b-beta")
generator.warm_up()
messages = [ChatMessage.from_user("What's Natural Language Processing? Be brief.")]
print(generator.run(messages))
```
```
{'replies':
[ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text=
"Natural Language Processing (NLP) is a subfield of artificial intelligence that deals
with the interaction between computers and human language. It enables computers to understand, interpret, and
generate human language in a valuable way. NLP involves various techniques such as speech recognition, text
analysis, sentiment analysis, and machine translation. The ultimate goal is to make it easier for computers to
process and derive meaning from human language, improving communication between humans and machines.")],
_name=None,
_meta={'finish_reason': 'stop', 'index': 0, 'model':
'mistralai/Mistral-7B-Instruct-v0.2',
'usage': {'completion_tokens': 90, 'prompt_tokens': 19, 'total_tokens': 109}})
]
}
```
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.__init__"></a>
#### HuggingFaceLocalChatGenerator.\_\_init\_\_
```python
def __init__(model: str = "HuggingFaceH4/zephyr-7b-beta",
task: Optional[Literal["text-generation",
"text2text-generation"]] = None,
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
chat_template: Optional[str] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
huggingface_pipeline_kwargs: Optional[dict[str, Any]] = None,
stop_words: Optional[list[str]] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
tools: Optional[ToolsType] = None,
tool_parsing_function: Optional[Callable[
[str], Optional[list[ToolCall]]]] = None,
async_executor: Optional[ThreadPoolExecutor] = None) -> None
```
Initializes the HuggingFaceLocalChatGenerator component.
**Arguments**:
- `model`: The Hugging Face text generation model name or path,
for example, `mistralai/Mistral-7B-Instruct-v0.2` or `TheBloke/OpenHermes-2.5-Mistral-7B-16k-AWQ`.
The model must be a chat model supporting the ChatML messaging
format.
If the model is specified in `huggingface_pipeline_kwargs`, this parameter is ignored.
- `task`: The task for the Hugging Face pipeline. Possible options:
- `text-generation`: Supported by decoder models, like GPT.
- `text2text-generation`: Supported by encoder-decoder models, like T5.
If the task is specified in `huggingface_pipeline_kwargs`, this parameter is ignored.
If not specified, the component calls the Hugging Face API to infer the task from the model name.
- `device`: The device for loading the model. If `None`, automatically selects the default device.
If a device or device map is specified in `huggingface_pipeline_kwargs`, it overrides this parameter.
- `token`: The token to use as HTTP bearer authorization for remote files.
If the token is specified in `huggingface_pipeline_kwargs`, this parameter is ignored.
- `chat_template`: Specifies an optional Jinja template for formatting chat
messages. Most high-quality chat models have their own templates, but for models without this
feature or if you prefer a custom template, use this parameter.
- `generation_kwargs`: A dictionary with keyword arguments to customize text generation.
Some examples: `max_length`, `max_new_tokens`, `temperature`, `top_k`, `top_p`.
See Hugging Face's documentation for more information:
- - [customize-text-generation](https://huggingface.co/docs/transformers/main/en/generation_strategies#customize-text-generation)
- - [GenerationConfig](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig)
The only `generation_kwargs` set by default is `max_new_tokens`, which is set to 512 tokens.
- `huggingface_pipeline_kwargs`: Dictionary with keyword arguments to initialize the
Hugging Face pipeline for text generation.
These keyword arguments provide fine-grained control over the Hugging Face pipeline.
In case of duplication, these kwargs override `model`, `task`, `device`, and `token` init parameters.
For kwargs, see [Hugging Face documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline.task).
In this dictionary, you can also include `model_kwargs` to specify the kwargs for [model initialization](https://huggingface.co/docs/transformers/en/main_classes/model#transformers.PreTrainedModel.from_pretrained)
- `stop_words`: A list of stop words. If the model generates a stop word, the generation stops.
If you provide this parameter, don't specify the `stopping_criteria` in `generation_kwargs`.
For some chat models, the output includes both the new text and the original prompt.
In these cases, make sure your prompt has no stop words.
- `streaming_callback`: An optional callable for handling streaming responses.
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
- `tool_parsing_function`: A callable that takes a string and returns a list of ToolCall objects or None.
If None, the default_tool_parser will be used which extracts tool calls using a predefined pattern.
- `async_executor`: Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will be
initialized and used
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.__del__"></a>
#### HuggingFaceLocalChatGenerator.\_\_del\_\_
```python
def __del__() -> None
```
Cleanup when the instance is being destroyed.
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.shutdown"></a>
#### HuggingFaceLocalChatGenerator.shutdown
```python
def shutdown() -> None
```
Explicitly shutdown the executor if we own it.
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.warm_up"></a>
#### HuggingFaceLocalChatGenerator.warm\_up
```python
def warm_up() -> None
```
Initializes the component.
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.to_dict"></a>
#### HuggingFaceLocalChatGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serializes the component to a dictionary.
**Returns**:
Dictionary with serialized data.
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.from_dict"></a>
#### HuggingFaceLocalChatGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HuggingFaceLocalChatGenerator"
```
Deserializes the component from a dictionary.
**Arguments**:
- `data`: The dictionary to deserialize from.
**Returns**:
The deserialized component.
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.run"></a>
#### HuggingFaceLocalChatGenerator.run
```python
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
generation_kwargs: Optional[dict[str, Any]] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
tools: Optional[ToolsType] = None) -> dict[str, list[ChatMessage]]
```
Invoke text generation inference based on the provided messages and generation parameters.
**Arguments**:
- `messages`: A list of ChatMessage objects representing the input messages.
- `generation_kwargs`: Additional keyword arguments for text generation.
- `streaming_callback`: An optional callable for handling streaming responses.
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override the `tools` parameter provided during initialization.
**Returns**:
A dictionary with the following keys:
- `replies`: A list containing the generated responses as ChatMessage instances.
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.create_message"></a>
#### HuggingFaceLocalChatGenerator.create\_message
```python
def create_message(text: str,
index: int,
tokenizer: Union["PreTrainedTokenizer",
"PreTrainedTokenizerFast"],
prompt: str,
generation_kwargs: dict[str, Any],
parse_tool_calls: bool = False) -> ChatMessage
```
Create a ChatMessage instance from the provided text, populated with metadata.
**Arguments**:
- `text`: The generated text.
- `index`: The index of the generated text.
- `tokenizer`: The tokenizer used for generation.
- `prompt`: The prompt used for generation.
- `generation_kwargs`: The generation parameters.
- `parse_tool_calls`: Whether to attempt parsing tool calls from the text.
**Returns**:
A ChatMessage instance.
<a id="chat/hugging_face_local.HuggingFaceLocalChatGenerator.run_async"></a>
#### HuggingFaceLocalChatGenerator.run\_async
```python
@component.output_types(replies=list[ChatMessage])
async def run_async(
messages: list[ChatMessage],
generation_kwargs: Optional[dict[str, Any]] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
tools: Optional[ToolsType] = None) -> dict[str, list[ChatMessage]]
```
Asynchronously invokes text generation inference based on the provided messages and generation parameters.
This is the asynchronous version of the `run` method. It has the same parameters
and return values but can be used with `await` in an async code.
**Arguments**:
- `messages`: A list of ChatMessage objects representing the input messages.
- `generation_kwargs`: Additional keyword arguments for text generation.
- `streaming_callback`: An optional callable for handling streaming responses.
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override the `tools` parameter provided during initialization.
**Returns**:
A dictionary with the following keys:
- `replies`: A list containing the generated responses as ChatMessage instances.
<a id="chat/hugging_face_api"></a>
# Module chat/hugging\_face\_api
<a id="chat/hugging_face_api.HuggingFaceAPIChatGenerator"></a>
## HuggingFaceAPIChatGenerator
Completes chats using Hugging Face APIs.
HuggingFaceAPIChatGenerator uses the [ChatMessage](https://docs.haystack.deepset.ai/docs/chatmessage)
format for input and output. Use it to generate text with Hugging Face APIs:
- [Serverless Inference API (Inference Providers)](https://huggingface.co/docs/inference-providers)
- [Paid Inference Endpoints](https://huggingface.co/inference-endpoints)
- [Self-hosted Text Generation Inference](https://github.com/huggingface/text-generation-inference)
### Usage examples
#### With the serverless inference API (Inference Providers) - free tier available
```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
# the api_type can be expressed using the HFGenerationAPIType enum or as a string
api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
api_type = "serverless_inference_api" # this is equivalent to the above
generator = HuggingFaceAPIChatGenerator(api_type=api_type,
api_params={"model": "Qwen/Qwen2.5-7B-Instruct",
"provider": "together"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(messages)
print(result)
```
#### With the serverless inference API (Inference Providers) and text+image input
```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage, ImageContent
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType
# Create an image from file path, URL, or base64
image = ImageContent.from_file_path("path/to/your/image.jpg")
# Create a multimodal message with both text and image
messages = [ChatMessage.from_user(content_parts=["Describe this image in detail", image])]
generator = HuggingFaceAPIChatGenerator(
api_type=HFGenerationAPIType.SERVERLESS_INFERENCE_API,
api_params={
"model": "Qwen/Qwen2.5-VL-7B-Instruct", # Vision Language Model
"provider": "hyperbolic"
},
token=Secret.from_token("<your-api-key>")
)
result = generator.run(messages)
print(result)
```
#### With paid inference endpoints
```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(messages)
print(result)
#### With self-hosted text generation inference
```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="text_generation_inference",
api_params={"url": "http://localhost:8080"})
result = generator.run(messages)
print(result)
```
<a id="chat/hugging_face_api.HuggingFaceAPIChatGenerator.__init__"></a>
#### HuggingFaceAPIChatGenerator.\_\_init\_\_
```python
def __init__(api_type: Union[HFGenerationAPIType, str],
api_params: dict[str, str],
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[dict[str, Any]] = None,
stop_words: Optional[list[str]] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
tools: Optional[ToolsType] = None)
```
Initialize the HuggingFaceAPIChatGenerator instance.
**Arguments**:
- `api_type`: The type of Hugging Face API to use. Available types:
- `text_generation_inference`: See [TGI](https://github.com/huggingface/text-generation-inference).
- `inference_endpoints`: See [Inference Endpoints](https://huggingface.co/inference-endpoints).
- `serverless_inference_api`: See
[Serverless Inference API - Inference Providers](https://huggingface.co/docs/inference-providers).
- `api_params`: A dictionary with the following keys:
- `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.
- `provider`: Provider name. Recommended when `api_type` is `SERVERLESS_INFERENCE_API`.
- `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or
`TEXT_GENERATION_INFERENCE`.
- Other parameters specific to the chosen API type, such as `timeout`, `headers`, etc.
- `token`: The Hugging Face token to use as HTTP bearer authorization.
Check your HF token in your [account settings](https://huggingface.co/settings/tokens).
- `generation_kwargs`: A dictionary with keyword arguments to customize text generation.
Some examples: `max_tokens`, `temperature`, `top_p`.
For details, see [Hugging Face chat_completion documentation](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion).
- `stop_words`: An optional list of strings representing the stop words.
- `streaming_callback`: An optional callable for handling streaming responses.
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
The chosen model should support tool/function calling, according to the model card.
Support for tools in the Hugging Face API and TGI is not yet fully refined and you may experience
unexpected behavior.
<a id="chat/hugging_face_api.HuggingFaceAPIChatGenerator.to_dict"></a>
#### HuggingFaceAPIChatGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize this component to a dictionary.
**Returns**:
A dictionary containing the serialized component.
<a id="chat/hugging_face_api.HuggingFaceAPIChatGenerator.from_dict"></a>
#### HuggingFaceAPIChatGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HuggingFaceAPIChatGenerator"
```
Deserialize this component from a dictionary.
<a id="chat/hugging_face_api.HuggingFaceAPIChatGenerator.run"></a>
#### HuggingFaceAPIChatGenerator.run
```python
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
generation_kwargs: Optional[dict[str, Any]] = None,
tools: Optional[ToolsType] = None,
streaming_callback: Optional[StreamingCallbackT] = None)
```
Invoke the text generation inference based on the provided messages and generation parameters.
**Arguments**:
- `messages`: A list of ChatMessage objects representing the input messages.
- `generation_kwargs`: Additional keyword arguments for text generation.
- `tools`: A list of tools or a Toolset for which the model can prepare calls. If set, it will override
the `tools` parameter set during component initialization. This parameter can accept either a
list of `Tool` objects or a `Toolset` instance.
- `streaming_callback`: An optional callable for handling streaming responses. If set, it will override the `streaming_callback`
parameter set during component initialization.
**Returns**:
A dictionary with the following keys:
- `replies`: A list containing the generated responses as ChatMessage objects.
<a id="chat/hugging_face_api.HuggingFaceAPIChatGenerator.run_async"></a>
#### HuggingFaceAPIChatGenerator.run\_async
```python
@component.output_types(replies=list[ChatMessage])
async def run_async(messages: list[ChatMessage],
generation_kwargs: Optional[dict[str, Any]] = None,
tools: Optional[ToolsType] = None,
streaming_callback: Optional[StreamingCallbackT] = None)
```
Asynchronously invokes the text generation inference based on the provided messages and generation parameters.
This is the asynchronous version of the `run` method. It has the same parameters
and return values but can be used with `await` in an async code.
**Arguments**:
- `messages`: A list of ChatMessage objects representing the input messages.
- `generation_kwargs`: Additional keyword arguments for text generation.
- `tools`: A list of tools or a Toolset for which the model can prepare calls. If set, it will override the `tools`
parameter set during component initialization. This parameter can accept either a list of `Tool` objects
or a `Toolset` instance.
- `streaming_callback`: An optional callable for handling streaming responses. If set, it will override the `streaming_callback`
parameter set during component initialization.
**Returns**:
A dictionary with the following keys:
- `replies`: A list containing the generated responses as ChatMessage objects.
<a id="chat/openai"></a>
# Module chat/openai
<a id="chat/openai.OpenAIChatGenerator"></a>
## OpenAIChatGenerator
Completes chats using OpenAI's large language models (LLMs).
It works with the gpt-4 and o-series models and supports streaming responses
from OpenAI API. It uses [ChatMessage](https://docs.haystack.deepset.ai/docs/chatmessage)
format in input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the `**generation_kwargs` argument when you initialize
the component or when you run it. Any parameter that works with
`openai.ChatCompletion.create` will work here too.
For details on OpenAI API parameters, see
[OpenAI documentation](https://platform.openai.com/docs/api-reference/chat).
### Usage example
```python
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
client = OpenAIChatGenerator()
response = client.run(messages)
print(response)
```
Output:
```
{'replies':
[ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=
[TextContent(text="Natural Language Processing (NLP) is a branch of artificial intelligence
that focuses on enabling computers to understand, interpret, and generate human language in
a way that is meaningful and useful.")],
_name=None,
_meta={'model': 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop',
'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})
]
}
```
<a id="chat/openai.OpenAIChatGenerator.__init__"></a>
#### OpenAIChatGenerator.\_\_init\_\_
```python
def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
model: str = "gpt-4o-mini",
streaming_callback: Optional[StreamingCallbackT] = None,
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
tools: Optional[ToolsType] = None,
tools_strict: bool = False,
http_client_kwargs: Optional[dict[str, Any]] = None)
```
Creates an instance of OpenAIChatGenerator. Unless specified otherwise in `model`, uses OpenAI's gpt-4o-mini
Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES'
environment variables to override the `timeout` and `max_retries` parameters respectively
in the OpenAI client.
**Arguments**:
- `api_key`: The OpenAI API key.
You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter
during initialization.
- `model`: The name of the model to use.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
The callback function accepts [StreamingChunk](https://docs.haystack.deepset.ai/docs/data-classes#streamingchunk)
as an argument.
- `api_base_url`: An optional base URL.
- `organization`: Your organization ID, defaults to `None`. See
[production best practices](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization).
- `generation_kwargs`: Other parameters to use for the model. These parameters are sent directly to
the OpenAI endpoint. See OpenAI [documentation](https://platform.openai.com/docs/api-reference/chat) for
more details.
Some of the supported parameters:
- `max_completion_tokens`: An upper bound for the number of tokens that can be generated for a completion,
including visible output tokens and reasoning tokens.
- `temperature`: What sampling temperature to use. Higher values mean the model will take more risks.
Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
- `top_p`: An alternative to sampling with temperature, called nucleus sampling, where the model
considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens
comprising the top 10% probability mass are considered.
- `n`: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2,
it will generate two completions for each of the three prompts, ending up with 6 completions in total.
- `stop`: One or more sequences after which the LLM should stop generating tokens.
- `presence_penalty`: What penalty to apply if a token is already present at all. Bigger values mean
the model will be less likely to repeat the same token in the text.
- `frequency_penalty`: What penalty to apply if a token has already been generated in the text.
Bigger values mean the model will be less likely to repeat the same token in the text.
- `logit_bias`: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the
values are the bias to add to that token.
- `response_format`: A JSON schema or a Pydantic model that enforces the structure of the model's response.
If provided, the output will always be validated against this
format (unless the model returns a tool call).
For details, see the [OpenAI Structured Outputs documentation](https://platform.openai.com/docs/guides/structured-outputs).
Notes:
- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
Older models only support basic version of structured outputs through `{"type": "json_object"}`.
For detailed information on JSON mode, see the [OpenAI Structured Outputs documentation](https://platform.openai.com/docs/guides/structured-outputs#json-mode).
- For structured outputs with streaming,
the `response_format` must be a JSON schema and not a Pydantic model.
- `timeout`: Timeout for OpenAI client calls. If not set, it defaults to either the
`OPENAI_TIMEOUT` environment variable, or 30 seconds.
- `max_retries`: Maximum number of retries to contact OpenAI after an internal error.
If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or set to 5.
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
- `tools_strict`: Whether to enable strict schema adherence for tool calls. If set to `True`, the model will follow exactly
the schema provided in the `parameters` field of the tool definition, but this may increase latency.
- `http_client_kwargs`: A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/`client`).
<a id="chat/openai.OpenAIChatGenerator.to_dict"></a>
#### OpenAIChatGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize this component to a dictionary.
**Returns**:
The serialized component as a dictionary.
<a id="chat/openai.OpenAIChatGenerator.from_dict"></a>
#### OpenAIChatGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "OpenAIChatGenerator"
```
Deserialize this component from a dictionary.
**Arguments**:
- `data`: The dictionary representation of this component.
**Returns**:
The deserialized component instance.
<a id="chat/openai.OpenAIChatGenerator.run"></a>
#### OpenAIChatGenerator.run
```python
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
*,
tools: Optional[ToolsType] = None,
tools_strict: Optional[bool] = None)
```
Invokes chat completion based on the provided messages and generation parameters.
**Arguments**:
- `messages`: A list of ChatMessage instances representing the input messages.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
- `generation_kwargs`: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see [OpenAI documentation](https://platform.openai.com/docs/api-reference/chat/create).
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override the `tools` parameter provided during initialization.
- `tools_strict`: Whether to enable strict schema adherence for tool calls. If set to `True`, the model will follow exactly
the schema provided in the `parameters` field of the tool definition, but this may increase latency.
If set, it will override the `tools_strict` parameter set during component initialization.
**Returns**:
A dictionary with the following key:
- `replies`: A list containing the generated responses as ChatMessage instances.
<a id="chat/openai.OpenAIChatGenerator.run_async"></a>
#### OpenAIChatGenerator.run\_async
```python
@component.output_types(replies=list[ChatMessage])
async def run_async(messages: list[ChatMessage],
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
*,
tools: Optional[ToolsType] = None,
tools_strict: Optional[bool] = None)
```
Asynchronously invokes chat completion based on the provided messages and generation parameters.
This is the asynchronous version of the `run` method. It has the same parameters and return values
but can be used with `await` in async code.
**Arguments**:
- `messages`: A list of ChatMessage instances representing the input messages.
- `streaming_callback`: A callback function that is called when a new token is received from the stream.
Must be a coroutine.
- `generation_kwargs`: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see [OpenAI documentation](https://platform.openai.com/docs/api-reference/chat/create).
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override the `tools` parameter provided during initialization.
- `tools_strict`: Whether to enable strict schema adherence for tool calls. If set to `True`, the model will follow exactly
the schema provided in the `parameters` field of the tool definition, but this may increase latency.
If set, it will override the `tools_strict` parameter set during component initialization.
**Returns**:
A dictionary with the following key:
- `replies`: A list containing the generated responses as ChatMessage instances.
<a id="chat/fallback"></a>
# Module chat/fallback
<a id="chat/fallback.FallbackChatGenerator"></a>
## FallbackChatGenerator
A chat generator wrapper that tries multiple chat generators sequentially.
It forwards all parameters transparently to the underlying chat generators and returns the first successful result.
Calls chat generators sequentially until one succeeds. Falls back on any exception raised by a generator.
If all chat generators fail, it raises a RuntimeError with details.
Timeout enforcement is fully delegated to the underlying chat generators. The fallback mechanism will only
work correctly if the underlying chat generators implement proper timeout handling and raise exceptions
when timeouts occur. For predictable latency guarantees, ensure your chat generators:
- Support a `timeout` parameter in their initialization
- Implement timeout as total wall-clock time (shared deadline for both streaming and non-streaming)
- Raise timeout exceptions (e.g., TimeoutError, asyncio.TimeoutError, httpx.TimeoutException) when exceeded
Note: Most well-implemented chat generators (OpenAI, Anthropic, Cohere, etc.) support timeout parameters
with consistent semantics. For HTTP-based LLM providers, a single timeout value (e.g., `timeout=30`)
typically applies to all connection phases: connection setup, read, write, and pool. For streaming
responses, read timeout is the maximum gap between chunks. For non-streaming, it's the time limit for
receiving the complete response.
Failover is automatically triggered when a generator raises any exception, including:
- Timeout errors (if the generator implements and raises them)
- Rate limit errors (429)
- Authentication errors (401)
- Context length errors (400)
- Server errors (500+)
- Any other exception
<a id="chat/fallback.FallbackChatGenerator.__init__"></a>
#### FallbackChatGenerator.\_\_init\_\_
```python
def __init__(chat_generators: list[ChatGenerator])
```
Creates an instance of FallbackChatGenerator.
**Arguments**:
- `chat_generators`: A non-empty list of chat generator components to try in order.
<a id="chat/fallback.FallbackChatGenerator.to_dict"></a>
#### FallbackChatGenerator.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize the component, including nested chat generators when they support serialization.
<a id="chat/fallback.FallbackChatGenerator.from_dict"></a>
#### FallbackChatGenerator.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> FallbackChatGenerator
```
Rebuild the component from a serialized representation, restoring nested chat generators.
<a id="chat/fallback.FallbackChatGenerator.run"></a>
#### FallbackChatGenerator.run
```python
@component.output_types(replies=list[ChatMessage], meta=dict[str, Any])
def run(
messages: list[ChatMessage],
generation_kwargs: Union[dict[str, Any], None] = None,
tools: Optional[ToolsType] = None,
streaming_callback: Union[StreamingCallbackT,
None] = None) -> dict[str, Any]
```
Execute chat generators sequentially until one succeeds.
**Arguments**:
- `messages`: The conversation history as a list of ChatMessage instances.
- `generation_kwargs`: Optional parameters for the chat generator (e.g., temperature, max_tokens).
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for function calling capabilities.
- `streaming_callback`: Optional callable for handling streaming responses.
**Raises**:
- `RuntimeError`: If all chat generators fail.
**Returns**:
A dictionary with:
- "replies": Generated ChatMessage instances from the first successful generator.
- "meta": Execution metadata including successful_chat_generator_index, successful_chat_generator_class,
total_attempts, failed_chat_generators, plus any metadata from the successful generator.
<a id="chat/fallback.FallbackChatGenerator.run_async"></a>
#### FallbackChatGenerator.run\_async
```python
@component.output_types(replies=list[ChatMessage], meta=dict[str, Any])
async def run_async(
messages: list[ChatMessage],
generation_kwargs: Union[dict[str, Any], None] = None,
tools: Optional[ToolsType] = None,
streaming_callback: Union[StreamingCallbackT,
None] = None) -> dict[str, Any]
```
Asynchronously execute chat generators sequentially until one succeeds.
**Arguments**:
- `messages`: The conversation history as a list of ChatMessage instances.
- `generation_kwargs`: Optional parameters for the chat generator (e.g., temperature, max_tokens).
- `tools`: A list of Tool and/or Toolset objects, or a single Toolset for function calling capabilities.
- `streaming_callback`: Optional callable for handling streaming responses.
**Raises**:
- `RuntimeError`: If all chat generators fail.
**Returns**:
A dictionary with:
- "replies": Generated ChatMessage instances from the first successful generator.
- "meta": Execution metadata including successful_chat_generator_index, successful_chat_generator_class,
total_attempts, failed_chat_generators, plus any metadata from the successful generator.