docs: simplify paths on main version (#9954)

This commit is contained in:
Stefano Fiorucci 2025-10-27 17:26:17 +01:00 committed by GitHub
parent 4ce5b683db
commit 00a644a00a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
93 changed files with 395 additions and 397 deletions

View File

@ -188,8 +188,8 @@ If you need help migrating a 1.x node without a 2.x counterpart, open an [issue]
| | | |
| --- | --- | --- |
| Haystack 1.x | Description | Haystack 2.x |
| QueryClassifier | Categorizes queries. **Example usage:** Distinguishing between keyword queries and natural language questions and routing them to the Retrievers that can handle them best. | [TransformersZeroShotTextRouter](../../docs/pipeline-components/routers/transformerszeroshottextrouter.mdx) <br />[TransformersTextRouter](../../docs/pipeline-components/routers/transformerstextrouter.mdx) |
| RouteDocuments | Routes documents to different branches of your pipeline based on their content type or metadata field. **Example usage:** Routing table data to `TableReader` and text data to `TransfomersReader` for better handling. | [Routers](../../docs/pipeline-components/routers.mdx) |
| QueryClassifier | Categorizes queries. **Example usage:** Distinguishing between keyword queries and natural language questions and routing them to the Retrievers that can handle them best. | [TransformersZeroShotTextRouter](/docs/pipeline-components/routers/transformerszeroshottextrouter.mdx) <br />[TransformersTextRouter](/docs/pipeline-components/routers/transformerstextrouter.mdx) |
| RouteDocuments | Routes documents to different branches of your pipeline based on their content type or metadata field. **Example usage:** Routing table data to `TableReader` and text data to `TransfomersReader` for better handling. | [Routers](/docs/pipeline-components/routers.mdx) |
### Utility Components

View File

@ -11,15 +11,15 @@ Use this component in pipelines that contain a Generator to parse its replies.
| | |
| --- | --- |
| **Most common position in a pipeline** | Use in pipelines (such as a RAG pipeline) after a [Generator](../../../docs/pipeline-components/generators.mdx) component to create [`GeneratedAnswer`](../../../docs/concepts/data-classes.mdx#generatedanswer) objects from its replies. |
| **Mandatory run variables** | “query”: A query string <br /> <br />”replies”: A list of strings, or a list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects that are replies from a Generator |
| **Most common position in a pipeline** | Use in pipelines (such as a RAG pipeline) after a [Generator](/docs/pipeline-components/generators.mdx) component to create [`GeneratedAnswer`](/docs/concepts/data-classes.mdx#generatedanswer) objects from its replies. |
| **Mandatory run variables** | “query”: A query string <br /> <br />”replies”: A list of strings, or a list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects that are replies from a Generator |
| **Output variables** | “answers”: A list of `GeneratedAnswer` objects |
| **API reference** | [Builders](/reference/builders-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/builders/answer_builder.py |
## Overview
`AnswerBuilder` takes a query and the replies a Generator returns as input and parses them into `GeneratedAnswer` objects. Optionally, it also takes documents and metadata from the Generator as inputs to enrich the `GeneratedAnswer` objects.
`AnswerBuilder` takes a query and the replies a Generator returns as input and parses them into `GeneratedAnswer` objects. Optionally, it also takes documents and metadata from the Generator as inputs to enrich the `GeneratedAnswer` objects.
The `AnswerBuilder` works with both Chat and non-Chat Generators.
@ -83,4 +83,4 @@ result = p.run(
)
print(result)
```
```

View File

@ -11,7 +11,7 @@ Classifies the documents based on the provided labels and adds them to their met
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [MetadataRouter](../../../docs/pipeline-components/routers/metadatarouter.mdx) |
| **Most common position in a pipeline** | Before a [MetadataRouter](/docs/pipeline-components/routers/metadatarouter.mdx) |
| **Mandatory init variables** | “model”: The name or path of a Hugging Face model for zero shot document classification <br /> <br />”labels”: The set of possible class labels to classify each document into, for example, ["positive", "negative"]. The labels depend on the selected model. |
| **Mandatory run variables** | “documents”: A list of documents to classify |
| **Output variables** | “documents”: A list of processed documents with an added “classification” metadata field |
@ -22,17 +22,17 @@ Classifies the documents based on the provided labels and adds them to their met
The `TransformersZeroShotDocumentClassifier` component performs zero-shot classification of documents based on the labels that you set and adds the predicted label to their metadata.
The component uses a Hugging Face pipeline for zero-shot classification.
To initialize the component, provide the model and the set of labels to be used for categorization.
The component uses a Hugging Face pipeline for zero-shot classification.
To initialize the component, provide the model and the set of labels to be used for categorization.
You can additionally configure the component to allow multiple labels to be true with the `multi_label` boolean set to True.
Classification is run on the document's content field by default. If you want it to run on another field, set the`classification_field` to one of the document's metadata fields.
The classification results are stored in the `classification` dictionary within each document's metadata. If `multi_label` is set to `True`, you will find the scores for each label under the `details` key within the `classification` dictionary.
Available models for the task of zero-shot-classification are:
- `valhalla/distilbart-mnli-12-3`
- `cross-encoder/nli-distilroberta-base`
Available models for the task of zero-shot-classification are:
- `valhalla/distilbart-mnli-12-3`
- `cross-encoder/nli-distilroberta-base`
- `cross-encoder/nli-deberta-v3-xsmall`
## Usage
@ -45,19 +45,19 @@ from haystack.components.classifiers import TransformersZeroShotDocumentClassifi
documents = [Document(id="0", content="Cats don't get teeth cavities."),
Document(id="1", content="Cucumbers can be grown in water.")]
document_classifier = TransformersZeroShotDocumentClassifier(
model="cross-encoder/nli-deberta-v3-xsmall",
labels=["animals", "food"],
)
document_classifier.warm_up()
document_classifier.run(documents = documents)
document_classifier.run(documents = documents)
```
### In a pipeline
The following is a pipeline that classifies documents based on predefined classification labels
The following is a pipeline that classifies documents based on predefined classification labels
retrieved from a search pipeline:
```python
@ -92,4 +92,4 @@ for idx, query in enumerate(queries):
assert result["document_classifier"]["documents"][0].to_dict()["id"] == str(idx)
assert (result["document_classifier"]["documents"][0].to_dict()["classification"]["label"]
== expected_predictions[idx])
```
```

View File

@ -11,7 +11,7 @@ This component creates pull requests from a fork back to the original repository
| | |
| --- | --- |
| **Most common position in a pipeline** | At the end of a pipeline, after [GitHubRepoForker](../../../docs/pipeline-components/connectors/githubrepoforker.mdx), [GitHubFileEditor](../../../docs/pipeline-components/connectors/githubfileeditor.mdx) and other components that prepare changes for submission |
| **Most common position in a pipeline** | At the end of a pipeline, after [GitHubRepoForker](/docs/pipeline-components/connectors/githubrepoforker.mdx), [GitHubFileEditor](/docs/pipeline-components/connectors/githubfileeditor.mdx) and other components that prepare changes for submission |
| **Mandatory init variables** | "github_token": GitHub personal access token. Can be set with `GITHUB_TOKEN` env var. |
| **Mandatory run variables** | "issue_url": GitHub issue URL <br /> <br />"title": PR title <br /> <br />"branch": Source branch <br /> <br />"base": Target branch |
| **Output variables** | "result": String indicating the pull request creation result |
@ -26,7 +26,7 @@ Key features:
- **Cross-repository PRs**: Creates pull requests from your fork to the original repository
- **Issue linking**: Automatically links the PR to the specified GitHub issue
- **Draft support**: Option to create draft pull requests
- **Draft support**: Option to create draft pull requests
- **Fork validation**: Checks that the required fork exists before creating the PR
As optional parameters, you can set `body` to provide a pull request description and the boolean parameter `draft` to open a draft pull request.
@ -72,4 +72,4 @@ print(result)
```bash
{'result': 'Pull request #456 created successfully and linked to issue #123'}
```
```

View File

@ -11,7 +11,7 @@ This component forks a GitHub repository from an issue URL through the GitHub AP
| | |
| --- | --- |
| **Most common position in a pipeline** | Right at the beginning of a pipeline and before an [Agent](../../../docs/pipeline-components/agents-1/agent.mdx) component that expects the name of a GitHub branch as input |
| **Most common position in a pipeline** | Right at the beginning of a pipeline and before an [Agent](/docs/pipeline-components/agents-1/agent.mdx) component that expects the name of a GitHub branch as input |
| **Mandatory init variables** | "github_token": GitHub personal access token. Can be set with `GITHUB_TOKEN` env var. |
| **Mandatory run variables** | "url": The URL of a GitHub issue in the repository that should be forked |
| **Output variables** | "repo": Fork repository path <br /> <br />"issue_branch": Issue-specific branch name (if created) |
@ -64,4 +64,4 @@ print(result)
```bash
{'repo': 'owner/repo', 'issue_branch': 'fix-123'}
```
```

View File

@ -11,7 +11,7 @@ This component navigates and fetches content from GitHub repositories through th
| | |
| --- | --- |
| **Most common position in a pipeline** | Right at the beginning of a pipeline and before a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) that expects the content of GitHub files as input |
| **Most common position in a pipeline** | Right at the beginning of a pipeline and before a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) that expects the content of GitHub files as input |
| **Mandatory run variables** | "path": Repository path to view <br /> <br />"repo": Repository in owner/repo format |
| **Output variables** | "documents": A list of documents containing repository contents |
| **API reference** | [GitHub](/reference/integrations-hanlp) |
@ -85,4 +85,4 @@ print(result)
```bash
{'documents': [Document(id=..., content: '<div align="center">
<a href="https://haystack.deepset.ai/"><img src="https://raw.githubuserconten...', meta: {'path': 'README.md', 'type': 'file_content', 'size': 11979, 'url': 'https://github.com/deepset-ai/haystack/blob/main/README.md'})]}
```
```

View File

@ -12,8 +12,8 @@ description: "`OpenAPIServiceConnector` is a component that acts as an interface
| | |
| --- | --- |
| **Most common position in a pipeline** | Flexible |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects where the last message is expected to carry parameter invocation payload. <br /> <br />”service_openapi_spec”: OpenAPI specification of the service being invoked. It can be YAML/JSON, and all ref values must be resolved. <br /> <br />”service_credentials”: Authentication credentials for the service. We currently support two OpenAPI spec v3 security schemes: <br /> <br />1. http for Basic, Bearer, and other HTTP authentication schemes; <br />2. apiKey for API keys and cookie authentication. |
| **Output variables** | “service_response”: A dictionary that is a list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects where each message corresponds to a function invocation. <br />If a user specifies multiple function calling requests, there will be multiple responses. |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects where the last message is expected to carry parameter invocation payload. <br /> <br />”service_openapi_spec”: OpenAPI specification of the service being invoked. It can be YAML/JSON, and all ref values must be resolved. <br /> <br />”service_credentials”: Authentication credentials for the service. We currently support two OpenAPI spec v3 security schemes: <br /> <br />1. http for Basic, Bearer, and other HTTP authentication schemes; <br />2. apiKey for API keys and cookie authentication. |
| **Output variables** | “service_response”: A dictionary that is a list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects where each message corresponds to a function invocation. <br />If a user specifies multiple function calling requests, there will be multiple responses. |
| **API reference** | [Connectors](/reference/connectors-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/connectors/openapi_service.py |
@ -95,13 +95,13 @@ pipe.connect("a3", "llm.messages")
user_prompt = "Why was Sam Altman ousted from OpenAI?"
result = pipe.run(data={"functions_llm": {"messages":[ChatMessage.from_system("Only do function calling"), ChatMessage.from_user(user_prompt)]},
result = pipe.run(data={"functions_llm": {"messages":[ChatMessage.from_system("Only do function calling"), ChatMessage.from_user(user_prompt)]},
"openapi_container": {"service_credentials": serper_dev_key},
"spec_to_functions": {"sources": [ByteStream.from_string(serper_spec)]},
"a3": {"system_message": [ChatMessage.from_system(system_prompt)]}})
>Sam Altman was ousted from OpenAI on November 17, 2023, following
>a "deliberative review process" by the board of directors. The board concluded
>that he was not "consistently candid in his communications". However, he
>Sam Altman was ousted from OpenAI on November 17, 2023, following
>a "deliberative review process" by the board of directors. The board concluded
>that he was not "consistently candid in his communications". However, he
>returned as CEO just days after his ouster.
```
```

View File

@ -11,7 +11,7 @@ description: "`AzureOCRDocumentConverter` converts files to documents using Azur
| | |
| --- | --- |
| **Most common position in a pipeline** | Before [PreProcessors](../../../docs/pipeline-components/preprocessors.mdx) , or right at the beginning of an indexing pipeline |
| **Most common position in a pipeline** | Before [PreProcessors](/docs/pipeline-components/preprocessors.mdx) , or right at the beginning of an indexing pipeline |
| **Mandatory init variables** | "endpoint": The endpoint of your Azure resource <br /> <br />"api_key": The API key of your Azure resource. Can be set with `AZURE_AI_API_KEY` environment variable. |
| **Mandatory run variables** | "sources": A list of file paths |
| **Output variables** | "documents": A list of documents <br /> <br />"raw_azure_response": A list of raw responses from Azure |
@ -76,4 +76,4 @@ pipeline.connect("splitter", "writer")
file_names = ["my_file.pdf"]
pipeline.run({"converter": {"sources": file_names}})
```
```

View File

@ -11,9 +11,9 @@ Converts JSON files to text documents.
| | |
| --- | --- |
| **Most common position in a pipeline** | Before [PreProcessors](../../../docs/pipeline-components/preprocessors.mdx) , or right at the beginning of an indexing pipeline |
| **Most common position in a pipeline** | Before [PreProcessors](/docs/pipeline-components/preprocessors.mdx) , or right at the beginning of an indexing pipeline |
| **Mandatory init variables** | ONE OF, OR BOTH: <br /> <br />"jq_schema": A jq filter string to extract content <br /> <br />"content_key": A key string to extract document content |
| **Mandatory run variables** | "sources": A list of file paths or [ByteStream](../../../docs/concepts/data-classes.mdx#bytestresm) objects |
| **Mandatory run variables** | "sources": A list of file paths or [ByteStream](/docs/concepts/data-classes.mdx#bytestresm) objects |
| **Output variables** | "documents": A list of documents |
| **API reference** | [Converters](/reference/converters-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/json.py |
@ -107,4 +107,4 @@ print(documents[1].content)
print(documents[1].meta)
## {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
```
```

View File

@ -11,8 +11,8 @@ Converts Microsoft Outlook .msg files to documents.
| | |
| --- | --- |
| **Most common position in a pipeline** | Before [PreProcessors](../../../docs/pipeline-components/preprocessors.mdx) , or right at the beginning of an indexing pipeline |
| **Mandatory run variables** | "sources": A list of .msg file paths or [ByteStream](../../../docs/concepts/data-classes.mdx#bytestresm) objects |
| **Most common position in a pipeline** | Before [PreProcessors](/docs/pipeline-components/preprocessors.mdx) , or right at the beginning of an indexing pipeline |
| **Mandatory run variables** | "sources": A list of .msg file paths or [ByteStream](/docs/concepts/data-classes.mdx#bytestresm) objects |
| **Output variables** | "documents": A list of documents <br /> <br />"attachments": A list of ByteStream objects representing file attachments |
| **API reference** | [Converters](/reference/converters-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/msg.py |
@ -67,4 +67,4 @@ pipeline.connect("converter.documents", "writer.documents")
file_names = ["email1.msg", "email2.msg"]
pipeline.run({"converter": {"sources": file_names}})
```
```

View File

@ -12,7 +12,7 @@ description: "`OpenAPIServiceToFunctions` is a component that transforms OpenAPI
| | |
| --- | --- |
| **Most common position in a pipeline** | Flexible |
| **Mandatory run variables** | “sources”: A list of OpenAPI specification sources, which can be file paths or [`ByteStream`](../../../docs/concepts/data-classes.mdx#bytestream) objects |
| **Mandatory run variables** | “sources”: A list of OpenAPI specification sources, which can be file paths or [`ByteStream`](/docs/concepts/data-classes.mdx#bytestream) objects |
| **Output variables** | “functions”: A list of JSON OpenAI function calling definitions objects. For each path definition in OpenAPI specification, a corresponding OpenAI function calling definitions is generated. <br /> <br />“openapi_specs”: A list of JSON/YAML objects with references resolved. Such OpenAPI spec (with references resolved) can, in turn, be used as input to OpenAPIServiceConnector. |
| **API reference** | [Converters](/reference/converters-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/openapi_functions.py |
@ -93,13 +93,13 @@ pipe.connect("a3", "llm.messages")
user_prompt = "Why was Sam Altman ousted from OpenAI?"
result = pipe.run(data={"functions_llm": {"messages":[ChatMessage.from_system("Only do function calling"), ChatMessage.from_user(user_prompt)]},
result = pipe.run(data={"functions_llm": {"messages":[ChatMessage.from_system("Only do function calling"), ChatMessage.from_user(user_prompt)]},
"openapi_container": {"service_credentials": serper_dev_key},
"spec_to_functions": {"sources": [ByteStream.from_string(serper_spec)]},
"a3": {"system_message": [ChatMessage.from_system(system_prompt)]}})
>Sam Altman was ousted from OpenAI on November 17, 2023, following
>a "deliberative review process" by the board of directors. The board concluded
>that he was not "consistently candid in his communications". However, he
>Sam Altman was ousted from OpenAI on November 17, 2023, following
>a "deliberative review process" by the board of directors. The board concluded
>that he was not "consistently candid in his communications". However, he
>returned as CEO just days after his ouster.
```
```

View File

@ -11,7 +11,7 @@ This component computes embeddings for documents using models through Amazon Bed
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "model": The embedding model to use <br /> <br />"aws_access_key_id": AWS access key ID. Can be set with `AWS_ACCESS_KEY_ID` env var. <br /> <br />"aws_secret_access_key": AWS secret access key. Can be set with `AWS_SECRET_ACCESS_KEY` env var. <br /> <br />"aws_region_name": AWS region name. Can be set with `AWS_DEFAULT_REGION` env var. |
| **Mandatory run variables** | “documents”: A list of documents to be embedded |
| **Output variables** | “documents”: A list of documents (enriched with embeddings) |
@ -34,7 +34,7 @@ This component should be used to embed a list of documents. To embed a string, y
### Authentication
`AmazonBedrockDocumentEmbedder` uses AWS for authentication. You can either provide credentials as parameters directly to the component or use the AWS CLI and authenticate through your IAM. For more information on how to set up an IAM identity-based policy, see the [official documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html).
`AmazonBedrockDocumentEmbedder` uses AWS for authentication. You can either provide credentials as parameters directly to the component or use the AWS CLI and authenticate through your IAM. For more information on how to set up an IAM identity-based policy, see the [official documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html).
To initialize `AmazonBedrockDocumentEmbedder` and authenticate by providing credentials, provide the `model_name`, as well as `aws_access_key_id`, `aws_secret_access_key` and `aws_region_name`. Other parameters are optional. You can check them out in our [API reference](/reference/integrations-amazon-bedrock#amazonbedrockdocumentembedder).
### Model-specific parameters
@ -147,4 +147,4 @@ print(result['retriever']['documents'][0])
## Additional References
:cook: Cookbook: [PDF-Based Question Answering with Amazon Bedrock and Haystack](https://haystack.deepset.ai/cookbook/amazon_bedrock_for_documentation_qa)
:cook: Cookbook: [PDF-Based Question Answering with Amazon Bedrock and Haystack](https://haystack.deepset.ai/cookbook/amazon_bedrock_for_documentation_qa)

View File

@ -11,7 +11,7 @@ description: "`AmazonBedrockDocumentImageEmbedder` computes image embeddings for
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](https://docs.haystack.deepset.ai../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](https://docs.haystack.deepset.ai/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "model": The multimodal embedding model to use. <br /> <br />"aws_access_key_id": AWS access key ID. Can be set with `AWS_ACCESS_KEY_ID` env var. <br /> <br />"aws_secret_access_key": AWS secret access key. Can be set with `AWS_SECRET_ACCESS_KEY` env var. <br /> <br />"aws_region_name": AWS region name. Can be set with `AWS_DEFAULT_REGION` env var. |
| **Mandatory run variables** | "documents": A list of documents, with a meta field containing an image file path |
| **Output variables** | "documents": A list of documents (enriched with embeddings) |
@ -138,4 +138,4 @@ res = query.run({"text_embedder": {"text": "Which document shows a horse?"}})
## Additional References
:notebook: Tutorial: [Creating Vision+Text RAG Pipelines](https://haystack.deepset.ai/tutorials/46_multimodal_rag)
:notebook: Tutorial: [Creating Vision+Text RAG Pipelines](https://haystack.deepset.ai/tutorials/46_multimodal_rag)

View File

@ -11,7 +11,7 @@ This component computes embeddings for text (such as a query) using models throu
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "model": The embedding model to use <br /> <br />"aws_access_key_id": AWS access key ID. Can be set with `AWS_ACCESS_KEY_ID` env var. <br /> <br />"aws_secret_access_key": AWS secret access key. Can be set with `AWS_SECRET_ACCESS_KEY` env var. <br /> <br />"aws_region_name": AWS region name. Can be set with `AWS_DEFAULT_REGION` env var. |
| **Mandatory run variables** | “text”: A string |
| **Output variables** | “embedding”: A list of float numbers (vector) |
@ -28,7 +28,7 @@ Use `AmazonBedrockTextEmbedder` to embed a simple string (such as a query) int
### Authentication
`AmazonBedrockTextEmbedder` uses AWS for authentication. You can either provide credentials as parameters directly to the component or use the AWS CLI and authenticate through your IAM. For more information on how to set up an IAM identity-based policy, see the [official documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html).
`AmazonBedrockTextEmbedder` uses AWS for authentication. You can either provide credentials as parameters directly to the component or use the AWS CLI and authenticate through your IAM. For more information on how to set up an IAM identity-based policy, see the [official documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html).
To initialize `AmazonBedrockTextEmbedder` and authenticate by providing credentials, provide the `model` name, as well as `aws_access_key_id`, `aws_secret_access_key`, and `aws_region_name`. Other parameters are optional, you can check them out in our [API reference](/reference/integrations-amazon-bedrock#amazonbedrocktextembedder).
### Model-specific parameters
@ -116,4 +116,4 @@ print(result['retriever']['documents'][0])
## Additional References
:cook: Cookbook: [PDF-Based Question Answering with Amazon Bedrock and Haystack](https://haystack.deepset.ai/cookbook/amazon_bedrock_for_documentation_qa)
:cook: Cookbook: [PDF-Based Question Answering with Amazon Bedrock and Haystack](https://haystack.deepset.ai/cookbook/amazon_bedrock_for_documentation_qa)

View File

@ -11,7 +11,7 @@ This component computes the embeddings of a list of documents and stores the obt
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) |
| **Mandatory init variables** | "api_key": The Azure OpenAI API key. Can be set with `AZURE_OPENAI_API_KEY` env var. <br />"azure_endpoint": The endpoint of the model deployed on Azure. |
| **Mandatory run variables** | "documents": A list of documents |
| **Output variables** | "documents": A list of documents (enriched with embeddings) <br /> <br />"meta": A dictionary of metadata |
@ -111,4 +111,4 @@ print(result['retriever']['documents'][0])
## Document(id=..., mimetype: 'text/plain',
## text: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -11,7 +11,7 @@ When you perform embedding retrieval, you use this component to transform your q
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_key": The Azure OpenAI API key. Can be set with `AZURE_OPENAI_API_KEY` env var. <br />"azure_endpoint": The endpoint of the model deployed on Azure. |
| **Mandatory run variables** | "text": A string |
| **Output variables** | "embedding": A list of float numbers <br /> <br />"meta": A dictionary of metadata |
@ -92,4 +92,4 @@ print(result['retriever']['documents'][0])
## Document(id=..., mimetype: 'text/plain',
## text: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -13,7 +13,7 @@ The vectors computed by this component are necessary to perform embedding retrie
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "api_key": The Cohere API key. Can be set with `COHERE_API_KEY` or `CO_API_KEY` env var. |
| **Mandatory run variables** | “documents”: A list of documents to be embedded |
| **Output variables** | “documents”: A list of documents (enriched with embeddings) <br /> <br />“meta”: A dictionary of metadata strings |
@ -24,9 +24,9 @@ The vectors computed by this component are necessary to perform embedding retrie
`CohereDocumentEmbedder` enriches the metadata of documents with an embedding of their content. To embed a string, you should use the [`CohereTextEmbedder`](https://docs.haystack.deepset.ai/v2.0/docs/coheretextembedder).
The component supports the following Cohere models:
`"embed-english-v3.0"`, `"embed-english-light-v3.0"`, `"embed-multilingual-v3.0"`,
`"embed-multilingual-light-v3.0"`, `"embed-english-v2.0"`, `"embed-english-light-v2.0"`,
The component supports the following Cohere models:
`"embed-english-v3.0"`, `"embed-english-light-v3.0"`, `"embed-multilingual-v3.0"`,
`"embed-multilingual-light-v3.0"`, `"embed-english-v2.0"`, `"embed-english-light-v2.0"`,
`"embed-multilingual-v2.0"`. The default model is `embed-english-v2.0`. This list of all supported models can be found in Coheres [model documentation](https://docs.cohere.com/docs/models#representation).
To start using this integration with Haystack, install it with:
@ -117,4 +117,4 @@ result = query_pipeline.run({"text_embedder":{"text": query}})
print(result['retriever']['documents'][0])
## Document(id=..., text: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -11,7 +11,7 @@ This component transforms a string into a vector that captures its semantics usi
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_key": The Cohere API key. Can be set with `COHERE_API_KEY` or `CO_API_KEY` env var. |
| **Mandatory run variables** | “text”: A string |
| **Output variables** | “embedding”: A list of float numbers (vectors) <br /> <br />“meta”: A dictionary of metadata strings |
@ -22,9 +22,9 @@ This component transforms a string into a vector that captures its semantics usi
`CohereTextEmbedder` embeds a simple string (such as a query) into a vector. For embedding lists of documents, use the use the [`CohereDocumentEmbedder`](https://docs.haystack.deepset.ai/v2.0/docs/coheredocumentembedder), which enriches the document with the computed embedding, also known as vector.
The component supports the following Cohere models:
`"embed-english-v3.0"`, `"embed-english-light-v3.0"`, `"embed-multilingual-v3.0"`,
`"embed-multilingual-light-v3.0"`, `"embed-english-v2.0"`, `"embed-english-light-v2.0"`,
The component supports the following Cohere models:
`"embed-english-v3.0"`, `"embed-english-light-v3.0"`, `"embed-multilingual-v3.0"`,
`"embed-multilingual-light-v3.0"`, `"embed-english-v2.0"`, `"embed-english-light-v2.0"`,
`"embed-multilingual-v2.0"`. The default model is `embed-english-v2.0`. This list of all supported models can be found in Coheres [model documentation](https://docs.cohere.com/docs/models#representation).
To start using this integration with Haystack, install it with:
@ -91,4 +91,4 @@ result = query_pipeline.run({"text_embedder":{"text": query}})
print(result['retriever']['documents'][0])
## Document(id=..., content: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -11,7 +11,7 @@ The vectors computed by this component are necessary to perform embedding retrie
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [DocumentWriter](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [DocumentWriter](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "api_key": The Google API key. Can be set with `GOOGLE_API_KEY` or `GEMINI_API_KEY` env var. |
| **Mandatory run variables** | "documents": A list of documents to be embedded |
| **Output variables** | "documents": A list of documents (enriched with embeddings) <br /> <br />"meta": A dictionary of metadata |
@ -37,7 +37,7 @@ pip install google-genai-haystack
Google Gen AI is compatible with both the Gemini Developer API and the Vertex AI API.
To use this component with the Gemini Developer API and get an API key, visit [Google AI Studio](https://aistudio.google.com/).
To use this component with the Gemini Developer API and get an API key, visit [Google AI Studio](https://aistudio.google.com/).
To use this component with the Vertex AI API, visit [Google Cloud > Vertex AI](https://cloud.google.com/vertex-ai).
The component uses a `GOOGLE_API_KEY` or `GEMINI_API_KEY` environment variable by default. Otherwise, you can pass an API key at initialization with a [Secret](../../concepts/secret-management.mdx) and `Secret.from_token` static method:
@ -153,4 +153,4 @@ result = query_pipeline.run({"text_embedder":{"text": query}})
print(result['retriever']['documents'][0])
## Document(id=..., content: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -11,7 +11,7 @@ This component transforms a string into a vector that captures its semantics usi
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_key": The Google API key. Can be set with `GOOGLE_API_KEY` or `GEMINI_API_KEY` env var. |
| **Mandatory run variables** | "text": A string |
| **Output variables** | "embedding": A list of float numbers <br /> <br />"meta": A dictionary of metadata |
@ -37,7 +37,7 @@ pip install google-genai-haystack
Google Gen AI is compatible with both the Gemini Developer API and the Vertex AI API.
To use this component with the Gemini Developer API and get an API key, visit [Google AI Studio](https://aistudio.google.com/).
To use this component with the Gemini Developer API and get an API key, visit [Google AI Studio](https://aistudio.google.com/).
To use this component with the Vertex AI API, visit [Google Cloud > Vertex AI](https://cloud.google.com/vertex-ai).
The component uses a `GOOGLE_API_KEY` or `GEMINI_API_KEY` environment variable by default. Otherwise, you can pass an API key at initialization with a [Secret](../../concepts/secret-management.mdx) and `Secret.from_token` static method:
@ -130,4 +130,4 @@ result = query_pipeline.run({"text_embedder":{"text": query}})
print(result['retriever']['documents'][0])
## Document(id=..., content: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -11,7 +11,7 @@ Use this component to compute document embeddings using various Hugging Face API
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx)  in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx)  in an indexing pipeline |
| **Mandatory init variables** | "api_type": The type of Hugging Face API to use <br /> <br />"api_params": A dictionary with one of the following keys: <br /> <br />- `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.**OR** - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or `TEXT_EMBEDDINGS_INFERENCE`. <br /> <br />"token": The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
| **Mandatory run variables** | “documents”: A list of documents to be embedded |
| **Output variables** | “documents”: A list of documents to be embedded (enriched with embeddings) |
@ -30,7 +30,7 @@ Use this component to compute document embeddings using various Hugging Face API
This component should be used to embed a list of documents. To embed a string, use [`HuggingFaceAPITextEmbedder`](huggingfaceapitextembedder.mdx).
:::
The component uses a `HF_API_TOKEN` environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with `token`  see code examples below.
The component uses a `HF_API_TOKEN` environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with `token`  see code examples below.
The token is needed:
- If you use the Serverless Inference API, or
@ -38,7 +38,7 @@ The token is needed:
## Usage
Similarly to other Document Embedders, this component allows adding prefixes (and postfixes) to include instruction and embedding metadata.
Similarly to other Document Embedders, this component allows adding prefixes (and postfixes) to include instruction and embedding metadata.
For more fine-grained details, refer to the components [API reference](/reference/embedders-api#huggingfaceapidocumentembedder).
### On its own
@ -47,8 +47,8 @@ For more fine-grained details, refer to the components [API reference](/refer
Formerly known as (free) Hugging Face Inference API, this API allows you to quickly experiment with many models hosted on the Hugging Face Hub, offloading the inference to Hugging Face servers. Its rate-limited and not meant for production.
To use this API, you need a [free Hugging Face token](https://huggingface.co/settings/tokens).
The Embedder expects the `model` in `api_params`.
To use this API, you need a [free Hugging Face token](https://huggingface.co/settings/tokens).
The Embedder expects the `model` in `api_params`.
```python
from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder
@ -73,8 +73,8 @@ In this case, a private instance of the model is deployed by Hugging Face, and y
To understand how to spin up an Inference Endpoint, visit [Hugging Face documentation](https://huggingface.co/inference-endpoints/dedicated).
Additionally, in this case, you need to provide your Hugging Face token.
The Embedder expects the `url` of your endpoint in `api_params`.
Additionally, in this case, you need to provide your Hugging Face token.
The Embedder expects the `url` of your endpoint in `api_params`.
```python
from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder
@ -111,7 +111,7 @@ docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingf
For more information, refer to the [official TEI repository](https://github.com/huggingface/text-embeddings-inference).
The Embedder expects the `url` of your TEI instance in `api_params`.
The Embedder expects the `url` of your TEI instance in `api_params`.
```python
from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder
@ -168,4 +168,4 @@ result = query_pipeline.run({"text_embedder":{"text": query}})
print(result['retriever']['documents'][0])
## Document(id=..., content: 'My name is Wolfgang and I live in Berlin', ...)
```
```

View File

@ -11,7 +11,7 @@ Use this component to embed strings using various Hugging Face APIs.
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_type": The type of Hugging Face API to use <br /> <br />"api_params": A dictionary with one of the following keys: <br /> <br />- `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.**OR** - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or `TEXT_EMBEDDINGS_INFERENCE`. <br /> <br />"token": The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
| **Mandatory run variables** | “text”: A string |
| **Output variables** | “embedding”: A list of float numbers |
@ -30,7 +30,7 @@ Use this component to embed strings using various Hugging Face APIs.
This component should be used to embed plain text. To embed a list of documents, use [`HuggingFaceAPIDocumentEmbedder`](huggingfaceapidocumentembedder.mdx).
:::
The component uses a `HF_API_TOKEN` environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with `token`  see code examples below.
The component uses a `HF_API_TOKEN` environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with `token`  see code examples below.
The token is needed:
- If you use the Serverless Inference API, or
@ -38,7 +38,7 @@ The token is needed:
## Usage
Similarly to other text Embedders, this component allows adding prefixes (and postfixes) to include instructions.
Similarly to other text Embedders, this component allows adding prefixes (and postfixes) to include instructions.
For more fine-grained details, refer to the components [API reference](/reference/embedders-api#huggingfaceapitextembedder).
### On its own
@ -47,8 +47,8 @@ For more fine-grained details, refer to the components [API reference](/refer
Formerly known as (free) Hugging Face Inference API, this API allows you to quickly experiment with many models hosted on the Hugging Face Hub, offloading the inference to Hugging Face servers. Its rate-limited and not meant for production.
To use this API, you need a [free Hugging Face token](https://huggingface.co/settings/tokens).
The Embedder expects the `model` in `api_params`.
To use this API, you need a [free Hugging Face token](https://huggingface.co/settings/tokens).
The Embedder expects the `model` in `api_params`.
```python
from haystack.components.embedders import HuggingFaceAPITextEmbedder
@ -69,8 +69,8 @@ In this case, a private instance of the model is deployed by Hugging Face, and y
To understand how to spin up an Inference Endpoint, visit [Hugging Face documentation](https://huggingface.co/inference-endpoints/dedicated).
Additionally, in this case, you need to provide your Hugging Face token.
The Embedder expects the `url` of your endpoint in `api_params`.
Additionally, in this case, you need to provide your Hugging Face token.
The Embedder expects the `url` of your endpoint in `api_params`.
```python
from haystack.components.embedders import HuggingFaceAPITextEmbedder
@ -102,7 +102,7 @@ docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingf
For more information, refer to the [official TEI repository](https://github.com/huggingface/text-embeddings-inference).
The Embedder expects the `url` of your TEI instance in `api_params`.
The Embedder expects the `url` of your TEI instance in `api_params`.
```python
from haystack.components.embedders import HuggingFaceAPITextEmbedder
@ -151,4 +151,4 @@ result = query_pipeline.run({"text_embedder":{"text": query}})
print(result['retriever']['documents'][0])
## Document(id=..., content: 'My name is Wolfgang and I live in Berlin', ...)
```
```

View File

@ -11,7 +11,7 @@ This component computes the embeddings of a list of documents and stores the obt
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "api_key": The Jina API key. Can be set with `JINA_API_KEY` env var. |
| **Mandatory run variables** | “documents”: A list of documents |
| **Output variables** | “documents”: A list of documents (enriched with embeddings) <br /> <br />”meta”: A dictionary of metadata |
@ -121,4 +121,4 @@ print(result['retriever']['documents'][0])
## Additional References
:cook: Cookbook: [Using the Jina-embeddings-v2-base-en model in a Haystack RAG pipeline for legal document analysis](https://haystack.deepset.ai/cookbook/jina-embeddings-v2-legal-analysis-rag)
:cook: Cookbook: [Using the Jina-embeddings-v2-base-en model in a Haystack RAG pipeline for legal document analysis](https://haystack.deepset.ai/cookbook/jina-embeddings-v2-legal-analysis-rag)

View File

@ -11,7 +11,7 @@ This component transforms a string into a vector that captures its semantics usi
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_key": The Jina API key. Can be set with `JINA_API_KEY` env var. |
| **Mandatory run variables** | “text”: A string |
| **Output variables** | “embedding”: A list of float numbers <br /> <br />”meta”: A dictionary of metadata |
@ -99,4 +99,4 @@ print(result['retriever']['documents'][0])
## Additional References
:cook: Cookbook: [Using the Jina-embeddings-v2-base-en model in a Haystack RAG pipeline for legal document analysis](https://haystack.deepset.ai/cookbook/jina-embeddings-v2-legal-analysis-rag)
:cook: Cookbook: [Using the Jina-embeddings-v2-base-en model in a Haystack RAG pipeline for legal document analysis](https://haystack.deepset.ai/cookbook/jina-embeddings-v2-legal-analysis-rag)

View File

@ -11,7 +11,7 @@ This component computes the embeddings of a list of documents using the Mistral
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "api_key": The Mistral API key. Can be set with `MISTRAL_API_KEY` env var. |
| **Mandatory run variables** | “documents”: A list of documents to be embedded |
| **Output variables** | “documents”: A list of documents (enriched with embeddings) <br /> <br />“meta”: A dictionary of metadata strings |
@ -93,4 +93,4 @@ indexing.connect("chunker", "embedder")
indexing.connect("embedder", "writer")
indexing.run(data={"fetcher": {"urls": ["https://mistral.ai/news/la-plateforme/"]}})
```
```

View File

@ -11,7 +11,7 @@ This component transforms a string into a vector using the Mistral API and model
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_key": The Mistral API key. Can be set with `MISTRAL_API_KEY` env var. |
| **Mandatory run variables** | “text”: A string |
| **Output variables** | “embedding”: A list of float numbers (vectors) <br /> <br />“meta”: A dictionary of metadata strings |
@ -95,7 +95,7 @@ indexing.connect("fetcher", "converter")
indexing.connect("converter", "embedder")
indexing.connect("embedder", "writer")
indexing.run(data={"fetcher": {"urls": ["https://docs.mistral.ai/self-deployment/cloudflare/",
indexing.run(data={"fetcher": {"urls": ["https://docs.mistral.ai/self-deployment/cloudflare/",
"https://docs.mistral.ai/platform/endpoints/"]}})
## Retrieval components
@ -136,4 +136,4 @@ result = doc_search.run(
)
print(result["llm"]["replies"])
```
```

View File

@ -11,7 +11,7 @@ This component computes the embeddings of a list of documents and stores the obt
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "api_key": API key for the NVIDIA NIM. Can be set with `NVIDIA_API_KEY` env var. |
| **Mandatory run variables** | “documents”: A list of documents |
| **Output variables** | “documents”: A list of documents (enriched with embeddings) <br /> <br />”meta”: A dictionary of metadata |
@ -120,4 +120,4 @@ print(result['retriever']['documents'][0])
## Additional References
:cook: Cookbook: [Haystack RAG Pipeline with Self-Deployed AI models using NVIDIA NIMs](https://haystack.deepset.ai/cookbook/rag-with-nims)
:cook: Cookbook: [Haystack RAG Pipeline with Self-Deployed AI models using NVIDIA NIMs](https://haystack.deepset.ai/cookbook/rag-with-nims)

View File

@ -11,7 +11,7 @@ This component transforms a string into a vector that captures its semantics usi
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_key": API key for the NVIDIA NIM. Can be set with `NVIDIA_API_KEY` env var. |
| **Mandatory run variables** | “text”: A string |
| **Output variables** | “embedding”: A list of float numbers (vectors) <br /> <br />“meta”: A dictionary of metadata strings |
@ -24,7 +24,7 @@ This component transforms a string into a vector that captures its semantics usi
It can be used with self-hosted models with NVIDIA NIM or models hosted on the [NVIDIA API catalog](https://build.nvidia.com/explore/discover).
To embed a list of documents, use the [`NvidiaDocumentEmbedder`](nvidiadocumentembedder.mdx), which enriches the document with the computed embedding, also known as vector.
To embed a list of documents, use the [`NvidiaDocumentEmbedder`](nvidiadocumentembedder.mdx), which enriches the document with the computed embedding, also known as vector.
## Usage
@ -120,4 +120,4 @@ print(result['retriever']['documents'][0])
## Additional References
:cook: Cookbook: [Haystack RAG Pipeline with Self-Deployed AI models using NVIDIA NIMs](https://haystack.deepset.ai/cookbook/rag-with-nims)
:cook: Cookbook: [Haystack RAG Pipeline with Self-Deployed AI models using NVIDIA NIMs](https://haystack.deepset.ai/cookbook/rag-with-nims)

View File

@ -11,7 +11,7 @@ This component computes the embeddings of a list of documents using embedding mo
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory run variables** | “documents”: A list of documents to be embedded |
| **Output variables** | “documents”: A list of documents (enriched with embeddings) <br /> <br />“meta”: A dictionary of metadata strings |
| **API reference** | [Ollama](/reference/integrations-ollama) |
@ -113,4 +113,4 @@ indexing_pipeline.run({"converter": {"sources": ["files/test_pdf_data.pdf"]}})
## Calculating embeddings: 100%|██████████| 115/115
## {'embedder': {'meta': {'model': 'nomic-embed-text'}}, 'writer': {'documents_written': 115}}
```
```

View File

@ -11,7 +11,7 @@ This component computes the embeddings of a string using embedding models compat
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory run variables** | “text”: A string |
| **Output variables** | “embedding”: A list of float numbers (vectors) <br /> <br />“meta”: A dictionary of metadata strings |
| **API reference** | [Ollama](/reference/integrations-ollama) |
@ -95,4 +95,4 @@ query = "Who lives in Berlin?"
result = query_pipeline.run({"text_embedder":{"text": query}})
print(result['retriever']['documents'][0])
```
```

View File

@ -13,7 +13,7 @@ The vectors computed by this component are necessary to perform embedding retrie
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "api_key": An OpenAI API key. Can be set with `OPENAI_API_KEY` env var. |
| **Mandatory run variables** | "documents": A list of documents |
| **Output variables** | "documents": A list of documents (enriched with embeddings) <br /> <br />"meta": A dictionary of metadata |
@ -107,4 +107,4 @@ print(result['retriever']['documents'][0])
## Document(id=..., mimetype: 'text/plain',
## text: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -13,7 +13,7 @@ When you perform embedding retrieval, you use this component to transform your q
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_key": An OpenAI API key. Can be set with `OPENAI_API_KEY` env var. |
| **Mandatory run variables** | "text": A string |
| **Output variables** | "embedding": A list of float numbers <br /> <br />"meta": A dictionary of metadata |
@ -88,4 +88,4 @@ print(result['retriever']['documents'][0])
## Document(id=..., mimetype: 'text/plain',
## text: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -11,7 +11,7 @@ The vectors computed by this component are necessary to perform embedding retrie
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../../../docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | Before a [`DocumentWriter`](/docs/pipeline-components/writers/documentwriter.mdx) in an indexing pipeline |
| **Mandatory init variables** | "api_key": The IBM Cloud API key. Can be set with `WATSONX_API_KEY` env var. <br /> <br />"project_id": The IBM Cloud project ID. Can be set with `WATSONX_PROJECT_ID` env var. |
| **Mandatory run variables** | "documents": A list of documents to be embedded |
| **Output variables** | "documents": A list of documents (enriched with embeddings) <br /> <br />"meta": A dictionary of metadata strings |
@ -126,4 +126,4 @@ result = query_pipeline.run({"text_embedder":{"text": query}})
print(result['retriever']['documents'][0])
## Document(id=..., text: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -11,7 +11,7 @@ When you perform embedding retrieval, you use this component to transform your q
| | |
| --- | --- |
| **Most common position in a pipeline** | Before an embedding [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Most common position in a pipeline** | Before an embedding [Retriever](/docs/pipeline-components/retrievers.mdx) in a query/RAG pipeline |
| **Mandatory init variables** | "api_key": An IBM Cloud API key. Can be set with `WATSONX_API_KEY` env var. <br /> <br />"project_id": An IBM Cloud project ID. Can be set with `WATSONX_PROJECT_ID` env var. |
| **Mandatory run variables** | "text": A string |
| **Output variables** | "embedding": A list of float numbers <br /> <br />"meta": A dictionary of metadata |
@ -101,4 +101,4 @@ print(result['retriever']['documents'][0])
## Document(id=..., mimetype: 'text/plain',
## text: 'My name is Wolfgang and I live in Berlin')
```
```

View File

@ -10,10 +10,10 @@ slug: "/evaluators"
| --- | --- |
| [AnswerExactMatchEvaluator](evaluators/answerexactmatchevaluator.mdx) | Evaluates answers predicted by Haystack pipelines using ground truth labels. It checks character by character whether a predicted answer exactly matches the ground truth answer. |
| [ContextRelevanceEvaluator](evaluators/contextrelevanceevaluator.mdx) | Uses an LLM to evaluate whether a generated answer can be inferred from the provided contexts. |
| [DeepEvalEvaluator](https://docs.haystack.deepset.ai/v2.0../../docs/pipeline-components/evaluators/deepevalevaluator.mdx) | Use DeepEval to evaluate generative pipelines. |
| [DeepEvalEvaluator](https://docs.haystack.deepset.ai/v2.0/docs/pipeline-components/evaluators/deepevalevaluator.mdx) | Use DeepEval to evaluate generative pipelines. |
| [DocumentMAPEvaluator](evaluators/documentmapevaluator.mdx) | Evaluates documents retrieved by Haystack pipelines using ground truth labels. It checks to what extent the list of retrieved documents contains only relevant documents as specified in the ground truth labels or also non-relevant documents. |
| [DocumentMRREvaluator](evaluators/documentmrrevaluator.mdx) | Evaluates documents retrieved by Haystack pipelines using ground truth labels. It checks at what rank ground truth documents appear in the list of retrieved documents. |
| [DocumentNDCGEvaluator](https://docs.haystack.deepset.ai/v2.7-unstable../../docs/pipeline-components/evaluators/documentndcgevaluator.mdx) | Evaluates documents retrieved by Haystack pipelines using ground truth labels. It checks at what rank ground truth documents appear in the list of retrieved documents. This metric is called normalized discounted cumulative gain (NDCG). |
| [DocumentNDCGEvaluator](https://docs.haystack.deepset.ai/v2.7-unstable/docs/pipeline-components/evaluators/documentndcgevaluator.mdx) | Evaluates documents retrieved by Haystack pipelines using ground truth labels. It checks at what rank ground truth documents appear in the list of retrieved documents. This metric is called normalized discounted cumulative gain (NDCG). |
| [DocumentRecallEvaluator](evaluators/documentrecallevaluator.mdx) | Evaluates documents retrieved by Haystack pipelines using ground truth labels. It checks how many of the ground truth documents were retrieved. |
| [FaithfulnessEvaluator](evaluators/faithfulnessevaluator.mdx) | Uses an LLM to evaluate whether a generated answer can be inferred from the provided contexts. Does not require ground truth labels. |
| [LLMEvaluator](evaluators/llmevaluator.mdx) | Uses an LLM to evaluate inputs based on a prompt containing user-defined instructions and examples. |

View File

@ -11,7 +11,7 @@ Extracts textual content from image-based documents using a vision-enabled Large
| | |
| --- | --- |
| **Most common position in a pipeline** | After [Converters](../../../docs/pipeline-components/converters.mdx) in an indexing pipeline to extract text from image-based documents |
| **Most common position in a pipeline** | After [Converters](/docs/pipeline-components/converters.mdx) in an indexing pipeline to extract text from image-based documents |
| **Mandatory init variables** | "chat_generator": A ChatGenerator instance that supports vision-based input <br /> <br />"prompt": Instructional text for the LLM on how to extract content (no Jinja variables allowed) |
| **Mandatory run variables** | "documents": A list of documents with file paths in metadata |
| **Output variables** | "documents": Successfully processed documents with extracted content <br /> <br />"failed_documents": Documents that failed processing with error metadata |
@ -174,4 +174,4 @@ print(f"Failed documents: {len(result['content_extractor']['failed_documents'])}
## Access documents in the store
stored_docs = document_store.filter_documents()
print(f"Documents in store: {len(stored_docs)}")
```
```

View File

@ -11,7 +11,7 @@ Extracts metadata from documents using a Large Language Model. The metadata is e
| | |
| --- | --- |
| **Most common position in a pipeline** | After [PreProcessors](../../../docs/pipeline-components/preprocessors.mdx) in an indexing pipeline |
| **Most common position in a pipeline** | After [PreProcessors](/docs/pipeline-components/preprocessors.mdx) in an indexing pipeline |
| **Mandatory init variables** | "prompt": The prompt to instruct the LLM on how to extract metadata from the document <br /> <br />"chat_generator": A Chat Generator instance which represents the LLM configured to return a JSON object |
| **Mandatory run variables** | “documents”: A list of documents |
| **Output variables** | “documents”: A list of documents |
@ -20,13 +20,13 @@ Extracts metadata from documents using a Large Language Model. The metadata is e
## Overview
The `LLMMetadataExtractor` extraction relies on an LLM and a prompt to perform the metadata extraction. At initialization time, it expects an LLM, a Haystack Generator, and a prompt describing the metadata extraction process.
The `LLMMetadataExtractor` extraction relies on an LLM and a prompt to perform the metadata extraction. At initialization time, it expects an LLM, a Haystack Generator, and a prompt describing the metadata extraction process.
The prompt should have a variable called `document` that will point to a single document in the list of documents. So, to access the content of the document, you can use `{{ document.content }}` in the prompt.
At runtime, it expects a list of documents and will run the LLM on each document in the list, extracting metadata from the document. The metadata will be added to the document's metadata field.
If the LLM fails to extract metadata from a document, it will be added to the `failed_documents` list. The failed documents' metadata will contain the keys `metadata_extraction_error` and `metadata_extraction_response`.
If the LLM fails to extract metadata from a document, it will be added to the `failed_documents` list. The failed documents' metadata will contain the keys `metadata_extraction_error` and `metadata_extraction_response`.
These documents can be re-run with another extractor to extract metadata using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt.
@ -140,4 +140,4 @@ extractor.run(documents=docs)
'failed_documents': []
}
>>
```
```

View File

@ -11,7 +11,7 @@ This component extracts predefined entities out of a piece of text and writes th
| | |
| --- | --- |
| **Most common position in a pipeline** | After the [PreProcessor](../../../docs/pipeline-components/preprocessors.mdx) in an indexing pipeline or after a [Retriever](../../../docs/pipeline-components/retrievers.mdx) in a query pipeline |
| **Most common position in a pipeline** | After the [PreProcessor](/docs/pipeline-components/preprocessors.mdx) in an indexing pipeline or after a [Retriever](/docs/pipeline-components/retrievers.mdx) in a query pipeline |
| **Mandatory init variables** | "backend": The backend to use for NER <br /> <br />"model": Name or path of the model to use |
| **Mandatory run variables** | “documents”: A list of documents |
| **Output variables** | “documents”: A list of documents |
@ -60,8 +60,8 @@ print(documents)
Here is the example result:
```python
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=11, end=16, score=0.99641764), NamedEntityAnnotation(entity='LOC', start=31, end=39, score=0.996198), NamedEntityAnnotation(entity='LOC', start=41, end=51, score=0.9990196)]}),
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=4, end=10, score=0.99054915)]}),
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=11, end=16, score=0.99641764), NamedEntityAnnotation(entity='LOC', start=31, end=39, score=0.996198), NamedEntityAnnotation(entity='LOC', start=41, end=51, score=0.9990196)]}),
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=4, end=10, score=0.99054915)]}),
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='LOC', start=0, end=14, score=0.9989541), NamedEntityAnnotation(entity='LOC', start=30, end=51, score=0.95746297)]})]
```
@ -88,4 +88,4 @@ print(annotations)
## If a Document doesn't contain any annotations, this returns None.
new_doc = Document(content="In one of many possible worlds...")
assert NamedEntityExtractor.get_stored_annotations(new_doc) is None
```
```

View File

@ -11,10 +11,10 @@ This component enables chat completion using models through Amazon Bedrock servi
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "model": The model to use <br /> <br />"aws_access_key_id": AWS access key ID. Can be set with `AWS_ACCESS_KEY_ID` env var. <br /> <br />"aws_secret_access_key": AWS secret access key. Can be set with `AWS_SECRET_ACCESS_KEY` env var. <br /> <br />"aws_region_name": AWS region name. Can be set with `AWS_DEFAULT_REGION` env var. |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) instances |
| **Output variables** | "replies": A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) instances |
| **Output variables** | "replies": A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
| **API reference** | [Amazon Bedrock](/reference/integrations-amazon-bedrock) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_bedrock |
@ -58,7 +58,7 @@ from haystack.dataclasses import ChatMessage
generator = AmazonBedrockChatGenerator(model="meta.llama2-70b-chat-v1")
messages = [ChatMessage.from_system("You are a helpful assistant that answers question in Spanish only"), ChatMessage.from_user("What's Natural Language Processing? Be brief.")]
response = generator.run(messages)
print(response)
```
@ -84,4 +84,4 @@ messages = [system_message, ChatMessage.from_user("What's the official language
res = pipe.run(data={"prompt_builder": {"template_variables": {"country": country}, "template": messages}})
print(res)
```
```

View File

@ -11,7 +11,7 @@ This component enables text generation using models through Amazon Bedrock servi
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "model": The model to use <br /> <br />"aws_access_key_id": AWS access key ID. Can be set with `AWS_ACCESS_KEY_ID` env var. <br /> <br />"aws_secret_access_key": AWS secret access key. Can be set with `AWS_SECRET_ACCESS_KEY` env var. <br /> <br />"aws_region_name": AWS region name. Can be set with `AWS_DEFAULT_REGION` env var. |
| **Mandatory run variables** | “prompt”: The instructions for the Generator |
| **Output variables** | “replies”: A list of strings with all the replies generated by the model |
@ -82,7 +82,7 @@ from haystack_integrations.components.generators.amazon_bedrock import AmazonBed
template = """
Given the following information, answer the question.
Context:
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
@ -115,4 +115,4 @@ pipe.run({
## Additional References
:cook: Cookbook: [PDF-Based Question Answering with Amazon Bedrock and Haystack](https://haystack.deepset.ai/cookbook/amazon_bedrock_for_documentation_qa)
:cook: Cookbook: [PDF-Based Question Answering with Amazon Bedrock and Haystack](https://haystack.deepset.ai/cookbook/amazon_bedrock_for_documentation_qa)

View File

@ -11,10 +11,10 @@ This component enables chat completions using Anthropic large language models (L
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "api_key": An Anthropic API key. Can be set with `ANTHROPIC_API_KEY` env var. |
| **Mandatory run variables** | “messages” A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage)  objects |
| **Output variables** | "replies": A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
| **Mandatory run variables** | “messages” A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage)  objects |
| **Output variables** | "replies": A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
| **API reference** | [Anthropic](/reference/integrations-anthropic) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/anthropic |
@ -33,7 +33,7 @@ Set your preferred Anthropic model with the `model` parameter when initializing
`AnthropicChatGenerator` requires a prompt to generate text, but you can pass any text generation parameters available in the Anthropic [Messaging API](https://docs.anthropic.com/en/api/messages) method directly to this component using the `generation_kwargs` parameter, both at initialization and when running the component. For more details on the parameters supported by the Anthropic API, see the [Anthropic documentation](https://docs.anthropic.com).
Finally, the component needs a list of `ChatMessage` objects to operate. `ChatMessage` is a data class that contains a message, a role (who generated the message, such as `user`, `assistant`, `system`, `function`), and optional metadata.
Finally, the component needs a list of `ChatMessage` objects to operate. `ChatMessage` is a data class that contains a message, a role (who generated the message, such as `user`, `assistant`, `system`, `function`), and optional metadata.
Only text input modality is supported at this time.
@ -65,7 +65,7 @@ Give preference to `print_streaming_chunk` by default. Write a custom callback o
### Prompt caching
Prompt caching is a feature for Anthropic LLMs that stores large text inputs for reuse. It allows you to send a large text block once and then refer to it in later requests without resending the entire text.
Prompt caching is a feature for Anthropic LLMs that stores large text inputs for reuse. It allows you to send a large text block once and then refer to it in later requests without resending the entire text.
This feature is particularly useful for coding assistants that need full codebase context and for processing large documents. It can help reduce costs and improve response times.
Here's an example of an instance of `AnthropicChatGenerator` being initialized with prompt caching and tagging a message to be cached:
@ -142,4 +142,4 @@ print(res)
## Additional References
:cook: Cookbook: [Advanced Prompt Customization for Anthropic](https://haystack.deepset.ai/cookbook/prompt_customization_for_anthropic)
:cook: Cookbook: [Advanced Prompt Customization for Anthropic](https://haystack.deepset.ai/cookbook/prompt_customization_for_anthropic)

View File

@ -11,7 +11,7 @@ This component enables text completions using Anthropic large language models (L
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [PromptBuilder](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [PromptBuilder](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "api_key": An Anthropic API key. Can be set with `ANTHROPIC_API_KEY` env var. |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
@ -82,4 +82,4 @@ query = "What language is spoke in Germany?"
res = pipe.run(data={"prompt_builder": {"query": {query}}})
print(res)
```
```

View File

@ -11,16 +11,16 @@ This component enables chat completions using AnthropicVertex API.
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "region": The region where the Anthropic model is deployed <br /> <br />”project_id”: GCP project ID where the Anthropic model is deployed |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](https://docs.haystack.deepset.ai../../../docs/concepts/data-classes.mdx#chatmessage)   objects |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](https://docs.haystack.deepset.ai/docs/concepts/data-classes.mdx#chatmessage)   objects |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and others |
| **API reference** | [Anthropic](/reference/integrations-anthropic) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/anthropic |
## Overview
`AnthropicVertexChatGenerator` enables text generation using state-of-the-art Claude 3 LLMs using the Anthropic Vertex AI API.
`AnthropicVertexChatGenerator` enables text generation using state-of-the-art Claude 3 LLMs using the Anthropic Vertex AI API.
It supports `Claude 3.5 Sonnet`, `Claude 3 Opus`, `Claude 3 Sonnet`, and `Claude 3 Haiku` models, that are accessible through the Vertex AI API endpoint. For more details about the models, refer to [Anthropic Vertex AI documentation](https://docs.anthropic.com/en/api/claude-on-vertex-ai).
### Parameters
@ -34,7 +34,7 @@ You can provide these keys in the following ways:
Before making requests, you may need to authenticate with GCP using `gcloud auth login`.
Set your preferred supported Anthropic model with the `model` parameter when initializing the component. Additionally, ensure that the desired Anthropic model is activated in the Vertex AI Model Garden.
Set your preferred supported Anthropic model with the `model` parameter when initializing the component. Additionally, ensure that the desired Anthropic model is activated in the Vertex AI Model Garden.
`AnthropicVertexChatGenerator` requires a prompt to generate text, but you can pass any text generation parameters available in the Anthropic [Messaging API](https://docs.anthropic.com/en/api/messages) method directly to this component using the `generation_kwargs` parameter, both at initialization and when running the component. For more details on the parameters supported by the Anthropic API, see the [Anthropic documentation](https://docs.anthropic.com/).
@ -149,4 +149,4 @@ messages = [system_message, ChatMessage.from_user("What's the official language
res = pipe.run(data={"prompt_builder": {"template_variables": {"country": country}, "template": messages}})
print(res)
```
```

View File

@ -11,9 +11,9 @@ This component enables chat completion using OpenAIs large language models (L
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "api_key": The Azure OpenAI API key. Can be set with `AZURE_OPENAI_API_KEY` env var. <br /> <br />"azure_ad_token": Microsoft Entra ID token. Can be set with `AZURE_OPENAI_AD_TOKEN` env var. |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects representing the chat |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects representing the chat |
| **Output variables** | “replies”: A list of alternative replies of the LLM to the input chat |
| **API reference** | [Generators](/reference/generators-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/generators/chat/azure.py |
@ -76,10 +76,10 @@ response = client.run(messages=[
print(response["replies"][0].text)
>> {"recipient_name":"David Julius","award_year":2021,"category":"Physiology or Medicine",
>> "achievement_description":"David Julius was awarded for his transformative findings
>> regarding the molecular mechanisms underlying the human body's sense of temperature
>> and touch. Through innovative experiments, he identified specific receptors responsible
>> for detecting heat and mechanical stimuli, ranging from gentle touch to pain-inducing
>> "achievement_description":"David Julius was awarded for his transformative findings
>> regarding the molecular mechanisms underlying the human body's sense of temperature
>> and touch. Through innovative experiments, he identified specific receptors responsible
>> for detecting heat and mechanical stimuli, ranging from gentle touch to pain-inducing
>> pressure.","nationality":"American"}
```
@ -166,4 +166,4 @@ location = "Berlin"
messages = [ChatMessage.from_system("Always respond in German even if some input data is in other languages."),
ChatMessage.from_user("Tell me about {{location}}")]
pipe.run(data={"prompt_builder": {"template_variables":{"location": location}, "template": messages}})
```
```

View File

@ -11,7 +11,7 @@ This component enables text generation using OpenAI's large language models (LLM
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "api_key": The Azure OpenAI API key. Can be set with `AZURE_OPENAI_API_KEY` env var. <br /> <br />"azure_ad_token": Microsoft Entra ID token. Can be set with `AZURE_OPENAI_AD_TOKEN` env var. |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
@ -109,7 +109,7 @@ query = "What is the capital of France?"
template = """
Given the following information, answer the question.
Context:
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
@ -134,4 +134,4 @@ res=pipe.run({
})
print(res)
```
```

View File

@ -11,10 +11,10 @@ CohereChatGenerator enables chat completions using Cohere's large language model
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "api_key": The Cohere API key. Can be set with `COHERE_API_KEY` or `CO_API_KEY` env var. |
| **Mandatory run variables** | “messages” A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects |
| **Output variables** | "replies": A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
| **Mandatory run variables** | “messages” A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects |
| **Output variables** | "replies": A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
| **API reference** | [Cohere](/reference/integrations-cohere) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/cohere |
@ -77,4 +77,4 @@ messages = [system_message, ChatMessage.from_user("What's the official language
res = pipe.run(data={"prompt_builder": {"template_variables": {"country": country}, "template": messages}})
print(res)
```
```

View File

@ -11,7 +11,7 @@ description: "`CohereGenerator` enables text generation using Cohere's large lan
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "api_key": The Cohere API key. Can be set with `COHERE_API_KEY` or `CO_API_KEY` env var. |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
@ -65,7 +65,7 @@ client = CohereGenerator(streaming_callback=lambda chunk: print(chunk.content, e
response = client.run("Briefly explain what NLP is in one sentence.")
print(response)
>>> Natural Language Processing (NLP) is the study of natural language and how it can be used to solve problems through computational methods, enabling machines to understand, interpret, and generate human language.
>>> Natural Language Processing (NLP) is the study of natural language and how it can be used to solve problems through computational methods, enabling machines to understand, interpret, and generate human language.
>>>{'replies': [' Natural Language Processing (NLP) is the study of natural language and how it can be used to solve problems through computational methods, enabling machines to understand, interpret, and generate human language.'], 'meta': [{'index': 0, 'finish_reason': 'COMPLETE'}]}
@ -91,7 +91,7 @@ query = "What is the capital of France?"
template = """
Given the following information, answer the question.
Context:
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
@ -116,4 +116,4 @@ res=pipe.run({
})
print(res)
```
```

View File

@ -11,7 +11,7 @@ Generate images using OpenAI's DALL-E model.
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx), flexible |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx), flexible |
| **Mandatory init variables** | "api_key": An OpenAI API key. Can be set with `OPENAI_API_KEY` env var. |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the model |
| **Output variables** | “images”: A list of generated images <br /> <br />”revised_prompt”: A string containing the prompt that was used to generate the image, if there was any revision to the prompt made by OpenAI |
@ -90,4 +90,4 @@ revised_prompt = results["image_generator"]["revised_prompt"]
print(f"Generated image URL: {generated_images[0]}")
print(f"Revised prompt: {revised_prompt}")
```
```

View File

@ -11,9 +11,9 @@ A ChatGenerator wrapper that tries multiple Chat Generators sequentially until o
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "chat_generators": A non-empty list of Chat Generator components to try in order |
| **Mandatory run variables** | "messages": A list of [`ChatMessage`](../../../docs/concepts/data-classes/chatmessage.mdx) objects representing the chat |
| **Mandatory run variables** | "messages": A list of [`ChatMessage`](/docs/concepts/data-classes/chatmessage.mdx) objects representing the chat |
| **Output variables** | "replies": Generated ChatMessage instances from the first successful generator <br /> <br />"meta": Execution metadata including successful generator details |
| **API reference** | [Generators](/reference/generators-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/generators/chat/fallback.py |
@ -87,7 +87,7 @@ print(result["replies"][0].text)
print(f"Successful generator: {result['meta']['successful_chat_generator_class']}")
print(f"Total attempts: {result['meta']['total_attempts']}")
>> Natural Language Processing (NLP) is a field of artificial intelligence that
>> Natural Language Processing (NLP) is a field of artificial intelligence that
>> focuses on the interaction between computers and humans through natural language...
>> Successful generator: OpenAIChatGenerator
>> Total attempts: 1
@ -211,4 +211,4 @@ try:
except RuntimeError as e:
print(f"All generators failed: {e}")
# Output: All 2 chat generators failed. Last error: ... Failed chat generators: [OpenAIChatGenerator, OpenAIChatGenerator]
```
```

View File

@ -11,9 +11,9 @@ This generator enables chat completion using various Hugging Face APIs.
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "api_type": The type of Hugging Face API to use <br /> <br />"api_params": A dictionary with one of the following keys: <br /> <br />- `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.**OR** - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or `TEXT_EMBEDDINGS_INFERENCE`."token": The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects representing the chat |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects representing the chat |
| **Output variables** | “replies”: A list of replies of the LLM to the input chat |
| **API reference** | [Generators](/reference/generators-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/generators/chat/hugging_face_api.py |
@ -32,7 +32,7 @@ This component's main input is a list of `ChatMessage` objects. `ChatMessage` is
This component is designed for chat completion, so it expects a list of messages, not a single string. If you want to use Hugging Face APIs for simple text generation (such as translation or summarization tasks) or don't want to use the `ChatMessage` object, use [`HuggingFaceAPIGenerator`](huggingfaceapigenerator.mdx) instead.
:::
The component uses a `HF_API_TOKEN` environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with `token` see code examples below.
The component uses a `HF_API_TOKEN` environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with `token` see code examples below.
The token is needed:
- If you use the Serverless Inference API, or
@ -50,7 +50,7 @@ This Generator supports [streaming](guides-to-generators/choosing-the-right-gene
This API allows you to quickly experiment with many models hosted on the Hugging Face Hub, offloading the inference to Hugging Face servers. It's rate-limited and not meant for production.
To use this API, you need a [free Hugging Face token](https://huggingface.co/settings/tokens).
To use this API, you need a [free Hugging Face token](https://huggingface.co/settings/tokens).
The Generator expects the `model` in `api_params`. It's also recommended to specify a `provider` for better performance and reliability.
```python
@ -81,8 +81,8 @@ In this case, a private instance of the model is deployed by Hugging Face, and y
To understand how to spin up an Inference Endpoint, visit [Hugging Face documentation](https://huggingface.co/inference-endpoints/dedicated).
Additionally, in this case, you need to provide your Hugging Face token.
The Generator expects the `url` of your endpoint in `api_params`.
Additionally, in this case, you need to provide your Hugging Face token.
The Generator expects the `url` of your endpoint in `api_params`.
```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
@ -146,7 +146,7 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingf
For more information, refer to the [official TGI repository](https://github.com/huggingface/text-generation-inference).
The Generator expects the `url` of your TGI instance in `api_params`.
The Generator expects the `url` of your TGI instance in `api_params`.
```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
@ -178,7 +178,7 @@ llm = HuggingFaceAPIChatGenerator(api_type=HFGenerationAPIType.SERVERLESS_INFERE
api_params={"model": "Qwen/Qwen2.5-7B-Instruct",
"provider": "together"},
token=Secret.from_env_var("HF_API_TOKEN"))
pipe = Pipeline()
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)
@ -193,4 +193,4 @@ print(result)
## Additional References
:cook: Cookbook: [Build with Google Gemma: chat and RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag)
:cook: Cookbook: [Build with Google Gemma: chat and RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag)

View File

@ -11,7 +11,7 @@ This generator enables text generation using various Hugging Face APIs.
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "api_type": The type of Hugging Face API to use <br /> <br />"api_params": A dictionary with one of the following keys: <br /> <br />- `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.**OR** - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or `TEXT_EMBEDDINGS_INFERENCE`."token": The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and others |
@ -37,7 +37,7 @@ Use the [`HuggingFaceAPIChatGenerator`](huggingfaceapichatgenerator.mdx) compone
This component is designed for text generation, not for chat. If you want to use these LLMs for chat, use [`HuggingFaceAPIChatGenerator`](huggingfaceapichatgenerator.mdx) instead.
:::
The component uses a `HF_API_TOKEN` environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with `token` see code examples below.
The component uses a `HF_API_TOKEN` environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with `token` see code examples below.
The token is needed when you use the Inference Endpoints.
### Streaming
@ -54,8 +54,8 @@ In this case, a private instance of the model is deployed by Hugging Face, and y
To understand how to spin up an Inference Endpoint, visit [Hugging Face documentation](https://huggingface.co/inference-endpoints/dedicated).
Additionally, in this case, you need to provide your Hugging Face token.
The Generator expects the `url` of your endpoint in `api_params`.
Additionally, in this case, you need to provide your Hugging Face token.
The Generator expects the `url` of your endpoint in `api_params`.
```python
from haystack.components.generators import HuggingFaceAPIGenerator
@ -86,7 +86,7 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingf
For more information, refer to the [official TGI repository](https://github.com/huggingface/text-generation-inference).
The Generator expects the `url` of your TGI instance in `api_params`.
The Generator expects the `url` of your TGI instance in `api_params`.
```python
from haystack.components.generators import HuggingFaceAPIGenerator
@ -104,8 +104,8 @@ print(result)
Formerly known as (free) Hugging Face Inference API, this API allows you to quickly experiment with many models hosted on the Hugging Face Hub, offloading the inference to Hugging Face servers. It's rate-limited and not meant for production.
To use this API, you need a [free Hugging Face token](https://huggingface.co/settings/tokens).
The Generator expects the `model` in `api_params`.
To use this API, you need a [free Hugging Face token](https://huggingface.co/settings/tokens).
The Generator expects the `model` in `api_params`.
```python
from haystack.components.generators import HuggingFaceAPIGenerator
@ -137,7 +137,7 @@ query = "What is the capital of France?"
template = """
Given the following information, answer the question.
Context:
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
@ -175,4 +175,4 @@ print(res)
- [Multilingual RAG from a podcast with Whisper, Qdrant and Mistral](https://haystack.deepset.ai/cookbook/multilingual_rag_podcast)
- [Information Extraction with Raven](https://haystack.deepset.ai/cookbook/information_extraction_raven)
- [Web QA with Mixtral-8x7B-Instruct-v0.1](https://haystack.deepset.ai/cookbook/mixtral-8x7b-for-web-qa)
- [Web QA with Mixtral-8x7B-Instruct-v0.1](https://haystack.deepset.ai/cookbook/mixtral-8x7b-for-web-qa)

View File

@ -11,7 +11,7 @@ description: "`LlamaCppGenerator` provides an interface to generate text using a
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "model": The path of the model to use |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count and others |
@ -56,7 +56,7 @@ pip install llama-cpp-haystack
from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
generator = LlamaCppGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
model_kwargs={"n_gpu_layers": -1},
@ -71,7 +71,7 @@ result = generator.run(prompt)
The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter.
The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters.
The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters.
See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.
@ -95,7 +95,7 @@ print(generated_text)
### Passing text generation parameters
The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference.
The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference.
See [Llama.cpp's Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) for more information on the available generation arguments.
@ -135,7 +135,7 @@ result = generator.run(
### Using in a Pipeline
We use the `LlamaCppGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM.
We use the `LlamaCppGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM.
Load the dataset:
@ -237,4 +237,4 @@ result = rag_pipeline.run(
generated_answer = result["answer_builder"]["answers"][0]
print(generated_answer.data)
## The Joker movie was released on October 4, 2019.
```
```

View File

@ -11,9 +11,9 @@ This component enables chat completions using any model made available by infere
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "model": The name of the model to use for chat completion. <br />This depends on the inference provider used for the Llama Stack Server. |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](../../../docs/concepts/data-classes/chatmessage.mdx) objects representing the chat |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](/docs/concepts/data-classes/chatmessage.mdx) objects representing the chat |
| **Output variables** | “replies”: A list of alternative replies of the model to the input chat |
| **API reference** | [Llama Stack](/reference/integrations-llama-stack) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/llama_stack |
@ -63,7 +63,7 @@ import os
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.llama_stack import LlamaStackChatGenerator
client = LlamaStackChatGenerator(model="ollama/llama3.2:3b")
client = LlamaStackChatGenerator(model="ollama/llama3.2:3b")
response = client.run(
[ChatMessage.from_user("What are Agentic Pipelines? Be brief.")]
)
@ -112,4 +112,4 @@ response = pipe.run(
"template_variables": {"city": "Berlin"}}}
)
print(response)
```
```

View File

@ -11,10 +11,10 @@ This component enables chat completion using Mistrals text generation models.
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "api_key": The Mistral API key. Can be set with `MISTRAL_API_KEY` env var. |
| **Mandatory run variables** | “messages” A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects |
| **Output variables** | "replies": A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
| **Mandatory run variables** | “messages” A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects |
| **Output variables** | "replies": A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
| **API reference** | [Mistral](/reference/integrations-mistral) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mistral |
@ -85,7 +85,7 @@ prompt_builder = ChatPromptBuilder(variables=["documents"])
llm = MistralChatGenerator(streaming_callback=print_streaming_chunk, model='mistral-small')
message_template = """Answer the following question based on the contents of the article: {{query}}\n
Article: {{documents[0].content}} \n
Article: {{documents[0].content}} \n
"""
messages = [ChatMessage.from_user(message_template)]
@ -105,7 +105,7 @@ result = rag_pipeline.run(
{
"fetcher": {"urls": ["https://mistral.ai/news/mixtral-of-experts"]},
"prompt_builder": {"template_variables": {"query": question}, "template": messages},
"llm": {"generation_kwargs": {"max_tokens": 165}},
},
)
@ -113,4 +113,4 @@ result = rag_pipeline.run(
## Additional References
:cook: Cookbook: [Web QA with Mixtral-8x7B-Instruct-v0.1](https://haystack.deepset.ai/cookbook/mixtral-8x7b-for-web-qa)
:cook: Cookbook: [Web QA with Mixtral-8x7B-Instruct-v0.1](https://haystack.deepset.ai/cookbook/mixtral-8x7b-for-web-qa)

View File

@ -11,7 +11,7 @@ This Generator enables text generation using Nvidia-hosted models.
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "api_key": API key for the NVIDIA NIM. Can be set with `NVIDIA_API_KEY` env var. |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count and others |
@ -99,7 +99,7 @@ query = "What is the capital of France?"
template = """
Given the following information, answer the question.
Context:
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
@ -137,4 +137,4 @@ print(res)
## Additional References
:cook: Cookbook: [Haystack RAG Pipeline with Self-Deployed AI models using NVIDIA NIMs](https://haystack.deepset.ai/cookbook/rag-with-nims)
:cook: Cookbook: [Haystack RAG Pipeline with Self-Deployed AI models using NVIDIA NIMs](https://haystack.deepset.ai/cookbook/rag-with-nims)

View File

@ -11,7 +11,7 @@ A component that provides an interface to generate text using an LLM running on
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count and others |
| **API reference** | [Ollama](/reference/integrations-ollama) |
@ -31,14 +31,14 @@ This Generator supports [streaming](guides-to-generators/choosing-the-right-gene
## Usage
1. You need a running instance of Ollama. You can find the installation instructions [here](https://github.com/jmorganca/ollama).
1. You need a running instance of Ollama. You can find the installation instructions [here](https://github.com/jmorganca/ollama).
A fast way to run Ollama is using Docker:
```shell
docker run -d -p 11434:11434 --name ollama ollama/ollama:latest
```
2. You need to download or pull the desired LLM. The model library is available on the [Ollama website](https://ollama.ai/library).
2. You need to download or pull the desired LLM. The model library is available on the [Ollama website](https://ollama.ai/library).
If you are using Docker, you can, for example, pull the Zephyr model:
```shell
@ -104,7 +104,7 @@ from haystack.document_stores.in_memory import InMemoryDocumentStore
template = """
Given the following information, answer the question.
Context:
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
@ -141,4 +141,4 @@ print(result)
## soccer and summer. Unfortunately, there is no direct information given about
## what else you enjoy...'],
## 'meta': [{'model': 'zephyr', ...]}}
```
```

View File

@ -11,7 +11,7 @@ description: "`OpenAIGenerator` enables text generation using OpenAI's large lan
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "api_key": An OpenAI API key. Can be set with `OPENAI_API_KEY` env var. |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
@ -54,12 +54,12 @@ client = OpenAIGenerator(model="gpt-4", api_key=Secret.from_token("<your-api-key
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
>>> {'replies': ['Natural Language Processing, often abbreviated as NLP, is a field
of artificial intelligence that focuses on the interaction between computers
and humans through natural language. The primary aim of NLP is to enable
computers to understand, interpret, and generate human language in a valuable way.'],
'meta': [{'model': 'gpt-4-0613', 'index': 0, 'finish_reason':
'stop', 'usage': {'prompt_tokens': 16, 'completion_tokens': 53,
>>> {'replies': ['Natural Language Processing, often abbreviated as NLP, is a field
of artificial intelligence that focuses on the interaction between computers
and humans through natural language. The primary aim of NLP is to enable
computers to understand, interpret, and generate human language in a valuable way.'],
'meta': [{'model': 'gpt-4-0613', 'index': 0, 'finish_reason':
'stop', 'usage': {'prompt_tokens': 16, 'completion_tokens': 53,
'total_tokens': 69}}]}
```
@ -73,16 +73,16 @@ client = OpenAIGenerator(streaming_callback=lambda chunk: print(chunk.content, e
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
>>> Natural Language Processing (NLP) is a branch of artificial
intelligence that focuses on the interaction between computers and human
language. It involves enabling computers to understand, interpret,and respond
>>> Natural Language Processing (NLP) is a branch of artificial
intelligence that focuses on the interaction between computers and human
language. It involves enabling computers to understand, interpret,and respond
to natural human language in a way that is both meaningful and useful.
>>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial
intelligence that focuses on the interaction between computers and human
language. It involves enabling computers to understand, interpret,and respond
to natural human language in a way that is both meaningful and useful.'],
'meta': [{'model': 'gpt-4o-mini', 'index': 0, 'finish_reason':
'stop', 'usage': {'prompt_tokens': 16, 'completion_tokens': 49,
>>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial
intelligence that focuses on the interaction between computers and human
language. It involves enabling computers to understand, interpret,and respond
to natural human language in a way that is both meaningful and useful.'],
'meta': [{'model': 'gpt-4o-mini', 'index': 0, 'finish_reason':
'stop', 'usage': {'prompt_tokens': 16, 'completion_tokens': 49,
'total_tokens': 65}}]}
```
@ -107,7 +107,7 @@ query = "What is the capital of France?"
template = """
Given the following information, answer the question.
Context:
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
@ -132,4 +132,4 @@ res=pipe.run({
})
print(res)
```
```

View File

@ -11,7 +11,7 @@ This component enables text generation using LLMs deployed on Amazon Sagemaker.
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "model": The model to use <br /> <br />"aws_access_key_id": AWS access key ID. Can be set with `AWS_ACCESS_KEY_ID` env var. <br /> <br />"aws_secret_access_key": AWS secret access key. Can be set with `AWS_SECRET_ACCESS_KEY` env var. |
| **Mandatory run variables** | “prompt”: A string containing the prompt for the LLM |
| **Output variables** | “replies”: A list of strings with all the replies generated by the LLM <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
@ -30,7 +30,7 @@ You also need to specify your Sagemaker endpoint at initialization time for the
generator = SagemakerGenerator(model="jumpstart-dft-hf-llm-falcon-7b-instruct-bf16")
```
Additionally, you can pass any text generation parameters valid for your specific model directly to `SagemakerGenerator` using the `generation_kwargs` parameter, both at initialization and to `run()` method.
Additionally, you can pass any text generation parameters valid for your specific model directly to `SagemakerGenerator` using the `generation_kwargs` parameter, both at initialization and to `run()` method.
If your model also needs custom attributes, pass those as a dictionary at initialization time by setting the `aws_custom_attributes` parameter.
@ -80,7 +80,7 @@ from haystack.components.builders import PromptBuilder
template = """
Given the following information, answer the question.
Context:
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
@ -100,4 +100,4 @@ pipe.run({
"country": "France"
}
})
```
```

View File

@ -11,20 +11,20 @@ This component enables chat completions using the STACKIT API.
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "model": The model used through the STACKIT API |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](../../../docs/concepts/data-classes/chatmessage.mdx)  objects |
| **Output variables** | "replies": A list of [`ChatMessage`](../../../docs/concepts/data-classes/chatmessage.mdx) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply (such as token count, finish reason, and so on) |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](/docs/concepts/data-classes/chatmessage.mdx)  objects |
| **Output variables** | "replies": A list of [`ChatMessage`](/docs/concepts/data-classes/chatmessage.mdx) objects <br /> <br />”meta”: A list of dictionaries with the metadata associated with each reply (such as token count, finish reason, and so on) |
| **API reference** | [STACKIT](/reference/integrations-stackit) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/stackit |
## Overview
`STACKITChatGenerator` enables text generation models served by STACKIT through their API.
`STACKITChatGenerator` enables text generation models served by STACKIT through their API.
### Parameters
To use the `STACKITChatGenerator`, ensure you have set a `STACKIT_API_KEY` as an environment variable. Alternatively, provide the API key as another environment variable or a token by setting
To use the `STACKITChatGenerator`, ensure you have set a `STACKIT_API_KEY` as an environment variable. Alternatively, provide the API key as another environment variable or a token by setting
`api_key` and using Haystacks [secret management](../../concepts/secret-management.mdx).
Set your preferred supported model with the `model` parameter when initializing the component. See the full list of all supported models on the [STACKIT website](https://docs.stackit.cloud/stackit/en/models-licenses-319914532.html).
@ -86,4 +86,4 @@ result = pipeline.run({"prompt_builder": {"template_variables": {"question": "Te
print(result)
```
For an example of streaming in a pipeline, refer to the examples in the STACKIT integration [repository](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/stackit/examples) and on its dedicated [integration page](https://haystack.deepset.ai/integrations/stackit).
For an example of streaming in a pipeline, refer to the examples in the STACKIT integration [repository](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/stackit/examples) and on its dedicated [integration page](https://haystack.deepset.ai/integrations/stackit).

View File

@ -11,7 +11,7 @@ This component enables text generation (image captioning) using Google Vertex AI
| | |
| --- | --- |
| **Mandatory run variables** | “image”: A [`ByteStream`](../../../docs/concepts/data-classes.mdx#bytestresm) containing an image data <br /> <br />”question”: A string of a question about the image |
| **Mandatory run variables** | “image”: A [`ByteStream`](/docs/concepts/data-classes.mdx#bytestresm) containing an image data <br /> <br />”question”: A string of a question about the image |
| **Output variables** | “replies”: A list of strings containing answers generated by the model |
| **API reference** | [Google Vertex](/reference/integrations-google-vertex) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/google_vertex |
@ -72,4 +72,4 @@ for answer in res["replies"]:
>>> pomeranian
>>> white
>>> pomeranian puppy
```
```

View File

@ -11,10 +11,10 @@ Use this component with IBM watsonx models like `granite-3-2b-instruct` for chat
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](../../../docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Most common position in a pipeline** | After a [ChatPromptBuilder](/docs/pipeline-components/builders/chatpromptbuilder.mdx) |
| **Mandatory init variables** | "api_key": The IBM Cloud API key. Can be set with `WATSONX_API_KEY` env var. <br /> <br />"project_id": The IBM Cloud project ID. Can be set with `WATSONX_PROJECT_ID` env var. |
| **Mandatory run variables** | "messages" A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects |
| **Output variables** | "replies": A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) objects |
| **Mandatory run variables** | "messages" A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects |
| **Output variables** | "replies": A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) objects |
| **API reference** | [Watsonx](/reference/integrations-watsonx) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/watsonx |
@ -86,4 +86,4 @@ messages = [system_message, ChatMessage.from_user("What's the official language
res = pipe.run(data={"prompt_builder": {"template_variables": {"country": country}, "template": messages}})
print(res)
```
```

View File

@ -11,7 +11,7 @@ Use this component with IBM watsonx models like `granite-3-2b-instruct` for simp
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [PromptBuilder](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | After a [PromptBuilder](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | "api_key": An IBM Cloud API key. Can be set with `WATSONX_API_KEY` env var. <br /> <br />"project_id": An IBM Cloud project ID. Can be set with `WATSONX_PROJECT_ID` env var. |
| **Mandatory run variables** | "prompt": A string containing the prompt for the LLM |
| **Output variables** | "replies": A list of strings with all the replies generated by the LLM <br /> <br />"meta": A list of dictionaries with the metadata associated with each reply, such as token count, finish reason, and so on |
@ -93,4 +93,4 @@ query = "What language is spoken in Germany?"
res = pipe.run(data={"prompt_builder": {"query": query}})
print(res)
```
```

View File

@ -2,22 +2,21 @@
title: "PreProcessors"
id: preprocessors
slug: "/preprocessors"
description: "Use the PreProcessors to preprare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search."
description: "Use the PreProcessors to prepare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search."
---
# PreProcessors
Use the PreProcessors to preprare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search.
Use the PreProcessors to prepare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search.
| PreProcessor | Description |
| --- | --- |
| [ChineseDocumentSplitter](../../docs/pipeline-components/preprocessors/chinesedocumentsplitter.mdx) | Divides Chinese text documents into smaller chunks using advanced Chinese language processing capabilities, using HanLP for accurate Chinese word segmentation and sentence tokenization. |
| [CSVDocumentCleaner](../../docs/pipeline-components/preprocessors/csvdocumentcleaner.mdx) | Cleans CSV documents by removing empty rows and columns while preserving specific ignored rows and columns. |
| [CSVDocumentSplitter](../../docs/pipeline-components/preprocessors/csvdocumentsplitter.mdx) | Divides CSV documents into smaller sub-tables based on empty rows and columns. |
| [DocumentCleaner](../../docs/pipeline-components/preprocessors/documentcleaner.mdx) | Removes extra whitespaces, empty lines, specified substrings, regexes, page headers, and footers from documents. |
| [DocumentPreprocessor](../../docs/pipeline-components/preprocessors/documentpreprocessor.mdx) | Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning. |
| [DocumentSplitter](../../docs/pipeline-components/preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. |
| [HierarchicalDocumentSplitter](../../docs/pipeline-components/preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. |
| [RecursiveSplitter](../../docs/pipeline-components/preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators <br />to the text, applied in the order they are provided. |
| [TextCleaner](../../docs/pipeline-components/preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. |
| [ChineseDocumentSplitter](/docs/pipeline-components/preprocessors/chinesedocumentsplitter.mdx) | Divides Chinese text documents into smaller chunks using advanced Chinese language processing capabilities, using HanLP for accurate Chinese word segmentation and sentence tokenization. |
| [CSVDocumentCleaner](/docs/pipeline-components/preprocessors/csvdocumentcleaner.mdx) | Cleans CSV documents by removing empty rows and columns while preserving specific ignored rows and columns. |
| [CSVDocumentSplitter](/docs/pipeline-components/preprocessors/csvdocumentsplitter.mdx) | Divides CSV documents into smaller sub-tables based on empty rows and columns. |
| [DocumentCleaner](/docs/pipeline-components/preprocessors/documentcleaner.mdx) | Removes extra whitespaces, empty lines, specified substrings, regexes, page headers, and footers from documents. |
| [DocumentPreprocessor](/docs/pipeline-components/preprocessors/documentpreprocessor.mdx) | Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning. |
| [DocumentSplitter](/docs/pipeline-components/preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. |
| [HierarchicalDocumentSplitter](/docs/pipeline-components/preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. |
| [RecursiveSplitter](/docs/pipeline-components/preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators <br />to the text, applied in the order they are provided. |
| [TextCleaner](/docs/pipeline-components/preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. |

View File

@ -11,7 +11,7 @@ description: "`ChineseDocumentSplitter` divides Chinese text documents into smal
| | |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](https://docs.haystack.deepset.ai../../../docs/pipeline-components/converters.mdx) and [DocumentCleaner](documentcleaner.mdx), before [Classifiers](https://docs.haystack.deepset.ai../../../docs/pipeline-components/classifiers.mdx) |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](https://docs.haystack.deepset.ai/docs/pipeline-components/converters.mdx) and [DocumentCleaner](documentcleaner.mdx), before [Classifiers](https://docs.haystack.deepset.ai/docs/pipeline-components/classifiers.mdx) |
| **Mandatory run variables** | "documents": A list of documents with Chinese text content |
| **Output variables** | "documents": A list of documents, each containing a chunk of the original Chinese text |
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
@ -57,9 +57,9 @@ from haystack_integrations.components.preprocessors.hanlp import ChineseDocument
## Initialize the splitter with word-based splitting
splitter = ChineseDocumentSplitter(
split_by="word",
split_length=10,
split_overlap=3,
split_by="word",
split_length=10,
split_overlap=3,
granularity="coarse"
)
@ -89,9 +89,9 @@ doc = Document(content=
)
splitter = ChineseDocumentSplitter(
split_by="word",
split_length=10,
split_overlap=3,
split_by="word",
split_length=10,
split_overlap=3,
respect_sentence_boundary=True,
granularity="coarse"
)
@ -115,9 +115,9 @@ from haystack_integrations.components.preprocessors.hanlp import ChineseDocument
doc = Document(content="人工智能技术正在快速发展,改变着我们的生活方式。")
splitter = ChineseDocumentSplitter(
split_by="word",
split_length=5,
split_overlap=0,
split_by="word",
split_length=5,
split_overlap=0,
granularity="fine" # More detailed segmentation
)
splitter.warm_up()
@ -140,7 +140,7 @@ def custom_split(text: str) -> list[str]:
doc = Document(content="第一段,第二段,第三段,第四段")
splitter = ChineseDocumentSplitter(
split_by="function",
split_by="function",
splitting_function=custom_split
)
splitter.warm_up()
@ -166,9 +166,9 @@ p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=ChineseDocumentSplitter(
split_by="word",
split_length=100,
split_overlap=20,
split_by="word",
split_length=100,
split_overlap=20,
respect_sentence_boundary=True,
granularity="coarse"
), name="chinese_splitter")
@ -183,4 +183,4 @@ p.connect("chinese_splitter.documents", "writer.documents")
p.run({"text_file_converter": {"sources": ["path/to/your/chinese/files.txt"]}})
```
This pipeline processes Chinese text files by converting them to documents, cleaning the text, splitting them into linguistically-aware chunks using Chinese word segmentation, and storing the results in the Document Store for further retrieval and processing.
This pipeline processes Chinese text files by converting them to documents, cleaning the text, splitting them into linguistically-aware chunks using Chinese word segmentation, and storing the results in the Document Store for further retrieval and processing.

View File

@ -11,7 +11,7 @@ description: "`CSVDocumentSplitter` divides CSV documents into smaller sub-table
| | |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](https://docs.haystack.deepset.ai../../../docs/pipeline-components/converters.mdx) , before [CSVDocumentCleaner](csvdocumentcleaner.mdx) |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](https://docs.haystack.deepset.ai/docs/pipeline-components/converters.mdx) , before [CSVDocumentCleaner](csvdocumentcleaner.mdx) |
| **Mandatory run variables** | "documents": A list of documents with CSV-formatted content |
| **Output variables** | "documents": A list of documents, each containing a sub-table extracted from the original CSV file |
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
@ -19,7 +19,7 @@ description: "`CSVDocumentSplitter` divides CSV documents into smaller sub-table
## Overview
`CSVDocumentSplitter` expects a list of documents containing CSV-formatted content and returns a list of new `Document` objects, each representing a sub-table extracted from the original document.
`CSVDocumentSplitter` expects a list of documents containing CSV-formatted content and returns a list of new `Document` objects, each representing a sub-table extracted from the original document.
There are two modes of operation for the splitter:
@ -110,4 +110,4 @@ p.connect("cleaner.documents", "writer.documents")
p.run({"csv_file_converter": {"sources": ["path/to/your/file.csv"]}})
```
This pipeline extracts CSV content, splits it into structured sub-tables, cleans the CSV documents by removing empty rows and columns, and stores the resulting documents in the Document Store for further retrieval and processing.
This pipeline extracts CSV content, splits it into structured sub-tables, cleans the CSV documents by removing empty rows and columns, and stores the resulting documents in the Document Store for further retrieval and processing.

View File

@ -11,16 +11,15 @@ Rankers are a group of components that order documents by given criteria. Their
| Ranker | Description |
| --- | --- |
| [AmazonBedrockRanker](../../docs/pipeline-components/rankers/amazonbedrockranker.mdx) | Ranks documents based on their similarity to the query using Amazon Bedrock models. |
| [CohereRanker](../../docs/pipeline-components/rankers/cohereranker.mdx) | Ranks documents based on their similarity to the query using Cohere rerank models. |
| [FastembedRanker](../../docs/pipeline-components/rankers/fastembedranker.mdx) | Ranks documents based on their similarity to the query using cross-encoder models supported by FastEmbed. |
| [HuggingFaceTEIRanker](../../docs/pipeline-components/rankers/huggingfaceteiranker.mdx) | Ranks documents based on their similarity to the query using a Text Embeddings Inference (TEI) API endpoint. |
| [JinaRanker](../../docs/pipeline-components/rankers/jinaranker.mdx) | Ranks documents based on their similarity to the query using Jina AI models. |
| [LostInTheMiddleRanker](../../docs/pipeline-components/rankers/lostinthemiddleranker.mdx) | Positions the most relevant documents at the beginning and at the end of the resulting list while placing the least relevant documents in the middle, based on a [research paper](https://arxiv.org/abs/2307.03172). |
| [MetaFieldRanker](../../docs/pipeline-components/rankers/metafieldranker.mdx) | A lightweight Ranker that orders documents based on a specific metadata field value. |
| [MetaFieldGroupingRanker](../../docs/pipeline-components/rankers/metafieldgroupingranker.mdx) | Reorders the documents by grouping them based on metadata keys. |
| [NvidiaRanker](../../docs/pipeline-components/rankers/nvidiaranker.mdx) | Ranks documents using large-language models from [NVIDIA NIMs](https://ai.nvidia.com) . |
| [TransformersSimilarityRanker](../../docs/pipeline-components/rankers/transformerssimilarityranker.mdx) | A legacy version of [SentenceTransformersSimilarityRanker](../../docs/pipeline-components/rankers/sentencetransformerssimilarityranker.mdx). |
| [SentenceTransformersDiversityRanker](../../docs/pipeline-components/rankers/sentencetransformersdiversityranker.mdx) | A Diversity Ranker based on Sentence Transformers. |
| [SentenceTransformersSimilarityRanker](../../docs/pipeline-components/rankers/sentencetransformerssimilarityranker.mdx) | A model-based Ranker that orders documents based on their relevance to the query. It uses a cross-encoder model to produce query and document embeddings. It then compares the similarity of the query embedding to the document embeddings to produce a ranking with the most similar documents appearing first. <br /> <br />It's a powerful Ranker that takes word order and syntax into account. You can use it to improve the initial ranking done by a weaker Retriever, but it's also more expensive computationally than the Rankers that don't use models. |
| [AmazonBedrockRanker](/docs/pipeline-components/rankers/amazonbedrockranker.mdx) | Ranks documents based on their similarity to the query using Amazon Bedrock models. |
| [CohereRanker](/docs/pipeline-components/rankers/cohereranker.mdx) | Ranks documents based on their similarity to the query using Cohere rerank models. |
| [FastembedRanker](/docs/pipeline-components/rankers/fastembedranker.mdx) | Ranks documents based on their similarity to the query using cross-encoder models supported by FastEmbed. |
| [HuggingFaceTEIRanker](/docs/pipeline-components/rankers/huggingfaceteiranker.mdx) | Ranks documents based on their similarity to the query using a Text Embeddings Inference (TEI) API endpoint. |
| [JinaRanker](/docs/pipeline-components/rankers/jinaranker.mdx) | Ranks documents based on their similarity to the query using Jina AI models. |
| [LostInTheMiddleRanker](/docs/pipeline-components/rankers/lostinthemiddleranker.mdx) | Positions the most relevant documents at the beginning and at the end of the resulting list while placing the least relevant documents in the middle, based on a [research paper](https://arxiv.org/abs/2307.03172). |
| [MetaFieldRanker](/docs/pipeline-components/rankers/metafieldranker.mdx) | A lightweight Ranker that orders documents based on a specific metadata field value. |
| [MetaFieldGroupingRanker](/docs/pipeline-components/rankers/metafieldgroupingranker.mdx) | Reorders the documents by grouping them based on metadata keys. |
| [NvidiaRanker](/docs/pipeline-components/rankers/nvidiaranker.mdx) | Ranks documents using large-language models from [NVIDIA NIMs](https://ai.nvidia.com) . |
| [TransformersSimilarityRanker](/docs/pipeline-components/rankers/transformerssimilarityranker.mdx) | A legacy version of [SentenceTransformersSimilarityRanker](/docs/pipeline-components/rankers/sentencetransformerssimilarityranker.mdx). |
| [SentenceTransformersDiversityRanker](/docs/pipeline-components/rankers/sentencetransformersdiversityranker.mdx) | A Diversity Ranker based on Sentence Transformers. |
| [SentenceTransformersSimilarityRanker](/docs/pipeline-components/rankers/sentencetransformerssimilarityranker.mdx) | A model-based Ranker that orders documents based on their relevance to the query. It uses a cross-encoder model to produce query and document embeddings. It then compares the similarity of the query embedding to the document embeddings to produce a ranking with the most similar documents appearing first. <br /> <br />It's a powerful Ranker that takes word order and syntax into account. You can use it to improve the initial ranking done by a weaker Retriever, but it's also more expensive computationally than the Rankers that don't use models. |

View File

@ -11,7 +11,7 @@ Use this component to rank documents based on their similarity to the query usin
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | "aws_access_key_id": AWS access key ID. Can be set with AWS_ACCESS_KEY_ID env var. <br /> <br />"aws_secret_access_key": AWS secret access key. Can be set with AWS_SECRET_ACCESS_KEY env var. <br /> <br />"aws_region_name": AWS region name. Can be set with AWS_DEFAULT_REGION env var. |
| **Mandatory run variables** | “documents”: A list of document objects <br /> <br />”query”: A query string |
| **Output variables** | “documents”: A list of document objects |
@ -90,4 +90,4 @@ document_ranker_pipeline.connect("retriever.documents", "ranker.documents")
query = "Cities in France"
res = document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3}, "ranker": {"query": query, "top_k": 2}})
```
```

View File

@ -11,7 +11,7 @@ Use this component to rank documents based on their similarity to the query usin
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | "api_key": The Cohere API key. Can be set with `COHERE_API_KEY` or `CO_API_KEY` env var. |
| **Mandatory run variables** | “documents”: A list of document objects <br /> <br />”query”: A query string <br /> <br />”top_k”: The maximum number of documents to return |
| **Output variables** | “documents”: A list of document objects |
@ -92,4 +92,4 @@ In the example above, the `top_k` values for the Retriever and the Ranker are di
You can set the same or a smaller `top_k` value for the Ranker. The Ranker's `top_k` is the number of documents it returns (if it's the last component in the pipeline) or forwards to the next component. In the pipeline example above, the Ranker is the last component, so the output you get when you run the pipeline are the top two documents, as per the Ranker's `top_k`.
Adjusting the `top_k` values can help you optimize performance. In this case, a smaller `top_k` value of the Retriever means fewer documents to process for the Ranker, which can speed up the pipeline.
:::
:::

View File

@ -11,7 +11,7 @@ Use this component to rank documents based on their similarity to the query usin
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory run variables** | “documents”: A list of documents <br /> <br />”query”: A query string |
| **Output variables** | “documents”: A list of documents |
| **API reference** | [FastEmbed](/reference/fastembed-embedders) |
@ -19,7 +19,7 @@ Use this component to rank documents based on their similarity to the query usin
## Overview
`FastembedRanker` ranks the documents based on how similar they are to the query. It uses [cross-encoder models supported by FastEmbed](https://qdrant.github.io/fastembed/examples/Supported_Models/).
`FastembedRanker` ranks the documents based on how similar they are to the query. It uses [cross-encoder models supported by FastEmbed](https://qdrant.github.io/fastembed/examples/Supported_Models/).
Based on ONXX Runtime, FastEmbed provides a fast experience on standard CPU machines.
`FastembedRanker` is most useful in query pipelines such as a retrieval-augmented generation (RAG) pipeline or a document search pipeline to ensure the retrieved documents are ordered by relevance. You can use it after a Retriever (such as the [`InMemoryEmbeddingRetriever`](../retrievers/inmemoryembeddingretriever.mdx)) to improve the search results. When using `FastembedRanker` with a Retriever, consider setting the Retriever's `top_k` to a small number. This way, the Ranker will have fewer documents to process, which can help make your pipeline faster.
@ -104,4 +104,4 @@ document_ranker_pipeline.connect("retriever.documents", "ranker.documents")
query = "Cities in France"
res = document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3}, "ranker": {"query": query, "top_k": 2}})
```
```

View File

@ -11,7 +11,7 @@ Use this component to rank documents based on their similarity to the query usin
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents, such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents, such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | “url”: Base URL of the TEI reranking service (for example, "https://api.example.com"). |
| **Mandatory run variables** | “query”: A query string <br /> <br />“documents”: A list of document objects |
| **Output variables** | “documents”: A grouped list of documents |
@ -73,7 +73,7 @@ from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.rankers import HuggingFaceTEIRanker
docs = [Document(content="Paris is in France"),
docs = [Document(content="Paris is in France"),
Document(content="Berlin is in Germany"),
Document(content="Lyon is in France")]
document_store = InMemoryDocumentStore()
@ -90,6 +90,6 @@ document_ranker_pipeline.add_component(instance=ranker, name="ranker")
document_ranker_pipeline.connect("retriever.documents", "ranker.documents")
query = "Cities in France"
document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
"ranker": {"query": query, "top_k": 2}})
```
```

View File

@ -11,7 +11,7 @@ Use this component to rank documents based on their similarity to the query usin
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents (such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) ) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents (such as a [Retriever](/docs/pipeline-components/retrievers.mdx) ) |
| **Mandatory init variables** | "api_key": The Jina API key. Can be set with `JINA_API_KEY` env var. |
| **Mandatory run variables** | “query”: A query string <br /> <br />”documents”: A list of documents |
| **Output variables** | “documents”: A list of documents |
@ -74,7 +74,7 @@ from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack_integrations.components.rankers.jina import JinaRanker
docs = [Document(content="Paris is in France"),
docs = [Document(content="Paris is in France"),
Document(content="Berlin is in Germany"),
Document(content="Lyon is in France")]
document_store = InMemoryDocumentStore()
@ -90,6 +90,6 @@ ranker_pipeline.add_component(instance=ranker, name="ranker")
ranker_pipeline.connect("retriever.documents", "ranker.documents")
query = "Cities in France"
ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
"ranker": {"query": query, "top_k": 2}})
```
```

View File

@ -11,7 +11,7 @@ description: "`MetaFieldRanker` ranks Documents based on the value of their meta
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents, such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents, such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | "meta_field": The name of the meta field to rank by |
| **Mandatory run variables** | “documents”: A list of documents <br /> <br />”top_k”: The maximum number of documents to return. If not provided, returns all documents it received. |
| **Output variables** | “documents”: A list of documents |
@ -32,7 +32,7 @@ By default, `MetaFieldRanker` sorts documents only based on the meta field. You
### On its own
You can use this Ranker outside of a pipeline to sort documents.
You can use this Ranker outside of a pipeline to sort documents.
This example uses the `MetaFieldRanker` to rank two simple documents. When running the Ranker, you pass the `query`, provide the `documents` and set the number of documents to rank using the `top_k` parameter.
@ -73,6 +73,6 @@ document_ranker_pipeline.add_component(instance=ranker, name="ranker")
document_ranker_pipeline.connect("retriever.documents", "ranker.documents")
query = "Cities in France"
document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
"ranker": {"query": query, "top_k": 2}})
```
```

View File

@ -11,7 +11,7 @@ Use this component to rank documents based on their similarity to the query usin
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | "api_key": API key for the NVIDIA NIM. Can be set with `NVIDIA_API_KEY` env var. |
| **Mandatory run variables** | ”query”: A query string <br /> <br />“documents”: A list of document objects |
| **Output variables** | “documents”: A list of document objects |
@ -48,20 +48,20 @@ This example uses `NvidiaRanker` to rank two simple documents. To run the Ranker
from haystack_integrations.components.rankers.nvidia import NvidiaRanker
from haystack import Document
from haystack.utils import Secret
ranker = NvidiaRanker(
model="nvidia/nv-rerankqa-mistral-4b-v3",
api_key=Secret.from_env_var("NVIDIA_API_KEY"),
)
ranker.warm_up()
query = "What is the capital of Germany?"
documents = [
Document(content="Berlin is the capital of Germany."),
Document(content="The capital of Germany is Berlin."),
Document(content="Germany's capital is Berlin."),
]
result = ranker.run(query, documents, top_k=2)
print(result["documents"])
```
@ -105,4 +105,4 @@ In the example above, the `top_k` values for the Retriever and the Ranker are di
You can set the same or a smaller `top_k` value for the Ranker. The Ranker's `top_k` is the number of documents it returns (if it's the last component in the pipeline) or forwards to the next component. In the pipeline example above, the Ranker is the last component, so the output you get when you run the pipeline are the top two documents, as per the Ranker's `top_k`.
Adjusting the `top_k` values can help you optimize performance. In this case, a smaller `top_k` value of the Retriever means fewer documents to process for the Ranker, which can speed up the pipeline.
:::
:::

View File

@ -11,7 +11,7 @@ This is a Diversity Ranker based on Sentence Transformers.
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | "token": The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
| **Mandatory run variables** | “documents”: A list of documents <br /> <br />”query”: A query string |
| **Output variables** | “documents”: A list of documents |
@ -39,9 +39,9 @@ from haystack.components.rankers import SentenceTransformersDiversityRanker
ranker = SentenceTransformersDiversityRanker(model="sentence-transformers/all-MiniLM-L6-v2", similarity="cosine")
ranker.warm_up()
docs = [Document(content="Regular Exercise"), Document(content="Balanced Nutrition"), Document(content="Positive Mindset"),
docs = [Document(content="Regular Exercise"), Document(content="Balanced Nutrition"), Document(content="Positive Mindset"),
Document(content="Eating Well"), Document(content="Doing physical activities"), Document(content="Thinking positively")]
query = "How can I maintain physical fitness?"
output = ranker.run(query=query, documents=docs)
docs = output["documents"]
@ -73,6 +73,6 @@ document_ranker_pipeline.add_component(instance=ranker, name="ranker")
document_ranker_pipeline.connect("retriever.documents", "ranker.documents")
query = "Most famous iconic sight in Paris"
document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
"ranker": {"query": query, "top_k": 2}})
```
```

View File

@ -11,7 +11,7 @@ Use this component to rank documents based on their similarity to the query. The
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | "token" (only for private models): The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
| **Mandatory run variables** | “documents”: A list of documents <br /> <br />”query”: A query string |
| **Output variables** | “documents”: A list of documents |
@ -101,4 +101,4 @@ In the example above, the `top_k` values for the Retriever and the Ranker are di
You can set the same or a smaller `top_k` value for the Ranker. The Ranker's `top_k` is the number of documents it returns (if it's the last component in the pipeline) or forwards to the next component. In the pipeline example above, the Ranker is the last component, so the output you get when you run the pipeline are the top two documents, as per the Ranker's `top_k`.
Adjusting the `top_k` values can help you optimize performance. In this case, a smaller `top_k` value of the Retriever means fewer documents to process for the Ranker, which can speed up the pipeline.
:::
:::

View File

@ -10,13 +10,13 @@ description: "Use this component to rank documents based on their similarity to
Use this component to rank documents based on their similarity to the query. The `TransformersSimilarityRanker` is a powerful, model-based Ranker that uses a cross-encoder model to produce document and query embeddings.
> 🚧 Legacy Component
>
> This component is considered legacy and will no longer receive updates. It may be deprecated in a future release, followed by removal after a deprecation period.
>
> This component is considered legacy and will no longer receive updates. It may be deprecated in a future release, followed by removal after a deprecation period.
> Consider using SentenceTransformersSimilarityRanker instead, as it provides the same functionality and additional features.
| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In a query pipeline, after a component that returns a list of documents such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | "token" (only for private models): The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
| **Mandatory run variables** | “documents”: A list of documents <br /> <br />”query”: A query string |
| **Output variables** | “documents”: A list of documents |
@ -63,7 +63,7 @@ ranker.run(query="City in France", documents=docs, top_k=1)
### In a pipeline
`TransformersSimilarityRanker` is most efficient in query pipelines when used after a Retriever.
`TransformersSimilarityRanker` is most efficient in query pipelines when used after a Retriever.
Below is an example of a pipeline that retrieves documents from an `InMemoryDocumentStore` based on keyword search (using `InMemoryBM25Retriever`). It then uses the `TransformersSimilarityRanker` to rank the retrieved documents according to their similarity to the query. The pipeline uses the default settings of the Ranker.
@ -73,7 +73,7 @@ from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.rankers import TransformersSimilarityRanker
docs = [Document(content="Paris is in France"),
docs = [Document(content="Paris is in France"),
Document(content="Berlin is in Germany"),
Document(content="Lyon is in France")]
document_store = InMemoryDocumentStore()
@ -90,7 +90,7 @@ document_ranker_pipeline.add_component(instance=ranker, name="ranker")
document_ranker_pipeline.connect("retriever.documents", "ranker.documents")
query = "Cities in France"
document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
document_ranker_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
"ranker": {"query": query, "top_k": 2}})
```
@ -102,4 +102,4 @@ In the example above, the `top_k` values for the Retriever and the Ranker are di
You can set the same or a smaller `top_k` value for the Ranker. The Ranker's `top_k` is the number of documents it returns (if it's the last component in the pipeline) or forwards to the next component. In the pipeline example above, the Ranker is the last component, so the output you get when you run the pipeline are the top two documents, as per the Ranker's `top_k`.
Adjusting the `top_k` values can help you optimize performance. In this case, a smaller `top_k` value of the Retriever means fewer documents to process for the Ranker, which can speed up the pipeline.
:::
:::

View File

@ -11,16 +11,16 @@ Use this component in extractive question answering pipelines based on a query a
| | |
| --- | --- |
| **Most common position in a pipeline** | In query pipelines, after a component that returns a list of documents, such as a [Retriever](../../../docs/pipeline-components/retrievers.mdx) |
| **Most common position in a pipeline** | In query pipelines, after a component that returns a list of documents, such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
| **Mandatory init variables** | "token": The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
| **Mandatory run variables** | "documents": A list of documents <br /> <br />"query": A query string |
| **Output variables** | "answers": A list of [`ExtractedAnswer`](../../../docs/concepts/data-classes.mdx#extractedanswer) objects |
| **Output variables** | "answers": A list of [`ExtractedAnswer`](/docs/concepts/data-classes.mdx#extractedanswer) objects |
| **API reference** | [Readers](/reference/readers-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/readers/extractive.py |
## Overview
`ExtractiveReader` locates and extracts answers to a given query from the document text. It's used in extractive QA systems where you want to know exactly where the answer is located within the document. It's usually coupled with a Retriever that precedes it, but you can also use it with other components that fetch documents.
`ExtractiveReader` locates and extracts answers to a given query from the document text. It's used in extractive QA systems where you want to know exactly where the answer is located within the document. It's usually coupled with a Retriever that precedes it, but you can also use it with other components that fetch documents.
Readers assign a _probability_ to answers. This score ranges from 0 to 1, indicating how well the results the Reader returned match the query. Probability closest to 1 means the model has high confidence in the answer's relevance. The Reader sorts the answers based on their probability scores, with higher probability listed first. You can limit the number of answers the Reader returns in the optional `top_k` parameter.
@ -90,6 +90,6 @@ extractive_qa_pipeline.add_component(instance=reader, name="reader")
extractive_qa_pipeline.connect("retriever.documents", "reader.documents")
query = "What is the capital of France?"
extractive_qa_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
extractive_qa_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
"reader": {"query": query, "top_k": 2}})
```
```

View File

@ -13,8 +13,8 @@ This Retriever combines embedding-based retrieval and BM25 text search search to
| | |
| --- | --- |
| **Most common position in a pipeline** | 1. After a TextEmbedder and before a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) in a RAG pipeline 2. The last component in a hybrid search pipeline 3. After a TextEmbedder and before an [`ExtractiveReader`](../../../docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of [`AzureAISearchDocumentStore`](../../../docs/document-stores/azureaisearchdocumentstore.mdx) |
| **Most common position in a pipeline** | 1. After a TextEmbedder and before a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) in a RAG pipeline 2. The last component in a hybrid search pipeline 3. After a TextEmbedder and before an [`ExtractiveReader`](/docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of [`AzureAISearchDocumentStore`](/docs/document-stores/azureaisearchdocumentstore.mdx) |
| **Mandatory run variables** | "query": A string <br /> <br />”query_embedding”: A list of floats |
| **Output variables** | “documents”: A list of documents (matching the query) |
| **API reference** | [Azure AI Search](/reference/integrations-azure_ai_search) |
@ -24,7 +24,7 @@ This Retriever combines embedding-based retrieval and BM25 text search search to
The `AzureAISearchHybridRetriever` combines vector retrieval and BM25 text search to fetch relevant documents from the `AzureAISearchDocumentStore`. It processes both textual (keyword) queries and query embeddings in a single request, executing all subqueries in parallel. The results are merged and reordered using [Reciprocal Rank Fusion (RRF)](https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking) to create a unified result set.
Besides the `query` and `query_embedding`, the `AzureAISearchHybridRetriever` accepts optional parameters such as `top_k` (the maximum number of documents to retrieve) and `filters` to refine the search. Additional keyword arguments can also be passed during initialization for further customization.
Besides the `query` and `query_embedding`, the `AzureAISearchHybridRetriever` accepts optional parameters such as `top_k` (the maximum number of documents to retrieve) and `filters` to refine the search. Additional keyword arguments can also be passed during initialization for further customization.
If your search index includes a [semantic configuration](https://learn.microsoft.com/en-us/azure/search/semantic-how-to-query-request), you can enable semantic ranking to apply it to the Retriever's results. For more details, refer to the [Azure AI documentation](https://learn.microsoft.com/en-us/azure/search/hybrid-search-how-to-query#semantic-hybrid-search).
@ -113,4 +113,4 @@ result = query_pipeline.run({"text_embedder": {"text": query}, "retriever": {"qu
print(result["retriever"]["documents"][0])
```
```

View File

@ -11,8 +11,8 @@ A keyword-based Retriever compatible with InMemoryDocumentStore.
| | |
| --- | --- |
| **Most common position in a pipeline** | In query pipelines: <br />In a RAG pipeline, before a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) <br />In a semantic search pipeline, as the last component <br />In an extractive QA pipeline, before an [`ExtractiveReader`](../../../docs/pipeline-components/readers/extractivereader.mdx) |
| **Mandatory init variables** | "document_store": An instance of [InMemoryDocumentStore](../../../docs/document-stores/inmemorydocumentstore.mdx) |
| **Most common position in a pipeline** | In query pipelines: <br />In a RAG pipeline, before a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) <br />In a semantic search pipeline, as the last component <br />In an extractive QA pipeline, before an [`ExtractiveReader`](/docs/pipeline-components/readers/extractivereader.mdx) |
| **Mandatory init variables** | "document_store": An instance of [InMemoryDocumentStore](/docs/document-stores/inmemorydocumentstore.mdx) |
| **Mandatory run variables** | "query": A query string |
| **Output variables** | "documents": A list of documents (matching the query) |
| **API reference** | [Retrievers](/reference/retrievers-api) |
@ -20,11 +20,11 @@ A keyword-based Retriever compatible with InMemoryDocumentStore.
## Overview
`InMemoryBM25Retriever` is a keyword-based Retriever that fetches Documents matching a query from a temporary in-memory database. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.
`InMemoryBM25Retriever` is a keyword-based Retriever that fetches Documents matching a query from a temporary in-memory database. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.
Since the `InMemoryBM25Retriever` matches strings based on word overlap, its often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data.
In addition to the `query`, the `InMemoryBM25Retriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space.
In addition to the `query`, the `InMemoryBM25Retriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space.
Some relevant parameters that impact the BM25 retrieval must be defined when the corresponding `InMemoryDocumentStore` is initialized: these include the specific BM25 algorithm and its parameters.
## Usage
@ -134,4 +134,4 @@ document_store.write_documents(documents)
result = pipeline.run(data={"retriever": {"query":"How many languages are there?"}})
print(result['retriever']['documents'][0])
```
```

View File

@ -11,8 +11,8 @@ Use this Retriever with the InMemoryDocumentStore if you're looking for embeddin
| | |
| --- | --- |
| **Most common position in a pipeline** | In query pipelines: <br />In a RAG pipeline, before a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) <br />In a semantic search pipeline, as the last component <br />In an extractive QA pipeline, after a Tex tEmbedder and before an [`ExtractiveReader`](../../../docs/pipeline-components/readers/extractivereader.mdx) |
| **Mandatory init variables** | "document_store": An instance of [InMemoryDocumentStore](../../../docs/document-stores/inmemorydocumentstore.mdx) |
| **Most common position in a pipeline** | In query pipelines: <br />In a RAG pipeline, before a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) <br />In a semantic search pipeline, as the last component <br />In an extractive QA pipeline, after a Tex tEmbedder and before an [`ExtractiveReader`](/docs/pipeline-components/readers/extractivereader.mdx) |
| **Mandatory init variables** | "document_store": An instance of [InMemoryDocumentStore](/docs/document-stores/inmemorydocumentstore.mdx) |
| **Mandatory run variables** | "query_embedding": A list of floating point numbers |
| **Output variables** | "documents": A list of documents |
| **API reference** | [Retrievers](/reference/retrievers-api) |
@ -20,7 +20,7 @@ Use this Retriever with the InMemoryDocumentStore if you're looking for embeddin
## Overview
The `InMemoryEmbeddingRetriever` is an embedding-based Retriever compatible with the `InMemoryDocumentStore`. It compares the query and Document embeddings and fetches the Documents most relevant to the query from the `InMemoryDocumentStore` based on the outcome.
The `InMemoryEmbeddingRetriever` is an embedding-based Retriever compatible with the `InMemoryDocumentStore`. It compares the query and Document embeddings and fetches the Documents most relevant to the query from the `InMemoryDocumentStore` based on the outcome.
When using the `InMemoryEmbeddingRetriever` in your NLP system, make sure it has the query and Document embeddings available. You can do so by adding a DocumentEmbedder to your indexing pipeline and a Text Embedder to your query pipeline. For details, see [Embedders](../embedders.mdx).
@ -62,4 +62,4 @@ query = "How many languages are there?"
result = query_pipeline.run({"text_embedder": {"text": query}})
print(result['retriever']['documents'][0])
```
```

View File

@ -13,8 +13,8 @@ A Hybrid Retriever uses both traditional keyword-based search (such as BM25) and
| | |
| --- | --- |
| Most common position in a pipeline | After an [OpenSearchDocumentStore](../../../docs/document-stores/opensearch-document-store.mdx) |
| Mandatory init variables | "document_store:: An instance of `OpenSearchDocumentStore` to use for retrieval <br /> <br />"embedder": Any [Embedder](../../../docs/pipeline-components/embedders.mdx) implementing the `TextEmbedder` protocol |
| Most common position in a pipeline | After an [OpenSearchDocumentStore](/docs/document-stores/opensearch-document-store.mdx) |
| Mandatory init variables | "document_store:: An instance of `OpenSearchDocumentStore` to use for retrieval <br /> <br />"embedder": Any [Embedder](/docs/pipeline-components/embedders.mdx) implementing the `TextEmbedder` protocol |
| Mandatory run variables | "query": A query string |
| Output variables | "documents": A list of documents matching the query |
| API reference | [OpenSearch](/reference/integrations-opensearch) |

View File

@ -11,8 +11,8 @@ An embedding-based Retriever compatible with the Qdrant Document Store.
| | |
| --- | --- |
| **Most common position in a pipeline** | 1\. After a Text Embedder and before a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) in a RAG Pipeline <br /> <br />2. The last component in the semantic search pipeline <br />3. After a Text Embedder and before an [`ExtractiveReader`](../../../docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of a [QdrantDocumentStore](../../../docs/document-stores/qdrant-document-store.mdx) |
| **Most common position in a pipeline** | 1\. After a Text Embedder and before a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) in a RAG Pipeline <br /> <br />2. The last component in the semantic search pipeline <br />3. After a Text Embedder and before an [`ExtractiveReader`](/docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of a [QdrantDocumentStore](/docs/document-stores/qdrant-document-store.mdx) |
| **Mandatory run variables** | “query_embedding”: A vector representing the query (a list of floats) |
| **Output variables** | “documents”: A list of documents |
| **API reference** | [Qdrant](/reference/integrations-qdrant) |
@ -80,7 +80,7 @@ documents = [Document(content="There are over 7,000 languages spoken around the
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_embedder = SentenceTransformersDocumentEmbedder()
document_embedder = SentenceTransformersDocumentEmbedder()
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)
@ -96,4 +96,4 @@ query = "How many languages are there?"
result = query_pipeline.run({"text_embedder": {"text": query}})
print(result['retriever']['documents'][0])
```
```

View File

@ -11,9 +11,9 @@ A Retriever based both on dense and sparse embeddings, compatible with the Qdran
| | |
| --- | --- |
| **Most common position in a pipeline** | 1\. After a Text Embedder and before a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) in a RAG pipeline <br /> <br />2. The last component in a hybrid search pipeline <br /> 3. After a Text Embedder and before an [`ExtractiveReader`](../../../docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of a [QdrantDocumentStore](../../../docs/document-stores/qdrant-document-store.mdx) |
| **Mandatory run variables** | “query_embedding”: A dense vector representing the query (a list of floats) <br /> <br />“query_sparse_embedding”: A [`SparseEmbedding`](../../../docs/concepts/data-classes.mdx#sparseembedding) object containing a vectorial representation of the query |
| **Most common position in a pipeline** | 1\. After a Text Embedder and before a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) in a RAG pipeline <br /> <br />2. The last component in a hybrid search pipeline <br /> 3. After a Text Embedder and before an [`ExtractiveReader`](/docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of a [QdrantDocumentStore](/docs/document-stores/qdrant-document-store.mdx) |
| **Mandatory run variables** | “query_embedding”: A dense vector representing the query (a list of floats) <br /> <br />“query_sparse_embedding”: A [`SparseEmbedding`](/docs/concepts/data-classes.mdx#sparseembedding) object containing a vectorial representation of the query |
| **Output variables** | “document”: A list of documents |
| **API reference** | [Qdrant](/reference/integrations-qdrant) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/qdrant |
@ -86,7 +86,7 @@ retriever.run(query_embedding=embedding, query_sparse_embedding=sparse_embedding
### In a pipeline
Currently, you can compute sparse embeddings using Fastembed Sparse Embedders.
Currently, you can compute sparse embeddings using Fastembed Sparse Embedders.
First, install the package with:
```shell
@ -106,8 +106,8 @@ from haystack_integrations.components.embedders.fastembed import (
FastembedDocumentEmbedder,
FastembedSparseTextEmbedder,
FastembedSparseDocumentEmbedder
)
)
document_store = QdrantDocumentStore(
":memory:",
recreate_index=True,
@ -159,4 +159,4 @@ print(result["retriever"]["documents"][0])
:notebook: Tutorial: [Creating a Hybrid Retrieval Pipeline](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval)
:cook: Cookbook: [Sparse Embedding Retrieval with Qdrant and FastEmbed](https://haystack.deepset.ai/cookbook/sparse_embedding_retrieval)
:cook: Cookbook: [Sparse Embedding Retrieval with Qdrant and FastEmbed](https://haystack.deepset.ai/cookbook/sparse_embedding_retrieval)

View File

@ -11,9 +11,9 @@ A Retriever based on sparse embeddings, compatible with the Qdrant Document Stor
| | |
| --- | --- |
| **Most common position in a pipeline** | 1\. After a Text Embedder and before a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) in a RAG pipeline <br /> <br />2. The last component in the semantic search pipeline <br /> 3. After a Text Embedder and before an [`ExtractiveReader`](../../../docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of a [QdrantDocumentStore](../../../docs/document-stores/qdrant-document-store.mdx) |
| **Mandatory run variables** | “query_sparse_embedding”: A [`SparseEmbedding`](../../../docs/concepts/data-classes.mdx#sparseembedding) object containing a vectorial representation of the query |
| **Most common position in a pipeline** | 1\. After a Text Embedder and before a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) in a RAG pipeline <br /> <br />2. The last component in the semantic search pipeline <br /> 3. After a Text Embedder and before an [`ExtractiveReader`](/docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of a [QdrantDocumentStore](/docs/document-stores/qdrant-document-store.mdx) |
| **Mandatory run variables** | “query_sparse_embedding”: A [`SparseEmbedding`](/docs/concepts/data-classes.mdx#sparseembedding) object containing a vectorial representation of the query |
| **Output variables** | “documents”: A list of documents |
| **API reference** | [Qdrant](/reference/integrations-qdrant) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/qdrant |
@ -131,4 +131,4 @@ print(result["sparse_retriever"]["documents"][0]) # noqa: T201
## Additional References
:cook: Cookbook: [Sparse Embedding Retrieval with Qdrant and FastEmbed](https://haystack.deepset.ai/cookbook/sparse_embedding_retrieval)
:cook: Cookbook: [Sparse Embedding Retrieval with Qdrant and FastEmbed](https://haystack.deepset.ai/cookbook/sparse_embedding_retrieval)

View File

@ -11,7 +11,7 @@ Connects to a Snowflake database to execute an SQL query.
| | |
| --- | --- |
| **Most common position in a pipeline** | Before a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) |
| **Most common position in a pipeline** | Before a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) |
| **Mandatory init variables** | “user”: User's login <br /> <br />”account”: Snowflake account identifier <br /> <br />”api_key”: Snowflake account password. Can be set with `SNOWFLAKE_API_KEY` env var |
| **Mandatory run variables** | “query”: An SQL query to execute |
| **Output variables** | “dataframe”: The resulting Pandas dataframe version of the table |
@ -73,4 +73,4 @@ pipeline.connect("builder", "llm")
pipeline.run(data={"query": "select employee, salary from table limit 10;"})
```
```

View File

@ -11,8 +11,8 @@ A Retriever that combines BM25 keyword search and vector similarity to fetch doc
| | |
| --- | --- |
| **Most common position in a pipeline** | 1. After a Text Embedder and before a [`PromptBuilder`](../../../docs/pipeline-components/builders/promptbuilder.mdx) in a RAG pipeline 2. The last component in a hybrid search pipeline 3. After a Text Embedder and before an [`ExtractiveReader`](../../../docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of a [WeaviateDocumentStore](../../../docs/document-stores/weaviatedocumentstore.mdx) |
| **Most common position in a pipeline** | 1. After a Text Embedder and before a [`PromptBuilder`](/docs/pipeline-components/builders/promptbuilder.mdx) in a RAG pipeline 2. The last component in a hybrid search pipeline 3. After a Text Embedder and before an [`ExtractiveReader`](/docs/pipeline-components/readers/extractivereader.mdx) in an extractive QA pipeline |
| **Mandatory init variables** | "document_store": An instance of a [WeaviateDocumentStore](/docs/document-stores/weaviatedocumentstore.mdx) |
| **Mandatory run variables** | "query": A string <br /> <br />"query_embedding": A list of floats |
| **Output variables** | "documents": A list of documents (matching the query) |
| **API reference** | [Weaviate](/reference/integrations-weaviate) |
@ -154,4 +154,4 @@ result = retriever_balanced.run(
query_embedding=embedding,
alpha=0.8
)
```
```

View File

@ -11,10 +11,10 @@ Use this Router in pipelines to route documents based on their MIME types to dif
| | |
| --- | --- |
| **Most common position in a pipeline** | As a preprocessing component to route documents by type before sending them to specific [Converters](../../../docs/pipeline-components/converters.mdx) or [Preprocessors](../../../docs/pipeline-components/preprocessors.mdx) |
| **Most common position in a pipeline** | As a preprocessing component to route documents by type before sending them to specific [Converters](/docs/pipeline-components/converters.mdx) or [Preprocessors](/docs/pipeline-components/preprocessors.mdx) |
| **Mandatory init variables** | "mime_types": A list of MIME types or regex patterns for classification |
| **Mandatory run variables** | "documents": A list of [Documents](../../../docs/concepts/data-classes.mdx#document) to categorize |
| **Output variables** | "unclassified": A list of uncategorized [Documents](../../../docs/concepts/data-classes.mdx#document) <br /> <br />"mime_types": For example "text/plain", "application/pdf", "image/jpeg": List of categorized [Documents](../../../docs/concepts/data-classes.mdx#document) |
| **Mandatory run variables** | "documents": A list of [Documents](/docs/concepts/data-classes.mdx#document) to categorize |
| **Output variables** | "unclassified": A list of uncategorized [Documents](/docs/concepts/data-classes.mdx#document) <br /> <br />"mime_types": For example "text/plain", "application/pdf", "image/jpeg": List of categorized [Documents](/docs/concepts/data-classes.mdx#document) |
| **API reference** | [Routers](/reference/routers-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/routers/document_type_router.py |
@ -160,4 +160,4 @@ result = p.run({"document_type_router": {"documents": docs}})
## - Text documents (text/plain) → DocumentSplitter → DocumentWriter
## - PDF documents (application/pdf) → DocumentWriter (direct)
## - Other documents → unclassified output
```
```

View File

@ -11,10 +11,10 @@ Use this Router in indexing pipelines to route file paths or byte streams based
| | |
| --- | --- |
| **Most common position in a pipeline** | As the first component preprocessing data followed by [Converters](../../../docs/pipeline-components/converters.mdx) |
| **Most common position in a pipeline** | As the first component preprocessing data followed by [Converters](/docs/pipeline-components/converters.mdx) |
| **Mandatory init variables** | "mime_types": A list of MIME types or regex patterns for classification |
| **Mandatory run variables** | "sources": A list of file paths or byte streams to categorize |
| **Output variables** | "unclassified": A list of uncategorized file paths or [byte streams](../../../docs/concepts/data-classes.mdx#bytestream) <br /> <br />”mime_types”: For example “"text/plain", "text/html", "application/pdf", "text/markdown", "audio/x-wav", "image/jpeg”: List of categorized file paths or byte streams |
| **Output variables** | "unclassified": A list of uncategorized file paths or [byte streams](/docs/concepts/data-classes.mdx#bytestream) <br /> <br />”mime_types”: For example “"text/plain", "text/html", "application/pdf", "text/markdown", "audio/x-wav", "image/jpeg”: List of categorized file paths or byte streams |
| **API reference** | [Routers](/reference/routers-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/routers/file_type_router.py |
@ -28,7 +28,7 @@ When initializing the component, you specify the set of MIME types to route to s
### On its own
Below is an example that uses the `FileTypeRouter` to rank two simple documents:
Below is an example that uses the `FileTypeRouter` to rank two simple documents:
```python
from haystack import Document
@ -60,4 +60,4 @@ p.connect("file_type_router.text/plain", "text_file_converter.sources")
p.connect("text_file_converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
p.run({"file_type_router": {"sources":["text-file-will-be-added.txt", "pdf-will-not-be-added.pdf"]}})
```
```

View File

@ -11,7 +11,7 @@ Use this component to route documents or byte streams to different output connec
| | |
| --- | --- |
| **Most common position in a pipeline** | After components that classify documents, such as [`DocumentLanguageClassifier`](../../../docs/pipeline-components/classifiers/documentlanguageclassifier.mdx) |
| **Most common position in a pipeline** | After components that classify documents, such as [`DocumentLanguageClassifier`](/docs/pipeline-components/classifiers/documentlanguageclassifier.mdx) |
| **Mandatory init variables** | "rules": A dictionary with metadata routing rules (see our API Reference for examples) |
| **Mandatory run variables** | “documents”: A list of documents or byte streams |
| **Output variables** | “unmatched”: A list of documents or byte streams not matching any rule <br /> <br />“_name_of_the_rule_”: A list of documents or byte streams matching custom rules. There's one output per one rule you define. Each of these outputs is a list of documents or byte streams. |
@ -87,4 +87,4 @@ p.connect("text_file_converter.documents", "language_classifier.documents")
p.connect("language_classifier.documents", "router.documents")
p.connect("router.en", "writer.documents")
p.run({"text_file_converter": {"sources": ["english-file-will-be-added.txt", "german-file-will-not-be-added.txt"]}})
```
```

View File

@ -11,7 +11,7 @@ Use this component in pipelines to route a query based on its language.
| | |
| --- | --- |
| **Most common position in a pipeline** | As the first component to route a query to different [Retrievers](../../../docs/pipeline-components/retrievers.mdx) , based on its language |
| **Most common position in a pipeline** | As the first component to route a query to different [Retrievers](/docs/pipeline-components/retrievers.mdx) , based on its language |
| **Mandatory init variables** | "languages": A list of ISO language codes |
| **Mandatory run variables** | “text”: A string |
| **Output variables** | “unmatched”: A string <br /> <br />“_language defined during initialization_”: A string. For example: "fr": French language string. |
@ -55,4 +55,4 @@ p.add_component(instance=TextLanguageRouter(), name="text_language_router")
p.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever")
p.connect("text_language_router.en", "retriever.query")
p.run({"text_language_router": {"text": "What's your query?"}})
```
```

View File

@ -11,8 +11,8 @@ Use this component to ensure that an LLM-generated chat message JSON adheres to
| | |
| --- | --- |
| **Most common position in a pipeline** | After a [Generator](../../../docs/pipeline-components/generators.mdx) |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](../../../docs/concepts/data-classes.mdx#chatmessage) instances to be validated  the last message in this list is the one that is validated |
| **Most common position in a pipeline** | After a [Generator](/docs/pipeline-components/generators.mdx) |
| **Mandatory run variables** | “messages”: A list of [`ChatMessage`](/docs/concepts/data-classes.mdx#chatmessage) instances to be validated  the last message in this list is the one that is validated |
| **Output variables** | “validated”: A list of messages if the last message is valid <br /> <br />”validation_error”: A list of messages if the last message is invalid |
| **API reference** | [Validators](/reference/validators-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/validators/json_schema.py |
@ -64,11 +64,11 @@ result = p.run(
"age": {"type": "integer"}}}}})
print(result)
>> {'schema_validator': {'validated': [ChatMessage(_role=<ChatRole.ASSISTANT:
>> 'assistant'>, _content=[TextContent(text='\n{\n "name": "John",\n "age": 30\n}')],
>> _name=None, _meta={'model': 'gpt-4-1106-preview', 'index': 0, 'finish_reason': 'stop',
>> 'usage': {'completion_tokens': 17, 'prompt_tokens': 20, 'total_tokens': 37,
>> 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0,
>> 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details':
>> {'schema_validator': {'validated': [ChatMessage(_role=<ChatRole.ASSISTANT:
>> 'assistant'>, _content=[TextContent(text='\n{\n "name": "John",\n "age": 30\n}')],
>> _name=None, _meta={'model': 'gpt-4-1106-preview', 'index': 0, 'finish_reason': 'stop',
>> 'usage': {'completion_tokens': 17, 'prompt_tokens': 20, 'total_tokens': 37,
>> 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0,
>> 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details':
>> {'audio_tokens': 0, 'cached_tokens': 0}}})]}}
```
```

View File

@ -11,7 +11,7 @@ Search engine using Search API.
| | |
| --- | --- |
| **Most common position in a pipeline** | Before [`LinkContentFetcher`](../../../docs/pipeline-components/fetchers/linkcontentfetcher.mdx) or [Converters](../../../docs/pipeline-components/converters.mdx) |
| **Most common position in a pipeline** | Before [`LinkContentFetcher`](/docs/pipeline-components/fetchers/linkcontentfetcher.mdx) or [Converters](/docs/pipeline-components/converters.mdx) |
| **Mandatory init variables** | "api_key": The SearchAPI API key. Can be set with `SEARCHAPI_API_KEY` env var. |
| **Mandatory run variables** | “query”: A string with your query |
| **Output variables** | “documents”: A list of documents <br /> <br />”links”: A list of strings of resulting links |
@ -92,4 +92,4 @@ pipe.connect("prompt_builder.messages", "llm.messages")
query = "What is the most famous landmark in Berlin?"
pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}})
```
```

View File

@ -11,7 +11,7 @@ Search engine using SerperDev API.
| | |
| --- | --- |
| **Most common position in a pipeline** | Before [`LinkContentFetcher`](../../../docs/pipeline-components/fetchers/linkcontentfetcher.mdx) or [Converters](../../../docs/pipeline-components/converters.mdx) |
| **Most common position in a pipeline** | Before [`LinkContentFetcher`](/docs/pipeline-components/fetchers/linkcontentfetcher.mdx) or [Converters](/docs/pipeline-components/converters.mdx) |
| **Mandatory init variables** | "api_key": The SearchAPI API key. Can be set with `SERPERDEV_API_KEY` env var. |
| **Mandatory run variables** | “query”: A string with your query |
| **Output variables** | “documents”: A list of documents <br /> <br />”links”: A list of strings of resulting links |
@ -98,4 +98,4 @@ pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}})
## Additional References
:notebook: Tutorial: [Building Fallbacks to Websearch with Conditional Routing](https://haystack.deepset.ai/tutorials/36_building_fallbacks_with_conditional_routing)
:notebook: Tutorial: [Building Fallbacks to Websearch with Conditional Routing](https://haystack.deepset.ai/tutorials/36_building_fallbacks_with_conditional_routing)