mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-01-07 20:46:31 +00:00
134 lines
5.4 KiB
Plaintext
134 lines
5.4 KiB
Plaintext
---
|
|
title: "WatsonxDocumentEmbedder"
|
|
id: watsonxdocumentembedder
|
|
slug: "/watsonxdocumentembedder"
|
|
description: "The vectors computed by this component are necessary to perform embedding retrieval on a collection of documents. At retrieval time, the vector that represents the query is compared with those of the documents to find the most similar or relevant documents."
|
|
---
|
|
|
|
# WatsonxDocumentEmbedder
|
|
|
|
The vectors computed by this component are necessary to perform embedding retrieval on a collection of documents. At retrieval time, the vector that represents the query is compared with those of the documents to find the most similar or relevant documents.
|
|
|
|
<div className="key-value-table">
|
|
|
|
| | |
|
|
| --- | --- |
|
|
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../writers/documentwriter.mdx) in an indexing pipeline |
|
|
| **Mandatory init variables** | `api_key`: The IBM Cloud API key. Can be set with `WATSONX_API_KEY` env var. <br /> <br />`project_id`: The IBM Cloud project ID. Can be set with `WATSONX_PROJECT_ID` env var. |
|
|
| **Mandatory run variables** | `documents`: A list of documents to be embedded |
|
|
| **Output variables** | `documents`: A list of documents (enriched with embeddings) <br /> <br />`meta`: A dictionary of metadata strings |
|
|
| **API reference** | [Watsonx](/reference/integrations-watsonx) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/watsonx |
|
|
|
|
</div>
|
|
|
|
## Overview
|
|
|
|
`WatsonxDocumentEmbedder` enriches the metadata of documents with an embedding of their content. To embed a string, you should use the [`WatsonxTextEmbedder`](watsonxtextembedder.mdx).
|
|
|
|
The component supports IBM watsonx.ai embedding models such as `ibm/slate-30m-english-rtrvr` and similar. The default model is `ibm/slate-30m-english-rtrvr`. This list of all supported models can be found in IBM's [model documentation](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-embed.html?context=wx).
|
|
|
|
To start using this integration with Haystack, install it with:
|
|
|
|
```shell
|
|
pip install watsonx-haystack
|
|
```
|
|
|
|
The component uses `WATSONX_API_KEY` and `WATSONX_PROJECT_ID` environment variables by default. Otherwise, you can pass API credentials at initialization with `api_key` and `project_id`:
|
|
|
|
```python
|
|
embedder = WatsonxDocumentEmbedder(
|
|
api_key=Secret.from_token("<your-api-key>"),
|
|
project_id=Secret.from_token("<your-project-id>")
|
|
)
|
|
```
|
|
|
|
To get IBM Cloud credentials, head over to https://cloud.ibm.com/.
|
|
|
|
### Embedding Metadata
|
|
|
|
Text documents often come with a set of metadata. If they are distinctive and semantically meaningful, you can embed them along with the text of the document to improve retrieval.
|
|
|
|
You can do this by using the Document Embedder:
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack_integrations.components.embedders.watsonx.document_embedder import WatsonxDocumentEmbedder
|
|
from haystack.utils import Secret
|
|
|
|
doc = Document(content="some text", meta={"title": "relevant title", "page number": 18})
|
|
|
|
embedder = WatsonxDocumentEmbedder(
|
|
api_key=Secret.from_env_var("WATSONX_API_KEY"),
|
|
project_id=Secret.from_env_var("WATSONX_PROJECT_ID"),
|
|
meta_fields_to_embed=["title"]
|
|
)
|
|
|
|
docs_w_embeddings = embedder.run(documents=[doc])["documents"]
|
|
```
|
|
|
|
## Usage
|
|
|
|
Install the `watsonx-haystack` package to use the `WatsonxDocumentEmbedder`:
|
|
|
|
```shell
|
|
pip install watsonx-haystack
|
|
```
|
|
|
|
### On its own
|
|
|
|
Remember to set `WATSONX_API_KEY` and `WATSONX_PROJECT_ID` as environment variables first, or pass them in directly.
|
|
|
|
Here is how you can use the component on its own:
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack_integrations.components.embedders.watsonx.document_embedder import WatsonxDocumentEmbedder
|
|
|
|
doc = Document(content="I love pizza!")
|
|
|
|
embedder = WatsonxDocumentEmbedder()
|
|
|
|
result = embedder.run([doc])
|
|
print(result['documents'][0].embedding)
|
|
## [-0.453125, 1.2236328, 2.0058594, 0.67871094...]
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
```python
|
|
from haystack import Pipeline
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
from haystack.components.writers import DocumentWriter
|
|
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
|
|
|
|
from haystack_integrations.components.embedders.watsonx.document_embedder import WatsonxDocumentEmbedder
|
|
from haystack_integrations.components.embedders.watsonx.text_embedder import WatsonxTextEmbedder
|
|
|
|
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
|
|
|
|
documents = [Document(content="My name is Wolfgang and I live in Berlin"),
|
|
Document(content="I saw a black horse running"),
|
|
Document(content="Germany has many big cities")]
|
|
|
|
indexing_pipeline = Pipeline()
|
|
indexing_pipeline.add_component("embedder", WatsonxDocumentEmbedder())
|
|
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
|
|
indexing_pipeline.connect("embedder", "writer")
|
|
|
|
indexing_pipeline.run({"embedder": {"documents": documents}})
|
|
|
|
query_pipeline = Pipeline()
|
|
query_pipeline.add_component("text_embedder", WatsonxTextEmbedder())
|
|
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
|
|
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
|
|
|
|
query = "Who lives in Berlin?"
|
|
|
|
result = query_pipeline.run({"text_embedder":{"text": query}})
|
|
|
|
print(result['retriever']['documents'][0])
|
|
|
|
## Document(id=..., text: 'My name is Wolfgang and I live in Berlin')
|
|
```
|