mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-01-31 12:04:17 +00:00
157 lines
6.2 KiB
Plaintext
157 lines
6.2 KiB
Plaintext
---
|
|
title: "GoogleGenAIDocumentEmbedder"
|
|
id: googlegenaidocumentembedder
|
|
slug: "/googlegenaidocumentembedder"
|
|
description: "The vectors computed by this component are necessary to perform embedding retrieval on a collection of documents. At retrieval time, the vector representing the query is compared with those of the documents to find the most similar or relevant documents."
|
|
---
|
|
|
|
# GoogleGenAIDocumentEmbedder
|
|
|
|
The vectors computed by this component are necessary to perform embedding retrieval on a collection of documents. At retrieval time, the vector representing the query is compared with those of the documents to find the most similar or relevant documents.
|
|
|
|
| | |
|
|
| --- | --- |
|
|
| **Most common position in a pipeline** | Before a [DocumentWriter](../writers/documentwriter.mdx) in an indexing pipeline |
|
|
| **Mandatory init variables** | "api_key": The Google API key. Can be set with `GOOGLE_API_KEY` or `GEMINI_API_KEY` env var. |
|
|
| **Mandatory run variables** | "documents": A list of documents to be embedded |
|
|
| **Output variables** | "documents": A list of documents (enriched with embeddings) <br /> <br />"meta": A dictionary of metadata |
|
|
| **API reference** | [Google AI](/reference/integrations-google-genai) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/google_genai |
|
|
|
|
## Overview
|
|
|
|
`GoogleGenAIDocumentEmbedder` enriches the metadata of documents with an embedding of their content. To embed a string, you should use the [`GoogleGenAITextEmbedder`](/docs/googlegenaitextembedder).
|
|
|
|
The component supports the following Google AI models:
|
|
|
|
- `text-embedding-004` (default)
|
|
- `text-embedding-004-v2`
|
|
|
|
To start using this integration with Haystack, install it with:
|
|
|
|
```shell
|
|
pip install google-genai-haystack
|
|
```
|
|
|
|
### Authentication
|
|
|
|
Google Gen AI is compatible with both the Gemini Developer API and the Vertex AI API.
|
|
|
|
To use this component with the Gemini Developer API and get an API key, visit [Google AI Studio](https://aistudio.google.com/).
|
|
To use this component with the Vertex AI API, visit [Google Cloud > Vertex AI](https://cloud.google.com/vertex-ai).
|
|
|
|
The component uses a `GOOGLE_API_KEY` or `GEMINI_API_KEY` environment variable by default. Otherwise, you can pass an API key at initialization with a [Secret](/docs/secret-management) and `Secret.from_token` static method:
|
|
|
|
```python
|
|
embedder = GoogleGenAIDocumentEmbedder(api_key=Secret.from_token("<your-api-key>"))
|
|
```
|
|
|
|
The following examples show how to use the component with the Gemini Developer API and the Vertex AI API.
|
|
|
|
#### Gemini Developer API (API Key Authentication)
|
|
|
|
```python
|
|
from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder
|
|
|
|
## set the environment variable (GOOGLE_API_KEY or GEMINI_API_KEY)
|
|
chat_generator = GoogleGenAIDocumentEmbedder()
|
|
```
|
|
|
|
#### Vertex AI (Application Default Credentials)
|
|
|
|
```python
|
|
from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder
|
|
|
|
## Using Application Default Credentials (requires gcloud auth setup)
|
|
chat_generator = GoogleGenAIDocumentEmbedder(
|
|
api="vertex",
|
|
vertex_ai_project="my-project",
|
|
vertex_ai_location="us-central1",
|
|
)
|
|
```
|
|
|
|
#### Vertex AI (API Key Authentication)
|
|
|
|
```python
|
|
from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder
|
|
|
|
## set the environment variable (GOOGLE_API_KEY or GEMINI_API_KEY)
|
|
chat_generator = GoogleGenAIDocumentEmbedder(api="vertex")
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Embedding Metadata
|
|
|
|
Text documents often come with a set of metadata. If they are distinctive and semantically meaningful, you can embed them along with the text of the document to improve retrieval.
|
|
|
|
You can do this by using the Document Embedder:
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder
|
|
|
|
doc = Document(content="some text", meta={"title": "relevant title", "page number": 18})
|
|
|
|
embedder = GoogleGenAIDocumentEmbedder(api_key=Secret.from_token("<your-api-key>"), meta_fields_to_embed=["title"])
|
|
|
|
docs_w_embeddings = embedder.run(documents=[doc])["documents"]
|
|
```
|
|
|
|
## Usage
|
|
|
|
### On its own
|
|
|
|
Here is how you can use the component on its own. You'll need to pass in your Google API key via Secret or set it as an environment variable called `GOOGLE_API_KEY` or `GEMINI_API_KEY`. The examples below assume you've set the environment variable.
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder
|
|
|
|
doc = Document(content="I love pizza!")
|
|
|
|
document_embedder = GoogleGenAIDocumentEmbedder()
|
|
|
|
result = document_embedder.run([doc])
|
|
print(result['documents'][0].embedding)
|
|
## [0.017020374536514282, -0.023255806416273117, ...]
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack import Pipeline
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
from haystack_integrations.components.embedders.google_genai import GoogleGenAITextEmbedder
|
|
from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder
|
|
from haystack.components.writers import DocumentWriter
|
|
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
|
|
|
|
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
|
|
|
|
documents = [Document(content="My name is Wolfgang and I live in Berlin"),
|
|
Document(content="I saw a black horse running"),
|
|
Document(content="Germany has many big cities")]
|
|
|
|
indexing_pipeline = Pipeline()
|
|
indexing_pipeline.add_component("embedder", GoogleGenAIDocumentEmbedder())
|
|
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
|
|
indexing_pipeline.connect("embedder", "writer")
|
|
|
|
indexing_pipeline.run({"embedder": {"documents": documents}})
|
|
|
|
query_pipeline = Pipeline()
|
|
query_pipeline.add_component("text_embedder", GoogleGenAITextEmbedder())
|
|
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
|
|
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
|
|
|
|
query = "Who lives in Berlin?"
|
|
|
|
result = query_pipeline.run({"text_embedder":{"text": query}})
|
|
|
|
print(result['retriever']['documents'][0])
|
|
|
|
## Document(id=..., content: 'My name is Wolfgang and I live in Berlin')
|
|
```
|