--- title: "Pgvector" id: integrations-pgvector description: "Pgvector integration for Haystack" slug: "/integrations-pgvector" --- ## Module haystack\_integrations.components.retrievers.pgvector.embedding\_retriever ### PgvectorEmbeddingRetriever Retrieves documents from the `PgvectorDocumentStore`, based on their dense embeddings. Example usage: ```python from haystack.document_stores import DuplicatePolicy from haystack import Document, Pipeline from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever # Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database. # e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME" document_store = PgvectorDocumentStore( embedding_dimension=768, vector_function="cosine_similarity", recreate_table=True, ) documents = [Document(content="There are over 7,000 languages spoken around the world today."), Document(content="Elephants have been observed to behave in a way that indicates..."), Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")] document_embedder = SentenceTransformersDocumentEmbedder() document_embedder.warm_up() documents_with_embeddings = document_embedder.run(documents) document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE) query_pipeline = Pipeline() query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder()) query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store)) query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding") query = "How many languages are there?" res = query_pipeline.run({"text_embedder": {"text": query}}) assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today." ``` #### PgvectorEmbeddingRetriever.\_\_init\_\_ ```python def __init__(*, document_store: PgvectorDocumentStore, filters: Optional[Dict[str, Any]] = None, top_k: int = 10, vector_function: Optional[Literal["cosine_similarity", "inner_product", "l2_distance"]] = None, filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE) ``` **Arguments**: - `document_store`: An instance of `PgvectorDocumentStore`. - `filters`: Filters applied to the retrieved Documents. - `top_k`: Maximum number of Documents to return. - `vector_function`: The similarity function to use when searching for similar embeddings. Defaults to the one set in the `document_store` instance. `"cosine_similarity"` and `"inner_product"` are similarity functions and higher scores indicate greater similarity between the documents. `"l2_distance"` returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. **Important**: if the document store is using the `"hnsw"` search strategy, the vector function should match the one utilized during index creation to take advantage of the index. - `filter_policy`: Policy to determine how filters are applied. **Raises**: - `ValueError`: If `document_store` is not an instance of `PgvectorDocumentStore` or if `vector_function` is not one of the valid options. #### PgvectorEmbeddingRetriever.to\_dict ```python def to_dict() -> Dict[str, Any] ``` Serializes the component to a dictionary. **Returns**: Dictionary with serialized data. #### PgvectorEmbeddingRetriever.from\_dict ```python @classmethod def from_dict(cls, data: Dict[str, Any]) -> "PgvectorEmbeddingRetriever" ``` Deserializes the component from a dictionary. **Arguments**: - `data`: Dictionary to deserialize from. **Returns**: Deserialized component. #### PgvectorEmbeddingRetriever.run ```python @component.output_types(documents=List[Document]) def run( query_embedding: List[float], filters: Optional[Dict[str, Any]] = None, top_k: Optional[int] = None, vector_function: Optional[Literal["cosine_similarity", "inner_product", "l2_distance"]] = None ) -> Dict[str, List[Document]] ``` Retrieve documents from the `PgvectorDocumentStore`, based on their embeddings. **Arguments**: - `query_embedding`: Embedding of the query. - `filters`: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the `filter_policy` chosen at retriever initialization. See init method docstring for more details. - `top_k`: Maximum number of Documents to return. - `vector_function`: The similarity function to use when searching for similar embeddings. **Returns**: A dictionary with the following keys: - `documents`: List of `Document`s that are similar to `query_embedding`. #### PgvectorEmbeddingRetriever.run\_async ```python @component.output_types(documents=List[Document]) async def run_async( query_embedding: List[float], filters: Optional[Dict[str, Any]] = None, top_k: Optional[int] = None, vector_function: Optional[Literal["cosine_similarity", "inner_product", "l2_distance"]] = None ) -> Dict[str, List[Document]] ``` Asynchronously retrieve documents from the `PgvectorDocumentStore`, based on their embeddings. **Arguments**: - `query_embedding`: Embedding of the query. - `filters`: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the `filter_policy` chosen at retriever initialization. See init method docstring for more details. - `top_k`: Maximum number of Documents to return. - `vector_function`: The similarity function to use when searching for similar embeddings. **Returns**: A dictionary with the following keys: - `documents`: List of `Document`s that are similar to `query_embedding`. ## Module haystack\_integrations.components.retrievers.pgvector.keyword\_retriever ### PgvectorKeywordRetriever Retrieve documents from the `PgvectorDocumentStore`, based on keywords. To rank the documents, the `ts_rank_cd` function of PostgreSQL is used. It considers how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. For more details, see [Postgres documentation](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING). Usage example: ```python from haystack.document_stores import DuplicatePolicy from haystack import Document from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore from haystack_integrations.components.retrievers.pgvector import PgvectorKeywordRetriever # Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database. # e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME" document_store = PgvectorDocumentStore(language="english", recreate_table=True) documents = [Document(content="There are over 7,000 languages spoken around the world today."), Document(content="Elephants have been observed to behave in a way that indicates..."), Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")] document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE) retriever = PgvectorKeywordRetriever(document_store=document_store) result = retriever.run(query="languages") assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today." #### PgvectorKeywordRetriever.\_\_init\_\_ ```python def __init__(*, document_store: PgvectorDocumentStore, filters: Optional[Dict[str, Any]] = None, top_k: int = 10, filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE) ``` **Arguments**: - `document_store`: An instance of `PgvectorDocumentStore`. - `filters`: Filters applied to the retrieved Documents. - `top_k`: Maximum number of Documents to return. - `filter_policy`: Policy to determine how filters are applied. **Raises**: - `ValueError`: If `document_store` is not an instance of `PgvectorDocumentStore`. #### PgvectorKeywordRetriever.to\_dict ```python def to_dict() -> Dict[str, Any] ``` Serializes the component to a dictionary. **Returns**: Dictionary with serialized data. #### PgvectorKeywordRetriever.from\_dict ```python @classmethod def from_dict(cls, data: Dict[str, Any]) -> "PgvectorKeywordRetriever" ``` Deserializes the component from a dictionary. **Arguments**: - `data`: Dictionary to deserialize from. **Returns**: Deserialized component. #### PgvectorKeywordRetriever.run ```python @component.output_types(documents=List[Document]) def run(query: str, filters: Optional[Dict[str, Any]] = None, top_k: Optional[int] = None) -> Dict[str, List[Document]] ``` Retrieve documents from the `PgvectorDocumentStore`, based on keywords. **Arguments**: - `query`: String to search in `Document`s' content. - `filters`: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the `filter_policy` chosen at retriever initialization. See init method docstring for more details. - `top_k`: Maximum number of Documents to return. **Returns**: A dictionary with the following keys: - `documents`: List of `Document`s that match the query. #### PgvectorKeywordRetriever.run\_async ```python @component.output_types(documents=List[Document]) async def run_async(query: str, filters: Optional[Dict[str, Any]] = None, top_k: Optional[int] = None) -> Dict[str, List[Document]] ``` Asynchronously retrieve documents from the `PgvectorDocumentStore`, based on keywords. **Arguments**: - `query`: String to search in `Document`s' content. - `filters`: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the `filter_policy` chosen at retriever initialization. See init method docstring for more details. - `top_k`: Maximum number of Documents to return. **Returns**: A dictionary with the following keys: - `documents`: List of `Document`s that match the query. ## Module haystack\_integrations.document\_stores.pgvector.document\_store ### PgvectorDocumentStore A Document Store using PostgreSQL with the [pgvector extension](https://github.com/pgvector/pgvector) installed. #### PgvectorDocumentStore.\_\_init\_\_ ```python def __init__(*, connection_string: Secret = Secret.from_env_var("PG_CONN_STR"), create_extension: bool = True, schema_name: str = "public", table_name: str = "haystack_documents", language: str = "english", embedding_dimension: int = 768, vector_type: Literal["vector", "halfvec"] = "vector", vector_function: Literal["cosine_similarity", "inner_product", "l2_distance"] = "cosine_similarity", recreate_table: bool = False, search_strategy: Literal["exact_nearest_neighbor", "hnsw"] = "exact_nearest_neighbor", hnsw_recreate_index_if_exists: bool = False, hnsw_index_creation_kwargs: Optional[Dict[str, int]] = None, hnsw_index_name: str = "haystack_hnsw_index", hnsw_ef_search: Optional[int] = None, keyword_index_name: str = "haystack_keyword_index") ``` Creates a new PgvectorDocumentStore instance. It is meant to be connected to a PostgreSQL database with the pgvector extension installed. A specific table to store Haystack documents will be created if it doesn't exist yet. **Arguments**: - `connection_string`: The connection string to use to connect to the PostgreSQL database, defined as an environment variable. It can be provided in either URI format e.g.: `PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"`, or keyword/value format e.g.: `PG_CONN_STR="host=HOST port=PORT dbname=DBNAME user=USER password=PASSWORD"` See [PostgreSQL Documentation](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING) for more details. - `create_extension`: Whether to create the pgvector extension if it doesn't exist. Set this to `True` (default) to automatically create the extension if it is missing. Creating the extension may require superuser privileges. If set to `False`, ensure the extension is already installed; otherwise, an error will be raised. - `schema_name`: The name of the schema the table is created in. The schema must already exist. - `table_name`: The name of the table to use to store Haystack documents. - `language`: The language to be used to parse query and document content in keyword retrieval. To see the list of available languages, you can run the following SQL query in your PostgreSQL database: `SELECT cfgname FROM pg_ts_config;`. More information can be found in this [StackOverflow answer](https://stackoverflow.com/a/39752553). - `embedding_dimension`: The dimension of the embedding. - `vector_type`: The type of vector used for embedding storage. "vector" is the default. "halfvec" stores embeddings in half-precision, which is particularly useful for high-dimensional embeddings (dimension greater than 2,000 and up to 4,000). Requires pgvector versions 0.7.0 or later. For more information, see the [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file). - `vector_function`: The similarity function to use when searching for similar embeddings. `"cosine_similarity"` and `"inner_product"` are similarity functions and higher scores indicate greater similarity between the documents. `"l2_distance"` returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. **Important**: when using the `"hnsw"` search strategy, an index will be created that depends on the `vector_function` passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index. - `recreate_table`: Whether to recreate the table if it already exists. - `search_strategy`: The search strategy to use when searching for similar embeddings. `"exact_nearest_neighbor"` provides perfect recall but can be slow for large numbers of documents. `"hnsw"` is an approximate nearest neighbor search strategy, which trades off some accuracy for speed; it is recommended for large numbers of documents. **Important**: when using the `"hnsw"` search strategy, an index will be created that depends on the `vector_function` passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index. - `hnsw_recreate_index_if_exists`: Whether to recreate the HNSW index if it already exists. Only used if search_strategy is set to `"hnsw"`. - `hnsw_index_creation_kwargs`: Additional keyword arguments to pass to the HNSW index creation. Only used if search_strategy is set to `"hnsw"`. You can find the list of valid arguments in the [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw) - `hnsw_index_name`: Index name for the HNSW index. - `hnsw_ef_search`: The `ef_search` parameter to use at query time. Only used if search_strategy is set to `"hnsw"`. You can find more information about this parameter in the [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw). - `keyword_index_name`: Index name for the Keyword index. #### PgvectorDocumentStore.to\_dict ```python def to_dict() -> Dict[str, Any] ``` Serializes the component to a dictionary. **Returns**: Dictionary with serialized data. #### PgvectorDocumentStore.from\_dict ```python @classmethod def from_dict(cls, data: Dict[str, Any]) -> "PgvectorDocumentStore" ``` Deserializes the component from a dictionary. **Arguments**: - `data`: Dictionary to deserialize from. **Returns**: Deserialized component. #### PgvectorDocumentStore.delete\_table ```python def delete_table() ``` Deletes the table used to store Haystack documents. The name of the schema (`schema_name`) and the name of the table (`table_name`) are defined when initializing the `PgvectorDocumentStore`. #### PgvectorDocumentStore.delete\_table\_async ```python async def delete_table_async() ``` Async method to delete the table used to store Haystack documents. #### PgvectorDocumentStore.count\_documents ```python def count_documents() -> int ``` Returns how many documents are present in the document store. **Returns**: Number of documents in the document store. #### PgvectorDocumentStore.count\_documents\_async ```python async def count_documents_async() -> int ``` Returns how many documents are present in the document store. **Returns**: Number of documents in the document store. #### PgvectorDocumentStore.filter\_documents ```python def filter_documents( filters: Optional[Dict[str, Any]] = None) -> List[Document] ``` Returns the documents that match the filters provided. For a detailed specification of the filters, refer to the [documentation](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering) **Arguments**: - `filters`: The filters to apply to the document list. **Raises**: - `TypeError`: If `filters` is not a dictionary. - `ValueError`: If `filters` syntax is invalid. **Returns**: A list of Documents that match the given filters. #### PgvectorDocumentStore.filter\_documents\_async ```python async def filter_documents_async( filters: Optional[Dict[str, Any]] = None) -> List[Document] ``` Asynchronously returns the documents that match the filters provided. For a detailed specification of the filters, refer to the [documentation](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering) **Arguments**: - `filters`: The filters to apply to the document list. **Raises**: - `TypeError`: If `filters` is not a dictionary. - `ValueError`: If `filters` syntax is invalid. **Returns**: A list of Documents that match the given filters. #### PgvectorDocumentStore.write\_documents ```python def write_documents(documents: List[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int ``` Writes documents to the document store. **Arguments**: - `documents`: A list of Documents to write to the document store. - `policy`: The duplicate policy to use when writing documents. **Raises**: - `ValueError`: If `documents` contains objects that are not of type `Document`. - `DuplicateDocumentError`: If a document with the same id already exists in the document store and the policy is set to `DuplicatePolicy.FAIL` (or not specified). - `DocumentStoreError`: If the write operation fails for any other reason. **Returns**: The number of documents written to the document store. #### PgvectorDocumentStore.write\_documents\_async ```python async def write_documents_async( documents: List[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int ``` Asynchronously writes documents to the document store. **Arguments**: - `documents`: A list of Documents to write to the document store. - `policy`: The duplicate policy to use when writing documents. **Raises**: - `ValueError`: If `documents` contains objects that are not of type `Document`. - `DuplicateDocumentError`: If a document with the same id already exists in the document store and the policy is set to `DuplicatePolicy.FAIL` (or not specified). - `DocumentStoreError`: If the write operation fails for any other reason. **Returns**: The number of documents written to the document store. #### PgvectorDocumentStore.delete\_documents ```python def delete_documents(document_ids: List[str]) -> None ``` Deletes documents that match the provided `document_ids` from the document store. **Arguments**: - `document_ids`: the document ids to delete #### PgvectorDocumentStore.delete\_documents\_async ```python async def delete_documents_async(document_ids: List[str]) -> None ``` Asynchronously deletes documents that match the provided `document_ids` from the document store. **Arguments**: - `document_ids`: the document ids to delete #### PgvectorDocumentStore.delete\_all\_documents ```python def delete_all_documents() -> None ``` Deletes all documents in the document store. #### PgvectorDocumentStore.delete\_all\_documents\_async ```python async def delete_all_documents_async() -> None ``` Asynchronously deletes all documents in the document store.