"The goal of this notebook is to show how to load embeddings from `unstructured` outputs into a Postgre database with the `pgvector` extension installed.\n",
"The [Postgres documentation](https://www.postgresql.org/docs/15/tutorial-install.html) has instructions on how to install the Postgres database.\n",
"See [the `pgvector` repo](https://github.com/pgvector/pgvector) for information on how to install `pgvector`.\n",
"\n",
"Postgres with `pgvector` is helpful because it combines the capabilities of a vector database with the structured information available in a traditional RDBMS. In this example, we'll show how to:\n",
"\n",
"- Load `unstructured` outputs into `pgvector`.\n",
"- Conduct a similarity search conditioned on a metadata field.\n",
"- Conduct a similarity search, with a decayed score that biases more recent information."
]
},
{
"cell_type": "markdown",
"id": "ffbe19a7",
"metadata": {},
"source": [
"## Setup the Postgres Database\n",
"\n",
"First, we'll get everything set up for the Postgres database. We'll use `sqlalchemy` as\n",
"and ORM for defining the table and performing queries."
"Next, we'll preprocess data (in this case emails) using the `partition_email` function from `unstructured`. We'll also use the `OpenAIEmbeddings` class from `langchain` to create embeddings from the text. The embeddings will be used for similarity search after we've loaded the documents into the database."
"Now that we've preprocessed the documents, we're ready to load the results into the database. We'll do this by creating objects with `sqlalchemy` using the schema we defined early and then running an insert command."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "a47c99d3",
"metadata": {},
"outputs": [],
"source": [
"items_to_add = []\n",
"for element in elements:\n",
" items_to_add.append(\n",
" Element(\n",
" text=element.text,\n",
" category=element.category,\n",
" embedding=element.embedding,\n",
" filename=element.metadata.filename,\n",
" date=element.metadata.get_date(),\n",
" sent_to=element.metadata.sent_to,\n",
" sent_from=element.metadata.sent_from,\n",
" subject=element.metadata.subject,\n",
" )\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "5d6bbf43",
"metadata": {},
"outputs": [],
"source": [
"session.add_all(items_to_add)\n",
"session.commit()"
]
},
{
"cell_type": "markdown",
"id": "8d013b64",
"metadata": {},
"source": [
"## Query the Database\n",
"\n",
"Finally, we're ready to query the database. The results from similarity search can be used for retrieval augmented generation, as described in the `langchain` doc [here](https://docs.langchain.com/docs/use-cases/qa-docs). First, we'll run a query conditioned on metadata. In this case, we'll look for similar items, but only look for narrative text elements. You can also perfor this operation using the [`pgvector` vectorstore](https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/pgvector.py) in `langchain`."
"Next, we'll run a similarity search, but add a decay function that biases the results toward most recent documents. This can be helpful if you want to run retrieval augmented generation, but are concerned about passing outdated information into the LLM. In this case, we multiply the distance metric by a decay function with an exponential decay rate."