mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-05 08:02:48 +00:00

### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
371 lines
10 KiB
Plaintext
371 lines
10 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3d77a209",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Loading Data into PGVector\n",
|
|
"\n",
|
|
"The goal of this notebook is to show how to load embeddings from `unstructured` outputs into a Postgre database with the `pgvector` extension installed.\n",
|
|
"The [Postgres documentation](https://www.postgresql.org/docs/15/tutorial-install.html) has instructions on how to install the Postgres database.\n",
|
|
"See [the `pgvector` repo](https://github.com/pgvector/pgvector) for information on how to install `pgvector`.\n",
|
|
"\n",
|
|
"Postgres with `pgvector` is helpful because it combines the capabilities of a vector database with the structured information available in a traditional RDBMS. In this example, we'll show how to:\n",
|
|
"\n",
|
|
"- Load `unstructured` outputs into `pgvector`.\n",
|
|
"- Conduct a similarity search conditioned on a metadata field.\n",
|
|
"- Conduct a similarity search, with a decayed score that biases more recent information."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ffbe19a7",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup the Postgres Database\n",
|
|
"\n",
|
|
"First, we'll get everything set up for the Postgres database. We'll use `sqlalchemy` as\n",
|
|
"and ORM for defining the table and performing queries."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "8a538b14",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sqlalchemy import (\n",
|
|
" create_engine,\n",
|
|
" ARRAY,\n",
|
|
" Column,\n",
|
|
" Integer,\n",
|
|
" String,\n",
|
|
" Float,\n",
|
|
" DateTime,\n",
|
|
" func,\n",
|
|
" text,\n",
|
|
")\n",
|
|
"from pgvector.sqlalchemy import Vector\n",
|
|
"from sqlalchemy.orm import declarative_base, sessionmaker"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "91893826",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ADA_TOKEN_COUNT = 1536"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "6614726b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"connection_string = \"postgresql://localhost:5432/postgres\"\n",
|
|
"engine = create_engine(connection_string)\n",
|
|
"\n",
|
|
"Base = declarative_base()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "bb2ffda8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class Element(Base):\n",
|
|
" __tablename__ = \"unstructured_elements\"\n",
|
|
"\n",
|
|
" id = Column(Integer, primary_key=True)\n",
|
|
" embedding = Column(Vector(ADA_TOKEN_COUNT))\n",
|
|
" text = Column(String)\n",
|
|
" category = Column(String)\n",
|
|
" filename = Column(String)\n",
|
|
" category = Column(String)\n",
|
|
" date = Column(DateTime)\n",
|
|
" sent_to = Column(ARRAY(String))\n",
|
|
" sent_from = Column(ARRAY(String))\n",
|
|
" subject = Column(String)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "e9130393",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Base.metadata.create_all(engine)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "d59aa418",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Session = sessionmaker(bind=engine)\n",
|
|
"session = Session()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "24fcb4fa",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Preprocess Documents with Unstructured\n",
|
|
"\n",
|
|
"Next, we'll preprocess data (in this case emails) using the `partition_email` function from `unstructured`. We'll also use the `OpenAIEmbeddings` class from `langchain` to create embeddings from the text. The embeddings will be used for similarity search after we've loaded the documents into the database."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "b08244dc",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import datetime\n",
|
|
"import os\n",
|
|
"\n",
|
|
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
|
"from unstructured.partition.email import partition_email"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "de97a526",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"EXAMPLE_DOCS_DIRECTORY = \"../../example-docs\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "5a18fd82",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"elements = []\n",
|
|
"for f in os.listdir(EXAMPLE_DOCS_DIRECTORY):\n",
|
|
" if not f.endswith(\".eml\"):\n",
|
|
" continue\n",
|
|
"\n",
|
|
" filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, f)\n",
|
|
" elements.extend(partition_email(filename=filename))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "b69915f0",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"embedding_function = OpenAIEmbeddings()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "2e6537d3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for element in elements:\n",
|
|
" element.embedding = embedding_function.embed_query(element.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "af262aa9",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load the Documents into Postgres\n",
|
|
"\n",
|
|
"Now that we've preprocessed the documents, we're ready to load the results into the database. We'll do this by creating objects with `sqlalchemy` using the schema we defined early and then running an insert command."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "a47c99d3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"items_to_add = []\n",
|
|
"for element in elements:\n",
|
|
" items_to_add.append(\n",
|
|
" Element(\n",
|
|
" text=element.text,\n",
|
|
" category=element.category,\n",
|
|
" embedding=element.embedding,\n",
|
|
" filename=element.metadata.filename,\n",
|
|
" date=element.metadata.get_date(),\n",
|
|
" sent_to=element.metadata.sent_to,\n",
|
|
" sent_from=element.metadata.sent_from,\n",
|
|
" subject=element.metadata.subject,\n",
|
|
" )\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "5d6bbf43",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"session.add_all(items_to_add)\n",
|
|
"session.commit()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8d013b64",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Query the Database\n",
|
|
"\n",
|
|
"Finally, we're ready to query the database. The results from similarity search can be used for retrieval augmented generation, as described in the `langchain` doc [here](https://docs.langchain.com/docs/use-cases/qa-docs). First, we'll run a query conditioned on metadata. In this case, we'll look for similar items, but only look for narrative text elements. You can also perfor this operation using the [`pgvector` vectorstore](https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/pgvector.py) in `langchain`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "7ba10d65",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"vector = embedding_function.embed_query(\"email\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"id": "25ed06a2",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"1 This is a test email to use for unit tests.\n",
|
|
"13 This is a test email to use for unit tests.\n",
|
|
"5 This is a test email to use for unit tests.\n",
|
|
"9 The unstructured logo is attached to this email.\n",
|
|
"19 It includes:\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"query = (\n",
|
|
" session.query(Element)\n",
|
|
" .filter(Element.category == \"NarrativeText\")\n",
|
|
" .order_by(Element.embedding.l2_distance(vector))\n",
|
|
" .limit(5)\n",
|
|
")\n",
|
|
"\n",
|
|
"for element in query:\n",
|
|
" print(element.id, element.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "10b0b984",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next, we'll run a similarity search, but add a decay function that biases the results toward most recent documents. This can be helpful if you want to run retrieval augmented generation, but are concerned about passing outdated information into the LLM. In this case, we multiply the distance metric by a decay function with an exponential decay rate."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"id": "532cb832",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"vector = embedding_function.embed_query(\"violets\")\n",
|
|
"decay_rate = 0.10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"id": "2ebff5da",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0.0050977532596662945 - Violets are blue\n",
|
|
"0.001773595479626926 - Violets are blue\n",
|
|
"0.001773595479626926 - Violets are blue\n",
|
|
"0.0011421532895244265 - Roses are red\n",
|
|
"0.00029501066142995373 - Roses are red\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"query = (\n",
|
|
" session.query(\n",
|
|
" Element,\n",
|
|
" Element.text,\n",
|
|
" func.exp(\n",
|
|
" text(f\"-{decay_rate} * EXTRACT(DAY FROM (NOW() - date))\")\n",
|
|
" * Element.embedding.l2_distance(vector)\n",
|
|
" ).label(\"decay_score\"),\n",
|
|
" )\n",
|
|
" .order_by(text(\"decay_score DESC\"))\n",
|
|
" .limit(5)\n",
|
|
")\n",
|
|
"\n",
|
|
"for element in query:\n",
|
|
" print(f\"{element.decay_score} - {element.text}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "edd8ba5f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.13"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|