mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-03 23:20:35 +00:00

### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
497 lines
14 KiB
Plaintext
497 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "57eeca7e",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Loading Data into MySQL\n",
|
|
"\n",
|
|
"The goal of this notebook is to show you how to load `unstructured` outputs into MySQL. This allows you to retrieve pre-processed text based on metadata fields that `unstructured` extracts.\n",
|
|
"\n",
|
|
"If you don't have MySQL installed on your system yet, you can follow the instructions [here](https://dev.mysql.com/doc/refman/5.7/en/installing.html) to get it installed. If you haven't already, run `pip install -r requirements.txt` in the base directory of the example folder to install the Python dependencies."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "566328b8",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Preprocess Documents with Unstructured\n",
|
|
"\n",
|
|
"First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "98122cd4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"from unstructured.partition.auto import partition"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "ece16580",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# NOTE: Update this directory if you are running the notebook\n",
|
|
"# from somewhere other than the examples/mysql folder in the\n",
|
|
"# unstructured repo\n",
|
|
"EXAMPLE_DOCS_FOLDER = \"../../example-docs/\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "c9d970f4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"documents_to_process = [\n",
|
|
" \"fake-email.eml\",\n",
|
|
" \"fake.docx\",\n",
|
|
" \"layout-parser-paper-fast.pdf\",\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "570a70bb",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"elements = []\n",
|
|
"for document in documents_to_process:\n",
|
|
" filename = os.path.join(EXAMPLE_DOCS_FOLDER, document)\n",
|
|
" elements.extend(partition(filename=filename, strategy=\"fast\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "73e2a698",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'This is a test email to use for unit tests.'"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"elements[0].text"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "4e47b525",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'filename': '../../example-docs/fake-email.eml',\n",
|
|
" 'date': '2022-12-16T17:04:16-05:00',\n",
|
|
" 'sent_from': ['Matthew Robinson <mrobinson@unstructured.io>'],\n",
|
|
" 'sent_to': ['Matthew Robinson <mrobinson@unstructured.io>'],\n",
|
|
" 'subject': 'Test Email'}"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"elements[0].metadata.to_dict()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1d68f22d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Convert the Unstructured Outputs to a Dataframe\n",
|
|
"\n",
|
|
"Now that we have the document outputs as a list of `Element` objects, we can convert the list to a dataframe using the `convert_to_dataframe` staging brick. With the elements in dataframe format, we can now see the text and type along side various document metadata."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "805e967f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from unstructured.staging.base import convert_to_dataframe"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "a3b76a17",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"elements_df = convert_to_dataframe(elements)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "89e4125f",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>type</th>\n",
|
|
" <th>text</th>\n",
|
|
" <th>element_id</th>\n",
|
|
" <th>coordinates</th>\n",
|
|
" <th>filename</th>\n",
|
|
" <th>page_number</th>\n",
|
|
" <th>url</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>NarrativeText</td>\n",
|
|
" <td>This is a test email to use for unit tests.</td>\n",
|
|
" <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>../../example-docs/fake-email.eml</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Title</td>\n",
|
|
" <td>Important points:</td>\n",
|
|
" <td>9c218520320f238595f1fde74bdd137d</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>../../example-docs/fake-email.eml</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>ListItem</td>\n",
|
|
" <td>Roses are red</td>\n",
|
|
" <td>8522061b991b1db70453502d328fe07e</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>../../example-docs/fake-email.eml</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>ListItem</td>\n",
|
|
" <td>Violets are blue</td>\n",
|
|
" <td>c3c4527761d4e4b8d0a4c4a0d46954c8</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>../../example-docs/fake-email.eml</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>Title</td>\n",
|
|
" <td>Lorem ipsum dolor sit amet.</td>\n",
|
|
" <td>dd14cbbf0e74909aac7f248a85d190af</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>../../example-docs/fake.docx</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" type text \\\n",
|
|
"0 NarrativeText This is a test email to use for unit tests. \n",
|
|
"1 Title Important points: \n",
|
|
"2 ListItem Roses are red \n",
|
|
"3 ListItem Violets are blue \n",
|
|
"4 Title Lorem ipsum dolor sit amet. \n",
|
|
"\n",
|
|
" element_id coordinates \\\n",
|
|
"0 f49fbd614ddf5b72e06f59e554e6ae2b NaN \n",
|
|
"1 9c218520320f238595f1fde74bdd137d NaN \n",
|
|
"2 8522061b991b1db70453502d328fe07e NaN \n",
|
|
"3 c3c4527761d4e4b8d0a4c4a0d46954c8 NaN \n",
|
|
"4 dd14cbbf0e74909aac7f248a85d190af NaN \n",
|
|
"\n",
|
|
" filename page_number url \n",
|
|
"0 ../../example-docs/fake-email.eml NaN NaN \n",
|
|
"1 ../../example-docs/fake-email.eml NaN NaN \n",
|
|
"2 ../../example-docs/fake-email.eml NaN NaN \n",
|
|
"3 ../../example-docs/fake-email.eml NaN NaN \n",
|
|
"4 ../../example-docs/fake.docx NaN NaN "
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"elements_df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a881fff4",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load the Documents into MySQL\n",
|
|
"\n",
|
|
"Once the `unstructured` elements are converted to a dataframe, we can easily upload them to MySQL using built-in `pandas` utilities. In this case, we'll upload the documents using a connection created with the `sqlalchemy` libary. \n",
|
|
"\n",
|
|
"Run `export MYSQL_PWD=<my-password>` to store your MySQL password in as an environment variable. You can accomplish this using other MySQL clients as well. In the `elements_df.to_sql` block, you can change `if_exists` to `\"append\"` if you would like to add to a table instead of replacing it."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "dd05592a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"import pandas as pd\n",
|
|
"from sqlalchemy import create_engine, text"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "0181db92",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# NOTE: update these values to reflect the username/password/database\n",
|
|
"# name that you created in MySQL\n",
|
|
"user = \"matt\"\n",
|
|
"pwd = os.environ.get(\"MYSQL_PWD\")\n",
|
|
"host = \"localhost\"\n",
|
|
"db = \"unstructured_example\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "d03c50a8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"engine = create_engine(\n",
|
|
" f\"mysql+mysqlconnector://{user}:{pwd}@{host}/{db}\",\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "ff49d2f4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"table_name = \"processed_documents\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "01fc4043",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"-1"
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"elements_df.to_sql(name=table_name, con=engine, if_exists=\"replace\", index=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b621bd38",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Read the Documents from MySQL\n",
|
|
"\n",
|
|
"Now that the documents are loaded into MySQL, you can run queries that retrieve document snippets based on metadata that `unstructured` has extracted. In this case, we show an example of how to retrieve all of the narrative text from a specific document."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"id": "5b03d965",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = \"\"\"\n",
|
|
"SELECT *\n",
|
|
"FROM unstructured_example.processed_documents\n",
|
|
"WHERE type = \"NarrativeText\"\n",
|
|
"AND filename LIKE '%fake-email.eml%'\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"id": "049c45fb",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with engine.begin() as conn:\n",
|
|
" elements_read_df = pd.read_sql_query(sql=text(sql), con=conn)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"id": "92bd2fb1",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>type</th>\n",
|
|
" <th>text</th>\n",
|
|
" <th>element_id</th>\n",
|
|
" <th>coordinates</th>\n",
|
|
" <th>filename</th>\n",
|
|
" <th>page_number</th>\n",
|
|
" <th>url</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>NarrativeText</td>\n",
|
|
" <td>This is a test email to use for unit tests.</td>\n",
|
|
" <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
|
|
" <td>None</td>\n",
|
|
" <td>../../example-docs/fake-email.eml</td>\n",
|
|
" <td>None</td>\n",
|
|
" <td>None</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" type text \\\n",
|
|
"0 NarrativeText This is a test email to use for unit tests. \n",
|
|
"\n",
|
|
" element_id coordinates \\\n",
|
|
"0 f49fbd614ddf5b72e06f59e554e6ae2b None \n",
|
|
"\n",
|
|
" filename page_number url \n",
|
|
"0 ../../example-docs/fake-email.eml None None "
|
|
]
|
|
},
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"elements_read_df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "df3ea81c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.13"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|