unstructured/examples/mysql/load-into-mysql.ipynb
Matt Robinson d9c035edb1
docs: no more bricks (#1967)
### Summary

We no longer use the "bricks" terminology for partioning functions, etc
in the library. This PR updates various references to bricks within the
repo and the docs. This is just an initial pass to swap the terminology
out, it'll likely be helpful to reorganize the docs a bit as well.

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-11-02 09:43:26 -05:00

497 lines
14 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "57eeca7e",
"metadata": {},
"source": [
"# Loading Data into MySQL\n",
"\n",
"The goal of this notebook is to show you how to load `unstructured` outputs into MySQL. This allows you to retrieve pre-processed text based on metadata fields that `unstructured` extracts.\n",
"\n",
"If you don't have MySQL installed on your system yet, you can follow the instructions [here](https://dev.mysql.com/doc/refman/5.7/en/installing.html) to get it installed. If you haven't already, run `pip install -r requirements.txt` in the base directory of the example folder to install the Python dependencies."
]
},
{
"cell_type": "markdown",
"id": "566328b8",
"metadata": {},
"source": [
"# Preprocess Documents with Unstructured\n",
"\n",
"First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging function."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "98122cd4",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"from unstructured.partition.auto import partition"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ece16580",
"metadata": {},
"outputs": [],
"source": [
"# NOTE: Update this directory if you are running the notebook\n",
"# from somewhere other than the examples/mysql folder in the\n",
"# unstructured repo\n",
"EXAMPLE_DOCS_FOLDER = \"../../example-docs/\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c9d970f4",
"metadata": {},
"outputs": [],
"source": [
"documents_to_process = [\n",
" \"fake-email.eml\",\n",
" \"fake.docx\",\n",
" \"layout-parser-paper-fast.pdf\",\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "570a70bb",
"metadata": {},
"outputs": [],
"source": [
"elements = []\n",
"for document in documents_to_process:\n",
" filename = os.path.join(EXAMPLE_DOCS_FOLDER, document)\n",
" elements.extend(partition(filename=filename, strategy=\"fast\"))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "73e2a698",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'This is a test email to use for unit tests.'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elements[0].text"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "4e47b525",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'filename': '../../example-docs/fake-email.eml',\n",
" 'date': '2022-12-16T17:04:16-05:00',\n",
" 'sent_from': ['Matthew Robinson <mrobinson@unstructured.io>'],\n",
" 'sent_to': ['Matthew Robinson <mrobinson@unstructured.io>'],\n",
" 'subject': 'Test Email'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elements[0].metadata.to_dict()"
]
},
{
"cell_type": "markdown",
"id": "1d68f22d",
"metadata": {},
"source": [
"## Convert the Unstructured Outputs to a Dataframe\n",
"\n",
"Now that we have the document outputs as a list of `Element` objects, we can convert the list to a dataframe using the `convert_to_dataframe` staging function. With the elements in dataframe format, we can now see the text and type along side various document metadata."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "805e967f",
"metadata": {},
"outputs": [],
"source": [
"from unstructured.staging.base import convert_to_dataframe"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a3b76a17",
"metadata": {},
"outputs": [],
"source": [
"elements_df = convert_to_dataframe(elements)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "89e4125f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>type</th>\n",
" <th>text</th>\n",
" <th>element_id</th>\n",
" <th>coordinates</th>\n",
" <th>filename</th>\n",
" <th>page_number</th>\n",
" <th>url</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NarrativeText</td>\n",
" <td>This is a test email to use for unit tests.</td>\n",
" <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Title</td>\n",
" <td>Important points:</td>\n",
" <td>9c218520320f238595f1fde74bdd137d</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ListItem</td>\n",
" <td>Roses are red</td>\n",
" <td>8522061b991b1db70453502d328fe07e</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ListItem</td>\n",
" <td>Violets are blue</td>\n",
" <td>c3c4527761d4e4b8d0a4c4a0d46954c8</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Title</td>\n",
" <td>Lorem ipsum dolor sit amet.</td>\n",
" <td>dd14cbbf0e74909aac7f248a85d190af</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake.docx</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" type text \\\n",
"0 NarrativeText This is a test email to use for unit tests. \n",
"1 Title Important points: \n",
"2 ListItem Roses are red \n",
"3 ListItem Violets are blue \n",
"4 Title Lorem ipsum dolor sit amet. \n",
"\n",
" element_id coordinates \\\n",
"0 f49fbd614ddf5b72e06f59e554e6ae2b NaN \n",
"1 9c218520320f238595f1fde74bdd137d NaN \n",
"2 8522061b991b1db70453502d328fe07e NaN \n",
"3 c3c4527761d4e4b8d0a4c4a0d46954c8 NaN \n",
"4 dd14cbbf0e74909aac7f248a85d190af NaN \n",
"\n",
" filename page_number url \n",
"0 ../../example-docs/fake-email.eml NaN NaN \n",
"1 ../../example-docs/fake-email.eml NaN NaN \n",
"2 ../../example-docs/fake-email.eml NaN NaN \n",
"3 ../../example-docs/fake-email.eml NaN NaN \n",
"4 ../../example-docs/fake.docx NaN NaN "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elements_df.head()"
]
},
{
"cell_type": "markdown",
"id": "a881fff4",
"metadata": {},
"source": [
"## Load the Documents into MySQL\n",
"\n",
"Once the `unstructured` elements are converted to a dataframe, we can easily upload them to MySQL using built-in `pandas` utilities. In this case, we'll upload the documents using a connection created with the `sqlalchemy` libary. \n",
"\n",
"Run `export MYSQL_PWD=<my-password>` to store your MySQL password in as an environment variable. You can accomplish this using other MySQL clients as well. In the `elements_df.to_sql` block, you can change `if_exists` to `\"append\"` if you would like to add to a table instead of replacing it."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "dd05592a",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import pandas as pd\n",
"from sqlalchemy import create_engine, text"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "0181db92",
"metadata": {},
"outputs": [],
"source": [
"# NOTE: update these values to reflect the username/password/database\n",
"# name that you created in MySQL\n",
"user = \"matt\"\n",
"pwd = os.environ.get(\"MYSQL_PWD\")\n",
"host = \"localhost\"\n",
"db = \"unstructured_example\""
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "d03c50a8",
"metadata": {},
"outputs": [],
"source": [
"engine = create_engine(\n",
" f\"mysql+mysqlconnector://{user}:{pwd}@{host}/{db}\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ff49d2f4",
"metadata": {},
"outputs": [],
"source": [
"table_name = \"processed_documents\""
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "01fc4043",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-1"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elements_df.to_sql(name=table_name, con=engine, if_exists=\"replace\", index=False)"
]
},
{
"cell_type": "markdown",
"id": "b621bd38",
"metadata": {},
"source": [
"## Read the Documents from MySQL\n",
"\n",
"Now that the documents are loaded into MySQL, you can run queries that retrieve document snippets based on metadata that `unstructured` has extracted. In this case, we show an example of how to retrieve all of the narrative text from a specific document."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "5b03d965",
"metadata": {},
"outputs": [],
"source": [
"sql = \"\"\"\n",
"SELECT *\n",
"FROM unstructured_example.processed_documents\n",
"WHERE type = \"NarrativeText\"\n",
"AND filename LIKE '%fake-email.eml%'\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "049c45fb",
"metadata": {},
"outputs": [],
"source": [
"with engine.begin() as conn:\n",
" elements_read_df = pd.read_sql_query(sql=text(sql), con=conn)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "92bd2fb1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>type</th>\n",
" <th>text</th>\n",
" <th>element_id</th>\n",
" <th>coordinates</th>\n",
" <th>filename</th>\n",
" <th>page_number</th>\n",
" <th>url</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NarrativeText</td>\n",
" <td>This is a test email to use for unit tests.</td>\n",
" <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
" <td>None</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" type text \\\n",
"0 NarrativeText This is a test email to use for unit tests. \n",
"\n",
" element_id coordinates \\\n",
"0 f49fbd614ddf5b72e06f59e554e6ae2b None \n",
"\n",
" filename page_number url \n",
"0 ../../example-docs/fake-email.eml None None "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elements_read_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df3ea81c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}