docs: unstructured -> MySQL example (#557)

* added requirements for mysql * first bit of mysql notebook * update requirements file * wrap with mysql example * update readme with install instructions
2025-12-25 14:14:30 +00:00 · 2023-05-09 09:26:49 -04:00 · 2023-05-09 09:26:49 -04:00 · 19beb24e03
commit 19beb24e03
parent aaea6358f6
3 changed files with 530 additions and 0 deletions
--- a/examples/mysql/README.md
+++ b/examples/mysql/README.md
@ -0,0 +1,25 @@
+# Loading `unstructured` outputs into MySQL
+
+The following example shows how to load `unstructured` output into MySQL.
+This allows you to run queries based on metadata that the `unstructured`
+library has extracted.
+Follow the instructions [here](https://dev.mysql.com/doc/refman/5.7/en/installing.html)
+to install MySQL on your system. If you're using homebrew on Mac, you can
+install MySQL with `brew install mysql`.
+
+
+Once you have installed MySQL, you can connect to MySQL with the command `mysql -u root`.
+You can create a non-root user and an `unstructured_example` database using the following
+commands:
+
+```sql
+CREATE USER '<username>'@'localhost' IDENTIFIED BY '<password>';
+CREATE DATABASE unstructured_example;
+GRANT ALL PRIVILEGES ON unstructured_example.* TO '<username>'@'localhost';
+```
+
+## Running the example
+
+1. Run `pip install -r requirements.txt` to install the Python dependencies.
+1. Run `jupyter-notebook to start.
+1. Run the `load-into-mysql.ipynb` notebook.
--- a/examples/mysql/load-into-mysql.ipynb
+++ b/examples/mysql/load-into-mysql.ipynb
@ -0,0 +1,501 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "57eeca7e",
+   "metadata": {},
+   "source": [
+    "# Loading Data into MySQL\n",
+    "\n",
+    "The goal of this notebook is to show you how to load `unstructured` outputs into MySQL. This allows you to retrieve pre-processed text based on metadata fields that `unstructured` extracts.\n",
+    "\n",
+    "If you don't have MySQL installed on your system yet, you can follow the instructions [here](https://dev.mysql.com/doc/refman/5.7/en/installing.html) to get it installed. If you haven't already, run `pip install -r requirements.txt` in the base directory of the example folder to install the Python dependencies."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "566328b8",
+   "metadata": {},
+   "source": [
+    "# Preprocess Documents with Unstructured\n",
+    "\n",
+    "First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "98122cd4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "from unstructured.partition.auto import partition"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "ece16580",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NOTE: Update this directory if you are running the notebook\n",
+    "# from somewhere other than the examples/mysql folder in the\n",
+    "# unstructured repo\n",
+    "EXAMPLE_DOCS_FOLDER = \"../../example-docs/\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c9d970f4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents_to_process = [\n",
+    "    \"fake-email.eml\",\n",
+    "    \"fake.docx\",\n",
+    "    \"layout-parser-paper-fast.pdf\",\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "570a70bb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "elements = []\n",
+    "for document in documents_to_process:\n",
+    "    filename = os.path.join(EXAMPLE_DOCS_FOLDER, document)\n",
+    "    elements.extend(partition(filename=filename, strategy=\"fast\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "73e2a698",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'This is a test email to use for unit tests.'"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "elements[0].text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "4e47b525",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'filename': '../../example-docs/fake-email.eml',\n",
+       " 'date': '2022-12-16T17:04:16-05:00',\n",
+       " 'sent_from': ['Matthew Robinson <mrobinson@unstructured.io>'],\n",
+       " 'sent_to': ['Matthew Robinson <mrobinson@unstructured.io>'],\n",
+       " 'subject': 'Test Email'}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "elements[0].metadata.to_dict()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1d68f22d",
+   "metadata": {},
+   "source": [
+    "## Convert the Unstructured Outputs to a Dataframe\n",
+    "\n",
+    "Now that we have the document outputs as a list of `Element` objects, we can convert the list to a dataframe using the `convert_to_dataframe` staging brick. With the elements in dataframe format, we can now see the text and type along side various document metadata."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "805e967f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from unstructured.staging.base import convert_to_dataframe"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "a3b76a17",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "elements_df = convert_to_dataframe(elements)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "89e4125f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>type</th>\n",
+       "      <th>text</th>\n",
+       "      <th>element_id</th>\n",
+       "      <th>coordinates</th>\n",
+       "      <th>filename</th>\n",
+       "      <th>page_number</th>\n",
+       "      <th>url</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>NarrativeText</td>\n",
+       "      <td>This is a test email to use for unit tests.</td>\n",
+       "      <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>../../example-docs/fake-email.eml</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Title</td>\n",
+       "      <td>Important points:</td>\n",
+       "      <td>9c218520320f238595f1fde74bdd137d</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>../../example-docs/fake-email.eml</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>ListItem</td>\n",
+       "      <td>Roses are red</td>\n",
+       "      <td>8522061b991b1db70453502d328fe07e</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>../../example-docs/fake-email.eml</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>ListItem</td>\n",
+       "      <td>Violets are blue</td>\n",
+       "      <td>c3c4527761d4e4b8d0a4c4a0d46954c8</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>../../example-docs/fake-email.eml</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Title</td>\n",
+       "      <td>Lorem ipsum dolor sit amet.</td>\n",
+       "      <td>dd14cbbf0e74909aac7f248a85d190af</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>../../example-docs/fake.docx</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "            type                                         text  \\\n",
+       "0  NarrativeText  This is a test email to use for unit tests.   \n",
+       "1          Title                            Important points:   \n",
+       "2       ListItem                                Roses are red   \n",
+       "3       ListItem                             Violets are blue   \n",
+       "4          Title                  Lorem ipsum dolor sit amet.   \n",
+       "\n",
+       "                         element_id  coordinates  \\\n",
+       "0  f49fbd614ddf5b72e06f59e554e6ae2b          NaN   \n",
+       "1  9c218520320f238595f1fde74bdd137d          NaN   \n",
+       "2  8522061b991b1db70453502d328fe07e          NaN   \n",
+       "3  c3c4527761d4e4b8d0a4c4a0d46954c8          NaN   \n",
+       "4  dd14cbbf0e74909aac7f248a85d190af          NaN   \n",
+       "\n",
+       "                            filename  page_number  url  \n",
+       "0  ../../example-docs/fake-email.eml          NaN  NaN  \n",
+       "1  ../../example-docs/fake-email.eml          NaN  NaN  \n",
+       "2  ../../example-docs/fake-email.eml          NaN  NaN  \n",
+       "3  ../../example-docs/fake-email.eml          NaN  NaN  \n",
+       "4       ../../example-docs/fake.docx          NaN  NaN  "
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "elements_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a881fff4",
+   "metadata": {},
+   "source": [
+    "## Load the Documents into MySQL\n",
+    "\n",
+    "Once the `unstructured` elements are converted to a dataframe, we can easily upload them to MySQL using built-in `pandas` utilities. In this case, we'll upload the documents using a connection created with the `sqlalchemy` libary. \n",
+    "\n",
+    "Run `export MYSQL_PWD=<my-password>` to store your MySQL password in as an environment variable. You can accomplish this using other MySQL clients as well. In the `elements_df.to_sql` block, you can change `if_exists` to `\"append\"` if you would like to add to a table instead of replacing it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "dd05592a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "import pandas as pd\n",
+    "from sqlalchemy import create_engine, text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "0181db92",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NOTE: update these values to reflect the username/password/database\n",
+    "# name that you created in MySQL\n",
+    "user = \"matt\"\n",
+    "pwd = os.environ.get(\"MYSQL_PWD\")\n",
+    "host = \"localhost\"\n",
+    "db = \"unstructured_example\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "d03c50a8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "engine = create_engine(\n",
+    "    f\"mysql+mysqlconnector://{user}:{pwd}@{host}/{db}\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "ff49d2f4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "table_name = \"processed_documents\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "01fc4043",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "-1"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "elements_df.to_sql(\n",
+    "    name=table_name,\n",
+    "    con=engine,\n",
+    "    if_exists=\"replace\",\n",
+    "    index=False\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b621bd38",
+   "metadata": {},
+   "source": [
+    "## Read the Documents from MySQL\n",
+    "\n",
+    "Now that the documents are loaded into MySQL, you can run queries that retrieve document snippets based on metadata that `unstructured` has extracted. In this case, we show an example of how to retrieve all of the narrative text from a specific document."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "5b03d965",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = \"\"\"\n",
+    "SELECT *\n",
+    "FROM unstructured_example.processed_documents\n",
+    "WHERE type = \"NarrativeText\"\n",
+    "AND filename LIKE '%fake-email.eml%'\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "049c45fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with engine.begin() as conn:\n",
+    "  elements_read_df = pd.read_sql_query(sql=text(sql), con=conn)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "92bd2fb1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>type</th>\n",
+       "      <th>text</th>\n",
+       "      <th>element_id</th>\n",
+       "      <th>coordinates</th>\n",
+       "      <th>filename</th>\n",
+       "      <th>page_number</th>\n",
+       "      <th>url</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>NarrativeText</td>\n",
+       "      <td>This is a test email to use for unit tests.</td>\n",
+       "      <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
+       "      <td>None</td>\n",
+       "      <td>../../example-docs/fake-email.eml</td>\n",
+       "      <td>None</td>\n",
+       "      <td>None</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "            type                                         text  \\\n",
+       "0  NarrativeText  This is a test email to use for unit tests.   \n",
+       "\n",
+       "                         element_id coordinates  \\\n",
+       "0  f49fbd614ddf5b72e06f59e554e6ae2b        None   \n",
+       "\n",
+       "                            filename page_number   url  \n",
+       "0  ../../example-docs/fake-email.eml        None  None  "
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "elements_read_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "df3ea81c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/examples/mysql/requirements.txt
+++ b/examples/mysql/requirements.txt
@ -0,0 +1,4 @@
+unstructured[local-inference]
+pandas
+sqlalchemy
+jupyter