"The goal of this notebook is to show you how to load `unstructured` outputs into MySQL. This allows you to retrieve pre-processed text based on metadata fields that `unstructured` extracts.\n",
"\n",
"If you don't have MySQL installed on your system yet, you can follow the instructions [here](https://dev.mysql.com/doc/refman/5.7/en/installing.html) to get it installed. If you haven't already, run `pip install -r requirements.txt` in the base directory of the example folder to install the Python dependencies."
]
},
{
"cell_type": "markdown",
"id": "566328b8",
"metadata": {},
"source": [
"# Preprocess Documents with Unstructured\n",
"\n",
"First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick."
"## Convert the Unstructured Outputs to a Dataframe\n",
"\n",
"Now that we have the document outputs as a list of `Element` objects, we can convert the list to a dataframe using the `convert_to_dataframe` staging brick. With the elements in dataframe format, we can now see the text and type along side various document metadata."
" <td>This is a test email to use for unit tests.</td>\n",
" <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Title</td>\n",
" <td>Important points:</td>\n",
" <td>9c218520320f238595f1fde74bdd137d</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ListItem</td>\n",
" <td>Roses are red</td>\n",
" <td>8522061b991b1db70453502d328fe07e</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ListItem</td>\n",
" <td>Violets are blue</td>\n",
" <td>c3c4527761d4e4b8d0a4c4a0d46954c8</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake-email.eml</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Title</td>\n",
" <td>Lorem ipsum dolor sit amet.</td>\n",
" <td>dd14cbbf0e74909aac7f248a85d190af</td>\n",
" <td>NaN</td>\n",
" <td>../../example-docs/fake.docx</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" type text \\\n",
"0 NarrativeText This is a test email to use for unit tests. \n",
"1 Title Important points: \n",
"2 ListItem Roses are red \n",
"3 ListItem Violets are blue \n",
"4 Title Lorem ipsum dolor sit amet. \n",
"\n",
" element_id coordinates \\\n",
"0 f49fbd614ddf5b72e06f59e554e6ae2b NaN \n",
"1 9c218520320f238595f1fde74bdd137d NaN \n",
"2 8522061b991b1db70453502d328fe07e NaN \n",
"3 c3c4527761d4e4b8d0a4c4a0d46954c8 NaN \n",
"4 dd14cbbf0e74909aac7f248a85d190af NaN \n",
"\n",
" filename page_number url \n",
"0 ../../example-docs/fake-email.eml NaN NaN \n",
"1 ../../example-docs/fake-email.eml NaN NaN \n",
"2 ../../example-docs/fake-email.eml NaN NaN \n",
"3 ../../example-docs/fake-email.eml NaN NaN \n",
"4 ../../example-docs/fake.docx NaN NaN "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elements_df.head()"
]
},
{
"cell_type": "markdown",
"id": "a881fff4",
"metadata": {},
"source": [
"## Load the Documents into MySQL\n",
"\n",
"Once the `unstructured` elements are converted to a dataframe, we can easily upload them to MySQL using built-in `pandas` utilities. In this case, we'll upload the documents using a connection created with the `sqlalchemy` libary. \n",
"\n",
"Run `export MYSQL_PWD=<my-password>` to store your MySQL password in as an environment variable. You can accomplish this using other MySQL clients as well. In the `elements_df.to_sql` block, you can change `if_exists` to `\"append\"` if you would like to add to a table instead of replacing it."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "dd05592a",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import pandas as pd\n",
"from sqlalchemy import create_engine, text"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "0181db92",
"metadata": {},
"outputs": [],
"source": [
"# NOTE: update these values to reflect the username/password/database\n",
"Now that the documents are loaded into MySQL, you can run queries that retrieve document snippets based on metadata that `unstructured` has extracted. In this case, we show an example of how to retrieve all of the narrative text from a specific document."