mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-25 14:14:30 +00:00
docs: unstructured -> MySQL example (#557)
* added requirements for mysql * first bit of mysql notebook * update requirements file * wrap with mysql example * update readme with install instructions
This commit is contained in:
parent
aaea6358f6
commit
19beb24e03
25
examples/mysql/README.md
Normal file
25
examples/mysql/README.md
Normal file
@ -0,0 +1,25 @@
|
||||
# Loading `unstructured` outputs into MySQL
|
||||
|
||||
The following example shows how to load `unstructured` output into MySQL.
|
||||
This allows you to run queries based on metadata that the `unstructured`
|
||||
library has extracted.
|
||||
Follow the instructions [here](https://dev.mysql.com/doc/refman/5.7/en/installing.html)
|
||||
to install MySQL on your system. If you're using homebrew on Mac, you can
|
||||
install MySQL with `brew install mysql`.
|
||||
|
||||
|
||||
Once you have installed MySQL, you can connect to MySQL with the command `mysql -u root`.
|
||||
You can create a non-root user and an `unstructured_example` database using the following
|
||||
commands:
|
||||
|
||||
```sql
|
||||
CREATE USER '<username>'@'localhost' IDENTIFIED BY '<password>';
|
||||
CREATE DATABASE unstructured_example;
|
||||
GRANT ALL PRIVILEGES ON unstructured_example.* TO '<username>'@'localhost';
|
||||
```
|
||||
|
||||
## Running the example
|
||||
|
||||
1. Run `pip install -r requirements.txt` to install the Python dependencies.
|
||||
1. Run `jupyter-notebook to start.
|
||||
1. Run the `load-into-mysql.ipynb` notebook.
|
||||
501
examples/mysql/load-into-mysql.ipynb
Normal file
501
examples/mysql/load-into-mysql.ipynb
Normal file
@ -0,0 +1,501 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "57eeca7e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Loading Data into MySQL\n",
|
||||
"\n",
|
||||
"The goal of this notebook is to show you how to load `unstructured` outputs into MySQL. This allows you to retrieve pre-processed text based on metadata fields that `unstructured` extracts.\n",
|
||||
"\n",
|
||||
"If you don't have MySQL installed on your system yet, you can follow the instructions [here](https://dev.mysql.com/doc/refman/5.7/en/installing.html) to get it installed. If you haven't already, run `pip install -r requirements.txt` in the base directory of the example folder to install the Python dependencies."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "566328b8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Preprocess Documents with Unstructured\n",
|
||||
"\n",
|
||||
"First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "98122cd4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"from unstructured.partition.auto import partition"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "ece16580",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# NOTE: Update this directory if you are running the notebook\n",
|
||||
"# from somewhere other than the examples/mysql folder in the\n",
|
||||
"# unstructured repo\n",
|
||||
"EXAMPLE_DOCS_FOLDER = \"../../example-docs/\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "c9d970f4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"documents_to_process = [\n",
|
||||
" \"fake-email.eml\",\n",
|
||||
" \"fake.docx\",\n",
|
||||
" \"layout-parser-paper-fast.pdf\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "570a70bb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"elements = []\n",
|
||||
"for document in documents_to_process:\n",
|
||||
" filename = os.path.join(EXAMPLE_DOCS_FOLDER, document)\n",
|
||||
" elements.extend(partition(filename=filename, strategy=\"fast\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "73e2a698",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'This is a test email to use for unit tests.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"elements[0].text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "4e47b525",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'filename': '../../example-docs/fake-email.eml',\n",
|
||||
" 'date': '2022-12-16T17:04:16-05:00',\n",
|
||||
" 'sent_from': ['Matthew Robinson <mrobinson@unstructured.io>'],\n",
|
||||
" 'sent_to': ['Matthew Robinson <mrobinson@unstructured.io>'],\n",
|
||||
" 'subject': 'Test Email'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"elements[0].metadata.to_dict()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1d68f22d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Convert the Unstructured Outputs to a Dataframe\n",
|
||||
"\n",
|
||||
"Now that we have the document outputs as a list of `Element` objects, we can convert the list to a dataframe using the `convert_to_dataframe` staging brick. With the elements in dataframe format, we can now see the text and type along side various document metadata."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "805e967f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from unstructured.staging.base import convert_to_dataframe"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "a3b76a17",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"elements_df = convert_to_dataframe(elements)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "89e4125f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>type</th>\n",
|
||||
" <th>text</th>\n",
|
||||
" <th>element_id</th>\n",
|
||||
" <th>coordinates</th>\n",
|
||||
" <th>filename</th>\n",
|
||||
" <th>page_number</th>\n",
|
||||
" <th>url</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>NarrativeText</td>\n",
|
||||
" <td>This is a test email to use for unit tests.</td>\n",
|
||||
" <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>../../example-docs/fake-email.eml</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Title</td>\n",
|
||||
" <td>Important points:</td>\n",
|
||||
" <td>9c218520320f238595f1fde74bdd137d</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>../../example-docs/fake-email.eml</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>ListItem</td>\n",
|
||||
" <td>Roses are red</td>\n",
|
||||
" <td>8522061b991b1db70453502d328fe07e</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>../../example-docs/fake-email.eml</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>ListItem</td>\n",
|
||||
" <td>Violets are blue</td>\n",
|
||||
" <td>c3c4527761d4e4b8d0a4c4a0d46954c8</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>../../example-docs/fake-email.eml</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>Title</td>\n",
|
||||
" <td>Lorem ipsum dolor sit amet.</td>\n",
|
||||
" <td>dd14cbbf0e74909aac7f248a85d190af</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>../../example-docs/fake.docx</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" type text \\\n",
|
||||
"0 NarrativeText This is a test email to use for unit tests. \n",
|
||||
"1 Title Important points: \n",
|
||||
"2 ListItem Roses are red \n",
|
||||
"3 ListItem Violets are blue \n",
|
||||
"4 Title Lorem ipsum dolor sit amet. \n",
|
||||
"\n",
|
||||
" element_id coordinates \\\n",
|
||||
"0 f49fbd614ddf5b72e06f59e554e6ae2b NaN \n",
|
||||
"1 9c218520320f238595f1fde74bdd137d NaN \n",
|
||||
"2 8522061b991b1db70453502d328fe07e NaN \n",
|
||||
"3 c3c4527761d4e4b8d0a4c4a0d46954c8 NaN \n",
|
||||
"4 dd14cbbf0e74909aac7f248a85d190af NaN \n",
|
||||
"\n",
|
||||
" filename page_number url \n",
|
||||
"0 ../../example-docs/fake-email.eml NaN NaN \n",
|
||||
"1 ../../example-docs/fake-email.eml NaN NaN \n",
|
||||
"2 ../../example-docs/fake-email.eml NaN NaN \n",
|
||||
"3 ../../example-docs/fake-email.eml NaN NaN \n",
|
||||
"4 ../../example-docs/fake.docx NaN NaN "
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"elements_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a881fff4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load the Documents into MySQL\n",
|
||||
"\n",
|
||||
"Once the `unstructured` elements are converted to a dataframe, we can easily upload them to MySQL using built-in `pandas` utilities. In this case, we'll upload the documents using a connection created with the `sqlalchemy` libary. \n",
|
||||
"\n",
|
||||
"Run `export MYSQL_PWD=<my-password>` to store your MySQL password in as an environment variable. You can accomplish this using other MySQL clients as well. In the `elements_df.to_sql` block, you can change `if_exists` to `\"append\"` if you would like to add to a table instead of replacing it."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "dd05592a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"from sqlalchemy import create_engine, text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "0181db92",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# NOTE: update these values to reflect the username/password/database\n",
|
||||
"# name that you created in MySQL\n",
|
||||
"user = \"matt\"\n",
|
||||
"pwd = os.environ.get(\"MYSQL_PWD\")\n",
|
||||
"host = \"localhost\"\n",
|
||||
"db = \"unstructured_example\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "d03c50a8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"engine = create_engine(\n",
|
||||
" f\"mysql+mysqlconnector://{user}:{pwd}@{host}/{db}\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "ff49d2f4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"table_name = \"processed_documents\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "01fc4043",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"-1"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"elements_df.to_sql(\n",
|
||||
" name=table_name,\n",
|
||||
" con=engine,\n",
|
||||
" if_exists=\"replace\",\n",
|
||||
" index=False\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b621bd38",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Read the Documents from MySQL\n",
|
||||
"\n",
|
||||
"Now that the documents are loaded into MySQL, you can run queries that retrieve document snippets based on metadata that `unstructured` has extracted. In this case, we show an example of how to retrieve all of the narrative text from a specific document."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "5b03d965",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = \"\"\"\n",
|
||||
"SELECT *\n",
|
||||
"FROM unstructured_example.processed_documents\n",
|
||||
"WHERE type = \"NarrativeText\"\n",
|
||||
"AND filename LIKE '%fake-email.eml%'\n",
|
||||
"\"\"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "049c45fb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with engine.begin() as conn:\n",
|
||||
" elements_read_df = pd.read_sql_query(sql=text(sql), con=conn)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "92bd2fb1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>type</th>\n",
|
||||
" <th>text</th>\n",
|
||||
" <th>element_id</th>\n",
|
||||
" <th>coordinates</th>\n",
|
||||
" <th>filename</th>\n",
|
||||
" <th>page_number</th>\n",
|
||||
" <th>url</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>NarrativeText</td>\n",
|
||||
" <td>This is a test email to use for unit tests.</td>\n",
|
||||
" <td>f49fbd614ddf5b72e06f59e554e6ae2b</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>../../example-docs/fake-email.eml</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" type text \\\n",
|
||||
"0 NarrativeText This is a test email to use for unit tests. \n",
|
||||
"\n",
|
||||
" element_id coordinates \\\n",
|
||||
"0 f49fbd614ddf5b72e06f59e554e6ae2b None \n",
|
||||
"\n",
|
||||
" filename page_number url \n",
|
||||
"0 ../../example-docs/fake-email.eml None None "
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"elements_read_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "df3ea81c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
4
examples/mysql/requirements.txt
Normal file
4
examples/mysql/requirements.txt
Normal file
@ -0,0 +1,4 @@
|
||||
unstructured[local-inference]
|
||||
pandas
|
||||
sqlalchemy
|
||||
jupyter
|
||||
Loading…
x
Reference in New Issue
Block a user