{ "cells": [ { "cell_type": "markdown", "id": "57eeca7e", "metadata": {}, "source": [ "# Loading Data into MySQL\n", "\n", "The goal of this notebook is to show you how to load `unstructured` outputs into MySQL. This allows you to retrieve pre-processed text based on metadata fields that `unstructured` extracts.\n", "\n", "If you don't have MySQL installed on your system yet, you can follow the instructions [here](https://dev.mysql.com/doc/refman/5.7/en/installing.html) to get it installed. If you haven't already, run `pip install -r requirements.txt` in the base directory of the example folder to install the Python dependencies." ] }, { "cell_type": "markdown", "id": "566328b8", "metadata": {}, "source": [ "# Preprocess Documents with Unstructured\n", "\n", "First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick." ] }, { "cell_type": "code", "execution_count": 1, "id": "98122cd4", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "from unstructured.partition.auto import partition" ] }, { "cell_type": "code", "execution_count": 2, "id": "ece16580", "metadata": {}, "outputs": [], "source": [ "# NOTE: Update this directory if you are running the notebook\n", "# from somewhere other than the examples/mysql folder in the\n", "# unstructured repo\n", "EXAMPLE_DOCS_FOLDER = \"../../example-docs/\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "c9d970f4", "metadata": {}, "outputs": [], "source": [ "documents_to_process = [\n", " \"fake-email.eml\",\n", " \"fake.docx\",\n", " \"layout-parser-paper-fast.pdf\",\n", "]" ] }, { "cell_type": "code", "execution_count": 4, "id": "570a70bb", "metadata": {}, "outputs": [], "source": [ "elements = []\n", "for document in documents_to_process:\n", " filename = os.path.join(EXAMPLE_DOCS_FOLDER, document)\n", " elements.extend(partition(filename=filename, strategy=\"fast\"))" ] }, { "cell_type": "code", "execution_count": 5, "id": "73e2a698", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is a test email to use for unit tests.'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elements[0].text" ] }, { "cell_type": "code", "execution_count": 6, "id": "4e47b525", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'filename': '../../example-docs/fake-email.eml',\n", " 'date': '2022-12-16T17:04:16-05:00',\n", " 'sent_from': ['Matthew Robinson '],\n", " 'sent_to': ['Matthew Robinson '],\n", " 'subject': 'Test Email'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elements[0].metadata.to_dict()" ] }, { "cell_type": "markdown", "id": "1d68f22d", "metadata": {}, "source": [ "## Convert the Unstructured Outputs to a Dataframe\n", "\n", "Now that we have the document outputs as a list of `Element` objects, we can convert the list to a dataframe using the `convert_to_dataframe` staging brick. With the elements in dataframe format, we can now see the text and type along side various document metadata." ] }, { "cell_type": "code", "execution_count": 7, "id": "805e967f", "metadata": {}, "outputs": [], "source": [ "from unstructured.staging.base import convert_to_dataframe" ] }, { "cell_type": "code", "execution_count": 8, "id": "a3b76a17", "metadata": {}, "outputs": [], "source": [ "elements_df = convert_to_dataframe(elements)" ] }, { "cell_type": "code", "execution_count": 9, "id": "89e4125f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typetextelement_idcoordinatesfilenamepage_numberurl
0NarrativeTextThis is a test email to use for unit tests.f49fbd614ddf5b72e06f59e554e6ae2bNaN../../example-docs/fake-email.emlNaNNaN
1TitleImportant points:9c218520320f238595f1fde74bdd137dNaN../../example-docs/fake-email.emlNaNNaN
2ListItemRoses are red8522061b991b1db70453502d328fe07eNaN../../example-docs/fake-email.emlNaNNaN
3ListItemViolets are bluec3c4527761d4e4b8d0a4c4a0d46954c8NaN../../example-docs/fake-email.emlNaNNaN
4TitleLorem ipsum dolor sit amet.dd14cbbf0e74909aac7f248a85d190afNaN../../example-docs/fake.docxNaNNaN
\n", "
" ], "text/plain": [ " type text \\\n", "0 NarrativeText This is a test email to use for unit tests. \n", "1 Title Important points: \n", "2 ListItem Roses are red \n", "3 ListItem Violets are blue \n", "4 Title Lorem ipsum dolor sit amet. \n", "\n", " element_id coordinates \\\n", "0 f49fbd614ddf5b72e06f59e554e6ae2b NaN \n", "1 9c218520320f238595f1fde74bdd137d NaN \n", "2 8522061b991b1db70453502d328fe07e NaN \n", "3 c3c4527761d4e4b8d0a4c4a0d46954c8 NaN \n", "4 dd14cbbf0e74909aac7f248a85d190af NaN \n", "\n", " filename page_number url \n", "0 ../../example-docs/fake-email.eml NaN NaN \n", "1 ../../example-docs/fake-email.eml NaN NaN \n", "2 ../../example-docs/fake-email.eml NaN NaN \n", "3 ../../example-docs/fake-email.eml NaN NaN \n", "4 ../../example-docs/fake.docx NaN NaN " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elements_df.head()" ] }, { "cell_type": "markdown", "id": "a881fff4", "metadata": {}, "source": [ "## Load the Documents into MySQL\n", "\n", "Once the `unstructured` elements are converted to a dataframe, we can easily upload them to MySQL using built-in `pandas` utilities. In this case, we'll upload the documents using a connection created with the `sqlalchemy` libary. \n", "\n", "Run `export MYSQL_PWD=` to store your MySQL password in as an environment variable. You can accomplish this using other MySQL clients as well. In the `elements_df.to_sql` block, you can change `if_exists` to `\"append\"` if you would like to add to a table instead of replacing it." ] }, { "cell_type": "code", "execution_count": 10, "id": "dd05592a", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import pandas as pd\n", "from sqlalchemy import create_engine, text" ] }, { "cell_type": "code", "execution_count": 11, "id": "0181db92", "metadata": {}, "outputs": [], "source": [ "# NOTE: update these values to reflect the username/password/database\n", "# name that you created in MySQL\n", "user = \"matt\"\n", "pwd = os.environ.get(\"MYSQL_PWD\")\n", "host = \"localhost\"\n", "db = \"unstructured_example\"" ] }, { "cell_type": "code", "execution_count": 12, "id": "d03c50a8", "metadata": {}, "outputs": [], "source": [ "engine = create_engine(\n", " f\"mysql+mysqlconnector://{user}:{pwd}@{host}/{db}\",\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "id": "ff49d2f4", "metadata": {}, "outputs": [], "source": [ "table_name = \"processed_documents\"" ] }, { "cell_type": "code", "execution_count": 14, "id": "01fc4043", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-1" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elements_df.to_sql(\n", " name=table_name,\n", " con=engine,\n", " if_exists=\"replace\",\n", " index=False\n", ")" ] }, { "cell_type": "markdown", "id": "b621bd38", "metadata": {}, "source": [ "## Read the Documents from MySQL\n", "\n", "Now that the documents are loaded into MySQL, you can run queries that retrieve document snippets based on metadata that `unstructured` has extracted. In this case, we show an example of how to retrieve all of the narrative text from a specific document." ] }, { "cell_type": "code", "execution_count": 15, "id": "5b03d965", "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"\n", "SELECT *\n", "FROM unstructured_example.processed_documents\n", "WHERE type = \"NarrativeText\"\n", "AND filename LIKE '%fake-email.eml%'\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 16, "id": "049c45fb", "metadata": {}, "outputs": [], "source": [ "with engine.begin() as conn:\n", " elements_read_df = pd.read_sql_query(sql=text(sql), con=conn)" ] }, { "cell_type": "code", "execution_count": 17, "id": "92bd2fb1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typetextelement_idcoordinatesfilenamepage_numberurl
0NarrativeTextThis is a test email to use for unit tests.f49fbd614ddf5b72e06f59e554e6ae2bNone../../example-docs/fake-email.emlNoneNone
\n", "
" ], "text/plain": [ " type text \\\n", "0 NarrativeText This is a test email to use for unit tests. \n", "\n", " element_id coordinates \\\n", "0 f49fbd614ddf5b72e06f59e554e6ae2b None \n", "\n", " filename page_number url \n", "0 ../../example-docs/fake-email.eml None None " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elements_read_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "df3ea81c", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 5 }