unstructured/examples/elastic-search/load-into-es.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "176b9504-8338-42d5-ab3f-ef2bd9ac8fe7",
   "metadata": {},
   "source": [
    "## Indexing Sentiment Analysis Data from Unstructured elements to Elasticsearch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30e7d198-974d-4ded-a26e-936b3ce704e8",
   "metadata": {},
   "source": [
    "The goal of this notebook is to show how to load `Unstructured` output [Elements](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements) together with basic sentiment analysis information into an `Elasticsearch` index. Check the official\n",
    "[Elastisearch documentation](https://elasticsearch-py.readthedocs.io/en/v8.8.0/) to learn more about working with indexes in python.\n",
    "\n",
    "In this example, we'll show how to:\n",
    "\n",
    "- Load unstructured outputs `Element` objects together with a fast sentiment analysis into an `Elasticsearch` index.\n",
    "- Retrieve the stored documents from `Elasticsearch` using a [Search DLS](https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html) query to get the *top5* most polarized and subjective `Text` elements in an html file entitled *\"Russian Offensive Campaign\"*.\n",
    "\n",
    "The workload for sentiment analysis is taken care of by third-party libraries such as [TextBlob](https://textblob.readthedocs.io/en/dev/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "28eb74e4-5560-4013-8236-ac159d87eff0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dependencies\n",
    "\n",
    "import configparser\n",
    "import json\n",
    "\n",
    "from unstructured.staging.base import convert_to_dict\n",
    "from unstructured.cleaners.core import clean_extra_whitespace\n",
    "from unstructured.partition.auto import partition\n",
    "\n",
    "from elasticsearch import Elasticsearch\n",
    "from elasticsearch_dsl import Search\n",
    "\n",
    "from textblob import TextBlob\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d5acda6-b41d-4bed-9dc3-d6e8419fed72",
   "metadata": {
    "tags": []
   },
   "source": [
    "The html file that is going to be partitioned exists inside the [example-docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) directory. You can render the html inside the notebook by executing the following snippet in a new cell:\n",
    "\n",
    "```python\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "with open(\"../../example-docs/example-with-scripts.html\", 'r') as file:\n",
    "    html_content = file.read()\n",
    "\n",
    "display(HTML(html_content))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1f168a9-5c80-47f5-b417-cd44303b5324",
   "metadata": {},
   "source": [
    "Let's start by calling our standard `partition` method from `partition.auto` to obtain a list of document `Element` objects out of the target html file content. These element objects represent different components of the source document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8b0dd307-9a85-41a1-89cb-1b5c173fec36",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total number of elements: 159\n",
      "\n",
      "first element, some fields:\n",
      "\n",
      "Title\n",
      "Skip to main content\n",
      "ElementMetadata(filename='../../example-docs/example-with-scripts.html', page_number=1, url=None)\n"
     ]
    }
   ],
   "source": [
    "elements = partition(\"../../example-docs/example-with-scripts.html\")\n",
    "\n",
    "print(f\"total number of elements: {len(elements)}\")\n",
    "\n",
    "# first element\n",
    "print(\"\\nfirst element, some fields:\\n\")\n",
    "print(elements[0].category)\n",
    "print(elements[0].text)\n",
    "print(elements[0].metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2714431-0fc4-40a2-a013-91a5de943e8b",
   "metadata": {},
   "source": [
    "For this example we will only focus on the the html article's body (paragraphs), so let's filter the list of `Element` objects to obtain only `Text` type element objects (`NarrativeText` and `UncategorizedText` element objects)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "839a27e4-1f9e-417a-8539-8555b38dbb3a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total number of \"Text\" elements: 88\n",
      "\n",
      "first \"Text\" element, some fields:\n",
      "\n",
      "UncategorizedText\n",
      "Dec 13, 2022 - Press\n",
      "                                        ISW\n",
      "ElementMetadata(filename='../../example-docs/example-with-scripts.html', page_number=1, url=None)\n"
     ]
    }
   ],
   "source": [
    "text_elements = [element for element in elements if \"Text\" in element.category]\n",
    "\n",
    "print(f'total number of \"Text\" elements: {len(text_elements)}')\n",
    "\n",
    "# first Text element\n",
    "\n",
    "print('\\nfirst \"Text\" element, some fields:\\n')\n",
    "print(text_elements[0].category)\n",
    "print(text_elements[0].text)\n",
    "print(text_elements[0].metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d23140f2-0ab6-4aa2-a37e-a16af8f3429b",
   "metadata": {},
   "source": [
    "Now, one of the simplest ways to upload data to an `Elasticsearch` index is by simply calling the api with some python dictionaries as the payload. To get the elements' data as python dictionary the `Element` object can be transformed by using the `to_dict()` class-method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ad47c9f0-6d31-4c3f-a004-39a947ee85b3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'element_id': 'fd853487ab296eece56a863ed64cafdb',\n",
       " 'coordinates': None,\n",
       " 'text': 'Dec 13, 2022 - Press\\n                                        ISW',\n",
       " 'type': 'UncategorizedText',\n",
       " 'metadata': {'filename': '../../example-docs/example-with-scripts.html',\n",
       "  'page_number': 1}}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_elements[0].to_dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a560cc89-200a-4f86-a665-ea0115c7d297",
   "metadata": {},
   "source": [
    "But making this transformation for each of the **88** elements is very unpractical. The method `convert_to_dict` from our [staging functions](https://unstructured-io.github.io/unstructured/functions.html#convert-to-dict) converts a list of `Element` objects to a list of dictionaries. This is the default format for representing documents in `Unstructured`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2d7033f9-15db-4b49-9ba7-51168f2ece9e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"element_id\": \"218e47afd026feae22d7ca6a1745706e\",\n",
      "  \"coordinates\": null,\n",
      "  \"text\": \"Belarusian forces remain unlikely to\\n                                                    attack Ukraine despite a snap Belarusian military readiness check on\\n                                                    December 13. Belarusian President Alexander Lukashenko\\n                                                ordered a snap comprehensive readiness check of the Belarusian military\\n                                                on December 13. The exercise does not appear to be cover for\\n                                                concentrating Belarusian and/or Russian forces near jumping-off\\n                                                positions for an invasion of Ukraine. It involves Belarusian elements\\n                                                deploying to training grounds across Belarus, conducting engineering\\n                                                tasks, and practicing crossing the Neman and Berezina rivers (which are\\n                                                over 170 km and 70 km away from the Belarusian-Ukrainian border,\\n                                                respectively).[1] Social media footage posted on December 13 showed a\\n                                                column of likely Belarusian infantry fighting vehicles and trucks\\n                                                reportedly moving from Kolodishchi (just east of Minsk) toward Hatava\\n                                                (6km south of Minsk).[2] Belarusian forces reportedly deployed 25\\n                                                BTR-80s and 30 trucks with personnel toward Malaryta, Brest (about 15 km\\n                                                from Ukraine) on December 13.[3] Russian T-80 tanks reportedly deployed\\n                                                from the Obuz-Lesnovsky Training Ground in Brest, Belarus, to the Brest\\n                                                Training Ground also in Brest (about 30 km from the Belarusian-Ukrainian\\n                                                Border) around December 12.[4] Russia reportedly deployed three MiG-31K\\n                                                interceptors to the Belarusian airfield in Machulishchy on December\\n                                                13.[5] These deployments are likely part of ongoing Russian information\\n                                                operations suggesting that Belarusian conventional ground forces might\\n                                                join Russia\\u2019s invasion of Ukraine.[6] ISW has written at length about\\n                                                why Belarus is extraordinarily unlikely to invade Ukraine in the\\n                                                foreseeable future.[7]\",\n",
      "  \"type\": \"NarrativeText\",\n",
      "  \"metadata\": {\n",
      "    \"filename\": \"../../example-docs/example-with-scripts.html\",\n",
      "    \"page_number\": 1\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "text_elements_dict = convert_to_dict(text_elements)\n",
    "\n",
    "# text_elements_dict display of one arbitrary Text elements\n",
    "\n",
    "print(json.dumps(text_elements_dict[4], indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4b0e7af-04fc-4132-9228-f14cb51acfd1",
   "metadata": {},
   "source": [
    "The `text` field in the element dictionaries has been parsed but is not *clean*. Let's apply one of our basic [cleaning functions](https://unstructured-io.github.io/unstructured/functions.html#clean-extra-whitespace) `clean_extra_whitespace` to improve the output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "482aa51b-e1e9-4290-9f08-32ec8b7f146d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"element_id\": \"218e47afd026feae22d7ca6a1745706e\",\n",
      "  \"coordinates\": null,\n",
      "  \"text\": \"Belarusian forces remain unlikely to attack Ukraine despite a snap Belarusian military readiness check on December 13. Belarusian President Alexander Lukashenko ordered a snap comprehensive readiness check of the Belarusian military on December 13. The exercise does not appear to be cover for concentrating Belarusian and/or Russian forces near jumping-off positions for an invasion of Ukraine. It involves Belarusian elements deploying to training grounds across Belarus, conducting engineering tasks, and practicing crossing the Neman and Berezina rivers (which are over 170 km and 70 km away from the Belarusian-Ukrainian border, respectively).[1] Social media footage posted on December 13 showed a column of likely Belarusian infantry fighting vehicles and trucks reportedly moving from Kolodishchi (just east of Minsk) toward Hatava (6km south of Minsk).[2] Belarusian forces reportedly deployed 25 BTR-80s and 30 trucks with personnel toward Malaryta, Brest (about 15 km from Ukraine) on December 13.[3] Russian T-80 tanks reportedly deployed from the Obuz-Lesnovsky Training Ground in Brest, Belarus, to the Brest Training Ground also in Brest (about 30 km from the Belarusian-Ukrainian Border) around December 12.[4] Russia reportedly deployed three MiG-31K interceptors to the Belarusian airfield in Machulishchy on December 13.[5] These deployments are likely part of ongoing Russian information operations suggesting that Belarusian conventional ground forces might join Russia\\u2019s invasion of Ukraine.[6] ISW has written at length about why Belarus is extraordinarily unlikely to invade Ukraine in the foreseeable future.[7]\",\n",
      "  \"type\": \"NarrativeText\",\n",
      "  \"metadata\": {\n",
      "    \"filename\": \"../../example-docs/example-with-scripts.html\",\n",
      "    \"page_number\": 1\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "clean_text_elements_dict = []\n",
    "\n",
    "for element_dict in text_elements_dict:\n",
    "    element_dict[\"text\"] = clean_extra_whitespace(element_dict[\"text\"])\n",
    "    clean_text_elements_dict.append(element_dict)\n",
    "\n",
    "# text_elements_dict display of 2 arbitrary Text elements after cleaning withespace\n",
    "\n",
    "print(json.dumps(clean_text_elements_dict[4], indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04e212b2-bf98-4c55-8a4b-8bab200832a0",
   "metadata": {},
   "source": [
    "Now that the data is pre-processed, we can proceed to upload this to an `Elasticsearch` index. Let's start the client connection, autheticating via a `es-credentials.ini` file containing the `cloud_id`, `user`, and `password` information. For the following steps, you should replace `CLOUD_ID`, `USER` and `PASSWORD` tokens in the credentials file and have previously created an index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f904e5fe-cf1d-4625-8553-04e1121b5a25",
   "metadata": {},
   "outputs": [],
   "source": [
    "# read credentials\n",
    "\n",
    "config = configparser.ConfigParser()\n",
    "config.read(\"es-credentials.ini\")  # path to credentials file\n",
    "\n",
    "# Instantiate the Elasticsearch connection\n",
    "\n",
    "es_client = Elasticsearch(\n",
    "    cloud_id=config[\"ELASTIC\"][\"cloud_id\"],\n",
    "    http_auth=(config[\"ELASTIC\"][\"user\"], config[\"ELASTIC\"][\"password\"]),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "983415aa-cabe-4bdc-ad51-544846115a99",
   "metadata": {},
   "source": [
    "The following command can be executed to display the client information on the notebook:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f6c69dc-43c6-4872-92be-e9a19d6047aa",
   "metadata": {},
   "source": [
    "```python\n",
    "print(json.dumps(es_client.info(), indent=2))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "102894a8-b76f-4341-966a-acc0f67a1c3c",
   "metadata": {},
   "source": [
    "We can now iterate through the list of pre-processed `Text` elements and analyse their `polarity`, `subjectivity`, and `sentiment` with the use of `TextBlob`library. In the same step we can upload each of the element dictionaries to an existing empty `Elasticsearch` index called `search-unstructured-elements`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b584aa56-c13c-4391-b262-e44238af9d58",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 88/88 [00:16<00:00,  5.47it/s]\n"
     ]
    }
   ],
   "source": [
    "for element in tqdm(clean_text_elements_dict):\n",
    "    element_blob = TextBlob(element[\"text\"])\n",
    "    element[\"polarity\"] = round(element_blob.sentiment.polarity, 4)\n",
    "    element[\"subjectivity\"] = round(element_blob.sentiment.subjectivity, 4)\n",
    "\n",
    "    if element[\"polarity\"] < 0:\n",
    "        element[\"sentiment\"] = \"negative\"\n",
    "    elif element[\"polarity\"] == 0:\n",
    "        element[\"sentiment\"] = \"neutral\"\n",
    "    else:\n",
    "        element[\"sentiment\"] = \"positive\"\n",
    "\n",
    "    es_client.index(index=\"search-unstructured-elements\", document=element)  # your index name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63b5e180-20a4-4036-adac-5649fe6464b9",
   "metadata": {},
   "source": [
    "🚀 Your data now is ready in your `Elasticsearch` index!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aabf3754-93f1-45a2-b64b-400f5c1a2cff",
   "metadata": {},
   "source": [
    "Finally, let's retrieve only `Text` elements with a non-neutral sentiment (`polarity`!=0.0) with the help of `elasticsearch_dsl` and then re-order them by their `polarity` score (1) and `subjectivity` score (2):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d5ad3659-7378-4670-822e-efbd7c9a68bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "s_pos = Search().using(es_client).query(\"match\", sentiment=\"positive\")\n",
    "response_pos = list(s_pos.execute())\n",
    "s = Search().using(es_client).query(\"match\", sentiment=\"negative\")\n",
    "response = list(s.execute())\n",
    "response.extend(response_pos)\n",
    "\n",
    "sorted_elements = sorted(response, key=lambda d: d[\"polarity\"], reverse=True)\n",
    "sorted_elements = sorted(sorted_elements, key=lambda d: d[\"subjectivity\"], reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05a6d3c5-28a2-4a13-954b-c79cd546508c",
   "metadata": {},
   "source": [
    "And the most polarized and subjective Text elements in the article are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "7a6e6e48-2767-43aa-8a9f-e5537923aa71",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TOP 5 MOST POLARIZED & SUBJECTIVE TEXT ELEMENTS IN THE HTML FILE: \n",
      "\n",
      "1: Eastern Ukraine: (Eastern Kharkiv Oblast-Western Luhansk Oblast)\n",
      "sentiment: negative\n",
      "polarity: -0.75\n",
      "subjectivity: 1.0\n",
      "\n",
      "2: US officials stated on December 13 that the Pentagon is finalizing plans to send Patriot missile defense systems to Ukraine. The US officials expect to receive the necessary approvals from Defense Secretary Lloyd Austin and President Joe Biden, and the Pentagon could make a formal announcement as early as December 15.[18] CNN reported that it is unclear how many Patriot missile systems the Pentagon plan would provide Ukraine, but that a typical Patriot battery includes up to eight launchers with a capacity of four ready-to-fire missiles each, radar targeting systems, computers, power generators, and an engagement control station.[19]\n",
      "sentiment: positive\n",
      "polarity: 0.1083\n",
      "subjectivity: 0.575\n",
      "\n",
      "3: Ukrainian officials continue to assess\n",
      "                                                    that Belarus is unlikely to attack Ukraine as of December 13.\n",
      "                                                The Ukrainian General Staff reiterated on December 13 that the\n",
      "                                                situation in northern Ukraine near Belarus has not significantly changed\n",
      "                                                and that Ukrainian authorities still have not detected Russian forces\n",
      "                                                forming strike groups in Belarus.[8] The Ukrainian State Border Guard\n",
      "                                                Service reported that the situation on the border with Belarus is under\n",
      "                                                control despite recent Belarusian readiness checks.[9]\n",
      "sentiment: negative\n",
      "polarity: -0.0896\n",
      "subjectivity: 0.4208\n",
      "\n",
      "4: Ukrainian officials continue to assess that Belarus is unlikely to attack Ukraine as of December 13. The Ukrainian General Staff reiterated on December 13 that the situation in northern Ukraine near Belarus has not significantly changed and that Ukrainian authorities still have not detected Russian forces forming strike groups in Belarus.[8] The Ukrainian State Border Guard Service reported that the situation on the border with Belarus is under control despite recent Belarusian readiness checks.[9]\n",
      "sentiment: negative\n",
      "polarity: -0.0896\n",
      "subjectivity: 0.4208\n",
      "\n",
      "5: Ukrainian intelligence reported that\n",
      "                                                    Russian forces are striking Ukraine with missiles that Ukraine\n",
      "                                                    transferred to Russian in the 1990s as part of an international\n",
      "                                                    agreement that Russia explicitly violated by invading Ukraine in\n",
      "                                                    2014 and 2022. In a comment to The New York Times\n",
      "                                                Ukrainian Main Intelligence Directorate (GUR) representative Vadym\n",
      "                                                Skibitsky said that Russian forces are using ballistic missiles and\n",
      "                                                Tu-160 and Tu-95 strategic bombers that Ukraine transferred to Russia as\n",
      "                                                part of the Budapest Memorandum, whereby Ukraine transferred its nuclear\n",
      "                                                arsenal to Russia for decommissioning.[16] Russia, the United States,\n",
      "                                                and the United Kingdom committed in return to \"respect the independence\n",
      "                                                and sovereignty and existing borders of Ukraine.\" This agreement has\n",
      "                                                generated some debate about whether or not it committed the United\n",
      "                                                States and the United Kingdom to defend Ukraine, which it did not do.\n",
      "                                                There can be no debate, however, that by this agreement Russia\n",
      "                                                explicitly recognized that Crimea and areas of Donetsk and Luhansk\n",
      "                                                Oblasts it occupied in 2014 were parts of Ukraine. By that agreement\n",
      "                                                Russia also committed \"to refrain from the threat or use of force\n",
      "                                                against the territorial integrity or political independence of Ukraine,\"\n",
      "                                                among many other provisions that Russia has violated. Skibitsky noted\n",
      "                                                that Russia has removed the nuclear warhead from these decommissioned\n",
      "                                                Kh-55 subsonic cruise missiles, which are now being used to launch\n",
      "                                                massive missile strikes on Ukraine.[17]\n",
      "sentiment: positive\n",
      "polarity: 0.1071\n",
      "subjectivity: 0.3421\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"TOP 5 MOST POLARIZED & SUBJECTIVE TEXT ELEMENTS IN THE HTML FILE: \\n\")\n",
    "\n",
    "for ix, hit in enumerate(sorted_elements, start=1):\n",
    "    print(\n",
    "        f\"{ix}: {hit.text}\\nsentiment: {hit.sentiment}\\npolarity: {hit.polarity}\\nsubjectivity: {hit.subjectivity}\\n\"\n",
    "    )\n",
    "    if ix == 5:\n",
    "        break"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}