## Indexing Sentiment Analysis Data from Unstructured elements to Elasticsearch

The goal of this notebook is to show how to load `Unstructured` output [Elements](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements) together with basic sentiment analysis information into an `Elasticsearch` index. Check the official
[Elastisearch documentation](https://elasticsearch-py.readthedocs.io/en/v8.8.0/) to learn more about working with indexes in python.

In this example, we'll show how to:

- Load unstructured outputs `Element` objects together with a fast sentiment analysis into an `Elasticsearch` index.
- Retrieve the stored documents from `Elasticsearch` using a [Search DLS](https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html) query to get the *top5* most polarized and subjective `Text` elements in an html file entitled *"Russian Offensive Campaign"*.

The workload for sentiment analysis is taken care of by third-party libraries such as [TextBlob](https://textblob.readthedocs.io/en/dev/).

In [1]:
# Dependencies

import configparser
import json

from unstructured.staging.base import convert_to_dict
from unstructured.cleaners.core import clean_extra_whitespace
from unstructured.partition.auto import partition

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

from textblob import TextBlob
from tqdm import tqdm

The html file that is going to be partitioned exists inside the [example-docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) directory. You can render the html inside the notebook by executing the following snippet in a new cell:

```python
from IPython.display import display, HTML

with open("../../example-docs/example-with-scripts.html", 'r') as file:
 html_content = file.read()

display(HTML(html_content))
```

Let's start by calling our standard `partition` method from `partition.auto` to obtain a list of document `Element` objects out of the target html file content. These element objects represent different components of the source document.

In [2]:
elements = partition("../../example-docs/example-with-scripts.html")

print(f"total number of elements: {len(elements)}")

# first element
print("\nfirst element, some fields:\n")
print(elements[0].category)
print(elements[0].text)
print(elements[0].metadata)

total number of elements: 159

first element, some fields:

Title
Skip to main content
ElementMetadata(filename='../../example-docs/example-with-scripts.html', page_number=1, url=None)


For this example we will only focus on the the html article's body (paragraphs), so let's filter the list of `Element` objects to obtain only `Text` type element objects (`NarrativeText` and `UncategorizedText` element objects).

In [3]:
text_elements = [element for element in elements if "Text" in element.category]

print(f'total number of "Text" elements: {len(text_elements)}')

# first Text element

print('\nfirst "Text" element, some fields:\n')
print(text_elements[0].category)
print(text_elements[0].text)
print(text_elements[0].metadata)

total number of "Text" elements: 88

first "Text" element, some fields:

UncategorizedText
Dec 13, 2022 - Press
 ISW
ElementMetadata(filename='../../example-docs/example-with-scripts.html', page_number=1, url=None)


Now, one of the simplest ways to upload data to an `Elasticsearch` index is by simply calling the api with some python dictionaries as the payload. To get the elements' data as python dictionary the `Element` object can be transformed by using the `to_dict()` class-method:

In [4]:
text_elements[0].to_dict()

{'element_id': 'fd853487ab296eece56a863ed64cafdb',
 'coordinates': None,
 'text': 'Dec 13, 2022 - Press\n ISW',
 'type': 'UncategorizedText',
 'metadata': {'filename': '../../example-docs/example-with-scripts.html',
 'page_number': 1}}

But making this transformation for each of the **88** elements is very unpractical. The method `convert_to_dict` from our [staging functions](https://unstructured-io.github.io/unstructured/functions.html#convert-to-dict) converts a list of `Element` objects to a list of dictionaries. This is the default format for representing documents in `Unstructured`.

In [5]:
text_elements_dict = convert_to_dict(text_elements)

# text_elements_dict display of one arbitrary Text elements

print(json.dumps(text_elements_dict[4], indent=2))

{
 "element_id": "218e47afd026feae22d7ca6a1745706e",
 "coordinates": null,
 "text": "Belarusian forces remain unlikely to\n attack Ukraine despite a snap Belarusian military readiness check on\n December 13. Belarusian President Alexander Lukashenko\n ordered a snap comprehensive readiness check of the Belarusian military\n on December 13. The exercise does not appear to be cover for\n concentrating Belarusian and/or Russian forces near jumping-off\n positions for an invasion of Ukraine. It involves Belarusian elements\n deploying to training grounds across Belarus, conducting engineering\n tasks, and practicing crossing the Neman and Berezina rivers (which are\n over 170 km and 70 km away from the Belarusian-Ukrainian border,\n respectively).[1] Social media footage posted on December 13 showed a\n column of likely Belarusian infantry fighting vehicles and trucks\n reportedly moving from Kolodishchi (just east of Minsk) toward Hatava\n (6km south of Minsk).[2] Belarusian forces report

The `text` field in the element dictionaries has been parsed but is not *clean*. Let's apply one of our basic [cleaning functions](https://unstructured-io.github.io/unstructured/functions.html#clean-extra-whitespace) `clean_extra_whitespace` to improve the output:

In [6]:
clean_text_elements_dict = []

for element_dict in text_elements_dict:
 element_dict["text"] = clean_extra_whitespace(element_dict["text"])
 clean_text_elements_dict.append(element_dict)

# text_elements_dict display of 2 arbitrary Text elements after cleaning withespace

print(json.dumps(clean_text_elements_dict[4], indent=2))

{
 "element_id": "218e47afd026feae22d7ca6a1745706e",
 "coordinates": null,
 "text": "Belarusian forces remain unlikely to attack Ukraine despite a snap Belarusian military readiness check on December 13. Belarusian President Alexander Lukashenko ordered a snap comprehensive readiness check of the Belarusian military on December 13. The exercise does not appear to be cover for concentrating Belarusian and/or Russian forces near jumping-off positions for an invasion of Ukraine. It involves Belarusian elements deploying to training grounds across Belarus, conducting engineering tasks, and practicing crossing the Neman and Berezina rivers (which are over 170 km and 70 km away from the Belarusian-Ukrainian border, respectively).[1] Social media footage posted on December 13 showed a column of likely Belarusian infantry fighting vehicles and trucks reportedly moving from Kolodishchi (just east of Minsk) toward Hatava (6km south of Minsk).[2] Belarusian forces reportedly deployed 25 BTR-80s a

Now that the data is pre-processed, we can proceed to upload this to an `Elasticsearch` index. Let's start the client connection, autheticating via a `es-credentials.ini` file containing the `cloud_id`, `user`, and `password` information. For the following steps, you should replace `CLOUD_ID`, `USER` and `PASSWORD` tokens in the credentials file and have previously created an index.

In [7]:
# read credentials

config = configparser.ConfigParser()
config.read("es-credentials.ini") # path to credentials file

# Instantiate the Elasticsearch connection

es_client = Elasticsearch(
 cloud_id=config["ELASTIC"]["cloud_id"],
 http_auth=(config["ELASTIC"]["user"], config["ELASTIC"]["password"]),
)

The following command can be executed to display the client information on the notebook:

```python
print(json.dumps(es_client.info(), indent=2))
```

We can now iterate through the list of pre-processed `Text` elements and analyse their `polarity`, `subjectivity`, and `sentiment` with the use of `TextBlob`library. In the same step we can upload each of the element dictionaries to an existing empty `Elasticsearch` index called `search-unstructured-elements`:

In [8]:
for element in tqdm(clean_text_elements_dict):
 element_blob = TextBlob(element["text"])
 element["polarity"] = round(element_blob.sentiment.polarity, 4)
 element["subjectivity"] = round(element_blob.sentiment.subjectivity, 4)

 if element["polarity"] < 0:
 element["sentiment"] = "negative"
 elif element["polarity"] == 0:
 element["sentiment"] = "neutral"
 else:
 element["sentiment"] = "positive"

 es_client.index(index="search-unstructured-elements", document=element) # your index name

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 88/88 [00:16<00:00, 5.47it/s]


🚀 Your data now is ready in your `Elasticsearch` index!

Finally, let's retrieve only `Text` elements with a non-neutral sentiment (`polarity`!=0.0) with the help of `elasticsearch_dsl` and then re-order them by their `polarity` score (1) and `subjectivity` score (2):

In [9]:
s_pos = Search().using(es_client).query("match", sentiment="positive")
response_pos = list(s_pos.execute())
s = Search().using(es_client).query("match", sentiment="negative")
response = list(s.execute())
response.extend(response_pos)

sorted_elements = sorted(response, key=lambda d: d["polarity"], reverse=True)
sorted_elements = sorted(sorted_elements, key=lambda d: d["subjectivity"], reverse=True)

And the most polarized and subjective Text elements in the article are:

In [10]:
print("TOP 5 MOST POLARIZED & SUBJECTIVE TEXT ELEMENTS IN THE HTML FILE: \n")

for ix, hit in enumerate(sorted_elements, start=1):
 print(
 f"{ix}: {hit.text}\nsentiment: {hit.sentiment}\npolarity: {hit.polarity}\nsubjectivity: {hit.subjectivity}\n"
 )
 if ix == 5:
 break

TOP 5 MOST POLARIZED & SUBJECTIVE TEXT ELEMENTS IN THE HTML FILE: 

1: Eastern Ukraine: (Eastern Kharkiv Oblast-Western Luhansk Oblast)
sentiment: negative
polarity: -0.75
subjectivity: 1.0

2: US officials stated on December 13 that the Pentagon is finalizing plans to send Patriot missile defense systems to Ukraine. The US officials expect to receive the necessary approvals from Defense Secretary Lloyd Austin and President Joe Biden, and the Pentagon could make a formal announcement as early as December 15.[18] CNN reported that it is unclear how many Patriot missile systems the Pentagon plan would provide Ukraine, but that a typical Patriot battery includes up to eight launchers with a capacity of four ready-to-fire missiles each, radar targeting systems, computers, power generators, and an engagement control station.[19]
sentiment: positive
polarity: 0.1083
subjectivity: 0.575

3: Ukrainian officials continue to assess
 that Belarus is unlikely to attack Ukraine as of December 13.
 