2023-06-05 21:05:08 +02:00
{
"cells": [
{
"cell_type": "markdown",
"id": "176b9504-8338-42d5-ab3f-ef2bd9ac8fe7",
"metadata": {},
"source": [
"## Indexing Sentiment Analysis Data from Unstructured elements to Elasticsearch"
]
},
{
"cell_type": "markdown",
"id": "30e7d198-974d-4ded-a26e-936b3ce704e8",
"metadata": {},
"source": [
"The goal of this notebook is to show how to load `Unstructured` output [Elements](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements) together with basic sentiment analysis information into an `Elasticsearch` index. Check the official\n",
"[Elastisearch documentation](https://elasticsearch-py.readthedocs.io/en/v8.8.0/) to learn more about working with indexes in python.\n",
"\n",
"In this example, we'll show how to:\n",
"\n",
"- Load unstructured outputs `Element` objects together with a fast sentiment analysis into an `Elasticsearch` index.\n",
"- Retrieve the stored documents from `Elasticsearch` using a [Search DLS](https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html) query to get the *top5* most polarized and subjective `Text` elements in an html file entitled *\"Russian Offensive Campaign\"*.\n",
"\n",
"The workload for sentiment analysis is taken care of by third-party libraries such as [TextBlob](https://textblob.readthedocs.io/en/dev/)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "28eb74e4-5560-4013-8236-ac159d87eff0",
"metadata": {},
"outputs": [],
"source": [
"# Dependencies\n",
"\n",
"import configparser\n",
"import json\n",
"\n",
"from unstructured.staging.base import convert_to_dict\n",
"from unstructured.cleaners.core import clean_extra_whitespace\n",
"from unstructured.partition.auto import partition\n",
"\n",
"from elasticsearch import Elasticsearch\n",
"from elasticsearch_dsl import Search\n",
"\n",
"from textblob import TextBlob\n",
"from tqdm import tqdm"
]
},
{
"cell_type": "markdown",
"id": "3d5acda6-b41d-4bed-9dc3-d6e8419fed72",
"metadata": {
"tags": []
},
"source": [
"The html file that is going to be partitioned exists inside the [example-docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) directory. You can render the html inside the notebook by executing the following snippet in a new cell:\n",
"\n",
"```python\n",
"from IPython.display import display, HTML\n",
"\n",
"with open(\"../../example-docs/example-with-scripts.html\", 'r') as file:\n",
" html_content = file.read()\n",
"\n",
"display(HTML(html_content))\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "b1f168a9-5c80-47f5-b417-cd44303b5324",
"metadata": {},
"source": [
"Let's start by calling our standard `partition` method from `partition.auto` to obtain a list of document `Element` objects out of the target html file content. These element objects represent different components of the source document."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8b0dd307-9a85-41a1-89cb-1b5c173fec36",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total number of elements: 159\n",
"\n",
"first element, some fields:\n",
"\n",
"Title\n",
"Skip to main content\n",
"ElementMetadata(filename='../../example-docs/example-with-scripts.html', page_number=1, url=None)\n"
]
}
],
"source": [
"elements = partition(\"../../example-docs/example-with-scripts.html\")\n",
"\n",
"print(f\"total number of elements: {len(elements)}\")\n",
"\n",
"# first element\n",
"print(\"\\nfirst element, some fields:\\n\")\n",
"print(elements[0].category)\n",
"print(elements[0].text)\n",
"print(elements[0].metadata)"
]
},
{
"cell_type": "markdown",
"id": "b2714431-0fc4-40a2-a013-91a5de943e8b",
"metadata": {},
"source": [
"For this example we will only focus on the the html article's body (paragraphs), so let's filter the list of `Element` objects to obtain only `Text` type element objects (`NarrativeText` and `UncategorizedText` element objects)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "839a27e4-1f9e-417a-8539-8555b38dbb3a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total number of \"Text\" elements: 88\n",
"\n",
"first \"Text\" element, some fields:\n",
"\n",
"UncategorizedText\n",
"Dec 13, 2022 - Press\n",
" ISW\n",
"ElementMetadata(filename='../../example-docs/example-with-scripts.html', page_number=1, url=None)\n"
]
}
],
"source": [
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"text_elements = [element for element in elements if \"Text\" in element.category]\n",
2023-06-05 21:05:08 +02:00
"\n",
"print(f'total number of \"Text\" elements: {len(text_elements)}')\n",
"\n",
"# first Text element\n",
"\n",
"print('\\nfirst \"Text\" element, some fields:\\n')\n",
"print(text_elements[0].category)\n",
"print(text_elements[0].text)\n",
"print(text_elements[0].metadata)"
]
},
{
"cell_type": "markdown",
"id": "d23140f2-0ab6-4aa2-a37e-a16af8f3429b",
"metadata": {},
"source": [
"Now, one of the simplest ways to upload data to an `Elasticsearch` index is by simply calling the api with some python dictionaries as the payload. To get the elements' data as python dictionary the `Element` object can be transformed by using the `to_dict()` class-method:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ad47c9f0-6d31-4c3f-a004-39a947ee85b3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'element_id': 'fd853487ab296eece56a863ed64cafdb',\n",
" 'coordinates': None,\n",
" 'text': 'Dec 13, 2022 - Press\\n ISW',\n",
" 'type': 'UncategorizedText',\n",
" 'metadata': {'filename': '../../example-docs/example-with-scripts.html',\n",
" 'page_number': 1}}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_elements[0].to_dict()"
]
},
{
"cell_type": "markdown",
"id": "a560cc89-200a-4f86-a665-ea0115c7d297",
"metadata": {},
"source": [
2023-11-02 10:43:26 -04:00
"But making this transformation for each of the **88** elements is very unpractical. The method `convert_to_dict` from our [staging functions](https://unstructured-io.github.io/unstructured/functions.html#convert-to-dict) converts a list of `Element` objects to a list of dictionaries. This is the default format for representing documents in `Unstructured`."
2023-06-05 21:05:08 +02:00
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2d7033f9-15db-4b49-9ba7-51168f2ece9e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"element_id\": \"218e47afd026feae22d7ca6a1745706e\",\n",
" \"coordinates\": null,\n",
" \"text\": \"Belarusian forces remain unlikely to\\n attack Ukraine despite a snap Belarusian military readiness check on\\n December 13. Belarusian President Alexander Lukashenko\\n ordered a snap comprehensive readiness check of the Belarusian military\\n on December 13. The exercise does not appear to be cover for\\n concentrating Belarusian and/or Russian forces near jumping-off\\n positions for an invasion of Ukraine. It involves Belarusian elements\\n deploying to training grounds across Belarus, conducting engineering\\n tasks, and practicing crossing the Neman and Berezina rivers (which are\\n over 170 km and 70 km away from the Belarusian-Ukrainian border,\\n respectively).[1] Social media footage posted on December 13 showed a\\n column of likely Belarusian infantry fighting vehicles and trucks\\n reportedly moving from Kolodishchi (just east of Minsk) toward Hatava\\n (6km south of Minsk).[2] Belarusian forces reportedly deployed 25\\n BTR-80s and 30 trucks with personnel toward Malaryta, Brest (about 15 km\\n from Ukraine) on December 13.[3] Russian T-80 tanks reportedly deployed\\n from the Obuz-Lesnovsky Training Ground in Brest, Belarus, to the Brest\\n Training Ground also in Brest (about 30 km from the Belarusian-Ukrainian\\n Border) around December 12.[4] Russia reportedly deployed three MiG-31K\\n interceptors to the Belarusian airfield in Machulishchy on December\\n 13.[5] These deployments are likely part of ongoing Russian information\\n operations suggesting that Belarusian conventional ground forces might\\n join Russia\\u2019s invasion of Ukraine.[6] ISW has written at length about\\n why Belarus is extraordinarily unlikely to invade Ukraine in the\\n foreseeable future.[7]\",\n",
" \"type\": \"NarrativeText\",\n",
" \"metadata\": {\n",
" \"filename\": \"../../example-docs/example-with-scripts.html\",\n",
" \"page_number\": 1\n",
" }\n",
"}\n"
]
}
],
"source": [
"text_elements_dict = convert_to_dict(text_elements)\n",
"\n",
"# text_elements_dict display of one arbitrary Text elements\n",
"\n",
"print(json.dumps(text_elements_dict[4], indent=2))"
]
},
{
"cell_type": "markdown",
"id": "d4b0e7af-04fc-4132-9228-f14cb51acfd1",
"metadata": {},
"source": [
2023-11-02 10:43:26 -04:00
"The `text` field in the element dictionaries has been parsed but is not *clean*. Let's apply one of our basic [cleaning functions](https://unstructured-io.github.io/unstructured/functions.html#clean-extra-whitespace) `clean_extra_whitespace` to improve the output:"
2023-06-05 21:05:08 +02:00
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "482aa51b-e1e9-4290-9f08-32ec8b7f146d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"element_id\": \"218e47afd026feae22d7ca6a1745706e\",\n",
" \"coordinates\": null,\n",
" \"text\": \"Belarusian forces remain unlikely to attack Ukraine despite a snap Belarusian military readiness check on December 13. Belarusian President Alexander Lukashenko ordered a snap comprehensive readiness check of the Belarusian military on December 13. The exercise does not appear to be cover for concentrating Belarusian and/or Russian forces near jumping-off positions for an invasion of Ukraine. It involves Belarusian elements deploying to training grounds across Belarus, conducting engineering tasks, and practicing crossing the Neman and Berezina rivers (which are over 170 km and 70 km away from the Belarusian-Ukrainian border, respectively).[1] Social media footage posted on December 13 showed a column of likely Belarusian infantry fighting vehicles and trucks reportedly moving from Kolodishchi (just east of Minsk) toward Hatava (6km south of Minsk).[2] Belarusian forces reportedly deployed 25 BTR-80s and 30 trucks with personnel toward Malaryta, Brest (about 15 km from Ukraine) on December 13.[3] Russian T-80 tanks reportedly deployed from the Obuz-Lesnovsky Training Ground in Brest, Belarus, to the Brest Training Ground also in Brest (about 30 km from the Belarusian-Ukrainian Border) around December 12.[4] Russia reportedly deployed three MiG-31K interceptors to the Belarusian airfield in Machulishchy on December 13.[5] These deployments are likely part of ongoing Russian information operations suggesting that Belarusian conventional ground forces might join Russia\\u2019s invasion of Ukraine.[6] ISW has written at length about why Belarus is extraordinarily unlikely to invade Ukraine in the foreseeable future.[7]\",\n",
" \"type\": \"NarrativeText\",\n",
" \"metadata\": {\n",
" \"filename\": \"../../example-docs/example-with-scripts.html\",\n",
" \"page_number\": 1\n",
" }\n",
"}\n"
]
}
],
"source": [
"clean_text_elements_dict = []\n",
"\n",
"for element_dict in text_elements_dict:\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" element_dict[\"text\"] = clean_extra_whitespace(element_dict[\"text\"])\n",
2023-06-05 21:05:08 +02:00
" clean_text_elements_dict.append(element_dict)\n",
"\n",
"# text_elements_dict display of 2 arbitrary Text elements after cleaning withespace\n",
"\n",
"print(json.dumps(clean_text_elements_dict[4], indent=2))"
]
},
{
"cell_type": "markdown",
"id": "04e212b2-bf98-4c55-8a4b-8bab200832a0",
"metadata": {},
"source": [
"Now that the data is pre-processed, we can proceed to upload this to an `Elasticsearch` index. Let's start the client connection, autheticating via a `es-credentials.ini` file containing the `cloud_id`, `user`, and `password` information. For the following steps, you should replace `CLOUD_ID`, `USER` and `PASSWORD` tokens in the credentials file and have previously created an index."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f904e5fe-cf1d-4625-8553-04e1121b5a25",
"metadata": {},
"outputs": [],
"source": [
"# read credentials\n",
"\n",
"config = configparser.ConfigParser()\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"config.read(\"es-credentials.ini\") # path to credentials file\n",
2023-06-05 21:05:08 +02:00
"\n",
"# Instantiate the Elasticsearch connection\n",
"\n",
"es_client = Elasticsearch(\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" cloud_id=config[\"ELASTIC\"][\"cloud_id\"],\n",
" http_auth=(config[\"ELASTIC\"][\"user\"], config[\"ELASTIC\"][\"password\"]),\n",
2023-06-05 21:05:08 +02:00
")"
]
},
{
"cell_type": "markdown",
"id": "983415aa-cabe-4bdc-ad51-544846115a99",
"metadata": {},
"source": [
"The following command can be executed to display the client information on the notebook:"
]
},
{
"cell_type": "markdown",
"id": "8f6c69dc-43c6-4872-92be-e9a19d6047aa",
"metadata": {},
"source": [
"```python\n",
"print(json.dumps(es_client.info(), indent=2))\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "102894a8-b76f-4341-966a-acc0f67a1c3c",
"metadata": {},
"source": [
"We can now iterate through the list of pre-processed `Text` elements and analyse their `polarity`, `subjectivity`, and `sentiment` with the use of `TextBlob`library. In the same step we can upload each of the element dictionaries to an existing empty `Elasticsearch` index called `search-unstructured-elements`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b584aa56-c13c-4391-b262-e44238af9d58",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 88/88 [00:16<00:00, 5.47it/s]\n"
]
}
],
"source": [
"for element in tqdm(clean_text_elements_dict):\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" element_blob = TextBlob(element[\"text\"])\n",
" element[\"polarity\"] = round(element_blob.sentiment.polarity, 4)\n",
" element[\"subjectivity\"] = round(element_blob.sentiment.subjectivity, 4)\n",
"\n",
" if element[\"polarity\"] < 0:\n",
" element[\"sentiment\"] = \"negative\"\n",
" elif element[\"polarity\"] == 0:\n",
" element[\"sentiment\"] = \"neutral\"\n",
2023-06-05 21:05:08 +02:00
" else:\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" element[\"sentiment\"] = \"positive\"\n",
2023-06-05 21:05:08 +02:00
"\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" es_client.index(index=\"search-unstructured-elements\", document=element) # your index name"
2023-06-05 21:05:08 +02:00
]
},
{
"cell_type": "markdown",
"id": "63b5e180-20a4-4036-adac-5649fe6464b9",
"metadata": {},
"source": [
"🚀 Your data now is ready in your `Elasticsearch` index!"
]
},
{
"cell_type": "markdown",
"id": "aabf3754-93f1-45a2-b64b-400f5c1a2cff",
"metadata": {},
"source": [
"Finally, let's retrieve only `Text` elements with a non-neutral sentiment (`polarity`!=0.0) with the help of `elasticsearch_dsl` and then re-order them by their `polarity` score (1) and `subjectivity` score (2):"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "d5ad3659-7378-4670-822e-efbd7c9a68bb",
"metadata": {},
"outputs": [],
"source": [
"s_pos = Search().using(es_client).query(\"match\", sentiment=\"positive\")\n",
"response_pos = list(s_pos.execute())\n",
"s = Search().using(es_client).query(\"match\", sentiment=\"negative\")\n",
"response = list(s.execute())\n",
"response.extend(response_pos)\n",
"\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"sorted_elements = sorted(response, key=lambda d: d[\"polarity\"], reverse=True)\n",
"sorted_elements = sorted(sorted_elements, key=lambda d: d[\"subjectivity\"], reverse=True)"
2023-06-05 21:05:08 +02:00
]
},
{
"cell_type": "markdown",
"id": "05a6d3c5-28a2-4a13-954b-c79cd546508c",
"metadata": {},
"source": [
"And the most polarized and subjective Text elements in the article are:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7a6e6e48-2767-43aa-8a9f-e5537923aa71",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TOP 5 MOST POLARIZED & SUBJECTIVE TEXT ELEMENTS IN THE HTML FILE: \n",
"\n",
"1: Eastern Ukraine: (Eastern Kharkiv Oblast-Western Luhansk Oblast)\n",
"sentiment: negative\n",
"polarity: -0.75\n",
"subjectivity: 1.0\n",
"\n",
"2: US officials stated on December 13 that the Pentagon is finalizing plans to send Patriot missile defense systems to Ukraine. The US officials expect to receive the necessary approvals from Defense Secretary Lloyd Austin and President Joe Biden, and the Pentagon could make a formal announcement as early as December 15.[18] CNN reported that it is unclear how many Patriot missile systems the Pentagon plan would provide Ukraine, but that a typical Patriot battery includes up to eight launchers with a capacity of four ready-to-fire missiles each, radar targeting systems, computers, power generators, and an engagement control station.[19]\n",
"sentiment: positive\n",
"polarity: 0.1083\n",
"subjectivity: 0.575\n",
"\n",
"3: Ukrainian officials continue to assess\n",
" that Belarus is unlikely to attack Ukraine as of December 13.\n",
" The Ukrainian General Staff reiterated on December 13 that the\n",
" situation in northern Ukraine near Belarus has not significantly changed\n",
" and that Ukrainian authorities still have not detected Russian forces\n",
" forming strike groups in Belarus.[8] The Ukrainian State Border Guard\n",
" Service reported that the situation on the border with Belarus is under\n",
" control despite recent Belarusian readiness checks.[9]\n",
"sentiment: negative\n",
"polarity: -0.0896\n",
"subjectivity: 0.4208\n",
"\n",
"4: Ukrainian officials continue to assess that Belarus is unlikely to attack Ukraine as of December 13. The Ukrainian General Staff reiterated on December 13 that the situation in northern Ukraine near Belarus has not significantly changed and that Ukrainian authorities still have not detected Russian forces forming strike groups in Belarus.[8] The Ukrainian State Border Guard Service reported that the situation on the border with Belarus is under control despite recent Belarusian readiness checks.[9]\n",
"sentiment: negative\n",
"polarity: -0.0896\n",
"subjectivity: 0.4208\n",
"\n",
"5: Ukrainian intelligence reported that\n",
" Russian forces are striking Ukraine with missiles that Ukraine\n",
" transferred to Russian in the 1990s as part of an international\n",
" agreement that Russia explicitly violated by invading Ukraine in\n",
" 2014 and 2022. In a comment to The New York Times\n",
" Ukrainian Main Intelligence Directorate (GUR) representative Vadym\n",
" Skibitsky said that Russian forces are using ballistic missiles and\n",
" Tu-160 and Tu-95 strategic bombers that Ukraine transferred to Russia as\n",
" part of the Budapest Memorandum, whereby Ukraine transferred its nuclear\n",
" arsenal to Russia for decommissioning.[16] Russia, the United States,\n",
" and the United Kingdom committed in return to \"respect the independence\n",
" and sovereignty and existing borders of Ukraine.\" This agreement has\n",
" generated some debate about whether or not it committed the United\n",
" States and the United Kingdom to defend Ukraine, which it did not do.\n",
" There can be no debate, however, that by this agreement Russia\n",
" explicitly recognized that Crimea and areas of Donetsk and Luhansk\n",
" Oblasts it occupied in 2014 were parts of Ukraine. By that agreement\n",
" Russia also committed \"to refrain from the threat or use of force\n",
" against the territorial integrity or political independence of Ukraine,\"\n",
" among many other provisions that Russia has violated. Skibitsky noted\n",
" that Russia has removed the nuclear warhead from these decommissioned\n",
" Kh-55 subsonic cruise missiles, which are now being used to launch\n",
" massive missile strikes on Ukraine.[17]\n",
"sentiment: positive\n",
"polarity: 0.1071\n",
"subjectivity: 0.3421\n",
"\n"
]
}
],
"source": [
"print(\"TOP 5 MOST POLARIZED & SUBJECTIVE TEXT ELEMENTS IN THE HTML FILE: \\n\")\n",
"\n",
"for ix, hit in enumerate(sorted_elements, start=1):\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" print(\n",
" f\"{ix}: {hit.text}\\nsentiment: {hit.sentiment}\\npolarity: {hit.polarity}\\nsubjectivity: {hit.subjectivity}\\n\"\n",
" )\n",
2023-06-05 21:05:08 +02:00
" if ix == 5:\n",
" break"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}