unstructured/examples/argilla-summarization/isw-summarization.ipynb

1151 lines
4.1 MiB
Plaintext
Raw Permalink Normal View History

{
"cells": [
{
"cell_type": "markdown",
"id": "6bb44a8e",
"metadata": {},
"source": [
"# Training a Summarization Model with Unstructured + Argilla + Huggingface\n",
"\n",
"In this notebook, we'll show you how you can use [`unstructured`](https://github.com/Unstructured-IO/unstructured), [`argilla`](https://github.com/argilla-io/argilla), and HuggingFace [`transformers`](https://github.com/huggingface/transformers) to train a custom summarization model. In this case, we're going to build a summarization model targeted at summarizing the [Institute for the Study of War's](https://www.understandingwar.org/) daily reports on the war in Ukraine. You can see an example of one of the reports [here](https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-december-12), and a screen shot appears below.\n",
"\n",
"Combining the `unstructured`, `argilla`, and `transformers` libraries, we're able to complete a data science project that previously could have taken a week or more in just a few hours!\n",
"\n",
"#### Table of Contents\n",
"\n",
"- [Section 1: Data Collection and Staging with `unstructured`](#collection)\n",
"- [Section 2: Label Verification with `argilla`](#verification)\n",
"- [Section 3: Model Training with `transformers`](#training)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "02819301",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAACjoAAAZsCAYAAAC6E8x8AAAMbmlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkJAQIICAlNCbIFIDSAmhBZBeBBshCSSUGBOCir0sKrh2EQEbuiqi2FZA7NiVRbH3xYKKsi7qYkPlTUhA133le+f75t4/Z878p9yZ3HsAoH/gSaV5qDYA+ZICWUJ4MHN0WjqT9BQQAR2ggAx8eXy5lB0XFw2gDNz/Lu9uAER5v+qs5Prn/H8VXYFQzgcAGQtxpkDOz4f4OAB4FV8qKwCAqNRbTS6QKvFsiPVkMECIVylxtgpvV+JMFT7cb5OUwIH4MgAaVB5Plg2A1j2oZxbysyGP1meIXSUCsQQA+jCIA/gingBiZezD8vMnKnE5xPbQXgoxjAewMr/jzP4bf+YgP4+XPYhVefWLRohYLs3jTf0/S/O/JT9PMeDDFg6qSBaRoMwf1vBW7sQoJaZC3CXJjIlV1hriD2KBqu4AoBSRIiJZZY+a8OUcWD9gALGrgBcSBbEJxGGSvJhotT4zSxzGhRjuFnSKuICbBLEhxAuF8tBEtc1G2cQEtS+0PkvGYav153iyfr9KXw8UuclsNf8bkZCr5se0ikRJqRBTILYuFKfEQKwFsYs8NzFKbTOySMSJGbCRKRKU8VtDnCCUhAer+LHCLFlYgtq+JF8+kC+2USTmxqjxvgJRUoSqPtgpPq8/fpgLdlkoYScP8Ajlo6MHchEIQ0JVuWPPhZLkRDXPB2lBcIJqLU6R5sWp7XFLYV64Um8JsYe8MFG9Fk8pgJtTxY9nSQviklRx4kU5vMg4VTz4MhANOCAEMIECjkwwEeQAcWtXQxf8pZoJAzwgA9lACJzVmoEVqf0zEnhNBEXgD4iEQD64Lrh/VggKof7LoFZ1dQZZ/bOF/StywVOI80EUyIO/Ff2rJIPeUsATqBH/wzsPDj6MNw8O5fy/1w9ov2nYUBOt1igGPDLpA5bEUGIIMYIYRnTAjfEA3A+PhtcgONxwFu4zkMc3e8JTQhvhEeE6oZ1we4J4ruyHKEeBdsgfpq5F5ve1wG0hpycejPtDdsiMG+DGwBn3gH7YeCD07Am1HHXcyqowf+D+WwbfPQ21HdmVjJKHkIPI9j+u1HLU8hxkUdb6+/qoYs0crDdncOZH/5zvqi+A96gfLbGF2H7sLHYCO48dxhoAEzuGNWIt2BElHtxdT/p314C3hP54ciGP+B/+Bp6sspJy11rXTtfPqrkC4ZQC5cHjTJROlYmzRQVMNnw7CJlcCd9lGNPN1c0NAOW7RvX39Ta+/x2CGLR80837HQD/Y319fYe+6SKPAbDXGx7/g9909iwAdDQBOHeQr5AVqnS48kKA/xJ0eNKMgBmwAvYwHzfgBfxAEAgFkSAWJIE0MB5GL4L7XAYmg+lgDigGpWAZWA0qwAawGWwHu8A+0AAOgxPgDLgILoPr4C7cPR3gJegG70AvgiAkhIYwECPEHLFBnBA3hIUEIKFINJKApCEZSDYiQRTIdGQeUoqsQCqQTUgNshc5iJxAziNtyG3kIdKJvEE+oRhKRfVQU9QWHY6yUDYahSah49BsdBJahM5Hl6DlaDW6E61HT6AX0etoO/oS7cEApokZYBaYM8bCOFgslo5lYTJsJlaClWHVWB3WBJ/zVawd68I+4kScgTNxZ7iDI/BknI9Pwmfii/EKfDtej5/Cr+IP8W78K4FGMCE4EXwJXMJoQjZhMqGYUEbYSjhAOA3PUgfhHZFINCDaEb3hWUwj5hCnERcT1xF3E48T24iPiT0kEsmI5ETyJ8WSeKQCUjFpLWkn6RjpCqmD9EFDU8Ncw00jTCNdQ6IxV6NMY4fGUY0rGs80esnaZBuyLzmWLCBPJS8lbyE3kS+RO8i9FB2KHcWfkkTJocyhlFPqKKcp9yhvNTU1LTV9NOM1xZqzNcs192ie03yo+ZGqS3WkcqhjqQrqEuo26nHqbepbGo1mSwuipdMKaEtoNbSTtAe0D1oMLRctrpZAa5ZWpVa91hWtV3Qy3YbOpo+nF9HL6Pvpl+hd2mRtW22ONk97pnal9kHtm9o9OgydETqxOvk6i3V26JzXea5L0rXVDdUV6M7X3ax7UvcxA2NYMTgMPmMeYwvjNKNDj6hnp8fVy9Er1dul16rXra+r76Gfoj9Fv1L/iH67AWZga8A1yDNYarDP4IbBpyGmQ9hDhEMWDakbcmXIe8OhhkGGQsMSw92G1w0/GTGNQo1yjZYbNRjdN8aNHY3jjScbrzc+bdw1VG+o31D+0JKh+4beMUFNHE0STKaZbDZpMekxNTMNN5WarjU9adplZmAWZJZjtsrsqFmnOcM8wFxsvsr8mPkLpj6TzcxjljNPMbstTCwiLBQWmyxaLXot7SyTLeda7ra8b0WxYlllWa2yarbqtja3HmU93brW+o4N2YZlI7JZY3PW5r2tnW2q7QLbBtvndoZ2XLsiu1q7e/Y0+0D7SfbV9tcciA4sh1yHdQ6XHVFHT0eRY6XjJSfUyctJ7LTOqW0YYZjPMMmw6mE3nanObOdC51rnhy4GLtEuc10aXF4Ntx6ePnz58LPDv7p6uua5bnG9O0J3ROSIuSOaRrxxc3Tju1W6XXOnuYe5z3JvdH/t4eQh9FjvccuT4TnKc4Fns+cXL28vmVedV6e3tXeGd5X3TZYeK461mHXOh+AT7DPL57DPR18v3wLffb5/+jn75frt8Hs+0m6kcOSWkY/9Lf15/pv82wOYARkBGwPaAy0CeYHVgY+CrIIEQVuDnrEd2DnsnexXwa7BsuADwe85vpwZnOMhWEh4SElIa6huaHJoReiDMMuw7LDasO5wz/Bp4ccjCBFREcsjbnJNuXxuDbc70jtyRuSpKGpUYlRF1KNox2hZdNModFTkqJWj7sXYxEhiGmJBLDd2Zez9OLu4SXGH4onxcfGV8U8TRiRMTzibyEickLgj8V1ScNLSpLvJ9smK5OYUesrYlJqU96khqStS20cPHz1j9MU04zRxWmM6KT0lfWt6z5jQMavHdIz1HFs89sY4u3FTxp0fbzw+b/yRCfQJvAn7MwgZqRk7Mj7zYnnVvJ5MbmZVZjefw1/DfykIEqwSdAr9hSuEz7L8s1ZkPc/2z16Z3SkKFJWJusQccYX4dU5Ezoac97mxudty+/JS83bna+Rn5B+U6EpyJacmmk2cMrFN6iQtlrZP8p20elK3LEq2VY7Ix8kbC/TgR32Lwl7xk+JhYUBhZeGHySmT90/RmSKZ0jLVceqiqc+Kwop+mYZP409rnm4xfc70hzPYMzbNRGZmzmyeZTVr/qyO2eGzt8+hzMmd89tc17kr5v41L3Ve03zT+bPnP/4p/KfaYq1iWfHNBX4LNizEF4oXti5yX7R20dcSQcmFUtfSstLPi/mLL/w84ufyn/uWZC1pXeq1dP0y4jLJshvLA5dvX6GzomjF45WjVtavYq4qWfXX6gmrz5d5lG1YQ1mjWNNeHl3euNZ67bK1nytEFdcrgyt3V5lULap6v06w7sr6oPV1G0w3lG74tFG88dam8E311bbVZZuJmws3P92SsuXsL6xfarYaby3d+mWbZFv79oTtp2q8a2p2mOxYWovWKmo7d47deXlXyK7GOue6TbsNdpfuAXsUe17szdh7Y1/Uvub9rP11v9r8WnWAcaCkHqmfWt/dIGpob0xrbDsYebC5ya/pwCGXQ9sOWxyuPKJ/ZOlRytH5R/uOFR3rOS493nUi+8Tj5gnNd0+OPnntVPyp1tNRp8+dCTtz8iz77LFz/ucOn/c9f/AC60LDRa+L9S2eLQd+8/ztQKtXa/0l70uNl30uN7WNbDt6JfDKiashV89c4167eD3metuN5Bu3bo692X5LcOv57bzbr+8U3um9O/se4V7Jfe37ZQ9MHlT/7vD77nav9iMPQx62PEp8dPcx//HLJ/InnzvmP6U9LXtm/qzmudvzw51hnZdfjHnR8VL6srer+A+dP6pe2b/69c+gP1u6R3d3vJa97nuz+K3R221/efzV3BPX8+Bd/rve9yUfjD5s/8j6ePZT6qd
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 1,
"metadata": {
"image/png": {
"width": 800
}
},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import Image\n",
"\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"Image(filename=\"img/isw.png\", width=800)"
]
},
{
"cell_type": "markdown",
"id": "902631a0",
"metadata": {},
"source": [
"## Section 1: Data Collection and Staging with `unstructured` <a id=\"collection\"></a>"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "9c1a7c2c",
"metadata": {},
"outputs": [],
"source": [
"import calendar\n",
"from datetime import datetime\n",
"import re\n",
"import time\n",
"\n",
"import requests\n",
"from transformers import pipeline\n",
"import tqdm\n",
"\n",
"import argilla as rg\n",
"\n",
"from unstructured.partition.html import partition_html\n",
"from unstructured.documents.elements import NarrativeText, ListItem\n",
"from unstructured.staging.argilla import stage_for_argilla"
]
},
{
"cell_type": "markdown",
"id": "14ef78ea",
"metadata": {},
"source": [
"First, we'll pull our documents from the ISW website. We'll use the built-in Python `datetime` and `calendar` libraries to iterate over the dates for the reports we want to pull and fine the associated URLs."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4fb80d42",
"metadata": {},
"outputs": [],
"source": [
"ISW_BASE_URL = \"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment\"\n",
"\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
"def datetime_to_url(dt):\n",
" month = dt.strftime(\"%B\").lower()\n",
" return f\"{ISW_BASE_URL}-{month}-{dt.day}\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "245bdfda",
"metadata": {},
"outputs": [],
"source": [
"urls = []\n",
"year = 2022\n",
"for month in range(3, 13):\n",
" _, last_day = calendar.monthrange(year, month)\n",
" for day in range(1, last_day + 1):\n",
" dt = datetime(year, month, day)\n",
" urls.append(datetime_to_url(dt))"
]
},
{
"cell_type": "markdown",
"id": "4a0415f6",
"metadata": {},
"source": [
"Once we have the URLs, we can pull the HTML document for each report from the web using the `requests` library. Normally, you'd need to write custom HTML parsing code using a library like `lxml` or `beautifulsoup` to extract the narrative text from the webpage for model training. With the `unstructured` library, you can simply call the `partition_html` function to extract the content of interest."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7d7d84d1",
"metadata": {},
"outputs": [],
"source": [
"def url_to_elements(url):\n",
" r = requests.get(url)\n",
" if r.status_code != 200:\n",
" return None\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
" elements = partition_html(text=r.text)\n",
" return elements"
]
},
{
"cell_type": "markdown",
"id": "d2cd324b",
"metadata": {},
"source": [
"After partitioning the document, we'll extract the `Key Takeaways` section from the ISW reports, which is shown in the screenshot below. The `Key Takeaways` section will serve as the target text for our summarization model. While it would be time consuming the write HTML parsing code to find this content, with the `unstructured` library it is easy. Since the `partition_html` function breaks down the elements of the document into different categories like `Title`, `NarrativeText`, and `ListItem`, all we need to do is find the `Key Takeaways` title and then grab `ListItem` elements until the list ends. This logic is implemented in the `get_key_takeaways` function."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1faf761d",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABMIAAAXeCAYAAACNFIP0AAAMbmlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkJAQIICAlNCbIFIDSAmhBZBeBBshCSSUGBOCir0sKrh2EQEbuiqi2FZA7NiVRbH3xYKKsi7qYkPlTUhA133le+f75t4/Z878p9yZ3HsAoH/gSaV5qDYA+ZICWUJ4MHN0WjqT9BQQAR2ggAx8eXy5lB0XFw2gDNz/Lu9uAER5v+qs5Prn/H8VXYFQzgcAGQtxpkDOz4f4OAB4FV8qKwCAqNRbTS6QKvFsiPVkMECIVylxtgpvV+JMFT7cb5OUwIH4MgAaVB5Plg2A1j2oZxbysyGP1meIXSUCsQQA+jCIA/gingBiZezD8vMnKnE5xPbQXgoxjAewMr/jzP4bf+YgP4+XPYhVefWLRohYLs3jTf0/S/O/JT9PMeDDFg6qSBaRoMwf1vBW7sQoJaZC3CXJjIlV1hriD2KBqu4AoBSRIiJZZY+a8OUcWD9gALGrgBcSBbEJxGGSvJhotT4zSxzGhRjuFnSKuICbBLEhxAuF8tBEtc1G2cQEtS+0PkvGYav153iyfr9KXw8UuclsNf8bkZCr5se0ikRJqRBTILYuFKfEQKwFsYs8NzFKbTOySMSJGbCRKRKU8VtDnCCUhAer+LHCLFlYgtq+JF8+kC+2USTmxqjxvgJRUoSqPtgpPq8/fpgLdlkoYScP8Ajlo6MHchEIQ0JVuWPPhZLkRDXPB2lBcIJqLU6R5sWp7XFLYV64Um8JsYe8MFG9Fk8pgJtTxY9nSQviklRx4kU5vMg4VTz4MhANOCAEMIECjkwwEeQAcWtXQxf8pZoJAzwgA9lACJzVmoEVqf0zEnhNBEXgD4iEQD64Lrh/VggKof7LoFZ1dQZZ/bOF/StywVOI80EUyIO/Ff2rJIPeUsATqBH/wzsPDj6MNw8O5fy/1w9ov2nYUBOt1igGPDLpA5bEUGIIMYIYRnTAjfEA3A+PhtcgONxwFu4zkMc3e8JTQhvhEeE6oZ1we4J4ruyHKEeBdsgfpq5F5ve1wG0hpycejPtDdsiMG+DGwBn3gH7YeCD07Am1HHXcyqowf+D+WwbfPQ21HdmVjJKHkIPI9j+u1HLU8hxkUdb6+/qoYs0crDdncOZH/5zvqi+A96gfLbGF2H7sLHYCO48dxhoAEzuGNWIt2BElHtxdT/p314C3hP54ciGP+B/+Bp6sspJy11rXTtfPqrkC4ZQC5cHjTJROlYmzRQVMNnw7CJlcCd9lGNPN1c0NAOW7RvX39Ta+/x2CGLR80837HQD/Y319fYe+6SKPAbDXGx7/g9909iwAdDQBOHeQr5AVqnS48kKA/xJ0eNKMgBmwAvYwHzfgBfxAEAgFkSAWJIE0MB5GL4L7XAYmg+lgDigGpWAZWA0qwAawGWwHu8A+0AAOgxPgDLgILoPr4C7cPR3gJegG70AvgiAkhIYwECPEHLFBnBA3hIUEIKFINJKApCEZSDYiQRTIdGQeUoqsQCqQTUgNshc5iJxAziNtyG3kIdKJvEE+oRhKRfVQU9QWHY6yUDYahSah49BsdBJahM5Hl6DlaDW6E61HT6AX0etoO/oS7cEApokZYBaYM8bCOFgslo5lYTJsJlaClWHVWB3WBJ/zVawd68I+4kScgTNxZ7iDI/BknI9Pwmfii/EKfDtej5/Cr+IP8W78K4FGMCE4EXwJXMJoQjZhMqGYUEbYSjhAOA3PUgfhHZFINCDaEb3hWUwj5hCnERcT1xF3E48T24iPiT0kEsmI5ETyJ8WSeKQCUjFpLWkn6RjpCqmD9EFDU8Ncw00jTCNdQ6IxV6NMY4fGUY0rGs80esnaZBuyLzmWLCBPJS8lbyE3kS+RO8i9FB2KHcWfkkTJocyhlFPqKKcp9yhvNTU1LTV9NOM1xZqzNcs192ie03yo+ZGqS3WkcqhjqQrqEuo26nHqbepbGo1mSwuipdMKaEtoNbSTtAe0D1oMLRctrpZAa5ZWpVa91hWtV3Qy3YbOpo+nF9HL6Pvpl+hd2mRtW22ONk97pnal9kHtm9o9OgydETqxOvk6i3V26JzXea5L0rXVDdUV6M7X3ax7UvcxA2NYMTgMPmMeYwvjNKNDj6hnp8fVy9Er1dul16rXra+r76Gfoj9Fv1L/iH67AWZga8A1yDNYarDP4IbBpyGmQ9hDhEMWDakbcmXIe8OhhkGGQsMSw92G1w0/GTGNQo1yjZYbNRjdN8aNHY3jjScbrzc+bdw1VG+o31D+0JKh+4beMUFNHE0STKaZbDZpMekxNTMNN5WarjU9adplZmAWZJZjtsrsqFmnOcM8wFxsvsr8mPkLpj6TzcxjljNPMbstTCwiLBQWmyxaLXot7SyTLeda7ra8b0WxYlllWa2yarbqtja3HmU93brW+o4N2YZlI7JZY3PW5r2tnW2q7QLbBtvndoZ2XLsiu1q7e/Y0+0D7SfbV9tcciA4sh1yHdQ6XHVFHT0eRY6XjJSfUyctJ7LTOqW0YYZjPMMmw6mE3nanObOdC51rnhy4GLtEuc10aXF4Ntx6ePnz58LPDv7p6uua5bnG9O0J3ROSIuSOaRrxxc3Tju1W6XXOnuYe5z3JvdH/t4eQh9FjvccuT4TnKc4Fns+cXL28vmVedV6e3tXeGd5X3TZYeK461mHXOh+AT7DPL57DPR18v3wLffb5/+jn75frt8Hs+0m6kcOSWkY/9Lf15/pv82wOYARkBGwPaAy0CeYHVgY+CrIIEQVuDnrEd2DnsnexXwa7BsuADwe85vpwZnOMhWEh4SElIa6huaHJoReiDMMuw7LDasO5wz/Bp4ccjCBFREcsjbnJNuXxuDbc70jtyRuSpKGpUYlRF1KNox2hZdNModFTkqJWj7sXYxEhiGmJBLDd2Zez9OLu4SXGH4onxcfGV8U8TRiRMTzibyEickLgj8V1ScNLSpLvJ9smK5OYUesrYlJqU96khqStS20cPHz1j9MU04zRxWmM6KT0lfWt6z5jQMavHdIz1HFs89sY4u3FTxp0fbzw+b/yRCfQJvAn7MwgZqRk7Mj7zYnnVvJ5MbmZVZjefw1/DfykIEqwSdAr9hSuEz7L8s1ZkPc/2z16Z3SkKFJWJusQccYX4dU5Ezoac97mxudty+/JS83bna+Rn5B+U6EpyJacmmk2cMrFN6iQtlrZP8p20elK3LEq2VY7Ix8kbC/TgR32Lwl7xk+JhYUBhZeGHySmT90/RmSKZ0jLVceqiqc+Kwop+mYZP409rnm4xfc70hzPYMzbNRGZmzmyeZTVr/qyO2eGzt8+hzMmd89tc17kr5v41L3Ve03zT+bPnP/4p/KfaYq1iWfHNBX4LNizEF4oXti5yX7R20dcSQcmFUtfSstLPi/mLL/w84ufyn/uWZC1pXeq1dP0y4jLJshvLA5dvX6GzomjF45WjVtavYq4qWfXX6gmrz5d5lG1YQ1mjWNNeHl3euNZ67bK1nytEFdcrgyt3V5lULap6v06w7sr6oPV1G0w3lG74tFG88dam8E311bbVZZuJmws3P92SsuXsL6xfarYaby3d+mWbZFv79oTtp2q8a2p2mOxYWovWKmo7d47deXlXyK7GOue6TbsNdpfuAXsUe17szdh7Y1/Uvub9rP11v9r8WnWAcaCkHqmfWt/dIGpob0xrbDsYebC5ya/pwCGXQ9sOWxyuPKJ/ZOlRytH5R/uOFR3rOS493nUi+8Tj5gnNd0+OPnntVPyp1tNRp8+dCTtz8iz77LFz/ucOn/c9f/AC60LDRa+L9S2eLQd+8/ztQKtXa/0l70uNl30uN7WNbDt6JfDKiashV89c4167eD3metuN5Bu3bo692X5LcOv57bzbr+8U3um9O/se4V7Jfe37ZQ9MHlT/7vD77nav9iMPQx62PEp8dPcx//HLJ/InnzvmP6U9LXtm/qzmudvzw51hnZdfjHnR8VL6srer+A+dP6pe2b/69c+gP1u6R3d3vJa97nuz+K3R221/efzV3BPX8+Bd/rve9yUfjD5s/8j6ePZT6qd
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 6,
"metadata": {
"image/png": {
"width": 500
}
},
"output_type": "execute_result"
}
],
"source": [
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"Image(filename=\"img/isw-key-takeaways.png\", width=500)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "82ce0492",
"metadata": {},
"outputs": [],
"source": [
"def _find_key_takeaways_idx(elements):\n",
" for idx, element in enumerate(elements):\n",
" if element.text == \"Key Takeaways\":\n",
" return idx\n",
"\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
"def get_key_takeaways(elements):\n",
" key_takeaways_idx = _find_key_takeaways_idx(elements)\n",
" if not key_takeaways_idx:\n",
" return None\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
" takeaways = []\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" for element in elements[key_takeaways_idx + 1 :]:\n",
" if not isinstance(element, ListItem):\n",
" break\n",
" takeaways.append(element)\n",
"\n",
" takeaway_text = \" \".join([el.text for el in takeaways])\n",
" return NarrativeText(text=takeaway_text)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a90b939a",
"metadata": {},
"outputs": [],
"source": [
"elements = url_to_elements(urls[200])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "95a1b3c7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Russian forces continue to prioritize strategically meaningless offensive operations around Donetsk City and Bakhmut over defending against continued Ukrainian counter-offensive operations in Kharkiv Oblast. Ukrainian forces liberated a settlement southwest of Lyman and are likely continuing to expand their positions in the area. Ukrainian forces continued to conduct an interdiction campaign in Kherson Oblast. Russian forces continued to conduct unsuccessful assaults around Bakhmut and Avdiivka. Ukrainian sources reported extensive partisan attacks on Russian military assets and logistics in southern Zaporizhia Oblast. Russian officials continued to undertake crypto-mobilization measures to generate forces for war Russian war efforts. Russian authorities are working to place 125 “orphan” Ukrainian children from occupied Donetsk Oblast with Russian families.\n"
]
}
],
"source": [
"print(get_key_takeaways(elements))"
]
},
{
"cell_type": "markdown",
"id": "17aa2396",
"metadata": {},
"source": [
"Next we'll grab the narrative text from the document as input for our model. Again, this is easy with `unstructured` because the `partition_html` function already splits out the text. We'll just grab all of the `NarrativeText` elements that exceed a minimum length threshold. While we're in there, we'll also clean out the raw text for citations within the document, which isn't natural language and could impact the quality of our summarization model."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "cefabbb7",
"metadata": {},
"outputs": [],
"source": [
"def get_narrative(elements):\n",
" narrative_text = \"\"\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" for element in elements:\n",
" if isinstance(element, NarrativeText) and len(element.text) > 500:\n",
" # NOTE: Removes citations like [3] from the text\n",
" element_text = re.sub(\"\\[\\d{1,3}\\]\", \"\", element.text)\n",
" narrative_text += f\"\\n\\n{element_text}\"\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
" return NarrativeText(text=narrative_text.strip())"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "e55a22fb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Russian forces continue to conduct meaningless offensive operations around Donetsk City and Bakhmut instead of focusing on defending against Ukrainian counteroffensives that continue to advance. Russian troops continue to attack Bakhmut and various villages near Donetsk City of emotional significance to pro-war residents of the Donetsk Peoples Republic (DNR) but little other importance. The Russians are apparently directing some of the very limited reserves available in Ukraine to these efforts rather than to the vulnerable Russian defensive lines hastily thrown up along the Oskil River in eastern Kharkiv Oblast. The Russians cannot hope to make gains around Bakhmut or Donetsk City on a large enough scale to derail Ukrainian counteroffensives and appear to be continuing an almost robotic effort to gain ground in Donetsk Oblast that seems increasingly divorced from the overall realities of the theater.\n",
"\n",
"Russian failures to rush large-scale reinforcements to eastern Kharkiv and to Luhansk Oblasts leave most of Russian-occupied northeastern Ukraine highly vulnerable to continuing Ukrainian counter-offensives. The Russians may have decided not to defend this area, despite Russian President Vladimir Putins repeated declarations that the purpose of the “special military operation” is to “liberate” Donetsk and Luhansk oblasts. Prioritizing the defense of Russian gains in southern Ukraine over holding northeastern Ukraine makes strategic sense since Kherson and Zaporizhia Oblasts are critical terrain for both Russia and Ukraine whereas the sparsely-populated agricultural areas in the northeast are much less so. But the continued Russian offensive operations around Bakhmut and Donetsk City, which are using some of Russias very limited effective combat power at the expense of defending against Ukrainian counteroffensives, might indicate that Russian theater decision-making remains questionable.\n",
"\n",
"Ukrainian forces appear to be expanding positions east of the Oskil River and north of the Siverskyi Donets River that could allow them to envelop Russian troops holding around Lyman. Further Ukrainian advances east along the north bank of the Siverskyi Donets River could make Russian positions around Lyman untenable and open the approaches to Lysychansk and ultimately Severodonetsk. The Russian defenders in Lyman still appear to consist in large part of BARS (Russian Combat Army Reserve) reservists and the remnants of units badly damaged in the Kharkiv Oblast counteroffensive, and the Russians do not appear to be directing reinforcements from elsewhere in the theater to these areas.\n",
"\n",
"We do not report in detail on Russian war crimes because those activities are well-covered in Western media and do not directly affect the military operations we are assessing and forecasting. We will continue to evaluate and report on the effects of these criminal activities on the Ukrainian military and population and specifically on combat in Ukrainian urban areas. We utterly condemn these Russian violations of the laws of armed conflict, Geneva Conventions, and humanity even though we do not describe them in these reports.\n",
"\n",
"Ukrainian and Russian sources indicated that Ukrainian forces are continuing to establish positions northwest and southwest of Lyman on September 17, while Russian forces have maintained their positions in Lyman and Yampil. Geolocated footage showed Ukrainian forces raising a flag over Shchurove, situated on the eastern bank of the Siverskyi Donets River six kilometers southwest of Lyman. Russian milbloggers claimed that Ukrainian forces crossed the Siverskyi Donets River and reached Studenok (approximately 25km northwest of Lyman) after Russian forces withdrew from the settlement on September 15. The Ukrainian General Staff reported that Russian forces shelled Oleksandrivka, which could indicate that Ukrainian forces advanced eight kilometers from Studenok. Russian milbloggers also noted heavy fighting in Oleksandrivka and settlements northwest of Oleksandrivka. The Ukrainian General Staff also noted that Russian forces fired artillery at Yarova (10km southeast of Studenok), while milbloggers noted fighting in the settlement, likely indicating a Ukrainian advance in the area. Russian sources also claimed active combat in Dobrysheve, between liberated Shchurove and contested Yarova.\n",
"\n",
"Ukrainian and Russian sources also reported kinetic activity on the northern segment of the Oskil River in Kharkiv Oblast. The Ukrainian General Staff and the Russian Defense Ministry reported that Russian forces are shelling Dvorichna (about 17km northeast of Kupyansk), while milbloggers speculated that Ukrainian forces are preparing for an eastward counterattack from the settlement. Geolocated footage showed Ukrainian artillery fire on Russian military equipment operating on the eastern bank of the Oskil River, approximately 38 northeast of Izyum.\n",
"\n",
"Ukrainian military officials maintained their operational silence regarding the progress of the counteroffensive on September 17 but noted the continuation of the Ukrainian interdiction campaign in Kherson Oblast. The Ukrainian Southern Operational Command reported that Ukrainian forces struck an alternative Russian pontoon crossing near Sadove (approximately 17km east of Kherson City), an electronic warfare (EW) station in Nova Kakhovka, and a Russian concentration area in Stara Zburyivka (about 23km southwest of Kherson City). The Ukrainian General Staff reported that Russian forces are preparing retreat routes, including a new crossing in the area of the Kakhovka Hydroelectric Power Plant, due to Ukrainian strikes on Russian crossings over the Dnipro River. Ukrainian military officials noted that the Ukrainian strike on Kherson City on September 10 resulted in the deaths of over 180 Russian servicemen. Social media footage corroborates Ukrainian official statements about the continuation of the interdiction campaign in Kherson Oblast. Residents reported smoke and explosions in Antonivka (on the left bank of the Dnipro River) and in Nova Kakhovka.\n",
"\n",
"Ukrainian and Russian sources reported kinetic activity in three main areas: northwest of Kherson City, near the Ukrainian bridgehead over the Inhulets River, and south of the Kherson-Dnipropetrovsk Oblast border near Vysokopillya. The Russian Defense Ministry and Russian milbloogers claimed that Russian forces repelled a Ukrainian “large-scale” attack on Pravdyne (approximately 28km northwest of Kherson City) on September 16. Some milbloggers specified that Ukrainian forces advanced through Russian defenses Pravdyne with up to two reinforced companies (likely less than a battalion in strength), which is hardly a large-scale attack. The Ukrainian Southern Operational Command also reported that Russian forces unsuccessfully attempted to attack Stepova Dolyna (the next settlement north of Pravdyne) from Pravdyne. A milblogger claimed that Ukrainian forces are using helicopters to transfer troops to Sukhyi Stavok (about 12km southeast of the bridgehead), which if true, likely indicates the reduced capacity of Russian air defenses in the area. Ukrainian and Russian forces noted that Russian forces continued to shell and launch airstrikes on Sukhyi Stavok. Geolocated footage also showed Ukrainian forces firing at Russian positions in Davydiv Brid. The Ukrainian Southern Operational Command reported that a Russian reconnaissance and sabotage group attempted a failed advance on Ukrainian-controlled Novovoznesenske (8km southeast of Vysokopillya) and conducted unsuccessful offensive operations in the direction of Arhanhelske-Ivanivka along the Inhulets RIver.\n",
"\n",
"Russian forces are intensifying filtration and social control measures in Kherson Oblast as a result of the Ukrainian counteroffensive in the region. A local Kherson Oblast Telegram channel reported that Russian forces are conducting filtration measures on Chaykovskiy Street in Kherson City. Russian Telegram channels published footage of Russian servicemen firing at unspecified targets near the Kherson City railway terminal, claiming that Russian forces were conducting a “counterterrorist operation.”\n",
"\n",
"Russian forces continued operations and allocating reinforcements to offensive actions aimed at taking relatively small settlements in Donetsk Oblast rather than dedicating these forces to defending against ongoing Ukrainian counteroffensives. Russian forces conducted limited ground attacks and continued routine fire throughout Donetsk Oblast on September 17. The Ukrainian General Staff reported that Ukrainian forces repelled Russian ground assaults on and south of Bakhmut, on and west of Avdiivka, and southwest of Donetsk City. Russian sources claimed that Russian forces made incremental advances into the eastern and southern outskirts of Bakhmut. Russian sources claimed that Russian forces repelled a Ukrainian ground attack on Berestove, 15km northeast of Soledar on the T1302. Mariupol Mayoral Advisor Petro Andryushchenko reported on September 17 that Russian forces transported a column of 15 Russian tanks marked with the 3rd Army Corps symbol from Mariupol towards Donetsk City, likely to reinforce Russian positions along the Bakhmut-Donetsk City front line.\n",
"\n",
"Russian forces did not conduct any ground attacks and continued routine fire in Zaporizhia Oblast west of Hulyaipole on September 17. Russian sources stated that Russian forces struck unspecified infrastructure in Zaporizhzhia City, likely as part of a continued effort to target Ukrainian infrastructure. Ukrainian authorities reported that Russian forces shelled Ochakiv, Mykolaiv Oblast (less than 10km from the Kinburn Spit in Kherson Oblast), throughout the night on September 16-17 and morning on September 17, and conducted air or missile strikes on the settlement during the day on September 17.\n",
"\n",
"Ukrainian sources reported extensive partisan attacks on Russian military assets and logistics in western Zaporizhia Oblast on September 17. Ukraines Resistance Center reported that (likely partisans) detonated explosives at the Nyzyany rail station (40km east of Tokmak), damaging rail lines on which Russian forces frequently transport military equipment and supplies from occupied Crimea. Russian sources claimed that Ukrainian forces struck the Nyzyany rail station with artillery, rockets, or HIMARS, but the high level of documented partisan activity and the inconsistent Russian narrative suggests that Ukrainian partisans likely conducted the attack. Ukrainian Melitopol Mayor Ivan Fedorov reported explosions (from likely partisan activity) in Bohatyr and Radivonivka (on the southwestern outskirts of Melitopol), where Fedorov reported that Russian forces have established a military base and are storing military equipment. Russian occupation authorities claimed that “terrorists” (likely Ukrainian partisans) blew up power lines in southern Melitopol, damaging concrete supports on the M18/E105 highway connecting Melitopol to Crimea.\n",
"\n",
"The International Atomic Energy Agency (IAEA) announced on September 17 that Ukrainian authorities reconnected the Zaporizhzhia Nuclear Power Plant (ZNPP) to the Ukrainian power grid following repairs to the main external power lines on September 16. Ukrainian state nuclear agency Energoatom announced on September 16 that a large convoy containing spare parts, chemical reagents, and diesel fuel traveled through Russian checkpoints and arrived at the ZNPP on September 16 that enabled Energoatom engineers to conduct the repairs necessary to reconnect the ZNPP to the Ukrainian power grid. The Russian Ministry of Defense claimed that Ukrainian forces “resumed provocations” by shelling the area around the ZNPP on September 17 but provided no evidence of the claimed shelling. Russian forces continued routine strikes against areas on the north bank of the Kakhovka Reservoir opposite Enerhodar.\n",
"\n",
"Russian authorities are seeking warm bodies to confront Ukrainian counteroffensives in the absence of trained soldiers and are taking extreme measures to speed recruitment efforts. A recruitment poster in Sevastopol advertised a mere 10 days of training for recruits prior to deployment as a part of the 810th Guards Naval Infantry Brigade of the Black Sea fleet, which Ukrainian sources report has lost over 85% of its personnel. The report noted that locals spotted similar posters in Bakhchysarai, Simferopol, Kerch, and Yalta, Crimea. Ten days is not remotely enough time to provide even basic levels of military training. The commitment of such “troops” will more likely further degrade Russian forces capability to defend against Ukrainian forces and conduct their own offensive operations than add to Russian combat power.\n",
"\n",
"Russian authorities continue to support major recruitment drives in prisons through private military companies (PMCs). BBC reported that the father of a prisoner in penal colony IK-6 stated that Wagner Group leadership is actively promoting military service with Wagner Group in exchange for pardons, including of narcotics and sexual crimes that previously disqualified individuals from Wagner Group employment. Russian humanitarian group “Rus Sidyashiy” head Olga Romanova stated that Russian-led forces have recruited at least 7,000 prisoners to fight in Ukraine, visited roughly 35 penal colonies, and recruited an average of 200 new prisoners per visit.\n",
"\n",
"Conditions for Russian soldiers continue to vary depending on the soldiers contract status and Russian sources reported a systematic preference for “traditional” contract soldiers over reservists. The Russian Union of Paratroopers and a Russian milblogger posted a public call to action on September 14 that details the poor treatment of BARS personnel in receiving promised benefits, recording their contracts, and in the documentation and quality of their medical care. The post claimed that the Russian milblogger has gathered nearly two dozen reports of such treatment from a single unit from Rostov-on-Don and that some BARS personnel were thrown on the streets with no money or supplies to get home and some returned home with untreated injuries. The post appealed to the Russian Ministry of Defense to protect the rights of military personnel and prosecute the worst perpetrators of unequal treatment.\n",
"\n",
"Russian authorities reported that Ukrainian children forcibly deported to Russia for adoption have received Russian citizenship and may be separated from their siblings. Russian Presidential Commissioner for Childrens Rights Maria Lvova-Belova stated that Russian authorities are working to place 125 “orphan” Ukrainian children from occupied Donetsk Oblast with Russian families but may have to separate siblings from families with over seven children. Lvova-Belova stated that Russian authorities have already granted these children Russian citizenship and are conducting “psychological testing” to determine appropriate placement with Russian families. As ISW has previously reported, the forcible transfer of children from one group to another “with intent to destroy, in whole or in part, a national, ethnical, racial or religious group” is a violation of the Convention on the Prevention and Punishment of the Crime of Genocide.\n",
"\n",
"Russian authorities are intensifying measures to identify and detain Ukrainians who oppose the occupation regime. The Ukrainian General Staff reported that Zaporizhia Oblast occupation authorities recently announced the strengthening of “sanctions” against patriotic Ukrainians and are threatening Ukrainian activists with forced deportation to occupied Donetsk and Luhansk Oblasts where occupation authorities have deemed providing support to members of the Ukrainian resistance movement a crime punishable by death. The Ukrainian General Staff also reported that occupation authorities in Kherson Oblast are conducting weekly inspections of Ukrainian businesses and are threatening to “nationalize” the businesses if they do not cooperate with the occupation regime. Ukraines Resistance Center reported that occupation authorities are searching for patriotic Ukrainians in Kherson City by engaging in dialogues to fish for personal information, setting up fake fundraisers for Ukrainian forces, or asking about the deployment of Russian forces, after which occupation authorities detain the Ukrainians for filtration. The Rosgvardia Press Service announced that Rosgvardia forces detained over 50 alleged “accomplices of the Ukrainian Armed Forces” in occupied Zaporizhia and Kherson Oblasts within the past week.\n",
"\n",
"Ukrainian officials stated on September 16-17 that Ukrainian partisans did not assassinate Luhansk Peoples Republic (LNR) Prosecutor General Sergey Gorenko and Deputy Prosecutor General Yekaterina Steglenko. Ukrainian Luhansk Oblast Head Serhiy Haidai claimed that LNR internal divisions, specifically the rift between Gorenko and LNR Head Leonid Pasechnik, caused Gorenkos death. Ukrainian Presidential Advisor Mikhail Podolyak suggested that local organized criminal groups could have assassinated Gorenko or that Russian authorities may be purging witnesses of Russian war crimes. The Ukrainian government has offered an official response to the assassination as of September 17. Various proxy officials claimed on September 16-17 that Ukrainian “terrorists” or “gangs” assassinated Gorenko and Steglenko.\n"
]
}
],
"source": [
"print(get_narrative(elements))"
]
},
{
"cell_type": "markdown",
"id": "cf47ebc3",
"metadata": {},
"source": [
"Now the we have everything set up, let's collect all of the reports! This step could take a while, we added a sleep call to the loop to avoid overwhelming ISW's webpage."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "29ea100e",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|████████████████████████████████████████████████████████████████████| 306/306 [06:05<00:00, 1.20s/it]\n"
]
}
],
"source": [
"inputs = []\n",
"annotations = []\n",
"for url in tqdm.tqdm(urls):\n",
" elements = url_to_elements(url)\n",
" if url is None or not elements:\n",
" continue\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
" text = get_narrative(elements)\n",
" annotation = get_key_takeaways(elements)\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
" if text and annotation:\n",
" inputs.append(text)\n",
" annotations.append(annotation.text)\n",
" # NOTE: Sleeping to reduce the volume of requests to ISW\n",
" time.sleep(1)"
]
},
{
"cell_type": "markdown",
"id": "28d84bbf",
"metadata": {},
"source": [
"## Label Verification with `argilla` <a id=\"verification\"></a>\n",
"\n",
"Now that we've collected the data and prepared it with `unstructured`, we're ready to work on our data labels in `argilla`. First, we'll use the `stage_for_argilla` staging function from the `unstructured` library. This will automatically convert our dataset to a `DatasetForText2Text` object, which we can then import into Argilla."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "286b29c8",
"metadata": {},
"outputs": [],
"source": [
"dataset = stage_for_argilla(inputs, \"text2text\", annotation=annotations)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f202c5a2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>prediction</th>\n",
" <th>prediction_agent</th>\n",
" <th>annotation</th>\n",
" <th>annotation_agent</th>\n",
" <th>id</th>\n",
" <th>metadata</th>\n",
" <th>status</th>\n",
" <th>event_timestamp</th>\n",
" <th>metrics</th>\n",
" <th>search_keywords</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Russian forces are completing the reinforcemen...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Russian forces are setting conditions to envel...</td>\n",
" <td>None</td>\n",
" <td>1c728c08b07bf47c5ec573bf78350c50</td>\n",
" <td>{}</td>\n",
" <td>Validated</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Russian forces resumed offensive operations in...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Russian forces resumed offensive operations ag...</td>\n",
" <td>None</td>\n",
" <td>e03b12744a53d8393620c617b5d82f27</td>\n",
" <td>{}</td>\n",
" <td>Validated</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>The Russian military has continued its unsucce...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Russian forces opened a new line of advance fr...</td>\n",
" <td>None</td>\n",
" <td>1852425c2dc32a35274b2ac112b43221</td>\n",
" <td>{}</td>\n",
" <td>Validated</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Russian forces continue their focus on encircl...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Russian forces have advanced rapidly on the ea...</td>\n",
" <td>None</td>\n",
" <td>9f094b6a9d30b9529aa630d818d143ae</td>\n",
" <td>{}</td>\n",
" <td>Validated</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Russian forces remain deployed in the position...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Russian forces conducted no major offensive op...</td>\n",
" <td>None</td>\n",
" <td>d4c88cb002d4fa75d7273c3206cbde93</td>\n",
" <td>{}</td>\n",
" <td>Validated</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text prediction \\\n",
"0 Russian forces are completing the reinforcemen... None \n",
"1 Russian forces resumed offensive operations in... None \n",
"2 The Russian military has continued its unsucce... None \n",
"3 Russian forces continue their focus on encircl... None \n",
"4 Russian forces remain deployed in the position... None \n",
"\n",
" prediction_agent annotation \\\n",
"0 None Russian forces are setting conditions to envel... \n",
"1 None Russian forces resumed offensive operations ag... \n",
"2 None Russian forces opened a new line of advance fr... \n",
"3 None Russian forces have advanced rapidly on the ea... \n",
"4 None Russian forces conducted no major offensive op... \n",
"\n",
" annotation_agent id metadata status \\\n",
"0 None 1c728c08b07bf47c5ec573bf78350c50 {} Validated \n",
"1 None e03b12744a53d8393620c617b5d82f27 {} Validated \n",
"2 None 1852425c2dc32a35274b2ac112b43221 {} Validated \n",
"3 None 9f094b6a9d30b9529aa630d818d143ae {} Validated \n",
"4 None d4c88cb002d4fa75d7273c3206cbde93 {} Validated \n",
"\n",
" event_timestamp metrics search_keywords \n",
"0 None None None \n",
"1 None None None \n",
"2 None None None \n",
"3 None None None \n",
"4 None None None "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.to_pandas().head()"
]
},
{
"cell_type": "markdown",
"id": "5c76982c",
"metadata": {},
"source": [
"After staging the data for argilla, we can call the `rg.log` function from the `argilla` Python library to upload the data to the Argilla UI. Before running this step, ensure that you have Argilla running in the background. You can do that by running the following commands in your terminal:\n",
"\n",
"- `docker run -d --name elasticsearch-for-argilla -p 9200:9200 -p 9300:9300 -e \"ES_JAVA_OPTS=-Xms512m -Xmx512m\" -e \"discovery.type=single-node\" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2`\n",
"- `python -m argilla`\n",
"\n",
"The first command starts the ElasticSearch backend for Argilla and the second command launches the webapp. Once it's running, navigate to `http://0.0.0.0:6900/` and enter `argilla` as the username and `1234` as the password. See the [Quickstart](https://docs.argilla.io/en/latest/getting_started/quickstart.html) instructions from the Argilla docs if you need help getting up and running. After logging the data to Argilla, your UI should look like the screenshot below."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "e5ee6ab8",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f191e356e79d476788bd243b9f00d800",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/285 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"285 records logged to http://localhost:6900/datasets/argilla/isw-summarization\n"
]
},
{
"data": {
"text/plain": [
"BulkResponse(dataset='isw-summarization', processed=285, failed=0)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rg.log(dataset, name=\"isw-summarization\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "fd56ed8c",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAC8oAAAWoCAYAAADQQNkaAAAMbmlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkJAQIICAlNCbIFIDSAmhBZBeBBshCSSUGBOCir0sKrh2EQEbuiqi2FZA7NiVRbH3xYKKsi7qYkPlTUhA133le+f75t4/Z878p9yZ3HsAoH/gSaV5qDYA+ZICWUJ4MHN0WjqT9BQQAR2ggAx8eXy5lB0XFw2gDNz/Lu9uAER5v+qs5Prn/H8VXYFQzgcAGQtxpkDOz4f4OAB4FV8qKwCAqNRbTS6QKvFsiPVkMECIVylxtgpvV+JMFT7cb5OUwIH4MgAaVB5Plg2A1j2oZxbysyGP1meIXSUCsQQA+jCIA/gingBiZezD8vMnKnE5xPbQXgoxjAewMr/jzP4bf+YgP4+XPYhVefWLRohYLs3jTf0/S/O/JT9PMeDDFg6qSBaRoMwf1vBW7sQoJaZC3CXJjIlV1hriD2KBqu4AoBSRIiJZZY+a8OUcWD9gALGrgBcSBbEJxGGSvJhotT4zSxzGhRjuFnSKuICbBLEhxAuF8tBEtc1G2cQEtS+0PkvGYav153iyfr9KXw8UuclsNf8bkZCr5se0ikRJqRBTILYuFKfEQKwFsYs8NzFKbTOySMSJGbCRKRKU8VtDnCCUhAer+LHCLFlYgtq+JF8+kC+2USTmxqjxvgJRUoSqPtgpPq8/fpgLdlkoYScP8Ajlo6MHchEIQ0JVuWPPhZLkRDXPB2lBcIJqLU6R5sWp7XFLYV64Um8JsYe8MFG9Fk8pgJtTxY9nSQviklRx4kU5vMg4VTz4MhANOCAEMIECjkwwEeQAcWtXQxf8pZoJAzwgA9lACJzVmoEVqf0zEnhNBEXgD4iEQD64Lrh/VggKof7LoFZ1dQZZ/bOF/StywVOI80EUyIO/Ff2rJIPeUsATqBH/wzsPDj6MNw8O5fy/1w9ov2nYUBOt1igGPDLpA5bEUGIIMYIYRnTAjfEA3A+PhtcgONxwFu4zkMc3e8JTQhvhEeE6oZ1we4J4ruyHKEeBdsgfpq5F5ve1wG0hpycejPtDdsiMG+DGwBn3gH7YeCD07Am1HHXcyqowf+D+WwbfPQ21HdmVjJKHkIPI9j+u1HLU8hxkUdb6+/qoYs0crDdncOZH/5zvqi+A96gfLbGF2H7sLHYCO48dxhoAEzuGNWIt2BElHtxdT/p314C3hP54ciGP+B/+Bp6sspJy11rXTtfPqrkC4ZQC5cHjTJROlYmzRQVMNnw7CJlcCd9lGNPN1c0NAOW7RvX39Ta+/x2CGLR80837HQD/Y319fYe+6SKPAbDXGx7/g9909iwAdDQBOHeQr5AVqnS48kKA/xJ0eNKMgBmwAvYwHzfgBfxAEAgFkSAWJIE0MB5GL4L7XAYmg+lgDigGpWAZWA0qwAawGWwHu8A+0AAOgxPgDLgILoPr4C7cPR3gJegG70AvgiAkhIYwECPEHLFBnBA3hIUEIKFINJKApCEZSDYiQRTIdGQeUoqsQCqQTUgNshc5iJxAziNtyG3kIdKJvEE+oRhKRfVQU9QWHY6yUDYahSah49BsdBJahM5Hl6DlaDW6E61HT6AX0etoO/oS7cEApokZYBaYM8bCOFgslo5lYTJsJlaClWHVWB3WBJ/zVawd68I+4kScgTNxZ7iDI/BknI9Pwmfii/EKfDtej5/Cr+IP8W78K4FGMCE4EXwJXMJoQjZhMqGYUEbYSjhAOA3PUgfhHZFINCDaEb3hWUwj5hCnERcT1xF3E48T24iPiT0kEsmI5ETyJ8WSeKQCUjFpLWkn6RjpCqmD9EFDU8Ncw00jTCNdQ6IxV6NMY4fGUY0rGs80esnaZBuyLzmWLCBPJS8lbyE3kS+RO8i9FB2KHcWfkkTJocyhlFPqKKcp9yhvNTU1LTV9NOM1xZqzNcs192ie03yo+ZGqS3WkcqhjqQrqEuo26nHqbepbGo1mSwuipdMKaEtoNbSTtAe0D1oMLRctrpZAa5ZWpVa91hWtV3Qy3YbOpo+nF9HL6Pvpl+hd2mRtW22ONk97pnal9kHtm9o9OgydETqxOvk6i3V26JzXea5L0rXVDdUV6M7X3ax7UvcxA2NYMTgMPmMeYwvjNKNDj6hnp8fVy9Er1dul16rXra+r76Gfoj9Fv1L/iH67AWZga8A1yDNYarDP4IbBpyGmQ9hDhEMWDakbcmXIe8OhhkGGQsMSw92G1w0/GTGNQo1yjZYbNRjdN8aNHY3jjScbrzc+bdw1VG+o31D+0JKh+4beMUFNHE0STKaZbDZpMekxNTMNN5WarjU9adplZmAWZJZjtsrsqFmnOcM8wFxsvsr8mPkLpj6TzcxjljNPMbstTCwiLBQWmyxaLXot7SyTLeda7ra8b0WxYlllWa2yarbqtja3HmU93brW+o4N2YZlI7JZY3PW5r2tnW2q7QLbBtvndoZ2XLsiu1q7e/Y0+0D7SfbV9tcciA4sh1yHdQ6XHVFHT0eRY6XjJSfUyctJ7LTOqW0YYZjPMMmw6mE3nanObOdC51rnhy4GLtEuc10aXF4Ntx6ePnz58LPDv7p6uua5bnG9O0J3ROSIuSOaRrxxc3Tju1W6XXOnuYe5z3JvdH/t4eQh9FjvccuT4TnKc4Fns+cXL28vmVedV6e3tXeGd5X3TZYeK461mHXOh+AT7DPL57DPR18v3wLffb5/+jn75frt8Hs+0m6kcOSWkY/9Lf15/pv82wOYARkBGwPaAy0CeYHVgY+CrIIEQVuDnrEd2DnsnexXwa7BsuADwe85vpwZnOMhWEh4SElIa6huaHJoReiDMMuw7LDasO5wz/Bp4ccjCBFREcsjbnJNuXxuDbc70jtyRuSpKGpUYlRF1KNox2hZdNModFTkqJWj7sXYxEhiGmJBLDd2Zez9OLu4SXGH4onxcfGV8U8TRiRMTzibyEickLgj8V1ScNLSpLvJ9smK5OYUesrYlJqU96khqStS20cPHz1j9MU04zRxWmM6KT0lfWt6z5jQMavHdIz1HFs89sY4u3FTxp0fbzw+b/yRCfQJvAn7MwgZqRk7Mj7zYnnVvJ5MbmZVZjefw1/DfykIEqwSdAr9hSuEz7L8s1ZkPc/2z16Z3SkKFJWJusQccYX4dU5Ezoac97mxudty+/JS83bna+Rn5B+U6EpyJacmmk2cMrFN6iQtlrZP8p20elK3LEq2VY7Ix8kbC/TgR32Lwl7xk+JhYUBhZeGHySmT90/RmSKZ0jLVceqiqc+Kwop+mYZP409rnm4xfc70hzPYMzbNRGZmzmyeZTVr/qyO2eGzt8+hzMmd89tc17kr5v41L3Ve03zT+bPnP/4p/KfaYq1iWfHNBX4LNizEF4oXti5yX7R20dcSQcmFUtfSstLPi/mLL/w84ufyn/uWZC1pXeq1dP0y4jLJshvLA5dvX6GzomjF45WjVtavYq4qWfXX6gmrz5d5lG1YQ1mjWNNeHl3euNZ67bK1nytEFdcrgyt3V5lULap6v06w7sr6oPV1G0w3lG74tFG88dam8E311bbVZZuJmws3P92SsuXsL6xfarYaby3d+mWbZFv79oTtp2q8a2p2mOxYWovWKmo7d47deXlXyK7GOue6TbsNdpfuAXsUe17szdh7Y1/Uvub9rP11v9r8WnWAcaCkHqmfWt/dIGpob0xrbDsYebC5ya/pwCGXQ9sOWxyuPKJ/ZOlRytH5R/uOFR3rOS493nUi+8Tj5gnNd0+OPnntVPyp1tNRp8+dCTtz8iz77LFz/ucOn/c9f/AC60LDRa+L9S2eLQd+8/ztQKtXa/0l70uNl30uN7WNbDt6JfDKiashV89c4167eD3metuN5Bu3bo692X5LcOv57bzbr+8U3um9O/se4V7Jfe37ZQ9MHlT/7vD77nav9iMPQx62PEp8dPcx//HLJ/InnzvmP6U9LXtm/qzmudvzw51hnZdfjHnR8VL6srer+A+dP6pe2b/69c+gP1u6R3d3vJa97nuz+K3R221/efzV3BPX8+Bd/rve9yUfjD5s/8j6ePZT6qd
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 16,
"metadata": {
"image/png": {
"width": 800
}
},
"output_type": "execute_result"
}
],
"source": [
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"Image(filename=\"img/argilla-dataset.png\", width=800)"
]
},
{
"cell_type": "markdown",
"id": "311cf33b",
"metadata": {},
"source": [
"After uploading the dataset, head over to the Argilla UI and validate and/or adjust the summaries we pulled from the ISW site. You can also check out the [`argilla` docs](https://docs.argilla.io/en/latest/reference/python/index.html) for more information on all of the exciting tools Argilla provides to help you evaluate and refine your training data!"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "10ad87d3",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAC84AAAZyCAYAAABb5ly0AAAMbmlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkJAQIICAlNCbIFIDSAmhBZBeBBshCSSUGBOCir0sKrh2EQEbuiqi2FZA7NiVRbH3xYKKsi7qYkPlTUhA133le+f75t4/Z878p9yZ3HsAoH/gSaV5qDYA+ZICWUJ4MHN0WjqT9BQQAR2ggAx8eXy5lB0XFw2gDNz/Lu9uAER5v+qs5Prn/H8VXYFQzgcAGQtxpkDOz4f4OAB4FV8qKwCAqNRbTS6QKvFsiPVkMECIVylxtgpvV+JMFT7cb5OUwIH4MgAaVB5Plg2A1j2oZxbysyGP1meIXSUCsQQA+jCIA/gingBiZezD8vMnKnE5xPbQXgoxjAewMr/jzP4bf+YgP4+XPYhVefWLRohYLs3jTf0/S/O/JT9PMeDDFg6qSBaRoMwf1vBW7sQoJaZC3CXJjIlV1hriD2KBqu4AoBSRIiJZZY+a8OUcWD9gALGrgBcSBbEJxGGSvJhotT4zSxzGhRjuFnSKuICbBLEhxAuF8tBEtc1G2cQEtS+0PkvGYav153iyfr9KXw8UuclsNf8bkZCr5se0ikRJqRBTILYuFKfEQKwFsYs8NzFKbTOySMSJGbCRKRKU8VtDnCCUhAer+LHCLFlYgtq+JF8+kC+2USTmxqjxvgJRUoSqPtgpPq8/fpgLdlkoYScP8Ajlo6MHchEIQ0JVuWPPhZLkRDXPB2lBcIJqLU6R5sWp7XFLYV64Um8JsYe8MFG9Fk8pgJtTxY9nSQviklRx4kU5vMg4VTz4MhANOCAEMIECjkwwEeQAcWtXQxf8pZoJAzwgA9lACJzVmoEVqf0zEnhNBEXgD4iEQD64Lrh/VggKof7LoFZ1dQZZ/bOF/StywVOI80EUyIO/Ff2rJIPeUsATqBH/wzsPDj6MNw8O5fy/1w9ov2nYUBOt1igGPDLpA5bEUGIIMYIYRnTAjfEA3A+PhtcgONxwFu4zkMc3e8JTQhvhEeE6oZ1we4J4ruyHKEeBdsgfpq5F5ve1wG0hpycejPtDdsiMG+DGwBn3gH7YeCD07Am1HHXcyqowf+D+WwbfPQ21HdmVjJKHkIPI9j+u1HLU8hxkUdb6+/qoYs0crDdncOZH/5zvqi+A96gfLbGF2H7sLHYCO48dxhoAEzuGNWIt2BElHtxdT/p314C3hP54ciGP+B/+Bp6sspJy11rXTtfPqrkC4ZQC5cHjTJROlYmzRQVMNnw7CJlcCd9lGNPN1c0NAOW7RvX39Ta+/x2CGLR80837HQD/Y319fYe+6SKPAbDXGx7/g9909iwAdDQBOHeQr5AVqnS48kKA/xJ0eNKMgBmwAvYwHzfgBfxAEAgFkSAWJIE0MB5GL4L7XAYmg+lgDigGpWAZWA0qwAawGWwHu8A+0AAOgxPgDLgILoPr4C7cPR3gJegG70AvgiAkhIYwECPEHLFBnBA3hIUEIKFINJKApCEZSDYiQRTIdGQeUoqsQCqQTUgNshc5iJxAziNtyG3kIdKJvEE+oRhKRfVQU9QWHY6yUDYahSah49BsdBJahM5Hl6DlaDW6E61HT6AX0etoO/oS7cEApokZYBaYM8bCOFgslo5lYTJsJlaClWHVWB3WBJ/zVawd68I+4kScgTNxZ7iDI/BknI9Pwmfii/EKfDtej5/Cr+IP8W78K4FGMCE4EXwJXMJoQjZhMqGYUEbYSjhAOA3PUgfhHZFINCDaEb3hWUwj5hCnERcT1xF3E48T24iPiT0kEsmI5ETyJ8WSeKQCUjFpLWkn6RjpCqmD9EFDU8Ncw00jTCNdQ6IxV6NMY4fGUY0rGs80esnaZBuyLzmWLCBPJS8lbyE3kS+RO8i9FB2KHcWfkkTJocyhlFPqKKcp9yhvNTU1LTV9NOM1xZqzNcs192ie03yo+ZGqS3WkcqhjqQrqEuo26nHqbepbGo1mSwuipdMKaEtoNbSTtAe0D1oMLRctrpZAa5ZWpVa91hWtV3Qy3YbOpo+nF9HL6Pvpl+hd2mRtW22ONk97pnal9kHtm9o9OgydETqxOvk6i3V26JzXea5L0rXVDdUV6M7X3ax7UvcxA2NYMTgMPmMeYwvjNKNDj6hnp8fVy9Er1dul16rXra+r76Gfoj9Fv1L/iH67AWZga8A1yDNYarDP4IbBpyGmQ9hDhEMWDakbcmXIe8OhhkGGQsMSw92G1w0/GTGNQo1yjZYbNRjdN8aNHY3jjScbrzc+bdw1VG+o31D+0JKh+4beMUFNHE0STKaZbDZpMekxNTMNN5WarjU9adplZmAWZJZjtsrsqFmnOcM8wFxsvsr8mPkLpj6TzcxjljNPMbstTCwiLBQWmyxaLXot7SyTLeda7ra8b0WxYlllWa2yarbqtja3HmU93brW+o4N2YZlI7JZY3PW5r2tnW2q7QLbBtvndoZ2XLsiu1q7e/Y0+0D7SfbV9tcciA4sh1yHdQ6XHVFHT0eRY6XjJSfUyctJ7LTOqW0YYZjPMMmw6mE3nanObOdC51rnhy4GLtEuc10aXF4Ntx6ePnz58LPDv7p6uua5bnG9O0J3ROSIuSOaRrxxc3Tju1W6XXOnuYe5z3JvdH/t4eQh9FjvccuT4TnKc4Fns+cXL28vmVedV6e3tXeGd5X3TZYeK461mHXOh+AT7DPL57DPR18v3wLffb5/+jn75frt8Hs+0m6kcOSWkY/9Lf15/pv82wOYARkBGwPaAy0CeYHVgY+CrIIEQVuDnrEd2DnsnexXwa7BsuADwe85vpwZnOMhWEh4SElIa6huaHJoReiDMMuw7LDasO5wz/Bp4ccjCBFREcsjbnJNuXxuDbc70jtyRuSpKGpUYlRF1KNox2hZdNModFTkqJWj7sXYxEhiGmJBLDd2Zez9OLu4SXGH4onxcfGV8U8TRiRMTzibyEickLgj8V1ScNLSpLvJ9smK5OYUesrYlJqU96khqStS20cPHz1j9MU04zRxWmM6KT0lfWt6z5jQMavHdIz1HFs89sY4u3FTxp0fbzw+b/yRCfQJvAn7MwgZqRk7Mj7zYnnVvJ5MbmZVZjefw1/DfykIEqwSdAr9hSuEz7L8s1ZkPc/2z16Z3SkKFJWJusQccYX4dU5Ezoac97mxudty+/JS83bna+Rn5B+U6EpyJacmmk2cMrFN6iQtlrZP8p20elK3LEq2VY7Ix8kbC/TgR32Lwl7xk+JhYUBhZeGHySmT90/RmSKZ0jLVceqiqc+Kwop+mYZP409rnm4xfc70hzPYMzbNRGZmzmyeZTVr/qyO2eGzt8+hzMmd89tc17kr5v41L3Ve03zT+bPnP/4p/KfaYq1iWfHNBX4LNizEF4oXti5yX7R20dcSQcmFUtfSstLPi/mLL/w84ufyn/uWZC1pXeq1dP0y4jLJshvLA5dvX6GzomjF45WjVtavYq4qWfXX6gmrz5d5lG1YQ1mjWNNeHl3euNZ67bK1nytEFdcrgyt3V5lULap6v06w7sr6oPV1G0w3lG74tFG88dam8E311bbVZZuJmws3P92SsuXsL6xfarYaby3d+mWbZFv79oTtp2q8a2p2mOxYWovWKmo7d47deXlXyK7GOue6TbsNdpfuAXsUe17szdh7Y1/Uvub9rP11v9r8WnWAcaCkHqmfWt/dIGpob0xrbDsYebC5ya/pwCGXQ9sOWxyuPKJ/ZOlRytH5R/uOFR3rOS493nUi+8Tj5gnNd0+OPnntVPyp1tNRp8+dCTtz8iz77LFz/ucOn/c9f/AC60LDRa+L9S2eLQd+8/ztQKtXa/0l70uNl30uN7WNbDt6JfDKiashV89c4167eD3metuN5Bu3bo692X5LcOv57bzbr+8U3um9O/se4V7Jfe37ZQ9MHlT/7vD77nav9iMPQx62PEp8dPcx//HLJ/InnzvmP6U9LXtm/qzmudvzw51hnZdfjHnR8VL6srer+A+dP6pe2b/69c+gP1u6R3d3vJa97nuz+K3R221/efzV3BPX8+Bd/rve9yUfjD5s/8j6ePZT6qd
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 17,
"metadata": {
"image/png": {
"width": 800
}
},
"output_type": "execute_result"
}
],
"source": [
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"Image(filename=\"img/argilla-annotation.png\", width=800)"
]
},
{
"cell_type": "markdown",
"id": "2083edaf",
"metadata": {},
"source": [
"## Section 3: Model Training with `transformers` <a id=\"training\"></a>\n",
"\n",
"After refining our traning data in Argilla, we're ready to fine-tune our model using the `transformers` library. Luckily, `argilla` has a utility for converting datasets to a `dataset.Dataset`, which is the format required by the `transformers` `Trainer` object. In this example, we'll train a `t5-small` model to keep the runtime for the notebook reasonable. You can play around with larger models to get higher quality results."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "b6db0bdd",
"metadata": {},
"outputs": [],
"source": [
"training_data = rg.load(\"isw-summarization\").to_datasets()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "3ebe5d04",
"metadata": {},
"outputs": [],
"source": [
"model_checkpoint = \"t5-small\""
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "c21113d4",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/mrobinson/.pyenv/versions/3.8.13/envs/argilla/lib/python3.8/site-packages/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.\n",
"For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.\n",
"- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.\n",
"- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.\n",
"- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.\n",
" warnings.warn(\n"
]
}
],
"source": [
"from transformers import AutoTokenizer\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "7919d660",
"metadata": {},
"outputs": [],
"source": [
"max_input_length = 1024\n",
"max_target_length = 128\n",
"\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"\n",
"def preprocess_function(examples):\n",
" inputs = [doc for doc in examples[\"text\"]]\n",
" model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)\n",
"\n",
" # Setup the tokenizer for targets\n",
" with tokenizer.as_target_tokenizer():\n",
" labels = tokenizer(examples[\"annotation\"], max_length=max_target_length, truncation=True)\n",
"\n",
" model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
" return model_inputs"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "ea639902",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5bb99903cdfb498f820c95eaa2b6114f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?ba/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/mrobinson/.pyenv/versions/3.8.13/envs/argilla/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3578: UserWarning: `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call.\n",
" warnings.warn(\n"
]
}
],
"source": [
"tokenized_datasets = training_data.map(preprocess_function, batched=True)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "b7839a1d",
"metadata": {},
"outputs": [],
"source": [
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
"from transformers import (\n",
" AutoModelForSeq2SeqLM,\n",
" DataCollatorForSeq2Seq,\n",
" Seq2SeqTrainingArguments,\n",
" Seq2SeqTrainer,\n",
")\n",
"\n",
"model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "7d8c62b2",
"metadata": {},
"outputs": [],
"source": [
"batch_size = 16\n",
"model_name = model_checkpoint.split(\"/\")[-1]\n",
"args = Seq2SeqTrainingArguments(\n",
" \"t5-small-isw-summaries\",\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" evaluation_strategy=\"epoch\",\n",
" learning_rate=2e-5,\n",
" per_device_train_batch_size=batch_size,\n",
" per_device_eval_batch_size=batch_size,\n",
" weight_decay=0.01,\n",
" save_total_limit=3,\n",
" num_train_epochs=1,\n",
" predict_with_generate=True,\n",
" fp16=False,\n",
" push_to_hub=False,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "a1717994",
"metadata": {},
"outputs": [],
"source": [
"data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "555b18d7",
"metadata": {},
"outputs": [],
"source": [
"trainer = Seq2SeqTrainer(\n",
" model,\n",
" args,\n",
" train_dataset=tokenized_datasets,\n",
" eval_dataset=tokenized_datasets,\n",
" data_collator=data_collator,\n",
" tokenizer=tokenizer,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "4b147430",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: metadata, status, metrics, prediction_agent, prediction, annotation_agent, id, event_timestamp, annotation, text. If metadata, status, metrics, prediction_agent, prediction, annotation_agent, id, event_timestamp, annotation, text are not expected by `T5ForConditionalGeneration.forward`, you can safely ignore this message.\n",
"/Users/mrobinson/.pyenv/versions/3.8.13/envs/argilla/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
" warnings.warn(\n",
"***** Running training *****\n",
" Num examples = 285\n",
" Num Epochs = 1\n",
" Instantaneous batch size per device = 16\n",
" Total train batch size (w. parallel, distributed & accumulation) = 16\n",
" Gradient Accumulation steps = 1\n",
" Total optimization steps = 18\n",
" Number of trainable parameters = 60506624\n",
"You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
]
},
{
"data": {
"text/html": [
"\n",
" <div>\n",
" \n",
" <progress value='18' max='18' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
" [18/18 06:24, Epoch 1/1]\n",
" </div>\n",
" <table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>Epoch</th>\n",
" <th>Training Loss</th>\n",
" <th>Validation Loss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>No log</td>\n",
" <td>3.969428</td>\n",
" </tr>\n",
" </tbody>\n",
"</table><p>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: metadata, status, metrics, prediction_agent, prediction, annotation_agent, id, event_timestamp, annotation, text. If metadata, status, metrics, prediction_agent, prediction, annotation_agent, id, event_timestamp, annotation, text are not expected by `T5ForConditionalGeneration.forward`, you can safely ignore this message.\n",
"***** Running Evaluation *****\n",
" Num examples = 285\n",
" Batch size = 16\n",
"\n",
"\n",
"Training completed. Do not forget to share your model on huggingface.co/models =)\n",
"\n",
"\n"
]
},
{
"data": {
"text/plain": [
"TrainOutput(global_step=18, training_loss=4.286832173665364, metrics={'train_runtime': 405.4928, 'train_samples_per_second': 0.703, 'train_steps_per_second': 0.044, 'total_flos': 77144826839040.0, 'train_loss': 4.286832173665364, 'epoch': 1.0})"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.train()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "2e7eab18",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Saving model checkpoint to t5-small-isw-summaries\n",
"Configuration saved in t5-small-isw-summaries/config.json\n",
"Model weights saved in t5-small-isw-summaries/pytorch_model.bin\n",
"tokenizer config file saved in t5-small-isw-summaries/tokenizer_config.json\n",
"Special tokens file saved in t5-small-isw-summaries/special_tokens_map.json\n"
]
}
],
"source": [
"trainer.save_model(\"t5-small-isw-summaries\")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "6a42a7bd",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"loading configuration file ./t5-small-isw-summaries/config.json\n",
"Model config T5Config {\n",
" \"_name_or_path\": \"./t5-small-isw-summaries\",\n",
" \"architectures\": [\n",
" \"T5ForConditionalGeneration\"\n",
" ],\n",
" \"d_ff\": 2048,\n",
" \"d_kv\": 64,\n",
" \"d_model\": 512,\n",
" \"decoder_start_token_id\": 0,\n",
" \"dense_act_fn\": \"relu\",\n",
" \"dropout_rate\": 0.1,\n",
" \"eos_token_id\": 1,\n",
" \"feed_forward_proj\": \"relu\",\n",
" \"initializer_factor\": 1.0,\n",
" \"is_encoder_decoder\": true,\n",
" \"is_gated_act\": false,\n",
" \"layer_norm_epsilon\": 1e-06,\n",
" \"model_type\": \"t5\",\n",
" \"n_positions\": 512,\n",
" \"num_decoder_layers\": 6,\n",
" \"num_heads\": 8,\n",
" \"num_layers\": 6,\n",
" \"output_past\": true,\n",
" \"pad_token_id\": 0,\n",
" \"relative_attention_max_distance\": 128,\n",
" \"relative_attention_num_buckets\": 32,\n",
" \"task_specific_params\": {\n",
" \"summarization\": {\n",
" \"early_stopping\": true,\n",
" \"length_penalty\": 2.0,\n",
" \"max_length\": 200,\n",
" \"min_length\": 30,\n",
" \"no_repeat_ngram_size\": 3,\n",
" \"num_beams\": 4,\n",
" \"prefix\": \"summarize: \"\n",
" },\n",
" \"translation_en_to_de\": {\n",
" \"early_stopping\": true,\n",
" \"max_length\": 300,\n",
" \"num_beams\": 4,\n",
" \"prefix\": \"translate English to German: \"\n",
" },\n",
" \"translation_en_to_fr\": {\n",
" \"early_stopping\": true,\n",
" \"max_length\": 300,\n",
" \"num_beams\": 4,\n",
" \"prefix\": \"translate English to French: \"\n",
" },\n",
" \"translation_en_to_ro\": {\n",
" \"early_stopping\": true,\n",
" \"max_length\": 300,\n",
" \"num_beams\": 4,\n",
" \"prefix\": \"translate English to Romanian: \"\n",
" }\n",
" },\n",
" \"torch_dtype\": \"float32\",\n",
" \"transformers_version\": \"4.25.1\",\n",
" \"use_cache\": true,\n",
" \"vocab_size\": 32128\n",
"}\n",
"\n",
"loading configuration file ./t5-small-isw-summaries/config.json\n",
"Model config T5Config {\n",
" \"_name_or_path\": \"./t5-small-isw-summaries\",\n",
" \"architectures\": [\n",
" \"T5ForConditionalGeneration\"\n",
" ],\n",
" \"d_ff\": 2048,\n",
" \"d_kv\": 64,\n",
" \"d_model\": 512,\n",
" \"decoder_start_token_id\": 0,\n",
" \"dense_act_fn\": \"relu\",\n",
" \"dropout_rate\": 0.1,\n",
" \"eos_token_id\": 1,\n",
" \"feed_forward_proj\": \"relu\",\n",
" \"initializer_factor\": 1.0,\n",
" \"is_encoder_decoder\": true,\n",
" \"is_gated_act\": false,\n",
" \"layer_norm_epsilon\": 1e-06,\n",
" \"model_type\": \"t5\",\n",
" \"n_positions\": 512,\n",
" \"num_decoder_layers\": 6,\n",
" \"num_heads\": 8,\n",
" \"num_layers\": 6,\n",
" \"output_past\": true,\n",
" \"pad_token_id\": 0,\n",
" \"relative_attention_max_distance\": 128,\n",
" \"relative_attention_num_buckets\": 32,\n",
" \"task_specific_params\": {\n",
" \"summarization\": {\n",
" \"early_stopping\": true,\n",
" \"length_penalty\": 2.0,\n",
" \"max_length\": 200,\n",
" \"min_length\": 30,\n",
" \"no_repeat_ngram_size\": 3,\n",
" \"num_beams\": 4,\n",
" \"prefix\": \"summarize: \"\n",
" },\n",
" \"translation_en_to_de\": {\n",
" \"early_stopping\": true,\n",
" \"max_length\": 300,\n",
" \"num_beams\": 4,\n",
" \"prefix\": \"translate English to German: \"\n",
" },\n",
" \"translation_en_to_fr\": {\n",
" \"early_stopping\": true,\n",
" \"max_length\": 300,\n",
" \"num_beams\": 4,\n",
" \"prefix\": \"translate English to French: \"\n",
" },\n",
" \"translation_en_to_ro\": {\n",
" \"early_stopping\": true,\n",
" \"max_length\": 300,\n",
" \"num_beams\": 4,\n",
" \"prefix\": \"translate English to Romanian: \"\n",
" }\n",
" },\n",
" \"torch_dtype\": \"float32\",\n",
" \"transformers_version\": \"4.25.1\",\n",
" \"use_cache\": true,\n",
" \"vocab_size\": 32128\n",
"}\n",
"\n",
"loading weights file ./t5-small-isw-summaries/pytorch_model.bin\n",
"All model checkpoint weights were used when initializing T5ForConditionalGeneration.\n",
"\n",
"All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at ./t5-small-isw-summaries.\n",
"If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.\n",
"loading file spiece.model\n",
"loading file tokenizer.json\n",
"loading file added_tokens.json\n",
"loading file special_tokens_map.json\n",
"loading file tokenizer_config.json\n"
]
}
],
"source": [
"summarization_model = pipeline(\n",
Chore (refactor): support table extraction with pre-computed ocr data (#1801) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
" task=\"summarization\",\n",
" model=\"./t5-small-isw-summaries\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "ea363821",
"metadata": {},
"source": [
"Now that our model is trained, we can save it locally and use our `unstructured` helper functions to grab future reports for inference!"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "6d6843b6",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Token indices sequence length is longer than the specified maximum sequence length for this model (3726 > 512). Running this sequence through the model will result in indexing errors\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Russian forces continue to attack Bakhmut and various villages near Donetsk City . the Russians are apparently directing some of the very limited reserves available in Ukraine to these efforts rather than to the vulnerable Russian defensive lines hastily thrown up . Russian sources claimed that Russian forces are repelled a Ukrainian ground attack on Pravdyne .\n"
]
}
],
"source": [
"elements = url_to_elements(urls[200])\n",
"narrative_text = get_narrative(elements)\n",
"results = summarization_model(str(narrative_text), max_length=100)\n",
"print(results[0][\"summary_text\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "530a80ec",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}