unstructured/examples/chipper-local-inference/chipper-local-inference.ipynb
Yuming Long ce40cdc55f
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary

Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.

**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image

### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:

screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">

### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR` 

### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`

---------

Co-authored-by: Yao You <yao@unstructured.io>
2023-10-21 00:24:23 +00:00

695 lines
29 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "55u3MolNzXCs"
},
"source": [
"# Chipper Local Inference\n",
"This notebook demonstrates and explains how to parse a PDF file using `chipper` model locally through our main libraries. If you want to run this notebook in Google Colab taking into account that making an inference using `chipper` in CPU can take while; switching the runtime from \"CPU\" to GPU `T4` (or any other available) will reduce the runtime and is strongly recommended. You can use the following commands to install the required libraries:\n",
"\n",
"```{python}\n",
"!pip install unstructured --quiet\n",
"!pip install unstructured-inference --quiet\n",
"!apt-get update -y && apt-get install -y poppler-utils --quiet\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GiSKK9xDLxPm"
},
"source": [
"Initialize the variables `filename` with your file path, and `model_name` with the model you want to use from the available `MODEL_TYPES` in each of the [models](https://github.com/Unstructured-IO/unstructured-inference/tree/203f7ab75b1644b938f6bae1e81c8365d274f35d/unstructured_inference/models) scripts, in this case `chipper`. For this notebook we will use `DA-1p.pdf` from our [example-docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs):"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "6Y_Rcuvq0Ga5"
},
"outputs": [],
"source": [
"filename = \"../../example-docs/DA-1p.pdf\" #\n",
"model_name = \"chipper\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qel-q8I_NhfW"
},
"source": [
"Most of the user experience is going to be through our main Unstructured lib, so the highest level call for local inference using `chipper` is through `unstructured.partition.auto.partition`. This method will need the `strategy`='hi_res' and `model_name`=model_name to call `chipper`, the additional kwarg `pdf_image_dpi`=300 is is **necessary for better performance** of `chipper`. Users should be prompted a `WARNING` saying `chipper` is in beta (*up to 14.08.2023*)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AD4xv3vh17wD"
},
"source": [
"#### [unstructured.partition.auto.partition](https://github.com/Unstructured-IO/unstructured/blob/612f9da6e8e27cffc3e6912928a16daad47903dc/unstructured/partition/auto.py#L73)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "mEsEJ1o1MriM"
},
"outputs": [],
"source": [
"from unstructured.partition.auto import partition"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3tvPQJ07dIa1",
"outputId": "70f60fb3-d22b-4272-d736-b1c177422970"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:unstructured_inference:The Chipper model is currently in Beta and is not yet ready for production use. You can reach out to the Unstructured engineering team in the Unstructured community Slack if you have any feedback on the Chipper model. You can join the community Slack here: https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-23kciff0y-bvzXxJkgtbXe5POo_rxMkw\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 6min 3s, sys: 17.1 s, total: 6min 20s\n",
"Wall time: 6min 35s\n"
]
}
],
"source": [
"%%time\n",
"elements = partition(filename=filename, strategy=\"hi_res\", model_name=model_name, pdf_image_dpi=300)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our `chipper` model process an image input and returns the textual content with a structure defined by some categories (document element types) it was fine-tuned on. Thereafter during a call of `partition` the PDF document is transformed to an image and the output element types are standardise to Unstructured [elements](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements).\n",
"\n",
"*Disclaimer:* The `UncategorizedText` elements being returned by the `partition` method will soon instead reflect the *category/type* identified by `chipper` (e.g. `Headline`, `Subheadline`, ..)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "RRLI3vyfIHTz",
"outputId": "5443a73b-d2f2-46e1-8157-d040b9facd29"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of elements: 13\n",
"UncategorizedText\n",
"UncategorizedText\n",
"UncategorizedText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"UncategorizedText\n",
"NarrativeText\n"
]
}
],
"source": [
"# Printing all categories/types\n",
"print(\"number of elements: \", len(elements))\n",
"for element in elements:\n",
" print(element.category)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "SANC43G5H2Ms",
"outputId": "734f1493-3e3b-4aae-a073-91a37628c82f"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAIN GAME\n",
"CREATURES\n",
"Abomination\n",
"\"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.\n",
"As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.\n",
"There, perched atop the spire of the village concurty, stood the mage. But he was human no longer.\n",
"We shooted prayers to the Maker and deflcted what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault.\"\n",
"—Transscribed from a tale told by a former exemplar in Cumberland, 8:84 Blessed.\n",
"It is known that images are able to walk the Fade while completely aware of their surroundings, unlike most others who may only enter the realm as dreamers and leave it scarcely aware of their experience. Demons are drawn to images, though whether it is because of this awareness or simply by virtue of their magical power in our world is unknown.\n",
"Regardless of the reason, a demon always attempts to possess a mage when it encounters one—by force or by making some kind of deal depending on the strength of the mage. Should the demon get the upper hand, the result is an unholy union known as an abomination. Abominations have been responsible for some of the worst cataclyms in history, and the notion that some mage in a remote tower could turn into such a creature unbeknownst to any was the driving force behind the creation of the Circle of Magi.\n",
"Thankfully, abominations are rare. The Circle has methods for weeding out those who are too at risk for economic possession, and scant few images would give up their free will to submit to such a bond with a demon. But once an abomination is created, it will do its best to create more. Considering that entire squads of exemplars have been known to fall at the hands of a single abomination, it is not surprising that the Chantry takes the business of the Circle of Magi very seriously indeed.\n",
"Arcane Horror\n",
"\"Upon ascending to the second floor of the tower, we were greeted by a gruesome sight: a ragged collection of bones wearing the Forbes of one of the senior enchanters. I had known her for years, watched her raise countless apprentices, and now she was a mere puppet for some demon.\"\n"
]
}
],
"source": [
"# Printing all element(s).text\n",
"for element in elements:\n",
" print(element.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Gp5u2yc2PfUn"
},
"source": [
"Internally this method calls `unstructured.partition.pdf._partition_pdf_or_image_local` which expects a model definition through `model_name` or an env variable called `UNSTRUCTURED_HI_RES_MODEL_NAME` to partition the file, and which ends up calling `process_file_with_model` | `process_data_with_model` ([1](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L391)|[2](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L361)) from `unstructured_inference.inference.layout.PageLayout`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "38XR3WKD2DvD",
"tags": []
},
"source": [
"##### [unstructured.partition.pdf._partition_pdf_or_image_local](https://github.com/Unstructured-IO/unstructured/blob/2e0ab86c6a5c27f14c0f95ea41728bb9a94b7378/unstructured/partition/pdf.py#L215C5-L215C34)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"id": "4UrbFJw4JFZ2"
},
"outputs": [],
"source": [
"from unstructured.partition.pdf import _partition_pdf_or_image_local"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "GmPw9feHK2w3",
"outputId": "b7475e23-6248-496d-8813-f58a98bd2d4d"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (1200) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 5min 8s, sys: 8.82 s, total: 5min 17s\n",
"Wall time: 5min 18s\n"
]
}
],
"source": [
"%%time\n",
"elements = _partition_pdf_or_image_local(\n",
" filename=filename, model_name=model_name, pdf_image_dpi=300\n",
") # file parameter could be use here as well"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Kc0l15RLLAdy",
"outputId": "19072c5f-1ece-48df-8f8c-6779a0f52212"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of elements: 13\n",
"UncategorizedText\n",
"UncategorizedText\n",
"UncategorizedText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"UncategorizedText\n",
"NarrativeText\n"
]
}
],
"source": [
"# Printing all categories/types\n",
"print(\"number of elements: \", len(elements))\n",
"for element in elements:\n",
" print(element.category)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "yxVvfaMyK_vy",
"outputId": "c682835c-db77-475f-8f13-0e6ee36ddf98"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAIN GAME\n",
"CREATURES\n",
"Abomination\n",
"\"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.\n",
"As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.\n",
"There, perched atop the spire of the village chantry, stood the mage. But he was human no longer.\n",
"We shooted prayers to the Maker and defIected what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault.\"\n",
"—Transscribed from a tale told by a former exemplar in Cumberland, 8:84 Blessed.\n",
"It is known that images are able to walk the Fade while completely aware of their surroundings, unlike most others who may only enter the realm as dreamers and leave it scarcely aware of their experience. Demons are drawn to images, though whether it is because of this awareness or simply by virtue of their magical power in our world is unknown.\n",
"Regardless of the reason, a demon always attempts to possess a mage when it encounters one—by force or by making some kind of deal depending on the strength of the mage. Should the demon get the upper hand, the result is an unholy union known as an abomination. Abominations have been responsible for some of the worst cataclyms in history, and the notion that some mage in a remote tower could turn into such a creature unbeknownst to any was the driving force behind the creation of the Circle of Magi.\n",
"Thankfully, abominations are rare. The Circle has methods for weeding out those who are too at risk for economic possession, and scant few images would give up their free will to submit to such a bond with a demon. But once an abomination is created, it will do its best to create more. Considering that entire squads of exemplars have been known to fall at the hands of a single abomination, it is not surprising that the Chantry takes the business of the Circle of Magi very seriously indeed.\n",
"Arcane Horror\n",
"\"Upon ascending to the second floor of the tower, we were greeted by a gruesome sight: a ragged collection of bones wearing the Forbes of one of the senior enchanters. I had known her for years, watched her raise countless apprentices, and now she was a mere puppet for some demon.\"\n"
]
}
],
"source": [
"# Printing all element(s).text\n",
"for element in elements:\n",
" print(element.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-4gqhwsv_Y8g",
"tags": []
},
"source": [
"##### unstructured.partition.auto.partition with env variable"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "koRvnMK8Jvy6"
},
"source": [
"<font color=\"red\">Restart your runtime before executing the cells in this sub-section!\n",
"\n",
"Do not import unstructured.partition.auto.partition before defining your env variables!</font>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gLUrgmZvRlfs"
},
"source": [
"Let's now use the model through Unstructured lib but instead of using the kwarg `model_name` we can define the env var `UNSTRUCTURED_HI_RES_MODEL_NAME`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "zlk7GGew_cu_"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"UNSTRUCTURED_HI_RES_MODEL_NAME\"] = model_name"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "bwGNLIbiCIch"
},
"outputs": [],
"source": [
"from unstructured.partition.auto import (\n",
" partition,\n",
") # we could also use unstructured.partition.pdf._partition_pdf_or_image_local"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "gmdiggjrBTT0",
"outputId": "5fc5daa9-8579-4080-9d9c-ba6bbf332375"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:unstructured_inference:The Chipper model is currently in Beta and is not yet ready for production use. You can reach out to the Unstructured engineering team in the Unstructured community Slack if you have any feedback on the Chipper model. You can join the community Slack here: https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-23kciff0y-bvzXxJkgtbXe5POo_rxMkw\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 6min 8s, sys: 17.6 s, total: 6min 25s\n",
"Wall time: 6min 49s\n"
]
}
],
"source": [
"%%time\n",
"elements = partition(\n",
" filename=filename, strategy=\"hi_res\", pdf_image_dpi=300\n",
") # internally _partition_pdf_or_image_local(filename=filename, pdf_image_dpi=300)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2r53-NWcBgJ-",
"outputId": "fdddf081-14c8-427e-fe8a-47556702dcdf"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of elements: 13\n",
"UncategorizedText\n",
"UncategorizedText\n",
"UncategorizedText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"NarrativeText\n",
"UncategorizedText\n",
"NarrativeText\n"
]
}
],
"source": [
"# Printing all categories/types\n",
"print(\"number of elements: \", len(elements))\n",
"for element in elements:\n",
" print(element.category)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eAIwMXJ6Oibj",
"outputId": "5325d185-a923-4bb6-c42d-a0ab425a0fa1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAIN GAME\n",
"CREATURES\n",
"Abomination\n",
"\"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.\n",
"As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.\n",
"There, perched atop the spire of the village concurty, stood the mage. But he was human no longer.\n",
"We shooted prayers to the Maker and deflcted what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault.\"\n",
"—Transscribed from a tale told by a former exemplar in Cumberland, 8:84 Blessed.\n",
"It is known that images are able to walk the Fade while completely aware of their surroundings, unlike most others who may only enter the realm as dreamers and leave it scarcely aware of their experience. Demons are drawn to images, though whether it is because of this awareness or simply by virtue of their magical power in our world is unknown.\n",
"Regardless of the reason, a demon always attempts to possess a mage when it encounters one—by force or by making some kind of deal depending on the strength of the mage. Should the demon get the upper hand, the result is an unholy union known as an abomination. Abominations have been responsible for some of the worst cataclyms in history, and the notion that some mage in a remote tower could turn into such a creature unbeknownst to any was the driving force behind the creation of the Circle of Magi.\n",
"Thankfully, abominations are rare. The Circle has methods for weeding out those who are too at risk for economic possession, and scant few images would give up their free will to submit to such a bond with a demon. But once an abomination is created, it will do its best to create more. Considering that entire squads of exemplars have been known to fall at the hands of a single abomination, it is not surprising that the Chantry takes the business of the Circle of Magi very seriously indeed.\n",
"Arcane Horror\n",
"\"Upon ascending to the second floor of the tower, we were greeted by a gruesome sight: a ragged collection of bones wearing the Forbes of one of the senior enchanters. I had known her for years, watched her raise countless apprentices, and now she was a mere puppet for some demon.\"\n"
]
}
],
"source": [
"# Printing all element(s).text\n",
"for element in elements:\n",
" print(element.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4OrnwmZA10ju",
"tags": []
},
"source": [
"#### [unstructured_inference.inference.layout.DocumentLayout](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L51)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We know already that `partition` from our main library uses `process_file_with_model` | `process_data_with_model` ([1](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L391)|[2](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L361)) from `unstructured_inference.inference.layout.PageLayout`. Let's now directly create a `DocumentLayout` containing `PageLayout` objects via unstructured-inference. For that, we nned to pass an Unstructured model object to the `element_extraction_model` param when creating a `DocumentLayout` object `from_file`. The method `get_model` from `models.base` creates Unstructured model objects from a model name for you:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "dmJhPx_izPso"
},
"outputs": [],
"source": [
"from unstructured_inference.models.base import get_model\n",
"from unstructured_inference.inference.layout import DocumentLayout"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tugr-KzP0KjR",
"outputId": "747be8ac-de8b-408b-e475-acada925ec5c"
},
"outputs": [
{
"data": {
"text/plain": [
"<unstructured_inference.models.chipper.UnstructuredChipperModel at 0x7965c7c9a260>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = get_model(model_name) # This can take a while on first run\n",
"model"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "NJtLcXU30NEb",
"outputId": "faf141dc-f204-4c37-f12b-0b81fb5f1939"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (1200) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4min 58s, sys: 11.5 s, total: 5min 10s\n",
"Wall time: 5min 12s\n"
]
}
],
"source": [
"%%time\n",
"layout = DocumentLayout.from_file(filename, element_extraction_model=model, pdf_image_dpi=300)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `layout` object is organized by pages with elements. In this case, the parsed document layout will contain the document element types that our `chipper` model was originally fine-tuned on."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WLyhlGb_cj2N",
"outputId": "c2e19488-e0b6-482a-b047-839d8f5ca4f6"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of elements: 13\n",
"Headline\n",
"Subheadline\n",
"Subheadline\n",
"Text\n",
"Text\n",
"Text\n",
"Text\n",
"Text\n",
"Text\n",
"Text\n",
"Text\n",
"Subheadline\n",
"Text\n"
]
}
],
"source": [
"# Printing all categories/types\n",
"for page in layout.pages:\n",
" print(\"number of elements: \", len(page.elements))\n",
" for element in page.elements:\n",
" print(element.type)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "UMtVqKJHFiLz",
"outputId": "d75e26fd-2d47-40af-cca4-c85e3f81703d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAIN GAME\n",
"CREATURES\n",
"Abomination\n",
"\"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.\n",
"As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.\n",
"There, perched atop the spire of the village chantry, stood the mage. But he was human no longer.\n",
"We shooted prayers to the Maker and defIected what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault.\"\n",
"—Transscribed from a tale told by a former exemplar in Cumberland, 8:84 Blessed.\n",
"It is known that images are able to walk the Fade while completely aware of their surroundings, unlike most others who may only enter the realm as dreamers and leave it scarcely aware of their experience. Demons are drawn to images, though whether it is because of this awareness or simply by virtue of their magical power in our world is unknown.\n",
"Regardless of the reason, a demon always attempts to possess a mage when it encounters one—by force or by making some kind of deal depending on the strength of the mage. Should the demon get the upper hand, the result is an unholy union known as an abomination. Abominations have been responsible for some of the worst cataclyms in history, and the notion that some mage in a remote tower could turn into such a creature unbeknownst to any was the driving force behind the creation of the Circle of Magi.\n",
"Thankfully, abominations are rare. The Circle has methods for weeding out those who are too at risk for economic possession, and scant few images would give up their free will to submit to such a bond with a demon. But once an abomination is created, it will do its best to create more. Considering that entire squads of exemplars have been known to fall at the hands of a single abomination, it is not surprising that the Chantry takes the business of the Circle of Magi very seriously indeed.\n",
"Arcane Horror\n",
"\"Upon ascending to the second floor of the tower, we were greeted by a gruesome sight: a ragged collection of bones wearing the Forbes of one of the senior enchanters. I had known her for years, watched her raise countless apprentices, and now she was a mere puppet for some demon.\"\n"
]
}
],
"source": [
"# Printing all element(s).text\n",
"for page in layout.pages:\n",
" for element in page.elements:\n",
" print(element.text)"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
}
},
"nbformat": 4,
"nbformat_minor": 4
}