"This notebook demonstrates and explains how to parse a PDF file using `chipper` model locally through our main libraries. If you want to run this notebook in Google Colab taking into account that making an inference using `chipper` in CPU can take while; switching the runtime from \"CPU\" to GPU `T4` (or any other available) will reduce the runtime and is strongly recommended. You can use the following commands to install the required libraries:\n",
"Initialize the variables `filename` with your file path, and `model_name` with the model you want to use from the available `MODEL_TYPES` in each of the [models](https://github.com/Unstructured-IO/unstructured-inference/tree/203f7ab75b1644b938f6bae1e81c8365d274f35d/unstructured_inference/models) scripts, in this case `chipper`. For this notebook we will use `DA-1p.pdf` from our [example-docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs):"
"Most of the user experience is going to be through our main Unstructured lib, so the highest level call for local inference using `chipper` is through `unstructured.partition.auto.partition`. This method will need the `strategy`='hi_res' and `model_name`=model_name to call `chipper`, the additional kwarg `pdf_image_dpi`=300 is is **necessary for better performance** of `chipper`. Users should be prompted a `WARNING` saying `chipper` is in beta (*up to 14.08.2023*)."
"WARNING:unstructured_inference:The Chipper model is currently in Beta and is not yet ready for production use. You can reach out to the Unstructured engineering team in the Unstructured community Slack if you have any feedback on the Chipper model. You can join the community Slack here: https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-23kciff0y-bvzXxJkgtbXe5POo_rxMkw\n"
"Our `chipper` model process an image input and returns the textual content with a structure defined by some categories (document element types) it was fine-tuned on. Thereafter during a call of `partition` the PDF document is transformed to an image and the output element types are standardise to Unstructured [elements](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements).\n",
"\n",
"*Disclaimer:* The `UncategorizedText` elements being returned by the `partition` method will soon instead reflect the *category/type* identified by `chipper` (e.g. `Headline`, `Subheadline`, ..)."
"\"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.\n",
"As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.\n",
"There, perched atop the spire of the village concurty, stood the mage. But he was human no longer.\n",
"We shooted prayers to the Maker and deflcted what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault.\"\n",
"—Transscribed from a tale told by a former exemplar in Cumberland, 8:84 Blessed.\n",
"It is known that images are able to walk the Fade while completely aware of their surroundings, unlike most others who may only enter the realm as dreamers and leave it scarcely aware of their experience. Demons are drawn to images, though whether it is because of this awareness or simply by virtue of their magical power in our world is unknown.\n",
"Regardless of the reason, a demon always attempts to possess a mage when it encounters one—by force or by making some kind of deal depending on the strength of the mage. Should the demon get the upper hand, the result is an unholy union known as an abomination. Abominations have been responsible for some of the worst cataclyms in history, and the notion that some mage in a remote tower could turn into such a creature unbeknownst to any was the driving force behind the creation of the Circle of Magi.\n",
"Thankfully, abominations are rare. The Circle has methods for weeding out those who are too at risk for economic possession, and scant few images would give up their free will to submit to such a bond with a demon. But once an abomination is created, it will do its best to create more. Considering that entire squads of exemplars have been known to fall at the hands of a single abomination, it is not surprising that the Chantry takes the business of the Circle of Magi very seriously indeed.\n",
"Arcane Horror\n",
"\"Upon ascending to the second floor of the tower, we were greeted by a gruesome sight: a ragged collection of bones wearing the Forbes of one of the senior enchanters. I had known her for years, watched her raise countless apprentices, and now she was a mere puppet for some demon.\"\n"
]
}
],
"source": [
"# Printing all element(s).text\n",
"for element in elements:\n",
" print(element.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Gp5u2yc2PfUn"
},
"source": [
"Internally this method calls `unstructured.partition.pdf._partition_pdf_or_image_local` which expects a model definition through `model_name` or an env variable called `UNSTRUCTURED_HI_RES_MODEL_NAME` to partition the file, and which ends up calling `process_file_with_model` | `process_data_with_model` ([1](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L391)|[2](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L361)) from `unstructured_inference.inference.layout.PageLayout`"
"/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (1200) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
"\"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.\n",
"As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.\n",
"There, perched atop the spire of the village chantry, stood the mage. But he was human no longer.\n",
"We shooted prayers to the Maker and defIected what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault.\"\n",
"—Transscribed from a tale told by a former exemplar in Cumberland, 8:84 Blessed.\n",
"It is known that images are able to walk the Fade while completely aware of their surroundings, unlike most others who may only enter the realm as dreamers and leave it scarcely aware of their experience. Demons are drawn to images, though whether it is because of this awareness or simply by virtue of their magical power in our world is unknown.\n",
"Regardless of the reason, a demon always attempts to possess a mage when it encounters one—by force or by making some kind of deal depending on the strength of the mage. Should the demon get the upper hand, the result is an unholy union known as an abomination. Abominations have been responsible for some of the worst cataclyms in history, and the notion that some mage in a remote tower could turn into such a creature unbeknownst to any was the driving force behind the creation of the Circle of Magi.\n",
"Thankfully, abominations are rare. The Circle has methods for weeding out those who are too at risk for economic possession, and scant few images would give up their free will to submit to such a bond with a demon. But once an abomination is created, it will do its best to create more. Considering that entire squads of exemplars have been known to fall at the hands of a single abomination, it is not surprising that the Chantry takes the business of the Circle of Magi very seriously indeed.\n",
"Arcane Horror\n",
"\"Upon ascending to the second floor of the tower, we were greeted by a gruesome sight: a ragged collection of bones wearing the Forbes of one of the senior enchanters. I had known her for years, watched her raise countless apprentices, and now she was a mere puppet for some demon.\"\n"
]
}
],
"source": [
"# Printing all element(s).text\n",
"for element in elements:\n",
" print(element.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-4gqhwsv_Y8g",
"tags": []
},
"source": [
"##### unstructured.partition.auto.partition with env variable"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "koRvnMK8Jvy6"
},
"source": [
"<font color=\"red\">Restart your runtime before executing the cells in this sub-section!\n",
"\n",
"Do not import unstructured.partition.auto.partition before defining your env variables!</font>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gLUrgmZvRlfs"
},
"source": [
"Let's now use the model through Unstructured lib but instead of using the kwarg `model_name` we can define the env var `UNSTRUCTURED_HI_RES_MODEL_NAME`."
"WARNING:unstructured_inference:The Chipper model is currently in Beta and is not yet ready for production use. You can reach out to the Unstructured engineering team in the Unstructured community Slack if you have any feedback on the Chipper model. You can join the community Slack here: https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-23kciff0y-bvzXxJkgtbXe5POo_rxMkw\n"
"\"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.\n",
"As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.\n",
"There, perched atop the spire of the village concurty, stood the mage. But he was human no longer.\n",
"We shooted prayers to the Maker and deflcted what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault.\"\n",
"—Transscribed from a tale told by a former exemplar in Cumberland, 8:84 Blessed.\n",
"It is known that images are able to walk the Fade while completely aware of their surroundings, unlike most others who may only enter the realm as dreamers and leave it scarcely aware of their experience. Demons are drawn to images, though whether it is because of this awareness or simply by virtue of their magical power in our world is unknown.\n",
"Regardless of the reason, a demon always attempts to possess a mage when it encounters one—by force or by making some kind of deal depending on the strength of the mage. Should the demon get the upper hand, the result is an unholy union known as an abomination. Abominations have been responsible for some of the worst cataclyms in history, and the notion that some mage in a remote tower could turn into such a creature unbeknownst to any was the driving force behind the creation of the Circle of Magi.\n",
"Thankfully, abominations are rare. The Circle has methods for weeding out those who are too at risk for economic possession, and scant few images would give up their free will to submit to such a bond with a demon. But once an abomination is created, it will do its best to create more. Considering that entire squads of exemplars have been known to fall at the hands of a single abomination, it is not surprising that the Chantry takes the business of the Circle of Magi very seriously indeed.\n",
"Arcane Horror\n",
"\"Upon ascending to the second floor of the tower, we were greeted by a gruesome sight: a ragged collection of bones wearing the Forbes of one of the senior enchanters. I had known her for years, watched her raise countless apprentices, and now she was a mere puppet for some demon.\"\n"
"We know already that `partition` from our main library uses `process_file_with_model` | `process_data_with_model` ([1](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L391)|[2](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L361)) from `unstructured_inference.inference.layout.PageLayout`. Let's now directly create a `DocumentLayout` containing `PageLayout` objects via unstructured-inference. For that, we nned to pass an Unstructured model object to the `element_extraction_model` param when creating a `DocumentLayout` object `from_file`. The method `get_model` from `models.base` creates Unstructured model objects from a model name for you:"
"/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (1200) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
"The `layout` object is organized by pages with elements. In this case, the parsed document layout will contain the document element types that our `chipper` model was originally fine-tuned on."
"\"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.\n",
"As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.\n",
"There, perched atop the spire of the village chantry, stood the mage. But he was human no longer.\n",
"We shooted prayers to the Maker and defIected what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault.\"\n",
"—Transscribed from a tale told by a former exemplar in Cumberland, 8:84 Blessed.\n",
"It is known that images are able to walk the Fade while completely aware of their surroundings, unlike most others who may only enter the realm as dreamers and leave it scarcely aware of their experience. Demons are drawn to images, though whether it is because of this awareness or simply by virtue of their magical power in our world is unknown.\n",
"Regardless of the reason, a demon always attempts to possess a mage when it encounters one—by force or by making some kind of deal depending on the strength of the mage. Should the demon get the upper hand, the result is an unholy union known as an abomination. Abominations have been responsible for some of the worst cataclyms in history, and the notion that some mage in a remote tower could turn into such a creature unbeknownst to any was the driving force behind the creation of the Circle of Magi.\n",
"Thankfully, abominations are rare. The Circle has methods for weeding out those who are too at risk for economic possession, and scant few images would give up their free will to submit to such a bond with a demon. But once an abomination is created, it will do its best to create more. Considering that entire squads of exemplars have been known to fall at the hands of a single abomination, it is not surprising that the Chantry takes the business of the Circle of Magi very seriously indeed.\n",
"Arcane Horror\n",
"\"Upon ascending to the second floor of the tower, we were greeted by a gruesome sight: a ragged collection of bones wearing the Forbes of one of the senior enchanters. I had known her for years, watched her raise countless apprentices, and now she was a mere puppet for some demon.\"\n"