mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-06 16:42:42 +00:00
250 lines
6.2 KiB
Plaintext
250 lines
6.2 KiB
Plaintext
![]() |
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "2fac3543",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Loading Data into Spacy"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "30bc0a1b",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The goal of this notebook is to show you how to start a spacy project with Unstructured's Elements. This allows you to create your NLP projects.\n",
|
||
|
"\n",
|
||
|
"Make sure you have Spacy installed on your local computer before running this notebook. If not, you can find the instructions for installation [here](https://spacy.io/usage)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "ac83c096",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Preprocess Documents with Unstructured"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "a29ef57d",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 3,
|
||
|
"id": "adb6b8f7",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import os\n",
|
||
|
"\n",
|
||
|
"from unstructured.partition.auto import partition"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 8,
|
||
|
"id": "8464299b",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# NOTE: Update this directory if you are running the notebook\n",
|
||
|
"# from somewhere other than the examples/spacy folder in the\n",
|
||
|
"# unstructured repo\n",
|
||
|
"EXAMPLE_DOCS_FOLDER = \"../../example-docs/\""
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 9,
|
||
|
"id": "2fd24424",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"document_to_process = \"fake-memo.pdf\"\n",
|
||
|
"filename = os.path.join(EXAMPLE_DOCS_FOLDER, document_to_process)\n",
|
||
|
"elements = partition(filename=filename, strategy=\"fast\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 10,
|
||
|
"id": "0aa45e81",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'May 5, 2023'"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 10,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"elements[0].text"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 11,
|
||
|
"id": "2429f8a5",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"{'filename': 'fake-memo.pdf',\n",
|
||
|
" 'file_directory': '../../example-docs',\n",
|
||
|
" 'filetype': 'application/pdf',\n",
|
||
|
" 'page_number': 1}"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 11,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"elements[0].metadata.to_dict()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "1fd556ff",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Extract Numbers Using Spacy\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "bdf2cefe",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Now let's import `spacy` and create a function to extract noun phrases with numbers. First we'll use a simple example then we'll use the text extracted by `unstructured`.\n",
|
||
|
"\n",
|
||
|
"The function first creates a spacy object with the text, then iterates through the spacy object to find the noun phrases with numbers. It then formats the phrases and appends to a list."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 1,
|
||
|
"id": "bfd20f75",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"Number: 10, Noun: apples, Context: 10 apples\n",
|
||
|
"Number: 5, Noun: oranges, Context: 5 oranges\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"import spacy\n",
|
||
|
"\n",
|
||
|
"nlp = spacy.load(\"en_core_web_sm\")\n",
|
||
|
"\n",
|
||
|
"def extract_numbers_with_context(text):\n",
|
||
|
" doc = nlp(text)\n",
|
||
|
" numbers = []\n",
|
||
|
" \n",
|
||
|
" for token in doc:\n",
|
||
|
" if token.like_num and token.dep_ == 'nummod' and token.head.pos_ == 'NOUN':\n",
|
||
|
" number = token.text\n",
|
||
|
" noun = token.head.text\n",
|
||
|
" context = ' '.join([number, noun])\n",
|
||
|
" numbers.append((number, noun, context))\n",
|
||
|
" \n",
|
||
|
" return numbers\n",
|
||
|
"\n",
|
||
|
"# Example usage\n",
|
||
|
"text = \"I bought 10 apples and 5 oranges yesterday.\"\n",
|
||
|
"numbers_with_context = extract_numbers_with_context(text)\n",
|
||
|
"\n",
|
||
|
"for number, noun, context in numbers_with_context:\n",
|
||
|
" print(f\"Number: {number}, Noun: {noun}, Context: {context}\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "7eae9735",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Using the Data Extracted with Unstructured's Library"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 28,
|
||
|
"id": "7c738f91",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"numbers_with_context = extract_numbers_with_context(elements[2].text)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 29,
|
||
|
"id": "3459555b",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"Number: 20,000, Noun: bottles, Context: 20,000 bottles\n",
|
||
|
"Number: 10,000, Noun: blankets, Context: 10,000 blankets\n",
|
||
|
"Number: 200, Noun: laptops, Context: 200 laptops\n",
|
||
|
"Number: 3, Noun: trucks, Context: 3 trucks\n",
|
||
|
"Number: 15, Noun: hours, Context: 15 hours\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"for number, noun, context in numbers_with_context:\n",
|
||
|
" print(f\"Number: {number}, Noun: {noun}, Context: {context}\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "dadd055a",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3 (ipykernel)",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.8.15"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 5
|
||
|
}
|