mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2026-01-07 12:50:54 +00:00
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>
This commit is contained in:
parent
3437a23c91
commit
ce40cdc55f
@ -1,5 +1,5 @@
|
||||
[run]
|
||||
omit =
|
||||
unstructured/ingest/*
|
||||
# TODO(yuming): please remove this line after adding tests for paddle (CORE-1886)
|
||||
# TODO(yuming): please remove this line after adding tests for paddle
|
||||
unstructured/partition/utils/ocr_models/paddle_ocr.py
|
||||
|
||||
2
.github/workflows/ci.yml
vendored
2
.github/workflows/ci.yml
vendored
@ -305,7 +305,7 @@ jobs:
|
||||
AZURE_SEARCH_API_KEY: ${{ secrets.AZURE_SEARCH_API_KEY }}
|
||||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||
TABLE_OCR: "tesseract"
|
||||
ENTIRE_PAGE_OCR: "tesseract"
|
||||
OCR_AGENT: "tesseract"
|
||||
CI: "true"
|
||||
run: |
|
||||
source .venv/bin/activate
|
||||
|
||||
@ -97,7 +97,7 @@ jobs:
|
||||
AZURE_SEARCH_API_KEY: ${{ secrets.AZURE_SEARCH_API_KEY }}
|
||||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||
TABLE_OCR: "tesseract"
|
||||
ENTIRE_PAGE_OCR: "tesseract"
|
||||
OCR_AGENT: "tesseract"
|
||||
OVERWRITE_FIXTURES: "true"
|
||||
CI: "true"
|
||||
run: |
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
## 0.10.25-dev8
|
||||
## 0.10.25-dev9
|
||||
|
||||
### Enhancements
|
||||
|
||||
@ -6,6 +6,8 @@
|
||||
|
||||
### Features
|
||||
|
||||
* **Table OCR refactor** support Table OCR with pre-computed OCR data to ensure we only do one OCR for entrie document. User can specify
|
||||
ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the entire document.
|
||||
* **Adds accuracy function** The accuracy scoring was originally an option under `calculate_edit_distance`. For easy function call, it is now a wrapper around the original function that calls edit_distance and return as "score".
|
||||
* **Adds HuggingFaceEmbeddingEncoder** The HuggingFace Embedding Encoder uses a local embedding model as opposed to using an API.
|
||||
* **Add AWS bedrock embedding connector** `unstructured.embed.bedrock` now provides a connector to use AWS bedrock's `titan-embed-text` model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run.
|
||||
|
||||
BIN
example-docs/korean-text-with-tables.pdf
Normal file
BIN
example-docs/korean-text-with-tables.pdf
Normal file
Binary file not shown.
@ -43,7 +43,7 @@
|
||||
"source": [
|
||||
"from IPython.display import Image\n",
|
||||
"\n",
|
||||
"Image(filename=\"img/isw.png\", width=800) "
|
||||
"Image(filename=\"img/isw.png\", width=800)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -94,6 +94,7 @@
|
||||
"source": [
|
||||
"ISW_BASE_URL = \"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment\"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def datetime_to_url(dt):\n",
|
||||
" month = dt.strftime(\"%B\").lower()\n",
|
||||
" return f\"{ISW_BASE_URL}-{month}-{dt.day}\""
|
||||
@ -134,8 +135,8 @@
|
||||
" r = requests.get(url)\n",
|
||||
" if r.status_code != 200:\n",
|
||||
" return None\n",
|
||||
" \n",
|
||||
" elements = partition_html(text=r.text) \n",
|
||||
"\n",
|
||||
" elements = partition_html(text=r.text)\n",
|
||||
" return elements"
|
||||
]
|
||||
},
|
||||
@ -170,7 +171,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"Image(filename=\"img/isw-key-takeaways.png\", width=500) "
|
||||
"Image(filename=\"img/isw-key-takeaways.png\", width=500)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -185,13 +186,14 @@
|
||||
" if element.text == \"Key Takeaways\":\n",
|
||||
" return idx\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_key_takeaways(elements):\n",
|
||||
" key_takeaways_idx = _find_key_takeaways_idx(elements)\n",
|
||||
" if not key_takeaways_idx:\n",
|
||||
" return None\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" takeaways = []\n",
|
||||
" for element in elements[key_takeaways_idx + 1:]:\n",
|
||||
" for element in elements[key_takeaways_idx + 1 :]:\n",
|
||||
" if not isinstance(element, ListItem):\n",
|
||||
" break\n",
|
||||
" takeaways.append(element)\n",
|
||||
@ -245,12 +247,12 @@
|
||||
"source": [
|
||||
"def get_narrative(elements):\n",
|
||||
" narrative_text = \"\"\n",
|
||||
" for element in elements: \n",
|
||||
" for element in elements:\n",
|
||||
" if isinstance(element, NarrativeText) and len(element.text) > 500:\n",
|
||||
" # NOTE: Removes citations like [3] from the text\n",
|
||||
" element_text = re.sub(\"\\[\\d{1,3}\\]\", \"\", element.text)\n",
|
||||
" narrative_text += f\"\\n\\n{element_text}\"\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" return NarrativeText(text=narrative_text.strip())"
|
||||
]
|
||||
},
|
||||
@ -337,10 +339,10 @@
|
||||
" elements = url_to_elements(url)\n",
|
||||
" if url is None or not elements:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" text = get_narrative(elements)\n",
|
||||
" annotation = get_key_takeaways(elements)\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" if text and annotation:\n",
|
||||
" inputs.append(text)\n",
|
||||
" annotations.append(annotation.text)\n",
|
||||
@ -600,7 +602,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"Image(filename=\"img/argilla-dataset.png\", width=800) "
|
||||
"Image(filename=\"img/argilla-dataset.png\", width=800)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -634,7 +636,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"Image(filename=\"img/argilla-annotation.png\", width=800) "
|
||||
"Image(filename=\"img/argilla-annotation.png\", width=800)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -688,7 +690,7 @@
|
||||
],
|
||||
"source": [
|
||||
"from transformers import AutoTokenizer\n",
|
||||
" \n",
|
||||
"\n",
|
||||
"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)"
|
||||
]
|
||||
},
|
||||
@ -702,6 +704,7 @@
|
||||
"max_input_length = 1024\n",
|
||||
"max_target_length = 128\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def preprocess_function(examples):\n",
|
||||
" inputs = [doc for doc in examples[\"text\"]]\n",
|
||||
" model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)\n",
|
||||
@ -754,7 +757,12 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer\n",
|
||||
"from transformers import (\n",
|
||||
" AutoModelForSeq2SeqLM,\n",
|
||||
" DataCollatorForSeq2Seq,\n",
|
||||
" Seq2SeqTrainingArguments,\n",
|
||||
" Seq2SeqTrainer,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)"
|
||||
]
|
||||
@ -770,7 +778,7 @@
|
||||
"model_name = model_checkpoint.split(\"/\")[-1]\n",
|
||||
"args = Seq2SeqTrainingArguments(\n",
|
||||
" \"t5-small-isw-summaries\",\n",
|
||||
" evaluation_strategy = \"epoch\",\n",
|
||||
" evaluation_strategy=\"epoch\",\n",
|
||||
" learning_rate=2e-5,\n",
|
||||
" per_device_train_batch_size=batch_size,\n",
|
||||
" per_device_eval_batch_size=batch_size,\n",
|
||||
@ -1068,8 +1076,8 @@
|
||||
],
|
||||
"source": [
|
||||
"summarization_model = pipeline(\n",
|
||||
"task=\"summarization\",\n",
|
||||
"model=\"./t5-small-isw-summaries\",\n",
|
||||
" task=\"summarization\",\n",
|
||||
" model=\"./t5-small-isw-summaries\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
|
||||
@ -20,13 +20,15 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import arxiv # Interact with arXiv api to scrape papers\n",
|
||||
"from sentence_transformers import SentenceTransformer # Use Hugging Face Embedding for Topic Modelling\n",
|
||||
"from bertopic import BERTopic # Package for Topic Modelling\n",
|
||||
"from tqdm import tqdm #Progress Bar When Iterating\n",
|
||||
"import glob #Identify Files in Directory\n",
|
||||
"import os #Delete Files in Directory\n",
|
||||
"import pandas as pd #Dataframe Manipulation"
|
||||
"import arxiv # Interact with arXiv api to scrape papers\n",
|
||||
"from sentence_transformers import (\n",
|
||||
" SentenceTransformer,\n",
|
||||
") # Use Hugging Face Embedding for Topic Modelling\n",
|
||||
"from bertopic import BERTopic # Package for Topic Modelling\n",
|
||||
"from tqdm import tqdm # Progress Bar When Iterating\n",
|
||||
"import glob # Identify Files in Directory\n",
|
||||
"import os # Delete Files in Directory\n",
|
||||
"import pandas as pd # Dataframe Manipulation"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -42,13 +44,19 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from unstructured.partition.auto import partition #Base Function to Partition PDF\n",
|
||||
"from unstructured.staging.base import convert_to_dict #Convert List Unstructured Elements Into List of Dicts for Easy Parsing\n",
|
||||
"from unstructured.cleaners.core import clean, remove_punctuation, clean_non_ascii_chars #Cleaning Bricks\n",
|
||||
"import re #Create Custom Cleaning Brick\n",
|
||||
"import nltk #Toolkit for more advanced pre-processing\n",
|
||||
"from nltk.corpus import stopwords #list of stopwords to remove\n",
|
||||
"from typing import List #Type Hinting"
|
||||
"from unstructured.partition.auto import partition # Base Function to Partition PDF\n",
|
||||
"from unstructured.staging.base import (\n",
|
||||
" convert_to_dict,\n",
|
||||
") # Convert List Unstructured Elements Into List of Dicts for Easy Parsing\n",
|
||||
"from unstructured.cleaners.core import (\n",
|
||||
" clean,\n",
|
||||
" remove_punctuation,\n",
|
||||
" clean_non_ascii_chars,\n",
|
||||
") # Cleaning Bricks\n",
|
||||
"import re # Create Custom Cleaning Brick\n",
|
||||
"import nltk # Toolkit for more advanced pre-processing\n",
|
||||
"from nltk.corpus import stopwords # list of stopwords to remove\n",
|
||||
"from typing import List # Type Hinting"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -84,7 +92,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"nltk.download('stopwords')"
|
||||
"nltk.download(\"stopwords\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -110,28 +118,29 @@
|
||||
" Returns:\n",
|
||||
" paper_texts (list[str]): Return list of narrative texts for each paper\n",
|
||||
" \"\"\"\n",
|
||||
" #Get List of Arxiv Papers Matching Our Query\n",
|
||||
" # Get List of Arxiv Papers Matching Our Query\n",
|
||||
" arxiv_papers = list(\n",
|
||||
" arxiv.Search(\n",
|
||||
" query = query,\n",
|
||||
" max_results = max_results,\n",
|
||||
" sort_by = arxiv.SortCriterion.Relevance,\n",
|
||||
" sort_order = arxiv.SortOrder.Descending\n",
|
||||
" )\n",
|
||||
" .results()\n",
|
||||
" query=query,\n",
|
||||
" max_results=max_results,\n",
|
||||
" sort_by=arxiv.SortCriterion.Relevance,\n",
|
||||
" sort_order=arxiv.SortOrder.Descending,\n",
|
||||
" ).results()\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" #Loop Through PDFs, Download and Pre-Process and Then Delete\n",
|
||||
" # Loop Through PDFs, Download and Pre-Process and Then Delete\n",
|
||||
" paper_texts = []\n",
|
||||
" for paper in tqdm(arxiv_papers):\n",
|
||||
" paper.download_pdf()\n",
|
||||
" pdf_file = glob.glob('*.pdf')[0]\n",
|
||||
" elements = partition(pdf_file) #Partition PDF Using Unstructured\n",
|
||||
" isd = convert_to_dict(elements) #Convert List of Elements to List of Dictionaries\n",
|
||||
" narrative_texts = [element['text'] for element in isd if element['type'] == 'NarrativeText'] #Only Keep Narrative Text and Combine Into One String\n",
|
||||
" os.remove(pdf_file) #Delete PDF\n",
|
||||
" pdf_file = glob.glob(\"*.pdf\")[0]\n",
|
||||
" elements = partition(pdf_file) # Partition PDF Using Unstructured\n",
|
||||
" isd = convert_to_dict(elements) # Convert List of Elements to List of Dictionaries\n",
|
||||
" narrative_texts = [\n",
|
||||
" element[\"text\"] for element in isd if element[\"type\"] == \"NarrativeText\"\n",
|
||||
" ] # Only Keep Narrative Text and Combine Into One String\n",
|
||||
" os.remove(pdf_file) # Delete PDF\n",
|
||||
" paper_texts += narrative_texts\n",
|
||||
" return paper_texts\n"
|
||||
" return paper_texts"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -155,7 +164,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"paper_texts = get_arxiv_paper_texts(query='natural language processing', max_results=10)"
|
||||
"paper_texts = get_arxiv_paper_texts(query=\"natural language processing\", max_results=10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -179,10 +188,11 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#Stopwords to Remove\n",
|
||||
"stop_words = set(stopwords.words('english'))\n",
|
||||
"# Stopwords to Remove\n",
|
||||
"stop_words = set(stopwords.words(\"english\"))\n",
|
||||
"\n",
|
||||
"#Function to Apply Whatever Cleaning Brick Functionality to Each Narrative Text Element\n",
|
||||
"\n",
|
||||
"# Function to Apply Whatever Cleaning Brick Functionality to Each Narrative Text Element\n",
|
||||
"def custom_clean_brick(narrative_text: str) -> str:\n",
|
||||
" \"\"\"Apply Mix of Unstructured Cleaning Bricks With Some Custom Functionality to Pre-Process Narrative Text\n",
|
||||
"\n",
|
||||
@ -192,18 +202,32 @@
|
||||
" Returns:\n",
|
||||
" cleaned_text (str): Text after going through all the cleaning procedures\n",
|
||||
" \"\"\"\n",
|
||||
" remove_numbers = lambda text: re.sub(r'\\d+', \"\", text) #lambda function to remove all punctuation\n",
|
||||
" cleaned_text = remove_numbers(narrative_text) #Apply Custom Lambda\n",
|
||||
" cleaned_text = clean(cleaned_text, extra_whitespace=True, dashes=True, bullets=True, trailing_punctuation=True, lowercase=True) #Apply Basic Clean Brick With All the Options\n",
|
||||
" cleaned_text = remove_punctuation(cleaned_text) #Remove all punctuation\n",
|
||||
" cleaned_text = ' '.join([word for word in cleaned_text.split() if word not in stop_words]) #remove stop words\n",
|
||||
" remove_numbers = lambda text: re.sub(\n",
|
||||
" r\"\\d+\", \"\", text\n",
|
||||
" ) # lambda function to remove all punctuation\n",
|
||||
" cleaned_text = remove_numbers(narrative_text) # Apply Custom Lambda\n",
|
||||
" cleaned_text = clean(\n",
|
||||
" cleaned_text,\n",
|
||||
" extra_whitespace=True,\n",
|
||||
" dashes=True,\n",
|
||||
" bullets=True,\n",
|
||||
" trailing_punctuation=True,\n",
|
||||
" lowercase=True,\n",
|
||||
" ) # Apply Basic Clean Brick With All the Options\n",
|
||||
" cleaned_text = remove_punctuation(cleaned_text) # Remove all punctuation\n",
|
||||
" cleaned_text = \" \".join(\n",
|
||||
" [word for word in cleaned_text.split() if word not in stop_words]\n",
|
||||
" ) # remove stop words\n",
|
||||
" return cleaned_text\n",
|
||||
"\n",
|
||||
"#Apply Function to Paper Texts\n",
|
||||
"\n",
|
||||
"# Apply Function to Paper Texts\n",
|
||||
"cleaned_paper_texts = [custom_clean_brick(text) for text in paper_texts]\n",
|
||||
"\n",
|
||||
"#Count Narratve Texts\n",
|
||||
"print(\"Number of Narrative Texts to Run Through Topic Modelling: {}\".format(len(cleaned_paper_texts)))"
|
||||
"# Count Narratve Texts\n",
|
||||
"print(\n",
|
||||
" \"Number of Narrative Texts to Run Through Topic Modelling: {}\".format(len(cleaned_paper_texts))\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -219,10 +243,10 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#Choose Which Hugging Face Model You Want to Use\n",
|
||||
"# Choose Which Hugging Face Model You Want to Use\n",
|
||||
"sentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n",
|
||||
"\n",
|
||||
"#Initialize Model\n",
|
||||
"# Initialize Model\n",
|
||||
"topic_model = BERTopic(embedding_model=sentence_model, top_n_words=10, nr_topics=10, verbose=True)"
|
||||
]
|
||||
},
|
||||
@ -264,16 +288,16 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#Fit Topic Model and Transform List of Paper Narrative Texts Into Topic and Probabilities\n",
|
||||
"# Fit Topic Model and Transform List of Paper Narrative Texts Into Topic and Probabilities\n",
|
||||
"topic_model.fit(cleaned_paper_texts)\n",
|
||||
"\n",
|
||||
"#Store Document-Topic Info\n",
|
||||
"# Store Document-Topic Info\n",
|
||||
"doc_topic_info = topic_model.get_document_info(cleaned_paper_texts)\n",
|
||||
"\n",
|
||||
"#Store Topic Info\n",
|
||||
"# Store Topic Info\n",
|
||||
"topic_info = pd.DataFrame(topic_model.get_topics())\n",
|
||||
"topic_info = topic_info.applymap(lambda x: x[0])\n",
|
||||
"topic_info.columns = ['topic_{}'.format(col+1) for col in topic_info.columns]"
|
||||
"topic_info.columns = [\"topic_{}\".format(col + 1) for col in topic_info.columns]"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@ -33,7 +33,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"filename = '../../example-docs/DA-1p.pdf' # \n",
|
||||
"filename = \"../../example-docs/DA-1p.pdf\" #\n",
|
||||
"model_name = \"chipper\""
|
||||
]
|
||||
},
|
||||
@ -95,7 +95,7 @@
|
||||
],
|
||||
"source": [
|
||||
"%%time\n",
|
||||
"elements = partition(filename=filename, strategy='hi_res', model_name=model_name, pdf_image_dpi=300)"
|
||||
"elements = partition(filename=filename, strategy=\"hi_res\", model_name=model_name, pdf_image_dpi=300)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -243,7 +243,9 @@
|
||||
],
|
||||
"source": [
|
||||
"%%time\n",
|
||||
"elements = _partition_pdf_or_image_local(filename=filename, model_name=model_name, pdf_image_dpi=300) # file parameter could be use here as well"
|
||||
"elements = _partition_pdf_or_image_local(\n",
|
||||
" filename=filename, model_name=model_name, pdf_image_dpi=300\n",
|
||||
") # file parameter could be use here as well"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -362,7 +364,7 @@
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"os.environ['UNSTRUCTURED_HI_RES_MODEL_NAME'] = model_name"
|
||||
"os.environ[\"UNSTRUCTURED_HI_RES_MODEL_NAME\"] = model_name"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -373,7 +375,9 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from unstructured.partition.auto import partition # we could also use unstructured.partition.pdf._partition_pdf_or_image_local"
|
||||
"from unstructured.partition.auto import (\n",
|
||||
" partition,\n",
|
||||
") # we could also use unstructured.partition.pdf._partition_pdf_or_image_local"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -405,7 +409,9 @@
|
||||
],
|
||||
"source": [
|
||||
"%%time\n",
|
||||
"elements = partition(filename=filename, strategy='hi_res', pdf_image_dpi=300) # internally _partition_pdf_or_image_local(filename=filename, pdf_image_dpi=300)"
|
||||
"elements = partition(\n",
|
||||
" filename=filename, strategy=\"hi_res\", pdf_image_dpi=300\n",
|
||||
") # internally _partition_pdf_or_image_local(filename=filename, pdf_image_dpi=300)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -536,7 +542,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"model = get_model(model_name) # This can take a while on first run\n",
|
||||
"model = get_model(model_name) # This can take a while on first run\n",
|
||||
"model"
|
||||
]
|
||||
},
|
||||
|
||||
@ -39,7 +39,9 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def plot_image_with_bounding_boxes_coloured(image_path, bounding_boxes, text_labels=None, desired_width=None):\n",
|
||||
"def plot_image_with_bounding_boxes_coloured(\n",
|
||||
" image_path, bounding_boxes, text_labels=None, desired_width=None\n",
|
||||
"):\n",
|
||||
" # Load the image\n",
|
||||
" image = Image.open(image_path)\n",
|
||||
"\n",
|
||||
@ -72,44 +74,76 @@
|
||||
" height = y_max - y_min\n",
|
||||
"\n",
|
||||
" # Create a rectangle patch\n",
|
||||
" rect = patches.Rectangle((x_min, y_min), width, height, linewidth=1, edgecolor='black', facecolor='none')\n",
|
||||
" rect = patches.Rectangle(\n",
|
||||
" (x_min, y_min), width, height, linewidth=1, edgecolor=\"black\", facecolor=\"none\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Determine the color based on the position of the bounding box\n",
|
||||
" if x_min < vertical_line_x and x_max < vertical_line_x:\n",
|
||||
" # Left side: color is red\n",
|
||||
" rect.set_edgecolor('red')\n",
|
||||
" rect.set_edgecolor(\"red\")\n",
|
||||
" rect.set_facecolor((1.0, 0.0, 0.0, 0.05))\n",
|
||||
"\n",
|
||||
" # Add the rectangle to the plot\n",
|
||||
" ax.add_patch(rect)\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" # Add custom text label above the bounding box\n",
|
||||
" ax.text(x_min, y_min - 5, label, fontsize=12, weight='bold', color='red', bbox=dict(facecolor=(1.0, 1.0, 1.0, 0.8), edgecolor=(0.95, 0.95, 0.95, 0.0), pad=0.5))\n",
|
||||
" \n",
|
||||
" ax.text(\n",
|
||||
" x_min,\n",
|
||||
" y_min - 5,\n",
|
||||
" label,\n",
|
||||
" fontsize=12,\n",
|
||||
" weight=\"bold\",\n",
|
||||
" color=\"red\",\n",
|
||||
" bbox=dict(\n",
|
||||
" facecolor=(1.0, 1.0, 1.0, 0.8), edgecolor=(0.95, 0.95, 0.95, 0.0), pad=0.5\n",
|
||||
" ),\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" elif x_min > vertical_line_x and x_max > vertical_line_x:\n",
|
||||
" # Right side: color is blue\n",
|
||||
" rect.set_edgecolor('blue')\n",
|
||||
" rect.set_edgecolor(\"blue\")\n",
|
||||
" rect.set_facecolor((0.0, 0.0, 1.0, 0.05))\n",
|
||||
"\n",
|
||||
" # Add the rectangle to the plot\n",
|
||||
" ax.add_patch(rect)\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" # Add custom text label above the bounding box\n",
|
||||
" ax.text(x_min, y_min - 5, label, fontsize=12, weight='bold', color='blue', bbox=dict(facecolor=(1.0, 1.0, 1.0, 0.8), edgecolor=(0.95, 0.95, 0.95, 0.0), pad=0.5))\n",
|
||||
" \n",
|
||||
" ax.text(\n",
|
||||
" x_min,\n",
|
||||
" y_min - 5,\n",
|
||||
" label,\n",
|
||||
" fontsize=12,\n",
|
||||
" weight=\"bold\",\n",
|
||||
" color=\"blue\",\n",
|
||||
" bbox=dict(\n",
|
||||
" facecolor=(1.0, 1.0, 1.0, 0.8), edgecolor=(0.95, 0.95, 0.95, 0.0), pad=0.5\n",
|
||||
" ),\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" else:\n",
|
||||
" # Spanning both sides: color is green\n",
|
||||
" rect.set_edgecolor('green')\n",
|
||||
" rect.set_edgecolor(\"green\")\n",
|
||||
" rect.set_facecolor((0.0, 1.0, 0.0, 0.05))\n",
|
||||
"\n",
|
||||
" # Add the rectangle to the plot\n",
|
||||
" ax.add_patch(rect)\n",
|
||||
"\n",
|
||||
" # Add custom text label above the bounding box\n",
|
||||
" ax.text(x_min, y_min - 5, label, fontsize=12, weight='bold', color='green', bbox=dict(facecolor=(1.0, 1.0, 1.0, 0.8), edgecolor=(0.95, 0.95, 0.95, 0.0), pad=0.5))\n",
|
||||
" ax.text(\n",
|
||||
" x_min,\n",
|
||||
" y_min - 5,\n",
|
||||
" label,\n",
|
||||
" fontsize=12,\n",
|
||||
" weight=\"bold\",\n",
|
||||
" color=\"green\",\n",
|
||||
" bbox=dict(\n",
|
||||
" facecolor=(1.0, 1.0, 1.0, 0.8), edgecolor=(0.95, 0.95, 0.95, 0.0), pad=0.5\n",
|
||||
" ),\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Draw the vertical line to split the image\n",
|
||||
" ax.axvline(x=vertical_line_x, color='black', linestyle='--', linewidth=1)\n",
|
||||
" ax.axvline(x=vertical_line_x, color=\"black\", linestyle=\"--\", linewidth=1)\n",
|
||||
"\n",
|
||||
" # Show the plot\n",
|
||||
" plt.show()"
|
||||
@ -123,8 +157,8 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def reorder_elements_in_double_columns(image_path, bounding_boxes):\n",
|
||||
" # todo: if first element of left and \n",
|
||||
" \n",
|
||||
" # todo: if first element of left and\n",
|
||||
"\n",
|
||||
" # Load the image\n",
|
||||
" image = Image.open(image_path)\n",
|
||||
"\n",
|
||||
@ -183,7 +217,7 @@
|
||||
"source": [
|
||||
"image_path = \"../../example-docs/double-column-A.jpg\"\n",
|
||||
"image = Image.open(image_path)\n",
|
||||
"layout = DocumentLayout.from_image_file(image_path) # from_file for pdfs\n",
|
||||
"layout = DocumentLayout.from_image_file(image_path) # from_file for pdfs\n",
|
||||
"width, height = image.size\n",
|
||||
"print(\"Width:\", width)\n",
|
||||
"print(\"Height:\", height)"
|
||||
@ -206,7 +240,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"elements_coordinates =[e.to_dict()['coordinates'] for e in elements]\n",
|
||||
"elements_coordinates = [e.to_dict()[\"coordinates\"] for e in elements]\n",
|
||||
"elements_types = [f\"{ix}: {e.to_dict()['type']}\" for ix, e in enumerate(elements, start=1)]"
|
||||
]
|
||||
},
|
||||
@ -230,7 +264,9 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"plot_image_with_bounding_boxes_coloured(image_path, elements_coordinates, elements_types, desired_width=20)"
|
||||
"plot_image_with_bounding_boxes_coloured(\n",
|
||||
" image_path, elements_coordinates, elements_types, desired_width=20\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -961,8 +997,10 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"elements_coordinates_fix =[e.to_dict()['coordinates'] for e in elements_reord]\n",
|
||||
"elements_types_fix = [f\"{ix}: {e.to_dict()['type']}\" for ix, e in enumerate(elements_reord, start=1)]"
|
||||
"elements_coordinates_fix = [e.to_dict()[\"coordinates\"] for e in elements_reord]\n",
|
||||
"elements_types_fix = [\n",
|
||||
" f\"{ix}: {e.to_dict()['type']}\" for ix, e in enumerate(elements_reord, start=1)\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -985,7 +1023,9 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"plot_image_with_bounding_boxes_coloured(image_path, elements_coordinates_fix, elements_types_fix, desired_width=20)"
|
||||
"plot_image_with_bounding_boxes_coloured(\n",
|
||||
" image_path, elements_coordinates_fix, elements_types_fix, desired_width=20\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1010,12 +1050,14 @@
|
||||
"source": [
|
||||
"image_path = \"../../example-docs/double-column-B.jpg\"\n",
|
||||
"image = Image.open(image_path)\n",
|
||||
"layout = DocumentLayout.from_image_file(image_path) # from_file for pdfs\n",
|
||||
"layout = DocumentLayout.from_image_file(image_path) # from_file for pdfs\n",
|
||||
"width, height = image.size\n",
|
||||
"elements = layout.pages[0].elements\n",
|
||||
"elements_coordinates =[e.to_dict()['coordinates'] for e in elements]\n",
|
||||
"elements_coordinates = [e.to_dict()[\"coordinates\"] for e in elements]\n",
|
||||
"elements_types = [f\"{ix}: {e.to_dict()['type']}\" for ix, e in enumerate(elements, start=1)]\n",
|
||||
"plot_image_with_bounding_boxes_coloured(image_path, elements_coordinates, elements_types, desired_width=20)"
|
||||
"plot_image_with_bounding_boxes_coloured(\n",
|
||||
" image_path, elements_coordinates, elements_types, desired_width=20\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1040,9 +1082,13 @@
|
||||
"source": [
|
||||
"new_ixs = reorder_elements_in_double_columns(image_path, elements_coordinates)\n",
|
||||
"elements_reord = [elements[i] for i in new_ixs]\n",
|
||||
"elements_coordinates_fix =[e.to_dict()['coordinates'] for e in elements_reord]\n",
|
||||
"elements_types_fix = [f\"{ix}: {e.to_dict()['type']}\" for ix, e in enumerate(elements_reord, start=1)]\n",
|
||||
"plot_image_with_bounding_boxes_coloured(image_path, elements_coordinates_fix, elements_types_fix, desired_width=20)"
|
||||
"elements_coordinates_fix = [e.to_dict()[\"coordinates\"] for e in elements_reord]\n",
|
||||
"elements_types_fix = [\n",
|
||||
" f\"{ix}: {e.to_dict()['type']}\" for ix, e in enumerate(elements_reord, start=1)\n",
|
||||
"]\n",
|
||||
"plot_image_with_bounding_boxes_coloured(\n",
|
||||
" image_path, elements_coordinates_fix, elements_types_fix, desired_width=20\n",
|
||||
")"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
||||
@ -138,7 +138,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"text_elements = [element for element in elements if 'Text' in element.category]\n",
|
||||
"text_elements = [element for element in elements if \"Text\" in element.category]\n",
|
||||
"\n",
|
||||
"print(f'total number of \"Text\" elements: {len(text_elements)}')\n",
|
||||
"\n",
|
||||
@ -258,7 +258,7 @@
|
||||
"clean_text_elements_dict = []\n",
|
||||
"\n",
|
||||
"for element_dict in text_elements_dict:\n",
|
||||
" element_dict['text'] = clean_extra_whitespace(element_dict['text'])\n",
|
||||
" element_dict[\"text\"] = clean_extra_whitespace(element_dict[\"text\"])\n",
|
||||
" clean_text_elements_dict.append(element_dict)\n",
|
||||
"\n",
|
||||
"# text_elements_dict display of 2 arbitrary Text elements after cleaning withespace\n",
|
||||
@ -284,13 +284,13 @@
|
||||
"# read credentials\n",
|
||||
"\n",
|
||||
"config = configparser.ConfigParser()\n",
|
||||
"config.read('es-credentials.ini') # path to credentials file\n",
|
||||
"config.read(\"es-credentials.ini\") # path to credentials file\n",
|
||||
"\n",
|
||||
"# Instantiate the Elasticsearch connection\n",
|
||||
"\n",
|
||||
"es_client = Elasticsearch(\n",
|
||||
" cloud_id=config['ELASTIC']['cloud_id'],\n",
|
||||
" http_auth=(config['ELASTIC']['user'], config['ELASTIC']['password'])\n",
|
||||
" cloud_id=config[\"ELASTIC\"][\"cloud_id\"],\n",
|
||||
" http_auth=(config[\"ELASTIC\"][\"user\"], config[\"ELASTIC\"][\"password\"]),\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@ -338,21 +338,18 @@
|
||||
],
|
||||
"source": [
|
||||
"for element in tqdm(clean_text_elements_dict):\n",
|
||||
" element_blob = TextBlob(element['text'])\n",
|
||||
" element['polarity'] = round(element_blob.sentiment.polarity, 4)\n",
|
||||
" element['subjectivity'] = round(element_blob.sentiment.subjectivity, 4)\n",
|
||||
" \n",
|
||||
" if element['polarity'] < 0:\n",
|
||||
" element['sentiment'] = \"negative\"\n",
|
||||
" elif element['polarity'] == 0:\n",
|
||||
" element['sentiment'] = \"neutral\"\n",
|
||||
" else:\n",
|
||||
" element['sentiment'] = \"positive\"\n",
|
||||
" element_blob = TextBlob(element[\"text\"])\n",
|
||||
" element[\"polarity\"] = round(element_blob.sentiment.polarity, 4)\n",
|
||||
" element[\"subjectivity\"] = round(element_blob.sentiment.subjectivity, 4)\n",
|
||||
"\n",
|
||||
" es_client.index(\n",
|
||||
" index='search-unstructured-elements', # your index name\n",
|
||||
" document=element\n",
|
||||
" )"
|
||||
" if element[\"polarity\"] < 0:\n",
|
||||
" element[\"sentiment\"] = \"negative\"\n",
|
||||
" elif element[\"polarity\"] == 0:\n",
|
||||
" element[\"sentiment\"] = \"neutral\"\n",
|
||||
" else:\n",
|
||||
" element[\"sentiment\"] = \"positive\"\n",
|
||||
"\n",
|
||||
" es_client.index(index=\"search-unstructured-elements\", document=element) # your index name"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -384,8 +381,8 @@
|
||||
"response = list(s.execute())\n",
|
||||
"response.extend(response_pos)\n",
|
||||
"\n",
|
||||
"sorted_elements = sorted(response, key=lambda d: d['polarity'], reverse=True)\n",
|
||||
"sorted_elements = sorted(sorted_elements, key=lambda d: d['subjectivity'], reverse=True)"
|
||||
"sorted_elements = sorted(response, key=lambda d: d[\"polarity\"], reverse=True)\n",
|
||||
"sorted_elements = sorted(sorted_elements, key=lambda d: d[\"subjectivity\"], reverse=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -469,7 +466,9 @@
|
||||
"print(\"TOP 5 MOST POLARIZED & SUBJECTIVE TEXT ELEMENTS IN THE HTML FILE: \\n\")\n",
|
||||
"\n",
|
||||
"for ix, hit in enumerate(sorted_elements, start=1):\n",
|
||||
" print(f\"{ix}: {hit.text}\\nsentiment: {hit.sentiment}\\npolarity: {hit.polarity}\\nsubjectivity: {hit.subjectivity}\\n\")\n",
|
||||
" print(\n",
|
||||
" f\"{ix}: {hit.text}\\nsentiment: {hit.sentiment}\\npolarity: {hit.polarity}\\nsubjectivity: {hit.subjectivity}\\n\"\n",
|
||||
" )\n",
|
||||
" if ix == 5:\n",
|
||||
" break"
|
||||
]
|
||||
|
||||
@ -353,12 +353,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"elements_df.to_sql(\n",
|
||||
" name=table_name,\n",
|
||||
" con=engine,\n",
|
||||
" if_exists=\"replace\",\n",
|
||||
" index=False\n",
|
||||
")"
|
||||
"elements_df.to_sql(name=table_name, con=engine, if_exists=\"replace\", index=False)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -394,7 +389,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with engine.begin() as conn:\n",
|
||||
" elements_read_df = pd.read_sql_query(sql=text(sql), con=conn)"
|
||||
" elements_read_df = pd.read_sql_query(sql=text(sql), con=conn)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@ -274,7 +274,8 @@
|
||||
"query = (\n",
|
||||
" session.query(Element)\n",
|
||||
" .filter(Element.category == \"NarrativeText\")\n",
|
||||
" .order_by(Element.embedding.l2_distance(vector)).limit(5)\n",
|
||||
" .order_by(Element.embedding.l2_distance(vector))\n",
|
||||
" .limit(5)\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"for element in query:\n",
|
||||
|
||||
@ -44,9 +44,31 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tickers = ['ehc', 'mrk','nke', 'msex', 'v', 'cvs', 'doc', 'smtc', 'cl', \n",
|
||||
"'ava', 'bc', 'f', 'lmt', 'cri', 'aig', 'rgld', 'apld', 'omcl', \n",
|
||||
"'mmm', 'bgs', 'dis','wetg', 'bj']"
|
||||
"tickers = [\n",
|
||||
" \"ehc\",\n",
|
||||
" \"mrk\",\n",
|
||||
" \"nke\",\n",
|
||||
" \"msex\",\n",
|
||||
" \"v\",\n",
|
||||
" \"cvs\",\n",
|
||||
" \"doc\",\n",
|
||||
" \"smtc\",\n",
|
||||
" \"cl\",\n",
|
||||
" \"ava\",\n",
|
||||
" \"bc\",\n",
|
||||
" \"f\",\n",
|
||||
" \"lmt\",\n",
|
||||
" \"cri\",\n",
|
||||
" \"aig\",\n",
|
||||
" \"rgld\",\n",
|
||||
" \"apld\",\n",
|
||||
" \"omcl\",\n",
|
||||
" \"mmm\",\n",
|
||||
" \"bgs\",\n",
|
||||
" \"dis\",\n",
|
||||
" \"wetg\",\n",
|
||||
" \"bj\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -86,14 +108,14 @@
|
||||
],
|
||||
"source": [
|
||||
"forms = []\n",
|
||||
"for ticker in tickers: \n",
|
||||
"for ticker in tickers:\n",
|
||||
" form_text = get_form_by_ticker(\n",
|
||||
" ticker=ticker,\n",
|
||||
" form_type=\"10-K\",\n",
|
||||
" company=\"Unstructured Technologies\",\n",
|
||||
" email=\"support@unstructured.io\"\n",
|
||||
" email=\"support@unstructured.io\",\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" filename = os.path.join(data_directory, f\"{ticker}-10k.xbrl\")\n",
|
||||
" with open(filename, \"w\") as f:\n",
|
||||
" f.write(form_text)\n",
|
||||
@ -278,12 +300,14 @@
|
||||
"annotations = []\n",
|
||||
"for i, element in enumerate(elements):\n",
|
||||
" inference = sentiment_pipeline(element.text, truncation=True)\n",
|
||||
" result = [LabelStudioResult(\n",
|
||||
" type=\"choices\",\n",
|
||||
" value={\"choices\": [inference[0][\"label\"].title()]},\n",
|
||||
" from_name=\"sentiment\",\n",
|
||||
" to_name=\"text\",\n",
|
||||
" )]\n",
|
||||
" result = [\n",
|
||||
" LabelStudioResult(\n",
|
||||
" type=\"choices\",\n",
|
||||
" value={\"choices\": [inference[0][\"label\"].title()]},\n",
|
||||
" from_name=\"sentiment\",\n",
|
||||
" to_name=\"text\",\n",
|
||||
" )\n",
|
||||
" ]\n",
|
||||
" annotations.append([LabelStudioAnnotation(result=result)])\n",
|
||||
" print(\".\", end=\"\") if i % 40 == 1 else None"
|
||||
]
|
||||
@ -380,11 +404,12 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"datasets_data = Dataset.from_dict({\n",
|
||||
" \"text\": [item[\"text\"] for item in training_data],\n",
|
||||
" \"label\": [0 if item[\"sentiment\"] == \"Negative\" else 1 \n",
|
||||
" for item in training_data]\n",
|
||||
"})"
|
||||
"datasets_data = Dataset.from_dict(\n",
|
||||
" {\n",
|
||||
" \"text\": [item[\"text\"] for item in training_data],\n",
|
||||
" \"label\": [0 if item[\"sentiment\"] == \"Negative\" else 1 for item in training_data],\n",
|
||||
" }\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -426,10 +451,7 @@
|
||||
"source": [
|
||||
"from transformers import AutoModelForSequenceClassification\n",
|
||||
"\n",
|
||||
"model = AutoModelForSequenceClassification.from_pretrained(\n",
|
||||
" model_name, \n",
|
||||
" num_labels=2\n",
|
||||
")\n"
|
||||
"model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -440,6 +462,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from transformers import AutoTokenizer\n",
|
||||
"\n",
|
||||
"tokenizer = AutoTokenizer.from_pretrained(model_name)"
|
||||
]
|
||||
},
|
||||
@ -451,7 +474,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def preprocess_function(examples):\n",
|
||||
" return tokenizer(examples[\"text\"], truncation=True)"
|
||||
" return tokenizer(examples[\"text\"], truncation=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -487,6 +510,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from transformers import DataCollatorWithPadding\n",
|
||||
"\n",
|
||||
"data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
|
||||
]
|
||||
},
|
||||
@ -508,10 +532,10 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"trainer = Trainer(\n",
|
||||
" model=model,\n",
|
||||
" train_dataset=tokenized_train,\n",
|
||||
" tokenizer=tokenizer,\n",
|
||||
" data_collator=data_collator,\n",
|
||||
" model=model,\n",
|
||||
" train_dataset=tokenized_train,\n",
|
||||
" tokenizer=tokenizer,\n",
|
||||
" data_collator=data_collator,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@ -709,8 +733,8 @@
|
||||
],
|
||||
"source": [
|
||||
"sec_sentiment_model = pipeline(\n",
|
||||
"task=\"sentiment-analysis\",\n",
|
||||
"model=\"./sec-sentiment-model\",\n",
|
||||
" task=\"sentiment-analysis\",\n",
|
||||
" model=\"./sec-sentiment-model\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
|
||||
@ -154,19 +154,21 @@
|
||||
"\n",
|
||||
"nlp = spacy.load(\"en_core_web_sm\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def extract_numbers_with_context(text):\n",
|
||||
" doc = nlp(text)\n",
|
||||
" numbers = []\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" for token in doc:\n",
|
||||
" if token.like_num and token.dep_ == 'nummod' and token.head.pos_ == 'NOUN':\n",
|
||||
" if token.like_num and token.dep_ == \"nummod\" and token.head.pos_ == \"NOUN\":\n",
|
||||
" number = token.text\n",
|
||||
" noun = token.head.text\n",
|
||||
" context = ' '.join([number, noun])\n",
|
||||
" context = \" \".join([number, noun])\n",
|
||||
" numbers.append((number, noun, context))\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" return numbers\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Example usage\n",
|
||||
"text = \"I bought 10 apples and 5 oranges yesterday.\"\n",
|
||||
"numbers_with_context = extract_numbers_with_context(text)\n",
|
||||
|
||||
@ -302,6 +302,7 @@
|
||||
],
|
||||
"source": [
|
||||
"from unstructured.documents.elements import Text\n",
|
||||
"\n",
|
||||
"element = Text(\"Philadelphia Eaglesâ\\x80\\x99 victory\")\n",
|
||||
"element.apply(replace_unicode_quotes)\n",
|
||||
"print(element)"
|
||||
|
||||
@ -91,7 +91,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"unstructured_class = create_unstructured_weaviate_class(unstructured_class_name)\n",
|
||||
"schema = {\"classes\": [unstructured_class]} "
|
||||
"schema = {\"classes\": [unstructured_class]}"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -225,11 +225,8 @@
|
||||
],
|
||||
"source": [
|
||||
"response = (\n",
|
||||
" client.query\n",
|
||||
" .get(\"UnstructuredDocument\", [\"text\", \"_additional {score}\"])\n",
|
||||
" .with_bm25(\n",
|
||||
" query=\"document understanding\"\n",
|
||||
" )\n",
|
||||
" client.query.get(\"UnstructuredDocument\", [\"text\", \"_additional {score}\"])\n",
|
||||
" .with_bm25(query=\"document understanding\")\n",
|
||||
" .with_limit(2)\n",
|
||||
" .do()\n",
|
||||
")\n",
|
||||
|
||||
@ -76,7 +76,7 @@ jsonpatch==1.33
|
||||
# via langchain
|
||||
jsonpointer==2.4
|
||||
# via jsonpatch
|
||||
langchain==0.0.317
|
||||
langchain==0.0.318
|
||||
# via -r requirements/embed-huggingface.in
|
||||
langsmith==0.0.46
|
||||
# via langchain
|
||||
|
||||
@ -6,7 +6,7 @@ pdf2image
|
||||
pdfminer.six
|
||||
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
|
||||
# when unstructured library is.
|
||||
unstructured-inference==0.7.7
|
||||
unstructured-inference==0.7.9
|
||||
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
|
||||
# from one tesseract call
|
||||
unstructured.pytesseract>=0.3.12
|
||||
|
||||
@ -92,7 +92,9 @@ numpy==1.24.4
|
||||
omegaconf==2.3.0
|
||||
# via effdet
|
||||
onnx==1.14.1
|
||||
# via -r requirements/extra-pdf-image.in
|
||||
# via
|
||||
# -r requirements/extra-pdf-image.in
|
||||
# unstructured-inference
|
||||
onnxruntime==1.15.1
|
||||
# via
|
||||
# -c requirements/constraints.in
|
||||
@ -151,7 +153,7 @@ pyparsing==3.0.9
|
||||
# via
|
||||
# -c requirements/constraints.in
|
||||
# matplotlib
|
||||
pypdfium2==4.21.0
|
||||
pypdfium2==4.22.0
|
||||
# via pdfplumber
|
||||
pytesseract==0.3.10
|
||||
# via layoutparser
|
||||
@ -234,7 +236,7 @@ typing-extensions==4.8.0
|
||||
# torch
|
||||
tzdata==2023.3
|
||||
# via pandas
|
||||
unstructured-inference==0.7.7
|
||||
unstructured-inference==0.7.9
|
||||
# via -r requirements/extra-pdf-image.in
|
||||
unstructured-pytesseract==0.3.12
|
||||
# via
|
||||
|
||||
@ -61,7 +61,7 @@ jsonpatch==1.33
|
||||
# via langchain
|
||||
jsonpointer==2.4
|
||||
# via jsonpatch
|
||||
langchain==0.0.317
|
||||
langchain==0.0.318
|
||||
# via -r requirements/ingest-bedrock.in
|
||||
langsmith==0.0.46
|
||||
# via langchain
|
||||
|
||||
@ -50,7 +50,7 @@ jsonpatch==1.33
|
||||
# via langchain
|
||||
jsonpointer==2.4
|
||||
# via jsonpatch
|
||||
langchain==0.0.317
|
||||
langchain==0.0.318
|
||||
# via -r requirements/ingest-openai.in
|
||||
langsmith==0.0.46
|
||||
# via langchain
|
||||
|
||||
@ -93,7 +93,7 @@ pytest==7.4.2
|
||||
# pytest-mock
|
||||
pytest-cov==4.1.0
|
||||
# via -r requirements/test.in
|
||||
pytest-mock==3.11.1
|
||||
pytest-mock==3.12.0
|
||||
# via -r requirements/test.in
|
||||
python-dateutil==2.8.2
|
||||
# via freezegun
|
||||
|
||||
@ -77,7 +77,7 @@ def test_chunk_by_title():
|
||||
Text(
|
||||
"Today is a bad day.",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]}
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]},
|
||||
),
|
||||
),
|
||||
Text("It is storming outside."),
|
||||
@ -99,7 +99,7 @@ def test_chunk_by_title():
|
||||
|
||||
assert chunks[0].metadata == ElementMetadata(emphasized_text_contents=["Day", "day"])
|
||||
assert chunks[3].metadata == ElementMetadata(
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=11, end=12)]}
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=11, end=12)]},
|
||||
)
|
||||
|
||||
|
||||
@ -116,7 +116,7 @@ def test_chunk_by_title_respects_section_change():
|
||||
Text(
|
||||
"Today is a bad day.",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]}
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]},
|
||||
),
|
||||
),
|
||||
Text("It is storming outside."),
|
||||
@ -153,7 +153,7 @@ def test_chunk_by_title_separates_by_page_number():
|
||||
Text(
|
||||
"Today is a bad day.",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]}
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]},
|
||||
),
|
||||
),
|
||||
Text("It is storming outside."),
|
||||
@ -187,19 +187,19 @@ def test_chunk_by_title_does_not_break_on_regex_metadata_change():
|
||||
Title(
|
||||
"Lorem Ipsum",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
|
||||
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]},
|
||||
),
|
||||
),
|
||||
Text(
|
||||
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
|
||||
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]},
|
||||
),
|
||||
),
|
||||
Text(
|
||||
"In rhoncus ipsum sed lectus porta volutpat.",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
|
||||
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]},
|
||||
),
|
||||
),
|
||||
]
|
||||
@ -209,8 +209,8 @@ def test_chunk_by_title_does_not_break_on_regex_metadata_change():
|
||||
assert chunks == [
|
||||
CompositeElement(
|
||||
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
|
||||
" ipsum sed lectus porta volutpat."
|
||||
)
|
||||
" ipsum sed lectus porta volutpat.",
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@ -224,7 +224,7 @@ def test_chunk_by_title_consolidates_and_adjusts_offsets_of_regex_metadata():
|
||||
Title(
|
||||
"Lorem Ipsum",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
|
||||
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]},
|
||||
),
|
||||
),
|
||||
Text(
|
||||
@ -233,13 +233,13 @@ def test_chunk_by_title_consolidates_and_adjusts_offsets_of_regex_metadata():
|
||||
regex_metadata={
|
||||
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
|
||||
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
|
||||
}
|
||||
},
|
||||
),
|
||||
),
|
||||
Text(
|
||||
"In rhoncus ipsum sed lectus porta volutpat.",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
|
||||
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]},
|
||||
),
|
||||
),
|
||||
]
|
||||
@ -249,7 +249,7 @@ def test_chunk_by_title_consolidates_and_adjusts_offsets_of_regex_metadata():
|
||||
chunk = chunks[0]
|
||||
assert chunk == CompositeElement(
|
||||
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
|
||||
" ipsum sed lectus porta volutpat."
|
||||
" ipsum sed lectus porta volutpat.",
|
||||
)
|
||||
assert chunk.metadata.regex_metadata == {
|
||||
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
|
||||
@ -274,7 +274,7 @@ def test_chunk_by_title_groups_across_pages():
|
||||
Text(
|
||||
"Today is a bad day.",
|
||||
metadata=ElementMetadata(
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]}
|
||||
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]},
|
||||
),
|
||||
),
|
||||
Text("It is storming outside."),
|
||||
|
||||
@ -130,7 +130,7 @@ def test_partition_image_with_auto_strategy(
|
||||
elements = image.partition_image(filename=filename, strategy="auto")
|
||||
titles = [el for el in elements if el.category == "Title" and len(el.text.split(" ")) > 10]
|
||||
title = "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
|
||||
idx = 2
|
||||
idx = 3
|
||||
assert titles[0].text == title
|
||||
assert elements[idx].metadata.detection_class_prob is not None
|
||||
assert isinstance(elements[idx].metadata.detection_class_prob, float)
|
||||
@ -255,7 +255,7 @@ def test_partition_image_default_strategy_hi_res():
|
||||
elements = image.partition_image(file=f)
|
||||
|
||||
title = "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
|
||||
idx = 2
|
||||
idx = 3
|
||||
assert elements[idx].text == title
|
||||
assert elements[idx].metadata.coordinates is not None
|
||||
assert elements[idx].metadata.detection_class_prob is not None
|
||||
@ -503,7 +503,7 @@ def test_partition_image_uses_model_name():
|
||||
@pytest.mark.parametrize(
|
||||
("ocr_mode", "idx_title_element"),
|
||||
[
|
||||
("entire_page", 2),
|
||||
("entire_page", 3),
|
||||
("individual_blocks", 1),
|
||||
],
|
||||
)
|
||||
@ -521,6 +521,27 @@ def test_partition_image_hi_res_invalid_ocr_mode():
|
||||
_ = image.partition_image(filename=filename, ocr_mode="invalid_ocr_mode", strategy="hi_res")
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("ocr_mode"),
|
||||
[
|
||||
("entire_page"),
|
||||
("individual_blocks"),
|
||||
],
|
||||
)
|
||||
def test_partition_image_hi_res_ocr_mode_with_table_extraction(ocr_mode):
|
||||
filename = "example-docs/layout-parser-paper-with-table.jpg"
|
||||
elements = image.partition_image(
|
||||
filename=filename,
|
||||
ocr_mode=ocr_mode,
|
||||
strategy="hi_res",
|
||||
infer_table_structure=True,
|
||||
)
|
||||
table = [el.metadata.text_as_html for el in elements if el.metadata.text_as_html]
|
||||
assert len(table) == 1
|
||||
assert "<table><thead><th>" in table[0]
|
||||
assert "Layouts of history Japanese documents" in table[0]
|
||||
|
||||
|
||||
def test_partition_image_raises_TypeError_for_invalid_languages():
|
||||
filename = "example-docs/layout-parser-paper-fast.jpg"
|
||||
with pytest.raises(TypeError):
|
||||
|
||||
@ -1,3 +1,5 @@
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
import unstructured_pytesseract
|
||||
from pdf2image.exceptions import PDFPageCountError
|
||||
@ -47,9 +49,8 @@ def test_process_file_with_ocr_invalid_filename(is_image):
|
||||
)
|
||||
|
||||
|
||||
# TODO(yuming): Add this for test coverage, please update/move it in CORE-1886
|
||||
def test_supplement_page_layout_with_ocr_invalid_ocr(monkeypatch):
|
||||
monkeypatch.setenv("ENTIRE_PAGE_OCR", "invalid_ocr")
|
||||
monkeypatch.setenv("OCR_AGENT", "invalid_ocr")
|
||||
with pytest.raises(ValueError):
|
||||
_ = ocr.supplement_page_layout_with_ocr(
|
||||
page_layout=None,
|
||||
@ -61,14 +62,15 @@ def test_get_ocr_layout_from_image_tesseract(monkeypatch):
|
||||
monkeypatch.setattr(
|
||||
unstructured_pytesseract,
|
||||
"image_to_data",
|
||||
lambda *args, **kwargs: {
|
||||
"level": ["line", "line", "word"],
|
||||
"left": [10, 20, 30],
|
||||
"top": [5, 15, 25],
|
||||
"width": [15, 25, 35],
|
||||
"height": [10, 20, 30],
|
||||
"text": ["Hello", "World", "!"],
|
||||
},
|
||||
lambda *args, **kwargs: pd.DataFrame(
|
||||
{
|
||||
"left": [10, 20, 30, 0],
|
||||
"top": [5, 15, 25, 0],
|
||||
"width": [15, 25, 35, 0],
|
||||
"height": [10, 20, 30, 0],
|
||||
"text": ["Hello", "World", "!", ""],
|
||||
},
|
||||
),
|
||||
)
|
||||
|
||||
image = Image.new("RGB", (100, 100))
|
||||
@ -76,7 +78,7 @@ def test_get_ocr_layout_from_image_tesseract(monkeypatch):
|
||||
ocr_layout = ocr.get_ocr_layout_from_image(
|
||||
image,
|
||||
ocr_languages="eng",
|
||||
entire_page_ocr="tesseract",
|
||||
ocr_agent="tesseract",
|
||||
)
|
||||
|
||||
expected_layout = [
|
||||
@ -108,6 +110,12 @@ def mock_ocr(*args, **kwargs):
|
||||
["!"],
|
||||
),
|
||||
],
|
||||
[
|
||||
(
|
||||
[(0, 0), (0, 0), (0, 0), (0, 0)],
|
||||
[""],
|
||||
),
|
||||
],
|
||||
]
|
||||
|
||||
|
||||
@ -128,7 +136,7 @@ def test_get_ocr_layout_from_image_paddle(monkeypatch):
|
||||
|
||||
image = Image.new("RGB", (100, 100))
|
||||
|
||||
ocr_layout = ocr.get_ocr_layout_from_image(image, ocr_languages="eng", entire_page_ocr="paddle")
|
||||
ocr_layout = ocr.get_ocr_layout_from_image(image, ocr_languages="eng", ocr_agent="paddle")
|
||||
|
||||
expected_layout = [
|
||||
TextRegion.from_coords(10, 5, 25, 15, "Hello", source="OCR-paddle"),
|
||||
@ -147,7 +155,7 @@ def test_get_ocr_text_from_image_tesseract(monkeypatch):
|
||||
)
|
||||
image = Image.new("RGB", (100, 100))
|
||||
|
||||
ocr_text = ocr.get_ocr_text_from_image(image, ocr_languages="eng", entire_page_ocr="tesseract")
|
||||
ocr_text = ocr.get_ocr_text_from_image(image, ocr_languages="eng", ocr_agent="tesseract")
|
||||
|
||||
assert ocr_text == "Hello World"
|
||||
|
||||
@ -161,7 +169,7 @@ def test_get_ocr_text_from_image_paddle(monkeypatch):
|
||||
|
||||
image = Image.new("RGB", (100, 100))
|
||||
|
||||
ocr_text = ocr.get_ocr_text_from_image(image, ocr_languages="eng", entire_page_ocr="paddle")
|
||||
ocr_text = ocr.get_ocr_text_from_image(image, ocr_languages="eng", ocr_agent="paddle")
|
||||
|
||||
assert ocr_text == "HelloWorld!"
|
||||
|
||||
@ -231,6 +239,18 @@ def test_get_elements_from_ocr_regions(mock_embedded_text_regions):
|
||||
assert elements == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize("zoom", [1, 0.1, 5, -1, 0])
|
||||
def test_zoom_image(zoom):
|
||||
image = Image.new("RGB", (100, 100))
|
||||
width, height = image.size
|
||||
new_image = ocr.zoom_image(image, zoom)
|
||||
new_w, new_h = new_image.size
|
||||
if zoom <= 0:
|
||||
zoom = 1
|
||||
assert new_w == np.round(width * zoom, 0)
|
||||
assert new_h == np.round(height * zoom, 0)
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def mock_layout(mock_embedded_text_regions):
|
||||
return [
|
||||
@ -390,3 +410,40 @@ def test_pad_element_bboxes(padding, expected_bbox):
|
||||
# make sure the original element has not changed
|
||||
original_element_bbox = (element.bbox.x1, element.bbox.y1, element.bbox.x2, element.bbox.y2)
|
||||
assert original_element_bbox == expected_original_element_bbox
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def table_element():
|
||||
table = LayoutElement.from_coords(x1=10, y1=20, x2=50, y2=70, text="I am a table", type="Table")
|
||||
return table
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def ocr_layout():
|
||||
ocr_regions = [
|
||||
TextRegion.from_coords(x1=15, y1=25, x2=35, y2=45, text="Token1"),
|
||||
TextRegion.from_coords(x1=40, y1=30, x2=45, y2=50, text="Token2"),
|
||||
]
|
||||
return ocr_regions
|
||||
|
||||
|
||||
def test_get_table_tokens_per_element(table_element, ocr_layout):
|
||||
table_tokens = ocr.get_table_tokens_per_element(table_element, ocr_layout)
|
||||
expected_tokens = [
|
||||
{
|
||||
"bbox": [5, 5, 25, 25],
|
||||
"text": "Token1",
|
||||
"span_num": 0,
|
||||
"line_num": 0,
|
||||
"block_num": 0,
|
||||
},
|
||||
{
|
||||
"bbox": [30, 10, 35, 30],
|
||||
"text": "Token2",
|
||||
"span_num": 1,
|
||||
"line_num": 0,
|
||||
"block_num": 0,
|
||||
},
|
||||
]
|
||||
|
||||
assert table_tokens == expected_tokens
|
||||
|
||||
@ -181,7 +181,6 @@ def test_partition_pdf_with_model_name_env_var(
|
||||
filename,
|
||||
is_image=False,
|
||||
pdf_image_dpi=200,
|
||||
extract_tables=False,
|
||||
model_name="checkbox",
|
||||
)
|
||||
|
||||
@ -201,7 +200,6 @@ def test_partition_pdf_with_model_name(
|
||||
filename,
|
||||
is_image=False,
|
||||
pdf_image_dpi=200,
|
||||
extract_tables=False,
|
||||
model_name="checkbox",
|
||||
)
|
||||
|
||||
@ -394,10 +392,34 @@ def test_partition_pdf_falls_back_to_ocr_only(
|
||||
def test_partition_pdf_uses_table_extraction():
|
||||
filename = "example-docs/layout-parser-paper-fast.pdf"
|
||||
with mock.patch(
|
||||
"unstructured_inference.inference.layout.process_file_with_model",
|
||||
"unstructured.partition.ocr.process_file_with_ocr",
|
||||
) as mock_process_file_with_model:
|
||||
pdf.partition_pdf(filename, infer_table_structure=True)
|
||||
assert mock_process_file_with_model.call_args[1]["extract_tables"]
|
||||
assert mock_process_file_with_model.call_args[1]["infer_table_structure"]
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("ocr_mode"),
|
||||
[
|
||||
("entire_page"),
|
||||
("individual_blocks"),
|
||||
],
|
||||
)
|
||||
def test_partition_pdf_hi_table_extraction_with_languages(ocr_mode):
|
||||
filename = "example-docs/korean-text-with-tables.pdf"
|
||||
elements = pdf.partition_pdf(
|
||||
filename=filename,
|
||||
ocr_mode=ocr_mode,
|
||||
languages=["kor"],
|
||||
strategy="hi_res",
|
||||
infer_table_structure=True,
|
||||
)
|
||||
table = [el.metadata.text_as_html for el in elements if el.metadata.text_as_html]
|
||||
assert len(table) == 2
|
||||
assert "<table><thead><th>" in table[0]
|
||||
# FIXME(yuming): didn't test full sentence here since unit test and docker test have
|
||||
# some differences on spaces between characters
|
||||
assert "업" in table[0]
|
||||
|
||||
|
||||
def test_partition_pdf_with_copy_protection():
|
||||
@ -418,7 +440,6 @@ def test_partition_pdf_with_dpi():
|
||||
mock_process.assert_called_once_with(
|
||||
filename,
|
||||
is_image=False,
|
||||
extract_tables=False,
|
||||
model_name=pdf.default_hi_res_model(),
|
||||
pdf_image_dpi=100,
|
||||
)
|
||||
@ -854,15 +875,9 @@ def test_add_chunking_strategy_by_title_on_partition_pdf(
|
||||
|
||||
def test_partition_pdf_formats_languages_for_tesseract():
|
||||
filename = "example-docs/DA-1p.pdf"
|
||||
with mock.patch.object(layout, "process_file_with_model", mock.MagicMock()) as mock_process:
|
||||
with mock.patch.object(ocr, "process_file_with_ocr", mock.MagicMock()) as mock_process:
|
||||
pdf.partition_pdf(filename=filename, strategy="hi_res", languages=["en"])
|
||||
mock_process.assert_called_once_with(
|
||||
filename,
|
||||
is_image=False,
|
||||
pdf_image_dpi=200,
|
||||
extract_tables=False,
|
||||
model_name=pdf.default_hi_res_model(),
|
||||
)
|
||||
assert mock_process.call_args[1]["ocr_languages"] == "eng"
|
||||
|
||||
|
||||
def test_partition_pdf_warns_with_ocr_languages(caplog):
|
||||
|
||||
@ -325,10 +325,10 @@ def test_auto_partition_pdf_from_filename(pass_metadata_filename, content_type,
|
||||
def test_auto_partition_pdf_uses_table_extraction():
|
||||
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
|
||||
with patch(
|
||||
"unstructured_inference.inference.layout.process_file_with_model",
|
||||
"unstructured.partition.ocr.process_file_with_ocr",
|
||||
) as mock_process_file_with_model:
|
||||
partition(filename, pdf_infer_table_structure=True, strategy="hi_res")
|
||||
assert mock_process_file_with_model.call_args[1]["extract_tables"]
|
||||
assert mock_process_file_with_model.call_args[1]["infer_table_structure"]
|
||||
|
||||
|
||||
def test_auto_partition_pdf_with_fast_strategy(monkeypatch):
|
||||
@ -430,7 +430,7 @@ def test_auto_partition_image_default_strategy_hi_res(pass_metadata_filename, co
|
||||
|
||||
# should be same result as test_partition_image_default_strategy_hi_res() in test_image.py
|
||||
title = "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
|
||||
idx = 2
|
||||
idx = 3
|
||||
assert elements[idx].text == title
|
||||
assert elements[idx].metadata.coordinates is not None
|
||||
|
||||
|
||||
11
test_unstructured/partition/utils/test_config.py
Normal file
11
test_unstructured/partition/utils/test_config.py
Normal file
@ -0,0 +1,11 @@
|
||||
def test_default_config():
|
||||
from unstructured.partition.utils.config import env_config
|
||||
|
||||
assert env_config.IMAGE_CROP_PAD == 0
|
||||
|
||||
|
||||
def test_env_override(monkeypatch):
|
||||
monkeypatch.setenv("IMAGE_CROP_PAD", 1)
|
||||
from unstructured.partition.utils.config import env_config
|
||||
|
||||
assert env_config.IMAGE_CROP_PAD == 1
|
||||
@ -1,7 +1,7 @@
|
||||
[
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "898ffdb987b556da4dc7ccd1cf11f1b3",
|
||||
"element_id": "b936b2ee403883f6bff0295edb71fae1",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -16,7 +16,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "rh Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accoun ig Method"
|
||||
"text": "FM) Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method"
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
@ -77,7 +77,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "c9bc33e913a25aaffa8367aa11bc8ed9",
|
||||
"element_id": "1dc07a295891aaf9dfd2559efee66ab2",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -92,7 +92,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Internal Revenue laws of the United States. We need it to ensure that taxpayers are complying with these laws an¢ to allow us to figure and collect the nght amount of tax. You are required to this information."
|
||||
"text": "Internal Revenue laws of the United States. We need it to ensure that taxpayers are complying with these laws and to allow us to figure and collect the right amount of tax. You are required to this information."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
@ -153,7 +153,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "f9b8e17da7a31507773f78959378e09c",
|
||||
"element_id": "fdb8017fc73bdc12f7200dece8b76c99",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -168,26 +168,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "File this form to request a change in your accounting method, including the accounting treatment of any item. if you are requesting 2 change in accounting period, use Form 1128, Application for Change in Accounting Period. For more information, see Publication 538, Accounting Periods and Methods,"
|
||||
},
|
||||
{
|
||||
"type": "UncategorizedText",
|
||||
"element_id": "2127f2ab4fc4feb4d32460c8317bf02f",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
"version": 328871203465633719836776597535876541325,
|
||||
"record_locator": {
|
||||
"protocol": "abfs",
|
||||
"remote_file_path": "container1/IRS-form-1987.png"
|
||||
},
|
||||
"date_created": "2023-03-10T09:44:55+00:00",
|
||||
"date_modified": "2023-03-10T09:44:55+00:00"
|
||||
},
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Form 3115,"
|
||||
"text": "File this form to request a change in your accounting method, including the accounting treatment of any item. If you are requesting a change in accounting period, use Form 1128, Application for Change in Accounting Period. For more information, see Publication 538, Accounting Periods and Methods."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -209,8 +190,8 @@
|
||||
"text": "When"
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "06658399dddcd1d4d4fda8f9fa90fd53",
|
||||
"type": "UncategorizedText",
|
||||
"element_id": "2127f2ab4fc4feb4d32460c8317bf02f",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -225,11 +206,11 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "filing taxpayers are reminded to determine if IRS has published a ruling or procedure dealing with the specific type of change since November 1987 (the current. revision date of Form 3115)"
|
||||
"text": "Form 3115,"
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "03c4a83e399f2f669047b3fcfeae5867",
|
||||
"element_id": "d3c76e8e037d3aec863db1d768b81f6d",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -244,7 +225,26 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Long-term contracts.—If you are required to"
|
||||
"text": "filing taxpayers are reminded to determine if IRS has published a ruling or procedure dealing with the specific type of change since November 1987 (the current revision date of Form 3115)."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "4e7936ce3e93c2ae3ca21444e573c208",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
"version": 328871203465633719836776597535876541325,
|
||||
"record_locator": {
|
||||
"protocol": "abfs",
|
||||
"remote_file_path": "container1/IRS-form-1987.png"
|
||||
},
|
||||
"date_created": "2023-03-10T09:44:55+00:00",
|
||||
"date_modified": "2023-03-10T09:44:55+00:00"
|
||||
},
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Long-term contracts. —If you are required to"
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
@ -267,7 +267,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "463ce4107785bb9854ad10b81d93dc7f",
|
||||
"element_id": "c259777a4c46933ffda4f8b06922fcfd",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -282,7 +282,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Other methods. —Unless the Service has published a regulation or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissioner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method for bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving credit plan (Act section 812); (3) the Inclusion of mcome attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not file Form 3115 for these"
|
||||
"text": "Other methods.—Unless the Service has published a regulation or procedure to the contrary, all other changes 1n accounting methods required by the Act are automatically considered to be approved by the Commissioner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method for bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving credit plan (Act section 812); (3) the inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not file Form 3115 for these"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -305,7 +305,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "ca23ec8966835bea3ca9b191b53ada9d",
|
||||
"element_id": "790998c70f51276d6fae7db8df6e8351",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -320,26 +320,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Generally, applicants must complete Section In addition, complete the appropriate sections (B:1 through H) for which a change is desired."
|
||||
},
|
||||
{
|
||||
"type": "UncategorizedText",
|
||||
"element_id": "e16bce609163ec96985ae522ca81502a",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
"version": 328871203465633719836776597535876541325,
|
||||
"record_locator": {
|
||||
"protocol": "abfs",
|
||||
"remote_file_path": "container1/IRS-form-1987.png"
|
||||
},
|
||||
"date_created": "2023-03-10T09:44:55+00:00",
|
||||
"date_modified": "2023-03-10T09:44:55+00:00"
|
||||
},
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "‘A."
|
||||
"text": "Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change ts desired."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
@ -362,7 +343,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "25f830e7c39c115c9937eb9d11cfb1f2",
|
||||
"element_id": "505160cf4f5ef1cf128734840a8c98d1",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -377,7 +358,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "State whether you desire a conference in the National Office if the Service proposes to disapprove your application"
|
||||
"text": "State whether you desire a conference in the National Office if the Service proposes to disapprove your application."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -400,7 +381,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "c10c0c63b05172dff854d1d0e570c588",
|
||||
"element_id": "6fef08d976462f03567612cbea207d0b",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -415,11 +396,11 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under section, 263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (imiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (\"Act\"), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to cchange from the cash method under section 448 have 10 years to take the adjustments into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required"
|
||||
"text": "Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under sectior,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustments into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "fc2252774c86adc22225761fc0bee985",
|
||||
"element_id": "207be7b7dc9e8cd420376c5aef291cfc",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -434,7 +415,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Disregard the instructions under Time and Place for Filing and Late Applications. instead, attach Form 3115 to your income tax return for the year of change; do not file it separately. Also include on a separate statement accompanying the Form 3115 the period over which the section 481(2) adjustment will be taken into account and the basis for that conclusion. Identify the automatic change being made at the top of page 1 of Form 3118 eg. “Automatic Change to Accrual Method—Section 448”). See Temporary Regulations sections 1.263A-1T and 1.448-1T for additional information"
|
||||
"text": "Disregard the instructions under Time and Place for Filing and Late Applications. Instead, attach Form 3115 to your income tax return for the year of change; do not file it separately. Also include on a separate statement accompanying the Form 3115 the period over which the section 481(a) adjustment will be taken into account and the basis for that conclusion. Identify the automatic change being made at the top of page 1 of Form 3115 (e.g., “Automatic Change to Accrual Method Section 448\"). See Temporary Regulations sections 1.263A-1T and 1.448-1T for additional information."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -457,7 +438,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "a720c5a62597e77c686cbc5df1c682ce",
|
||||
"element_id": "af8bdf713f162b09567c8d1a3a2d4de7",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -472,11 +453,11 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Generally, applicants must file this form within the first 180 days of the tax year in which itis desired to make the change."
|
||||
"text": "Generally, applicants must file this form within the first 180 days of the tax year in which it is desired to make the change."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "9dda11db48254f5e0d0000afb5d1dd9b",
|
||||
"element_id": "1bf38b8fd54e59f8ec042e22cc5c5a89",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -491,7 +472,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Taxpayers, other than exempt organizations, should file Form 3115 with the Commissioner of Internal Revenue, Attention: CC:C:4, 1111 Constitution Avenue, NW, Washington, DC 20224, Exempt organizations should file with the Assistant Commissioner (Employee Plans and Exempt Organizations), 1111 Constitution Avenue, NW, Washington, DC 20224."
|
||||
"text": "Taxpayers, other than exempt organizations, should file Form 3115 with the Commissioner of Internal Revenue, Attention: CC:C:4, 1111 Constitution Avenue, NW, Washington, DC 20224. Exempt organizations should file with the Assistant Commissioner (Employee Plans and Exempt Organizations), 1111 Constitution Avenue, NW, Washington, DC 20224."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
@ -514,7 +495,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "e3e2ccf4f0d1524d4f5ce42e8f2d1efa",
|
||||
"element_id": "c56ebb2883fe0c95b8564fa3969f7010",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -529,7 +510,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "See section 5.03 of Rev. Proc. 84-74 for filing early application,"
|
||||
"text": "See section 5.03 of Rev. Proc. 84-74 for filing early application."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -552,7 +533,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "11cb901986e9621aadbd76e6f7400809",
|
||||
"element_id": "12f877f0bd47f9b761ed7e74be1afacd",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -567,7 +548,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Note: If this form is being filed in accordance with Rey. Proc. 74-11, see Section G below."
|
||||
"text": "Note: /f this form is being filed in accordance with Rev. Proc. 74-11, see Section G below."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -590,7 +571,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "8474975a0cd563b9feee81d0e540ffd3",
|
||||
"element_id": "d6fca66a91f1ac21a81f0123a2641176",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -605,7 +586,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "If your application is filed after the 180-day period, itis late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev. Proc. 79-63."
|
||||
"text": "If your application is filed after the 180-day period, it is late. The application will be considered for processing only upon a showing of “good cause\" and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev. Proc. 79-63."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -628,7 +609,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "ec3c2d03b846d2a186fc9a8f318f688b",
|
||||
"element_id": "8605ee209656c311cec7ce4b001caab2",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -643,11 +624,11 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Individuals. —An individual should enter his or her social security number in this block. If the application is made on behalf of a husband and wife who file their income tax return jointly, enter the social security numbers of both."
|
||||
"text": "Individuals.—An individual should enter his or her social security number in this block. If the application is made on behalf of a husband and wife who file their income tax return jointly, enter the social security numbers of both."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "cf50ba29fb549766117d8ff1d099da44",
|
||||
"element_id": "7d82c5876c5c1a3596338ae8cfbd1a50",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -662,7 +643,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Others.-—The employer identification number applicant other than an individual should be entered in this block,"
|
||||
"text": "Others.-—The employer identification number applicant other than an individual should be entered in this block."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -703,8 +684,8 @@
|
||||
"text": "of"
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "48cd565f152ff17bab8eba19eb23db34",
|
||||
"type": "Title",
|
||||
"element_id": "f1a73e2204a114077f988c9da98d7f8b",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -719,11 +700,11 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Individuals.—An individual desiring the change should sign the application. Ifthe application pertains to a husband and wife filing a joint Income tax return, the names of both should appear in the heading and both should"
|
||||
"text": "Signature"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "0b6f395ca14ac202374d5cff678b7115",
|
||||
"type": "NarrativeText",
|
||||
"element_id": "dc1531183c8e3f45a78f110ec1efe15f",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -738,7 +719,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "sign"
|
||||
"text": "Individuals. —An individual desiring the change should sign the application. If the application pertains to a husband and wife filing a joint income tax return, the names of both should appear in the heading and both should sign."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
@ -761,7 +742,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "ee6a9bcef7e5e33bc26f419812e2c77a",
|
||||
"element_id": "9de285e8e3b042aa9ac86edde98a21a9",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -776,7 +757,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Corporations, cooperatives, and insurance companies.—The form should show the name of the corporation, cooperative, or insurance Company and the signature of the president, vice president, treasurer, assistant treasurer, or chief accounting officer (such as tax officer) authorized tosign, and his or her official title. Receivers, trustees, or assignees must sign any application they are required to file, For a subsidiary corporation filing a consolidated return with its parent, the form should be signed by an officer of the parent corporation,"
|
||||
"text": "Corporations, cooperatives, and insurance companies.—The form should show the name of the corporation, cooperative, or insurance company and the signature of the president, vice president, treasurer, assistant treasurer, or chief accounting officer (such as tax officer) authorized to sign, and his or her official title. Receivers, trustees, or assignees must sign any application they are required to file. For a subsidiary corporation filing a consolidated return with its parent, the form should be signed by an officer of the parent corporation."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
@ -799,7 +780,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "e3c8d21cabd10cc36b53107e58a5be8d",
|
||||
"element_id": "15388163cb5265b432737b06d98790e5",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -814,7 +795,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "name of the estate or trust and be signed by the fiduciary, personal representative, executor, executrix, administrator, administratrx, etc’, having legal authority to'sign, and his or her ttle."
|
||||
"text": "name of the estate or trust and be signed by the fiduciary, personal representative, executor, executrix, administrator, administratrix, etc., having legal authority to sign, and his or her title."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
@ -835,25 +816,6 @@
|
||||
},
|
||||
"text": "Preparer other than partner, officer, etc.—The signature of the individual preparing the application should appear in the space provided on page 6."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "8200352b4e91b1be4f14e9248d50380a",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
"version": 328871203465633719836776597535876541325,
|
||||
"record_locator": {
|
||||
"protocol": "abfs",
|
||||
"remote_file_path": "container1/IRS-form-1987.png"
|
||||
},
|
||||
"date_created": "2023-03-10T09:44:55+00:00",
|
||||
"date_modified": "2023-03-10T09:44:55+00:00"
|
||||
},
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Ifthe individual or firm is also authorized to represent the applicant before the IRS, receive copy of the requested ruling, or perform any other act(s), the power of attorney must reflect such authorization(s)."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "ca978112ca1bbdcafac231b39a23dc4d",
|
||||
@ -873,6 +835,25 @@
|
||||
},
|
||||
"text": "a"
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "12a24aabbcef2cabc07babe12d9c82c5",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
"version": 328871203465633719836776597535876541325,
|
||||
"record_locator": {
|
||||
"protocol": "abfs",
|
||||
"remote_file_path": "container1/IRS-form-1987.png"
|
||||
},
|
||||
"date_created": "2023-03-10T09:44:55+00:00",
|
||||
"date_modified": "2023-03-10T09:44:55+00:00"
|
||||
},
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "If the individual or firm is also authorized to represent the applicant before the IRS, receive copy of the requested ruling, or perform any other act(s), the power of attorney must reflect such authorization(s)."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "8b06cd6e2bf7fc15130d5d9ed7e66283",
|
||||
@ -894,7 +875,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "762e2a39ed1a3ef5d3d4c83dd5dcc0e8",
|
||||
"element_id": "e5a2893bc6d1da570465f39fa3a8da15",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -909,7 +890,7 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Taxpayers that are members of an affiliated group filing a consolidated return that seeks to Change to the same accounting method for more than one member of the group must file a separate Form 3115 for each such member."
|
||||
"text": "Taxpayers that are members of an affiliated group filing a consolidated return that seeks to change to the same accounting method for more than one member of the group must file a separate Form 3115 for each such member."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -951,7 +932,7 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "a6c53a8898025076b8c0397178f95fa3",
|
||||
"element_id": "b57b7502430c59194bb865cfa1bcfab5",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -966,11 +947,11 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Item 5a, page 1.—“Taxable income or (loss) from operations” is to be entered before application of any net operating loss deduction under section 172(a)"
|
||||
"text": "Item 5a, page 1.—“Taxable income or (loss) from operations” is to be entered before application of any net operating loss deduction under section 172(a)."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "e9278d083996ccb1f39236b8064b28cd",
|
||||
"element_id": "1c43ecf3a4844d9537d924d892d7c546",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -985,11 +966,11 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Item 6, page 2.—The term “gross receipts” Includes total sales (net of returns and allowances) and all amounts received for services. in addition, gross receipts include any income from investments and from incidental or outside sources (e.g., interest, dividends, rents, royalties, and annuities). However, if you area resaler of personal property, exclude from gross receipts any amounts not derived in the ordinary course of a trade or business. Gross receipts do not include amounts received for sales taxes if, tunder the applicable state or local law, the taxis legally imposed on the purchaser of the good or service, and the taxpayer merely collects and remits the tax to the taxing authority."
|
||||
"text": "Item 6, page 2.—The term “gross receipts” includes total sales (net of returns and allowances) and all amounts received for services. In addition, gross receipts include any income from investments and from incidental or outside sources (e.g., interest, dividends, rents, royalties, and annuities). However, if you area resaler of personal property, exclude from gross receipts any amounts not derived in the ordinary course of a trade or business. Gross receipts do not include amounts received for sales taxes if, under the applicable state or local law, the tax is legatly imposed on the purchaser of the good or service, and the taxpayer merely collects and remits the tax to the taxing authority."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "4b4424f821633ea87deab36702d4c113",
|
||||
"element_id": "372359d8718b28cc34e7a5f1fdd05213",
|
||||
"metadata": {
|
||||
"data_source": {
|
||||
"url": "abfs://container1/IRS-form-1987.png",
|
||||
@ -1004,6 +985,6 @@
|
||||
"filetype": "image/png",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Item 7b, page 2.—If item 7b 1s \"Yes,\" indicate ona separate sheet the following for each separate trade or business: Nature of business"
|
||||
"text": "Item 7b, page 2.—If item 7b 1s “Yes,” indicate ona separate sheet the following for each separate trade or business: Nature of business"
|
||||
}
|
||||
]
|
||||
@ -1,53 +1,63 @@
|
||||
[
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "5fc3b3d02c954fce8bdb8742665da14d",
|
||||
"element_id": "8e5839c2fb9b4d6b78cd2b1c1f5bed02",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "LayoutParser: A Unified Toolkit for DL-Based DIA 5,"
|
||||
"text": "LayoutParser: A Unified Toolkit for DL-Based DIA 5"
|
||||
},
|
||||
{
|
||||
"type": "FigureCaption",
|
||||
"element_id": "53522497f48d7f32acd862a28dee0253",
|
||||
"element_id": "f2c0641f368a9449a58ec35931e4ae81",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "‘Table 1: Current layout detection models in the LayoutParser model z00"
|
||||
"text": "Table 1: Current layout detection models in the LayoutParser model zoo"
|
||||
},
|
||||
{
|
||||
"type": "Table",
|
||||
"element_id": "5abf71b206592971d65a1e67d1767283",
|
||||
"element_id": "cbbf7b2ca7e0cda98760d1a4dfb657a8",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Dataset | Bare Model! Large Mode! | Noter PablagNee [5] P/M M_|Tayous of moder scientie documents Rima [) « = | tagout of scanned modern magains and cee reports Newspaper [IT]| FP {Layout of ranned US nemepapers fom the 2h entry ‘Thbtebenk (i) | ‘able epon cn modern aciente and business document apace 1) |_P/M =| tagout of itary Inpanere documents"
|
||||
"text": "Dataset | Base Model\" Large Model | Notes PubLayNet [38] P/M M Layouts of modern scientific documents PRImA [3) M - Layouts of scanned modern magazines and scientific reports Newspaper [17] P - Layouts of scanned US newspapers from the 20th century ‘TableBank (18) P P Table region on modern scientific and business document HJDataset (31) | F/M - Layouts of history Japanese documents"
|
||||
},
|
||||
{
|
||||
"type": "UncategorizedText",
|
||||
"element_id": "d4735e3a265e16eee03f59718b9b5d03",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "2"
|
||||
},
|
||||
{
|
||||
"type": "FigureCaption",
|
||||
"element_id": "5ea2a1fafef2b04f62fc83deab5c6a5d",
|
||||
"element_id": "2ad24f5a5e3078b568fa6e9525e64f77",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "Sek ca pe ogee ile Se riot (ade ines ores Tsbooe 09, pect, One ca nin moose cet acct ie Ptr LOND Bi (7) and Mask ‘hay tn Hee tens Te platen etd sete it bem ee me"
|
||||
"text": "For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months."
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "62253f3b9ad80b81d9fe3656d597ba21",
|
||||
"element_id": "10d504d8b13f993462c1fb2b259a637c",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility fanctions for the visualization and storage of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training We now provide detailed descriptions for each component."
|
||||
"text": "layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility functions for the visualization and stomge of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training. We now provide detailed descriptions for each component."
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
@ -61,102 +71,92 @@
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "f72c039d55d8062b540b8f075bf697fb",
|
||||
"element_id": "6308564d51ecc38117d56ca4d886e7b7",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "In LayoutParser, a layout model takes a document image as an input and generates a list of rectangular boxes for the target content regions. Different from traditional methods, it relies on deep convolutional neural networks rather than manually curated rules to identify content regions. It is formulated as an object detection problem and state-of-the-art’ models like Faster R-CNNY [28] and Mask R-CNN [12] are used. ‘This yields prediction results of high accuracy and makes it possible to build a concise, generalized interface for layout detection. LayoutParser, built upon Detectron? [38], provides a minimal API that can perform layout detection with only four lines of code in Python:"
|
||||
"text": "In LayoutParser, a layout model takes a document image as an input and generates a list of rectangular boxes for the target content regions. Different from traditional methods, it relies on deep convolutional neural networks rather than manually curated rules to identify content regions. It is formulated as an object detection problem and state-of-the-art models like Faster R-CNN [28] and Mask R-CNN [12] are used. This yields prediction results of high accuracy and makes it possible to build a concise, generalized interface for layout detection. LayoutParser, built upon Detectron2 [35], provides a minimal API that can perform layout detection with only four lines of code in Python:"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "742f93af10c235d2612a2b85c7ce9294",
|
||||
"element_id": "ef6eafe86d89bc2e0af554aeab05a317",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "\\ import layoutparser as 1p"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "79e0a3c1350f942562679b971915d272",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "2 image = cv2.imread(\"image_file\") # load images"
|
||||
},
|
||||
{
|
||||
"type": "UncategorizedText",
|
||||
"element_id": "32ebb1abcc1c601ceb9c4e3c4faba0ca",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "("
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "cd84964c612152f5362ee38fab9cad62",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "» model = 1p. Detectron2LayoutModel"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "252ce669ee21ceee8521053e783bf50a",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "\"1p: //PubLayllet /faster_renn_t |-50_FPI_3x/config\")"
|
||||
},
|
||||
{
|
||||
"type": "UncategorizedText",
|
||||
"element_id": "4b227777d4dd1fc61c6f884f48641d02",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "4"
|
||||
"text": "import layoutparser as lp"
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "cfacfd3ec33b9608b59a343d05da204c",
|
||||
"element_id": "3aaabff05e0eb86a34fc7d72e421f2e0",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "detect"
|
||||
"text": "ea wwe"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "ec23428744214fb4e7dd4d5d25939ae9",
|
||||
"element_id": "cbb879049df7bd737ffc487a61b05f70",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "layout = model. (image)"
|
||||
"text": "image = cv2.imread(\"image_file\") # load images"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "61869c4b92bc306c9fa617681ce746f6",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "model = lp. Detectron2LayoutModel ("
|
||||
},
|
||||
{
|
||||
"type": "Title",
|
||||
"element_id": "ec4c180151f7ee4a36de276d970797e3",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "//PubLayNet/faster_rcnn_R_50_FPN_3x/config\")"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "2941b46b1dc3845b1d5dd2856df4bb67",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "\"lp:"
|
||||
},
|
||||
{
|
||||
"type": "ListItem",
|
||||
"element_id": "d327c74e28b98f9a40394148e2ed8be7",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "layout = model.detect (image)"
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"element_id": "9e7beafe373dc2fbff761d7997defec9",
|
||||
"element_id": "26b798107369c9c8a5e68a19f68cf7ec",
|
||||
"metadata": {
|
||||
"data_source": {},
|
||||
"filetype": "image/jpeg",
|
||||
"page_number": 1
|
||||
},
|
||||
"text": "LayoutParser provides a wealth of pre-trained model weights using various datasets covering different languages, time periods, and document types. Due to domain shift [7], the prediction performance can notably drop when models are ap- plied to target samples that are significantly different from the training dataset. As document structures and layouts vary greatly in different domains, itis important to select models trained on adataset similar to the test samples. A semantic syntax is used for initializing the model weights in LayoutParser, using both the dataset name and model name 1p://<dataset-nane>/<nodel-archi tecture-nane>."
|
||||
"text": "LayoutParser provides a wealth of pre-trained model weights using various datasets covering different languages, time periods, and document types. Due to domain shift [7], the prediction performance can notably drop when models are ap- plied to target samples that are significantly different from the training dataset. As document structures and layouts vary greatly in different domains, it is important to select models trained on a dataset similar to the test samples. A semantic syntax is used for initializing the model weights in Layout Parser, using both the dataset name and model name 1p://<dataset-name>/<model-architecture-name>."
|
||||
}
|
||||
]
|
||||
@ -396,7 +396,7 @@
|
||||
"data_source": {},
|
||||
"filetype": "application/pdf",
|
||||
"page_number": 5,
|
||||
"text_as_html": "<table><thead><th>Dataset</th><th>| Base Mod</th><th>el'| Large Model</th><th>| Notes</th></thead><tr><td>PubLayNet B8]|</td><td>F/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td>M</td><td>-</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>-</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td>F/M</td><td>-</td><td>Layouts of history Japanese documents</td></tr></table>"
|
||||
"text_as_html": "<table><thead><th>Dataset</th><th>| Base Model'|</th><th>Notes</th></thead><tr><td>PubLayNet B8]|</td><td>F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td>M</td><td>nned modern magazines and scientific reports</td></tr><tr><td>Newspapei</td><td>F</td><td>canned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>"
|
||||
},
|
||||
"text": "Base Model1 Large Model Notes Dataset PubLayNet [38] PRImA [3] Newspaper [17] TableBank [18] HJDataset [31] F / M M F F F / M M - - F - Layouts of modern scientific documents Layouts of scanned modern magazines and scientific reports Layouts of scanned US newspapers from the 20th century Table region on modern scientific and business document Layouts of history Japanese documents"
|
||||
},
|
||||
@ -717,7 +717,7 @@
|
||||
"data_source": {},
|
||||
"filetype": "application/pdf",
|
||||
"page_number": 8,
|
||||
"text_as_html": "<table><thead><th>block.pad(top, bottom,</th><th>right,</th><th>left)</th><th>Enlarge the current block according to the input</th></thead><tr><td>block.scale(fx, fy)</td><td></td><td></td><td>Scale the current block given the ratio in x and y direction</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>block1. intersect (block2)</td><td></td><td></td><td>Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs.</td></tr><tr><td>block1.union(block2)</td><td></td><td></td><td>Return the union region of blockl and block2. Coordinate type to be determined based on the inputs.</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td></td><td>Convert the absolute coordinates of block to relative coordinates to block2</td></tr><tr><td>block1.condition_on(block2) block. crop_image (image)</td><td></td><td></td><td>Calculate the absolute coordinates of blockl given the canvas block2’s absolute coordinates Obtain the image segments in the block region</td></tr></table>"
|
||||
"text_as_html": "<table><thead><th>block.pad(top,</th><th>bottom,</th><th>right,</th><th>left)</th><th>Enlarge the current block according to the input</th></thead><tr><td>block.scale(fx, fy)</td><td></td><td></td><td></td><td>Scale the current block given the ratio ion in x and y di</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td></td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td></td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>; block1. intersect</td><td>(block2)</td><td></td><td></td><td>Return the intersection region of block and block2. . . . Coordinate type to be determined based on the</td></tr><tr><td>; block1.union(block2)</td><td></td><td></td><td></td><td>Return the union region of block1 and block2. . . . Coordinate type to be determined based on the</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td></td><td></td><td>Convert the absolute coordinates of block to ' ' relative coordinates to block2</td></tr><tr><td>. block1.condition_on(block2)</td><td></td><td></td><td></td><td>Calculate the absolute coordinates of block1 given . the canvas block2’s absolute coordinates</td></tr></table>"
|
||||
},
|
||||
"text": "block.pad(top, bottom, right, left) Enlarge the current block according to the input Scale the current block given the ratio in x and y direction block.scale(fx, fy) Move the current block with the shift distances in x and y direction block.shift(dx, dy) Whether block1 is inside of block2 block1.is in(block2) block1.intersect(block2) block1.union(block2) Convert the absolute coordinates of block1 to relative coordinates to block2 block1.relative to(block2) Calculate the absolute coordinates of block1 given the canvas block2’s absolute coordinates block1.condition on(block2)"
|
||||
},
|
||||
|
||||
@ -1 +1 @@
|
||||
__version__ = "0.10.25-dev8" # pragma: no cover
|
||||
__version__ = "0.10.25-dev9" # pragma: no cover
|
||||
|
||||
@ -1,9 +1,11 @@
|
||||
import os
|
||||
import tempfile
|
||||
from copy import deepcopy
|
||||
from typing import BinaryIO, List, Optional, Union, cast
|
||||
from typing import BinaryIO, Dict, List, Optional, Union, cast
|
||||
|
||||
import cv2
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pdf2image
|
||||
import unstructured_pytesseract
|
||||
|
||||
@ -17,21 +19,31 @@ from unstructured_inference.inference.layoutelement import (
|
||||
LayoutElement,
|
||||
partition_groups_from_regions,
|
||||
)
|
||||
from unstructured_inference.models.tables import UnstructuredTableTransformerModel
|
||||
from unstructured_pytesseract import Output
|
||||
|
||||
from unstructured.logger import logger
|
||||
from unstructured.partition.utils.constants import SUBREGION_THRESHOLD_FOR_OCR, OCRMode
|
||||
from unstructured.partition.utils.config import env_config
|
||||
from unstructured.partition.utils.constants import (
|
||||
SUBREGION_THRESHOLD_FOR_OCR,
|
||||
TESSERACT_TEXT_HEIGHT,
|
||||
OCRMode,
|
||||
)
|
||||
|
||||
# Force tesseract to be single threaded,
|
||||
# otherwise we see major performance problems
|
||||
if "OMP_THREAD_LIMIT" not in os.environ:
|
||||
os.environ["OMP_THREAD_LIMIT"] = "1"
|
||||
|
||||
# Define table_agent as a global variable
|
||||
table_agent = None
|
||||
|
||||
|
||||
def process_data_with_ocr(
|
||||
data: Union[bytes, BinaryIO],
|
||||
out_layout: "DocumentLayout",
|
||||
is_image: bool = False,
|
||||
infer_table_structure: bool = False,
|
||||
ocr_languages: str = "eng",
|
||||
ocr_mode: str = OCRMode.FULL_PAGE.value,
|
||||
pdf_image_dpi: int = 200,
|
||||
@ -44,11 +56,13 @@ def process_data_with_ocr(
|
||||
- data (Union[bytes, BinaryIO]): The input file data,
|
||||
which can be either bytes or a BinaryIO object.
|
||||
|
||||
- out_layout (DocumentLayout): The output layout from unstructured-inference.
|
||||
- out_layout (DocumentLayout): The output layout from unstructured-inference.
|
||||
|
||||
- is_image (bool, optional): Indicates if the input data is an image (True) or not (False).
|
||||
Defaults to False.
|
||||
|
||||
- infer_table_structure (bool, optional): If true, extract the table content.
|
||||
|
||||
- ocr_languages (str, optional): The languages for OCR processing. Defaults to "eng" (English).
|
||||
|
||||
- ocr_mode (str, optional): The OCR processing mode, e.g., "entire_page" or "individual_blocks".
|
||||
@ -68,6 +82,7 @@ def process_data_with_ocr(
|
||||
filename=tmp_file.name,
|
||||
out_layout=out_layout,
|
||||
is_image=is_image,
|
||||
infer_table_structure=infer_table_structure,
|
||||
ocr_languages=ocr_languages,
|
||||
ocr_mode=ocr_mode,
|
||||
pdf_image_dpi=pdf_image_dpi,
|
||||
@ -79,13 +94,14 @@ def process_file_with_ocr(
|
||||
filename: str,
|
||||
out_layout: "DocumentLayout",
|
||||
is_image: bool = False,
|
||||
infer_table_structure: bool = False,
|
||||
ocr_languages: str = "eng",
|
||||
ocr_mode: str = OCRMode.FULL_PAGE.value,
|
||||
pdf_image_dpi: int = 200,
|
||||
) -> "DocumentLayout":
|
||||
"""
|
||||
Process OCR data from a given file and supplement the output DocumentLayout
|
||||
from unsturcutured0inference with ocr.
|
||||
from unsturcutured-inference with ocr.
|
||||
|
||||
Parameters:
|
||||
- filename (str): The path to the input file, which can be an image or a PDF.
|
||||
@ -95,6 +111,8 @@ def process_file_with_ocr(
|
||||
- is_image (bool, optional): Indicates if the input data is an image (True) or not (False).
|
||||
Defaults to False.
|
||||
|
||||
- infer_table_structure (bool, optional): If true, extract the table content.
|
||||
|
||||
- ocr_languages (str, optional): The languages for OCR processing. Defaults to "eng" (English).
|
||||
|
||||
- ocr_mode (str, optional): The OCR processing mode, e.g., "entire_page" or "individual_blocks".
|
||||
@ -118,6 +136,7 @@ def process_file_with_ocr(
|
||||
merged_page_layout = supplement_page_layout_with_ocr(
|
||||
out_layout.pages[i],
|
||||
image,
|
||||
infer_table_structure=infer_table_structure,
|
||||
ocr_languages=ocr_languages,
|
||||
ocr_mode=ocr_mode,
|
||||
)
|
||||
@ -137,6 +156,7 @@ def process_file_with_ocr(
|
||||
merged_page_layout = supplement_page_layout_with_ocr(
|
||||
out_layout.pages[i],
|
||||
image,
|
||||
infer_table_structure=infer_table_structure,
|
||||
ocr_languages=ocr_languages,
|
||||
ocr_mode=ocr_mode,
|
||||
)
|
||||
@ -152,6 +172,7 @@ def process_file_with_ocr(
|
||||
def supplement_page_layout_with_ocr(
|
||||
page_layout: "PageLayout",
|
||||
image: PILImage,
|
||||
infer_table_structure: bool = False,
|
||||
ocr_languages: str = "eng",
|
||||
ocr_mode: str = OCRMode.FULL_PAGE.value,
|
||||
) -> "PageLayout":
|
||||
@ -162,30 +183,26 @@ def supplement_page_layout_with_ocr(
|
||||
If mode is "individual_blocks", we find the elements from PageLayout
|
||||
with no text and add text from OCR to each element.
|
||||
"""
|
||||
entire_page_ocr = os.getenv("ENTIRE_PAGE_OCR", "tesseract").lower()
|
||||
# TODO(yuming): add tests for paddle with ENTIRE_PAGE_OCR env
|
||||
# see CORE-1886
|
||||
if entire_page_ocr not in ["paddle", "tesseract"]:
|
||||
ocr_agent = os.getenv("OCR_AGENT", "tesseract").lower()
|
||||
if ocr_agent not in ["paddle", "tesseract"]:
|
||||
raise ValueError(
|
||||
"Environment variable ENTIRE_PAGE_OCR",
|
||||
"Environment variable OCR_AGENT",
|
||||
" must be set to 'tesseract' or 'paddle'.",
|
||||
)
|
||||
|
||||
elements = page_layout.elements
|
||||
ocr_layout = None
|
||||
if ocr_mode == OCRMode.FULL_PAGE.value:
|
||||
ocr_layout = get_ocr_layout_from_image(
|
||||
image,
|
||||
ocr_languages=ocr_languages,
|
||||
entire_page_ocr=entire_page_ocr,
|
||||
ocr_agent=ocr_agent,
|
||||
)
|
||||
merged_page_layout_elements = merge_out_layout_with_ocr_layout(
|
||||
elements,
|
||||
ocr_layout,
|
||||
page_layout.elements[:] = merge_out_layout_with_ocr_layout(
|
||||
out_layout=page_layout.elements,
|
||||
ocr_layout=ocr_layout,
|
||||
)
|
||||
elements[:] = merged_page_layout_elements
|
||||
return page_layout
|
||||
elif ocr_mode == OCRMode.INDIVIDUAL_BLOCKS.value:
|
||||
for element in elements:
|
||||
for element in page_layout.elements:
|
||||
if element.text == "":
|
||||
padded_element = pad_element_bboxes(element, padding=12)
|
||||
cropped_image = image.crop(
|
||||
@ -196,19 +213,139 @@ def supplement_page_layout_with_ocr(
|
||||
padded_element.bbox.y2,
|
||||
),
|
||||
)
|
||||
# Note(yuming): instead of getting OCR layout, we just need
|
||||
# the text extraced from OCR for individual elements
|
||||
text_from_ocr = get_ocr_text_from_image(
|
||||
cropped_image,
|
||||
ocr_languages=ocr_languages,
|
||||
entire_page_ocr=entire_page_ocr,
|
||||
ocr_agent=ocr_agent,
|
||||
)
|
||||
element.text = text_from_ocr
|
||||
return page_layout
|
||||
else:
|
||||
raise ValueError(
|
||||
"Invalid OCR mode. Parameter `ocr_mode` "
|
||||
"must be set to `entire_page` or `individual_blocks`.",
|
||||
)
|
||||
|
||||
# Note(yuming): use the OCR data from entire page OCR for table extraction
|
||||
if infer_table_structure:
|
||||
table_agent = init_table_agent()
|
||||
if ocr_layout is None:
|
||||
# Note(yuming): ocr_layout is None for individual_blocks ocr_mode
|
||||
ocr_layout = get_ocr_layout_from_image(
|
||||
image,
|
||||
ocr_languages=ocr_languages,
|
||||
ocr_agent=ocr_agent,
|
||||
)
|
||||
page_layout.elements[:] = supplement_element_with_table_extraction(
|
||||
elements=page_layout.elements,
|
||||
ocr_layout=ocr_layout,
|
||||
image=image,
|
||||
table_agent=table_agent,
|
||||
)
|
||||
|
||||
return page_layout
|
||||
|
||||
|
||||
def supplement_element_with_table_extraction(
|
||||
elements: List[LayoutElement],
|
||||
ocr_layout: List[TextRegion],
|
||||
image: PILImage,
|
||||
table_agent: "UnstructuredTableTransformerModel",
|
||||
) -> List[LayoutElement]:
|
||||
"""Supplement the existing layout with table extraction. Any Table elements
|
||||
that are extracted will have a metadata field "text_as_html" where
|
||||
the table's text content is rendered into an html string.
|
||||
"""
|
||||
for element in elements:
|
||||
if element.type == "Table":
|
||||
padding = env_config.IMAGE_CROP_PAD
|
||||
padded_element = pad_element_bboxes(element, padding=padding)
|
||||
cropped_image = image.crop(
|
||||
(
|
||||
padded_element.bbox.x1,
|
||||
padded_element.bbox.y1,
|
||||
padded_element.bbox.x2,
|
||||
padded_element.bbox.y2,
|
||||
),
|
||||
)
|
||||
table_tokens = get_table_tokens_per_element(
|
||||
padded_element,
|
||||
ocr_layout,
|
||||
)
|
||||
element.text_as_html = table_agent.predict(cropped_image, ocr_tokens=table_tokens)
|
||||
return elements
|
||||
|
||||
|
||||
def get_table_tokens_per_element(
|
||||
table_element: LayoutElement,
|
||||
ocr_layout: List[TextRegion],
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Extract and prepare table tokens within the specified table element
|
||||
based on the OCR layout of an entire image.
|
||||
|
||||
Parameters:
|
||||
- table_element (LayoutElement): The table element for which table tokens
|
||||
should be extracted. It typically represents the bounding box of the table.
|
||||
- ocr_layout (List[TextRegion]): A list of TextRegion objects representing
|
||||
the OCR layout of the entire image.
|
||||
|
||||
Returns:
|
||||
- List[Dict]: A list of dictionaries, each containing information about a table
|
||||
token within the specified table element. Each dictionary includes the
|
||||
following fields:
|
||||
- 'bbox': A list of four coordinates [x1, y1, x2, y2]
|
||||
relative to the table element's bounding box.
|
||||
- 'text': The text content of the table token.
|
||||
- 'span_num': (Optional) The span number of the table token.
|
||||
- 'line_num': (Optional) The line number of the table token.
|
||||
- 'block_num': (Optional) The block number of the table token.
|
||||
"""
|
||||
# TODO(yuming): update table_tokens from List[Dict] to List[TABLE_TOKEN]
|
||||
# where TABLE_TOKEN will be a data class defined in unstructured-inference
|
||||
table_tokens = []
|
||||
for ocr_region in ocr_layout:
|
||||
if ocr_region.bbox.is_in(table_element.bbox):
|
||||
table_tokens.append(
|
||||
{
|
||||
"bbox": [
|
||||
# token bound box is relative to table element
|
||||
ocr_region.bbox.x1 - table_element.bbox.x1,
|
||||
ocr_region.bbox.y1 - table_element.bbox.y1,
|
||||
ocr_region.bbox.x2 - table_element.bbox.x1,
|
||||
ocr_region.bbox.y2 - table_element.bbox.y1,
|
||||
],
|
||||
"text": ocr_region.text,
|
||||
},
|
||||
)
|
||||
|
||||
# 'table_tokens' is a list of tokens
|
||||
# Need to be in a relative reading order
|
||||
# If no order is provided, use current order
|
||||
for idx, token in enumerate(table_tokens):
|
||||
if "span_num" not in token:
|
||||
token["span_num"] = idx
|
||||
if "line_num" not in token:
|
||||
token["line_num"] = 0
|
||||
if "block_num" not in token:
|
||||
token["block_num"] = 0
|
||||
|
||||
return table_tokens
|
||||
|
||||
|
||||
def init_table_agent():
|
||||
"""Initialize a table agent from unstructured_inference as
|
||||
a global variable to ensure that we only load it once."""
|
||||
|
||||
global table_agent
|
||||
|
||||
if table_agent is None:
|
||||
table_agent = UnstructuredTableTransformerModel()
|
||||
table_agent.initialize(model="microsoft/table-transformer-structure-recognition")
|
||||
|
||||
return table_agent
|
||||
|
||||
|
||||
def pad_element_bboxes(
|
||||
element: "LayoutElement",
|
||||
@ -225,48 +362,93 @@ def pad_element_bboxes(
|
||||
return out_element
|
||||
|
||||
|
||||
def zoom_image(image: PILImage, zoom: float = 1) -> PILImage:
|
||||
"""scale an image based on the zoom factor using cv2; the scaled image is post processed by
|
||||
dilation then erosion to improve edge sharpness for OCR tasks"""
|
||||
if zoom <= 0:
|
||||
# no zoom but still does dilation and erosion
|
||||
zoom = 1
|
||||
new_image = cv2.resize(
|
||||
cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR),
|
||||
None,
|
||||
fx=zoom,
|
||||
fy=zoom,
|
||||
interpolation=cv2.INTER_CUBIC,
|
||||
)
|
||||
|
||||
kernel = np.ones((1, 1), np.uint8)
|
||||
new_image = cv2.dilate(new_image, kernel, iterations=1)
|
||||
new_image = cv2.erode(new_image, kernel, iterations=1)
|
||||
|
||||
return PILImage.fromarray(new_image)
|
||||
|
||||
|
||||
def get_ocr_layout_from_image(
|
||||
image: PILImage,
|
||||
ocr_languages: str = "eng",
|
||||
entire_page_ocr: str = "tesseract",
|
||||
ocr_agent: str = "tesseract",
|
||||
) -> List[TextRegion]:
|
||||
"""
|
||||
Get the OCR layout from image as a list of text regions with paddle or tesseract.
|
||||
"""
|
||||
if entire_page_ocr == "paddle":
|
||||
logger.info("Processing entrie page OCR with paddle...")
|
||||
if ocr_agent == "paddle":
|
||||
logger.info("Processing OCR with paddle...")
|
||||
from unstructured.partition.utils.ocr_models import paddle_ocr
|
||||
|
||||
# TODO(yuming): pass in language parameter once we
|
||||
# have the mapping for paddle lang code
|
||||
# see CORE-2034
|
||||
ocr_data = paddle_ocr.load_agent().ocr(np.array(image), cls=True)
|
||||
ocr_layout = parse_ocr_data_paddle(ocr_data)
|
||||
else:
|
||||
ocr_data = unstructured_pytesseract.image_to_data(
|
||||
logger.info("Processing OCR with tesseract...")
|
||||
zoom = 1
|
||||
ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
|
||||
np.array(image),
|
||||
lang=ocr_languages,
|
||||
output_type=Output.DICT,
|
||||
output_type=Output.DATAFRAME,
|
||||
)
|
||||
ocr_layout = parse_ocr_data_tesseract(ocr_data)
|
||||
ocr_df = ocr_df.dropna()
|
||||
|
||||
# tesseract performance degrades when the text height is out of the preferred zone so we
|
||||
# zoom the image (in or out depending on estimated text height) for optimum OCR results
|
||||
# but this needs to be evaluated based on actual use case as the optimum scaling also
|
||||
# depend on type of characters (font, language, etc); be careful about this
|
||||
# functionality
|
||||
text_height = ocr_df[TESSERACT_TEXT_HEIGHT].quantile(
|
||||
env_config.TESSERACT_TEXT_HEIGHT_QUANTILE,
|
||||
)
|
||||
if (
|
||||
text_height < env_config.TESSERACT_MIN_TEXT_HEIGHT
|
||||
or text_height > env_config.TESSERACT_MAX_TEXT_HEIGHT
|
||||
):
|
||||
# rounding avoids unnecessary precision and potential numerical issues assocaited
|
||||
# with numbers very close to 1 inside cv2 image processing
|
||||
zoom = np.round(env_config.TESSERACT_OPTIMUM_TEXT_HEIGHT / text_height, 1)
|
||||
ocr_df = unstructured_pytesseract.image_to_data(
|
||||
np.array(zoom_image(image, zoom)),
|
||||
lang=ocr_languages,
|
||||
output_type=Output.DATAFRAME,
|
||||
)
|
||||
ocr_df = ocr_df.dropna()
|
||||
|
||||
ocr_layout = parse_ocr_data_tesseract(ocr_df, zoom=zoom)
|
||||
return ocr_layout
|
||||
|
||||
|
||||
def get_ocr_text_from_image(
|
||||
image: PILImage,
|
||||
ocr_languages: str = "eng",
|
||||
entire_page_ocr: str = "tesseract",
|
||||
ocr_agent: str = "tesseract",
|
||||
) -> str:
|
||||
"""
|
||||
Get the OCR text from image as a string with paddle or tesseract.
|
||||
"""
|
||||
if entire_page_ocr == "paddle":
|
||||
if ocr_agent == "paddle":
|
||||
logger.info("Processing entrie page OCR with paddle...")
|
||||
from unstructured.partition.utils.ocr_models import paddle_ocr
|
||||
|
||||
# TODO(yuming): pass in language parameter once we
|
||||
# have the mapping for paddle lang code
|
||||
# see CORE-2034
|
||||
ocr_data = paddle_ocr.load_agent().ocr(np.array(image), cls=True)
|
||||
ocr_layout = parse_ocr_data_paddle(ocr_data)
|
||||
text_from_ocr = ""
|
||||
@ -281,44 +463,45 @@ def get_ocr_text_from_image(
|
||||
return text_from_ocr
|
||||
|
||||
|
||||
def parse_ocr_data_tesseract(ocr_data: dict) -> List[TextRegion]:
|
||||
def parse_ocr_data_tesseract(ocr_data: pd.DataFrame, zoom: float = 1) -> List[TextRegion]:
|
||||
"""
|
||||
Parse the OCR result data to extract a list of TextRegion objects from
|
||||
tesseract.
|
||||
|
||||
The function processes the OCR result dictionary, looking for bounding
|
||||
The function processes the OCR result data frame, looking for bounding
|
||||
box information and associated text to create instances of the TextRegion
|
||||
class, which are then appended to a list.
|
||||
|
||||
Parameters:
|
||||
- ocr_data (dict): A dictionary containing the OCR result data, expected
|
||||
to have keys like "level", "left", "top", "width",
|
||||
"height", and "text".
|
||||
- ocr_data (pd.DataFrame):
|
||||
A Pandas DataFrame containing the OCR result data.
|
||||
It should have columns like 'text', 'left', 'top', 'width', and 'height'.
|
||||
|
||||
- zoom (float, optional):
|
||||
A zoom factor to scale the coordinates of the bounding boxes from image scaling.
|
||||
Default is 1.
|
||||
|
||||
Returns:
|
||||
- List[TextRegion]: A list of TextRegion objects, each representing a
|
||||
detected text region within the OCR-ed image.
|
||||
- List[TextRegion]:
|
||||
A list of TextRegion objects, each representing a detected text region
|
||||
within the OCR-ed image.
|
||||
|
||||
Note:
|
||||
- An empty string or a None value for the 'text' key in the input
|
||||
dictionary will result in its associated bounding box being ignored.
|
||||
data frame will result in its associated bounding box being ignored.
|
||||
"""
|
||||
|
||||
levels = ocr_data["level"]
|
||||
text_regions = []
|
||||
for i, level in enumerate(levels):
|
||||
(l, t, w, h) = (
|
||||
ocr_data["left"][i],
|
||||
ocr_data["top"][i],
|
||||
ocr_data["width"][i],
|
||||
ocr_data["height"][i],
|
||||
)
|
||||
(x1, y1, x2, y2) = l, t, l + w, t + h
|
||||
text = ocr_data["text"][i]
|
||||
for idtx in ocr_data.itertuples():
|
||||
text = idtx.text
|
||||
if not text:
|
||||
continue
|
||||
cleaned_text = text.strip()
|
||||
if cleaned_text:
|
||||
x1 = idtx.left / zoom
|
||||
y1 = idtx.top / zoom
|
||||
x2 = (idtx.left + idtx.width) / zoom
|
||||
y2 = (idtx.top + idtx.height) / zoom
|
||||
text_region = TextRegion.from_coords(x1, y1, x2, y2, text=text, source="OCR-tesseract")
|
||||
text_regions.append(text_region)
|
||||
|
||||
|
||||
@ -377,7 +377,6 @@ def _partition_pdf_or_image_local(
|
||||
out_layout = process_file_with_model(
|
||||
filename,
|
||||
is_image=is_image,
|
||||
extract_tables=infer_table_structure,
|
||||
model_name=model_name,
|
||||
pdf_image_dpi=pdf_image_dpi,
|
||||
**process_with_model_kwargs,
|
||||
@ -390,6 +389,7 @@ def _partition_pdf_or_image_local(
|
||||
filename,
|
||||
out_layout,
|
||||
is_image=is_image,
|
||||
infer_table_structure=infer_table_structure,
|
||||
ocr_languages=ocr_languages,
|
||||
ocr_mode=ocr_mode,
|
||||
pdf_image_dpi=pdf_image_dpi,
|
||||
@ -398,7 +398,6 @@ def _partition_pdf_or_image_local(
|
||||
out_layout = process_data_with_model(
|
||||
file,
|
||||
is_image=is_image,
|
||||
extract_tables=infer_table_structure,
|
||||
model_name=model_name,
|
||||
pdf_image_dpi=pdf_image_dpi,
|
||||
**process_with_model_kwargs,
|
||||
@ -413,6 +412,7 @@ def _partition_pdf_or_image_local(
|
||||
file,
|
||||
out_layout,
|
||||
is_image=is_image,
|
||||
infer_table_structure=infer_table_structure,
|
||||
ocr_languages=ocr_languages,
|
||||
ocr_mode=ocr_mode,
|
||||
pdf_image_dpi=pdf_image_dpi,
|
||||
|
||||
65
unstructured/partition/utils/config.py
Normal file
65
unstructured/partition/utils/config.py
Normal file
@ -0,0 +1,65 @@
|
||||
"""
|
||||
This module contains variables that can permitted to be tweaked by the system environment. For
|
||||
example, model parameters that changes the output of an inference call. Constants do NOT belong in
|
||||
this module. Constants are values that are usually names for common options (e.g., color names) or
|
||||
settings that should not be altered without making a code change (e.g., definition of 1Gb of memory
|
||||
in bytes). Constants should go into `./constants.py`
|
||||
"""
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ENVConfig:
|
||||
"""class for configuring enviorment parameters"""
|
||||
|
||||
def _get_string(self, var: str, default_value: str = "") -> str:
|
||||
"""attempt to get the value of var from the os environment; if not present return the
|
||||
default_value"""
|
||||
return os.environ.get(var, default_value)
|
||||
|
||||
def _get_int(self, var: str, default_value: int) -> int:
|
||||
if value := self._get_string(var):
|
||||
return int(value)
|
||||
return default_value
|
||||
|
||||
def _get_float(self, var: str, default_value: float) -> float:
|
||||
if value := self._get_string(var):
|
||||
return float(value)
|
||||
return default_value
|
||||
|
||||
@property
|
||||
def IMAGE_CROP_PAD(self) -> int:
|
||||
"""extra image content to add around an identified element region; measured in pixels"""
|
||||
return self._get_int("IMAGE_CROP_PAD", 0)
|
||||
|
||||
@property
|
||||
def TESSERACT_TEXT_HEIGHT_QUANTILE(self) -> float:
|
||||
"""the quantile to check for text height"""
|
||||
return self._get_float("TESSERACT_TEXT_HEIGHT_QUANTILE", 0.5)
|
||||
|
||||
@property
|
||||
def TESSERACT_MIN_TEXT_HEIGHT(self) -> int:
|
||||
"""minimum text height acceptable from tesseract OCR results
|
||||
|
||||
if estimated text height from tesseract OCR results is lower than this value the image is
|
||||
scaled up to be processed again
|
||||
"""
|
||||
return self._get_int("TESSERACT_MIN_TEXT_HEIGHT", 12)
|
||||
|
||||
@property
|
||||
def TESSERACT_MAX_TEXT_HEIGHT(self) -> int:
|
||||
"""maximum text height acceptable from tesseract OCR results
|
||||
|
||||
if estimated text height from tesseract OCR results is higher than this value the image is
|
||||
scaled down to be processed again
|
||||
"""
|
||||
return self._get_int("TESSERACT_MAX_TEXT_HEIGHT", 100)
|
||||
|
||||
@property
|
||||
def TESSERACT_OPTIMUM_TEXT_HEIGHT(self) -> int:
|
||||
"""optimum text height for tesseract OCR"""
|
||||
return self._get_int("TESSERACT_OPTIMUM_TEXT_HEIGHT", 20)
|
||||
|
||||
|
||||
env_config = ENVConfig()
|
||||
@ -13,3 +13,7 @@ SORT_MODE_DONT = "dont"
|
||||
|
||||
SUBREGION_THRESHOLD_FOR_OCR = 0.5
|
||||
UNSTRUCTURED_INCLUDE_DEBUG_METADATA = os.getenv("UNSTRUCTURED_INCLUDE_DEBUG_METADATA", False)
|
||||
|
||||
|
||||
# this field is defined by pytesseract/unstructured.pytesseract
|
||||
TESSERACT_TEXT_HEIGHT = "height"
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user