From dfa17bd3a0c476dce571b8b493dd2ff80ddaebc1 Mon Sep 17 00:00:00 2001
From: cragwolfe
Date: Fri, 4 Apr 2025 14:38:23 -0700
Subject: [PATCH] fix: hi_res PDF parsing: only uncategorized text for
extracted elements (#3975)
---
CHANGELOG.md | 3 +-
.../partition/pdf_image/test_pdf.py | 4 +-
test_unstructured/partition/test_msg.py | 2 +-
.../biomed-api/65/11/main.PMC6312790.pdf.html | 30 +++----
.../biomed-api/75/29/main.PMC6312793.pdf.html | 28 +++---
.../07/07/sbaa031.073.PMC7234218.pdf.html | 4 +-
.../recalibrating-risk-report.pdf.html | 86 +++++++++----------
.../layout-parser-paper-with-table.jpg.html | 4 +-
.../layout-parser-paper.pdf.html | 54 ++++++------
.../biomed-api/65/11/main.PMC6312790.pdf.json | 20 ++---
.../biomed-api/75/29/main.PMC6312793.pdf.json | 18 ++--
.../07/07/sbaa031.073.PMC7234218.pdf.json | 2 +-
.../recalibrating-risk-report.pdf.json | 44 +++++-----
.../layout-parser-paper-with-table.jpg.json | 2 +-
.../layout-parser-paper.pdf.json | 30 +++----
unstructured/__version__.py | 2 +-
unstructured/partition/pdf.py | 5 +-
17 files changed, 171 insertions(+), 167 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index ad3afdfc3..baa69aae9 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,10 +1,11 @@
-## 0.17.6-dev0
+## 0.17.6-dev1
### Enhancements
### Features
### Fixes
+- **Do not use NLP to determine element types for extracted elements with hi_res.** This avoids extraneous Title elements in hi_res outputs. This only applies to *extracted* elements, meaning text objects that are found outside of Object Detection objects which get mapped to *inferred* elements. (*extracted* and *inferred* elements get merged together to form the list of `Element`s returned by `pdf_partition()`)
## 0.17.5
diff --git a/test_unstructured/partition/pdf_image/test_pdf.py b/test_unstructured/partition/pdf_image/test_pdf.py
index 6d1145eb8..7a0c8ff29 100644
--- a/test_unstructured/partition/pdf_image/test_pdf.py
+++ b/test_unstructured/partition/pdf_image/test_pdf.py
@@ -823,8 +823,8 @@ def test_partition_categorization_backup():
example_doc_path("pdf/layout-parser-paper-fast.pdf"),
strategy=PartitionStrategy.HI_RES,
)
- # Should have changed the element class from Text to Title
- assert isinstance(elements[0], Title)
+ # Should NOT have changed the element class from Text to Title
+ assert isinstance(elements[0], Text)
assert elements[0].text == text
diff --git a/test_unstructured/partition/test_msg.py b/test_unstructured/partition/test_msg.py
index d1d66876e..94b12d557 100644
--- a/test_unstructured/partition/test_msg.py
+++ b/test_unstructured/partition/test_msg.py
@@ -141,7 +141,7 @@ def test_partition_msg_can_process_attachments():
"Text",
"Text",
"Image",
- "Title",
+ "Text",
"Text",
"Title",
"Title",
diff --git a/test_unstructured_ingest/expected-structured-output-html/biomed-api/65/11/main.PMC6312790.pdf.html b/test_unstructured_ingest/expected-structured-output-html/biomed-api/65/11/main.PMC6312790.pdf.html
index a55cccdbb..210109c06 100644
--- a/test_unstructured_ingest/expected-structured-output-html/biomed-api/65/11/main.PMC6312790.pdf.html
+++ b/test_unstructured_ingest/expected-structured-output-html/biomed-api/65/11/main.PMC6312790.pdf.html
@@ -14,9 +14,9 @@
Contents lists available at ScienceDirect
-
+
Data in Brief
-
+
journal homepage: www.elsevier.com/locate/dib
@@ -28,19 +28,19 @@
Data on environmental sustainable corrosion inhibitor for stainless steel in aggressive environment
-
+
(Jee
-
+
Omotayo Sanni n, Abimbola Patricia I. Popoola
Department of Chemical, Metallurgical and Materials Engineering, Tshwane University of Technology, Pretoria, South Africa
-
+
a r t i c l e i n f o
-
+
a b s t r a c t
@@ -88,19 +88,19 @@
Value of the data
-
+
© Data presented here provide optimum conditions of waste material as inhibitor for stainless steel
Type 316 in 0.5 M H2SO4 medium. The given data describe the inhibitive performance of eco-friendly egg shell powder on austenitic stainless steel Type 316 corrosion in sulphuric acid environment.
-
+
© The data obtained for the inhibition of waste product (egg shell powder) on stainless steel Type 316
can be used as basis in determining the inhibitive performance of the same inhibitor in other environments.
-
+
© The data can be used to examine the relationship between the process variable as it affect the
@@ -152,9 +152,9 @@
Inhibitor be (V/dec) ba (V/dec) Ecorr (V) icorr (A/cm?) Polarization Corrosion concentration (g) resistance (Q) rate (mm/year) oO 0.0335 0.0409 —0.9393 0.0003 24.0910 2.8163 2 1.9460 0.0596 —0.8276 0.0002 121.440 1.5054 4 0.0163 0.2369 —0.8825 0.0001 42.121 0.9476 6 0.3233 0.0540 —0.8027 5.39E-05 373.180 0.4318 8 0.1240 0.0556 —0.5896 5.46E-05 305.650 0.3772 10 0.0382 0.0086 —0.5356 1.24E-05 246.080 0.0919
-
+
rate (mm/year)
-
+
The plot of inhibitor concentration over degree of surface coverage versus inhibitor concentration gives a straight line as shown in Fig. 5. The strong correlation reveals that egg shell adsorption on stainless surface in 0.5 M H2SO4 follow Langmuir adsorption isotherm. Figs. 6–8 show the SEM/EDX surface morphology analysis of stainless steel. Figs. 7 and 8 are the SEM/EDX images of the stainless steel specimens without and with inhibitor after weight loss experiment in sulphuric acid medium. The stainless steel surface corrosion product layer in the absence of inhibitor was porous and as a result gives no corrosion protection. With the presence of ES, corrosion damage was minimized, with an evidence of ES present on the metal surface as shown in Fig. 8.
@@ -232,12 +232,12 @@
The potentiodynamic polarization method was performed on the prepared test samples immersed in 0.5 M H2SO4 solution in the presence and absence of different ES concentrations. A three electrode system was used; stainless steel Type 316 plate as working electrode with an exposed area of 1.0 cm2, platinum rod as counter electrode and silver chloride electrode as reference electrode. The electrode was polished, degreased in acetone and thoroughly rinsed with distilled water before the experiment. Current density against applied potential was plotted. The slope of the linear part in anodic and cathodic plots gives anodic and cathodic constants according to the Stern–Geary equation, and the
-
+
ð2Þ
-
-
+
+
ð3Þ
-
+
diff --git a/test_unstructured_ingest/expected-structured-output-html/biomed-api/75/29/main.PMC6312793.pdf.html b/test_unstructured_ingest/expected-structured-output-html/biomed-api/75/29/main.PMC6312793.pdf.html
index bb95afd2b..aabc7233c 100644
--- a/test_unstructured_ingest/expected-structured-output-html/biomed-api/75/29/main.PMC6312793.pdf.html
+++ b/test_unstructured_ingest/expected-structured-output-html/biomed-api/75/29/main.PMC6312793.pdf.html
@@ -14,9 +14,9 @@
Contents lists available at ScienceDirect
-
+
Data in Brief
-
+
journal homepage: www.elsevier.com/locate/dib
@@ -28,9 +28,9 @@
A benchmark dataset for the multiple depot vehicle scheduling problem
-
+
(eee
-
+
Sarang Kulkarni a,b,c,n, Mohan Krishnamoorthy d,e, Abhiram Ranade f, Andreas T. Ernst c, Rahul Patil b
@@ -52,16 +52,16 @@
e School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072,
-
+
Australia
-
+
f Department of Computer Science and Engineering, IIT Bombay, Powai, Mumbai 400076, India
-
+
a r t i c l e i n f o
-
+
a b s t r a c t
@@ -106,13 +106,13 @@
© The data provide all the information that is required to model the MDVSP by using the existing mathematical formulations.
-
+
e All the problem instances are available for use without any restrictions.
e The benchmark solutions and solution time for the problem instances are presented in [3] and can be used for the comparison.
-
+
© The dataset includes a program that can generate similar problem instances of different sizes.
@@ -121,9 +121,9 @@
The dataset contains 60 different problem instances of the multiple depot vehicle scheduling pro- blem (MDVSP). Each problem instance is provided in a separate file. Each file is named as ‘RN-m-n-k.dat’, where ‘m’, ‘n’, and ‘k’ denote the number of depots, the number of trips, and the instance number for the size, ‘ðm;nÞ’, respectively. For example, the problem instance, ‘RN-8–1500-01.dat’, is the first problem instance with 8 depots and 1500 trips. For the number of depots, m, we used three values, 8,12, and 16. The four values for the number of trips, n, are 1500, 2000, 2500, and 3000. For each size, ðm;nÞ, five instances are provided. The dataset can be downloaded from https://orlib.uqcloud.net. For each problem instance, the following information is provided:
-
+
The number of depots mð
-
+
Þ,
@@ -187,9 +187,9 @@
Instance size (m, n) Average number of Locations Times Vehicles (8, 1500) 568.40 975.20 652.20 668,279.40 (8, 2000) 672.80 1048.00 857.20 1,195,844.80 (8, 2500) 923.40 1078.00 1082.40 1,866,175.20 (8, 3000) 977.00 1113.20 1272.80 2,705,617.00 (12, 1500) 566.00 994.00 642.00 674,191.00 (12, 2000) 732.60 1040.60 861.20 1,199,659.80 (12, 2500) 875.00 1081.00 1096.00 1,878,745.20 (12, 3000) 1119.60 1107.40 1286.20 2,711,180.40 (16, 1500) 581.80 985.40 667.80 673,585.80 (16, 2000) 778.00 1040.60 872.40 1,200,560.80 (16, 2500) 879.00 1083.20 1076.40 1,879,387.00 (16, 3000) 1087.20 1101.60 1284.60 2,684,983.60
-
+
Possible empty travels
-
+
diff --git a/test_unstructured_ingest/expected-structured-output-html/biomed-path/07/07/sbaa031.073.PMC7234218.pdf.html b/test_unstructured_ingest/expected-structured-output-html/biomed-path/07/07/sbaa031.073.PMC7234218.pdf.html
index 0862a71a2..eabce53c2 100644
--- a/test_unstructured_ingest/expected-structured-output-html/biomed-path/07/07/sbaa031.073.PMC7234218.pdf.html
+++ b/test_unstructured_ingest/expected-structured-output-html/biomed-path/07/07/sbaa031.073.PMC7234218.pdf.html
@@ -76,8 +76,8 @@
Camila Loureiro*1, Corsi-Zuelli Fabiana1, Fachim Helene Aparecida1, Shuhama Rosana1, Menezes Paulo Rossi1, Dalton Caroline F2,
-
+
AQ3
-
+