fix: correct PDF list item parsing (#1693)

The current implementation removes elements from the beginning of the
element list and duplicates the list items

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
This commit is contained in:
Inscore 2023-10-11 13:38:36 -07:00 committed by GitHub
parent 6acd06987b
commit 8ab40c20c1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 116 additions and 112 deletions

View File

@ -1,4 +1,4 @@
## 0.10.22-dev0
## 0.10.22-dev1
### Enhancements
@ -6,6 +6,9 @@
### Fixes
* **Fixes PDF list parsing creating duplicate list items** Previously a bug in PDF list item parsing caused removal of other elements and duplication of the list items
## 0.10.21
* **Adds Scarf analytics**.

View File

@ -910,6 +910,7 @@ def test_combine_numbered_list(filename):
first_list_element = element
break
assert len(elements) < 28
assert len([element for element in elements if isinstance(element, ListItem)]) == 4
assert first_list_element.text.endswith(
"character recognition, and other DIA tasks (Section 3)",
)

View File

@ -562,6 +562,17 @@
},
"text": "All times are in minutes and integers. The planning duration is from 5 a.m. to around midnight. Each instance has two classes of trips, short trips and long trips, with 40% short trips and 60% long trips. The duration of a short trip is less than a total of 45 min and the travel time between the start"
},
{
"type": "UncategorizedText",
"element_id": "86b700fab5db37977a73700b53a0654b",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 3,
"links": []
},
"text": "486"
},
{
"type": "NarrativeText",
"element_id": "0a1b09ff562f4d063703cbf021ee297f",
@ -683,17 +694,6 @@
},
"text": "1. Each schedule should start and end at the same depot. 2. Each trip should be covered by only one vehicle. 3. The number of schedules that start from a depot should not exceed the number of vehicles at the depot."
},
{
"type": "ListItem",
"element_id": "2d6b506bd58a7dd7bbf1c8599ef630c8",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 3,
"links": []
},
"text": "1. Each schedule should start and end at the same depot. 2. Each trip should be covered by only one vehicle. 3. The number of schedules that start from a depot should not exceed the number of vehicles at the depot."
},
{
"type": "NarrativeText",
"element_id": "4fa30384f002f9a1d85b03ebdb0c8143",

View File

@ -33,15 +33,26 @@
"text": "2023 JAN"
},
{
"type": "ListItem",
"element_id": "c4e0168ffab999611a92e8ebd8fe48a9",
"type": "Title",
"element_id": "85e4ff3addb38328ecc08ec49759def7",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 2,
"links": []
},
"text": "The balance of risks remains tilted to the downside, but adverse risks have moderated since the October 2022 WEO. On the upside, a stronger boost from pent-up demand in numerous economies or a faster fall in inflation are plausible. On the downside, severe health outcomes in China could hold back the recovery, Russias war in Ukraine could escalate, and tighter global financing conditions could worsen debt distress. Financial markets could also suddenly reprice in response to adverse inflation news, while further geopolitical fragmentation could hamper economic progress."
"text": "Inflation Peaking amid Low Growth"
},
{
"type": "ListItem",
"element_id": "f1d5f4ed63a14db581e985bf15416cdd",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 2,
"links": []
},
"text": "Global growth is projected to fall from an estimated 3.4 percent in 2022 to 2.9 percent in 2023, then rise to 3.1 percent in 2024. The forecast for 2023 is 0.2 percentage point higher than predicted in the October 2022 World Economic Outlook (WEO) but below the historical (200019) average of 3.8 percent. The rise in central bank rates to fight inflation and Russias war in Ukraine continue to weigh on economic activity. The rapid spread of COVID-19 in China dampened growth in 2022, but the recent reopening has paved the way for a faster-than-expected recovery. Global inflation is expected to fall from 8.8 percent in 2022 to 6.6 percent in 2023 and 4.3 percent in 2024, still above pre-pandemic (201719) levels of about 3.5 percent."
},
{
"type": "ListItem",
@ -65,17 +76,6 @@
},
"text": "In most economies, amid the cost-of-living crisis, the priority remains achieving sustained disinflation. With tighter monetary conditions and lower growth potentially affecting financial and debt stability, it is necessary to deploy macroprudential tools and strengthen debt restructuring frameworks. Accelerating COVID-19 vaccinations in China would safeguard the recovery, with positive cross-border spillovers. Fiscal support should be better targeted at those most affected by elevated food and energy prices, and broad-based fiscal relief measures should be withdrawn. Stronger multilateral cooperation is essential to preserve the gains from the rules-based multilateral system and to mitigate climate change by limiting emissions and raising green investment."
},
{
"type": "ListItem",
"element_id": "5e9b501fc056965a744f6598d022f31d",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 2,
"links": []
},
"text": "In most economies, amid the cost-of-living crisis, the priority remains achieving sustained disinflation. With tighter monetary conditions and lower growth potentially affecting financial and debt stability, it is necessary to deploy macroprudential tools and strengthen debt restructuring frameworks. Accelerating COVID-19 vaccinations in China would safeguard the recovery, with positive cross-border spillovers. Fiscal support should be better targeted at those most affected by elevated food and energy prices, and broad-based fiscal relief measures should be withdrawn. Stronger multilateral cooperation is essential to preserve the gains from the rules-based multilateral system and to mitigate climate change by limiting emissions and raising green investment."
},
{
"type": "NarrativeText",
"element_id": "968162aa6cdc3927ef2b11bb03cdeb45",
@ -527,6 +527,28 @@
},
"text": "In the United States, growth is projected to fall from 2.0 percent in 2022 to 1.4 percent in 2023 and 1.0 percent in 2024. With growth rebounding in the second half of 2024, growth in 2024 will be faster than in 2023 on a fourth-quarter-over-fourth-quarter basis, as in most advanced"
},
{
"type": "NarrativeText",
"element_id": "70f05b9620aa1b7236058898e7e59192",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 5,
"links": []
},
"text": "economies. There is a 0.4 percentage point upward revision for annual growth in 2023, reflecting carryover effects from domestic demand resilience in 2022, but a 0.2 percentage point downward revision of growth in 2024 due to the steeper path of Federal Reserve rate hikes, to a peak of about 5.1 percent in 2023."
},
{
"type": "ListItem",
"element_id": "fd6c549473e196512c076844988f465c",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 5,
"links": []
},
"text": "Growth in the euro area is projected to bottom out at 0.7 percent in 2023 before rising to 1.6 percent in 2024. The 0.2 percentage point upward revision to the forecast for 2023 reflects the effects of faster rate hikes by the European Central Bank and eroding real incomes, offset by the carryover from the 2022 outturn, lower wholesale energy prices, and additional announcements of fiscal purchasing power support in the form of energy price controls and cash transfers."
},
{
"type": "ListItem",
"element_id": "3be6554964c172468cceaee89294f59d",
@ -549,17 +571,6 @@
},
"text": "Growth in Japan is projected to rise to 1.8 percent in 2023, with continued monetary and fiscal policy support. High corporate profits from a depreciated yen and earlier delays in implementing previous projects will support business investment. In 2024, growth is expected to decline to 0.9 percent as the effects of past stimulus dissipate."
},
{
"type": "ListItem",
"element_id": "b24771387a5318eeda21adaa49629186",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 5,
"links": []
},
"text": "Growth in Japan is projected to rise to 1.8 percent in 2023, with continued monetary and fiscal policy support. High corporate profits from a depreciated yen and earlier delays in implementing previous projects will support business investment. In 2024, growth is expected to decline to 0.9 percent as the effects of past stimulus dissipate."
},
{
"type": "NarrativeText",
"element_id": "ad7ee60befc68a0200bb75d81828b2d2",
@ -582,17 +593,6 @@
},
"text": "Growth in emerging and developing Asia is expected to rise in 2023 and 2024 to 5.3 percent and 5.2 percent, respectively, after the deeper-than-expected slowdown in 2022 to 4.3 percent attributable to Chinas economy. Chinas real GDP slowdown in the fourth quarter of 2022 implies a 0.2 percentage point downgrade for 2022 growth to 3.0 percent—the first time in more than 40 years with Chinas growth below the global average. Growth in China is projected to rise to 5.2 percent in 2023, reflecting rapidly improving mobility, and to fall to 4.5 percent in 2024 before settling at below 4 percent over the medium term amid declining business dynamism and slow progress on structural reforms. Growth in India is set to decline from 6.8 percent in 2022 to 6.1 percent in 2023 before picking up to 6.8 percent in 2024, with resilient domestic demand despite external headwinds. Growth in the ASEAN-5 countries (Indonesia, Malaysia, Philippines, Singapore, Thailand) is similarly projected to slow to 4.3 percent in 2023 and then pick up to 4.7 percent in 2024."
},
{
"type": "ListItem",
"element_id": "2ba41350ae3c684802f0e2b785c2d11b",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 5,
"links": []
},
"text": "Growth in emerging and developing Asia is expected to rise in 2023 and 2024 to 5.3 percent and 5.2 percent, respectively, after the deeper-than-expected slowdown in 2022 to 4.3 percent attributable to Chinas economy. Chinas real GDP slowdown in the fourth quarter of 2022 implies a 0.2 percentage point downgrade for 2022 growth to 3.0 percent—the first time in more than 40 years with Chinas growth below the global average. Growth in China is projected to rise to 5.2 percent in 2023, reflecting rapidly improving mobility, and to fall to 4.5 percent in 2024 before settling at below 4 percent over the medium term amid declining business dynamism and slow progress on structural reforms. Growth in India is set to decline from 6.8 percent in 2022 to 6.1 percent in 2023 before picking up to 6.8 percent in 2024, with resilient domestic demand despite external headwinds. Growth in the ASEAN-5 countries (Indonesia, Malaysia, Philippines, Singapore, Thailand) is similarly projected to slow to 4.3 percent in 2023 and then pick up to 4.7 percent in 2024."
},
{
"type": "ListItem",
"element_id": "afde979c99a73646915fe253c85c5a9c",
@ -681,6 +681,17 @@
},
"text": "The balance of risks to the global outlook remains tilted to the downside, with scope for lower growth and higher inflation, but adverse risks have moderated since the October 2022 World Economic Outlook."
},
{
"type": "UncategorizedText",
"element_id": "8f81c653cbf1334344d3063cb9f4de04",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 7,
"links": []
},
"text": "Table 1. Overview of the World Economic Outlook Projections (Percent change, unless noted otherwise)"
},
{
"type": "Title",
"element_id": "d11a1c04bd3a9891350b4bd94104df58",
@ -1738,15 +1749,37 @@
"text": "Pent-up demand boost: Fueled by the stock of excess private savings from the pandemic fiscal support and, in many cases, still-tight labor markets and solid wage growth, pent-up demand remains an upside risk to the growth outlook. In some advanced economies, recent data show that households are still on net adding to their stock of excess savings (as in some euro area countries and the United Kingdom) or have ample savings left (as in the United States). This leaves scope for a further boost to consumption—particularly of services, including tourism."
},
{
"type": "ListItem",
"element_id": "cf20f95904c591b6ac4ccd5d43fa8a98",
"type": "NarrativeText",
"element_id": "d379a79a55cecddeed62b21eb6a0ff00",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 7,
"page_number": 8,
"links": []
},
"text": "Pent-up demand boost: Fueled by the stock of excess private savings from the pandemic fiscal support and, in many cases, still-tight labor markets and solid wage growth, pent-up demand remains an upside risk to the growth outlook. In some advanced economies, recent data show that households are still on net adding to their stock of excess savings (as in some euro area countries and the United Kingdom) or have ample savings left (as in the United States). This leaves scope for a further boost to consumption—particularly of services, including tourism."
"text": "However, the boost to demand could stoke core inflation, leading to even tighter monetary policies and a stronger-than-expected slowdown later on. Pent-up demand could also fuel a stronger rebound in China."
},
{
"type": "ListItem",
"element_id": "2bbe57e6c291db638d3fcddca9e0199a",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 8,
"links": []
},
"text": "Faster disinflation: An easing in labor market pressures in some advanced economies due to falling vacancies could cool wage inflation without necessarily increasing unemployment. A sharp fall in the prices of goods, as consumers shift back to services, could further push down inflation. Such developments could imply a “softer” landing with less monetary tightening."
},
{
"type": "NarrativeText",
"element_id": "a2f806b25a06969405637298b4c85139",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 8,
"links": []
},
"text": "Downside risks—Numerous downside risks continue to weigh on the global outlook, lowering growth while, in a number of cases, adding further to inflation:"
},
{
"type": "ListItem",
@ -1759,28 +1792,6 @@
},
"text": "Chinas recovery stalling: Amid still-low population immunity levels and insufficient hospital capacity, especially outside the major urban areas, significant health consequences could hamper the recovery. A deepening crisis in the real estate market remains a major source of vulnerability, with risks of widespread defaults by developers and resulting financial sector instability. Spillovers to the rest of the world would operate primarily through lower demand and potentially renewed supply chain problems."
},
{
"type": "ListItem",
"element_id": "90a90e12a4c6b8b74d3c8d20a76f22dc",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 8,
"links": []
},
"text": "Chinas recovery stalling: Amid still-low population immunity levels and insufficient hospital capacity, especially outside the major urban areas, significant health consequences could hamper the recovery. A deepening crisis in the real estate market remains a major source of vulnerability, with risks of widespread defaults by developers and resulting financial sector instability. Spillovers to the rest of the world would operate primarily through lower demand and potentially renewed supply chain problems."
},
{
"type": "ListItem",
"element_id": "42ac57e394bf7c98d908745cefce0b80",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 8,
"links": []
},
"text": "War in Ukraine escalating: An escalation of the war in Ukraine remains a major source of vulnerability, particularly for Europe and lower-income countries. Europe is facing lower-than- anticipated gas prices, having stored enough gas to make shortages unlikely this winter. However, refilling storage with much-diminished Russian flows will be challenging ahead of next winter, particularly if it is a very cold one and Chinas energy demand picks up, causing price spikes. A possible increase in food prices from a failed extension of the Black Sea grain initiative would put further pressure on lower-income countries that are experiencing food insecurity and have limited budgetary room to cushion the impact on households and businesses. With elevated food and fuel prices, social unrest may increase."
},
{
"type": "ListItem",
"element_id": "42ac57e394bf7c98d908745cefce0b80",
@ -1836,17 +1847,6 @@
},
"text": "Geopolitical fragmentation: The war in Ukraine and the related international sanctions aimed at  pressuring Russia to end hostilities are splitting the world economy into blocs and reinforcing earlier geopolitical tensions, such as those associated with the US-China trade dispute."
},
{
"type": "ListItem",
"element_id": "75bd22ee0ba778cc3a616ed0a9b42292",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 8,
"links": []
},
"text": "Geopolitical fragmentation: The war in Ukraine and the related international sanctions aimed at  pressuring Russia to end hostilities are splitting the world economy into blocs and reinforcing earlier geopolitical tensions, such as those associated with the US-China trade dispute."
},
{
"type": "NarrativeText",
"element_id": "35514c59b45fbe18e13b3072e41ec0d4",
@ -1925,15 +1925,37 @@
"text": "1 See “Geo-Economic Fragmentation and the Future of Multilateralism,” IMF Staff Discussion Note 2023/001."
},
{
"type": "ListItem",
"element_id": "bd7674df887463bc9f05c8030a151dea",
"type": "NarrativeText",
"element_id": "1344e770221822b381fb428d9390a446",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 10,
"links": []
},
"text": "Restraining the pandemic: Global coordination is needed to resolve bottlenecks in the global distribution of vaccines and treatments. Public support for the development of new vaccine technologies and the design of systematic responses to future epidemics also remains essential.  Addressing debt distress: Progress has been made for countries that requested debt treatment under the Group of Twentys Common Framework initiative, and more will be needed to strengthen it. It is also necessary to agree on mechanisms to resolve debt in a broader set of economies, including middle-income countries that are not eligible under the Common Framework. Non Paris Club and private creditors have a crucial role to play in ensuring coordinated, effective, and timely debt resolution processes."
"text": "controls. The temporary and broad-based measures are becoming increasingly costly and should be withdrawn and replaced by targeted approaches. Preserving the energy price signal will encourage a reduction in energy consumption and limit the risks of shortages. Targeting can be achieved through social safety nets such as cash transfers to eligible households based on income or demographics or by transfers through electricity companies based on past energy consumption. Subsidies should be temporary and offset by revenue-generating measures, including one-time solidarity taxes on high- income households and companies, where appropriate."
},
{
"type": "NarrativeText",
"element_id": "5f63f2b3388c5c9f2ab22f4136d4196d",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 10,
"links": []
},
"text": "Reinforcing supply: Supply-side policies could address the key structural factors impeding growth— including market power, rent seeking, rigid regulation and planning, and inefficient education—and could help build resilience, reduce bottlenecks, and alleviate price pressures. A concerted push for investment along the supply chain of green energy technologies would bolster energy security and help advance progress on the green transition."
},
{
"type": "NarrativeText",
"element_id": "c64f29a38dae74989484539db014364f",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 10,
"links": []
},
"text": "Strengthening multilateral cooperation—Urgent action is needed to limit the risks stemming from geopolitical fragmentation and to ensure cooperation on fundamental areas of common interest:"
},
{
"type": "ListItem",
@ -1957,17 +1979,6 @@
},
"text": "Strengthening global trade: Strengthening the global trading system would address risks associated with trade fragmentation. This can be achieved by rolling back restrictions on food exports and other essential items such as medicine, upgrading World Trade Organization (WTO) rules in critical areas such as agricultural and industrial subsidies, concluding and implementing new WTO-based agreements, and fully restoring the WTO dispute settlement system."
},
{
"type": "ListItem",
"element_id": "af6eef18ec41f4980c1a4cbb5b7d4fec",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 10,
"links": []
},
"text": "Strengthening global trade: Strengthening the global trading system would address risks associated with trade fragmentation. This can be achieved by rolling back restrictions on food exports and other essential items such as medicine, upgrading World Trade Organization (WTO) rules in critical areas such as agricultural and industrial subsidies, concluding and implementing new WTO-based agreements, and fully restoring the WTO dispute settlement system."
},
{
"type": "ListItem",
"element_id": "d6f6afcf055ed3084a0fac1093458c88",
@ -1990,17 +2001,6 @@
},
"text": "Speeding the green transition: To meet governments climate change goals, it is necessary to swiftly implement credible mitigation policies. International coordination on carbon pricing or equivalent policies would facilitate faster decarbonization. Global cooperation is needed to build resilience to climate shocks, including through aid to vulnerable countries."
},
{
"type": "ListItem",
"element_id": "089c5759e7030e34a3b537d9e20bcd13",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 10,
"links": []
},
"text": "Speeding the green transition: To meet governments climate change goals, it is necessary to swiftly implement credible mitigation policies. International coordination on carbon pricing or equivalent policies would facilitate faster decarbonization. Global cooperation is needed to build resilience to climate shocks, including through aid to vulnerable countries."
},
{
"type": "NarrativeText",
"element_id": "14187a5be9e3a125267bfe10e6c67fae",

View File

@ -1 +1 @@
__version__ = "0.10.22-dev0" # pragma: no cover
__version__ = "0.10.22-dev1" # pragma: no cover

View File

@ -633,7 +633,7 @@ def _process_pdfminer_pages(
system=coordinate_system,
)
page_element = list_page_element
updated_page_elements.pop(0)
updated_page_elements.pop()
updated_page_elements.append(page_element)