fix chucking text None type has no attribute stripe (#4018)

### Summary
To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no
attribute 'strip'"}` I found the logs under same org (could assume this
is the same job)

screenshot:
![Screenshot 2025-06-11 at 10 15
57 AM](https://github.com/user-attachments/assets/c50ada55-eef1-43f7-9e27-9b9ae339a6fb)

stack trace from the `utic-api` ES log doc:
![Screenshot 2025-06-11 at 2 01
01 PM](https://github.com/user-attachments/assets/7e84fa24-4eb6-45e8-b195-a11d3d124bfa)



### Notes
longer term we should make partitioner (vlm + utic-api) not return text
with Null

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
This commit is contained in:
Yuming Long 2025-06-12 11:28:46 -07:00 committed by GitHub
parent ec209c6b5f
commit 55ad5fd637
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
13 changed files with 132 additions and 108 deletions

View File

@ -133,7 +133,7 @@ jobs:
- name: Test
env:
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
TESSERACT_VERSION : "5.4.1"
TESSERACT_VERSION : "5.5.1"
run: |
source .venv/bin/activate
sudo apt-get update

View File

@ -1,10 +1,11 @@
## 0.17.11-dev0
## 0.17.11-dev1
### Enhancements
### Features
### Fixes
- Fix chunking for elements with None text that has AttributeError 'NoneType' object has no attribute 'strip'.
- Invalid elements IDs are not visible in VLM output. Parent-child hierarchy is now retrieved based on unstructured element ID, instead of id injected into HTML code of element.
## 0.17.10

View File

@ -416,6 +416,20 @@ class DescribePreChunk:
)
assert pre_chunk._text == "hello"
def it_can_chunk_elements_with_none_text_without_error(self):
"""Regression test for AttributeError when Image elements have None text."""
pre_chunk = PreChunk(
[Image(None), Text("hello world"), Image(None)],
overlap_prefix="",
opts=ChunkingOptions(),
)
# Should not raise AttributeError when generating chunks
chunks = list(pre_chunk.iter_chunks())
assert len(chunks) == 1
assert chunks[0].text == "hello world"
@pytest.mark.parametrize(
("max_characters", "combine_text_under_n_chars", "expected_value"),
[
@ -1026,6 +1040,15 @@ class Describe_TableChunker:
# -- computation is only on first call, all chunks get exactly the same orig-elements --
assert table_chunker._orig_elements is orig_elements
def it_handles_table_with_none_text_without_error(self):
"""Regression test for AttributeError when Table elements have None text."""
table = Table(None) # Table with None text
# Should not raise AttributeError and should produce no chunks
chunks = list(_TableChunker.iter_chunks(table, "", ChunkingOptions()))
assert len(chunks) == 0
# ================================================================================================
# HTML SPLITTERS

View File

@ -7,8 +7,8 @@
</title>
</head>
<body>
<h1 class="Title" id="33d8fd813310ae3e74efd7e17fef99df">
a Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method
<h1 class="Title" id="9c3a63df0fa9649fd2065ebcc4922e18">
gai) Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method
</h1>
<p class="NarrativeText" id="5801c515b515aadfb7717e4c36a4cea4">
(Section references are to the Internal Revenue Code unless otherwise noted.)
@ -28,8 +28,8 @@
<p class="NarrativeText" id="8753b1907d0b40b882489a68baf3fe2c">
File this form to request a change in your accounting method, including the accounting treatment of any item. If you are requesting a change in accounting period, use Form 1128, Application for Change in Accounting Period. For more information, see Publication 538, Accounting Periods and Methods.
</p>
<p class="NarrativeText" id="0cf9161971e9ea8feec111ff7d24f403">
When filing Form 3115, taxpayers are reminded to determine if IRS has published a ruling or procedure dealing with the specific type of change since November 1987 (the current revision date of Form 3115),
<p class="NarrativeText" id="7b5365f4534832bac87e1df792cf5b16">
When filing Form 3115, taxpayers are reminded to determine if IRS has published a ruling or procedure dealing with the specific type of change since November 1987 (the current revision date of Form 3115).
</p>
<p class="NarrativeText" id="0fb8eb24db1b27f6f8b69213e3dd9b41">
Long-term contracts. —If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed.
@ -37,20 +37,20 @@
<p class="NarrativeText" id="7282f497b067ed1e34176cc85d46ea8e">
Other methods.—Unless the Service has published a regulation or procedure to the contrary, all other changes !n accounting methods required by the Act are automatically considered to be approved by the Commissioner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method for bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving credit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year In which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not file Form 3115 for these changes.
</p>
<p class="NarrativeText" id="61f76478266283c91988a108081fc02e">
Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change Is desired.
<p class="NarrativeText" id="9218e8a34790d23be418f5c4ffaaf54c">
Generally, applicants must complete Section A. \n addition, complete the appropriate sections (B-1 through H) for which a change Is desired.
</p>
<p class="NarrativeText" id="b8f9f1fdeffadd34472959092459fba9">
You must give all relevant facts, including a detailed description of your present and proposed methods. You must also state the reason(s) you believe approval to make the requested change should be granted. Attach additional pages if more space is needed for explanations. Each page should show your name, address, and identifying number.
</p>
<p class="NarrativeText" id="6055008a5485b687b614551c78a89c6e">
State whether you desire a conference in the National Office if the Service proposes to disapprove your application.
<p class="NarrativeText" id="b7ac9f40a0b010ca0f9a6dedba12a95c">
State whether you desire a conference In the National Office if the Service proposes to disapprove your application.
</p>
<h1 class="Title" id="45da2e5561453f7cdfcf31c1ace13cf0">
Changes to Accounting Methods Required Under the Tax Reform Act of 1986
</h1>
<p class="NarrativeText" id="9256e7591256b6799035172da259b839">
Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under section,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change 1s treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustrnents into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.
<p class="NarrativeText" id="0476fb3d546e315ae90c733259812973">
Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under section,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustrnents into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.
</p>
<p class="NarrativeText" id="9951e8eac8f909df08655f3bc100a586">
Disregard the instructions under Time and Place for Filing and Late Applications. Instead, attach Form 3115 to your income tax return for the year of change; do not file it separately. Also include on a separate statement accompanying the Form 3115 the period over which the section 481(a) adjustment will be taken into account and the basis for that conclusion. Identify the automatic change being made at the top of page 1 of Form 3115 (e.g., “Automatic Change to Accrual Method—Section 448"). See Temporary Regulations sections 1.263A-1T and 1.448-1T for additional information.
@ -76,8 +76,8 @@
<h1 class="Title" id="9bac1c8a91f637da3c6114d95239ceee">
Late Applications
</h1>
<p class="NarrativeText" id="c92c7f4def0263141b370bf307d6bcc0">
If your application is filed after the 180-day period, it is late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev, Proc. 79-63.
<p class="NarrativeText" id="adad72fa6ed1f3d66351440221c1ad23">
If your application is filed after the 180-day period, it 1s late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev, Proc. 79-63.
</p>
<h1 class="Title" id="569b780f1a01b3fe19031adfd2ff6567">
Identifying Number
@ -118,8 +118,8 @@
<h1 class="Title" id="5a646ca8e56ece623a47079b32e62fc6">
Specific Instructions
</h1>
<h1 class="Title" id="e0e692b1f478333e3950f8cb2483a484">
Section A
<h1 class="Title" id="1505240fbe441adc4acdbc867689af29">
SectionA
</h1>
<p class="NarrativeText" id="43c45bb43eaf69131bf2392df1239ef2">
Item 5a, page 1.—“Taxable income or (loss) from operations” is to be entered before application of any net operating loss deduction under section 172(a).
@ -166,8 +166,8 @@
<p class="NarrativeText" id="454de5bfbdcba4385a21dd6261c57d53">
The limitation on the use of the cash method (except for tax shelters) does not apply to—
</p>
<p class="NarrativeText" id="fc1f0d4d56acd27a18ba80ab0acfb9e9">
(1) Farming businesses.—F or this purpose, the term “farming business” 1s defined in section 263A(e)(4), but it also includes the raising, harvesting, or growing of trees to which section 263A(c)(5) applies. Notwithstanding this exception, section 447 requires certain C corporations and partnerships with a C corporation as a partner to use the accrual method.
<p class="NarrativeText" id="d268b0c2840319e1b229673523368cae">
(1) Farming businesses.—For this purpose, the term “farming business” 1s defined in section 263A(e)(4), but it also includes the raising, harvesting, or growing of trees to which section 263A(c)(5) applies. Notwithstanding this exception, section 447 requires certain C corporations and partnerships with a C corporation as a partner to use the accrual method.
</p>
<p class="NarrativeText" id="51dcb59cd362d0003f609fdb43fbdfdc">
(2) Qualified personal service corporations. — A “qualified personal service corporation” is any corporation: (a) substantially all of the activities of which involve the performance of services in the fields of health, law, engineering, architecture, accounting, actuarial science, performing arts, or consulting, and (b)
@ -178,8 +178,8 @@
<p class="NarrativeText" id="e4776aaec9edf7383c95941623c47ff6">
substantially all of the stock of which is owned by employees performing the services, retired employees who had performed the services, any estate of any individual who had performed the services listed above, or any person who acquired stock of the corporation as a result of the death of an employee or retiree described above if the acquisition occurred within 2 years of death.
</p>
<p class="NarrativeText" id="5f5c402f9ebefef3ba8eabf1b5f628b2">
(3) Entities with gross receipts of $5,000,000 or less. —To qualify for this exception, the C corporation's or partnerships annual average gross receipts for the three years ending with the prior tax year may not exceed $5,000,000. If the corporation or partnership was not in existence for the entire 3-year period, the period of existence is used to determine whether the corporation or partnership qualifies. If any tax year in the 3-year period is a short tax year, the corporation or partnership must annualize the gross receipts by multiplying the gross receipts by 12 and dividing the result by the number of months in the short period.
<p class="NarrativeText" id="02eb85f4c80a008b9e03744e68528aff">
(3) Entities with gross receipts of $5,000,000 or less. —To qualify for this exception, the C corporation's or partnerships annual average gross receipts for the three years ending with the prior tax year may not exceed $5,000,000. If the corporation or partnership was not in existence for the entire 3-year period, the period of existence is used to determine whether the corporation or partnership qualifies. If any tax year in the 3-year period is a short tax year, the corporation or partnership must annualize the gross receipts by multiplying the gross receipts by 12 and dividing the result by the number of months tn the short period.
</p>
<p class="NarrativeText" id="427e5fe33c8c181ccb93c7de11946c13">
For more information, see section 448 and Temporary Regulations section 1.448-1T.

View File

@ -7,8 +7,8 @@
</title>
</head>
<body>
<h1 class="Title" id="9de8f65e7c38a2a2e1b0d0d8c3526808">
FM) Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method
<h1 class="Title" id="6b126b0255ab0d12659889c3d523a5e8">
¥i9) Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method
</h1>
<p class="NarrativeText" id="53c329edad597af665506d646581db18">
(Section references are to the Internal Revenue Code unless otherwise noted.)
@ -16,8 +16,8 @@
<h1 class="Title" id="dc71cf9b39b3b58cf3960cad7a5f390c">
Paperwork Reduction Act Notice
</h1>
<p class="NarrativeText" id="a77ac8b6adb5acf78d2eac45a4de69e5">
We ask for this information to carry out the Internal Revenue laws of the United States. We need it to ensure that taxpayers are complying with these laws and to allow us to figure and collect the right amount of tax. You are required to give us this information.
<p class="NarrativeText" id="90beb9a0f4b6984d9cfdf68096e114e4">
We ask for this information to carry out the Internal Revenue laws of the United States. We need it to ensure that taxpayers are complying with these laws ang to allow us to figure and collect the right amount of tax. You are required to give us this information.
</p>
<h1 class="Title" id="95f3b1224a44ca83b5aefea67a9fdde4">
General Instructions
@ -34,11 +34,11 @@
<p class="NarrativeText" id="4af565181db0676202636585f9abb438">
Long-term contracts. —If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed.
</p>
<p class="NarrativeText" id="8dc3e4d18b3936db176790654f8823e1">
Other methods.—Unless the Service has published a regulation or procedure to the contrary, all other changes 1n accounting methods required by the Act are automatically considered to be approved by the Commissioner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method for bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving credit plan (Act section 812); (3) the inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not file Form 3115 for these changes.
<p class="NarrativeText" id="746483e119190b6ce718ce4715bee6e6">
Other methods.—Unless the Service has published a regulation or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissioner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method for bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving credit plan (Act section 812); (3) the inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not file Form 3115 for these changes.
</p>
<p class="NarrativeText" id="85f1dcb7770743b979ee143d2e2aff19">
Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change ts desired.
<p class="NarrativeText" id="fdaf06392be067a41cac854c07a66033">
Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change is desired.
</p>
<p class="NarrativeText" id="6fbfaf2f668ea8e5a161f2f08ec5c002">
You must give all relevant facts, including a detailed description of your present and proposed methods. You must also state the reason(s) you believe approval to make the requested change should be granted. Attach additional pages if more space is needed for explanations. Each page should show your name, address, and identifying number.
@ -49,8 +49,8 @@
<h1 class="Title" id="4d2011ddb75aecb442fab45c276032ef">
Changes to Accounting Methods Required Under the Tax Reform Act of 1986
</h1>
<p class="NarrativeText" id="5b2139cd0640cd4eceddbce416a17f6f">
Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under sectior,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustments into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.
<p class="NarrativeText" id="1d7c9cb0ba025f28eb4d035bb6447d52">
Uniform capitalization rules and limitation on cash method.—f you are required to change your method of accounting under sectior,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustments into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.
</p>
<p class="NarrativeText" id="525b9d3bf3ae575f8e86f62af6068ebd">
Disregard the instructions under Time and Place for Filing and Late Applications. Instead, attach Form 3115 to your income tax return for the year of change; do not file it separately. Also include on a separate statement accompanying the Form 3115 the period over which the section 481(a) adjustment will be taken into account and the basis for that conclusion. Identify the automatic change being made at the top of page 1 of Form 3115 (e.g., “Automatic Change to Accrual Method Section 448"). See Temporary Regulations sections 1.263A-1T and 1.448-1T for additional information.
@ -61,8 +61,8 @@
<p class="NarrativeText" id="ae8e74a1d77625ba73dd01fe4dc0cdea">
Generally, applicants must file this form within the first 180 days of the tax year in which it is desired to make the change.
</p>
<p class="NarrativeText" id="05e444867dea72fa51c18a796551305f">
Taxpayers, other than exempt organizations, should file Form 3115 with the Commissioner of Internal Revenue, Attention: CC:C:4, 1111 Constitution Avenue, NW, Washington, DC 20224. Exempt organizations should file with the Assistant Commissioner (Employee Plans and Exempt Organizations), 1111 Constitution Avenue, NW, Washington, DC 20224.
<p class="NarrativeText" id="fd3e2689051b08dfefd978b6fe03a251">
Taxpayers, other than exempt organizations, should file Form 3115 with the Commissioner of Internal Revenue, Attention: CC:C:4, 1111 Constitution Avenue, NW, Washington, OC 20224. Exempt organizations should file with the Assistant Commissioner (Employee Plans and Exempt Organizations), 1111 Constitution Avenue, NW, Washington, DC 20224.
</p>
<p class="NarrativeText" id="09f4d2c426aaa217278d83c17a4bf21e">
You should normally receive an acknowledgment of receipt of your application within 30 days. If you do not hear from IRS within 30 days of submitting your completed Form 3115, you may inquire as to the receipt of your application by writing to: Control Clerk, CC:C:4, Internal Revenue Service, Room 5040, 1111 Constitution Avenue, NW, Washington, DC 20224.
@ -76,14 +76,14 @@
<h1 class="Title" id="ceb4948527ce520e2ac219097e279559">
Late Applications
</h1>
<p class="NarrativeText" id="53204b2c819131895da7dba7fe978047">
If your application is filed after the 180-day period, it is late. The application will be considered for processing only upon a showing of “good cause" and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev. Proc. 79-63.
<p class="NarrativeText" id="adda4424f1f7ffb84390b6b8c60ac3bd">
If your application is filed after the 180-day period, it 1s late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev. Proc. 79-63.
</p>
<h1 class="Title" id="2c598855a28fb70d3812979066df72c1">
Identifying Number
</h1>
<p class="NarrativeText" id="a41365af6ab3185637e8f3891b27fcba">
Individuals.—An individual should enter his or her social security number in this block. If the application is made on behalf of a husband and wife who file their income tax return jointly, enter the social security numbers of both.
<p class="NarrativeText" id="923c71f94a011e5def8896cc2aa7120e">
Individuals. —An individual should enter his or her social security number in this block. If the application is made on behalf of a husband and wife who file their income tax return jointly, enter the social security numbers of both.
</p>
<p class="NarrativeText" id="803549fa9207cd4111ed9e5d7389a027">
Others.-—The employer identification number of an applicant other than an individual should be entered in this block.

View File

@ -86,7 +86,7 @@
</tr>
<tr style="border: 1px solid black;">
<td style="border: 1px solid black;">
HIDataset [31]
HIDataset (31)
</td>
<td style="border: 1px solid black;">
P/M
@ -99,8 +99,8 @@
</tr>
</tbody>
</table>
<p class="FigureCaption" id="a0c3c6b7e1e8c95016b989ef43c5ea2e">
2 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.
<p class="FigureCaption" id="3d1c1bf1eb6a87a874d21d8f11b226b1">
2 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Fuster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 10] backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.
</p>
<p class="NarrativeText" id="b68ca269882f83b03827b5edf0fec979">
layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility functions for the visualization and stomge of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training. We now provide detailed descriptions for each component.

View File

@ -43,8 +43,8 @@
<h1 class="Title" id="d3be9e3d661e2a79f37257caa5b54d8c">
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
</h1>
<p class="NarrativeText" id="7cf062c1ba64938cc68c4fae61506d84">
Zejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain Lee*, Jacob Carlson, and Weining Li&gt;
<p class="NarrativeText" id="97c951d2dd3a1b5452d0c55e62e8ea78">
Zejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain Lee*, Jacob Carlson, and Weining Li®
</p>
<p class="NarrativeText" id="23b8def20ce16f929d4f558b2a19f200">
1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca
@ -139,7 +139,7 @@
<p class="UncategorizedText" id="92c4289ad4af7c0793e40d5662707e0a">
Z. Shen et al.
</p>
<img alt="Efficient Data Annotation Model Customization Document Images Community Platform a &gt;) ¥ DIA Model Hub i .) Customized Model Training] == | Layout Detection Models | ——= DIA Pipeline Sharing ~ OCR Module = { Layout Data stuctue ) = (storage Visualization VY" class="Image" id="642416e5d6c99219b16dbba6f72392c5"/>
<img alt="Efficient Data Annotation Model Customization Document Images Community Platform A &gt;) ¥ DIA Model Hub a Customized Model Training] == | Layout Detection Models | ——= DIA Pipeline Sharing ~ OCR Module = { Layout Data stuctue ) = (store Visualization LY" class="Image" id="285d83f3098b26302329b33637fd265f"/>
<p class="NarrativeText" id="466f0bc21599ccf0fa27c021cb023f90">
Fig.1: The overall architecture of LayoutParser. For an input document image, the core LayoutParser library provides a set of off-the-shelf tools for layout detection, OCR, visualization, and storage, backed by a carefully designed layout data structure. LayoutParser also supports high level customization via efficient layout annotation and model training functions. These improve model accuracy on the target samples. The community platform enables the easy sharing of DIA models and whole digitization pipelines to promote reusability and reproducibility. A collection of detailed documentation, tutorials and exemplar projects make LayoutParser easy to learn and use.
</p>
@ -266,7 +266,7 @@
<p class="UncategorizedText" id="710ac103981c6363195774b02ee582d4">
Z. Shen et al.
</p>
<img alt='- ° . 3 a a 4 a 3 oo er 2 § 8 a 8 3 3 £ 4 A g a 9 3 ¥ Coordinate g 4 5 3 + § 3 H Extra Features [O=") [Bo] eaing i Text | | Type | | ower ° &amp; a ¢ o [ coordinatel textblock1, 3 3 g Q 3 , textblock2 , layoutl ] 4 q ® A list of the layout elements Ff' class="Image" id="6eb2bb6ca50b3be177565f9ff546bce8"/>
<img alt="3 a a 4 a 3 Rectangle vada 4 8 4 iS v 2 [S) af : fa &amp; o a 6 g 4 Coordinate g 2 8 3 + 4 * v 8 Extra features =| 9%) | Hock) Reading é ret | | Type | | order 2 &amp; a ¢ @ [ coordinatel textblock1 , 8 » , ee 3 , textblock2 , layoutl ] 8 q ® A list of the layout elements sf" class="Image" id="fd2288e4e3cf90f109d1c1198cea4ca0"/>
<p class="FigureCaption" id="9f11aa6b22dea1bba7eb0d122c0c5562">
Fig.2: The relationship between the three types of layout data structures. Coordinate supports three kinds of variation; TextBlock consists of the co- ordinate information and extra features like block text, types, and reading orders; a Layout object is a list of all possible layout elements, including other Layout objects. They all support the same set of transformation and operation APIs for maximum flexibility.
</p>
@ -399,7 +399,7 @@
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Convert the absolute coordinates of block to relative coordinates to block2
Convert the absolute coordinates of blockl to relative coordinates to block2
</td>
</tr>
<tr style="border: 1px solid black;">
@ -449,7 +449,7 @@
<li class="ListItem" id="c069937e6c2bfc0f856835f3af4d6181">
LayoutParser: A Unified Toolkit for DL-Based DIA
</li>
<img alt="x09 Burpunog uayor Aeydsiq 1 vondo 10g Guypunog usyoy apir:z uondo Mode I: Showing Layout on the Original Image Mode Il: Drawing OCR'd Text at the Correspoding Position" class="Image" id="f5450580bd9ae07f4cdf7c23a6ccaf41"/>
<img alt="0g Burpunog uayor Aeydsiq:1 vondo 10g Guypunog usyou ap:z uondo Mode I: Showing Layout on the Original Image Mode Il: Drawing OCR Text at the Correspoding Position" class="Image" id="02a078081972f7bdb26f06a787773a30"/>
<p class="NarrativeText" id="4d1b9566e792683b9559b778be4f4046">
Fig.3: Layout detection and OCR results visualization generated by the LayoutParser APIs. Mode I directly overlays the layout region bounding boxes and categories over the original image. Mode II recreates the original document via drawing the OCRd texts at their corresponding positions on the image canvas. In this figure, tokens in textual regions are filtered using the API and then displayed.
</p>
@ -465,7 +465,7 @@
<li class="ListItem" id="59c95b02b488f297417af4125e4ac316">
10 Z. Shen et al.
</li>
<img alt="Intra-column reading order Token Categories tie (Adress 2) tee (NE sumber Variable Column reading order HEE company type Column Categories (J tite Adress _] ree [7] Section Header Maximum Allowed Height (b) Illustration of the recreated document with dense text structure for better OCR performance" class="Image" id="6eb34afad9d568fbccde8ac8854dc24d"/>
<img alt="Intra-column reading order Token Categories tie (Adress tee Ewumber Variable Column reading order HEE company ype Column Categories (J tite Adress 1] ree [7] Section Header Maximum Allowed Height (b) Illustration of the recreated document with dense text structure for better OCR performance" class="Image" id="747f46c43a88768fd543e10bac84203b"/>
<p class="NarrativeText" id="9667b0e42f9d28607c7c13bffb760906">
Fig.4: Illustration of (a) the original historical Japanese document with layout detection results and (b) a recreated version of the document image that achieves much better character recognition recall. The reorganization algorithm rearranges the tokens based on the their detected bounding boxes given a maximum allowed height.
</p>
@ -502,7 +502,7 @@
<p class="NarrativeText" id="42551c9b40827dcdc52055b4d25c6fc3">
As shown in Figure 4 (a), the document contains columns of text written vertically 15, a common style in Japanese. Due to scanning noise and archaic printing technology, the columns can be skewed or have vari- able widths, and hence cannot be eas- ily identified via rule-based methods. Within each column, words are sepa- rated by white spaces of variable size, and the vertical positions of objects can be an indicator of their layout type.
</p>
<img alt="(spe peepee, Active Learning Layout Annotate Layout Dataset | + Annotation Toolkit ¥ a Deep Leaming Layout Model Training &amp; Inference, ¥ ; Handy Data Structures &amp; Post-processing El Apis for Layout Det a LAR ror tye eats) 4 Text Recognition | &lt;—— Default ane Customized ¥ ee Layout Structure Visualization &amp; Export | &lt;—— | visualization &amp; Storage The Japanese Document Helpful LayoutParser Digitization Pipeline Modules" class="Image" id="f48a844114951222f6c96331efc683fb"/>
<img alt="———————_+ (| Active Learning Layout Annotate Layout Dataset | + Annotation Toolkit ¥ alae Deep Leaming Layout Model Training &amp; Inference, ¥ ; Handy Data Structures &amp; Post-processing Ee apis for Layout Dat a Ae ror yon Oats 4 Text Recognition | &lt;—— Default ane Customized ¥ ee Layout Structure Visualization &amp; Export | &lt;—— | visualization &amp; Storage The Japanese Document Helpful LayoutParser Digitization Pipeline Modules" class="Image" id="2b90153124fb6f9e9f5539b9db75d240"/>
<p class="NarrativeText" id="80291b42f1785935496188bb52788288">
Fig.5: Illustration of how LayoutParser helps with the historical document digi- tization pipeline.
</p>
@ -536,7 +536,7 @@
<li class="ListItem" id="2b7101f39954d5301166b82906202ea9">
LayoutParser: A Unified Toolkit for DL-Based DIA
</li>
<img alt="(@) Partial table at the bottom (&amp;) Full page table (6) Partial table at the top (d) Mis-detected tet line" class="Image" id="d5c954ff619e348d36d5180feedabc6c"/>
<img alt="(@) Partial table at the bottom (6) Full page table (©) Partial table at the top (@) Mis-detected text line" class="Image" id="1359eaa601a24c083e143b8bf5114127"/>
<p class="FigureCaption" id="d35d253341e8b8d837f384ecd6ac410a">
Fig.6: This lightweight table detector can identify tables (outlined in red) and cells (shaded in blue) in different locations on a page. In very few cases (d), it might generate minor error predictions, e.g, failing to capture the top text line of a table.
</p>

View File

@ -1,8 +1,8 @@
[
{
"type": "Title",
"element_id": "33d8fd813310ae3e74efd7e17fef99df",
"text": "a Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method",
"element_id": "9c3a63df0fa9649fd2065ebcc4922e18",
"text": "gai) Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -155,8 +155,8 @@
},
{
"type": "NarrativeText",
"element_id": "0cf9161971e9ea8feec111ff7d24f403",
"text": "When filing Form 3115, taxpayers are reminded to determine if IRS has published a ruling or procedure dealing with the specific type of change since November 1987 (the current revision date of Form 3115),",
"element_id": "7b5365f4534832bac87e1df792cf5b16",
"text": "When filing Form 3115, taxpayers are reminded to determine if IRS has published a ruling or procedure dealing with the specific type of change since November 1987 (the current revision date of Form 3115).",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -221,8 +221,8 @@
},
{
"type": "NarrativeText",
"element_id": "61f76478266283c91988a108081fc02e",
"text": "Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change Is desired.",
"element_id": "9218e8a34790d23be418f5c4ffaaf54c",
"text": "Generally, applicants must complete Section A. \\n addition, complete the appropriate sections (B-1 through H) for which a change Is desired.",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -265,8 +265,8 @@
},
{
"type": "NarrativeText",
"element_id": "6055008a5485b687b614551c78a89c6e",
"text": "State whether you desire a conference in the National Office if the Service proposes to disapprove your application.",
"element_id": "b7ac9f40a0b010ca0f9a6dedba12a95c",
"text": "State whether you desire a conference In the National Office if the Service proposes to disapprove your application.",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -309,8 +309,8 @@
},
{
"type": "NarrativeText",
"element_id": "9256e7591256b6799035172da259b839",
"text": "Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under section,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change 1s treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustrnents into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.",
"element_id": "0476fb3d546e315ae90c733259812973",
"text": "Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under section,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustrnents into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -507,8 +507,8 @@
},
{
"type": "NarrativeText",
"element_id": "c92c7f4def0263141b370bf307d6bcc0",
"text": "If your application is filed after the 180-day period, it is late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev, Proc. 79-63.",
"element_id": "adad72fa6ed1f3d66351440221c1ad23",
"text": "If your application is filed after the 180-day period, it 1s late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev, Proc. 79-63.",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -815,8 +815,8 @@
},
{
"type": "Title",
"element_id": "e0e692b1f478333e3950f8cb2483a484",
"text": "Section A",
"element_id": "1505240fbe441adc4acdbc867689af29",
"text": "SectionA",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -1167,8 +1167,8 @@
},
{
"type": "NarrativeText",
"element_id": "fc1f0d4d56acd27a18ba80ab0acfb9e9",
"text": "(1) Farming businesses.—F or this purpose, the term “farming business” 1s defined in section 263A(e)(4), but it also includes the raising, harvesting, or growing of trees to which section 263A(c)(5) applies. Notwithstanding this exception, section 447 requires certain C corporations and partnerships with a C corporation as a partner to use the accrual method.",
"element_id": "d268b0c2840319e1b229673523368cae",
"text": "(1) Farming businesses.—For this purpose, the term “farming business” 1s defined in section 263A(e)(4), but it also includes the raising, harvesting, or growing of trees to which section 263A(c)(5) applies. Notwithstanding this exception, section 447 requires certain C corporations and partnerships with a C corporation as a partner to use the accrual method.",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -1255,8 +1255,8 @@
},
{
"type": "NarrativeText",
"element_id": "5f5c402f9ebefef3ba8eabf1b5f628b2",
"text": "(3) Entities with gross receipts of $5,000,000 or less. —To qualify for this exception, the C corporation's or partnerships annual average gross receipts for the three years ending with the prior tax year may not exceed $5,000,000. If the corporation or partnership was not in existence for the entire 3-year period, the period of existence is used to determine whether the corporation or partnership qualifies. If any tax year in the 3-year period is a short tax year, the corporation or partnership must annualize the gross receipts by multiplying the gross receipts by 12 and dividing the result by the number of months in the short period.",
"element_id": "02eb85f4c80a008b9e03744e68528aff",
"text": "(3) Entities with gross receipts of $5,000,000 or less. —To qualify for this exception, the C corporation's or partnerships annual average gross receipts for the three years ending with the prior tax year may not exceed $5,000,000. If the corporation or partnership was not in existence for the entire 3-year period, the period of existence is used to determine whether the corporation or partnership qualifies. If any tax year in the 3-year period is a short tax year, the corporation or partnership must annualize the gross receipts by multiplying the gross receipts by 12 and dividing the result by the number of months tn the short period.",
"metadata": {
"filetype": "application/pdf",
"languages": [

View File

@ -1,8 +1,8 @@
[
{
"type": "Title",
"element_id": "9de8f65e7c38a2a2e1b0d0d8c3526808",
"text": "FM) Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method",
"element_id": "6b126b0255ab0d12659889c3d523a5e8",
"text": "¥i9) Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method",
"metadata": {
"filetype": "image/png",
"languages": [
@ -67,8 +67,8 @@
},
{
"type": "NarrativeText",
"element_id": "a77ac8b6adb5acf78d2eac45a4de69e5",
"text": "We ask for this information to carry out the Internal Revenue laws of the United States. We need it to ensure that taxpayers are complying with these laws and to allow us to figure and collect the right amount of tax. You are required to give us this information.",
"element_id": "90beb9a0f4b6984d9cfdf68096e114e4",
"text": "We ask for this information to carry out the Internal Revenue laws of the United States. We need it to ensure that taxpayers are complying with these laws ang to allow us to figure and collect the right amount of tax. You are required to give us this information.",
"metadata": {
"filetype": "image/png",
"languages": [
@ -199,8 +199,8 @@
},
{
"type": "NarrativeText",
"element_id": "8dc3e4d18b3936db176790654f8823e1",
"text": "Other methods.—Unless the Service has published a regulation or procedure to the contrary, all other changes 1n accounting methods required by the Act are automatically considered to be approved by the Commissioner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method for bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving credit plan (Act section 812); (3) the inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not file Form 3115 for these changes.",
"element_id": "746483e119190b6ce718ce4715bee6e6",
"text": "Other methods.—Unless the Service has published a regulation or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissioner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method for bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving credit plan (Act section 812); (3) the inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not file Form 3115 for these changes.",
"metadata": {
"filetype": "image/png",
"languages": [
@ -221,8 +221,8 @@
},
{
"type": "NarrativeText",
"element_id": "85f1dcb7770743b979ee143d2e2aff19",
"text": "Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change ts desired.",
"element_id": "fdaf06392be067a41cac854c07a66033",
"text": "Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change is desired.",
"metadata": {
"filetype": "image/png",
"languages": [
@ -309,8 +309,8 @@
},
{
"type": "NarrativeText",
"element_id": "5b2139cd0640cd4eceddbce416a17f6f",
"text": "Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under sectior,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustments into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.",
"element_id": "1d7c9cb0ba025f28eb4d035bb6447d52",
"text": "Uniform capitalization rules and limitation on cash method.—f you are required to change your method of accounting under sectior,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustments into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.",
"metadata": {
"filetype": "image/png",
"languages": [
@ -397,8 +397,8 @@
},
{
"type": "NarrativeText",
"element_id": "05e444867dea72fa51c18a796551305f",
"text": "Taxpayers, other than exempt organizations, should file Form 3115 with the Commissioner of Internal Revenue, Attention: CC:C:4, 1111 Constitution Avenue, NW, Washington, DC 20224. Exempt organizations should file with the Assistant Commissioner (Employee Plans and Exempt Organizations), 1111 Constitution Avenue, NW, Washington, DC 20224.",
"element_id": "fd3e2689051b08dfefd978b6fe03a251",
"text": "Taxpayers, other than exempt organizations, should file Form 3115 with the Commissioner of Internal Revenue, Attention: CC:C:4, 1111 Constitution Avenue, NW, Washington, OC 20224. Exempt organizations should file with the Assistant Commissioner (Employee Plans and Exempt Organizations), 1111 Constitution Avenue, NW, Washington, DC 20224.",
"metadata": {
"filetype": "image/png",
"languages": [
@ -507,8 +507,8 @@
},
{
"type": "NarrativeText",
"element_id": "53204b2c819131895da7dba7fe978047",
"text": "If your application is filed after the 180-day period, it is late. The application will be considered for processing only upon a showing of “good cause\" and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev. Proc. 79-63.",
"element_id": "adda4424f1f7ffb84390b6b8c60ac3bd",
"text": "If your application is filed after the 180-day period, it 1s late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev. Proc. 79-63.",
"metadata": {
"filetype": "image/png",
"languages": [
@ -551,8 +551,8 @@
},
{
"type": "NarrativeText",
"element_id": "a41365af6ab3185637e8f3891b27fcba",
"text": "Individuals.—An individual should enter his or her social security number in this block. If the application is made on behalf of a husband and wife who file their income tax return jointly, enter the social security numbers of both.",
"element_id": "923c71f94a011e5def8896cc2aa7120e",
"text": "Individuals. —An individual should enter his or her social security number in this block. If the application is made on behalf of a husband and wife who file their income tax return jointly, enter the social security numbers of both.",
"metadata": {
"filetype": "image/png",
"languages": [

View File

@ -48,7 +48,7 @@
"element_id": "dddac446da6c93dc1449ecb5d997c423",
"text": "Dataset | Base Model\" Large Model | Notes PubLayNet [38] P/M M Layouts of modern scientific documents PRImA [3) M - Layouts of scanned modern magazines and scientific reports Newspaper [17] P - Layouts of scanned US newspapers from the 20th century TableBank (18) P P Table region on modern scientific and business document HJDataset (31) | F/M - Layouts of history Japanese documents",
"metadata": {
"text_as_html": "<table><thead><tr><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>| Notes</th></tr></thead><tbody><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
"text_as_html": "<table><thead><tr><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>| Notes</th></tr></thead><tbody><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset (31)</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
"filetype": "image/jpeg",
"languages": [
"eng"
@ -68,8 +68,8 @@
},
{
"type": "FigureCaption",
"element_id": "a0c3c6b7e1e8c95016b989ef43c5ea2e",
"text": "2 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.",
"element_id": "3d1c1bf1eb6a87a874d21d8f11b226b1",
"text": "2 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Fuster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 10] backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.",
"metadata": {
"filetype": "image/jpeg",
"languages": [

View File

@ -265,8 +265,8 @@
},
{
"type": "NarrativeText",
"element_id": "7cf062c1ba64938cc68c4fae61506d84",
"text": "Zejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain Lee*, Jacob Carlson, and Weining Li>",
"element_id": "97c951d2dd3a1b5452d0c55e62e8ea78",
"text": "Zejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain Lee*, Jacob Carlson, and Weining Li®",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -1199,8 +1199,8 @@
},
{
"type": "Image",
"element_id": "642416e5d6c99219b16dbba6f72392c5",
"text": "Efficient Data Annotation Model Customization Document Images Community Platform a >) ¥ DIA Model Hub i .) Customized Model Training] == | Layout Detection Models | ——= DIA Pipeline Sharing ~ OCR Module = { Layout Data stuctue ) = (storage Visualization VY",
"element_id": "285d83f3098b26302329b33637fd265f",
"text": "Efficient Data Annotation Model Customization Document Images Community Platform A >) ¥ DIA Model Hub a Customized Model Training] == | Layout Detection Models | ——= DIA Pipeline Sharing ~ OCR Module = { Layout Data stuctue ) = (store Visualization LY",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -1762,8 +1762,8 @@
},
{
"type": "Image",
"element_id": "6eb2bb6ca50b3be177565f9ff546bce8",
"text": "- ° . 3 a a 4 a 3 oo er 2 § 8 a 8 3 3 £ 4 A g a 9 3 ¥ Coordinate g 4 5 3 + § 3 H Extra Features [O=\") [Bo] eaing i Text | | Type | | ower ° & a ¢ o [ coordinatel textblock1, 3 3 g Q 3 , textblock2 , layoutl ] 4 q ® A list of the layout elements Ff",
"element_id": "fd2288e4e3cf90f109d1c1198cea4ca0",
"text": "3 a a 4 a 3 Rectangle vada 4 8 4 iS v 2 [S) af : fa & o a 6 g 4 Coordinate g 2 8 3 + 4 * v 8 Extra features =| 9%) | Hock) Reading é ret | | Type | | order 2 & a ¢ @ [ coordinatel textblock1 , 8 » , ee 3 , textblock2 , layoutl ] 8 q ® A list of the layout elements sf",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -2153,7 +2153,7 @@
"element_id": "64bc79d1132a89c71837f420d6e4e2dc",
"text": "Operation Name Description block.pad(top, bottom, right, left) Enlarge the current block according to the input block.scale(fx, fy) Scale the current block given the ratio in x and y direction block.shift(dx, dy) Move the current block with the shift distances in x and y direction block1.is in(block2) Whether block1 is inside of block2 block1.intersect(block2) Return the intersection region of block1 and block2. Coordinate type to be determined based on the inputs. block1.union(block2) Return the union region of block1 and block2. Coordinate type to be determined based on the inputs. block1.relative to(block2) Convert the absolute coordinates of block1 to relative coordinates to block2 block1.condition on(block2) Calculate the absolute coordinates of block1 given the canvas block2s absolute coordinates block.crop image(image) Obtain the image segments in the block region",
"metadata": {
"text_as_html": "<table><thead><tr><th>block.pad(top, bottom,</th><th>right,</th><th>left)</th><th>Enlarge the current block according to the input</th></tr></thead><tbody><tr><td>block.scale(fx, fy)</td><td></td><td></td><td>Scale the current block given the ratio in x and y direction</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>block1. intersect (block2)</td><td></td><td></td><td>Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.union(block2)</td><td></td><td></td><td>Return the union region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td></td><td>Convert the absolute coordinates of block to relative coordinates to block2</td></tr><tr><td>block1.condition_on(block2)</td><td></td><td></td><td>Calculate the absolute coordinates of blockl given the canvas block2s absolute coordinates</td></tr><tr><td>block. crop_image (image)</td><td></td><td></td><td>Obtain the image segments in the block region</td></tr></tbody></table>",
"text_as_html": "<table><thead><tr><th>block.pad(top, bottom,</th><th>right,</th><th>left)</th><th>Enlarge the current block according to the input</th></tr></thead><tbody><tr><td>block.scale(fx, fy)</td><td></td><td></td><td>Scale the current block given the ratio in x and y direction</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>block1. intersect (block2)</td><td></td><td></td><td>Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.union(block2)</td><td></td><td></td><td>Return the union region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td></td><td>Convert the absolute coordinates of blockl to relative coordinates to block2</td></tr><tr><td>block1.condition_on(block2)</td><td></td><td></td><td>Calculate the absolute coordinates of blockl given the canvas block2s absolute coordinates</td></tr><tr><td>block. crop_image (image)</td><td></td><td></td><td>Obtain the image segments in the block region</td></tr></tbody></table>",
"filetype": "application/pdf",
"languages": [
"eng"
@ -2356,8 +2356,8 @@
},
{
"type": "Image",
"element_id": "f5450580bd9ae07f4cdf7c23a6ccaf41",
"text": "x09 Burpunog uayor Aeydsiq 1 vondo 10g Guypunog usyoy apir:z uondo Mode I: Showing Layout on the Original Image Mode Il: Drawing OCR'd Text at the Correspoding Position",
"element_id": "02a078081972f7bdb26f06a787773a30",
"text": "0g Burpunog uayor Aeydsiq:1 vondo 10g Guypunog usyou ap:z uondo Mode I: Showing Layout on the Original Image Mode Il: Drawing OCR Text at the Correspoding Position",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -2507,8 +2507,8 @@
},
{
"type": "Image",
"element_id": "6eb34afad9d568fbccde8ac8854dc24d",
"text": "Intra-column reading order Token Categories tie (Adress 2) tee (NE sumber Variable Column reading order HEE company type Column Categories (J tite Adress _] ree [7] Section Header Maximum Allowed Height (b) Illustration of the recreated document with dense text structure for better OCR performance",
"element_id": "747f46c43a88768fd543e10bac84203b",
"text": "Intra-column reading order Token Categories tie (Adress tee Ewumber Variable Column reading order HEE company ype Column Categories (J tite Adress 1] ree [7] Section Header Maximum Allowed Height (b) Illustration of the recreated document with dense text structure for better OCR performance",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -2819,8 +2819,8 @@
},
{
"type": "Image",
"element_id": "f48a844114951222f6c96331efc683fb",
"text": "(spe peepee, Active Learning Layout Annotate Layout Dataset | + Annotation Toolkit ¥ a Deep Leaming Layout Model Training & Inference, ¥ ; Handy Data Structures & Post-processing El Apis for Layout Det a LAR ror tye eats) 4 Text Recognition | <—— Default ane Customized ¥ ee Layout Structure Visualization & Export | <—— | visualization & Storage The Japanese Document Helpful LayoutParser Digitization Pipeline Modules",
"element_id": "2b90153124fb6f9e9f5539b9db75d240",
"text": "———————_+ (| Active Learning Layout Annotate Layout Dataset | + Annotation Toolkit ¥ alae Deep Leaming Layout Model Training & Inference, ¥ ; Handy Data Structures & Post-processing Ee apis for Layout Dat a Ae ror yon Oats 4 Text Recognition | <—— Default ane Customized ¥ ee Layout Structure Visualization & Export | <—— | visualization & Storage The Japanese Document Helpful LayoutParser Digitization Pipeline Modules",
"metadata": {
"filetype": "application/pdf",
"languages": [
@ -3119,8 +3119,8 @@
},
{
"type": "Image",
"element_id": "d5c954ff619e348d36d5180feedabc6c",
"text": "(@) Partial table at the bottom (&) Full page table (6) Partial table at the top (d) Mis-detected tet line",
"element_id": "1359eaa601a24c083e143b8bf5114127",
"text": "(@) Partial table at the bottom (6) Full page table (©) Partial table at the top (@) Mis-detected text line",
"metadata": {
"filetype": "application/pdf",
"languages": [

View File

@ -1 +1 @@
__version__ = "0.17.11-dev0" # pragma: no cover
__version__ = "0.17.11-dev1" # pragma: no cover

View File

@ -503,12 +503,10 @@ class PreChunk:
if self._overlap_prefix:
yield self._overlap_prefix
for e in self._elements:
if e.text is None:
continue
text = " ".join(e.text.strip().split())
if not text:
continue
yield text
if e.text and len(e.text):
text = " ".join(e.text.strip().split())
if text:
yield text
@lazyproperty
def _text(self) -> str:
@ -848,13 +846,15 @@ class _TableChunker:
@lazyproperty
def _table_text(self) -> str:
"""The text in this table, not including any overlap-prefix or extra whitespace."""
if not self._table.text:
return ""
return " ".join(self._table.text.split())
@lazyproperty
def _text_with_overlap(self) -> str:
"""The text for this chunk, including the overlap-prefix when present."""
overlap_prefix = self._overlap_prefix
table_text = self._table.text.strip()
table_text = "" if not self._table.text else self._table.text.strip()
# -- use row-separator between overlap and table-text --
return overlap_prefix + "\n" + table_text if overlap_prefix else table_text