chunk: relax table segregation during chunking (#3812)

**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.

**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
This commit is contained in:
Steve Canny 2024-12-09 10:57:22 -08:00 committed by GitHub
parent 18d6c81c47
commit 4379d883a3
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
15 changed files with 1077 additions and 935 deletions

View File

@ -1,8 +1,10 @@
## 0.16.11-dev0 ## 0.16.11-dev1
### Enhancements ### Enhancements
- **Enhance quote standardization tests** with additional Unicode scenarios - **Enhance quote standardization tests** with additional Unicode scenarios
- **Relax table segregation rule in chunking.** Previously a `Table` element was always segregated into its own pre-chunk such that the `Table` appeared alone in a chunk or was split into multiple `TableChunk` elements, but never combined with `Text`-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
- **Compute chunk length based solely on `element.text`.** Previously `.metadata.text_as_html` was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.
### Features ### Features

File diff suppressed because it is too large Load Diff

View File

@ -25,31 +25,31 @@ def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_parti
assert chunks == [ assert chunks == [
CompositeElement( CompositeElement(
"US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 INTRODUCTION" "US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 INTRODUCTION"
"\n\nA.\tPURPOSE" "\n\nA. PURPOSE"
), ),
CompositeElement( CompositeElement(
"The United States Trustee appoints and supervises standing trustees and monitors and" "The United States Trustee appoints and supervises standing trustees and monitors and"
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C." " supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586," " § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
" establishes or clarifies the position of the United States Trustee Program (Program)" " establishes or clarifies the position of the United States Trustee Program (Program)"
" on the duties owed by a standing trustee to the debtors, creditors, other parties in" " on the duties owed by a standing trustee to the debtors, creditors, other parties in"
" interest, and the United States Trustee. The Handbook does not present a full and" " interest, and the United States Trustee. The Handbook does not present a full and"
), ),
CompositeElement( CompositeElement(
"complete statement of the law; it should not be used as a substitute for legal" "complete statement of the law; it should not be used as a substitute for legal"
" research and analysis. The standing trustee must be familiar with relevant" " research and analysis. The standing trustee must be familiar with relevant"
" provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules)," " provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules),"
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586," " any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips" " 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
" identified in this Handbook but these are not considered mandatory." " identified in this Handbook but these are not considered mandatory."
), ),
CompositeElement( CompositeElement(
"Nothing in this Handbook should be construed to excuse the standing trustee from" "Nothing in this Handbook should be construed to excuse the standing trustee from"
" complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and" " complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and"
" orders of the court. The standing trustee should notify the United States Trustee" " orders of the court. The standing trustee should notify the United States Trustee"
" whenever the provision of the Handbook conflicts with the local rules or orders of" " whenever the provision of the Handbook conflicts with the local rules or orders of"
" the court. The standing trustee is accountable for all duties set forth in this" " the court. The standing trustee is accountable for all duties set forth in this"
" Handbook, but need not personally perform any duty unless otherwise indicated. All" " Handbook, but need not personally perform any duty unless otherwise indicated. All"
), ),
CompositeElement( CompositeElement(
"statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101" "statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101"
@ -57,12 +57,12 @@ def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_parti
), ),
CompositeElement( CompositeElement(
"This Handbook does not create additional rights against the standing trustee or" "This Handbook does not create additional rights against the standing trustee or"
" United States Trustee in favor of other parties.\n\nB.\tROLE OF THE UNITED STATES" " United States Trustee in favor of other parties.\n\nB. ROLE OF THE UNITED STATES"
" TRUSTEE" " TRUSTEE"
), ),
CompositeElement( CompositeElement(
"The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the" "The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the"
" responsibilities for daytoday administration of cases. Debtors, creditors, and" " responsibilities for daytoday administration of cases. Debtors, creditors, and"
" third parties with adverse interests to the trustee were concerned that the court," " third parties with adverse interests to the trustee were concerned that the court,"
" which previously appointed and supervised the trustee, would not impartially" " which previously appointed and supervised the trustee, would not impartially"
" adjudicate their rights as adversaries of that trustee. To address these concerns," " adjudicate their rights as adversaries of that trustee. To address these concerns,"
@ -70,24 +70,24 @@ def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_parti
), ),
CompositeElement( CompositeElement(
"Many administrative functions formerly performed by the court were placed within the" "Many administrative functions formerly performed by the court were placed within the"
" Department of Justice through the creation of the Program. Among the administrative" " Department of Justice through the creation of the Program. Among the administrative"
" functions assigned to the United States Trustee were the appointment and supervision" " functions assigned to the United States Trustee were the appointment and supervision"
" of chapter 13 trustees./ This Handbook is issued under the authority of the" " of chapter 13 trustees./ This Handbook is issued under the authority of the"
" Programs enabling statutes. \n\nC.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t" " Programs enabling statutes.\n\nC. STATUTORY DUTIES OF A STANDING TRUSTEE"
), ),
CompositeElement( CompositeElement(
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The" "The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
" standing trustee is more than a mere disbursing agent. The standing trustee must" " standing trustee is more than a mere disbursing agent. The standing trustee must"
" be personally involved in the trustee operation. If the standing trustee is or" " be personally involved in the trustee operation. If the standing trustee is or"
" becomes unable to perform the duties and responsibilities of a standing trustee," " becomes unable to perform the duties and responsibilities of a standing trustee,"
" the standing trustee must immediately advise the United States Trustee." " the standing trustee must immediately advise the United States Trustee."
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)." " 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
), ),
CompositeElement( CompositeElement(
"Although this Handbook is not intended to be a complete statutory reference, the" "Although this Handbook is not intended to be a complete statutory reference, the"
" standing trustees primary statutory duties are set forth in 11 U.S.C. § 1302, which" " standing trustees primary statutory duties are set forth in 11 U.S.C. § 1302, which"
" incorporates by reference some of the duties of chapter 7 trustees found in" " incorporates by reference some of the duties of chapter 7 trustees found in"
" 11 U.S.C. § 704. These duties include, but are not limited to, the" " 11 U.S.C. § 704. These duties include, but are not limited to, the"
" following:\n\nCopyright" " following:\n\nCopyright"
), ),
] ]

View File

@ -8,7 +8,7 @@ from typing import Any, Optional
import pytest import pytest
from test_unstructured.unit_utils import FixtureRequest, Mock, function_mock from test_unstructured.unit_utils import FixtureRequest, Mock, function_mock, input_path
from unstructured.chunking.base import CHUNK_MULTI_PAGE_DEFAULT from unstructured.chunking.base import CHUNK_MULTI_PAGE_DEFAULT
from unstructured.chunking.title import _ByTitleChunkingOptions, chunk_by_title from unstructured.chunking.title import _ByTitleChunkingOptions, chunk_by_title
from unstructured.documents.coordinates import CoordinateSystem from unstructured.documents.coordinates import CoordinateSystem
@ -20,10 +20,12 @@ from unstructured.documents.elements import (
ElementMetadata, ElementMetadata,
ListItem, ListItem,
Table, Table,
TableChunk,
Text, Text,
Title, Title,
) )
from unstructured.partition.html import partition_html from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_from_json
# ================================================================================================ # ================================================================================================
# INTEGRATION-TESTS # INTEGRATION-TESTS
@ -33,7 +35,53 @@ from unstructured.partition.html import partition_html
# ================================================================================================ # ================================================================================================
def test_it_splits_a_large_element_into_multiple_chunks(): def test_it_chunks_text_followed_by_table_together_when_both_fit():
elements = elements_from_json(input_path("chunking/title_table_200.json"))
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
assert len(chunks) == 1
assert isinstance(chunks[0], CompositeElement)
def test_it_chunks_table_followed_by_text_together_when_both_fit():
elements = elements_from_json(input_path("chunking/table_text_200.json"))
# -- disable chunk combining so we test pre-chunking behavior, not chunk-combining --
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
assert len(chunks) == 1
assert isinstance(chunks[0], CompositeElement)
def test_it_splits_oversized_table():
elements = elements_from_json(input_path("chunking/table_2000.json"))
chunks = chunk_by_title(elements)
assert len(chunks) == 5
assert all(isinstance(chunk, TableChunk) for chunk in chunks)
def test_it_starts_new_chunk_for_table_after_full_text_chunk():
elements = elements_from_json(input_path("chunking/long_text_table_200.json"))
chunks = chunk_by_title(elements, max_characters=250)
assert len(chunks) == 2
assert [type(chunk) for chunk in chunks] == [CompositeElement, Table]
def test_it_starts_new_chunk_for_text_after_full_table_chunk():
elements = elements_from_json(input_path("chunking/full_table_long_text_250.json"))
chunks = chunk_by_title(elements, max_characters=250)
assert len(chunks) == 2
assert [type(chunk) for chunk in chunks] == [Table, CompositeElement]
def test_it_splits_a_large_text_element_into_multiple_chunks():
elements: list[Element] = [ elements: list[Element] = [
Title("Introduction"), Title("Introduction"),
Text( Text(
@ -68,7 +116,7 @@ def test_it_splits_elements_by_title_and_table():
chunks = chunk_by_title(elements, combine_text_under_n_chars=0, include_orig_elements=True) chunks = chunk_by_title(elements, combine_text_under_n_chars=0, include_orig_elements=True)
assert len(chunks) == 4 assert len(chunks) == 3
# -- # --
chunk = chunks[0] chunk = chunks[0]
assert isinstance(chunk, CompositeElement) assert isinstance(chunk, CompositeElement)
@ -76,13 +124,10 @@ def test_it_splits_elements_by_title_and_table():
Title("A Great Day"), Title("A Great Day"),
Text("Today is a great day."), Text("Today is a great day."),
Text("It is sunny outside."), Text("It is sunny outside."),
Table("Heading\nCell text"),
] ]
# -- # --
chunk = chunks[1] chunk = chunks[1]
assert isinstance(chunk, Table)
assert chunk.metadata.orig_elements == [Table("Heading\nCell text")]
# ==
chunk = chunks[2]
assert isinstance(chunk, CompositeElement) assert isinstance(chunk, CompositeElement)
assert chunk.metadata.orig_elements == [ assert chunk.metadata.orig_elements == [
Title("An Okay Day"), Title("An Okay Day"),
@ -90,7 +135,7 @@ def test_it_splits_elements_by_title_and_table():
Text("It is rainy outside."), Text("It is rainy outside."),
] ]
# -- # --
chunk = chunks[3] chunk = chunks[2]
assert isinstance(chunk, CompositeElement) assert isinstance(chunk, CompositeElement)
assert chunk.metadata.orig_elements == [ assert chunk.metadata.orig_elements == [
Title("A Bad Day"), Title("A Bad Day"),
@ -119,9 +164,8 @@ def test_chunk_by_title():
assert chunks == [ assert chunks == [
CompositeElement( CompositeElement(
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.", "A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
), ),
Table("Heading\nCell text"),
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."), CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
CompositeElement( CompositeElement(
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.", "A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
@ -150,10 +194,7 @@ def test_chunk_by_title_separates_by_page_number():
CompositeElement( CompositeElement(
"A Great Day", "A Great Day",
), ),
CompositeElement( CompositeElement("Today is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"),
"Today is a great day.\n\nIt is sunny outside.",
),
Table("Heading\nCell text"),
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."), CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
CompositeElement( CompositeElement(
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.", "A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
@ -178,9 +219,8 @@ def test_chuck_by_title_respects_multipage():
chunks = chunk_by_title(elements, multipage_sections=True, combine_text_under_n_chars=0) chunks = chunk_by_title(elements, multipage_sections=True, combine_text_under_n_chars=0)
assert chunks == [ assert chunks == [
CompositeElement( CompositeElement(
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.", "A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
), ),
Table("Heading\nCell text"),
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."), CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
CompositeElement( CompositeElement(
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.", "A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
@ -206,9 +246,8 @@ def test_chunk_by_title_groups_across_pages():
assert chunks == [ assert chunks == [
CompositeElement( CompositeElement(
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.", "A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
), ),
Table("Heading\nCell text"),
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."), CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
CompositeElement( CompositeElement(
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.", "A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",

View File

@ -37,7 +37,7 @@ def test_it_chunks_elements_when_a_chunking_strategy_is_specified():
"example-docs/spring-weather.html.json", chunking_strategy="basic", max_characters=1500 "example-docs/spring-weather.html.json", chunking_strategy="basic", max_characters=1500
) )
assert len(chunks) == 10 assert len(chunks) == 9
assert all(isinstance(ch, CompositeElement) for ch in chunks) assert all(isinstance(ch, CompositeElement) for ch in chunks)

View File

@ -0,0 +1,32 @@
[
{
"type": "Table",
"element_id": "ca96108263324e9d865a98f19cf7c940",
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
"languages": [
"eng"
],
"filetype": "text/html"
}
},
{
"type": "NarrativeText",
"element_id": "5bc93ad5828445f98cac824c750cacfd",
"text": "Format: CSV file for Export and Download Contact: Charles Stringham cstringham@alsde.edu to arrange secure data transfer OR with technical questions nickey.johnson@alsde.edu for other questions",
"metadata": {
"category_depth": 2,
"page_number": 1,
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham cstringham@alsde.edu to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">nickey.johnson@alsde.edu for other questions </p>",
"languages": [
"eng"
],
"filetype": "text/html"
}
}
]

View File

@ -0,0 +1,32 @@
[
{
"type": "NarrativeText",
"element_id": "5bc93ad5828445f98cac824c750cacfd",
"text": "Format: CSV file for Export and Download Contact: Charles Stringham cstringham@alsde.edu to arrange secure data transfer OR with technical questions nickey.johnson@alsde.edu for other questions",
"metadata": {
"category_depth": 2,
"page_number": 1,
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham cstringham@alsde.edu to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">nickey.johnson@alsde.edu for other questions </p>",
"languages": [
"eng"
],
"filetype": "text/html"
}
},
{
"type": "Table",
"element_id": "ca96108263324e9d865a98f19cf7c940",
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
"languages": [
"eng"
],
"filetype": "text/html"
}
}
]

View File

@ -0,0 +1,17 @@
[
{
"type": "Table",
"element_id": "e6278883f688428c98cec628a00b0102",
"text": "Field Name Size Type Description Example School_Year 9 VARCHAR School year the assessment was given 2019-2020 LEA_Name VARCHAR Official Name of the School System Happy City Schools LEA_Code 3 VARCHAR 3-digit ALSDE-assigned system code 010 or 298 School_Code 6 VARCHAR 4-digit ALSDE-assigned school code 0100 or 9203 Student_Identifier 10 VARCHAR Student's ALSDE ID number -SSID ***must be 10 digits and start with \"19\" or \"20\"*** 9999999999 Student_Last_Name 35 VARCHAR Student's last name Smith Student_First_Name 35 VARCHAR Student's first name Jane Student_Date_of_Birth_Month 2 VARCHAR Student birth date month. MM 05, 11 Student_Date_of_Birth_Day 2 VARCHAR Student birth date day. DD 03, 25 Student_Date_of_Birth_Year 4 VARCHAR Student birth date Year. YYYY 2015 Reading_Teacher_Identifier 13 VARCHAR Reading Teacher's ALSDE ID/TCHNumber. The teacher who is primarily responsible for Reading instruction of the student. (These are two names for the same number). ***must be in this format 3 letters, dash, 4 numbers, dash, 4 numbers*** XXX-9999-9999, NOJ-1234-5678 Reading_Assessment_Name 15 VARCHAR Unique identifier for Reading assessment. Vendor's name for overall assessment. XXXX Reading_Administration_Mode 8 VARCHAR This field indicates if the assessment was administered in an in-person (face-to-face) or a remote learning environment. The options are: InPerson or Remote Reading_Benchmark_Period 3 VARCHAR Benchmark period during the term the assessment was administered. Summer School will be SSS. BOY, MOY or EOY (SSS for summer school) Reading_Date_Completed 10 VARCHAR This is the date on which the assessment is completed MM/DD/YYYY 43962 Reading_Extended_Time 2 VARCHAR The field will contain a \"Y\" if the student was given more than the allotted time to finish the assessment or any subtest of the assessment as defined by the vendor in a standard administration. Y",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "3ddff8c2b6c44a16be24baf72bdd78a2",
"text_as_html": "<table class=\"Table\" id=\"e6278883f688428c98cec628a00b0102\"> <thead> <tr> <th>Field Name</th><th>Size</th><th>Type</th><th>Description</th><th>Example</th></tr></thead><tbody> <tr> <td>School_Year</td><td>9</td><td>VARCHAR</td><td>School year the assessment was given</td><td>2019-2020</td></tr><tr> <td>LEA_Name</td><td></td><td>VARCHAR</td><td>Official Name of the School System</td><td>Happy City Schools</td></tr><tr> <td>LEA_Code</td><td>3</td><td>VARCHAR</td><td>3-digit ALSDE-assigned system code</td><td>010 or 298</td></tr><tr> <td>School_Code</td><td>6</td><td>VARCHAR</td><td>4-digit ALSDE-assigned school code</td><td>0100 or 9203</td></tr><tr> <td>Student_Identifier</td><td>10</td><td>VARCHAR</td><td>Student's ALSDE ID number -SSID ***must be 10 digits and start with \"19\" or \"20\"***</td><td>9999999999</td></tr><tr> <td>Student_Last_Name</td><td>35</td><td>VARCHAR</td><td>Student's last name</td><td>Smith</td></tr><tr> <td>Student_First_Name</td><td>35</td><td>VARCHAR</td><td>Student's first name</td><td>Jane</td></tr><tr> <td>Student_Date_of_Birth_Month</td><td>2</td><td>VARCHAR</td><td>Student birth date month. MM</td><td>05, 11</td></tr><tr> <td>Student_Date_of_Birth_Day</td><td>2</td><td>VARCHAR</td><td>Student birth date day. DD</td><td>03, 25</td></tr><tr> <td>Student_Date_of_Birth_Year</td><td>4</td><td>VARCHAR</td><td>Student birth date Year. YYYY</td><td>2015</td></tr><tr> <td>Reading_Teacher_Identifier</td><td>13</td><td>VARCHAR</td><td>Reading Teacher's ALSDE ID/TCHNumber. The teacher who is primarily responsible for Reading instruction of the student. (These are two names for the same number). ***must be in this format 3 letters, dash, 4 numbers, dash, 4 numbers***</td><td>XXX-9999-9999, NOJ-1234-5678</td></tr><tr> <td>Reading_Assessment_Name</td><td>15</td><td>VARCHAR</td><td>Unique identifier for Reading assessment. Vendor's name for overall assessment.</td><td>XXXX</td></tr><tr> <td>Reading_Administration_Mode</td><td>8</td><td>VARCHAR</td><td>This field indicates if the assessment was administered in an in-person (face-to-face) or a remote learning environment. The options are:</td><td>InPerson or Remote</td></tr><tr> <td>Reading_Benchmark_Period</td><td>3</td><td>VARCHAR</td><td>Benchmark period during the term the assessment was administered. Summer School will be SSS.</td><td>BOY, MOY or EOY (SSS for summer school)</td></tr><tr> <td>Reading_Date_Completed</td><td>10</td><td>VARCHAR</td><td>This is the date on which the assessment is completed MM/DD/YYYY</td><td>43962</td></tr><tr> <td>Reading_Extended_Time</td><td>2</td><td>VARCHAR</td><td>The field will contain a \"Y\" if the student was given more than the allotted time to finish the assessment or any subtest of the assessment as defined by the vendor in a standard administration.</td><td>Y</td></tr></tbody></table>",
"languages": [
"eng"
],
"filetype": "text/html"
}
}
]

View File

@ -0,0 +1,32 @@
[
{
"type": "Table",
"element_id": "ca96108263324e9d865a98f19cf7c940",
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
"languages": [
"eng"
],
"filetype": "text/html"
}
},
{
"type": "Text",
"element_id": "0163a58539934b3aaca402c9e961b0d6",
"text": "REQUEST FOR PROPOSALS",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<h2 class=\"Subtitle\" id=\"0163a58539934b3aaca402c9e961b0d6\">REQUEST FOR PROPOSALS </h2>",
"languages": [
"eng"
],
"filetype": "text/html"
}
}
]

View File

@ -0,0 +1,32 @@
[
{
"type": "Title",
"element_id": "0163a58539934b3aaca402c9e961b0d6",
"text": "REQUEST FOR PROPOSALS",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<h2 class=\"Subtitle\" id=\"0163a58539934b3aaca402c9e961b0d6\">REQUEST FOR PROPOSALS </h2>",
"languages": [
"eng"
],
"filetype": "text/html"
}
},
{
"type": "Table",
"element_id": "ca96108263324e9d865a98f19cf7c940",
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
"languages": [
"eng"
],
"filetype": "text/html"
}
}
]

View File

@ -101,6 +101,13 @@ def parse_optional_datetime(datetime_str: Optional[str]) -> Optional[dt.datetime
return dt.datetime.fromisoformat(datetime_str) if datetime_str else None return dt.datetime.fromisoformat(datetime_str) if datetime_str else None
def input_path(rel_path: str) -> str:
"""Resolve the absolute-path to `rel_path` in the testfiles directory."""
testfiles_dir = pathlib.Path(__file__).parent / "testfiles"
file_path = testfiles_dir / rel_path
return str(file_path.resolve())
# ------------------------------------------------------------------------------------------------ # ------------------------------------------------------------------------------------------------
# MOCKING FIXTURES # MOCKING FIXTURES
# ------------------------------------------------------------------------------------------------ # ------------------------------------------------------------------------------------------------

View File

@ -1,8 +1,8 @@
[ [
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "36385872440a208d3521a8a885d5f873", "element_id": "85002882dd396da0b1b82c925b002be5",
"text": "US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 \u2013 INTRODUCTION\n\nA.\tPURPOSE", "text": "US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 \u2013 INTRODUCTION\n\nA. PURPOSE",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -55,8 +55,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "91d26c5ec7f727ece12679cf6b80f90d", "element_id": "1abe685eb8dfed0f2266d6cf793d7e6b",
"text": "le 11 of the United States Code. 28 U.S.C. \u00a7 586(b). The Handbook, issued as part of our duties under 28 U.S.C. \u00a7 586, establishes or clarifies the", "text": "le 11 of the United States Code. 28 U.S.C. \u00a7 586(b). The Handbook, issued as part of our duties under 28 U.S.C. \u00a7 586, establishes or clarifies the",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -103,8 +103,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "20447c8f42ed2b919bd0e5707e7899ae", "element_id": "40588c4c1489058c4fec885f4696ebcc",
"text": "s, creditors, other parties in interest, and the United States Trustee. The Handbook does not present a full and complete statement of the law; it", "text": "s, creditors, other parties in interest, and the United States Trustee. The Handbook does not present a full and complete statement of the law; it",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -127,8 +127,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "e34c56af21b43f4179f996ddea901bc4", "element_id": "9ddf0b109cf940de5f575acc9d9758c8",
"text": "ment of the law; it should not be used as a substitute for legal research and analysis. The standing trustee must be familiar with relevant", "text": "ment of the law; it should not be used as a substitute for legal research and analysis. The standing trustee must be familiar with relevant provisions",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -151,8 +151,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "55e660e5b0d0ec6ee5476621e556d6c8", "element_id": "b7d1b42646393ca0f41af0e8ec48f9a9",
"text": "iliar with relevant provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules), any local bankruptcy rules, and case law. 11", "text": "relevant provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules), any local bankruptcy rules, and case law. 11 U.S.C. \u00a7 321,",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -175,8 +175,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "a9335be161a6a7a080ff78e4e07cbadb", "element_id": "9ee33f4141eca1f98ca4299d0fdfba31",
"text": ", and case law. 11 U.S.C. \u00a7 321, 28 U.S.C. \u00a7 586, 28 C.F.R. \u00a7 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips identified in", "text": "w. 11 U.S.C. \u00a7 321, 28 U.S.C. \u00a7 586, 28 C.F.R. \u00a7 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips identified in this Handbook but",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -199,8 +199,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "5f2d61a46e9d16ce346eacc25321a250", "element_id": "6da3b5e2a833fa5ab6685f0fa46d2d6f",
"text": "Tips identified in this Handbook but these are not considered mandatory.", "text": "n this Handbook but these are not considered mandatory.",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -246,8 +246,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "2ff156994a8c58d8a5c91918a543ec28", "element_id": "685600ed24c5b0e3b34e7d639d3b1959",
"text": "tcy Code and Rules, local rules, and orders of the court. The standing trustee should notify the United States Trustee whenever the provision of the", "text": "tcy Code and Rules, local rules, and orders of the court. The standing trustee should notify the United States Trustee whenever the provision of the",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -270,8 +270,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "7c43851f864b7ccc35150c93d06abe80", "element_id": "c998f5c10c9dac92e4d3624896a603c7",
"text": "he provision of the Handbook conflicts with the local rules or orders of the court. The standing trustee is accountable for all duties set forth in", "text": "he provision of the Handbook conflicts with the local rules or orders of the court. The standing trustee is accountable for all duties set forth in",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -294,8 +294,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "7caf69b806daa033d686fae6100f4d7c", "element_id": "d4b750e9af7167156f369b310a8cebb8",
"text": "duties set forth in this Handbook, but need not personally perform any duty unless otherwise indicated. All statutory references in this Handbook", "text": "duties set forth in this Handbook, but need not personally perform any duty unless otherwise indicated. All statutory references in this Handbook",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -365,8 +365,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "66ff9b9385d511ca7e71f1e6852d3221", "element_id": "8f411358790d6ee5b0d24f919206d3fd",
"text": "B.\tROLE OF THE UNITED STATES TRUSTEE", "text": "B. ROLE OF THE UNITED STATES TRUSTEE",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -388,8 +388,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "1876c502fcbb25fd7b978417aea8dded", "element_id": "6044d58375609c8802cfae16cef5cee9",
"text": "The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the responsibilities for daytoday administration of cases. Debtors, creditors,", "text": "The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the responsibilities for daytoday administration of cases. Debtors, creditors, and",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -411,8 +411,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "5f89702a93c3df34a62905e5dff5c54d", "element_id": "a4030396eaf54570462ed74f86e45bc8",
"text": "Debtors, creditors, and third parties with adverse interests to the trustee were concerned that the court, which previously appointed and supervised", "text": "ors, creditors, and third parties with adverse interests to the trustee were concerned that the court, which previously appointed and supervised the",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -435,8 +435,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "c916e417ed924c556baed9616c3f81ae", "element_id": "80e3b20fead224c85652bbdce327a28d",
"text": "nted and supervised the trustee, would not impartially adjudicate their rights as adversaries of that trustee. To address these concerns, judicial and", "text": "and supervised the trustee, would not impartially adjudicate their rights as adversaries of that trustee. To address these concerns, judicial and",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -483,8 +483,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "709927b67286cccaf8fb25d63667c277", "element_id": "39a3f1465d06269d2544ded43dc3a7df",
"text": "Many administrative functions formerly performed by the court were placed within the Department of Justice through the creation of the Program. Among", "text": "Many administrative functions formerly performed by the court were placed within the Department of Justice through the creation of the Program. Among",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -506,8 +506,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "509676fb8d4f77b5f270629dee7a2664", "element_id": "2872e5d0bea6ec1523eb9ae2c1c64add",
"text": "the Program. Among the administrative functions assigned to the United States Trustee were the appointment and supervision of chapter 13 trustees./", "text": "the Program. Among the administrative functions assigned to the United States Trustee were the appointment and supervision of chapter 13 trustees./",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -530,8 +530,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "7ced6d1ee6cc9478adfd8e2a613be42a", "element_id": "24e1076110b431b248b43b1fdaae5282",
"text": "apter 13 trustees./ This Handbook is issued under the authority of the Program\u2019s enabling statutes. ", "text": "apter 13 trustees./ This Handbook is issued under the authority of the Program\u2019s enabling statutes.",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -554,8 +554,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "2c82d3fa4252275d5309a640eb25cd68", "element_id": "158a80e29cfe6aa83a4931d955a8fa4f",
"text": "C.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t", "text": "C. STATUTORY DUTIES OF A STANDING TRUSTEE",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -577,8 +577,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "a819e32a65d1f545cb404fe3f6273357", "element_id": "e5fdcc6a007017354a9d708dc04fee02",
"text": "The standing trustee has a fiduciary responsibility to the bankruptcy estate. The standing trustee is more than a mere disbursing agent. The", "text": "The standing trustee has a fiduciary responsibility to the bankruptcy estate. The standing trustee is more than a mere disbursing agent. The standing",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -600,8 +600,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "9e98089003e3b42ed7f1c263335dee3c", "element_id": "0bf52e064da3ef4fb8b0a92d4b9fa694",
"text": "bursing agent. The standing trustee must be personally involved in the trustee operation. If the standing trustee is or becomes unable to perform", "text": "agent. The standing trustee must be personally involved in the trustee operation. If the standing trustee is or becomes unable to perform the duties",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -624,8 +624,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "d476b15e5336342b1da22d100849b23c", "element_id": "db297530e558410b89acd93c6b452b84",
"text": "s unable to perform the duties and responsibilities of a standing trustee, the standing trustee must immediately advise the United States Trustee. 28", "text": "perform the duties and responsibilities of a standing trustee, the standing trustee must immediately advise the United States Trustee. 28 U.S.C. \u00a7",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -648,8 +648,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "8f8c9c0919f7502bd2fabad0b12ad664", "element_id": "201bfacc211f0eb640e2830b8c29ae41",
"text": "States Trustee. 28 U.S.C. \u00a7 586(b), 28 C.F.R. \u00a7 58.4(b) referencing 28 C.F.R. \u00a7 58.3(b).", "text": "rustee. 28 U.S.C. \u00a7 586(b), 28 C.F.R. \u00a7 58.4(b) referencing 28 C.F.R. \u00a7 58.3(b).",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -695,8 +695,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "9864d90bf9febdd104e7eac4c56689ba", "element_id": "fd4c45036e8f17c27271f75944389724",
"text": "are set forth in 11 U.S.C. \u00a7 1302, which incorporates by reference some of the duties of chapter 7 trustees found in 11 U.S.C. \u00a7 704. These duties", "text": "are set forth in 11 U.S.C. \u00a7 1302, which incorporates by reference some of the duties of chapter 7 trustees found in 11 U.S.C. \u00a7 704. These duties",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {
@ -719,8 +719,8 @@
}, },
{ {
"type": "CompositeElement", "type": "CompositeElement",
"element_id": "a91f963bcd1c092bffb844453aafa499", "element_id": "a968d741409111b777fc123ef01f5407",
"text": "704. These duties include, but are not limited to, the following:", "text": "\u00a7 704. These duties include, but are not limited to, the following:",
"metadata": { "metadata": {
"data_source": { "data_source": {
"record_locator": { "record_locator": {

View File

@ -1 +1 @@
__version__ = "0.16.11-dev0" # pragma: no cover __version__ = "0.16.11-dev1" # pragma: no cover

View File

@ -43,9 +43,6 @@ Only operative for "by_title" chunking strategy.
BoundaryPredicate: TypeAlias = Callable[[Element], bool] BoundaryPredicate: TypeAlias = Callable[[Element], bool]
"""Detects when element represents crossing a semantic boundary like section or page.""" """Detects when element represents crossing a semantic boundary like section or page."""
PreChunk: TypeAlias = "TablePreChunk | TextPreChunk"
"""The kind of object produced by a pre-chunker."""
TextAndHtml: TypeAlias = tuple[str, str] TextAndHtml: TypeAlias = tuple[str, str]
@ -288,8 +285,13 @@ class PreChunker:
pre_chunk_builder = PreChunkBuilder(self._opts) pre_chunk_builder = PreChunkBuilder(self._opts)
for element in self._elements: for element in self._elements:
# -- start new pre-chunk when necessary -- # -- start new pre-chunk when necessary to uphold segregation guarantees --
if self._is_in_new_semantic_unit(element) or not pre_chunk_builder.will_fit(element): if (
# -- start new pre-chunk when necessary to uphold segregation guarantees --
self._is_in_new_semantic_unit(element)
# -- or when next element won't fit --
or not pre_chunk_builder.will_fit(element)
):
yield from pre_chunk_builder.flush() yield from pre_chunk_builder.flush()
# -- add this element to the work-in-progress (WIP) pre-chunk -- # -- add this element to the work-in-progress (WIP) pre-chunk --
@ -320,8 +322,7 @@ class PreChunkBuilder:
the next element in the element stream. the next element in the element stream.
`.flush()` is used to build a PreChunk object from the accumulated elements. This method `.flush()` is used to build a PreChunk object from the accumulated elements. This method
returns an iterator that generates zero-or-one `TextPreChunk` or `TablePreChunk` object and is returns an iterator that generates zero-or-one `PreChunk` object and is used like so:
used like so:
yield from builder.flush() yield from builder.flush()
@ -355,15 +356,13 @@ class PreChunkBuilder:
boundary has been reached. Also to clear out a terminal pre-chunk at the end of an element boundary has been reached. Also to clear out a terminal pre-chunk at the end of an element
stream. stream.
""" """
if not self._elements: elements = self._elements
if not elements:
return return
pre_chunk = ( # -- copy element list, don't use original or it may change contents as builder proceeds --
TablePreChunk(self._elements[0], self._overlap_prefix, self._opts) pre_chunk = PreChunk(elements, self._overlap_prefix, self._opts)
if isinstance(self._elements[0], Table)
# -- copy list, don't use original or it may change contents as builder proceeds --
else TextPreChunk(list(self._elements), self._overlap_prefix, self._opts)
)
# -- clear builder before yield so we're not sensitive to the timing of how/when this # -- clear builder before yield so we're not sensitive to the timing of how/when this
# -- iterator is exhausted and can add elements for the next pre-chunk immediately. # -- iterator is exhausted and can add elements for the next pre-chunk immediately.
self._reset_state(pre_chunk.overlap_tail) self._reset_state(pre_chunk.overlap_tail)
@ -384,12 +383,6 @@ class PreChunkBuilder:
# -- an empty pre-chunk will accept any element (including an oversized-element) -- # -- an empty pre-chunk will accept any element (including an oversized-element) --
if len(self._elements) == 0: if len(self._elements) == 0:
return True return True
# -- a `Table` will not fit in a non-empty pre-chunk --
if isinstance(element, Table):
return False
# -- no element will fit in a pre-chunk that already contains a `Table` element --
if isinstance(self._elements[0], Table):
return False
# -- a pre-chunk that already exceeds the soft-max is considered "full" -- # -- a pre-chunk that already exceeds the soft-max is considered "full" --
if self._text_length > self._opts.soft_max: if self._text_length > self._opts.soft_max:
return False return False
@ -429,19 +422,291 @@ class PreChunkBuilder:
# ================================================================================================ # ================================================================================================
# PRE-CHUNK SUB-TYPES # PRE-CHUNK
# ================================================================================================ # ================================================================================================
class TablePreChunk: class PreChunk:
"""A pre-chunk composed of a single Table element.""" """Sequence of elements staged to form a single chunk.
This object is purposely immutable.
"""
def __init__(
self, elements: Iterable[Element], overlap_prefix: str, opts: ChunkingOptions
) -> None:
self._elements = list(elements)
self._overlap_prefix = overlap_prefix
self._opts = opts
def __eq__(self, other: Any) -> bool:
if not isinstance(other, PreChunk):
return False
return self._overlap_prefix == other._overlap_prefix and self._elements == other._elements
def can_combine(self, pre_chunk: PreChunk) -> bool:
"""True when `pre_chunk` can be combined with this one without exceeding size limits."""
if len(self._text) >= self._opts.combine_text_under_n_chars:
return False
# -- avoid duplicating length computations by doing a trial-combine which is just as
# -- efficient and definitely more robust than hoping two different computations of combined
# -- length continue to get the same answer as the code evolves. Only possible because
# -- `.combine()` is non-mutating.
combined_len = len(self.combine(pre_chunk)._text)
return combined_len <= self._opts.hard_max
def combine(self, other_pre_chunk: PreChunk) -> PreChunk:
"""Return new `PreChunk` that combines this and `other_pre_chunk`."""
# -- combined pre-chunk gets the overlap-prefix of the first pre-chunk. The second overlap
# -- is automatically incorporated at the end of the first chunk, where it originated.
return PreChunk(
self._elements + other_pre_chunk._elements,
overlap_prefix=self._overlap_prefix,
opts=self._opts,
)
def iter_chunks(self) -> Iterator[CompositeElement | Table | TableChunk]:
"""Form this pre-chunk into one or more chunk elements maxlen or smaller.
When the total size of the pre-chunk will fit in the chunking window, a single chunk it
emitted. When this prechunk contains an oversized element (always isolated), it is split
into two or more chunks that each fit the chunking window.
"""
# -- a one-table-only pre-chunk is handled specially, by `TablePreChunk`, mainly because
# -- it may need to be split into multiple `TableChunk` elements and that operation is
# -- quite specialized.
if len(self._elements) == 1 and isinstance(self._elements[0], Table):
yield from _TableChunker.iter_chunks(
self._elements[0], self._overlap_prefix, self._opts
)
else:
yield from _Chunker.iter_chunks(self._elements, self._text, self._opts)
@lazyproperty
def overlap_tail(self) -> str:
"""The portion of this chunk's text to be repeated as a prefix in the next chunk.
This value is the empty-string ("") when either the `.overlap` length option is `0` or
`.overlap_all` is `False`. When there is a text value, it is stripped of both leading and
trailing whitespace.
"""
overlap = self._opts.inter_chunk_overlap
return self._text[-overlap:].strip() if overlap else ""
def _iter_text_segments(self) -> Iterator[str]:
"""Generate overlap text and each element text segment in order.
Empty text segments are not included.
"""
if self._overlap_prefix:
yield self._overlap_prefix
for e in self._elements:
text = " ".join(e.text.strip().split())
if not text:
continue
yield text
@lazyproperty
def _text(self) -> str:
"""The concatenated text of all elements in this pre-chunk, including any overlap.
Whitespace is normalized to a single space. The text of each element is separated from
that of the next by a blank line ("\n\n").
"""
return self._opts.text_separator.join(self._iter_text_segments())
# ================================================================================================
# CHUNKING HELPER/SPLITTERS
# ================================================================================================
class _Chunker:
"""Forms chunks from a pre-chunk other than one containing only a `Table`.
Produces zero-or-more `CompositeElement` objects.
"""
def __init__(self, elements: Iterable[Element], text: str, opts: ChunkingOptions) -> None:
self._elements = list(elements)
self._text = text
self._opts = opts
@classmethod
def iter_chunks(
cls, elements: Iterable[Element], text: str, opts: ChunkingOptions
) -> Iterator[CompositeElement]:
"""Form zero or more chunks from `elements`.
One `CompositeElement` is produced when all `elements` will fit. Otherwise there is a
single `Text`-subtype element and chunks are formed by splitting.
"""
return cls(elements, text, opts)._iter_chunks()
def _iter_chunks(self) -> Iterator[CompositeElement]:
"""Form zero or more chunks from `elements`."""
# -- a pre-chunk containing no text (maybe only a PageBreak element for example) does not
# -- generate any chunks.
if not self._text:
return
# -- `split()` is the text-splitting function used to split an oversized element --
split = self._opts.split
# -- emit first chunk --
s, remainder = split(self._text)
yield CompositeElement(text=s, metadata=self._consolidated_metadata)
# -- an oversized pre-chunk will have a remainder, split that up into additional chunks.
# -- Note these get continuation_metadata which includes is_continuation=True.
while remainder:
s, remainder = split(remainder)
yield CompositeElement(text=s, metadata=self._continuation_metadata)
@lazyproperty
def _all_metadata_values(self) -> dict[str, list[Any]]:
"""Collection of all populated metadata values across elements.
The resulting dict has one key for each `ElementMetadata` field that had a non-None value in
at least one of the elements in this pre-chunk. The value of that key is a list of all those
populated values, in element order, for example:
{
"filename": ["sample.docx", "sample.docx"],
"languages": [["lat"], ["lat", "eng"]]
...
}
This preprocessing step provides the input for a specified consolidation strategy that will
resolve the list of values for each field to a single consolidated value.
"""
def iter_populated_fields(metadata: ElementMetadata) -> Iterator[tuple[str, Any]]:
"""(field_name, value) pair for each non-None field in single `ElementMetadata`."""
return (
(field_name, value)
for field_name, value in metadata.known_fields.items()
if value is not None
)
field_values: DefaultDict[str, list[Any]] = collections.defaultdict(list)
# -- collect all non-None field values in a list for each field, in element-order --
for e in self._elements:
for field_name, value in iter_populated_fields(e.metadata):
field_values[field_name].append(value)
return dict(field_values)
@lazyproperty
def _consolidated_metadata(self) -> ElementMetadata:
"""Metadata applicable to this pre-chunk as a single chunk.
Formed by applying consolidation rules to all metadata fields across the elements of this
pre-chunk.
For the sake of consistency, the same rules are applied (for example, for dropping values)
to a single-element pre-chunk too, even though metadata for such a pre-chunk is already
"consolidated".
"""
consolidated_metadata = ElementMetadata(**self._meta_kwargs)
if self._opts.include_orig_elements:
consolidated_metadata.orig_elements = self._orig_elements
return consolidated_metadata
@lazyproperty
def _continuation_metadata(self) -> ElementMetadata:
"""Metadata applicable to the second and later text-split chunks of the pre-chunk.
The same metadata as the first text-split chunk but includes `.is_continuation = True`.
Unused for non-oversized pre-chunks since those are not subject to text-splitting.
"""
# -- we need to make a copy, otherwise adding a field would also change metadata value
# -- already assigned to another chunk (e.g. the first text-split chunk). Deep-copy is not
# -- required though since we're not changing any collection fields.
continuation_metadata = copy.copy(self._consolidated_metadata)
continuation_metadata.is_continuation = True
return continuation_metadata
@lazyproperty
def _meta_kwargs(self) -> dict[str, Any]:
"""The consolidated metadata values as a dict suitable for constructing ElementMetadata.
This is where consolidation strategies are actually applied. The output is suitable for use
in constructing an `ElementMetadata` object like `ElementMetadata(**self._meta_kwargs)`.
"""
CS = ConsolidationStrategy
field_consolidation_strategies = ConsolidationStrategy.field_consolidation_strategies()
def iter_kwarg_pairs() -> Iterator[tuple[str, Any]]:
"""Generate (field-name, value) pairs for each field in consolidated metadata."""
for field_name, values in self._all_metadata_values.items():
strategy = field_consolidation_strategies.get(field_name)
if strategy is CS.FIRST:
yield field_name, values[0]
# -- concatenate lists from each element that had one, in order --
elif strategy is CS.LIST_CONCATENATE:
yield field_name, sum(values, cast("list[Any]", []))
# -- union lists from each element, preserving order of appearance --
elif strategy is CS.LIST_UNIQUE:
# -- Python 3.7+ maintains dict insertion order --
ordered_unique_keys = {key: None for val_list in values for key in val_list}
yield field_name, list(ordered_unique_keys.keys())
elif strategy is CS.STRING_CONCATENATE:
yield field_name, " ".join(val.strip() for val in values)
elif strategy is CS.DROP:
continue
else: # pragma: no cover
# -- not likely to hit this since we have a test in `text_elements.py` that
# -- ensures every ElementMetadata fields has an assigned strategy.
raise NotImplementedError(
f"metadata field {repr(field_name)} has no defined consolidation strategy"
)
return dict(iter_kwarg_pairs())
@lazyproperty
def _orig_elements(self) -> list[Element]:
"""The `.metadata.orig_elements` value for chunks formed from this pre-chunk."""
def iter_orig_elements():
for e in self._elements:
if e.metadata.orig_elements is None:
yield e
continue
# -- make copy of any element we're going to mutate because these elements don't
# -- belong to us (the user may have downstream purposes for them).
orig_element = copy.copy(e)
# -- prevent recursive .orig_elements when element is a chunk (has orig-elements of
# -- its own)
orig_element.metadata.orig_elements = None
yield orig_element
return list(iter_orig_elements())
class _TableChunker:
"""Responsible for forming chunks, especially splits, from a single-table pre-chunk.
Table splitting is specialized because we recursively split on an even row, cell, text
boundary. This object encapsulate those details.
"""
def __init__(self, table: Table, overlap_prefix: str, opts: ChunkingOptions) -> None: def __init__(self, table: Table, overlap_prefix: str, opts: ChunkingOptions) -> None:
self._table = table self._table = table
self._overlap_prefix = overlap_prefix self._overlap_prefix = overlap_prefix
self._opts = opts self._opts = opts
def iter_chunks(self) -> Iterator[Table | TableChunk]: @classmethod
def iter_chunks(
cls, table: Table, overlap_prefix: str, opts: ChunkingOptions
) -> Iterator[Table | TableChunk]:
"""Split this pre-chunk into `Table` or `TableChunk` objects maxlen or smaller."""
return cls(table, overlap_prefix, opts)._iter_chunks()
def _iter_chunks(self) -> Iterator[Table | TableChunk]:
"""Split this pre-chunk into `Table` or `TableChunk` objects maxlen or smaller.""" """Split this pre-chunk into `Table` or `TableChunk` objects maxlen or smaller."""
# -- A table with no non-whitespace text produces no chunks -- # -- A table with no non-whitespace text produces no chunks --
if not self._table_text: if not self._table_text:
@ -459,7 +724,7 @@ class TablePreChunk:
# -- When there's no HTML, split it like a normal element. Also fall back to text-only # -- When there's no HTML, split it like a normal element. Also fall back to text-only
# -- chunks when `max_characters` is less than 50. `.text_as_html` metadata is impractical # -- chunks when `max_characters` is less than 50. `.text_as_html` metadata is impractical
# -- for a chunking window that small because the 33 characterss of HTML overhead for each # -- for a chunking window that small because the 33 characters of HTML overhead for each
# -- chunk (`<table><tr><td>...</td></tr></table>`) would produce a very large number of # -- chunk (`<table><tr><td>...</td></tr></table>`) would produce a very large number of
# -- very small chunks. # -- very small chunks.
if not self._html or self._opts.hard_max < 50: if not self._html or self._opts.hard_max < 50:
@ -469,17 +734,6 @@ class TablePreChunk:
# -- otherwise, form splits with "synchronized" text and html -- # -- otherwise, form splits with "synchronized" text and html --
yield from self._iter_text_and_html_table_chunks() yield from self._iter_text_and_html_table_chunks()
@lazyproperty
def overlap_tail(self) -> str:
"""The portion of this chunk's text to be repeated as a prefix in the next chunk.
This value is the empty-string ("") when either the `.overlap` length option is `0` or
`.overlap_all` is `False`. When there is a text value, it is stripped of both leading and
trailing whitespace.
"""
overlap = self._opts.inter_chunk_overlap
return self._text_with_overlap[-overlap:].strip() if overlap else ""
@lazyproperty @lazyproperty
def _html(self) -> str: def _html(self) -> str:
"""The compactified HTML for this table when it has text-as-HTML. """The compactified HTML for this table when it has text-as-HTML.
@ -517,7 +771,7 @@ class TablePreChunk:
is_continuation = False is_continuation = False
for text, html in _TableSplitter.iter_subtables(html_table, self._opts): for text, html in _HtmlTableSplitter.iter_subtables(html_table, self._opts):
metadata = self._metadata metadata = self._metadata
metadata.text_as_html = html metadata.text_as_html = html
# -- second and later chunks get `.metadata.is_continuation = True` -- # -- second and later chunks get `.metadata.is_continuation = True` --
@ -527,7 +781,11 @@ class TablePreChunk:
yield TableChunk(text=text, metadata=metadata) yield TableChunk(text=text, metadata=metadata)
def _iter_text_only_table_chunks(self) -> Iterator[TableChunk]: def _iter_text_only_table_chunks(self) -> Iterator[TableChunk]:
"""Split oversized text-only table (no text-as-html) into chunks.""" """Split oversized text-only table (no text-as-html) into chunks.
`.metadata.text_as_html` is optional, not included when `infer_table_structure` is
`False`.
"""
text_remainder = self._text_with_overlap text_remainder = self._text_with_overlap
split = self._opts.split split = self._opts.split
is_continuation = False is_continuation = False
@ -599,229 +857,12 @@ class TablePreChunk:
return overlap_prefix + "\n" + table_text if overlap_prefix else table_text return overlap_prefix + "\n" + table_text if overlap_prefix else table_text
class TextPreChunk:
"""A sequence of elements that belong to the same semantic unit within a document.
The name "section" derives from the idea of a document-section, a heading followed by the
paragraphs "under" that heading. That structure is not found in all documents and actual section
content can vary, but that's the concept.
This object is purposely immutable.
"""
def __init__(
self, elements: Iterable[Element], overlap_prefix: str, opts: ChunkingOptions
) -> None:
self._elements = list(elements)
self._overlap_prefix = overlap_prefix
self._opts = opts
def __eq__(self, other: Any) -> bool:
if not isinstance(other, TextPreChunk):
return False
return self._overlap_prefix == other._overlap_prefix and self._elements == other._elements
def can_combine(self, pre_chunk: TextPreChunk) -> bool:
"""True when `pre_chunk` can be combined with this one without exceeding size limits."""
if len(self._text) >= self._opts.combine_text_under_n_chars:
return False
# -- avoid duplicating length computations by doing a trial-combine which is just as
# -- efficient and definitely more robust than hoping two different computations of combined
# -- length continue to get the same answer as the code evolves. Only possible because
# -- `.combine()` is non-mutating.
combined_len = len(self.combine(pre_chunk)._text)
return combined_len <= self._opts.hard_max
def combine(self, other_pre_chunk: TextPreChunk) -> TextPreChunk:
"""Return new `TextPreChunk` that combines this and `other_pre_chunk`."""
# -- combined pre-chunk gets the overlap-prefix of the first pre-chunk. The second overlap
# -- is automatically incorporated at the end of the first chunk, where it originated.
return TextPreChunk(
self._elements + other_pre_chunk._elements,
overlap_prefix=self._overlap_prefix,
opts=self._opts,
)
def iter_chunks(self) -> Iterator[CompositeElement]:
"""Split this pre-chunk into one or more `CompositeElement` objects maxlen or smaller."""
# -- a pre-chunk containing no text (maybe only a PageBreak element for example) does not
# -- generate any chunks.
if not self._text:
return
split = self._opts.split
# -- emit first chunk --
s, remainder = split(self._text)
yield CompositeElement(text=s, metadata=self._consolidated_metadata)
# -- an oversized pre-chunk will have a remainder, split that up into additional chunks.
# -- Note these get continuation_metadata which includes is_continuation=True.
while remainder:
s, remainder = split(remainder)
yield CompositeElement(text=s, metadata=self._continuation_metadata)
@lazyproperty
def overlap_tail(self) -> str:
"""The portion of this chunk's text to be repeated as a prefix in the next chunk.
This value is the empty-string ("") when either the `.overlap` length option is `0` or
`.overlap_all` is `False`. When there is a text value, it is stripped of both leading and
trailing whitespace.
"""
overlap = self._opts.inter_chunk_overlap
return self._text[-overlap:].strip() if overlap else ""
@lazyproperty
def _all_metadata_values(self) -> dict[str, list[Any]]:
"""Collection of all populated metadata values across elements.
The resulting dict has one key for each `ElementMetadata` field that had a non-None value in
at least one of the elements in this pre-chunk. The value of that key is a list of all those
populated values, in element order, for example:
{
"filename": ["sample.docx", "sample.docx"],
"languages": [["lat"], ["lat", "eng"]]
...
}
This preprocessing step provides the input for a specified consolidation strategy that will
resolve the list of values for each field to a single consolidated value.
"""
def iter_populated_fields(metadata: ElementMetadata) -> Iterator[tuple[str, Any]]:
"""(field_name, value) pair for each non-None field in single `ElementMetadata`."""
return (
(field_name, value)
for field_name, value in metadata.known_fields.items()
if value is not None
)
field_values: DefaultDict[str, list[Any]] = collections.defaultdict(list)
# -- collect all non-None field values in a list for each field, in element-order --
for e in self._elements:
for field_name, value in iter_populated_fields(e.metadata):
field_values[field_name].append(value)
return dict(field_values)
@lazyproperty
def _consolidated_metadata(self) -> ElementMetadata:
"""Metadata applicable to this pre-chunk as a single chunk.
Formed by applying consolidation rules to all metadata fields across the elements of this
pre-chunk.
For the sake of consistency, the same rules are applied (for example, for dropping values)
to a single-element pre-chunk too, even though metadata for such a pre-chunk is already
"consolidated".
"""
consolidated_metadata = ElementMetadata(**self._meta_kwargs)
if self._opts.include_orig_elements:
consolidated_metadata.orig_elements = self._orig_elements
return consolidated_metadata
@lazyproperty
def _continuation_metadata(self) -> ElementMetadata:
"""Metadata applicable to the second and later text-split chunks of the pre-chunk.
The same metadata as the first text-split chunk but includes `.is_continuation = True`.
Unused for non-oversized pre-chunks since those are not subject to text-splitting.
"""
# -- we need to make a copy, otherwise adding a field would also change metadata value
# -- already assigned to another chunk (e.g. the first text-split chunk). Deep-copy is not
# -- required though since we're not changing any collection fields.
continuation_metadata = copy.copy(self._consolidated_metadata)
continuation_metadata.is_continuation = True
return continuation_metadata
def _iter_text_segments(self) -> Iterator[str]:
"""Generate overlap text and each element text segment in order.
Empty text segments are not included.
"""
if self._overlap_prefix:
yield self._overlap_prefix
for e in self._elements:
if not e.text:
continue
yield e.text
@lazyproperty
def _meta_kwargs(self) -> dict[str, Any]:
"""The consolidated metadata values as a dict suitable for constructing ElementMetadata.
This is where consolidation strategies are actually applied. The output is suitable for use
in constructing an `ElementMetadata` object like `ElementMetadata(**self._meta_kwargs)`.
"""
CS = ConsolidationStrategy
field_consolidation_strategies = ConsolidationStrategy.field_consolidation_strategies()
def iter_kwarg_pairs() -> Iterator[tuple[str, Any]]:
"""Generate (field-name, value) pairs for each field in consolidated metadata."""
for field_name, values in self._all_metadata_values.items():
strategy = field_consolidation_strategies.get(field_name)
if strategy is CS.FIRST:
yield field_name, values[0]
# -- concatenate lists from each element that had one, in order --
elif strategy is CS.LIST_CONCATENATE:
yield field_name, sum(values, cast("list[Any]", []))
# -- union lists from each element, preserving order of appearance --
elif strategy is CS.LIST_UNIQUE:
# -- Python 3.7+ maintains dict insertion order --
ordered_unique_keys = {key: None for val_list in values for key in val_list}
yield field_name, list(ordered_unique_keys.keys())
elif strategy is CS.STRING_CONCATENATE:
yield field_name, " ".join(val.strip() for val in values)
elif strategy is CS.DROP:
continue
else: # pragma: no cover
# -- not likely to hit this since we have a test in `text_elements.py` that
# -- ensures every ElementMetadata fields has an assigned strategy.
raise NotImplementedError(
f"metadata field {repr(field_name)} has no defined consolidation strategy"
)
return dict(iter_kwarg_pairs())
@lazyproperty
def _orig_elements(self) -> list[Element]:
"""The `.metadata.orig_elements` value for chunks formed from this pre-chunk."""
def iter_orig_elements():
for e in self._elements:
if e.metadata.orig_elements is None:
yield e
continue
# -- make copy of any element we're going to mutate because these elements don't
# -- belong to us (the user may have downstream purposes for them).
orig_element = copy.copy(e)
# -- prevent recursive .orig_elements when element is a chunk (has orig-elements of
# -- its own)
orig_element.metadata.orig_elements = None
yield orig_element
return list(iter_orig_elements())
@lazyproperty
def _text(self) -> str:
"""The concatenated text of all elements in this pre-chunk.
Each element-text is separated from the next by a blank line ("\n\n").
"""
text_separator = self._opts.text_separator
return text_separator.join(self._iter_text_segments())
# ================================================================================================ # ================================================================================================
# PRE-CHUNK SPLITTERS # HTML SPLITTERS
# ================================================================================================ # ================================================================================================
class _TableSplitter: class _HtmlTableSplitter:
"""Produces (text, html) pairs for a `<table>` HtmlElement. """Produces (text, html) pairs for a `<table>` HtmlElement.
Each chunk contains a whole number of rows whenever possible. An oversized row is split on an Each chunk contains a whole number of rows whenever possible. An oversized row is split on an
@ -1040,7 +1081,7 @@ class _CellAccumulator:
def will_fit(self, cell: HtmlCell) -> bool: def will_fit(self, cell: HtmlCell) -> bool:
"""True when `cell` will fit within remaining space left by accummulated cells.""" """True when `cell` will fit within remaining space left by accummulated cells."""
return self._remaining_space >= len(cell.html) return self._remaining_space >= len(cell.text)
def _iter_cell_texts(self) -> Iterator[str]: def _iter_cell_texts(self) -> Iterator[str]:
"""Generate contents of each accumulated cell as a separate string. """Generate contents of each accumulated cell as a separate string.
@ -1054,10 +1095,11 @@ class _CellAccumulator:
@property @property
def _remaining_space(self) -> int: def _remaining_space(self) -> int:
"""Number of characters remaining when accumulated cells are formed into HTML.""" """Number of characters remaining when text of accumulated cells is joined."""
# -- 24 is `len("<table><tr></tr></table>")`, the overhead in addition to `<td>` # -- separators are one space (" ") at the end of each cell's text, including last one to
# -- HTML fragments # -- account for space before prospective next cell.
return self._maxlen - 24 - sum(len(c.html) for c in self._cells) separators_len = len(self._cells)
return self._maxlen - separators_len - sum(len(c.text) for c in self._cells)
class _RowAccumulator: class _RowAccumulator:
@ -1087,7 +1129,7 @@ class _RowAccumulator:
def will_fit(self, row: HtmlRow) -> bool: def will_fit(self, row: HtmlRow) -> bool:
"""True when `row` will fit within remaining space left by accummulated rows.""" """True when `row` will fit within remaining space left by accummulated rows."""
return self._remaining_space >= len(row.html) return self._remaining_space >= row.text_len
def _iter_cell_texts(self) -> Iterator[str]: def _iter_cell_texts(self) -> Iterator[str]:
"""Generate contents of each row cell as a separate string. """Generate contents of each row cell as a separate string.
@ -1100,8 +1142,10 @@ class _RowAccumulator:
@property @property
def _remaining_space(self) -> int: def _remaining_space(self) -> int:
"""Number of characters remaining when accumulated rows are formed into HTML.""" """Number of characters remaining when accumulated rows are formed into HTML."""
# -- 15 is `len("<table></table>")`, the overhead in addition to `<tr>` HTML fragments -- # -- separators are one space (" ") at the end of each row's text, including last one to
return self._maxlen - 15 - sum(len(r.html) for r in self._rows) # -- account for space before prospective next row.
separators_len = len(self._rows)
return self._maxlen - separators_len - sum(r.text_len for r in self._rows)
# ================================================================================================ # ================================================================================================
@ -1117,16 +1161,10 @@ class PreChunkCombiner:
self._opts = opts self._opts = opts
def iter_combined_pre_chunks(self) -> Iterator[PreChunk]: def iter_combined_pre_chunks(self) -> Iterator[PreChunk]:
"""Generate pre-chunk objects, combining TextPreChunk objects when they'll fit in window.""" """Generate pre-chunk objects, combining `PreChunk` objects when they'll fit in window."""
accum = TextPreChunkAccumulator(self._opts) accum = _PreChunkAccumulator(self._opts)
for pre_chunk in self._pre_chunks: for pre_chunk in self._pre_chunks:
# -- a table pre-chunk is never combined --
if isinstance(pre_chunk, TablePreChunk):
yield from accum.flush()
yield pre_chunk
continue
# -- finish accumulating pre-chunk when it's full -- # -- finish accumulating pre-chunk when it's full --
if not accum.will_fit(pre_chunk): if not accum.will_fit(pre_chunk):
yield from accum.flush() yield from accum.flush()
@ -1136,39 +1174,37 @@ class PreChunkCombiner:
yield from accum.flush() yield from accum.flush()
class TextPreChunkAccumulator: class _PreChunkAccumulator:
"""Accumulates, measures, and combines text pre-chunks. """Accumulates, measures, and combines pre-chunks.
Used for combining pre-chunks for chunking strategies like "by-title" that can potentially Used for combining pre-chunks for chunking strategies like "by-title" that can potentially
produce undersized chunks and offer the `combine_text_under_n_chars` option. Note that only produce undersized chunks and offer the `combine_text_under_n_chars` option.
sequential `TextPreChunk` objects can be combined. A `TablePreChunk` is never combined with
another pre-chunk.
Provides `.add_pre_chunk()` allowing a pre-chunk to be added to the chunk and provides Provides `.add_pre_chunk()` allowing a pre-chunk to be added to the chunk and provides
monitoring properties `.remaining_space` and `.text_length` suitable for deciding whether to add monitoring properties `.remaining_space` and `.text_length` suitable for deciding whether to add
another pre-chunk. another pre-chunk.
`.flush()` is used to combine the accumulated pre-chunks into a single `TextPreChunk` object. `.flush()` is used to combine the accumulated pre-chunks into a single `PreChunk` object.
This method returns an interator that generates zero-or-one `TextPreChunk` objects and is used This method returns an interator that generates zero-or-one `PreChunk` objects and is used
like so: like so:
yield from accum.flush() yield from accum.flush()
If no pre-chunks have been accumulated, no `TextPreChunk` is generated. Flushing the builder If no pre-chunks have been accumulated, no `PreChunk` is generated. Flushing the builder
clears the pre-chunks it contains so it is ready to accept the next text-pre-chunk. clears the pre-chunks it contains so it is ready to accept the next pre-chunk.
""" """
def __init__(self, opts: ChunkingOptions) -> None: def __init__(self, opts: ChunkingOptions) -> None:
self._opts = opts self._opts = opts
self._pre_chunk: TextPreChunk | None = None self._pre_chunk: PreChunk | None = None
def add_pre_chunk(self, pre_chunk: TextPreChunk) -> None: def add_pre_chunk(self, pre_chunk: PreChunk) -> None:
"""Add a pre-chunk to the accumulator for possible combination with next pre-chunk.""" """Add a pre-chunk to the accumulator for possible combination with next pre-chunk."""
self._pre_chunk = ( self._pre_chunk = (
pre_chunk if self._pre_chunk is None else self._pre_chunk.combine(pre_chunk) pre_chunk if self._pre_chunk is None else self._pre_chunk.combine(pre_chunk)
) )
def flush(self) -> Iterator[TextPreChunk]: def flush(self) -> Iterator[PreChunk]:
"""Generate accumulated pre-chunk as a single combined pre-chunk. """Generate accumulated pre-chunk as a single combined pre-chunk.
Does not generate a pre-chunk when none has been accumulated. Does not generate a pre-chunk when none has been accumulated.
@ -1181,7 +1217,7 @@ class TextPreChunkAccumulator:
# -- and reset the accumulator (to empty) -- # -- and reset the accumulator (to empty) --
self._pre_chunk = None self._pre_chunk = None
def will_fit(self, pre_chunk: TextPreChunk) -> bool: def will_fit(self, pre_chunk: PreChunk) -> bool:
"""True when there is room for `pre_chunk` in accumulator. """True when there is room for `pre_chunk` in accumulator.
An empty accumulator always has room. Otherwise there is only room when `pre_chunk` can be An empty accumulator always has room. Otherwise there is only room when `pre_chunk` can be
@ -1206,7 +1242,7 @@ class TextPreChunkAccumulator:
# predicate. # predicate.
# #
# These can be mixed and matched to produce different chunking behaviors like "by_title" or left # These can be mixed and matched to produce different chunking behaviors like "by_title" or left
# out altogether to produce "by_element" behavior. # out altogether to produce "basic-chunking" behavior.
# #
# The effective lifetime of the function that produce a predicate (rather than directly being one) # The effective lifetime of the function that produce a predicate (rather than directly being one)
# is limited to a single element-stream because these retain state (e.g. current page number) to # is limited to a single element-stream because these retain state (e.g. current page number) to

View File

@ -136,11 +136,15 @@ class HtmlRow:
for td in self._tr: for td in self._tr:
if (text := td.text) is None: if (text := td.text) is None:
continue continue
text = text.strip()
if not text: if not text:
continue continue
yield text yield text
@lazyproperty
def text_len(self) -> int:
"""Length of the normalized text, as it would appear in `element.text`."""
return len(" ".join(self.iter_cell_texts()))
class HtmlCell: class HtmlCell:
"""A `<td>` element.""" """A `<td>` element."""
@ -158,4 +162,4 @@ class HtmlCell:
"""Text inside `<td>` element, empty string when no text.""" """Text inside `<td>` element, empty string when no text."""
if (text := self._td.text) is None: if (text := self._td.text) is None:
return "" return ""
return text.strip() return " ".join(text.strip().split())