mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-12 07:34:09 +00:00
chunk: relax table segregation during chunking (#3812)
**Summary** Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. **Additional Context** Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>
This commit is contained in:
parent
18d6c81c47
commit
4379d883a3
@ -1,8 +1,10 @@
|
|||||||
## 0.16.11-dev0
|
## 0.16.11-dev1
|
||||||
|
|
||||||
### Enhancements
|
### Enhancements
|
||||||
|
|
||||||
- **Enhance quote standardization tests** with additional Unicode scenarios
|
- **Enhance quote standardization tests** with additional Unicode scenarios
|
||||||
|
- **Relax table segregation rule in chunking.** Previously a `Table` element was always segregated into its own pre-chunk such that the `Table` appeared alone in a chunk or was split into multiple `TableChunk` elements, but never combined with `Text`-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
|
||||||
|
- **Compute chunk length based solely on `element.text`.** Previously `.metadata.text_as_html` was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.
|
||||||
|
|
||||||
### Features
|
### Features
|
||||||
|
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@ -25,31 +25,31 @@ def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_parti
|
|||||||
assert chunks == [
|
assert chunks == [
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 – INTRODUCTION"
|
"US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 – INTRODUCTION"
|
||||||
"\n\nA.\tPURPOSE"
|
"\n\nA. PURPOSE"
|
||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"The United States Trustee appoints and supervises standing trustees and monitors and"
|
"The United States Trustee appoints and supervises standing trustees and monitors and"
|
||||||
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
|
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
|
||||||
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
|
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
|
||||||
" establishes or clarifies the position of the United States Trustee Program (Program)"
|
" establishes or clarifies the position of the United States Trustee Program (Program)"
|
||||||
" on the duties owed by a standing trustee to the debtors, creditors, other parties in"
|
" on the duties owed by a standing trustee to the debtors, creditors, other parties in"
|
||||||
" interest, and the United States Trustee. The Handbook does not present a full and"
|
" interest, and the United States Trustee. The Handbook does not present a full and"
|
||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"complete statement of the law; it should not be used as a substitute for legal"
|
"complete statement of the law; it should not be used as a substitute for legal"
|
||||||
" research and analysis. The standing trustee must be familiar with relevant"
|
" research and analysis. The standing trustee must be familiar with relevant"
|
||||||
" provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules),"
|
" provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules),"
|
||||||
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
|
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
|
||||||
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
|
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
|
||||||
" identified in this Handbook but these are not considered mandatory."
|
" identified in this Handbook but these are not considered mandatory."
|
||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"Nothing in this Handbook should be construed to excuse the standing trustee from"
|
"Nothing in this Handbook should be construed to excuse the standing trustee from"
|
||||||
" complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and"
|
" complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and"
|
||||||
" orders of the court. The standing trustee should notify the United States Trustee"
|
" orders of the court. The standing trustee should notify the United States Trustee"
|
||||||
" whenever the provision of the Handbook conflicts with the local rules or orders of"
|
" whenever the provision of the Handbook conflicts with the local rules or orders of"
|
||||||
" the court. The standing trustee is accountable for all duties set forth in this"
|
" the court. The standing trustee is accountable for all duties set forth in this"
|
||||||
" Handbook, but need not personally perform any duty unless otherwise indicated. All"
|
" Handbook, but need not personally perform any duty unless otherwise indicated. All"
|
||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101"
|
"statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101"
|
||||||
@ -57,12 +57,12 @@ def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_parti
|
|||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"This Handbook does not create additional rights against the standing trustee or"
|
"This Handbook does not create additional rights against the standing trustee or"
|
||||||
" United States Trustee in favor of other parties.\n\nB.\tROLE OF THE UNITED STATES"
|
" United States Trustee in favor of other parties.\n\nB. ROLE OF THE UNITED STATES"
|
||||||
" TRUSTEE"
|
" TRUSTEE"
|
||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the"
|
"The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the"
|
||||||
" responsibilities for daytoday administration of cases. Debtors, creditors, and"
|
" responsibilities for daytoday administration of cases. Debtors, creditors, and"
|
||||||
" third parties with adverse interests to the trustee were concerned that the court,"
|
" third parties with adverse interests to the trustee were concerned that the court,"
|
||||||
" which previously appointed and supervised the trustee, would not impartially"
|
" which previously appointed and supervised the trustee, would not impartially"
|
||||||
" adjudicate their rights as adversaries of that trustee. To address these concerns,"
|
" adjudicate their rights as adversaries of that trustee. To address these concerns,"
|
||||||
@ -70,24 +70,24 @@ def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_parti
|
|||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"Many administrative functions formerly performed by the court were placed within the"
|
"Many administrative functions formerly performed by the court were placed within the"
|
||||||
" Department of Justice through the creation of the Program. Among the administrative"
|
" Department of Justice through the creation of the Program. Among the administrative"
|
||||||
" functions assigned to the United States Trustee were the appointment and supervision"
|
" functions assigned to the United States Trustee were the appointment and supervision"
|
||||||
" of chapter 13 trustees./ This Handbook is issued under the authority of the"
|
" of chapter 13 trustees./ This Handbook is issued under the authority of the"
|
||||||
" Program’s enabling statutes. \n\nC.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t"
|
" Program’s enabling statutes.\n\nC. STATUTORY DUTIES OF A STANDING TRUSTEE"
|
||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
|
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
|
||||||
" standing trustee is more than a mere disbursing agent. The standing trustee must"
|
" standing trustee is more than a mere disbursing agent. The standing trustee must"
|
||||||
" be personally involved in the trustee operation. If the standing trustee is or"
|
" be personally involved in the trustee operation. If the standing trustee is or"
|
||||||
" becomes unable to perform the duties and responsibilities of a standing trustee,"
|
" becomes unable to perform the duties and responsibilities of a standing trustee,"
|
||||||
" the standing trustee must immediately advise the United States Trustee."
|
" the standing trustee must immediately advise the United States Trustee."
|
||||||
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
|
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
|
||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"Although this Handbook is not intended to be a complete statutory reference, the"
|
"Although this Handbook is not intended to be a complete statutory reference, the"
|
||||||
" standing trustee’s primary statutory duties are set forth in 11 U.S.C. § 1302, which"
|
" standing trustee’s primary statutory duties are set forth in 11 U.S.C. § 1302, which"
|
||||||
" incorporates by reference some of the duties of chapter 7 trustees found in"
|
" incorporates by reference some of the duties of chapter 7 trustees found in"
|
||||||
" 11 U.S.C. § 704. These duties include, but are not limited to, the"
|
" 11 U.S.C. § 704. These duties include, but are not limited to, the"
|
||||||
" following:\n\nCopyright"
|
" following:\n\nCopyright"
|
||||||
),
|
),
|
||||||
]
|
]
|
||||||
|
|||||||
@ -8,7 +8,7 @@ from typing import Any, Optional
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from test_unstructured.unit_utils import FixtureRequest, Mock, function_mock
|
from test_unstructured.unit_utils import FixtureRequest, Mock, function_mock, input_path
|
||||||
from unstructured.chunking.base import CHUNK_MULTI_PAGE_DEFAULT
|
from unstructured.chunking.base import CHUNK_MULTI_PAGE_DEFAULT
|
||||||
from unstructured.chunking.title import _ByTitleChunkingOptions, chunk_by_title
|
from unstructured.chunking.title import _ByTitleChunkingOptions, chunk_by_title
|
||||||
from unstructured.documents.coordinates import CoordinateSystem
|
from unstructured.documents.coordinates import CoordinateSystem
|
||||||
@ -20,10 +20,12 @@ from unstructured.documents.elements import (
|
|||||||
ElementMetadata,
|
ElementMetadata,
|
||||||
ListItem,
|
ListItem,
|
||||||
Table,
|
Table,
|
||||||
|
TableChunk,
|
||||||
Text,
|
Text,
|
||||||
Title,
|
Title,
|
||||||
)
|
)
|
||||||
from unstructured.partition.html import partition_html
|
from unstructured.partition.html import partition_html
|
||||||
|
from unstructured.staging.base import elements_from_json
|
||||||
|
|
||||||
# ================================================================================================
|
# ================================================================================================
|
||||||
# INTEGRATION-TESTS
|
# INTEGRATION-TESTS
|
||||||
@ -33,7 +35,53 @@ from unstructured.partition.html import partition_html
|
|||||||
# ================================================================================================
|
# ================================================================================================
|
||||||
|
|
||||||
|
|
||||||
def test_it_splits_a_large_element_into_multiple_chunks():
|
def test_it_chunks_text_followed_by_table_together_when_both_fit():
|
||||||
|
elements = elements_from_json(input_path("chunking/title_table_200.json"))
|
||||||
|
|
||||||
|
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
|
||||||
|
|
||||||
|
assert len(chunks) == 1
|
||||||
|
assert isinstance(chunks[0], CompositeElement)
|
||||||
|
|
||||||
|
|
||||||
|
def test_it_chunks_table_followed_by_text_together_when_both_fit():
|
||||||
|
elements = elements_from_json(input_path("chunking/table_text_200.json"))
|
||||||
|
|
||||||
|
# -- disable chunk combining so we test pre-chunking behavior, not chunk-combining --
|
||||||
|
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
|
||||||
|
|
||||||
|
assert len(chunks) == 1
|
||||||
|
assert isinstance(chunks[0], CompositeElement)
|
||||||
|
|
||||||
|
|
||||||
|
def test_it_splits_oversized_table():
|
||||||
|
elements = elements_from_json(input_path("chunking/table_2000.json"))
|
||||||
|
|
||||||
|
chunks = chunk_by_title(elements)
|
||||||
|
|
||||||
|
assert len(chunks) == 5
|
||||||
|
assert all(isinstance(chunk, TableChunk) for chunk in chunks)
|
||||||
|
|
||||||
|
|
||||||
|
def test_it_starts_new_chunk_for_table_after_full_text_chunk():
|
||||||
|
elements = elements_from_json(input_path("chunking/long_text_table_200.json"))
|
||||||
|
|
||||||
|
chunks = chunk_by_title(elements, max_characters=250)
|
||||||
|
|
||||||
|
assert len(chunks) == 2
|
||||||
|
assert [type(chunk) for chunk in chunks] == [CompositeElement, Table]
|
||||||
|
|
||||||
|
|
||||||
|
def test_it_starts_new_chunk_for_text_after_full_table_chunk():
|
||||||
|
elements = elements_from_json(input_path("chunking/full_table_long_text_250.json"))
|
||||||
|
|
||||||
|
chunks = chunk_by_title(elements, max_characters=250)
|
||||||
|
|
||||||
|
assert len(chunks) == 2
|
||||||
|
assert [type(chunk) for chunk in chunks] == [Table, CompositeElement]
|
||||||
|
|
||||||
|
|
||||||
|
def test_it_splits_a_large_text_element_into_multiple_chunks():
|
||||||
elements: list[Element] = [
|
elements: list[Element] = [
|
||||||
Title("Introduction"),
|
Title("Introduction"),
|
||||||
Text(
|
Text(
|
||||||
@ -68,7 +116,7 @@ def test_it_splits_elements_by_title_and_table():
|
|||||||
|
|
||||||
chunks = chunk_by_title(elements, combine_text_under_n_chars=0, include_orig_elements=True)
|
chunks = chunk_by_title(elements, combine_text_under_n_chars=0, include_orig_elements=True)
|
||||||
|
|
||||||
assert len(chunks) == 4
|
assert len(chunks) == 3
|
||||||
# --
|
# --
|
||||||
chunk = chunks[0]
|
chunk = chunks[0]
|
||||||
assert isinstance(chunk, CompositeElement)
|
assert isinstance(chunk, CompositeElement)
|
||||||
@ -76,13 +124,10 @@ def test_it_splits_elements_by_title_and_table():
|
|||||||
Title("A Great Day"),
|
Title("A Great Day"),
|
||||||
Text("Today is a great day."),
|
Text("Today is a great day."),
|
||||||
Text("It is sunny outside."),
|
Text("It is sunny outside."),
|
||||||
|
Table("Heading\nCell text"),
|
||||||
]
|
]
|
||||||
# --
|
# --
|
||||||
chunk = chunks[1]
|
chunk = chunks[1]
|
||||||
assert isinstance(chunk, Table)
|
|
||||||
assert chunk.metadata.orig_elements == [Table("Heading\nCell text")]
|
|
||||||
# ==
|
|
||||||
chunk = chunks[2]
|
|
||||||
assert isinstance(chunk, CompositeElement)
|
assert isinstance(chunk, CompositeElement)
|
||||||
assert chunk.metadata.orig_elements == [
|
assert chunk.metadata.orig_elements == [
|
||||||
Title("An Okay Day"),
|
Title("An Okay Day"),
|
||||||
@ -90,7 +135,7 @@ def test_it_splits_elements_by_title_and_table():
|
|||||||
Text("It is rainy outside."),
|
Text("It is rainy outside."),
|
||||||
]
|
]
|
||||||
# --
|
# --
|
||||||
chunk = chunks[3]
|
chunk = chunks[2]
|
||||||
assert isinstance(chunk, CompositeElement)
|
assert isinstance(chunk, CompositeElement)
|
||||||
assert chunk.metadata.orig_elements == [
|
assert chunk.metadata.orig_elements == [
|
||||||
Title("A Bad Day"),
|
Title("A Bad Day"),
|
||||||
@ -119,9 +164,8 @@ def test_chunk_by_title():
|
|||||||
|
|
||||||
assert chunks == [
|
assert chunks == [
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
|
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
|
||||||
),
|
),
|
||||||
Table("Heading\nCell text"),
|
|
||||||
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
||||||
@ -150,10 +194,7 @@ def test_chunk_by_title_separates_by_page_number():
|
|||||||
CompositeElement(
|
CompositeElement(
|
||||||
"A Great Day",
|
"A Great Day",
|
||||||
),
|
),
|
||||||
CompositeElement(
|
CompositeElement("Today is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"),
|
||||||
"Today is a great day.\n\nIt is sunny outside.",
|
|
||||||
),
|
|
||||||
Table("Heading\nCell text"),
|
|
||||||
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
||||||
@ -178,9 +219,8 @@ def test_chuck_by_title_respects_multipage():
|
|||||||
chunks = chunk_by_title(elements, multipage_sections=True, combine_text_under_n_chars=0)
|
chunks = chunk_by_title(elements, multipage_sections=True, combine_text_under_n_chars=0)
|
||||||
assert chunks == [
|
assert chunks == [
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
|
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
|
||||||
),
|
),
|
||||||
Table("Heading\nCell text"),
|
|
||||||
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
||||||
@ -206,9 +246,8 @@ def test_chunk_by_title_groups_across_pages():
|
|||||||
|
|
||||||
assert chunks == [
|
assert chunks == [
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
|
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
|
||||||
),
|
),
|
||||||
Table("Heading\nCell text"),
|
|
||||||
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
||||||
CompositeElement(
|
CompositeElement(
|
||||||
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
||||||
|
|||||||
@ -37,7 +37,7 @@ def test_it_chunks_elements_when_a_chunking_strategy_is_specified():
|
|||||||
"example-docs/spring-weather.html.json", chunking_strategy="basic", max_characters=1500
|
"example-docs/spring-weather.html.json", chunking_strategy="basic", max_characters=1500
|
||||||
)
|
)
|
||||||
|
|
||||||
assert len(chunks) == 10
|
assert len(chunks) == 9
|
||||||
assert all(isinstance(ch, CompositeElement) for ch in chunks)
|
assert all(isinstance(ch, CompositeElement) for ch in chunks)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -0,0 +1,32 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"type": "Table",
|
||||||
|
"element_id": "ca96108263324e9d865a98f19cf7c940",
|
||||||
|
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 1,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "747587de72444235a68c768d544ff5f3",
|
||||||
|
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "NarrativeText",
|
||||||
|
"element_id": "5bc93ad5828445f98cac824c750cacfd",
|
||||||
|
"text": "Format: CSV file for Export and Download Contact: Charles Stringham cstringham@alsde.edu to arrange secure data transfer OR with technical questions nickey.johnson@alsde.edu for other questions",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 2,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
|
||||||
|
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham cstringham@alsde.edu to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">nickey.johnson@alsde.edu for other questions </p>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
@ -0,0 +1,32 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"type": "NarrativeText",
|
||||||
|
"element_id": "5bc93ad5828445f98cac824c750cacfd",
|
||||||
|
"text": "Format: CSV file for Export and Download Contact: Charles Stringham cstringham@alsde.edu to arrange secure data transfer OR with technical questions nickey.johnson@alsde.edu for other questions",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 2,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
|
||||||
|
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham cstringham@alsde.edu to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">nickey.johnson@alsde.edu for other questions </p>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "Table",
|
||||||
|
"element_id": "ca96108263324e9d865a98f19cf7c940",
|
||||||
|
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 1,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "747587de72444235a68c768d544ff5f3",
|
||||||
|
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
17
test_unstructured/testfiles/chunking/table_2000.json
Normal file
17
test_unstructured/testfiles/chunking/table_2000.json
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"type": "Table",
|
||||||
|
"element_id": "e6278883f688428c98cec628a00b0102",
|
||||||
|
"text": "Field Name Size Type Description Example School_Year 9 VARCHAR School year the assessment was given 2019-2020 LEA_Name VARCHAR Official Name of the School System Happy City Schools LEA_Code 3 VARCHAR 3-digit ALSDE-assigned system code 010 or 298 School_Code 6 VARCHAR 4-digit ALSDE-assigned school code 0100 or 9203 Student_Identifier 10 VARCHAR Student's ALSDE ID number -SSID ***must be 10 digits and start with \"19\" or \"20\"*** 9999999999 Student_Last_Name 35 VARCHAR Student's last name Smith Student_First_Name 35 VARCHAR Student's first name Jane Student_Date_of_Birth_Month 2 VARCHAR Student birth date month. MM 05, 11 Student_Date_of_Birth_Day 2 VARCHAR Student birth date day. DD 03, 25 Student_Date_of_Birth_Year 4 VARCHAR Student birth date Year. YYYY 2015 Reading_Teacher_Identifier 13 VARCHAR Reading Teacher's ALSDE ID/TCHNumber. The teacher who is primarily responsible for Reading instruction of the student. (These are two names for the same number). ***must be in this format 3 letters, dash, 4 numbers, dash, 4 numbers*** XXX-9999-9999, NOJ-1234-5678 Reading_Assessment_Name 15 VARCHAR Unique identifier for Reading assessment. Vendor's name for overall assessment. XXXX Reading_Administration_Mode 8 VARCHAR This field indicates if the assessment was administered in an in-person (face-to-face) or a remote learning environment. The options are: InPerson or Remote Reading_Benchmark_Period 3 VARCHAR Benchmark period during the term the assessment was administered. Summer School will be SSS. BOY, MOY or EOY (SSS for summer school) Reading_Date_Completed 10 VARCHAR This is the date on which the assessment is completed MM/DD/YYYY 43962 Reading_Extended_Time 2 VARCHAR The field will contain a \"Y\" if the student was given more than the allotted time to finish the assessment or any subtest of the assessment as defined by the vendor in a standard administration. Y",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 1,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "3ddff8c2b6c44a16be24baf72bdd78a2",
|
||||||
|
"text_as_html": "<table class=\"Table\" id=\"e6278883f688428c98cec628a00b0102\"> <thead> <tr> <th>Field Name</th><th>Size</th><th>Type</th><th>Description</th><th>Example</th></tr></thead><tbody> <tr> <td>School_Year</td><td>9</td><td>VARCHAR</td><td>School year the assessment was given</td><td>2019-2020</td></tr><tr> <td>LEA_Name</td><td></td><td>VARCHAR</td><td>Official Name of the School System</td><td>Happy City Schools</td></tr><tr> <td>LEA_Code</td><td>3</td><td>VARCHAR</td><td>3-digit ALSDE-assigned system code</td><td>010 or 298</td></tr><tr> <td>School_Code</td><td>6</td><td>VARCHAR</td><td>4-digit ALSDE-assigned school code</td><td>0100 or 9203</td></tr><tr> <td>Student_Identifier</td><td>10</td><td>VARCHAR</td><td>Student's ALSDE ID number -SSID ***must be 10 digits and start with \"19\" or \"20\"***</td><td>9999999999</td></tr><tr> <td>Student_Last_Name</td><td>35</td><td>VARCHAR</td><td>Student's last name</td><td>Smith</td></tr><tr> <td>Student_First_Name</td><td>35</td><td>VARCHAR</td><td>Student's first name</td><td>Jane</td></tr><tr> <td>Student_Date_of_Birth_Month</td><td>2</td><td>VARCHAR</td><td>Student birth date month. MM</td><td>05, 11</td></tr><tr> <td>Student_Date_of_Birth_Day</td><td>2</td><td>VARCHAR</td><td>Student birth date day. DD</td><td>03, 25</td></tr><tr> <td>Student_Date_of_Birth_Year</td><td>4</td><td>VARCHAR</td><td>Student birth date Year. YYYY</td><td>2015</td></tr><tr> <td>Reading_Teacher_Identifier</td><td>13</td><td>VARCHAR</td><td>Reading Teacher's ALSDE ID/TCHNumber. The teacher who is primarily responsible for Reading instruction of the student. (These are two names for the same number). ***must be in this format 3 letters, dash, 4 numbers, dash, 4 numbers***</td><td>XXX-9999-9999, NOJ-1234-5678</td></tr><tr> <td>Reading_Assessment_Name</td><td>15</td><td>VARCHAR</td><td>Unique identifier for Reading assessment. Vendor's name for overall assessment.</td><td>XXXX</td></tr><tr> <td>Reading_Administration_Mode</td><td>8</td><td>VARCHAR</td><td>This field indicates if the assessment was administered in an in-person (face-to-face) or a remote learning environment. The options are:</td><td>InPerson or Remote</td></tr><tr> <td>Reading_Benchmark_Period</td><td>3</td><td>VARCHAR</td><td>Benchmark period during the term the assessment was administered. Summer School will be SSS.</td><td>BOY, MOY or EOY (SSS for summer school)</td></tr><tr> <td>Reading_Date_Completed</td><td>10</td><td>VARCHAR</td><td>This is the date on which the assessment is completed MM/DD/YYYY</td><td>43962</td></tr><tr> <td>Reading_Extended_Time</td><td>2</td><td>VARCHAR</td><td>The field will contain a \"Y\" if the student was given more than the allotted time to finish the assessment or any subtest of the assessment as defined by the vendor in a standard administration.</td><td>Y</td></tr></tbody></table>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
32
test_unstructured/testfiles/chunking/table_text_200.json
Normal file
32
test_unstructured/testfiles/chunking/table_text_200.json
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"type": "Table",
|
||||||
|
"element_id": "ca96108263324e9d865a98f19cf7c940",
|
||||||
|
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 1,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "747587de72444235a68c768d544ff5f3",
|
||||||
|
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "Text",
|
||||||
|
"element_id": "0163a58539934b3aaca402c9e961b0d6",
|
||||||
|
"text": "REQUEST FOR PROPOSALS",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 1,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "747587de72444235a68c768d544ff5f3",
|
||||||
|
"text_as_html": "<h2 class=\"Subtitle\" id=\"0163a58539934b3aaca402c9e961b0d6\">REQUEST FOR PROPOSALS </h2>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
32
test_unstructured/testfiles/chunking/title_table_200.json
Normal file
32
test_unstructured/testfiles/chunking/title_table_200.json
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"type": "Title",
|
||||||
|
"element_id": "0163a58539934b3aaca402c9e961b0d6",
|
||||||
|
"text": "REQUEST FOR PROPOSALS",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 1,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "747587de72444235a68c768d544ff5f3",
|
||||||
|
"text_as_html": "<h2 class=\"Subtitle\" id=\"0163a58539934b3aaca402c9e961b0d6\">REQUEST FOR PROPOSALS </h2>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "Table",
|
||||||
|
"element_id": "ca96108263324e9d865a98f19cf7c940",
|
||||||
|
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
|
||||||
|
"metadata": {
|
||||||
|
"category_depth": 1,
|
||||||
|
"page_number": 1,
|
||||||
|
"parent_id": "747587de72444235a68c768d544ff5f3",
|
||||||
|
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
|
||||||
|
"languages": [
|
||||||
|
"eng"
|
||||||
|
],
|
||||||
|
"filetype": "text/html"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
@ -101,6 +101,13 @@ def parse_optional_datetime(datetime_str: Optional[str]) -> Optional[dt.datetime
|
|||||||
return dt.datetime.fromisoformat(datetime_str) if datetime_str else None
|
return dt.datetime.fromisoformat(datetime_str) if datetime_str else None
|
||||||
|
|
||||||
|
|
||||||
|
def input_path(rel_path: str) -> str:
|
||||||
|
"""Resolve the absolute-path to `rel_path` in the testfiles directory."""
|
||||||
|
testfiles_dir = pathlib.Path(__file__).parent / "testfiles"
|
||||||
|
file_path = testfiles_dir / rel_path
|
||||||
|
return str(file_path.resolve())
|
||||||
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------------------------------------
|
# ------------------------------------------------------------------------------------------------
|
||||||
# MOCKING FIXTURES
|
# MOCKING FIXTURES
|
||||||
# ------------------------------------------------------------------------------------------------
|
# ------------------------------------------------------------------------------------------------
|
||||||
|
|||||||
@ -1,8 +1,8 @@
|
|||||||
[
|
[
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "36385872440a208d3521a8a885d5f873",
|
"element_id": "85002882dd396da0b1b82c925b002be5",
|
||||||
"text": "US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 \u2013 INTRODUCTION\n\nA.\tPURPOSE",
|
"text": "US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 \u2013 INTRODUCTION\n\nA. PURPOSE",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -55,8 +55,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "91d26c5ec7f727ece12679cf6b80f90d",
|
"element_id": "1abe685eb8dfed0f2266d6cf793d7e6b",
|
||||||
"text": "le 11 of the United States Code. 28 U.S.C. \u00a7 586(b). The Handbook, issued as part of our duties under 28 U.S.C. \u00a7 586, establishes or clarifies the",
|
"text": "le 11 of the United States Code. 28 U.S.C. \u00a7 586(b). The Handbook, issued as part of our duties under 28 U.S.C. \u00a7 586, establishes or clarifies the",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -103,8 +103,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "20447c8f42ed2b919bd0e5707e7899ae",
|
"element_id": "40588c4c1489058c4fec885f4696ebcc",
|
||||||
"text": "s, creditors, other parties in interest, and the United States Trustee. The Handbook does not present a full and complete statement of the law; it",
|
"text": "s, creditors, other parties in interest, and the United States Trustee. The Handbook does not present a full and complete statement of the law; it",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -127,8 +127,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "e34c56af21b43f4179f996ddea901bc4",
|
"element_id": "9ddf0b109cf940de5f575acc9d9758c8",
|
||||||
"text": "ment of the law; it should not be used as a substitute for legal research and analysis. The standing trustee must be familiar with relevant",
|
"text": "ment of the law; it should not be used as a substitute for legal research and analysis. The standing trustee must be familiar with relevant provisions",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -151,8 +151,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "55e660e5b0d0ec6ee5476621e556d6c8",
|
"element_id": "b7d1b42646393ca0f41af0e8ec48f9a9",
|
||||||
"text": "iliar with relevant provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules), any local bankruptcy rules, and case law. 11",
|
"text": "relevant provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules), any local bankruptcy rules, and case law. 11 U.S.C. \u00a7 321,",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -175,8 +175,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "a9335be161a6a7a080ff78e4e07cbadb",
|
"element_id": "9ee33f4141eca1f98ca4299d0fdfba31",
|
||||||
"text": ", and case law. 11 U.S.C. \u00a7 321, 28 U.S.C. \u00a7 586, 28 C.F.R. \u00a7 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips identified in",
|
"text": "w. 11 U.S.C. \u00a7 321, 28 U.S.C. \u00a7 586, 28 C.F.R. \u00a7 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips identified in this Handbook but",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -199,8 +199,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "5f2d61a46e9d16ce346eacc25321a250",
|
"element_id": "6da3b5e2a833fa5ab6685f0fa46d2d6f",
|
||||||
"text": "Tips identified in this Handbook but these are not considered mandatory.",
|
"text": "n this Handbook but these are not considered mandatory.",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -246,8 +246,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "2ff156994a8c58d8a5c91918a543ec28",
|
"element_id": "685600ed24c5b0e3b34e7d639d3b1959",
|
||||||
"text": "tcy Code and Rules, local rules, and orders of the court. The standing trustee should notify the United States Trustee whenever the provision of the",
|
"text": "tcy Code and Rules, local rules, and orders of the court. The standing trustee should notify the United States Trustee whenever the provision of the",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -270,8 +270,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "7c43851f864b7ccc35150c93d06abe80",
|
"element_id": "c998f5c10c9dac92e4d3624896a603c7",
|
||||||
"text": "he provision of the Handbook conflicts with the local rules or orders of the court. The standing trustee is accountable for all duties set forth in",
|
"text": "he provision of the Handbook conflicts with the local rules or orders of the court. The standing trustee is accountable for all duties set forth in",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -294,8 +294,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "7caf69b806daa033d686fae6100f4d7c",
|
"element_id": "d4b750e9af7167156f369b310a8cebb8",
|
||||||
"text": "duties set forth in this Handbook, but need not personally perform any duty unless otherwise indicated. All statutory references in this Handbook",
|
"text": "duties set forth in this Handbook, but need not personally perform any duty unless otherwise indicated. All statutory references in this Handbook",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -365,8 +365,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "66ff9b9385d511ca7e71f1e6852d3221",
|
"element_id": "8f411358790d6ee5b0d24f919206d3fd",
|
||||||
"text": "B.\tROLE OF THE UNITED STATES TRUSTEE",
|
"text": "B. ROLE OF THE UNITED STATES TRUSTEE",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -388,8 +388,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "1876c502fcbb25fd7b978417aea8dded",
|
"element_id": "6044d58375609c8802cfae16cef5cee9",
|
||||||
"text": "The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the responsibilities for daytoday administration of cases. Debtors, creditors,",
|
"text": "The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the responsibilities for daytoday administration of cases. Debtors, creditors, and",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -411,8 +411,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "5f89702a93c3df34a62905e5dff5c54d",
|
"element_id": "a4030396eaf54570462ed74f86e45bc8",
|
||||||
"text": "Debtors, creditors, and third parties with adverse interests to the trustee were concerned that the court, which previously appointed and supervised",
|
"text": "ors, creditors, and third parties with adverse interests to the trustee were concerned that the court, which previously appointed and supervised the",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -435,8 +435,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "c916e417ed924c556baed9616c3f81ae",
|
"element_id": "80e3b20fead224c85652bbdce327a28d",
|
||||||
"text": "nted and supervised the trustee, would not impartially adjudicate their rights as adversaries of that trustee. To address these concerns, judicial and",
|
"text": "and supervised the trustee, would not impartially adjudicate their rights as adversaries of that trustee. To address these concerns, judicial and",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -483,8 +483,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "709927b67286cccaf8fb25d63667c277",
|
"element_id": "39a3f1465d06269d2544ded43dc3a7df",
|
||||||
"text": "Many administrative functions formerly performed by the court were placed within the Department of Justice through the creation of the Program. Among",
|
"text": "Many administrative functions formerly performed by the court were placed within the Department of Justice through the creation of the Program. Among",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -506,8 +506,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "509676fb8d4f77b5f270629dee7a2664",
|
"element_id": "2872e5d0bea6ec1523eb9ae2c1c64add",
|
||||||
"text": "the Program. Among the administrative functions assigned to the United States Trustee were the appointment and supervision of chapter 13 trustees./",
|
"text": "the Program. Among the administrative functions assigned to the United States Trustee were the appointment and supervision of chapter 13 trustees./",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -530,8 +530,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "7ced6d1ee6cc9478adfd8e2a613be42a",
|
"element_id": "24e1076110b431b248b43b1fdaae5282",
|
||||||
"text": "apter 13 trustees./ This Handbook is issued under the authority of the Program\u2019s enabling statutes. ",
|
"text": "apter 13 trustees./ This Handbook is issued under the authority of the Program\u2019s enabling statutes.",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -554,8 +554,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "2c82d3fa4252275d5309a640eb25cd68",
|
"element_id": "158a80e29cfe6aa83a4931d955a8fa4f",
|
||||||
"text": "C.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t",
|
"text": "C. STATUTORY DUTIES OF A STANDING TRUSTEE",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -577,8 +577,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "a819e32a65d1f545cb404fe3f6273357",
|
"element_id": "e5fdcc6a007017354a9d708dc04fee02",
|
||||||
"text": "The standing trustee has a fiduciary responsibility to the bankruptcy estate. The standing trustee is more than a mere disbursing agent. The",
|
"text": "The standing trustee has a fiduciary responsibility to the bankruptcy estate. The standing trustee is more than a mere disbursing agent. The standing",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -600,8 +600,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "9e98089003e3b42ed7f1c263335dee3c",
|
"element_id": "0bf52e064da3ef4fb8b0a92d4b9fa694",
|
||||||
"text": "bursing agent. The standing trustee must be personally involved in the trustee operation. If the standing trustee is or becomes unable to perform",
|
"text": "agent. The standing trustee must be personally involved in the trustee operation. If the standing trustee is or becomes unable to perform the duties",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -624,8 +624,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "d476b15e5336342b1da22d100849b23c",
|
"element_id": "db297530e558410b89acd93c6b452b84",
|
||||||
"text": "s unable to perform the duties and responsibilities of a standing trustee, the standing trustee must immediately advise the United States Trustee. 28",
|
"text": "perform the duties and responsibilities of a standing trustee, the standing trustee must immediately advise the United States Trustee. 28 U.S.C. \u00a7",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -648,8 +648,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "8f8c9c0919f7502bd2fabad0b12ad664",
|
"element_id": "201bfacc211f0eb640e2830b8c29ae41",
|
||||||
"text": "States Trustee. 28 U.S.C. \u00a7 586(b), 28 C.F.R. \u00a7 58.4(b) referencing 28 C.F.R. \u00a7 58.3(b).",
|
"text": "rustee. 28 U.S.C. \u00a7 586(b), 28 C.F.R. \u00a7 58.4(b) referencing 28 C.F.R. \u00a7 58.3(b).",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -695,8 +695,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "9864d90bf9febdd104e7eac4c56689ba",
|
"element_id": "fd4c45036e8f17c27271f75944389724",
|
||||||
"text": "are set forth in 11 U.S.C. \u00a7 1302, which incorporates by reference some of the duties of chapter 7 trustees found in 11 U.S.C. \u00a7 704. These duties",
|
"text": "are set forth in 11 U.S.C. \u00a7 1302, which incorporates by reference some of the duties of chapter 7 trustees found in 11 U.S.C. \u00a7 704. These duties",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
@ -719,8 +719,8 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"type": "CompositeElement",
|
"type": "CompositeElement",
|
||||||
"element_id": "a91f963bcd1c092bffb844453aafa499",
|
"element_id": "a968d741409111b777fc123ef01f5407",
|
||||||
"text": "704. These duties include, but are not limited to, the following:",
|
"text": "\u00a7 704. These duties include, but are not limited to, the following:",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"data_source": {
|
"data_source": {
|
||||||
"record_locator": {
|
"record_locator": {
|
||||||
|
|||||||
@ -1 +1 @@
|
|||||||
__version__ = "0.16.11-dev0" # pragma: no cover
|
__version__ = "0.16.11-dev1" # pragma: no cover
|
||||||
|
|||||||
@ -43,9 +43,6 @@ Only operative for "by_title" chunking strategy.
|
|||||||
BoundaryPredicate: TypeAlias = Callable[[Element], bool]
|
BoundaryPredicate: TypeAlias = Callable[[Element], bool]
|
||||||
"""Detects when element represents crossing a semantic boundary like section or page."""
|
"""Detects when element represents crossing a semantic boundary like section or page."""
|
||||||
|
|
||||||
PreChunk: TypeAlias = "TablePreChunk | TextPreChunk"
|
|
||||||
"""The kind of object produced by a pre-chunker."""
|
|
||||||
|
|
||||||
TextAndHtml: TypeAlias = tuple[str, str]
|
TextAndHtml: TypeAlias = tuple[str, str]
|
||||||
|
|
||||||
|
|
||||||
@ -288,8 +285,13 @@ class PreChunker:
|
|||||||
pre_chunk_builder = PreChunkBuilder(self._opts)
|
pre_chunk_builder = PreChunkBuilder(self._opts)
|
||||||
|
|
||||||
for element in self._elements:
|
for element in self._elements:
|
||||||
# -- start new pre-chunk when necessary --
|
# -- start new pre-chunk when necessary to uphold segregation guarantees --
|
||||||
if self._is_in_new_semantic_unit(element) or not pre_chunk_builder.will_fit(element):
|
if (
|
||||||
|
# -- start new pre-chunk when necessary to uphold segregation guarantees --
|
||||||
|
self._is_in_new_semantic_unit(element)
|
||||||
|
# -- or when next element won't fit --
|
||||||
|
or not pre_chunk_builder.will_fit(element)
|
||||||
|
):
|
||||||
yield from pre_chunk_builder.flush()
|
yield from pre_chunk_builder.flush()
|
||||||
|
|
||||||
# -- add this element to the work-in-progress (WIP) pre-chunk --
|
# -- add this element to the work-in-progress (WIP) pre-chunk --
|
||||||
@ -320,8 +322,7 @@ class PreChunkBuilder:
|
|||||||
the next element in the element stream.
|
the next element in the element stream.
|
||||||
|
|
||||||
`.flush()` is used to build a PreChunk object from the accumulated elements. This method
|
`.flush()` is used to build a PreChunk object from the accumulated elements. This method
|
||||||
returns an iterator that generates zero-or-one `TextPreChunk` or `TablePreChunk` object and is
|
returns an iterator that generates zero-or-one `PreChunk` object and is used like so:
|
||||||
used like so:
|
|
||||||
|
|
||||||
yield from builder.flush()
|
yield from builder.flush()
|
||||||
|
|
||||||
@ -355,15 +356,13 @@ class PreChunkBuilder:
|
|||||||
boundary has been reached. Also to clear out a terminal pre-chunk at the end of an element
|
boundary has been reached. Also to clear out a terminal pre-chunk at the end of an element
|
||||||
stream.
|
stream.
|
||||||
"""
|
"""
|
||||||
if not self._elements:
|
elements = self._elements
|
||||||
|
|
||||||
|
if not elements:
|
||||||
return
|
return
|
||||||
|
|
||||||
pre_chunk = (
|
# -- copy element list, don't use original or it may change contents as builder proceeds --
|
||||||
TablePreChunk(self._elements[0], self._overlap_prefix, self._opts)
|
pre_chunk = PreChunk(elements, self._overlap_prefix, self._opts)
|
||||||
if isinstance(self._elements[0], Table)
|
|
||||||
# -- copy list, don't use original or it may change contents as builder proceeds --
|
|
||||||
else TextPreChunk(list(self._elements), self._overlap_prefix, self._opts)
|
|
||||||
)
|
|
||||||
# -- clear builder before yield so we're not sensitive to the timing of how/when this
|
# -- clear builder before yield so we're not sensitive to the timing of how/when this
|
||||||
# -- iterator is exhausted and can add elements for the next pre-chunk immediately.
|
# -- iterator is exhausted and can add elements for the next pre-chunk immediately.
|
||||||
self._reset_state(pre_chunk.overlap_tail)
|
self._reset_state(pre_chunk.overlap_tail)
|
||||||
@ -384,12 +383,6 @@ class PreChunkBuilder:
|
|||||||
# -- an empty pre-chunk will accept any element (including an oversized-element) --
|
# -- an empty pre-chunk will accept any element (including an oversized-element) --
|
||||||
if len(self._elements) == 0:
|
if len(self._elements) == 0:
|
||||||
return True
|
return True
|
||||||
# -- a `Table` will not fit in a non-empty pre-chunk --
|
|
||||||
if isinstance(element, Table):
|
|
||||||
return False
|
|
||||||
# -- no element will fit in a pre-chunk that already contains a `Table` element --
|
|
||||||
if isinstance(self._elements[0], Table):
|
|
||||||
return False
|
|
||||||
# -- a pre-chunk that already exceeds the soft-max is considered "full" --
|
# -- a pre-chunk that already exceeds the soft-max is considered "full" --
|
||||||
if self._text_length > self._opts.soft_max:
|
if self._text_length > self._opts.soft_max:
|
||||||
return False
|
return False
|
||||||
@ -429,19 +422,291 @@ class PreChunkBuilder:
|
|||||||
|
|
||||||
|
|
||||||
# ================================================================================================
|
# ================================================================================================
|
||||||
# PRE-CHUNK SUB-TYPES
|
# PRE-CHUNK
|
||||||
# ================================================================================================
|
# ================================================================================================
|
||||||
|
|
||||||
|
|
||||||
class TablePreChunk:
|
class PreChunk:
|
||||||
"""A pre-chunk composed of a single Table element."""
|
"""Sequence of elements staged to form a single chunk.
|
||||||
|
|
||||||
|
This object is purposely immutable.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self, elements: Iterable[Element], overlap_prefix: str, opts: ChunkingOptions
|
||||||
|
) -> None:
|
||||||
|
self._elements = list(elements)
|
||||||
|
self._overlap_prefix = overlap_prefix
|
||||||
|
self._opts = opts
|
||||||
|
|
||||||
|
def __eq__(self, other: Any) -> bool:
|
||||||
|
if not isinstance(other, PreChunk):
|
||||||
|
return False
|
||||||
|
return self._overlap_prefix == other._overlap_prefix and self._elements == other._elements
|
||||||
|
|
||||||
|
def can_combine(self, pre_chunk: PreChunk) -> bool:
|
||||||
|
"""True when `pre_chunk` can be combined with this one without exceeding size limits."""
|
||||||
|
if len(self._text) >= self._opts.combine_text_under_n_chars:
|
||||||
|
return False
|
||||||
|
# -- avoid duplicating length computations by doing a trial-combine which is just as
|
||||||
|
# -- efficient and definitely more robust than hoping two different computations of combined
|
||||||
|
# -- length continue to get the same answer as the code evolves. Only possible because
|
||||||
|
# -- `.combine()` is non-mutating.
|
||||||
|
combined_len = len(self.combine(pre_chunk)._text)
|
||||||
|
|
||||||
|
return combined_len <= self._opts.hard_max
|
||||||
|
|
||||||
|
def combine(self, other_pre_chunk: PreChunk) -> PreChunk:
|
||||||
|
"""Return new `PreChunk` that combines this and `other_pre_chunk`."""
|
||||||
|
# -- combined pre-chunk gets the overlap-prefix of the first pre-chunk. The second overlap
|
||||||
|
# -- is automatically incorporated at the end of the first chunk, where it originated.
|
||||||
|
return PreChunk(
|
||||||
|
self._elements + other_pre_chunk._elements,
|
||||||
|
overlap_prefix=self._overlap_prefix,
|
||||||
|
opts=self._opts,
|
||||||
|
)
|
||||||
|
|
||||||
|
def iter_chunks(self) -> Iterator[CompositeElement | Table | TableChunk]:
|
||||||
|
"""Form this pre-chunk into one or more chunk elements maxlen or smaller.
|
||||||
|
|
||||||
|
When the total size of the pre-chunk will fit in the chunking window, a single chunk it
|
||||||
|
emitted. When this prechunk contains an oversized element (always isolated), it is split
|
||||||
|
into two or more chunks that each fit the chunking window.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# -- a one-table-only pre-chunk is handled specially, by `TablePreChunk`, mainly because
|
||||||
|
# -- it may need to be split into multiple `TableChunk` elements and that operation is
|
||||||
|
# -- quite specialized.
|
||||||
|
if len(self._elements) == 1 and isinstance(self._elements[0], Table):
|
||||||
|
yield from _TableChunker.iter_chunks(
|
||||||
|
self._elements[0], self._overlap_prefix, self._opts
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
yield from _Chunker.iter_chunks(self._elements, self._text, self._opts)
|
||||||
|
|
||||||
|
@lazyproperty
|
||||||
|
def overlap_tail(self) -> str:
|
||||||
|
"""The portion of this chunk's text to be repeated as a prefix in the next chunk.
|
||||||
|
|
||||||
|
This value is the empty-string ("") when either the `.overlap` length option is `0` or
|
||||||
|
`.overlap_all` is `False`. When there is a text value, it is stripped of both leading and
|
||||||
|
trailing whitespace.
|
||||||
|
"""
|
||||||
|
overlap = self._opts.inter_chunk_overlap
|
||||||
|
return self._text[-overlap:].strip() if overlap else ""
|
||||||
|
|
||||||
|
def _iter_text_segments(self) -> Iterator[str]:
|
||||||
|
"""Generate overlap text and each element text segment in order.
|
||||||
|
|
||||||
|
Empty text segments are not included.
|
||||||
|
"""
|
||||||
|
if self._overlap_prefix:
|
||||||
|
yield self._overlap_prefix
|
||||||
|
for e in self._elements:
|
||||||
|
text = " ".join(e.text.strip().split())
|
||||||
|
if not text:
|
||||||
|
continue
|
||||||
|
yield text
|
||||||
|
|
||||||
|
@lazyproperty
|
||||||
|
def _text(self) -> str:
|
||||||
|
"""The concatenated text of all elements in this pre-chunk, including any overlap.
|
||||||
|
|
||||||
|
Whitespace is normalized to a single space. The text of each element is separated from
|
||||||
|
that of the next by a blank line ("\n\n").
|
||||||
|
"""
|
||||||
|
return self._opts.text_separator.join(self._iter_text_segments())
|
||||||
|
|
||||||
|
|
||||||
|
# ================================================================================================
|
||||||
|
# CHUNKING HELPER/SPLITTERS
|
||||||
|
# ================================================================================================
|
||||||
|
|
||||||
|
|
||||||
|
class _Chunker:
|
||||||
|
"""Forms chunks from a pre-chunk other than one containing only a `Table`.
|
||||||
|
|
||||||
|
Produces zero-or-more `CompositeElement` objects.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, elements: Iterable[Element], text: str, opts: ChunkingOptions) -> None:
|
||||||
|
self._elements = list(elements)
|
||||||
|
self._text = text
|
||||||
|
self._opts = opts
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def iter_chunks(
|
||||||
|
cls, elements: Iterable[Element], text: str, opts: ChunkingOptions
|
||||||
|
) -> Iterator[CompositeElement]:
|
||||||
|
"""Form zero or more chunks from `elements`.
|
||||||
|
|
||||||
|
One `CompositeElement` is produced when all `elements` will fit. Otherwise there is a
|
||||||
|
single `Text`-subtype element and chunks are formed by splitting.
|
||||||
|
"""
|
||||||
|
return cls(elements, text, opts)._iter_chunks()
|
||||||
|
|
||||||
|
def _iter_chunks(self) -> Iterator[CompositeElement]:
|
||||||
|
"""Form zero or more chunks from `elements`."""
|
||||||
|
# -- a pre-chunk containing no text (maybe only a PageBreak element for example) does not
|
||||||
|
# -- generate any chunks.
|
||||||
|
if not self._text:
|
||||||
|
return
|
||||||
|
|
||||||
|
# -- `split()` is the text-splitting function used to split an oversized element --
|
||||||
|
split = self._opts.split
|
||||||
|
|
||||||
|
# -- emit first chunk --
|
||||||
|
s, remainder = split(self._text)
|
||||||
|
yield CompositeElement(text=s, metadata=self._consolidated_metadata)
|
||||||
|
|
||||||
|
# -- an oversized pre-chunk will have a remainder, split that up into additional chunks.
|
||||||
|
# -- Note these get continuation_metadata which includes is_continuation=True.
|
||||||
|
while remainder:
|
||||||
|
s, remainder = split(remainder)
|
||||||
|
yield CompositeElement(text=s, metadata=self._continuation_metadata)
|
||||||
|
|
||||||
|
@lazyproperty
|
||||||
|
def _all_metadata_values(self) -> dict[str, list[Any]]:
|
||||||
|
"""Collection of all populated metadata values across elements.
|
||||||
|
|
||||||
|
The resulting dict has one key for each `ElementMetadata` field that had a non-None value in
|
||||||
|
at least one of the elements in this pre-chunk. The value of that key is a list of all those
|
||||||
|
populated values, in element order, for example:
|
||||||
|
|
||||||
|
{
|
||||||
|
"filename": ["sample.docx", "sample.docx"],
|
||||||
|
"languages": [["lat"], ["lat", "eng"]]
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
This preprocessing step provides the input for a specified consolidation strategy that will
|
||||||
|
resolve the list of values for each field to a single consolidated value.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def iter_populated_fields(metadata: ElementMetadata) -> Iterator[tuple[str, Any]]:
|
||||||
|
"""(field_name, value) pair for each non-None field in single `ElementMetadata`."""
|
||||||
|
return (
|
||||||
|
(field_name, value)
|
||||||
|
for field_name, value in metadata.known_fields.items()
|
||||||
|
if value is not None
|
||||||
|
)
|
||||||
|
|
||||||
|
field_values: DefaultDict[str, list[Any]] = collections.defaultdict(list)
|
||||||
|
|
||||||
|
# -- collect all non-None field values in a list for each field, in element-order --
|
||||||
|
for e in self._elements:
|
||||||
|
for field_name, value in iter_populated_fields(e.metadata):
|
||||||
|
field_values[field_name].append(value)
|
||||||
|
|
||||||
|
return dict(field_values)
|
||||||
|
|
||||||
|
@lazyproperty
|
||||||
|
def _consolidated_metadata(self) -> ElementMetadata:
|
||||||
|
"""Metadata applicable to this pre-chunk as a single chunk.
|
||||||
|
|
||||||
|
Formed by applying consolidation rules to all metadata fields across the elements of this
|
||||||
|
pre-chunk.
|
||||||
|
|
||||||
|
For the sake of consistency, the same rules are applied (for example, for dropping values)
|
||||||
|
to a single-element pre-chunk too, even though metadata for such a pre-chunk is already
|
||||||
|
"consolidated".
|
||||||
|
"""
|
||||||
|
consolidated_metadata = ElementMetadata(**self._meta_kwargs)
|
||||||
|
if self._opts.include_orig_elements:
|
||||||
|
consolidated_metadata.orig_elements = self._orig_elements
|
||||||
|
return consolidated_metadata
|
||||||
|
|
||||||
|
@lazyproperty
|
||||||
|
def _continuation_metadata(self) -> ElementMetadata:
|
||||||
|
"""Metadata applicable to the second and later text-split chunks of the pre-chunk.
|
||||||
|
|
||||||
|
The same metadata as the first text-split chunk but includes `.is_continuation = True`.
|
||||||
|
Unused for non-oversized pre-chunks since those are not subject to text-splitting.
|
||||||
|
"""
|
||||||
|
# -- we need to make a copy, otherwise adding a field would also change metadata value
|
||||||
|
# -- already assigned to another chunk (e.g. the first text-split chunk). Deep-copy is not
|
||||||
|
# -- required though since we're not changing any collection fields.
|
||||||
|
continuation_metadata = copy.copy(self._consolidated_metadata)
|
||||||
|
continuation_metadata.is_continuation = True
|
||||||
|
return continuation_metadata
|
||||||
|
|
||||||
|
@lazyproperty
|
||||||
|
def _meta_kwargs(self) -> dict[str, Any]:
|
||||||
|
"""The consolidated metadata values as a dict suitable for constructing ElementMetadata.
|
||||||
|
|
||||||
|
This is where consolidation strategies are actually applied. The output is suitable for use
|
||||||
|
in constructing an `ElementMetadata` object like `ElementMetadata(**self._meta_kwargs)`.
|
||||||
|
"""
|
||||||
|
CS = ConsolidationStrategy
|
||||||
|
field_consolidation_strategies = ConsolidationStrategy.field_consolidation_strategies()
|
||||||
|
|
||||||
|
def iter_kwarg_pairs() -> Iterator[tuple[str, Any]]:
|
||||||
|
"""Generate (field-name, value) pairs for each field in consolidated metadata."""
|
||||||
|
for field_name, values in self._all_metadata_values.items():
|
||||||
|
strategy = field_consolidation_strategies.get(field_name)
|
||||||
|
if strategy is CS.FIRST:
|
||||||
|
yield field_name, values[0]
|
||||||
|
# -- concatenate lists from each element that had one, in order --
|
||||||
|
elif strategy is CS.LIST_CONCATENATE:
|
||||||
|
yield field_name, sum(values, cast("list[Any]", []))
|
||||||
|
# -- union lists from each element, preserving order of appearance --
|
||||||
|
elif strategy is CS.LIST_UNIQUE:
|
||||||
|
# -- Python 3.7+ maintains dict insertion order --
|
||||||
|
ordered_unique_keys = {key: None for val_list in values for key in val_list}
|
||||||
|
yield field_name, list(ordered_unique_keys.keys())
|
||||||
|
elif strategy is CS.STRING_CONCATENATE:
|
||||||
|
yield field_name, " ".join(val.strip() for val in values)
|
||||||
|
elif strategy is CS.DROP:
|
||||||
|
continue
|
||||||
|
else: # pragma: no cover
|
||||||
|
# -- not likely to hit this since we have a test in `text_elements.py` that
|
||||||
|
# -- ensures every ElementMetadata fields has an assigned strategy.
|
||||||
|
raise NotImplementedError(
|
||||||
|
f"metadata field {repr(field_name)} has no defined consolidation strategy"
|
||||||
|
)
|
||||||
|
|
||||||
|
return dict(iter_kwarg_pairs())
|
||||||
|
|
||||||
|
@lazyproperty
|
||||||
|
def _orig_elements(self) -> list[Element]:
|
||||||
|
"""The `.metadata.orig_elements` value for chunks formed from this pre-chunk."""
|
||||||
|
|
||||||
|
def iter_orig_elements():
|
||||||
|
for e in self._elements:
|
||||||
|
if e.metadata.orig_elements is None:
|
||||||
|
yield e
|
||||||
|
continue
|
||||||
|
# -- make copy of any element we're going to mutate because these elements don't
|
||||||
|
# -- belong to us (the user may have downstream purposes for them).
|
||||||
|
orig_element = copy.copy(e)
|
||||||
|
# -- prevent recursive .orig_elements when element is a chunk (has orig-elements of
|
||||||
|
# -- its own)
|
||||||
|
orig_element.metadata.orig_elements = None
|
||||||
|
yield orig_element
|
||||||
|
|
||||||
|
return list(iter_orig_elements())
|
||||||
|
|
||||||
|
|
||||||
|
class _TableChunker:
|
||||||
|
"""Responsible for forming chunks, especially splits, from a single-table pre-chunk.
|
||||||
|
|
||||||
|
Table splitting is specialized because we recursively split on an even row, cell, text
|
||||||
|
boundary. This object encapsulate those details.
|
||||||
|
"""
|
||||||
|
|
||||||
def __init__(self, table: Table, overlap_prefix: str, opts: ChunkingOptions) -> None:
|
def __init__(self, table: Table, overlap_prefix: str, opts: ChunkingOptions) -> None:
|
||||||
self._table = table
|
self._table = table
|
||||||
self._overlap_prefix = overlap_prefix
|
self._overlap_prefix = overlap_prefix
|
||||||
self._opts = opts
|
self._opts = opts
|
||||||
|
|
||||||
def iter_chunks(self) -> Iterator[Table | TableChunk]:
|
@classmethod
|
||||||
|
def iter_chunks(
|
||||||
|
cls, table: Table, overlap_prefix: str, opts: ChunkingOptions
|
||||||
|
) -> Iterator[Table | TableChunk]:
|
||||||
|
"""Split this pre-chunk into `Table` or `TableChunk` objects maxlen or smaller."""
|
||||||
|
return cls(table, overlap_prefix, opts)._iter_chunks()
|
||||||
|
|
||||||
|
def _iter_chunks(self) -> Iterator[Table | TableChunk]:
|
||||||
"""Split this pre-chunk into `Table` or `TableChunk` objects maxlen or smaller."""
|
"""Split this pre-chunk into `Table` or `TableChunk` objects maxlen or smaller."""
|
||||||
# -- A table with no non-whitespace text produces no chunks --
|
# -- A table with no non-whitespace text produces no chunks --
|
||||||
if not self._table_text:
|
if not self._table_text:
|
||||||
@ -459,7 +724,7 @@ class TablePreChunk:
|
|||||||
|
|
||||||
# -- When there's no HTML, split it like a normal element. Also fall back to text-only
|
# -- When there's no HTML, split it like a normal element. Also fall back to text-only
|
||||||
# -- chunks when `max_characters` is less than 50. `.text_as_html` metadata is impractical
|
# -- chunks when `max_characters` is less than 50. `.text_as_html` metadata is impractical
|
||||||
# -- for a chunking window that small because the 33 characterss of HTML overhead for each
|
# -- for a chunking window that small because the 33 characters of HTML overhead for each
|
||||||
# -- chunk (`<table><tr><td>...</td></tr></table>`) would produce a very large number of
|
# -- chunk (`<table><tr><td>...</td></tr></table>`) would produce a very large number of
|
||||||
# -- very small chunks.
|
# -- very small chunks.
|
||||||
if not self._html or self._opts.hard_max < 50:
|
if not self._html or self._opts.hard_max < 50:
|
||||||
@ -469,17 +734,6 @@ class TablePreChunk:
|
|||||||
# -- otherwise, form splits with "synchronized" text and html --
|
# -- otherwise, form splits with "synchronized" text and html --
|
||||||
yield from self._iter_text_and_html_table_chunks()
|
yield from self._iter_text_and_html_table_chunks()
|
||||||
|
|
||||||
@lazyproperty
|
|
||||||
def overlap_tail(self) -> str:
|
|
||||||
"""The portion of this chunk's text to be repeated as a prefix in the next chunk.
|
|
||||||
|
|
||||||
This value is the empty-string ("") when either the `.overlap` length option is `0` or
|
|
||||||
`.overlap_all` is `False`. When there is a text value, it is stripped of both leading and
|
|
||||||
trailing whitespace.
|
|
||||||
"""
|
|
||||||
overlap = self._opts.inter_chunk_overlap
|
|
||||||
return self._text_with_overlap[-overlap:].strip() if overlap else ""
|
|
||||||
|
|
||||||
@lazyproperty
|
@lazyproperty
|
||||||
def _html(self) -> str:
|
def _html(self) -> str:
|
||||||
"""The compactified HTML for this table when it has text-as-HTML.
|
"""The compactified HTML for this table when it has text-as-HTML.
|
||||||
@ -517,7 +771,7 @@ class TablePreChunk:
|
|||||||
|
|
||||||
is_continuation = False
|
is_continuation = False
|
||||||
|
|
||||||
for text, html in _TableSplitter.iter_subtables(html_table, self._opts):
|
for text, html in _HtmlTableSplitter.iter_subtables(html_table, self._opts):
|
||||||
metadata = self._metadata
|
metadata = self._metadata
|
||||||
metadata.text_as_html = html
|
metadata.text_as_html = html
|
||||||
# -- second and later chunks get `.metadata.is_continuation = True` --
|
# -- second and later chunks get `.metadata.is_continuation = True` --
|
||||||
@ -527,7 +781,11 @@ class TablePreChunk:
|
|||||||
yield TableChunk(text=text, metadata=metadata)
|
yield TableChunk(text=text, metadata=metadata)
|
||||||
|
|
||||||
def _iter_text_only_table_chunks(self) -> Iterator[TableChunk]:
|
def _iter_text_only_table_chunks(self) -> Iterator[TableChunk]:
|
||||||
"""Split oversized text-only table (no text-as-html) into chunks."""
|
"""Split oversized text-only table (no text-as-html) into chunks.
|
||||||
|
|
||||||
|
`.metadata.text_as_html` is optional, not included when `infer_table_structure` is
|
||||||
|
`False`.
|
||||||
|
"""
|
||||||
text_remainder = self._text_with_overlap
|
text_remainder = self._text_with_overlap
|
||||||
split = self._opts.split
|
split = self._opts.split
|
||||||
is_continuation = False
|
is_continuation = False
|
||||||
@ -599,229 +857,12 @@ class TablePreChunk:
|
|||||||
return overlap_prefix + "\n" + table_text if overlap_prefix else table_text
|
return overlap_prefix + "\n" + table_text if overlap_prefix else table_text
|
||||||
|
|
||||||
|
|
||||||
class TextPreChunk:
|
|
||||||
"""A sequence of elements that belong to the same semantic unit within a document.
|
|
||||||
|
|
||||||
The name "section" derives from the idea of a document-section, a heading followed by the
|
|
||||||
paragraphs "under" that heading. That structure is not found in all documents and actual section
|
|
||||||
content can vary, but that's the concept.
|
|
||||||
|
|
||||||
This object is purposely immutable.
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self, elements: Iterable[Element], overlap_prefix: str, opts: ChunkingOptions
|
|
||||||
) -> None:
|
|
||||||
self._elements = list(elements)
|
|
||||||
self._overlap_prefix = overlap_prefix
|
|
||||||
self._opts = opts
|
|
||||||
|
|
||||||
def __eq__(self, other: Any) -> bool:
|
|
||||||
if not isinstance(other, TextPreChunk):
|
|
||||||
return False
|
|
||||||
return self._overlap_prefix == other._overlap_prefix and self._elements == other._elements
|
|
||||||
|
|
||||||
def can_combine(self, pre_chunk: TextPreChunk) -> bool:
|
|
||||||
"""True when `pre_chunk` can be combined with this one without exceeding size limits."""
|
|
||||||
if len(self._text) >= self._opts.combine_text_under_n_chars:
|
|
||||||
return False
|
|
||||||
# -- avoid duplicating length computations by doing a trial-combine which is just as
|
|
||||||
# -- efficient and definitely more robust than hoping two different computations of combined
|
|
||||||
# -- length continue to get the same answer as the code evolves. Only possible because
|
|
||||||
# -- `.combine()` is non-mutating.
|
|
||||||
combined_len = len(self.combine(pre_chunk)._text)
|
|
||||||
|
|
||||||
return combined_len <= self._opts.hard_max
|
|
||||||
|
|
||||||
def combine(self, other_pre_chunk: TextPreChunk) -> TextPreChunk:
|
|
||||||
"""Return new `TextPreChunk` that combines this and `other_pre_chunk`."""
|
|
||||||
# -- combined pre-chunk gets the overlap-prefix of the first pre-chunk. The second overlap
|
|
||||||
# -- is automatically incorporated at the end of the first chunk, where it originated.
|
|
||||||
return TextPreChunk(
|
|
||||||
self._elements + other_pre_chunk._elements,
|
|
||||||
overlap_prefix=self._overlap_prefix,
|
|
||||||
opts=self._opts,
|
|
||||||
)
|
|
||||||
|
|
||||||
def iter_chunks(self) -> Iterator[CompositeElement]:
|
|
||||||
"""Split this pre-chunk into one or more `CompositeElement` objects maxlen or smaller."""
|
|
||||||
# -- a pre-chunk containing no text (maybe only a PageBreak element for example) does not
|
|
||||||
# -- generate any chunks.
|
|
||||||
if not self._text:
|
|
||||||
return
|
|
||||||
|
|
||||||
split = self._opts.split
|
|
||||||
|
|
||||||
# -- emit first chunk --
|
|
||||||
s, remainder = split(self._text)
|
|
||||||
yield CompositeElement(text=s, metadata=self._consolidated_metadata)
|
|
||||||
|
|
||||||
# -- an oversized pre-chunk will have a remainder, split that up into additional chunks.
|
|
||||||
# -- Note these get continuation_metadata which includes is_continuation=True.
|
|
||||||
while remainder:
|
|
||||||
s, remainder = split(remainder)
|
|
||||||
yield CompositeElement(text=s, metadata=self._continuation_metadata)
|
|
||||||
|
|
||||||
@lazyproperty
|
|
||||||
def overlap_tail(self) -> str:
|
|
||||||
"""The portion of this chunk's text to be repeated as a prefix in the next chunk.
|
|
||||||
|
|
||||||
This value is the empty-string ("") when either the `.overlap` length option is `0` or
|
|
||||||
`.overlap_all` is `False`. When there is a text value, it is stripped of both leading and
|
|
||||||
trailing whitespace.
|
|
||||||
"""
|
|
||||||
overlap = self._opts.inter_chunk_overlap
|
|
||||||
return self._text[-overlap:].strip() if overlap else ""
|
|
||||||
|
|
||||||
@lazyproperty
|
|
||||||
def _all_metadata_values(self) -> dict[str, list[Any]]:
|
|
||||||
"""Collection of all populated metadata values across elements.
|
|
||||||
|
|
||||||
The resulting dict has one key for each `ElementMetadata` field that had a non-None value in
|
|
||||||
at least one of the elements in this pre-chunk. The value of that key is a list of all those
|
|
||||||
populated values, in element order, for example:
|
|
||||||
|
|
||||||
{
|
|
||||||
"filename": ["sample.docx", "sample.docx"],
|
|
||||||
"languages": [["lat"], ["lat", "eng"]]
|
|
||||||
...
|
|
||||||
}
|
|
||||||
|
|
||||||
This preprocessing step provides the input for a specified consolidation strategy that will
|
|
||||||
resolve the list of values for each field to a single consolidated value.
|
|
||||||
"""
|
|
||||||
|
|
||||||
def iter_populated_fields(metadata: ElementMetadata) -> Iterator[tuple[str, Any]]:
|
|
||||||
"""(field_name, value) pair for each non-None field in single `ElementMetadata`."""
|
|
||||||
return (
|
|
||||||
(field_name, value)
|
|
||||||
for field_name, value in metadata.known_fields.items()
|
|
||||||
if value is not None
|
|
||||||
)
|
|
||||||
|
|
||||||
field_values: DefaultDict[str, list[Any]] = collections.defaultdict(list)
|
|
||||||
|
|
||||||
# -- collect all non-None field values in a list for each field, in element-order --
|
|
||||||
for e in self._elements:
|
|
||||||
for field_name, value in iter_populated_fields(e.metadata):
|
|
||||||
field_values[field_name].append(value)
|
|
||||||
|
|
||||||
return dict(field_values)
|
|
||||||
|
|
||||||
@lazyproperty
|
|
||||||
def _consolidated_metadata(self) -> ElementMetadata:
|
|
||||||
"""Metadata applicable to this pre-chunk as a single chunk.
|
|
||||||
|
|
||||||
Formed by applying consolidation rules to all metadata fields across the elements of this
|
|
||||||
pre-chunk.
|
|
||||||
|
|
||||||
For the sake of consistency, the same rules are applied (for example, for dropping values)
|
|
||||||
to a single-element pre-chunk too, even though metadata for such a pre-chunk is already
|
|
||||||
"consolidated".
|
|
||||||
"""
|
|
||||||
consolidated_metadata = ElementMetadata(**self._meta_kwargs)
|
|
||||||
if self._opts.include_orig_elements:
|
|
||||||
consolidated_metadata.orig_elements = self._orig_elements
|
|
||||||
return consolidated_metadata
|
|
||||||
|
|
||||||
@lazyproperty
|
|
||||||
def _continuation_metadata(self) -> ElementMetadata:
|
|
||||||
"""Metadata applicable to the second and later text-split chunks of the pre-chunk.
|
|
||||||
|
|
||||||
The same metadata as the first text-split chunk but includes `.is_continuation = True`.
|
|
||||||
Unused for non-oversized pre-chunks since those are not subject to text-splitting.
|
|
||||||
"""
|
|
||||||
# -- we need to make a copy, otherwise adding a field would also change metadata value
|
|
||||||
# -- already assigned to another chunk (e.g. the first text-split chunk). Deep-copy is not
|
|
||||||
# -- required though since we're not changing any collection fields.
|
|
||||||
continuation_metadata = copy.copy(self._consolidated_metadata)
|
|
||||||
continuation_metadata.is_continuation = True
|
|
||||||
return continuation_metadata
|
|
||||||
|
|
||||||
def _iter_text_segments(self) -> Iterator[str]:
|
|
||||||
"""Generate overlap text and each element text segment in order.
|
|
||||||
|
|
||||||
Empty text segments are not included.
|
|
||||||
"""
|
|
||||||
if self._overlap_prefix:
|
|
||||||
yield self._overlap_prefix
|
|
||||||
for e in self._elements:
|
|
||||||
if not e.text:
|
|
||||||
continue
|
|
||||||
yield e.text
|
|
||||||
|
|
||||||
@lazyproperty
|
|
||||||
def _meta_kwargs(self) -> dict[str, Any]:
|
|
||||||
"""The consolidated metadata values as a dict suitable for constructing ElementMetadata.
|
|
||||||
|
|
||||||
This is where consolidation strategies are actually applied. The output is suitable for use
|
|
||||||
in constructing an `ElementMetadata` object like `ElementMetadata(**self._meta_kwargs)`.
|
|
||||||
"""
|
|
||||||
CS = ConsolidationStrategy
|
|
||||||
field_consolidation_strategies = ConsolidationStrategy.field_consolidation_strategies()
|
|
||||||
|
|
||||||
def iter_kwarg_pairs() -> Iterator[tuple[str, Any]]:
|
|
||||||
"""Generate (field-name, value) pairs for each field in consolidated metadata."""
|
|
||||||
for field_name, values in self._all_metadata_values.items():
|
|
||||||
strategy = field_consolidation_strategies.get(field_name)
|
|
||||||
if strategy is CS.FIRST:
|
|
||||||
yield field_name, values[0]
|
|
||||||
# -- concatenate lists from each element that had one, in order --
|
|
||||||
elif strategy is CS.LIST_CONCATENATE:
|
|
||||||
yield field_name, sum(values, cast("list[Any]", []))
|
|
||||||
# -- union lists from each element, preserving order of appearance --
|
|
||||||
elif strategy is CS.LIST_UNIQUE:
|
|
||||||
# -- Python 3.7+ maintains dict insertion order --
|
|
||||||
ordered_unique_keys = {key: None for val_list in values for key in val_list}
|
|
||||||
yield field_name, list(ordered_unique_keys.keys())
|
|
||||||
elif strategy is CS.STRING_CONCATENATE:
|
|
||||||
yield field_name, " ".join(val.strip() for val in values)
|
|
||||||
elif strategy is CS.DROP:
|
|
||||||
continue
|
|
||||||
else: # pragma: no cover
|
|
||||||
# -- not likely to hit this since we have a test in `text_elements.py` that
|
|
||||||
# -- ensures every ElementMetadata fields has an assigned strategy.
|
|
||||||
raise NotImplementedError(
|
|
||||||
f"metadata field {repr(field_name)} has no defined consolidation strategy"
|
|
||||||
)
|
|
||||||
|
|
||||||
return dict(iter_kwarg_pairs())
|
|
||||||
|
|
||||||
@lazyproperty
|
|
||||||
def _orig_elements(self) -> list[Element]:
|
|
||||||
"""The `.metadata.orig_elements` value for chunks formed from this pre-chunk."""
|
|
||||||
|
|
||||||
def iter_orig_elements():
|
|
||||||
for e in self._elements:
|
|
||||||
if e.metadata.orig_elements is None:
|
|
||||||
yield e
|
|
||||||
continue
|
|
||||||
# -- make copy of any element we're going to mutate because these elements don't
|
|
||||||
# -- belong to us (the user may have downstream purposes for them).
|
|
||||||
orig_element = copy.copy(e)
|
|
||||||
# -- prevent recursive .orig_elements when element is a chunk (has orig-elements of
|
|
||||||
# -- its own)
|
|
||||||
orig_element.metadata.orig_elements = None
|
|
||||||
yield orig_element
|
|
||||||
|
|
||||||
return list(iter_orig_elements())
|
|
||||||
|
|
||||||
@lazyproperty
|
|
||||||
def _text(self) -> str:
|
|
||||||
"""The concatenated text of all elements in this pre-chunk.
|
|
||||||
|
|
||||||
Each element-text is separated from the next by a blank line ("\n\n").
|
|
||||||
"""
|
|
||||||
text_separator = self._opts.text_separator
|
|
||||||
return text_separator.join(self._iter_text_segments())
|
|
||||||
|
|
||||||
|
|
||||||
# ================================================================================================
|
# ================================================================================================
|
||||||
# PRE-CHUNK SPLITTERS
|
# HTML SPLITTERS
|
||||||
# ================================================================================================
|
# ================================================================================================
|
||||||
|
|
||||||
|
|
||||||
class _TableSplitter:
|
class _HtmlTableSplitter:
|
||||||
"""Produces (text, html) pairs for a `<table>` HtmlElement.
|
"""Produces (text, html) pairs for a `<table>` HtmlElement.
|
||||||
|
|
||||||
Each chunk contains a whole number of rows whenever possible. An oversized row is split on an
|
Each chunk contains a whole number of rows whenever possible. An oversized row is split on an
|
||||||
@ -1040,7 +1081,7 @@ class _CellAccumulator:
|
|||||||
|
|
||||||
def will_fit(self, cell: HtmlCell) -> bool:
|
def will_fit(self, cell: HtmlCell) -> bool:
|
||||||
"""True when `cell` will fit within remaining space left by accummulated cells."""
|
"""True when `cell` will fit within remaining space left by accummulated cells."""
|
||||||
return self._remaining_space >= len(cell.html)
|
return self._remaining_space >= len(cell.text)
|
||||||
|
|
||||||
def _iter_cell_texts(self) -> Iterator[str]:
|
def _iter_cell_texts(self) -> Iterator[str]:
|
||||||
"""Generate contents of each accumulated cell as a separate string.
|
"""Generate contents of each accumulated cell as a separate string.
|
||||||
@ -1054,10 +1095,11 @@ class _CellAccumulator:
|
|||||||
|
|
||||||
@property
|
@property
|
||||||
def _remaining_space(self) -> int:
|
def _remaining_space(self) -> int:
|
||||||
"""Number of characters remaining when accumulated cells are formed into HTML."""
|
"""Number of characters remaining when text of accumulated cells is joined."""
|
||||||
# -- 24 is `len("<table><tr></tr></table>")`, the overhead in addition to `<td>`
|
# -- separators are one space (" ") at the end of each cell's text, including last one to
|
||||||
# -- HTML fragments
|
# -- account for space before prospective next cell.
|
||||||
return self._maxlen - 24 - sum(len(c.html) for c in self._cells)
|
separators_len = len(self._cells)
|
||||||
|
return self._maxlen - separators_len - sum(len(c.text) for c in self._cells)
|
||||||
|
|
||||||
|
|
||||||
class _RowAccumulator:
|
class _RowAccumulator:
|
||||||
@ -1087,7 +1129,7 @@ class _RowAccumulator:
|
|||||||
|
|
||||||
def will_fit(self, row: HtmlRow) -> bool:
|
def will_fit(self, row: HtmlRow) -> bool:
|
||||||
"""True when `row` will fit within remaining space left by accummulated rows."""
|
"""True when `row` will fit within remaining space left by accummulated rows."""
|
||||||
return self._remaining_space >= len(row.html)
|
return self._remaining_space >= row.text_len
|
||||||
|
|
||||||
def _iter_cell_texts(self) -> Iterator[str]:
|
def _iter_cell_texts(self) -> Iterator[str]:
|
||||||
"""Generate contents of each row cell as a separate string.
|
"""Generate contents of each row cell as a separate string.
|
||||||
@ -1100,8 +1142,10 @@ class _RowAccumulator:
|
|||||||
@property
|
@property
|
||||||
def _remaining_space(self) -> int:
|
def _remaining_space(self) -> int:
|
||||||
"""Number of characters remaining when accumulated rows are formed into HTML."""
|
"""Number of characters remaining when accumulated rows are formed into HTML."""
|
||||||
# -- 15 is `len("<table></table>")`, the overhead in addition to `<tr>` HTML fragments --
|
# -- separators are one space (" ") at the end of each row's text, including last one to
|
||||||
return self._maxlen - 15 - sum(len(r.html) for r in self._rows)
|
# -- account for space before prospective next row.
|
||||||
|
separators_len = len(self._rows)
|
||||||
|
return self._maxlen - separators_len - sum(r.text_len for r in self._rows)
|
||||||
|
|
||||||
|
|
||||||
# ================================================================================================
|
# ================================================================================================
|
||||||
@ -1117,16 +1161,10 @@ class PreChunkCombiner:
|
|||||||
self._opts = opts
|
self._opts = opts
|
||||||
|
|
||||||
def iter_combined_pre_chunks(self) -> Iterator[PreChunk]:
|
def iter_combined_pre_chunks(self) -> Iterator[PreChunk]:
|
||||||
"""Generate pre-chunk objects, combining TextPreChunk objects when they'll fit in window."""
|
"""Generate pre-chunk objects, combining `PreChunk` objects when they'll fit in window."""
|
||||||
accum = TextPreChunkAccumulator(self._opts)
|
accum = _PreChunkAccumulator(self._opts)
|
||||||
|
|
||||||
for pre_chunk in self._pre_chunks:
|
for pre_chunk in self._pre_chunks:
|
||||||
# -- a table pre-chunk is never combined --
|
|
||||||
if isinstance(pre_chunk, TablePreChunk):
|
|
||||||
yield from accum.flush()
|
|
||||||
yield pre_chunk
|
|
||||||
continue
|
|
||||||
|
|
||||||
# -- finish accumulating pre-chunk when it's full --
|
# -- finish accumulating pre-chunk when it's full --
|
||||||
if not accum.will_fit(pre_chunk):
|
if not accum.will_fit(pre_chunk):
|
||||||
yield from accum.flush()
|
yield from accum.flush()
|
||||||
@ -1136,39 +1174,37 @@ class PreChunkCombiner:
|
|||||||
yield from accum.flush()
|
yield from accum.flush()
|
||||||
|
|
||||||
|
|
||||||
class TextPreChunkAccumulator:
|
class _PreChunkAccumulator:
|
||||||
"""Accumulates, measures, and combines text pre-chunks.
|
"""Accumulates, measures, and combines pre-chunks.
|
||||||
|
|
||||||
Used for combining pre-chunks for chunking strategies like "by-title" that can potentially
|
Used for combining pre-chunks for chunking strategies like "by-title" that can potentially
|
||||||
produce undersized chunks and offer the `combine_text_under_n_chars` option. Note that only
|
produce undersized chunks and offer the `combine_text_under_n_chars` option.
|
||||||
sequential `TextPreChunk` objects can be combined. A `TablePreChunk` is never combined with
|
|
||||||
another pre-chunk.
|
|
||||||
|
|
||||||
Provides `.add_pre_chunk()` allowing a pre-chunk to be added to the chunk and provides
|
Provides `.add_pre_chunk()` allowing a pre-chunk to be added to the chunk and provides
|
||||||
monitoring properties `.remaining_space` and `.text_length` suitable for deciding whether to add
|
monitoring properties `.remaining_space` and `.text_length` suitable for deciding whether to add
|
||||||
another pre-chunk.
|
another pre-chunk.
|
||||||
|
|
||||||
`.flush()` is used to combine the accumulated pre-chunks into a single `TextPreChunk` object.
|
`.flush()` is used to combine the accumulated pre-chunks into a single `PreChunk` object.
|
||||||
This method returns an interator that generates zero-or-one `TextPreChunk` objects and is used
|
This method returns an interator that generates zero-or-one `PreChunk` objects and is used
|
||||||
like so:
|
like so:
|
||||||
|
|
||||||
yield from accum.flush()
|
yield from accum.flush()
|
||||||
|
|
||||||
If no pre-chunks have been accumulated, no `TextPreChunk` is generated. Flushing the builder
|
If no pre-chunks have been accumulated, no `PreChunk` is generated. Flushing the builder
|
||||||
clears the pre-chunks it contains so it is ready to accept the next text-pre-chunk.
|
clears the pre-chunks it contains so it is ready to accept the next pre-chunk.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, opts: ChunkingOptions) -> None:
|
def __init__(self, opts: ChunkingOptions) -> None:
|
||||||
self._opts = opts
|
self._opts = opts
|
||||||
self._pre_chunk: TextPreChunk | None = None
|
self._pre_chunk: PreChunk | None = None
|
||||||
|
|
||||||
def add_pre_chunk(self, pre_chunk: TextPreChunk) -> None:
|
def add_pre_chunk(self, pre_chunk: PreChunk) -> None:
|
||||||
"""Add a pre-chunk to the accumulator for possible combination with next pre-chunk."""
|
"""Add a pre-chunk to the accumulator for possible combination with next pre-chunk."""
|
||||||
self._pre_chunk = (
|
self._pre_chunk = (
|
||||||
pre_chunk if self._pre_chunk is None else self._pre_chunk.combine(pre_chunk)
|
pre_chunk if self._pre_chunk is None else self._pre_chunk.combine(pre_chunk)
|
||||||
)
|
)
|
||||||
|
|
||||||
def flush(self) -> Iterator[TextPreChunk]:
|
def flush(self) -> Iterator[PreChunk]:
|
||||||
"""Generate accumulated pre-chunk as a single combined pre-chunk.
|
"""Generate accumulated pre-chunk as a single combined pre-chunk.
|
||||||
|
|
||||||
Does not generate a pre-chunk when none has been accumulated.
|
Does not generate a pre-chunk when none has been accumulated.
|
||||||
@ -1181,7 +1217,7 @@ class TextPreChunkAccumulator:
|
|||||||
# -- and reset the accumulator (to empty) --
|
# -- and reset the accumulator (to empty) --
|
||||||
self._pre_chunk = None
|
self._pre_chunk = None
|
||||||
|
|
||||||
def will_fit(self, pre_chunk: TextPreChunk) -> bool:
|
def will_fit(self, pre_chunk: PreChunk) -> bool:
|
||||||
"""True when there is room for `pre_chunk` in accumulator.
|
"""True when there is room for `pre_chunk` in accumulator.
|
||||||
|
|
||||||
An empty accumulator always has room. Otherwise there is only room when `pre_chunk` can be
|
An empty accumulator always has room. Otherwise there is only room when `pre_chunk` can be
|
||||||
@ -1206,7 +1242,7 @@ class TextPreChunkAccumulator:
|
|||||||
# predicate.
|
# predicate.
|
||||||
#
|
#
|
||||||
# These can be mixed and matched to produce different chunking behaviors like "by_title" or left
|
# These can be mixed and matched to produce different chunking behaviors like "by_title" or left
|
||||||
# out altogether to produce "by_element" behavior.
|
# out altogether to produce "basic-chunking" behavior.
|
||||||
#
|
#
|
||||||
# The effective lifetime of the function that produce a predicate (rather than directly being one)
|
# The effective lifetime of the function that produce a predicate (rather than directly being one)
|
||||||
# is limited to a single element-stream because these retain state (e.g. current page number) to
|
# is limited to a single element-stream because these retain state (e.g. current page number) to
|
||||||
|
|||||||
@ -136,11 +136,15 @@ class HtmlRow:
|
|||||||
for td in self._tr:
|
for td in self._tr:
|
||||||
if (text := td.text) is None:
|
if (text := td.text) is None:
|
||||||
continue
|
continue
|
||||||
text = text.strip()
|
|
||||||
if not text:
|
if not text:
|
||||||
continue
|
continue
|
||||||
yield text
|
yield text
|
||||||
|
|
||||||
|
@lazyproperty
|
||||||
|
def text_len(self) -> int:
|
||||||
|
"""Length of the normalized text, as it would appear in `element.text`."""
|
||||||
|
return len(" ".join(self.iter_cell_texts()))
|
||||||
|
|
||||||
|
|
||||||
class HtmlCell:
|
class HtmlCell:
|
||||||
"""A `<td>` element."""
|
"""A `<td>` element."""
|
||||||
@ -158,4 +162,4 @@ class HtmlCell:
|
|||||||
"""Text inside `<td>` element, empty string when no text."""
|
"""Text inside `<td>` element, empty string when no text."""
|
||||||
if (text := self._td.text) is None:
|
if (text := self._td.text) is None:
|
||||||
return ""
|
return ""
|
||||||
return text.strip()
|
return " ".join(text.strip().split())
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user