15 Commits

Author SHA1 Message Date
Benjamin Torres
05c3cd1be2
feat: clean pdfminer elements inside tables (#1808)
This PR introduces `clean_pdfminer_inner_elements` , which deletes
pdfminer elements inside other detection origins such as YoloX or
detectron.
This function returns the clean document.

Also, the ingest-test fixtures were updated to reflect the new standard
output.

The best way to check that this function is working properly is check
the new test `test_clean_pdfminer_inner_elements` in
`test_unstructured/partition/utils/test_processing_elements.py`

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-10-30 07:10:51 +00:00
Steve Canny
7373391aa4
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles

**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.

_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).

Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.

Example:
```python
elements: List[Element] = [
    Title("Lorem Ipsum"),  # 11
    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
    Title("Rhoncus"),  # 7
    Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."),  # 57
]

chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)

# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')

# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```

**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.

**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.

The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:

- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-30 04:20:27 +00:00
Steve Canny
f273a7cb83
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length

**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.

When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.

Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.

**Example**

  ```python
    elements: List[Element] = [
        Title("Chunking Priorities"),  # 19 chars
        ListItem("Divide text into manageable chunks"),  # 34 chars
        ListItem("Preserve semantic boundaries"),  # 28 chars
        ListItem("Minimize mid-text chunk-splitting"),  # 33 chars
    ]  # 114 chars total but 120 chars with separators

    chunks = chunk_by_title(elements, max_characters=115)
  ```

  Want:

  ```python
[
    CompositeElement(
        "Chunking Priorities"
        "\n\nDivide text into manageable chunks"
        "\n\nPreserve semantic boundaries"
    ),
    CompositeElement("Minimize mid-text chunk-splitting"),
]
  ```

  Got:

  ```python
[
    CompositeElement(
        "Chunking Priorities"
        "\n\nDivide text into manageable chunks"
        "\n\nPreserve semantic boundaries"
        "\n\nMinimize mid-text chunk-spli"),
    )
    CompositeElement("tting")
  ```

### Technical Summary

Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.

### Fix

Consider separator length in the space-remaining computation.

The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.

This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 21:34:15 +00:00
qued
808b4ced7a
build(deps): remove ebooklib (#1878)
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under
AGPL3, which is incompatible with the Apache 2.0 license. Thus it is
being removed.
2023-10-26 12:22:40 -05:00
Steve Canny
40a265d027
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"

**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.

When a user specifies this value, they get an error message:
```python
  >>> chunks = chunk_by_title(elements, max_characters=100)
  ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"

In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).

To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
  ```python
  >>> chunks = chunk_by_title(
  ...     elements, max_characters=100, combine_text_under_n_chars=100
  ... )
  ```

This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.

An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.

Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```

### Fix

1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 23:22:38 +00:00
Amanda Cameron
0584e1d031
chore: fix infer_table bug (#1833)
Carrying `skip_infer_table_types` to `infer_table_structure` in
partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a
`text_as_html` field.

Note: I've continued to exclude this var from partitioners that go
through html flow, I think if we've already got the html it doesn't make
sense to carry the infer variable along, since we're not 'infer-ing' the
html table in these cases.


TODO:
  add unit tests

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: amanda103 <amanda103@users.noreply.github.com>
2023-10-24 00:11:53 +00:00
Steve Canny
82c8adba3f
fix: split-chunks appear out-of-order (#1824)
**Executive Summary.** Code inspection in preparation for adding the
chunk-overlap feature revealed a bug causing split-chunks to be inserted
out-of-order. For example, elements like this:

```
Text("One" + 400 chars)
Text("Two" + 400 chars)
Text("Three" + 600 chars)
Text("Four" + 400 chars)
Text("Five" + 600 chars)
```

Should produce chunks:
```
CompositeElement("One ...")           # (400 chars)
CompositeElement("Two ...")           # (400 chars)
CompositeElement("Three ...")         # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("Four")              # (400 chars)
CompositeElement("Five ...")          # (500 chars)
CompositeElement("rest of Five ...")  # (100 chars)
```

but produced this instead:
```
CompositeElement("Five ...")          # (500 chars)
CompositeElement("rest of Five ...")  # (100 chars)
CompositeElement("Three ...")         # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("One ...")           # (400 chars)
CompositeElement("Two ...")           # (400 chars)
CompositeElement("Four")              # (400 chars)
```

This PR fixes that behavior that was introduced on Oct 9 this year in
commit: f98d5e65 when adding chunk splitting.


**Technical Summary**

The essential transformation of chunking is:

```
elements          sections              chunks
List[Element] -> List[List[Element]] -> List[CompositeElement]
```

1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_
semantically-related elements into _sections_ (`List[Element]`), in the
best case, that would be a title (heading) and the text that follows it
(until the next title). A heading and its text is often referred to as a
_section_ in publishing parlance, hence the name.

2. The _chunker_ (`chunk_by_title()` currently) does two things:
1. first it _consolidates_ the elements of each section into a single
`ConsolidatedElement` object (a "chunk"). This includes both joining the
element text into a single string as well as consolidating the metadata
of the section elements.
2. then if necessary it _splits_ the chunk into two or more
`ConsolidatedElement` objects when the consolidated text is too long to
fit in the specified window (`max_characters`).

Chunk splitting is only required when a single element (like a big
paragraph) has text longer than the specified window. Otherwise a
section and the chunk that derives from it reflects an even element
boundary.

`chunk_by_title()` was elaborated in commit f98d5e65 to add this
"chunk-splitting" behavior.

At the time there was some notion of wanting to "split from the end
backward" such that any small remainder chunk would appear first, and
could possibly be combined with a small prior chunk. To accomplish this,
split chunks were _inserted_ at the beginning of the list instead of
_appended_ to the end.

The `chunked_elements` variable (`List[CompositeElement]`) holds the
sequence of chunks that result from the chunking operation and is the
returned value for `chunk_by_title()`. This was the list
"split-from-the-end" chunks were inserted at the beginning of and that
unfortunately produces this out-of-order behavior because the insertion
was at the beginning of this "all-chunks-in-document" list, not a
sublist just for this chunk.

Further, the "split-from-the-end" behavior can produce no benefit
because chunks are never combined, only _elements_ are combined (across
semantic boundaries into a single section when a section is small) and
sectioning occurs _prior_ to chunking.

The fix is to rework the chunk-splitting passage to a straighforward
iterative algorithm that works both when a chunk must be split and when
it doesn't. This algorithm is also very easily extended to implement
split-chunk-overlap which is coming up in an immediately following PR.

```python
# -- split chunk into CompositeElements objects maxlen or smaller --
text_len = len(text)
start = 0
remaining = text_len

while remaining > 0:
    end = min(start + max_characters, text_len)
    chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta))
    start = end - overlap
    remaining = text_len - end
```

*Forensic analysis*
The out-of-order-chunks behavior was introduced in commit 4ea71683 on
10/09/2023 in the same PR in which chunk-splitting was introduced.

---------

Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-21 01:37:34 +00:00
Yuming Long
ce40cdc55f
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary

Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.

**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image

### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:

screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">

### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR` 

### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`

---------

Co-authored-by: Yao You <yao@unstructured.io>
2023-10-21 00:24:23 +00:00
Steve Canny
d9c2516364
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.

1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.

2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.


**Technical Summary**

The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.

Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:

```python
>>> element.regex_metadata
{
    "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
    "version": [
        {"text": "current: v1.7.2", "start": 7, "end": 21},
        {"text": "supersedes: v1.7.0", "start": 22, "end": 40},
    ],
}
```

*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.

* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.

Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.


**Over-chunking Behavior**

The over-chunking looked like this:

Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).

```python
elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]

chunks = chunk_by_title(elements)

assert chunks == [
    CompositeElement(
        "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
        " ipsum sed lectus porta volutpat."
    )
]
```

Observed behavior looked like this:

```python
chunks => [
    CompositeElement('Lorem Ipsum')
    CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
    CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]

```

The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.


**Dropping regex-metadata Behavior**

Chunking this section:

```python
elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={
                "dolor": [RegexMetadata(text="dolor", start=12, end=17)],
                "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
            }
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]
```

..should produce this regex_metadata on the single produced chunk:

```python
assert chunk == CompositeElement(
    "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
    " ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
    "dolor": [RegexMetadata(text="dolor", start=25, end=30)],
    "ipsum": [
        RegexMetadata(text="Ipsum", start=6, end=11),
        RegexMetadata(text="ipsum", start=19, end=24),
        RegexMetadata(text="ipsum", start=81, end=86),
    ],
}
```

but instead produced this:

```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```

Which is the regex-metadata from the first element only.

The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 22:16:02 -05:00
Amanda Cameron
f98d5e65ca
chore: adding max_characters to other element type chunking (#1673)
This PR adds the `max_characters` (hard max) param to non-table element
chunking. Additionally updates the `num_characters` metadata to
`max_characters` to make it clearer which param we're referencing.

To test:

```
from unstructured.partition.html import partition_html

filename = "example-docs/example-10k-1p.html"
chunk_elements = partition_html(
        filename,
        chunking_strategy="by_title",
        combine_text_under_n_chars=0,
        new_after_n_chars=50,
        max_characters=100,
    )

for chunk in chunk_elements:
     print(len(chunk.text))

# previously we were only respecting the "soft max" (default of 500) for elements other than tables
# now we should see that all the elements have text fields under 100 chars.
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-09 19:42:36 +00:00
ryannikolaidis
9960ce5f00
fix: chunking fails with detection_class_prob in metadata (#1637) 2023-10-04 22:14:21 +00:00
Amanda Cameron
1fb464235a
chore: Table chunking (#1540)
This change is adding to our `add_chunking_strategy` logic so that we
are able to chunk Table elements' `text` and `text_as_html` params. In
order to keep the functionality under the same `by_title` chunking
strategy we have renamed the `combine_under_n_chars` to
`max_characters`. It functions the same way for the combining elements
under Title's, as well as specifying a chunk size (in chars) for
TableChunk elements.

*renaming the variable to `max_characters` will also reflect the 'hard
max' we will implement for large elements in followup PRs


Additionally -> some lint changes snuck in when I ran `make tidy` hence
the minor changes in unrelated files :)

TODO:
 add unit tests
--> note: added where I could to unit tests! Some unit tests I just
clarified that the chunking strategy was now 'by_title' because we don't
have a file example that has Table elements to test the
'by_num_characters' chunking strategy
  update changelog

To manually test:
```
In [1]: filename="example-docs/example-10k.html"

In [2]: from unstructured.chunking.title import chunk_table_element

In [3]: from unstructured.partition.auto import partition

In [4]: elements = partition(filename)

# element at -2 happens to be a Table, and we'll get chunks of char size 4 here
In [5]: chunks = chunk_table_element(elements[-2], 4)

# examine text and text_as_html params
ln [6]: for c in chunks:
                    print(c.text)
                    print(c.metadata.text_as_html)
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-03 09:40:34 -07:00
Ahmet Melek
bd33a52ee0
fix: coordinates metadata hinders chunking (#1374)
Closes https://github.com/Unstructured-IO/unstructured/issues/1373

This PR: 
- drops the `coordinates` metadata field in `chunk_by_title` to fix
https://github.com/Unstructured-IO/unstructured/issues/1373 (read issue
for the details)
- adds relevant test that checks the particular case
2023-09-14 10:10:03 +00:00
John
c58b261feb
chunk_by_title decorator (#1304)
### Summary

Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.

Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.

### Testing

from unstructured.partition.html import partition_html

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
2023-09-11 21:00:14 +00:00
Matt Robinson
f6a745a74f
feat: chunk elements based on titles (#1222)
### Summary

An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.

### Testing

```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
```
2023-08-29 16:04:57 +00:00