This change is adding to our `add_chunking_strategy` logic so that we
are able to chunk Table elements' `text` and `text_as_html` params. In
order to keep the functionality under the same `by_title` chunking
strategy we have renamed the `combine_under_n_chars` to
`max_characters`. It functions the same way for the combining elements
under Title's, as well as specifying a chunk size (in chars) for
TableChunk elements.
*renaming the variable to `max_characters` will also reflect the 'hard
max' we will implement for large elements in followup PRs
Additionally -> some lint changes snuck in when I ran `make tidy` hence
the minor changes in unrelated files :)
TODO:
✅ add unit tests
--> note: added where I could to unit tests! Some unit tests I just
clarified that the chunking strategy was now 'by_title' because we don't
have a file example that has Table elements to test the
'by_num_characters' chunking strategy
✅ update changelog
To manually test:
```
In [1]: filename="example-docs/example-10k.html"
In [2]: from unstructured.chunking.title import chunk_table_element
In [3]: from unstructured.partition.auto import partition
In [4]: elements = partition(filename)
# element at -2 happens to be a Table, and we'll get chunks of char size 4 here
In [5]: chunks = chunk_table_element(elements[-2], 4)
# examine text and text_as_html params
ln [6]: for c in chunks:
print(c.text)
print(c.metadata.text_as_html)
```
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
This PR adds an arg to the html partition flow called `source_format` if
anything other than "html" we will return non-HTML elements to conform
with the file type we received.
addresses: https://github.com/Unstructured-IO/unstructured/issues/726
This PR does two things:
1. Adds test case (and alters sample doc) for rtf and epub files with
table
2. Adds `xls/x` file extension to `skip_infer_table_types` default list
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
### Summary
Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.
Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.
### Testing
from unstructured.partition.html import partition_html
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
**Summary**
Closes#747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests