This change is adding to our `add_chunking_strategy` logic so that we
are able to chunk Table elements' `text` and `text_as_html` params. In
order to keep the functionality under the same `by_title` chunking
strategy we have renamed the `combine_under_n_chars` to
`max_characters`. It functions the same way for the combining elements
under Title's, as well as specifying a chunk size (in chars) for
TableChunk elements.
*renaming the variable to `max_characters` will also reflect the 'hard
max' we will implement for large elements in followup PRs
Additionally -> some lint changes snuck in when I ran `make tidy` hence
the minor changes in unrelated files :)
TODO:
✅ add unit tests
--> note: added where I could to unit tests! Some unit tests I just
clarified that the chunking strategy was now 'by_title' because we don't
have a file example that has Table elements to test the
'by_num_characters' chunking strategy
✅ update changelog
To manually test:
```
In [1]: filename="example-docs/example-10k.html"
In [2]: from unstructured.chunking.title import chunk_table_element
In [3]: from unstructured.partition.auto import partition
In [4]: elements = partition(filename)
# element at -2 happens to be a Table, and we'll get chunks of char size 4 here
In [5]: chunks = chunk_table_element(elements[-2], 4)
# examine text and text_as_html params
ln [6]: for c in chunks:
print(c.text)
print(c.metadata.text_as_html)
```
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
Improves hierarchy from docx files by leveraging natural hierarchies
built into docx documents. Hierarchy can now be detected from an
indentation level for list bullets/numbers and by style name (e.g.
Heading 1, List Bullet 2, List Number).
Hierarchy detection is improved by determining category depth via the
following:
1. Check if the paragraph item has an indentation level (ilvl) xpath -
these are typically on list bullet/numbers. Return the indentation level
if it exists
2. Check the name of the paragraph style if it contains any category
depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List
Bullet 2). Return the category depth if found, else default to depth of
0.
3. Check the paragraph ilvl via the paragraph's style name. Outside of
the paragraph's metadata, docx stores default ilvls for various style
names, which requires a complex lookup. This check is yet to be
implemented, as the above methods cover most usecases but the
implementation is stubbed out.
---
Co-authored-by: Steve Canny <stcanny@gmail.com>
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.
This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
### Summary
Duplicate PR of #1259 because of issues with checks
Closes#1227, which found that `nan` values were present in the
coordinates being generated for some elements.
This breaks logic out from `add_pytesseract_bbox_to_elements` to new
functions `_get_element_box` and
`convert_multiple_coordinates_to_new_system`. It also updates the logic
to check that the current bounding box matches the first character of
the element's text (as to avoid the `~` characters that
`pytesseract.image_to_boxes` includes, but are not present in
`pytesseract.image_to_string`.
### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw
filename="example-docs/layout-parser-paper-with-table.jpg"
elements = partition_image(filename=filename, strategy="ocr_only")
image = Image.open(filename)
draw = ImageDraw.Draw(image)
for i, element in enumerate(elements):
print(i, element.metadata.coordinates)
if element.metadata.coordinates:
draw.polygon(element.metadata.coordinates.points, outline="red", width=2)
output = "example-docs/box-layout-parser-paper-with-table.jpg"
image.save(output)
image.close()
```
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
### Summary
Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.
Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.
### Testing
from unstructured.partition.html import partition_html
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
**Summary**
Closes#747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests