32 Commits

Author SHA1 Message Date
Steve Canny
601594d373
fix(docx): fix short-row DOCX table (#2943)
**Summary**
The DOCX format allows a table row to start late and/or end early,
meaning cells at the beginning or end of a row can be omitted. While
there are legitimate uses for this capability, using it in practice is
relatively rare. However, it can happen unintentionally when adjusting
cell borders with the mouse. Accommodate this case and generate accurate
`.text` and `.metadata.text_as_html` for these tables.
2024-05-02 00:45:52 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Michał Martyniak
cb1e91058e
Introduce start_page argument to partitioning functions that assign element.metadata.page_number (#2884)
This small change will be useful for users who partition only fragments
of their PDF documents.
It's a small step towards addressing this issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

Related PRs:
* https://github.com/Unstructured-IO/unstructured/pull/2842
* https://github.com/Unstructured-IO/unstructured/pull/2673
2024-04-15 21:03:42 +00:00
Filip Knefel
6af6604057
feat: introduce date_from_file_object parameter to partitions (#2563)
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-18 01:09:44 +00:00
Filip Knefel
f048695a55
feat: include text from shapes in docx (#2510)
Reported bug: Text from docx shapes is not included in the `partition`
output.
Fix: Extend docx partition to search for text tags nested inside
structures responsible for creating the shape.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
2024-02-14 17:48:38 +00:00
Steve Canny
74d089d942
rfctr: skip CheckBox elements during chunking (#2253)
`CheckBox` elements get special treatment during chunking. `CheckBox`
does not derive from `Text` and can contribute no text to a chunk. It is
considered "non-combinable" and so is emitted as-is as a chunk of its
own. A consequence of this is it breaks an otherwise contiguous chunk
into two wherever it occurs.

This is problematic, but becomes much more so when overlap is
introduced. Each chunk accepts a "tail" text fragment from its preceding
element and contributes its own tail fragment to the next chunk. These
tails represent the "overlap" between chunks. However, a non-text chunk
can neither accept nor provide a tail-fragment and so interrupts the
overlap. None of the possible solutions are terrific.

Give `Element` a `.text` attribute such that _all_ elements have a
`.text` attribute, even though its value is the empty-string for
element-types such as CheckBox and PageBreak which inherently have no
text. As a consequence, several `cast()` wrappers are no longer required
to satisfy strict type-checking.

This also allows a `CheckBox` element to be combined with `Text`
subtypes during chunking, essentially the same way `PageBreak` is,
contributing no text to the chunk.

Also, remove the `_NonTextSection` object which previously wrapped a
`CheckBox` element during pre-chunking as it is no longer required.
2023-12-13 20:22:25 +00:00
Steve Canny
02e8c962aa
fix(docx): tables in header/footer dropped (#2135)
A DOCX header or footer is a so-called "story part" meaning like the
document body (which is also a story part) it can contain both
paragraphs and tables. The implementation of `Header.text` and
`Footer.text` gather only the paragraphs.

Add a new method to extract all content from a header or footer,
including table content, suitable for use as the `.text` attribute of
that element.

Fixes #2126.
2023-11-22 15:39:25 -08:00
Steve Canny
e6637592d1
fix(docx): Table.text duplicates merged cell text (#2134)
**Summary.** The `python-docx` table API is designed for _uniform_
tables (no merged cells, no nested tables). Naive processing of DOCX
tables using this API produces duplicate text when the table has merged
cells. Add a more sophisticated parsing method that reads only "root"
cells (those with an actual `<tc>` element) and skip cells spanned by a
merge.

In the process, abandon use of the `tabulate` package for this job
(which is also designed for uniform tables) and remove the whitespace
padding it adds for visual alignment of columns. Separate the text for
each cell with a single newline ("\n").

Since it's little extra trouble, add support for nested tables such that
their text also contributes to the `Table.text` string.

The new `._iter_table_texts()` method will also be used for parsing
tables in headers and footers (where they are frequently used for layout
purposes) in a closely following PR.

Fixes #2106.
2023-11-21 22:22:40 +00:00
Steve Canny
a589a494f6
docx: improve page break fidelity (#1631)
Page breaks can and often do occur within a paragraph. The full text of
the paragraph is attributed to the page (number) the paragraph starts
on.

Improve page-break fidelity such that a paragraph containing a
page-break is split into two elements, one containing the text before
the page-break and the other the text after. Emit the `PageBreak`
element between these two and assign the correct page-number (n and n+1
respectively) to the two textual elements.

This functionality is largely provided upstream by the new `python-docx`
v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support).
That version also makes obsolete the "include hyperlink text in
`Paragraph.text` monkey patch that we had maintained up to now. Remove
that monkey-patch.
2023-11-17 00:09:14 +00:00
Steve Canny
7a741c9ae6
fix(chunk): #1985 mis-splits of Table chunks (#2076)
Closes #1985 

**Summary.** Due to an interaction of coding errors, HTML text in
`TableChunk` splits of a `Table` element were repeating the entire HTML
for the table in each chunk.

**Technical Summary.** This behavior was fixed but not published in the
last chunking PR of a series. Finish up that PR and submit it all here.
This PR extracts chunking to the particular Section type (each has their
own distinct chunking behavior).
2023-11-16 16:22:50 +00:00
Steve Canny
41fc55bc12
fix(docx): tabulate output is non-deterministic (#2090)
The test for nested tables added a few PRs ago indirectly relies on the
padding added to table-HTML by `tabulate`. The length of that padding
turns out to be non-deterministic, perhaps related to M1 vs. Intel
hardware.

Remove padding from tabulate output in the test so only actual content
is compared.
2023-11-16 07:52:16 +00:00
Steve Canny
d06bcc41bb
fix(docx): improve page-break detection (#2036)
Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements
present in the document XML. Page breaks are NOT reliably indicated by
"hard" page-breaks inserted by the author and when present are redundant
to a `w:lastRenderedPageBreak` element so cause over-counting if used.

Use rendered page-breaks only.
2023-11-09 20:34:30 +00:00
Christine Straub
bb58c1bb0b
Refactor: element type (#2035)
### Summary
- add constants for element type
- replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the
`ElementType` constants
- replace element type strings using the constants

### Testing
CI should pass.
2023-11-08 21:52:55 -08:00
Steve Canny
0e2c21e5a2
fix: handle sectionless-docx in the general case (#1829)
A DOCX document that has no sections can still contain one or more
tables. Such files are never created by Word but Word can open them just
fine. These can be and are generated by other applications.

Use the newly-added `Document.iter_inner_content()` method added
upstream in `python-docx` to capture both paragraphs and tables from a
section-less DOCX document.

This generalizes the fix for MS Teams chat-transcripts (an example of
sectionless-docx) implemented in #1825.
2023-11-08 19:05:19 +00:00
Steve Canny
80fe07b89f
fix: #1952 support nested docx tables (#2020)
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.

When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.

This implements the solution described and spiked in PR #1952.

---------

Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
2023-11-08 00:37:21 +00:00
Steve Canny
4e40999070
rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978)
*Reviewer:* May be quicker to review commit by commit as they are quite
distinct and well-groomed to each focus on a single clean-up task.

Clean up odds-and-ends in the docx partitioner in preparation for adding
nested-tables support in a closely following PR.

1. Remove obsolete TODOs now in GitHub issues, which is probably where
they belong in future anyway.
2. Remove local DOCX "workaround" code that has been implemented
upstream and is now obsolete.
3. "Clean" the docx tests, introducing strict typing, extracting a
fixture or two, and generally tightening things up.
4. Extract docx-local versions of
`unstructured.partition.common.convert_ms_office_table_to_text()` which
will be the base for adding nested-table support. More information on
why this is required in that commit.
2023-11-02 05:22:17 +00:00
Yao You
b530e0a2be
fix: partition docx from teams output (#1825)
This PR resolves #1816 
- current docx partition assumes all contents are in sections
- this is not true for MS Teams chat transcript exported to docx
- now the code checks if there are sections or not; if not then iterate
through the paragraphs and partition contents in the paragraphs
2023-10-24 15:17:02 +00:00
Amanda Cameron
0584e1d031
chore: fix infer_table bug (#1833)
Carrying `skip_infer_table_types` to `infer_table_structure` in
partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a
`text_as_html` field.

Note: I've continued to exclude this var from partitioners that go
through html flow, I think if we've already got the html it doesn't make
sense to carry the infer variable along, since we're not 'infer-ing' the
html table in these cases.


TODO:
  add unit tests

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: amanda103 <amanda103@users.noreply.github.com>
2023-10-24 00:11:53 +00:00
Roman Isecke
b265d8874b
refactoring linting (#1739)
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.

The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
2023-10-17 12:45:12 +00:00
Steve Canny
4b84d596c2
docx: add hyperlink metadata (#1746) 2023-10-13 06:26:14 +00:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
Amanda Cameron
f98d5e65ca
chore: adding max_characters to other element type chunking (#1673)
This PR adds the `max_characters` (hard max) param to non-table element
chunking. Additionally updates the `num_characters` metadata to
`max_characters` to make it clearer which param we're referencing.

To test:

```
from unstructured.partition.html import partition_html

filename = "example-docs/example-10k-1p.html"
chunk_elements = partition_html(
        filename,
        chunking_strategy="by_title",
        combine_text_under_n_chars=0,
        new_after_n_chars=50,
        max_characters=100,
    )

for chunk in chunk_elements:
     print(len(chunk.text))

# previously we were only respecting the "soft max" (default of 500) for elements other than tables
# now we should see that all the elements have text fields under 100 chars.
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-09 19:42:36 +00:00
Benjamin Torres
e0201e9a11
feat/add sources from unstructured inference (#1538)
This PR adds support for `source` property from
`unstructured_inference`, allowing the user to be able to see the origin
of the data under `detection_origin`field environment variable
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true

In order to try this feature you can use this code:
```
from unstructured.partition.pdf import partition_pdf_or_image

yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')

sources = [e.detection_origin for e in yolox_elements]
print(sources)
```
And will print 'yolox' as source for all the elements
2023-10-05 20:26:47 +00:00
Amanda Cameron
1fb464235a
chore: Table chunking (#1540)
This change is adding to our `add_chunking_strategy` logic so that we
are able to chunk Table elements' `text` and `text_as_html` params. In
order to keep the functionality under the same `by_title` chunking
strategy we have renamed the `combine_under_n_chars` to
`max_characters`. It functions the same way for the combining elements
under Title's, as well as specifying a chunk size (in chars) for
TableChunk elements.

*renaming the variable to `max_characters` will also reflect the 'hard
max' we will implement for large elements in followup PRs


Additionally -> some lint changes snuck in when I ran `make tidy` hence
the minor changes in unrelated files :)

TODO:
 add unit tests
--> note: added where I could to unit tests! Some unit tests I just
clarified that the chunking strategy was now 'by_title' because we don't
have a file example that has Table elements to test the
'by_num_characters' chunking strategy
  update changelog

To manually test:
```
In [1]: filename="example-docs/example-10k.html"

In [2]: from unstructured.chunking.title import chunk_table_element

In [3]: from unstructured.partition.auto import partition

In [4]: elements = partition(filename)

# element at -2 happens to be a Table, and we'll get chunks of char size 4 here
In [5]: chunks = chunk_table_element(elements[-2], 4)

# examine text and text_as_html params
ln [6]: for c in chunks:
                    print(c.text)
                    print(c.metadata.text_as_html)
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-03 09:40:34 -07:00
Newel H
55315cf645
Feat: Native hierarchies for docx element types (#1505)
Improves hierarchy from docx files by leveraging natural hierarchies
built into docx documents. Hierarchy can now be detected from an
indentation level for list bullets/numbers and by style name (e.g.
Heading 1, List Bullet 2, List Number).

Hierarchy detection is improved by determining category depth via the
following:
1. Check if the paragraph item has an indentation level (ilvl) xpath -
these are typically on list bullet/numbers. Return the indentation level
if it exists
2. Check the name of the paragraph style if it contains any category
depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List
Bullet 2). Return the category depth if found, else default to depth of
0.
3. Check the paragraph ilvl via the paragraph's style name. Outside of
the paragraph's metadata, docx stores default ilvls for various style
names, which requires a complex lookup. This check is yet to be
implemented, as the above methods cover most usecases but the
implementation is stubbed out.
---
Co-authored-by: Steve Canny <stcanny@gmail.com>
2023-09-27 11:32:46 -04:00
rvztz
2f52df180f
Adds data source properties to onedrive, reddit and slack (#1281) 2023-09-20 04:26:36 +00:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
John
de4d496fcf
Fix bbox coordinates for ocr_only strategy (#1325)
### Summary
Duplicate PR of #1259 because of issues with checks
Closes #1227, which found that `nan` values were present in the
coordinates being generated for some elements.
This breaks logic out from `add_pytesseract_bbox_to_elements` to new
functions `_get_element_box` and
`convert_multiple_coordinates_to_new_system`. It also updates the logic
to check that the current bounding box matches the first character of
the element's text (as to avoid the `~` characters that
`pytesseract.image_to_boxes` includes, but are not present in
`pytesseract.image_to_string`.

### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw

filename="example-docs/layout-parser-paper-with-table.jpg"
elements = partition_image(filename=filename, strategy="ocr_only")
image = Image.open(filename)
draw = ImageDraw.Draw(image)
for i, element in enumerate(elements):
    print(i, element.metadata.coordinates)
    if element.metadata.coordinates:
        draw.polygon(element.metadata.coordinates.points, outline="red", width=2)
output = "example-docs/box-layout-parser-paper-with-table.jpg"
image.save(output)
image.close()
```

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-09-15 15:11:16 -05:00
John
c58b261feb
chunk_by_title decorator (#1304)
### Summary

Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.

Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.

### Testing

from unstructured.partition.html import partition_html

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
2023-09-11 21:00:14 +00:00
Klaijan
675a10ea69
fix: update test_json to not use auto partition (#1187)
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
2023-08-29 16:59:26 -04:00
Newel H
e4aa7373e2
test: create CI pipelines for verifying base and extras pass respective tests (#1137)
**Summary**
Closes #747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
2023-08-19 12:56:13 -04:00