38 Commits

Author SHA1 Message Date
Steve Canny
22cbdce7ca
fix(html): unequal row lengths in HTMLTable.text_as_html (#2345)
Fixes #2339

Fixes to HTML partitioning introduced with v0.11.0 removed the use of
`tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This
had several benefits, but part of `tabulate`'s behavior was to make
row-length (cell-count) uniform across the rows of the table.

Lacking this prior uniformity produced a downstream problem reported in

On closer inspection, the method used to "harvest" cell-text was
producing more text-nodes than there were cells and was sensitive to
where whitespace was used to format the HTML. It also "moved" text to
different columns in certain rows.

Refine the cell-text gathering mechanism to get exactly one text string
for each row cell, eliminating whitespace formatting nodes and producing
strict correspondence between the number of cells in the original HTML
table row and that placed in HTML.text_as_html.

HTML tables that are uniform (every row has the same number of cells)
will produce a uniform table in `.text_as_html`. Merged cells may still
produce a non-uniform table in `.text_as_html` (because the source table
is non-uniform).
2024-01-04 21:53:19 +00:00
Steve Canny
ee9be2a3b2
fix: assorted partition_html() bugs (#2113)
Addresses a cluster of HTML-related bugs:
- empty table is identified as bulleted-table
- `partition_html()` emits empty (no text) tables (#1928)
- `.text_as_html` contains inappropriate `<br>` elements in invalid
locations.
- cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928)
- `.text_as_html` contains whitespace padding

Each of these is addressed in a separate commit below.

Fixes #1928.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-11-20 16:29:32 +00:00
Yao You
3437a23c91
fix: partition html fail with table without tbody (#1817)
This PR resolves #1807 
- fix a bug where when a table tagged content does not contain `tbody`
tag but `thead` tag for the rows the code fails
- now when there is no `tbody` in a table section we try to look for
`thead` isntead
- when both are not found return empty table
2023-10-20 23:21:59 +00:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
Newel H
bcd0eee753
Feat: Detect all text in HTML Heading tags as titles (#1556)
## Summary
This will increase the accuracy of hierarchies in HTML documents and
provide more accurate element categorization. If text is in an HTML
heading tag and is not a list item, address categorize it as a title.

## Testing
```
from unstructured.partition.html import partition_html
elements = partition_html(url="https://www.eda.gov/grants/2015")
```
Before, the date headers at the given url would not be correctly parsed
as titles, after this change they are now correctly identified.

A unit test to verify the functionality has been added:
`test_html_partition::test_html_heading_title_detection` that includes
values that were previously detected as narrative text and uncategorized
text
2023-10-03 11:54:36 -04:00
rvztz
2f52df180f
Adds data source properties to onedrive, reddit and slack (#1281) 2023-09-20 04:26:36 +00:00
Amanda Cameron
50db2abd9f
fix: updating element types (#1394)
This PR adds an arg to the html partition flow called `source_format` if
anything other than "html" we will return non-HTML elements to conform
with the file type we received.

addresses: https://github.com/Unstructured-IO/unstructured/issues/726
2023-09-15 11:51:22 -05:00
John
c58b261feb
chunk_by_title decorator (#1304)
### Summary

Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.

Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.

### Testing

from unstructured.partition.html import partition_html

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
2023-09-11 21:00:14 +00:00
Amanda Cameron
a501d1d18f
Adding table extraction to partition_html (#1324)
Adding table extraction to HTML partitioning.

This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.

```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
2023-09-11 11:14:11 -07:00
Matt Robinson
22974f61ce
fix: separate elements by <br> tag in partition_html (#1314)
### Summary

Closes #1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.


### Testing

The following is code previously produced one giant element on `main`.

```python
from unstructured.partition.html import partition_html

filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)

len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```

The output should be:

```python
January 2023

(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from.  The
answer was ok, but not what I would have said. This is what I would have said.)

The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.

Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
2023-09-07 13:16:31 +00:00
Klaijan
675a10ea69
fix: update test_json to not use auto partition (#1187)
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
2023-08-29 16:59:26 -04:00
Matt Robinson
c578b85699
fix: respect <pre> tag order in partition_html (#1197)
### Summary

Closes #1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.

### Testing

The elements in the following example should be in the correct order.

```python
    from unstructured.partition.html import partition_html

    html_text = """
    <pre>The Big Brown Bear</pre>
    <div>The big brown bear is growling.</div>
    <pre>The big brown bear is sleeping.</pre>
    <div>The Big Blue Bear</div>
    """
    elements = partition_html(text=html_text)
    print("\n\n".join([str(el) for el in elements]))
```
2023-08-25 04:14:48 +00:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields (#1099)
Closes Github Issue #1060.

* update the metadata field links
* update the metadata field emphasized_texts
2023-08-16 04:33:06 +00:00
kravetsmic
25ca5744cf
feat: optionally ignore header and footer tags in partition html (#1013)
---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 21:56:33 +00:00
Matt Robinson
f4ddf53590
feat: track emphasized text in partition_html (#1034)
* Feat/965 track emphasized text html (#1021)

* feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML

* feat: add `include_tail_text` parameter to `_construct_text`

* test: add test case for `_get_emphasized_texts_from_tag`

* test: add `emphasized_texts` to metadata

* chore: update changelog & version

* fix tests

* fix lint errors

* chore: update changelog

* chore: small comment updates

* feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag

* chore: update changelog

* Update ingest test fixtures (#1026)

Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#1035)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-08-03 16:24:25 +00:00
cragwolfe
13d3559fa4
chore: rename Element's "date" field to "last_modified" (#997)
Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.
2023-08-01 02:55:43 +00:00
Matt Robinson
d9aed66b65
feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-07-26 15:10:14 -04:00
Matt Robinson
6e852cbe70
feat: track links from anchor tags in partition_html (#959)
* track tags in html

* pass through links as metadata

* add test for grabbing links

* one more link

* changelog and version

* update docs

* fix tests

* update empty link assertion

* ingest-test-fixtures-update

* Update ingest test fixtures (#961)
2023-07-24 18:28:56 +00:00
Christine Straub
47bc4009a8
fix: adjust threshold for encoding detection (#894)
* chore: add example doc

* fix: adjust encoding recognition threshold value in `detect_file_encoding`

* test: add test cases for German characters

* chore: update changelog & version
2023-07-07 09:25:03 -04:00
John
dc6d7d7268
feat: add metadata_filename parameter across all partition functions (#811)
* fix conflicts

* add tests and clean metadata_filename in partitions

* fix test_email and remove comments

* make tidy/check

* update changelog and version

* fix tests

* make tidy again
2023-07-05 16:02:22 -04:00
Roman Isecke
9882c2b83f
Avoid setting metadata in constructor signature for elements (#837)
Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification).

Bonus refactor for PageBreak to have text values of "".

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-06-29 03:14:05 +00:00
kravetsmic
58e988e110
feature(html partition): parse pre tag (#642)
* feature(html partition): parse pre tag

* chore: update CHANGELOG.md

* style: black format xml.py

* Added tests dor html with pre tag

* remove skip test, update parse pre tag

* fix style

* chore: spell check

* chore: update changelog & version

* chore: update ingest test fixtures

* chore: add exception handling if `element.text` is `None` in `_read_xml`

* test: add more sanity testing on the `.text` content of the element(s)

* refactor: move the conditional logic for <pre> outside of the `try/except` block

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-06-27 18:52:39 +00:00
Matt Robinson
c53ce117bc
fix: enable partition_html to grab content outside of <article> tags (#772)
* optionally dont assemble articles

* add test for content outside of articles

* pass kwargs in partition

* changelog and version

* update default to False

* bump version for release

* back to dev version to get another fix in the release
2023-06-20 17:07:30 +00:00
Christine Straub
547bb38d86
fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660)
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.

Change auto.py to have a None default for encoding

Remove the unused parameter encoding from partition_pdf

Add functionality to the read_txt_file utility function to handle file-like object from URL
2023-06-05 11:27:12 -07:00
John
c78c5b6adf
fix: page_number appears in partition_html metadata if include_metadata=False (#658)
* fix: page_number appears in partition_html metadata if include_metadata=False

* Update common.py

* Update CHANGELOG

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-30 20:47:55 +00:00
Matt Robinson
bd6a8a3a40
enhancement: add file_directory to element metadata (#585)
* enhancement: add `file_directory` to element metadata

* update msg test

* exclude file_directory

* update slack output

* added file directory tests on partition_x paths
2023-05-15 18:25:39 -04:00
Matt Robinson
bd1e540af9
feat: parameter to turn off SSL verification (#506)
* add kwarg for ssl verification

* update docs

* update version and changelog

* add verify kwarg to test
2023-04-20 11:13:56 -04:00
Matt Robinson
137b4b9a2e
feat: cleaning brick for normalizing bytes string output (#481)
* add cleaning brick for emojis

* changelog and versoin

* docs for bytes_string_to_string

* different test for bytes_string_to_string
2023-04-13 19:39:08 +00:00
Matt Robinson
b855fd269f
fix: fix html encoding to support foreign characters (#452)
* fix: fix html encoding to support foreign characters

* version and changelog
2023-04-05 20:18:54 +00:00
Matt Robinson
09b52b4fc4
fix: text kwargs no longer fail with empty string (#413)
* fix: text kwargs no longer fail with empty string

* linting
2023-03-28 21:03:51 +00:00
cragwolfe
ce9fc26009
feat: add ability to pass headers in partition_html (#397)
Also adds pytest-mock requirement, those fixtures are nice to have!

Implements issue/feature #396 .
2023-03-23 20:14:57 -07:00
ryannikolaidis
a4726cb197
fix: open xml files in read only mode (#362) 2023-03-13 13:06:45 -07:00
Matt Robinson
a5da3de43b
fix: ensure all text is maintained in html output (#335)
* fix: ensure all text is maintained in html pages

* add back in replace unicode quotes

* changelog and version bump

* apt-get update in ci

* white space differences in output
2023-03-02 14:03:13 -05:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
e73cf09977
feat: optional page breaks for .pptx, .pdf, .html and images (#205)
* page breaks for pptx

* added page breaks for image/pdf

* tests for images with page breaks

* page breaks for html documents

* linting, linting, linting

* changelog and bump version

* update docs

* fix typo

* refactor reusable code to common.py

* add type back in
2023-02-08 15:11:15 +00:00
Matt Robinson
ee9f15483f
feat: partition_html directly from a url (#202)
* added tests for html from url

* bump version

* added types-requests

* and -> an
2023-02-07 14:09:34 +00:00
Matt Robinson
3c19c7cd8a
feat: Add partition_html brick (#91)
* update readme

* updated sphinx docs

* bump version; changelog

* clear cache; retrigger ci

* rename test file

* switch default parameters to None

* typo in the changelog

* add in text output
2022-12-12 14:22:10 +00:00