13 Commits

Author SHA1 Message Date
ryannikolaidis
66bf4b0198
feat: support extracting image url in html (#3955)
also removes mimetype when base64 is not included in image metadata

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-13 22:41:10 +00:00
ryannikolaidis
c0457c1cc3
feat: include images when partitioning html (#3945)
Currently we [filter img
tags](2addb19473/unstructured/partition/html/partition.py (L226-L229))
before tags are converted to Elements by the html partitioner. More
importantly we also don’t currently have a defined “block” / mapping to
support these. This adds these mappings and logic to process.

It also respects `extract_image_block_types` and
`extract_image_block_to_payload` (as we do with pdfs) to determine
whether base64 is included in the metadata.

The partitioned Image Elements sets the text to the img tag’s alt text
if available.

The partitioned Image Elements include the [url in the
metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209)
(rather than image_base64) if the img tag src is a url.

## Testing

unit tests have been added for explicit coverage.
existing integration tests and other unit test fixtures have been
updated to account for `Image` elements now present

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-08 01:25:21 +00:00
Steve Canny
1eceac26c8
rfctr(email): eml partitioner rewrite (#3694)
**Summary**
Initial attempts to incrementally refactor `partition_email()` into
shape to allow pluggable partitioning quickly became too complex for
ready code-review. Prepare separate rewritten module and tests and swap
them out whole.

**Additional Context**
- Uses the modern stdlib `email` module to reliably accomplish several
manual decoding steps in the legacy code.
- Remove obsolete email-specific element-types which were replaced 18
months or so ago with email-specific metadata fields for things like Cc:
addresses, subject, etc.
- Remove accepting an email as `text: str` because MIME-email is
inherently a binary format which can and often does contain multiple and
contradictory character-encodings.
- Remove `encoding` parameters as it is now unused. An email file is not
a text file and as such does not have a single overall encoding.
Character encoding is specified individually for each MIME-part within
the message and often varies from one part to another in the same
message.
- Remove the need for a caller to specify `attachment_partitioner`.
There is only one reasonable choice for this which is
`auto.partition()`, consistent with the same interface and operation in
`partition_msg()`.
- Fixes #3671 along the way by silently skipping attachments with a
file-type for which there is no partitioner.
- Substantially extend the test-suite to cover multiple
transport-encoding/charset combinations.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-16 02:02:33 +00:00
Steve Canny
718891a447
rfctr(part): remove double-decoration 5 (#3692)
**Summary**
Remove double-decoration from EML and MSG.

**Additional Context**
- These needed to wait to the end because `partition_email()` and
`partition_msg()` can use any other partitioner for one of their
attachments.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-04 21:01:32 +00:00
John
147514f6b5
feat: msg and email metadata (#3444)
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.

Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files

Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path

elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```

Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2024-08-01 19:24:17 +00:00
Steve Canny
c27e0d0062
rfctr(html): replace html parser (#3218)
**Summary**
Replace legacy HTML parser with recursive version that captures all
content and provides flexibility to add new metadata. It's also
substantially faster although that's just a happy side-effect.

**Additional Context**
The prior HTML parsing algorithm that makes up the core of HTML
partitioning was buggy and very difficult to reason about because it did
not conform to the inherently recursive structure of HTML. The new
version retains `lxml` as the performant and reliable base library but
uses `lxml`'s custom element classes to efficiently classify HTML
elements by their behaviors (block-item and inline (phrasing) primarily)
and give those elements the desired partitioning behaviors.

This solves a host of existing problems with content being skipped and
elements (paragraphs) being divided improperly, but also provides a
clear domain model for reasoning about its behavior and reliably
adjusting it to suit our existing and future purposes.

The parser's operation is recursive, closely modeling the recursive
structure of HTML itself. It's behaviors are based on the HTML Standard
and reliably produce proper and explainable results even for novel
cases.

Fixes #2325 
Fixes #2562
Fixes #2675
Fixes #3168
Fixes #3227
Fixes #3228 
Fixes #3230 
Fixes #3237 
Fixes #3245 
Fixes #3247 
Fixes #3255
Fixes #3309 

### BEHAVIOR DIFFERENCES

#### `emphasized_text_tags` encoding is changed:
- `<strong>` is encoded as `"b"` rather than `"strong"`.
- `<em>` is encoded as `"i"` rather than `"em"`.
- `<span>` is no longer recorded in `emphasized_text_tags` (because
without the CSS we can't tell whether it's used for emphasis or if so
what kind).
- nested emphasis (e.g. bold+italic) is encoded as multiple characters
("bi").
- `emphasized_text_contents` is broken on emphasis-change boundaries,
like:
  ```html
   `<p>foo <b>bar <i>baz</i> bada</b> bing</p>`
  ```
  produces:
  ```json
  {
    "emphasized_text_contents": ["bar", "baz", "bada"],
    "emphasized_text_tags": ["b", "bi", "b"]
  }
  ```
   whereas previously it would have produced:
  ```json
  {
    "emphasized_text_contents": ["bar baz bada", "baz"],
    "emphasized_text_tags": ["b", "i"]
  }
  ```

#### `<pre>` text is preserved as it appears in the html
Except that a leading newline is removed if present (has to be in
position 0 of text). Also, a trailing newline is stripped but only if it
appears in the very last position ([-1]) of the `<pre>` text. Old parser
stripped all leading and trailing whitespace.

Result is that:
```html
<pre>
foo
bar
baz
</pre>
```
parses to `"foo\nbar\nbaz"` which is the same result produced for:
```html
<pre>foo
bar
baz</pre>
```
This equivalence is the same behavior exhibited by a browser, which is
why we did the extra work to make it this way.

#### Whitespace normalization
Leading and trailing whitespace are removed from element text, just as
it is removed in the browser. Runs of whitespace within the element text
are reduced to a single space character (like in the browser). Note this
means that `\t`, `\n`, and `&nbsp;` are replaced with a regular space
character. All text derived from elements is whitespace normalized
except the text within a `<pre>` tag. Any leading or trailing newline is
trimmed from `<pre>` element text; all other whitespace is preserved
just as it appeared in the HTML source.

#### `link_start_indexes` metadata is no longer captured. Rationale:
- It was frequently wrong, often `-1`.
- It was deprecated but then added back in a community PR.
- Maintaining it across any possible downstream transformations (e.g.
chunking) would be expensive and almost certainly lead to wrong values
as distant code evolves.
- It is complex to compute and recompute when whitespace is normalized,
adding substantial complexity to the code and reducing readability and
maintainability

#### `<br/>` element is replaced with a single newline (`"\n"`)
but that is usually replaced with a space in `Element.text` when it is
normalized. The newline is preserved within a `<pre>` element.
  - Related: _No paragraph-break on `<br/><br/>`_

#### Empty `h1..h6` elements are dropped.
HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a
`Title` element) when they contain no text or contain only whitespace.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-07-11 00:14:28 +00:00
David Potter
db1e6993a8
rfctr [P6M-398]: salesforce connector v2 (#3344)
Updates salesforce source connector to v2.
2024-07-09 16:46:58 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
jakub-sandomierz-deepsense-ai
ee0441efea
enhancement: normalize Salesforce artifact extensions (#2402)
Connectors use predictable result file naming convention so consumers of
library can write code in abstraction of particular connector.

This change introduces compatibility with said naming convention.

`_output_filename` returns now filename with format.
2024-01-16 10:36:00 +00:00
Steve Canny
b8a8de33f4
fix(ingest): canonicalize ingest JSON (#2080)
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
2023-11-15 00:52:58 -08:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
David Potter
01a147eb1d
feat: improved salesforce partitioning (#1475)
* Partitions Salesforce data as xlm instead of text for improved detail and flexibility
* Partitions htmlbody instead of textbody for Salesforce emails
2023-09-25 11:44:28 -07:00
David Potter
b710bafa89
feat: add salesforce connector (#1168) 2023-09-02 08:50:31 -07:00