13 Commits

Author SHA1 Message Date
ryannikolaidis
66bf4b0198
feat: support extracting image url in html (#3955)
also removes mimetype when base64 is not included in image metadata

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-13 22:41:10 +00:00
ryannikolaidis
c0457c1cc3
feat: include images when partitioning html (#3945)
Currently we [filter img
tags](2addb19473/unstructured/partition/html/partition.py (L226-L229))
before tags are converted to Elements by the html partitioner. More
importantly we also don’t currently have a defined “block” / mapping to
support these. This adds these mappings and logic to process.

It also respects `extract_image_block_types` and
`extract_image_block_to_payload` (as we do with pdfs) to determine
whether base64 is included in the metadata.

The partitioned Image Elements sets the text to the img tag’s alt text
if available.

The partitioned Image Elements include the [url in the
metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209)
(rather than image_base64) if the img tag src is a url.

## Testing

unit tests have been added for explicit coverage.
existing integration tests and other unit test fixtures have been
updated to account for `Image` elements now present

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-08 01:25:21 +00:00
Steve Canny
b3a2dd4755
fix: html incorrectly categorizing text (#3841)
Fixes #3666

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-12-18 18:46:54 +00:00
Steve Canny
208c7edc52
rfctr(csv): minify HTML and table text is cct (#3733)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_csv()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- CSV `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-19 06:49:09 +00:00
MiXiBo
0506aff788
add support for start_index in html links extraction (#2600)
add support for start_index in html links extraction (closes #2625)

Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json


html_text = """<html>
        <p>Hello there I am a <a href="/link">very important link!</a></p>
        <p>Here is a list of my favorite things</p>
        <ul>
            <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
            <li>Dogs</li>
        </ul>
        <a href="/loner">A lone link!</a>
    </html>"""

elements = partition_html(text=html_text)
print(elements_to_json(elements))
```

---------

Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-04-12 06:14:20 +00:00
Matt Robinson
b4d9ad8130
enhancement: detect headers in partition_pdf with fast strategy (#2455)
### Summary

Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-02-23 16:56:09 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
Roman Isecke
cc05e948ff
chore: sensitive info connector audit (#2227)
### Description
All other connectors that were not included in
https://github.com/Unstructured-IO/unstructured/pull/2194 are now
updated to follow the new pattern and mark any variables as sensitive
where it makes sense.
Core changes:
* All connectors now support an `AccessConfig` to mark data that's
needed for auth (i.e. username, password) and those that are sensitive
are designated appropriately using the new enhanced field.
* All cli configs on the cli definition now inherit from the base config
in the connector file to reuse the variables set on that dataclass
* The base writer class was updated to better generalize the new
approach given better use of dataclasses
* The base cli classes were refactored to also take into account the
need for a connector and write config when creating the respective
runner/writer classes.
* Any mismatch between the cli field name and the dataclass field name
were updated on the dataclass side to not impact the user but maintain
consistency
* Add custom redaction logic for mongodb URIs since the password is
expected to be a part of it. Now this:
`"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"`
->
`"mongodb+srv://ingest-test-user:***REDACTED***@ingest-test.hgaig.mongodb.net/"`
in the logs
* Bundle all fsspec based files into their own packages. 
* Refactor custom `_decode_dataclass` used for enhanced json mixin by
using a monkey-patch approach. The original approach was breaking on
optional nested dataclasses when serializing since the other methods in
`dataclasses_json_core` weren't using the new method. By monkey-patching
the original method with a new one, all other methods in that library
would use the new one.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-11 17:37:49 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
Yao You
19d8bff275
feat: change default hi_res model to yolox quantized (#1607) 2023-10-04 03:28:47 +00:00
Klaijan
d26d591d6a
feat: get embedded url, associate text and start index for pdf (#1539)
**Executive Summary**

Adds PDF functionality to capture hyperlink (external or internal) for
pdf fast strategy along with associate text.

**Technical Details**

- `pdfminer` associates `annotation` (links and uris) with bounding box
rather than text. Therefore, the link and text matching is not a perfect
pair but rather a logic-based and calculation matching from bounding box
overlapping.
- There is no word-level bounding box. Only character-level (access
using `LTChar`). Thus in order to get to word-level, there is a window
slicing through the text. The words are captured in alphanumeric and
non-alphanumeric separately, meaning it will split the word if contains
both, on the first encounter of non-alphanumeric.)
- The bounding box calculation is calculated using start and stop
coordinates for the corresponding word calculated from above. The
calculation is simply using distance between two dots.

The result now contains `links` in `metadata` as shown below:

```
            "links": [
                {
                    "text": "link",
                    "url": "https://github.com/Unstructured-IO/unstructured",
                    "start_index": 12
                },
                {
                    "text": "email",
                    "url": "mailto:unstructuredai@earlygrowth.com",
                    "start_index": 30
                },
                {
                    "text": "phone number",
                    "url": "tel:6505124019",
                    "start_index": 49
                }
            ]
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-09-27 13:43:32 -04:00
Amanda Cameron
a501d1d18f
Adding table extraction to partition_html (#1324)
Adding table extraction to HTML partitioning.

This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.

```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
2023-09-11 11:14:11 -07:00
Roman Isecke
db8af4f5de
Roman/notion tests (#1072)
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-21 15:16:50 -04:00