mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-09 01:55:55 +00:00

add support for start_index in html links extraction (closes #2625) Testing ``` from unstructured.partition.html import partition_html from unstructured.staging.base import elements_to_json html_text = """<html> <p>Hello there I am a <a href="/link">very important link!</a></p> <p>Here is a list of my favorite things</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li> <li>Dogs</li> </ul> <a href="/loner">A lone link!</a> </html>""" elements = partition_html(text=html_text) print(elements_to_json(elements)) ``` --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
977 B
977 B
1 | filename | doctype | connector | cct-accuracy | cct-%missing |
---|---|---|---|---|---|
2 | fake-text.txt | txt | Sharepoint | 1.0 | 0.0 |
3 | ideas-page.html | html | Sharepoint | 0.93 | 0.033 |
4 | stanley-cups.xlsx | xlsx | Sharepoint | 0.778 | 0.0 |
5 | Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf | azure | 0.981 | 0.005 | |
6 | IRS-form-1987.pdf | azure | 0.794 | 0.135 | |
7 | spring-weather.html | html | azure | 0.0 | 0.018 |
8 | example-10k.html | html | local | 0.754 | 0.027 |
9 | fake-html-cp1252.html | html | local | 0.659 | 0.0 |
10 | ideas-page.html | html | local | 0.93 | 0.033 |
11 | UDHR_first_article_all.txt | txt | local-single-file | 0.995 | 0.0 |
12 | handbook-1p.docx | docx | local-single-file-basic-chunking | 0.858 | 0.029 |
13 | fake-html-cp1252.html | html | local-single-file-with-encoding | 0.659 | 0.0 |
14 | layout-parser-paper-with-table.jpg | jpg | local-single-file-with-pdf-infer-table-structure | 0.716 | 0.032 |
15 | layout-parser-paper.pdf | local-single-file-with-pdf-infer-table-structure | 0.95 | 0.029 | |
16 | 2023-Jan-economic-outlook.pdf | s3 | 0.84 | 0.044 | |
17 | page-with-formula.pdf | s3 | 0.971 | 0.021 | |
18 | recalibrating-risk-report.pdf | s3 | 0.968 | 0.008 |