MiXiBo 0506aff788
add support for start_index in html links extraction (#2600)
add support for start_index in html links extraction (closes #2625)

Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json


html_text = """<html>
        <p>Hello there I am a <a href="/link">very important link!</a></p>
        <p>Here is a list of my favorite things</p>
        <ul>
            <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
            <li>Dogs</li>
        </ul>
        <a href="/loner">A lone link!</a>
    </html>"""

elements = partition_html(text=html_text)
print(elements_to_json(elements))
```

---------

Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-04-12 06:14:20 +00:00

977 B

1filenamedoctypeconnectorcct-accuracycct-%missing
2fake-text.txttxtSharepoint1.00.0
3ideas-page.htmlhtmlSharepoint0.930.033
4stanley-cups.xlsxxlsxSharepoint0.7780.0
5Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdfpdfazure0.9810.005
6IRS-form-1987.pdfpdfazure0.7940.135
7spring-weather.htmlhtmlazure0.00.018
8example-10k.htmlhtmllocal0.7540.027
9fake-html-cp1252.htmlhtmllocal0.6590.0
10ideas-page.htmlhtmllocal0.930.033
11UDHR_first_article_all.txttxtlocal-single-file0.9950.0
12handbook-1p.docxdocxlocal-single-file-basic-chunking0.8580.029
13fake-html-cp1252.htmlhtmllocal-single-file-with-encoding0.6590.0
14layout-parser-paper-with-table.jpgjpglocal-single-file-with-pdf-infer-table-structure0.7160.032
15layout-parser-paper.pdfpdflocal-single-file-with-pdf-infer-table-structure0.950.029
162023-Jan-economic-outlook.pdfpdfs30.840.044
17page-with-formula.pdfpdfs30.9710.021
18recalibrating-risk-report.pdfpdfs30.9680.008