fix documentation html links example (#2608)

Closes  #2577

Testing:
```
from unstructured.partition.html import partition_html

cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2024"):
            links.append(f"{cnn_lite_url}{relative_link}")
            
print(links)
```

---------

Co-authored-by: ron-unstructured <ronny@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
This commit is contained in:
John 2024-03-04 12:33:42 -06:00 committed by GitHub
parent b9aa4b7452
commit 3783b44d0b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 6 additions and 6 deletions

View File

@ -5,7 +5,7 @@
### Features
### Fixes
* **Fix SharePoint dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string.
* **Include warnings** about the potential risk of installing a version of `pandoc` which does not support RTF files + instructions that will help resolve that issue.
* **Incorporate the `install-pandoc` Makefile recipe** into relevant stages of CI workflow, ensuring it is a version that supports RTF input files.

View File

@ -4,7 +4,7 @@
#
# pip-compile --output-file=build.txt build.in
#
alabaster==0.7.16
alabaster==0.7.13
# via sphinx
babel==2.14.0
# via sphinx

View File

@ -20,9 +20,9 @@ First, we gather links from the CNN Lite homepage using the `partition_html` fun
links = []
for element in elements:
if element.metadata.links is not None:
relative_link = element.metadata.links[0]["url"][1:]
if relative_link.startswith("2023"):
if element.metadata.link_urls:
relative_link = element.metadata.link_urls[0][1:]
if relative_link.startswith("2024"):
links.append(f"{cnn_lite_url}{relative_link}")
Ingest Individual Articles with UnstructuredURLLoader

View File

@ -4,7 +4,7 @@
#
# pip-compile --output-file=build.txt build.in
#
alabaster==0.7.16
alabaster==0.7.13
# via sphinx
babel==2.14.0
# via sphinx