mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-09-03 13:51:07 +00:00
fix documentation html links example (#2608)
Closes #2577 Testing: ``` from unstructured.partition.html import partition_html cnn_lite_url = "https://lite.cnn.com/" elements = partition_html(url=cnn_lite_url) links = [] for element in elements: if element.metadata.link_urls: relative_link = element.metadata.link_urls[0][1:] if relative_link.startswith("2024"): links.append(f"{cnn_lite_url}{relative_link}") print(links) ``` --------- Co-authored-by: ron-unstructured <ronny@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
This commit is contained in:
parent
b9aa4b7452
commit
3783b44d0b
@ -5,7 +5,7 @@
|
||||
### Features
|
||||
|
||||
### Fixes
|
||||
|
||||
|
||||
* **Fix SharePoint dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string.
|
||||
* **Include warnings** about the potential risk of installing a version of `pandoc` which does not support RTF files + instructions that will help resolve that issue.
|
||||
* **Incorporate the `install-pandoc` Makefile recipe** into relevant stages of CI workflow, ensuring it is a version that supports RTF input files.
|
||||
|
@ -4,7 +4,7 @@
|
||||
#
|
||||
# pip-compile --output-file=build.txt build.in
|
||||
#
|
||||
alabaster==0.7.16
|
||||
alabaster==0.7.13
|
||||
# via sphinx
|
||||
babel==2.14.0
|
||||
# via sphinx
|
||||
|
@ -20,9 +20,9 @@ First, we gather links from the CNN Lite homepage using the `partition_html` fun
|
||||
links = []
|
||||
|
||||
for element in elements:
|
||||
if element.metadata.links is not None:
|
||||
relative_link = element.metadata.links[0]["url"][1:]
|
||||
if relative_link.startswith("2023"):
|
||||
if element.metadata.link_urls:
|
||||
relative_link = element.metadata.link_urls[0][1:]
|
||||
if relative_link.startswith("2024"):
|
||||
links.append(f"{cnn_lite_url}{relative_link}")
|
||||
|
||||
Ingest Individual Articles with UnstructuredURLLoader
|
||||
|
@ -4,7 +4,7 @@
|
||||
#
|
||||
# pip-compile --output-file=build.txt build.in
|
||||
#
|
||||
alabaster==0.7.16
|
||||
alabaster==0.7.13
|
||||
# via sphinx
|
||||
babel==2.14.0
|
||||
# via sphinx
|
||||
|
Loading…
x
Reference in New Issue
Block a user