5 Commits

Author SHA1 Message Date
Filip Knefel
bdfd975115
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-22 10:08:49 +00:00
John
3783b44d0b
fix documentation html links example (#2608)
Closes  #2577

Testing:
```
from unstructured.partition.html import partition_html

cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2024"):
            links.append(f"{cnn_lite_url}{relative_link}")
            
print(links)
```

---------

Co-authored-by: ron-unstructured <ronny@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-04 18:33:42 +00:00
Ronny H
8e6bc10ba1
Docs various updates (#2386)
To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
2024-01-17 21:01:01 +00:00
Ronny H
ac380ce989
Added AWS Marketplace docs and improved Azure Marketplace docs (#2248)
To test:
> cd docs && make HTML

Change logs:
- Added AWS Marketplace documentation
- Improved Azure Marketplace documentation - Networking section
2023-12-20 20:13:47 +00:00
Ronny H
d80abf0714
Reorganized the Examples section in Documentation & add Databricks example (#1855)
To test:
> cd docs && make html

Change logs:
* Examples are reorganized to have its own page
* Removed two old examples, ie. "file-utils" & "sentiment analysis".
* Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" &
"Multi-Files Processing with S3 Connector and API"
* Reorganized and added detailed API documentation: (i) usage, (ii)
SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and
validation errors
2023-11-30 01:24:43 +00:00