Docs updates (#2458)

To test:
> cd docs && make html

Change logs:
* Updates the best practice for table extraction to use
`skip_infer_table_types` instead of `pdf_infer_table_structure`.
* Fixed CSS issue with a duplicate search box.
* Fixed RST warning message
* Fixed typo on the Intro page.
This commit is contained in:
Ronny H 2024-01-25 12:31:28 -08:00 committed by GitHub
parent d8b3bdb919
commit d5a6f4b82c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 23 additions and 25 deletions

View File

@ -1,4 +1,4 @@
## 0.12.3-dev5
## 0.12.3-dev6
### Enhancements
@ -16,6 +16,8 @@
* **Fix FSSpec destination connectors check_connection.** FSSpec destination connectors did not use `check_connection`. There was an error when trying to `ls` destination directory - it may not exist at the moment of connector creation. Now `check_connection` calls `ls` on bucket root and this method is called on `initialize` of destination connector.
* **Fix databricks-volumes extra location.** `setup.py` is currently pointing to the wrong location for the databricks-volumes extra requirements. This results in errors when trying to build the wheel for unstructured. This change updates to point to the correct path.
* **Fix uploading None values to Chroma and Pinecone.** Removes keys with None values with Pinecone and Chroma destinations. Pins Pinecone dependency
* **Update documentation.** (i) best practice for table extration by using 'skip_infer_table_types' param, instead of 'pdf_infer_table_structure', and (ii) fixed CSS, RST issues and typo in the documentation.
## 0.12.2

View File

@ -33,11 +33,8 @@ To extract the tables from PDF files using the `partition_pdf <https://unstructu
Method 2: Using Auto Partition or Unstructured API
--------------------------------------------------
For extracting tables from PDFs using `auto partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , set the ``pdf_infer_table_structure`` parameter to **True** and ``strategy`` parameter to ``hi_res``.
By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx`` file types is disabled. To enable table extraction from PDFs and other file types using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , you can set the ``skip_infer_table_types`` parameter to ``'[]'`` and ``strategy`` parameter to ``hi_res``.
.. warning::
You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.
**Usage: Auto Partition**
@ -48,8 +45,8 @@ For extracting tables from PDFs using `auto partition <https://unstructured-io.g
filename = "example-docs/layout-parser-paper.pdf"
elements = partition(filename=filename,
pdf_infer_table_structure=True,
strategy='hi_res',
skip_infer_table_types='[]', # don't forget to include apostrophe around the square bracket
)
tables = [el for el in elements if el.category == "Table"]
@ -62,11 +59,15 @@ For extracting tables from PDFs using `auto partition <https://unstructured-io.g
.. code-block:: bash
curl -X 'POST' \
'https://api.unstructured.io/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper.pdf' \
-F 'strategy=hi_res' \
-F 'pdf_infer_table_structure=true' \
| jq -C . | less -R
curl -X 'POST' \
'https://api.unstructured.io/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'strategy=hi_res' \
-F 'skip_infer_table_types=[]' \
| jq -C . | less -R
.. warning::
You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.

View File

@ -55,9 +55,10 @@ html_theme = "furo"
html_static_path = ["_static"]
# Adding a custom css file in order to add custom css file and can change the necessary elements.
# custom css and js for kapa.ai integration
html_favicon = "_static/images/unstructured_small.png"
html_css_files = ["unstructured.css"]
html_js_files = ["js/githubStargazers.js", "js/sidebarScrollPosition.js"]
html_js_files = ["js/githubStargazers.js", "js/sidebarScrollPosition.js", "custom.js"]
html_css_files = ["unstructured.css", "custom.css"]
html_theme_options = {
"sidebar_hide_name": True,
@ -135,8 +136,3 @@ html_theme_options = {
"sidebar-caption-space-above": "0",
},
}
# kapa.ai integration
html_static_path = ["_static"]
html_js_files = ["custom.js"]
html_css_files = ["custom.css"]

View File

@ -1,8 +1,7 @@
Unstructured Core Library
=========================
The ``unstructured`` library is designed to help preprocess structure unstructured text documents
for use in downstream machine learning tasks. Examples of documents that can be processed
The ``unstructured`` library is designed to help preprocess and structure unstructured text documents for use in downstream machine learning tasks. Examples of documents that can be processed
using the ``unstructured`` library include PDFs, XML and HTML documents.
Library Documentation

View File

@ -1,5 +1,5 @@
Databricks Volumes
===========
==================
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to a Databricks Volume.

View File

@ -1 +1 @@
__version__ = "0.12.3-dev5" # pragma: no cover
__version__ = "0.12.3-dev6" # pragma: no cover