Docs updates (#2458)

To test: > cd docs && make html Change logs: * Updates the best practice for table extraction to use `skip_infer_table_types` instead of `pdf_infer_table_structure`. * Fixed CSS issue with a duplicate search box. * Fixed RST warning message * Fixed typo on the Intro page.
2025-12-27 23:24:27 +00:00 · 2024-01-25 12:31:28 -08:00 · 2024-01-25 12:31:28 -08:00 · d5a6f4b82c
commit d5a6f4b82c
parent d8b3bdb919
6 changed files with 23 additions and 25 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,4 +1,4 @@
-## 0.12.3-dev5
+## 0.12.3-dev6

 ### Enhancements

@ -16,6 +16,8 @@
 * **Fix FSSpec destination connectors check_connection.** FSSpec destination connectors did not use `check_connection`. There was an error when trying to `ls` destination directory - it may not exist at the moment of connector creation. Now `check_connection` calls `ls` on bucket root and this method is called on `initialize` of destination connector.
 * **Fix databricks-volumes extra location.** `setup.py` is currently pointing to the wrong location for the databricks-volumes extra requirements. This results in errors when trying to build the wheel for unstructured. This change updates to point to the correct path.
 * **Fix uploading None values to Chroma and Pinecone.** Removes keys with None values with Pinecone and Chroma destinations. Pins Pinecone dependency
+* **Update documentation.** (i) best practice for table extration by using 'skip_infer_table_types' param, instead of 'pdf_infer_table_structure', and (ii) fixed CSS, RST issues and typo in the documentation.
+

 ## 0.12.2

--- a/docs/source/best_practices/table_extraction_pdf.rst
+++ b/docs/source/best_practices/table_extraction_pdf.rst
@ -33,11 +33,8 @@ To extract the tables from PDF files using the `partition_pdf <https://unstructu
 Method 2: Using Auto Partition or Unstructured API
 --------------------------------------------------

-For extracting tables from PDFs using `auto partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , set the ``pdf_infer_table_structure`` parameter to **True** and ``strategy`` parameter to ``hi_res``.
+By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx`` file types is disabled. To enable table extraction from PDFs and other file types using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , you can set the ``skip_infer_table_types`` parameter to ``'[]'`` and ``strategy`` parameter to ``hi_res``.

-.. warning::
-
-    You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.

 **Usage: Auto Partition**

@ -48,8 +45,8 @@ For extracting tables from PDFs using `auto partition <https://unstructured-io.g
    filename = "example-docs/layout-parser-paper.pdf"

    elements = partition(filename=filename,
-                         pdf_infer_table_structure=True,
                         strategy='hi_res',
+                         skip_infer_table_types='[]', # don't forget to include apostrophe around the square bracket
               )

    tables = [el for el in elements if el.category == "Table"]
@ -62,11 +59,15 @@ For extracting tables from PDFs using `auto partition <https://unstructured-io.g

 .. code-block:: bash

-     curl -X 'POST' \
-      'https://api.unstructured.io/general/v0/general' \
-      -H 'accept: application/json' \
-      -H 'Content-Type: multipart/form-data' \
-      -F 'files=@sample-docs/layout-parser-paper.pdf' \
-      -F 'strategy=hi_res' \
-      -F 'pdf_infer_table_structure=true' \
-      | jq -C . | less -R
+      curl -X 'POST' \
+          'https://api.unstructured.io/general/v0/general' \
+          -H 'accept: application/json' \
+          -H 'Content-Type: multipart/form-data' \
+          -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
+          -F 'strategy=hi_res' \
+          -F 'skip_infer_table_types=[]' \
+          | jq -C . | less -R
+
+.. warning::
+
+    You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -55,9 +55,10 @@ html_theme = "furo"
 html_static_path = ["_static"]

 # Adding a custom css file in order to add custom css file and can change the necessary elements.
+# custom css and js for kapa.ai integration
 html_favicon = "_static/images/unstructured_small.png"
-html_css_files = ["unstructured.css"]
-html_js_files = ["js/githubStargazers.js", "js/sidebarScrollPosition.js"]
+html_js_files = ["js/githubStargazers.js", "js/sidebarScrollPosition.js", "custom.js"]
+html_css_files = ["unstructured.css", "custom.css"]

 html_theme_options = {
    "sidebar_hide_name": True,
@ -135,8 +136,3 @@ html_theme_options = {
        "sidebar-caption-space-above": "0",
    },
 }
-
-# kapa.ai integration
-html_static_path = ["_static"]
-html_js_files = ["custom.js"]
-html_css_files = ["custom.css"]
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -1,8 +1,7 @@
 Unstructured Core Library
 =========================

-The ``unstructured`` library is designed to help preprocess structure unstructured text documents
-for use in downstream machine learning tasks. Examples of documents that can be processed
+The ``unstructured`` library is designed to help preprocess and structure unstructured text documents for use in downstream machine learning tasks. Examples of documents that can be processed
 using the ``unstructured`` library include PDFs, XML and HTML documents.

 Library Documentation
--- a/docs/source/ingest/destination_connectors/databricks_volumes.rst
+++ b/docs/source/ingest/destination_connectors/databricks_volumes.rst
@ -1,5 +1,5 @@
 Databricks Volumes
-===========
+==================

 Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to a Databricks Volume.

--- a/unstructured/version.py
+++ b/unstructured/version.py
@ -1 +1 @@
-__version__ = "0.12.3-dev5"  # pragma: no cover
+__version__ = "0.12.3-dev6"  # pragma: no cover