mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-27 23:24:27 +00:00
Docs updates (#2458)
To test: > cd docs && make html Change logs: * Updates the best practice for table extraction to use `skip_infer_table_types` instead of `pdf_infer_table_structure`. * Fixed CSS issue with a duplicate search box. * Fixed RST warning message * Fixed typo on the Intro page.
This commit is contained in:
parent
d8b3bdb919
commit
d5a6f4b82c
@ -1,4 +1,4 @@
|
||||
## 0.12.3-dev5
|
||||
## 0.12.3-dev6
|
||||
|
||||
### Enhancements
|
||||
|
||||
@ -16,6 +16,8 @@
|
||||
* **Fix FSSpec destination connectors check_connection.** FSSpec destination connectors did not use `check_connection`. There was an error when trying to `ls` destination directory - it may not exist at the moment of connector creation. Now `check_connection` calls `ls` on bucket root and this method is called on `initialize` of destination connector.
|
||||
* **Fix databricks-volumes extra location.** `setup.py` is currently pointing to the wrong location for the databricks-volumes extra requirements. This results in errors when trying to build the wheel for unstructured. This change updates to point to the correct path.
|
||||
* **Fix uploading None values to Chroma and Pinecone.** Removes keys with None values with Pinecone and Chroma destinations. Pins Pinecone dependency
|
||||
* **Update documentation.** (i) best practice for table extration by using 'skip_infer_table_types' param, instead of 'pdf_infer_table_structure', and (ii) fixed CSS, RST issues and typo in the documentation.
|
||||
|
||||
|
||||
## 0.12.2
|
||||
|
||||
|
||||
@ -33,11 +33,8 @@ To extract the tables from PDF files using the `partition_pdf <https://unstructu
|
||||
Method 2: Using Auto Partition or Unstructured API
|
||||
--------------------------------------------------
|
||||
|
||||
For extracting tables from PDFs using `auto partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , set the ``pdf_infer_table_structure`` parameter to **True** and ``strategy`` parameter to ``hi_res``.
|
||||
By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx`` file types is disabled. To enable table extraction from PDFs and other file types using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , you can set the ``skip_infer_table_types`` parameter to ``'[]'`` and ``strategy`` parameter to ``hi_res``.
|
||||
|
||||
.. warning::
|
||||
|
||||
You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.
|
||||
|
||||
**Usage: Auto Partition**
|
||||
|
||||
@ -48,8 +45,8 @@ For extracting tables from PDFs using `auto partition <https://unstructured-io.g
|
||||
filename = "example-docs/layout-parser-paper.pdf"
|
||||
|
||||
elements = partition(filename=filename,
|
||||
pdf_infer_table_structure=True,
|
||||
strategy='hi_res',
|
||||
skip_infer_table_types='[]', # don't forget to include apostrophe around the square bracket
|
||||
)
|
||||
|
||||
tables = [el for el in elements if el.category == "Table"]
|
||||
@ -62,11 +59,15 @@ For extracting tables from PDFs using `auto partition <https://unstructured-io.g
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl -X 'POST' \
|
||||
'https://api.unstructured.io/general/v0/general' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: multipart/form-data' \
|
||||
-F 'files=@sample-docs/layout-parser-paper.pdf' \
|
||||
-F 'strategy=hi_res' \
|
||||
-F 'pdf_infer_table_structure=true' \
|
||||
| jq -C . | less -R
|
||||
curl -X 'POST' \
|
||||
'https://api.unstructured.io/general/v0/general' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: multipart/form-data' \
|
||||
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
|
||||
-F 'strategy=hi_res' \
|
||||
-F 'skip_infer_table_types=[]' \
|
||||
| jq -C . | less -R
|
||||
|
||||
.. warning::
|
||||
|
||||
You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.
|
||||
|
||||
@ -55,9 +55,10 @@ html_theme = "furo"
|
||||
html_static_path = ["_static"]
|
||||
|
||||
# Adding a custom css file in order to add custom css file and can change the necessary elements.
|
||||
# custom css and js for kapa.ai integration
|
||||
html_favicon = "_static/images/unstructured_small.png"
|
||||
html_css_files = ["unstructured.css"]
|
||||
html_js_files = ["js/githubStargazers.js", "js/sidebarScrollPosition.js"]
|
||||
html_js_files = ["js/githubStargazers.js", "js/sidebarScrollPosition.js", "custom.js"]
|
||||
html_css_files = ["unstructured.css", "custom.css"]
|
||||
|
||||
html_theme_options = {
|
||||
"sidebar_hide_name": True,
|
||||
@ -135,8 +136,3 @@ html_theme_options = {
|
||||
"sidebar-caption-space-above": "0",
|
||||
},
|
||||
}
|
||||
|
||||
# kapa.ai integration
|
||||
html_static_path = ["_static"]
|
||||
html_js_files = ["custom.js"]
|
||||
html_css_files = ["custom.css"]
|
||||
|
||||
@ -1,8 +1,7 @@
|
||||
Unstructured Core Library
|
||||
=========================
|
||||
|
||||
The ``unstructured`` library is designed to help preprocess structure unstructured text documents
|
||||
for use in downstream machine learning tasks. Examples of documents that can be processed
|
||||
The ``unstructured`` library is designed to help preprocess and structure unstructured text documents for use in downstream machine learning tasks. Examples of documents that can be processed
|
||||
using the ``unstructured`` library include PDFs, XML and HTML documents.
|
||||
|
||||
Library Documentation
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
Databricks Volumes
|
||||
===========
|
||||
==================
|
||||
|
||||
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to a Databricks Volume.
|
||||
|
||||
|
||||
@ -1 +1 @@
|
||||
__version__ = "0.12.3-dev5" # pragma: no cover
|
||||
__version__ = "0.12.3-dev6" # pragma: no cover
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user