Add strategy documentation (#1353)

2025-12-18 10:44:23 +00:00 · 2023-09-09 18:54:01 -07:00 · 2023-09-09 18:54:01 -07:00 · edc45013dc
commit edc45013dc
parent 915e4adcbb
8 changed files with 49 additions and 2 deletions
--- a/docs/source/best_practices.rst
+++ b/docs/source/best_practices.rst
@ -0,0 +1,15 @@
+Best Practices
+==============
+
+Unstructured offers a few strategies and models to extract document information. These best practices are intended to provide guidelines to configure the ``strategy`` and ``model`` configurations to optimize document information extraction.
+
+High-level overview of available strategies and models in ``Unstructured`` library:
+
+.. image:: imgs/strategy.png
+  :width: 1000
+  :alt: Alternative text
+
+.. toctree::
+   :maxdepth: 1
+
+   best_practices/strategies
--- a/docs/source/best_practices/strategies.rst
+++ b/docs/source/best_practices/strategies.rst
@ -0,0 +1,27 @@
+Strategies
+==========
+
+The Unstructured library offers a variety of different ways to preprocess documents which can be specified by the “strategy” parameter.
+
+**Basic usage:**
+
+.. code:: python
+
+    elements = partition(filename=filename, strategy='hi_res')
+
+**Available options:**
+
+* ``auto`` (default strategy): The "auto" strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
+* ``fast``: The “fast” strategy will leverage traditional NLP extraction techniques to quickly pull all the text elements. "Fast" strategy is not good for image based file types.
+* ``hi_res``: The "hi_res" strategy will identify the layout of the document using ``detectron2``. The advantage of “hi_res” is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements.
+* ``ocr_only``: Leverage Optical Character Recognition to extract text from the image based files.
+
+**These strategies are available on the following partition bricks:**
+
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
+| Document Type                             | Partition Function             | Strategies                             | Table Support  | Options                                                                                                          |
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
+| Images (`.png`/`.jpg`)                    | `partition_image`              | "auto", "hi_res", "ocr_only"           | Yes            | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy                                    |
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
+| PDFs (`.pdf`)                             | `partition_pdf`                | "auto", "fast", "hi_res", "ocr_only"   | Yes            | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy                     |
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
--- a/docs/source/downstream_connectors/pgvector.rst
+++ b/docs/source/downstream_connectors/pgvector.rst
--- a/docs/source/imgs/strategy.png
+++ b/docs/source/imgs/strategy.png
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -29,6 +29,9 @@ Library Documentation
 :doc:`integrations`
  We make it easy for you to connect your output with other popular ML services.

+:doc:`best_practices`
+  Learn best practices to optimize document information extraction using ``unstructured`` library.
+
 .. Hidden TOCs

 .. toctree::
@ -43,5 +46,5 @@ Library Documentation
   upstream_connectors
   metadata
   examples
-   strategies
   integrations
+   best_practices
--- a/docs/source/introductions_all.rst
+++ b/docs/source/introductions_all.rst
--- a/docs/source/upstream_connectors.rst
+++ b/docs/source/upstream_connectors.rst
@ -21,6 +21,7 @@ in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_
   upstream_connectors/gitlab
   upstream_connectors/google_cloud_storage
   upstream_connectors/google_drive
+   upstream_connectors/jira
   upstream_connectors/local_connector
   upstream_connectors/notion
   upstream_connectors/onedrive
--- a/docs/source/upstream_connectors/jira.rst
+++ b/docs/source/upstream_connectors/jira.rst
@ -1,5 +1,6 @@
 Jira
-==========
+=====
+
 Connect Jira to your preprocessing pipeline, and batch process all your documents using ``unstructured-ingest`` to store structured outputs locally on your filesystem.

 First you'll need to install the Jira dependencies as shown here.