Add strategy documentation (#1353)

This commit is contained in:
Ronny H 2023-09-09 18:54:01 -07:00 committed by GitHub
parent 915e4adcbb
commit edc45013dc
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 49 additions and 2 deletions

View File

@ -0,0 +1,15 @@
Best Practices
==============
Unstructured offers a few strategies and models to extract document information. These best practices are intended to provide guidelines to configure the ``strategy`` and ``model`` configurations to optimize document information extraction.
High-level overview of available strategies and models in ``Unstructured`` library:
.. image:: imgs/strategy.png
:width: 1000
:alt: Alternative text
.. toctree::
:maxdepth: 1
best_practices/strategies

View File

@ -0,0 +1,27 @@
Strategies
==========
The Unstructured library offers a variety of different ways to preprocess documents which can be specified by the “strategy” parameter.
**Basic usage:**
.. code:: python
elements = partition(filename=filename, strategy='hi_res')
**Available options:**
* ``auto`` (default strategy): The "auto" strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
* ``fast``: The “fast” strategy will leverage traditional NLP extraction techniques to quickly pull all the text elements. "Fast" strategy is not good for image based file types.
* ``hi_res``: The "hi_res" strategy will identify the layout of the document using ``detectron2``. The advantage of “hi_res” is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements.
* ``ocr_only``: Leverage Optical Character Recognition to extract text from the image based files.
**These strategies are available on the following partition bricks:**
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Document Type | Partition Function | Strategies | Table Support | Options |
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Images (`.png`/`.jpg`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy |
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| PDFs (`.pdf`) | `partition_pdf` | "auto", "fast", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy |
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+

Binary file not shown.

After

Width:  |  Height:  |  Size: 217 KiB

View File

@ -29,6 +29,9 @@ Library Documentation
:doc:`integrations`
We make it easy for you to connect your output with other popular ML services.
:doc:`best_practices`
Learn best practices to optimize document information extraction using ``unstructured`` library.
.. Hidden TOCs
.. toctree::
@ -43,5 +46,5 @@ Library Documentation
upstream_connectors
metadata
examples
strategies
integrations
best_practices

View File

@ -21,6 +21,7 @@ in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_
upstream_connectors/gitlab
upstream_connectors/google_cloud_storage
upstream_connectors/google_drive
upstream_connectors/jira
upstream_connectors/local_connector
upstream_connectors/notion
upstream_connectors/onedrive

View File

@ -1,5 +1,6 @@
Jira
==========
=====
Connect Jira to your preprocessing pipeline, and batch process all your documents using ``unstructured-ingest`` to store structured outputs locally on your filesystem.
First you'll need to install the Jira dependencies as shown here.