mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-08-18 05:37:53 +00:00
Add strategy
documentation (#1353)
This commit is contained in:
parent
915e4adcbb
commit
edc45013dc
15
docs/source/best_practices.rst
Normal file
15
docs/source/best_practices.rst
Normal file
@ -0,0 +1,15 @@
|
||||
Best Practices
|
||||
==============
|
||||
|
||||
Unstructured offers a few strategies and models to extract document information. These best practices are intended to provide guidelines to configure the ``strategy`` and ``model`` configurations to optimize document information extraction.
|
||||
|
||||
High-level overview of available strategies and models in ``Unstructured`` library:
|
||||
|
||||
.. image:: imgs/strategy.png
|
||||
:width: 1000
|
||||
:alt: Alternative text
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
best_practices/strategies
|
27
docs/source/best_practices/strategies.rst
Normal file
27
docs/source/best_practices/strategies.rst
Normal file
@ -0,0 +1,27 @@
|
||||
Strategies
|
||||
==========
|
||||
|
||||
The Unstructured library offers a variety of different ways to preprocess documents which can be specified by the “strategy” parameter.
|
||||
|
||||
**Basic usage:**
|
||||
|
||||
.. code:: python
|
||||
|
||||
elements = partition(filename=filename, strategy='hi_res')
|
||||
|
||||
**Available options:**
|
||||
|
||||
* ``auto`` (default strategy): The "auto" strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
|
||||
* ``fast``: The “fast” strategy will leverage traditional NLP extraction techniques to quickly pull all the text elements. "Fast" strategy is not good for image based file types.
|
||||
* ``hi_res``: The "hi_res" strategy will identify the layout of the document using ``detectron2``. The advantage of “hi_res” is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements.
|
||||
* ``ocr_only``: Leverage Optical Character Recognition to extract text from the image based files.
|
||||
|
||||
**These strategies are available on the following partition bricks:**
|
||||
|
||||
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Document Type | Partition Function | Strategies | Table Support | Options |
|
||||
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Images (`.png`/`.jpg`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy |
|
||||
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| PDFs (`.pdf`) | `partition_pdf` | "auto", "fast", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy |
|
||||
+-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
BIN
docs/source/imgs/strategy.png
Normal file
BIN
docs/source/imgs/strategy.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 217 KiB |
@ -29,6 +29,9 @@ Library Documentation
|
||||
:doc:`integrations`
|
||||
We make it easy for you to connect your output with other popular ML services.
|
||||
|
||||
:doc:`best_practices`
|
||||
Learn best practices to optimize document information extraction using ``unstructured`` library.
|
||||
|
||||
.. Hidden TOCs
|
||||
|
||||
.. toctree::
|
||||
@ -43,5 +46,5 @@ Library Documentation
|
||||
upstream_connectors
|
||||
metadata
|
||||
examples
|
||||
strategies
|
||||
integrations
|
||||
best_practices
|
@ -21,6 +21,7 @@ in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_
|
||||
upstream_connectors/gitlab
|
||||
upstream_connectors/google_cloud_storage
|
||||
upstream_connectors/google_drive
|
||||
upstream_connectors/jira
|
||||
upstream_connectors/local_connector
|
||||
upstream_connectors/notion
|
||||
upstream_connectors/onedrive
|
||||
|
@ -1,5 +1,6 @@
|
||||
Jira
|
||||
==========
|
||||
=====
|
||||
|
||||
Connect Jira to your preprocessing pipeline, and batch process all your documents using ``unstructured-ingest`` to store structured outputs locally on your filesystem.
|
||||
|
||||
First you'll need to install the Jira dependencies as shown here.
|
||||
|
Loading…
x
Reference in New Issue
Block a user