diff --git a/docs/source/best_practices.rst b/docs/source/best_practices.rst new file mode 100644 index 000000000..85dfd55ff --- /dev/null +++ b/docs/source/best_practices.rst @@ -0,0 +1,15 @@ +Best Practices +============== + +Unstructured offers a few strategies and models to extract document information. These best practices are intended to provide guidelines to configure the ``strategy`` and ``model`` configurations to optimize document information extraction. + +High-level overview of available strategies and models in ``Unstructured`` library: + +.. image:: imgs/strategy.png + :width: 1000 + :alt: Alternative text + +.. toctree:: + :maxdepth: 1 + + best_practices/strategies diff --git a/docs/source/best_practices/strategies.rst b/docs/source/best_practices/strategies.rst new file mode 100644 index 000000000..312da9d26 --- /dev/null +++ b/docs/source/best_practices/strategies.rst @@ -0,0 +1,27 @@ +Strategies +========== + +The Unstructured library offers a variety of different ways to preprocess documents which can be specified by the “strategy” parameter. + +**Basic usage:** + +.. code:: python + + elements = partition(filename=filename, strategy='hi_res') + +**Available options:** + +* ``auto`` (default strategy): The "auto" strategy will choose the partitioning strategy based on document characteristics and the function kwargs. +* ``fast``: The “fast” strategy will leverage traditional NLP extraction techniques to quickly pull all the text elements. "Fast" strategy is not good for image based file types. +* ``hi_res``: The "hi_res" strategy will identify the layout of the document using ``detectron2``. The advantage of “hi_res” is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements. +* ``ocr_only``: Leverage Optical Character Recognition to extract text from the image based files. + +**These strategies are available on the following partition bricks:** + ++-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Document Type | Partition Function | Strategies | Table Support | Options | ++-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Images (`.png`/`.jpg`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy | ++-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| PDFs (`.pdf`) | `partition_pdf` | "auto", "fast", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy | ++-------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ \ No newline at end of file diff --git a/docs/source/downstream_connectors/pgvector.rst b/docs/source/downstream_connectors/pgvector.rst deleted file mode 100644 index e69de29bb..000000000 diff --git a/docs/source/imgs/strategy.png b/docs/source/imgs/strategy.png new file mode 100644 index 000000000..5c02df1d4 Binary files /dev/null and b/docs/source/imgs/strategy.png differ diff --git a/docs/source/index.rst b/docs/source/index.rst index dd5359820..dc4f7bfa0 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -29,6 +29,9 @@ Library Documentation :doc:`integrations` We make it easy for you to connect your output with other popular ML services. +:doc:`best_practices` + Learn best practices to optimize document information extraction using ``unstructured`` library. + .. Hidden TOCs .. toctree:: @@ -43,5 +46,5 @@ Library Documentation upstream_connectors metadata examples - strategies integrations + best_practices \ No newline at end of file diff --git a/docs/source/introductions_all.rst b/docs/source/introductions_all.rst deleted file mode 100644 index e69de29bb..000000000 diff --git a/docs/source/upstream_connectors.rst b/docs/source/upstream_connectors.rst index b7af1ae20..d78138e0e 100644 --- a/docs/source/upstream_connectors.rst +++ b/docs/source/upstream_connectors.rst @@ -21,6 +21,7 @@ in our community `Slack.