unstructured/docs/source/best_practices/table_extraction_pdf.rst

Table Extraction from PDF
=========================

This section describes two methods for extracting tables from PDF files.

.. note::

    To extract tables from any documents, set the ``strategy`` parameter to ``hi_res`` for both methods below.

Method 1: Using `partition_pdf`
-------------------------------

To extract the tables from PDF files using the `partition_pdf <https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf>`__, set the ``infer_table_structure`` parameter to ``True`` and ``strategy`` parameter to ``hi_res``.

**Usage**

.. code-block:: python

    from unstructured.partition.pdf import partition_pdf

    fname = "example-docs/layout-parser-paper.pdf"

    elements = partition_pdf(filename=fname,
                             infer_table_structure=True,
                             strategy='hi_res',
               )

    tables = [el for el in elements if el.category == "Table"]

    print(tables[0].text)
    print(tables[0].metadata.text_as_html)

Method 2: Using Auto Partition or Unstructured API
--------------------------------------------------

By default, table extraction from all file types is enabled. To extract tables from PDFs and images using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ simply set ``strategy`` parameter to ``hi_res``.


**Usage: Auto Partition**

.. code-block:: python

    from unstructured.partition.auto import partition

    filename = "example-docs/layout-parser-paper.pdf"

    elements = partition(filename=filename,
                         strategy='hi_res',
               )

    tables = [el for el in elements if el.category == "Table"]

    print(tables[0].text)
    print(tables[0].metadata.text_as_html)


**Usage: API Parameters**

.. code-block:: bash

      curl -X 'POST' \
          'https://api.unstructured.io/general/v0/general' \
          -H 'accept: application/json' \
          -H 'Content-Type: multipart/form-data' \
          -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
          -F 'strategy=hi_res' \
          | jq -C . | less -R
Docs various updates (#2386) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section 2024-01-17 13:01:01 -08:00			`Table Extraction from PDF`
			`=========================`

			`This section describes two methods for extracting tables from PDF files.`

			`.. note::`

			To extract tables from any documents, set the ``strategy`` parameter to ``hi_res`` for both methods below.

			Method 1: Using `partition_pdf`
			`-------------------------------`

			To extract the tables from PDF files using the `partition_pdf <https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf>`__, set the ``infer_table_structure`` parameter to ``True`` and ``strategy`` parameter to ``hi_res``.

			`Usage`

			`.. code-block:: python`

			`from unstructured.partition.pdf import partition_pdf`

			`fname = "example-docs/layout-parser-paper.pdf"`

			`elements = partition_pdf(filename=fname,`
			`infer_table_structure=True,`
			`strategy='hi_res',`
			`)`

			`tables = [el for el in elements if el.category == "Table"]`

			`print(tables[0].text)`
			`print(tables[0].metadata.text_as_html)`

			`Method 2: Using Auto Partition or Unstructured API`
			`--------------------------------------------------`

chore: change table extraction defaults (#2588) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com> 2024-03-22 11:08:49 +01:00			By default, table extraction from all file types is enabled. To extract tables from PDFs and images using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ simply set ``strategy`` parameter to ``hi_res``.
Docs various updates (#2386) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section 2024-01-17 13:01:01 -08:00

			`Usage: Auto Partition`

			`.. code-block:: python`

			`from unstructured.partition.auto import partition`

			`filename = "example-docs/layout-parser-paper.pdf"`

			`elements = partition(filename=filename,`
			`strategy='hi_res',`
			`)`

			`tables = [el for el in elements if el.category == "Table"]`

			`print(tables[0].text)`
			`print(tables[0].metadata.text_as_html)`


			`Usage: API Parameters`

			`.. code-block:: bash`

Docs updates (#2458) To test: > cd docs && make html Change logs: * Updates the best practice for table extraction to use `skip_infer_table_types` instead of `pdf_infer_table_structure`. * Fixed CSS issue with a duplicate search box. * Fixed RST warning message * Fixed typo on the Intro page. 2024-01-25 12:31:28 -08:00			`curl -X 'POST' \`
			`'https://api.unstructured.io/general/v0/general' \`
			`-H 'accept: application/json' \`
			`-H 'Content-Type: multipart/form-data' \`
			`-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \`
			`-F 'strategy=hi_res' \`
			`\| jq -C . \| less -R`