unstructured/docs/source/best_practices/table_extraction_pdf.rst
Filip Knefel bdfd975115
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-22 10:08:49 +00:00

68 lines
2.2 KiB
ReStructuredText

Table Extraction from PDF
=========================
This section describes two methods for extracting tables from PDF files.
.. note::
To extract tables from any documents, set the ``strategy`` parameter to ``hi_res`` for both methods below.
Method 1: Using `partition_pdf`
-------------------------------
To extract the tables from PDF files using the `partition_pdf <https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf>`__, set the ``infer_table_structure`` parameter to ``True`` and ``strategy`` parameter to ``hi_res``.
**Usage**
.. code-block:: python
from unstructured.partition.pdf import partition_pdf
fname = "example-docs/layout-parser-paper.pdf"
elements = partition_pdf(filename=fname,
infer_table_structure=True,
strategy='hi_res',
)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].text)
print(tables[0].metadata.text_as_html)
Method 2: Using Auto Partition or Unstructured API
--------------------------------------------------
By default, table extraction from all file types is enabled. To extract tables from PDFs and images using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ simply set ``strategy`` parameter to ``hi_res``.
**Usage: Auto Partition**
.. code-block:: python
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper.pdf"
elements = partition(filename=filename,
strategy='hi_res',
)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].text)
print(tables[0].metadata.text_as_html)
**Usage: API Parameters**
.. code-block:: bash
curl -X 'POST' \
'https://api.unstructured.io/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'strategy=hi_res' \
| jq -C . | less -R