mirror of
				https://github.com/Unstructured-IO/unstructured.git
				synced 2025-10-31 10:03:07 +00:00 
			
		
		
		
	 bdfd975115
			
		
	
	
		bdfd975115
		
			
		
	
	
	
	
		
			
			Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
		
			
				
	
	
		
			68 lines
		
	
	
		
			2.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			68 lines
		
	
	
		
			2.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| Table Extraction from PDF
 | |
| =========================
 | |
| 
 | |
| This section describes two methods for extracting tables from PDF files.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     To extract tables from any documents, set the ``strategy`` parameter to ``hi_res`` for both methods below.
 | |
| 
 | |
| Method 1: Using `partition_pdf`
 | |
| -------------------------------
 | |
| 
 | |
| To extract the tables from PDF files using the `partition_pdf <https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf>`__, set the ``infer_table_structure`` parameter to ``True`` and ``strategy`` parameter to ``hi_res``.
 | |
| 
 | |
| **Usage**
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     from unstructured.partition.pdf import partition_pdf
 | |
| 
 | |
|     fname = "example-docs/layout-parser-paper.pdf"
 | |
| 
 | |
|     elements = partition_pdf(filename=fname,
 | |
|                              infer_table_structure=True,
 | |
|                              strategy='hi_res',
 | |
|                )
 | |
| 
 | |
|     tables = [el for el in elements if el.category == "Table"]
 | |
| 
 | |
|     print(tables[0].text)
 | |
|     print(tables[0].metadata.text_as_html)
 | |
| 
 | |
| Method 2: Using Auto Partition or Unstructured API
 | |
| --------------------------------------------------
 | |
| 
 | |
| By default, table extraction from all file types is enabled. To extract tables from PDFs and images using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ simply set ``strategy`` parameter to ``hi_res``.
 | |
| 
 | |
| 
 | |
| **Usage: Auto Partition**
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     from unstructured.partition.auto import partition
 | |
| 
 | |
|     filename = "example-docs/layout-parser-paper.pdf"
 | |
| 
 | |
|     elements = partition(filename=filename,
 | |
|                          strategy='hi_res',
 | |
|                )
 | |
| 
 | |
|     tables = [el for el in elements if el.category == "Table"]
 | |
| 
 | |
|     print(tables[0].text)
 | |
|     print(tables[0].metadata.text_as_html)
 | |
| 
 | |
| 
 | |
| **Usage: API Parameters**
 | |
| 
 | |
| .. code-block:: bash
 | |
| 
 | |
|       curl -X 'POST' \
 | |
|           'https://api.unstructured.io/general/v0/general' \
 | |
|           -H 'accept: application/json' \
 | |
|           -H 'Content-Type: multipart/form-data' \
 | |
|           -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
 | |
|           -F 'strategy=hi_res' \
 | |
|           | jq -C . | less -R
 |