unstructured/docs/source/apis/api_parameters.rst

API Parameters
==============

The endpoint of the API provides several parameters to customize the processing of documents. Below are the details of these parameters:

files
-----
- **Type**: string (binary format)
- **Description**: The file to extract.
- **Required**: true
- **Example**: File to be partitioned. `Example File <https://github.com/Unstructured-IO/unstructured/blob/98d3541909f64290b5efb65a226fc3ee8a7cc5ee/example-docs/layout-parser-paper.pdf>`_

strategy
--------
- **Type**: string
- **Description**: The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: auto.
- **Example**: hi_res

gz_uncompressed_content_type
-----------------------------
- **Type**: string
- **Description**: If file is gzipped, use this content type after unzipping.
- **Example**: application/pdf

output_format
-------------
- **Type**: string
- **Description**: The format of the response. Supported formats are application/json and text/csv. Default: application/json.
- **Example**: application/json

coordinates
-----------
- **Type**: boolean
- **Description**: If true, return coordinates for each element. Default: false.

encoding
--------
- **Type**: string
- **Description**: The encoding method used to decode the text input. Default: utf-8.
- **Example**: utf-8

hi_res_model_name
-----------------
- **Type**: string
- **Description**: The name of the inference model used when strategy is hi_res.
- **Example**: yolox

include_page_breaks
-------------------
- **Type**: boolean
- **Description**: If True, the output will include page breaks if the filetype supports it. Default: false.

languages
---------
- **Type**: array
- **Description**: The languages present in the document, for use in partitioning and/or OCR.
- **Default**: []
- **Example**: [eng]

pdf_infer_table_structure
-------------------------
- **Type**: boolean
- **Description**: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.

skip_infer_table_types
----------------------
- **Type**: array
- **Description**: The document types that you want to skip table extraction with. Default: ['pdf', 'jpg', 'png', 'heic'].

xml_keep_tags
-------------
- **Type**: boolean
- **Description**: If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to partition_xml.

chunking_strategy
-----------------
- **Type**: string
- **Description**: Use one of the supported strategies to chunk the returned elements. Currently supports: by_title.
- **Example**: by_title

multipage_sections
------------------
- **Type**: boolean
- **Description**: If chunking strategy is set, determines if sections can span multiple sections. Default: true.

combine_under_n_chars
---------------------
- **Type**: integer
- **Description**: If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500.
- **Example**: 500

new_after_n_chars
-----------------
- **Type**: integer
- **Description**: If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500.
- **Example**: 1500

max_characters
--------------
- **Type**: integer
- **Description**: If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 1500.
- **Example**: 1500

extract_image_block_types
-------------------------
- **Type**: array
- **Description**: The types of image blocks to extract from the document. Supports various Element types.
- **Example**: ['Image', 'Table']
Reorganized the Examples section in Documentation & add Databricks example (#1855) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors 2023-11-29 17:24:43 -08:00			`API Parameters`
			`==============`

			`The endpoint of the API provides several parameters to customize the processing of documents. Below are the details of these parameters:`

			`files`
			`-----`
			`- Type: string (binary format)`
			`- Description: The file to extract.`
			`- Required: true`
			- Example: File to be partitioned. `Example File <https://github.com/Unstructured-IO/unstructured/blob/98d3541909f64290b5efb65a226fc3ee8a7cc5ee/example-docs/layout-parser-paper.pdf>`_

			`strategy`
			`--------`
			`- Type: string`
			`- Description: The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: auto.`
			`- Example: hi_res`

			`gz_uncompressed_content_type`
			`-----------------------------`
			`- Type: string`
			`- Description: If file is gzipped, use this content type after unzipping.`
			`- Example: application/pdf`

			`output_format`
			`-------------`
			`- Type: string`
			`- Description: The format of the response. Supported formats are application/json and text/csv. Default: application/json.`
			`- Example: application/json`

			`coordinates`
			`-----------`
			`- Type: boolean`
			`- Description: If true, return coordinates for each element. Default: false.`

			`encoding`
			`--------`
			`- Type: string`
			`- Description: The encoding method used to decode the text input. Default: utf-8.`
			`- Example: utf-8`

			`hi_res_model_name`
			`-----------------`
			`- Type: string`
			`- Description: The name of the inference model used when strategy is hi_res.`
			`- Example: yolox`

			`include_page_breaks`
			`-------------------`
			`- Type: boolean`
			`- Description: If True, the output will include page breaks if the filetype supports it. Default: false.`

			`languages`
			`---------`
			`- Type: array`
			`- Description: The languages present in the document, for use in partitioning and/or OCR.`
			`- Default: []`
			`- Example: [eng]`

			`pdf_infer_table_structure`
			`-------------------------`
			`- Type: boolean`
chore: change table extraction defaults (#2588) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com> 2024-03-22 11:08:49 +01:00			`- Description: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.`
Reorganized the Examples section in Documentation & add Databricks example (#1855) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors 2023-11-29 17:24:43 -08:00
			`skip_infer_table_types`
			`----------------------`
			`- Type: array`
feat: add support for partitioning .heic files (#2454) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com> 2024-01-29 22:49:00 -06:00			`- Description: The document types that you want to skip table extraction with. Default: ['pdf', 'jpg', 'png', 'heic'].`
Reorganized the Examples section in Documentation & add Databricks example (#1855) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors 2023-11-29 17:24:43 -08:00
			`xml_keep_tags`
			`-------------`
			`- Type: boolean`
			`- Description: If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to partition_xml.`

			`chunking_strategy`
			`-----------------`
			`- Type: string`
			`- Description: Use one of the supported strategies to chunk the returned elements. Currently supports: by_title.`
			`- Example: by_title`

			`multipage_sections`
			`------------------`
			`- Type: boolean`
			`- Description: If chunking strategy is set, determines if sections can span multiple sections. Default: true.`

			`combine_under_n_chars`
			`---------------------`
			`- Type: integer`
			`- Description: If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500.`
			`- Example: 500`

			`new_after_n_chars`
			`-----------------`
			`- Type: integer`
			`- Description: If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500.`
			`- Example: 1500`

			`max_characters`
			`--------------`
			`- Type: integer`
			`- Description: If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 1500.`
			`- Example: 1500`
Updated docs on API Params and Filetype Supports (#2433) To test: > cd docs && make html Changelogs: * Fixed sphinx error due to malformed rst table on partition page * Updated API Params, ie. `extract_image_block_types` and `extract_image_block_to_payload` * Updated image filetype supports 2024-01-19 16:07:57 -08:00
			`extract_image_block_types`
			`-------------------------`
			`- Type: array`
			`- Description: The types of image blocks to extract from the document. Supports various Element types.`
			`- Example: ['Image', 'Table']`