API Parameters ============== The endpoint of the API provides several parameters to customize the processing of documents. Below are the details of these parameters: files ----- - **Type**: string (binary format) - **Description**: The file to extract. - **Required**: true - **Example**: File to be partitioned. `Example File `_ strategy -------- - **Type**: string - **Description**: The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: auto. - **Example**: hi_res gz_uncompressed_content_type ----------------------------- - **Type**: string - **Description**: If file is gzipped, use this content type after unzipping. - **Example**: application/pdf output_format ------------- - **Type**: string - **Description**: The format of the response. Supported formats are application/json and text/csv. Default: application/json. - **Example**: application/json coordinates ----------- - **Type**: boolean - **Description**: If true, return coordinates for each element. Default: false. encoding -------- - **Type**: string - **Description**: The encoding method used to decode the text input. Default: utf-8. - **Example**: utf-8 extract_image_block_types ------------------------- - **Type**: array - **Description**: The types of image blocks to extract from the document. Supports various Element types. - **Example**: ['Image', 'Table'] hi_res_model_name ----------------- - **Type**: string - **Description**: The name of the inference model used when strategy is hi_res. - **Example**: yolox include_page_breaks ------------------- - **Type**: boolean - **Description**: When true, the output will include page break elements when the filetype supports it. Default: false. languages --------- - **Type**: array - **Description**: The languages present in the document, for use in partitioning and/or OCR. - **Default**: [] - **Example**: [eng] pdf_infer_table_structure ------------------------- - **Type**: boolean - **Description**: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents. skip_infer_table_types ---------------------- - **Type**: array - **Description**: The document types that you want to skip table extraction with. Default: ['pdf', 'jpg', 'png', 'heic']. xml_keep_tags ------------- - **Type**: boolean - **Description**: If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to partition_xml. Chunking Parameters ------------------- The following parameters control chunking behavior. Chunking is automatically performed after partitioning when a value is provided for the ``chunking_strategy`` argument. The remaining chunking parameters are only operative when a chunking strategy is specified. Note that not all chunking parameters apply to all chunking strategies. Any chunking arguments not supported by the selected chunker are ignored. chunking_strategy ----------------- - **Type**: string - **Description**: Use one of the supported strategies to chunk the returned elements. When omitted, no chunking is performed and any other chunking parameters provided are ignored. - **Valid values**: ``"basic"``, ``"by_title"`` combine_under_n_chars --------------------- - **Type**: integer - **Applicable Chunkers**: "by_title" only - **Description**: When chunking strategy is set to "by_title", combine small chunks until the combined chunk reaches a length of n chars. This can mitigate the appearance of small chunks created by short paragraphs, not intended as section headings, being identified as ``Title`` elements in certain documents. - **Default**: the same value as ``max_characters`` - **Example**: 500 include_orig_elements --------------------- - **Type**: boolean - **Applicable Chunkers**: All - **Description**: Add the elements used to form each chunk to ``.metadata.orig_elements`` for that chunk. These can be used to recover the original text and metadata for individual elements when that is required, for example to identify the page-numbers or coordinates spanned by a chunk. When an element larger than ``max_characters`` is divided into two or more chunks via text-splitting, each of those chunks will contain the entire original chunk as the only item in its ``.metadata.orig_elements`` list. - **Default**: true max_characters -------------- - **Type**: integer - **Applicable Chunkers**: All - **Description**: When chunking strategy is set, cut off new chunks after reaching a length of n chars (hard max). - **Default**: 500 multipage_sections ------------------ - **Type**: boolean - **Applicable Chunkers**: "by_title" only - **Description**: When true and chunking strategy is set to "by_title", allows a chunk to include elements from more than one page. Otherwise chunks are broken on page boundaries. - **Default**: true new_after_n_chars ----------------- - **Type**: integer - **Applicable Chunkers**: "basic", "by_title" - **Description**: When chunking strategy is set, cut off new chunk after reaching a length of n chars (soft max). - **Default**: 1500