Haystack Bot cb8082e196
Sync Core Integrations API reference (paddleocr) on Docusaurus (#10216)
Co-authored-by: anakin87 <44616784+anakin87@users.noreply.github.com>
2025-12-10 11:13:04 +01:00

180 lines
6.8 KiB
Markdown

---
title: "PaddleOCR"
id: integrations-paddleocr
description: "PaddleOCR integration for Haystack"
slug: "/integrations-paddleocr"
---
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter"></a>
## Module haystack\_integrations.components.converters.paddleocr.paddleocr\_vl\_document\_converter
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter"></a>
### PaddleOCRVLDocumentConverter
This component extracts text from documents using PaddleOCR's large model
document parsing API.
PaddleOCR-VL is used behind the scenes. For more information, please
refer to:
https://www.paddleocr.ai/latest/en/version3.x/algorithm/PaddleOCR-VL/PaddleOCR-VL.html
**Usage Example:**
```python
from haystack.utils import Secret
from haystack_integrations.components.converters.paddleocr import (
PaddleOCRVLDocumentConverter,
)
converter = PaddleOCRVLDocumentConverter(
api_url="http://xxxxx.aistudio-app.com/layout-parsing",
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
)
result = converter.run(sources=["sample.pdf"])
documents = result["documents"]
raw_responses = result["raw_paddleocr_responses"]
```
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter.__init__"></a>
#### PaddleOCRVLDocumentConverter.\_\_init\_\_
```python
def __init__(
*,
api_url: str,
access_token: Secret = Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
file_type: Optional[FileTypeInput] = None,
use_doc_orientation_classify: Optional[bool] = None,
use_doc_unwarping: Optional[bool] = None,
use_layout_detection: Optional[bool] = None,
use_chart_recognition: Optional[bool] = None,
layout_threshold: Optional[Union[float, dict]] = None,
layout_nms: Optional[bool] = None,
layout_unclip_ratio: Optional[Union[float, tuple[float, float],
dict]] = None,
layout_merge_bboxes_mode: Optional[Union[str, dict]] = None,
prompt_label: Optional[str] = None,
format_block_content: Optional[bool] = None,
repetition_penalty: Optional[float] = None,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
min_pixels: Optional[int] = None,
max_pixels: Optional[int] = None,
prettify_markdown: Optional[bool] = None,
show_formula_number: Optional[bool] = None,
visualize: Optional[bool] = None,
additional_params: Optional[dict[str, Any]] = None)
```
Create a `PaddleOCRVLDocumentConverter` component.
**Arguments**:
- `api_url`: API URL. To obtain the API URL, visit the [PaddleOCR official
website](https://aistudio.baidu.com/paddleocr/task), click the
**API** button in the upper-left corner, choose the example code
for **Large Model document parsing(PaddleOCR-VL)**, and copy the
`API_URL`.
- `access_token`: AI Studio access token. You can obtain it from [this
page](https://aistudio.baidu.com/account/accessToken).
- `file_type`: File type. Can be "pdf" for PDF files, "image" for
image files, or `None` for auto-detection. If not specified, the
file type will be inferred from the file extension.
- `use_doc_orientation_classify`: Whether to enable the document orientation classification
function. Enabling this feature allows the input image to be
automatically rotated to the correct orientation.
- `use_doc_unwarping`: Whether to enable the text image unwarping function. Enabling
this feature allows automatic correction of distorted text images.
- `use_layout_detection`: Whether to enable the layout detection function.
- `use_chart_recognition`: Whether to enable the chart recognition function.
- `layout_threshold`: Layout detection threshold. Can be a float or a dict with
page-specific thresholds.
- `layout_nms`: Whether to perform NMS (Non-Maximum Suppression) on layout
detection results.
- `layout_unclip_ratio`: Layout unclip ratio. Can be a float, a tuple of (min, max), or a
dict with page-specific values.
- `layout_merge_bboxes_mode`: Layout merge bounding boxes mode. Can be a string or a dict.
- `prompt_label`: Prompt type for the VLM. Possible values are "ocr", "formula",
"table", and "chart".
- `format_block_content`: Whether to format block content.
- `repetition_penalty`: Repetition penalty parameter used in VLM sampling.
- `temperature`: Temperature parameter used in VLM sampling.
- `top_p`: Top-p parameter used in VLM sampling.
- `min_pixels`: Minimum number of pixels allowed during VLM preprocessing.
- `max_pixels`: Maximum number of pixels allowed during VLM preprocessing.
- `prettify_markdown`: Whether to prettify the output Markdown text.
- `show_formula_number`: Whether to include formula numbers in the output markdown text.
- `visualize`: Whether to return visualization results.
- `additional_params`: Additional parameters for calling the PaddleOCR API.
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter.to_dict"></a>
#### PaddleOCRVLDocumentConverter.to\_dict
```python
def to_dict() -> dict[str, Any]
```
Serialize the component to a dictionary.
**Returns**:
Dictionary with serialized data.
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter.from_dict"></a>
#### PaddleOCRVLDocumentConverter.from\_dict
```python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "PaddleOCRVLDocumentConverter"
```
Deserialize the component from a dictionary.
**Arguments**:
- `data`: Dictionary to deserialize from.
**Returns**:
Deserialized component.
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter.run"></a>
#### PaddleOCRVLDocumentConverter.run
```python
@component.output_types(documents=list[Document],
raw_paddleocr_responses=list[dict[str, Any]])
def run(
sources: list[Union[str, Path, ByteStream]],
meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
) -> dict[str, Any]
```
Convert image or PDF files to Documents.
**Arguments**:
- `sources`: List of image or PDF file paths or ByteStream objects.
- `meta`: Optional metadata to attach to the Documents.
This value can be either a list of dictionaries or a single
dictionary. If it's a single dictionary, its content is added to
the metadata of all produced Documents. If it's a list, the length
of the list must match the number of sources, because the two
lists will be zipped. If `sources` contains ByteStream objects,
their `meta` will be added to the output Documents.
**Returns**:
A dictionary with the following keys:
- `documents`: A list of created Documents.
- `raw_paddleocr_responses`: A list of raw PaddleOCR API responses.