mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-01-19 02:26:49 +00:00
180 lines
6.8 KiB
Markdown
180 lines
6.8 KiB
Markdown
---
|
|
title: "PaddleOCR"
|
|
id: integrations-paddleocr
|
|
description: "PaddleOCR integration for Haystack"
|
|
slug: "/integrations-paddleocr"
|
|
---
|
|
|
|
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter"></a>
|
|
|
|
## Module haystack\_integrations.components.converters.paddleocr.paddleocr\_vl\_document\_converter
|
|
|
|
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter"></a>
|
|
|
|
### PaddleOCRVLDocumentConverter
|
|
|
|
This component extracts text from documents using PaddleOCR's large model
|
|
document parsing API.
|
|
|
|
PaddleOCR-VL is used behind the scenes. For more information, please
|
|
refer to:
|
|
https://www.paddleocr.ai/latest/en/version3.x/algorithm/PaddleOCR-VL/PaddleOCR-VL.html
|
|
|
|
**Usage Example:**
|
|
|
|
```python
|
|
from haystack.utils import Secret
|
|
from haystack_integrations.components.converters.paddleocr import (
|
|
PaddleOCRVLDocumentConverter,
|
|
)
|
|
|
|
converter = PaddleOCRVLDocumentConverter(
|
|
api_url="http://xxxxx.aistudio-app.com/layout-parsing",
|
|
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
|
|
)
|
|
|
|
result = converter.run(sources=["sample.pdf"])
|
|
|
|
documents = result["documents"]
|
|
raw_responses = result["raw_paddleocr_responses"]
|
|
```
|
|
|
|
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter.__init__"></a>
|
|
|
|
#### PaddleOCRVLDocumentConverter.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(
|
|
*,
|
|
api_url: str,
|
|
access_token: Secret = Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
|
|
file_type: Optional[FileTypeInput] = None,
|
|
use_doc_orientation_classify: Optional[bool] = None,
|
|
use_doc_unwarping: Optional[bool] = None,
|
|
use_layout_detection: Optional[bool] = None,
|
|
use_chart_recognition: Optional[bool] = None,
|
|
layout_threshold: Optional[Union[float, dict]] = None,
|
|
layout_nms: Optional[bool] = None,
|
|
layout_unclip_ratio: Optional[Union[float, tuple[float, float],
|
|
dict]] = None,
|
|
layout_merge_bboxes_mode: Optional[Union[str, dict]] = None,
|
|
prompt_label: Optional[str] = None,
|
|
format_block_content: Optional[bool] = None,
|
|
repetition_penalty: Optional[float] = None,
|
|
temperature: Optional[float] = None,
|
|
top_p: Optional[float] = None,
|
|
min_pixels: Optional[int] = None,
|
|
max_pixels: Optional[int] = None,
|
|
prettify_markdown: Optional[bool] = None,
|
|
show_formula_number: Optional[bool] = None,
|
|
visualize: Optional[bool] = None,
|
|
additional_params: Optional[dict[str, Any]] = None)
|
|
```
|
|
|
|
Create a `PaddleOCRVLDocumentConverter` component.
|
|
|
|
**Arguments**:
|
|
|
|
- `api_url`: API URL. To obtain the API URL, visit the [PaddleOCR official
|
|
website](https://aistudio.baidu.com/paddleocr/task), click the
|
|
**API** button in the upper-left corner, choose the example code
|
|
for **Large Model document parsing(PaddleOCR-VL)**, and copy the
|
|
`API_URL`.
|
|
- `access_token`: AI Studio access token. You can obtain it from [this
|
|
page](https://aistudio.baidu.com/account/accessToken).
|
|
- `file_type`: File type. Can be "pdf" for PDF files, "image" for
|
|
image files, or `None` for auto-detection. If not specified, the
|
|
file type will be inferred from the file extension.
|
|
- `use_doc_orientation_classify`: Whether to enable the document orientation classification
|
|
function. Enabling this feature allows the input image to be
|
|
automatically rotated to the correct orientation.
|
|
- `use_doc_unwarping`: Whether to enable the text image unwarping function. Enabling
|
|
this feature allows automatic correction of distorted text images.
|
|
- `use_layout_detection`: Whether to enable the layout detection function.
|
|
- `use_chart_recognition`: Whether to enable the chart recognition function.
|
|
- `layout_threshold`: Layout detection threshold. Can be a float or a dict with
|
|
page-specific thresholds.
|
|
- `layout_nms`: Whether to perform NMS (Non-Maximum Suppression) on layout
|
|
detection results.
|
|
- `layout_unclip_ratio`: Layout unclip ratio. Can be a float, a tuple of (min, max), or a
|
|
dict with page-specific values.
|
|
- `layout_merge_bboxes_mode`: Layout merge bounding boxes mode. Can be a string or a dict.
|
|
- `prompt_label`: Prompt type for the VLM. Possible values are "ocr", "formula",
|
|
"table", and "chart".
|
|
- `format_block_content`: Whether to format block content.
|
|
- `repetition_penalty`: Repetition penalty parameter used in VLM sampling.
|
|
- `temperature`: Temperature parameter used in VLM sampling.
|
|
- `top_p`: Top-p parameter used in VLM sampling.
|
|
- `min_pixels`: Minimum number of pixels allowed during VLM preprocessing.
|
|
- `max_pixels`: Maximum number of pixels allowed during VLM preprocessing.
|
|
- `prettify_markdown`: Whether to prettify the output Markdown text.
|
|
- `show_formula_number`: Whether to include formula numbers in the output markdown text.
|
|
- `visualize`: Whether to return visualization results.
|
|
- `additional_params`: Additional parameters for calling the PaddleOCR API.
|
|
|
|
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter.to_dict"></a>
|
|
|
|
#### PaddleOCRVLDocumentConverter.to\_dict
|
|
|
|
```python
|
|
def to_dict() -> dict[str, Any]
|
|
```
|
|
|
|
Serialize the component to a dictionary.
|
|
|
|
**Returns**:
|
|
|
|
Dictionary with serialized data.
|
|
|
|
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter.from_dict"></a>
|
|
|
|
#### PaddleOCRVLDocumentConverter.from\_dict
|
|
|
|
```python
|
|
@classmethod
|
|
def from_dict(cls, data: dict[str, Any]) -> "PaddleOCRVLDocumentConverter"
|
|
```
|
|
|
|
Deserialize the component from a dictionary.
|
|
|
|
**Arguments**:
|
|
|
|
- `data`: Dictionary to deserialize from.
|
|
|
|
**Returns**:
|
|
|
|
Deserialized component.
|
|
|
|
<a id="haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter.PaddleOCRVLDocumentConverter.run"></a>
|
|
|
|
#### PaddleOCRVLDocumentConverter.run
|
|
|
|
```python
|
|
@component.output_types(documents=list[Document],
|
|
raw_paddleocr_responses=list[dict[str, Any]])
|
|
def run(
|
|
sources: list[Union[str, Path, ByteStream]],
|
|
meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
|
|
) -> dict[str, Any]
|
|
```
|
|
|
|
Convert image or PDF files to Documents.
|
|
|
|
**Arguments**:
|
|
|
|
- `sources`: List of image or PDF file paths or ByteStream objects.
|
|
- `meta`: Optional metadata to attach to the Documents.
|
|
This value can be either a list of dictionaries or a single
|
|
dictionary. If it's a single dictionary, its content is added to
|
|
the metadata of all produced Documents. If it's a list, the length
|
|
of the list must match the number of sources, because the two
|
|
lists will be zipped. If `sources` contains ByteStream objects,
|
|
their `meta` will be added to the output Documents.
|
|
|
|
**Returns**:
|
|
|
|
A dictionary with the following keys:
|
|
- `documents`: A list of created Documents.
|
|
- `raw_paddleocr_responses`: A list of raw PaddleOCR API responses.
|
|
|