Docs: parser behavior change (#11176)

### What problem does this PR solve?


### Type of change


- [x] Documentation Update
This commit is contained in:
writinwaters 2025-11-11 21:10:06 +08:00 committed by GitHub
parent a15f522dc9
commit 2c727a4a9c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 185 additions and 45 deletions

View File

@ -24,14 +24,6 @@ A guide explaining how to build a RAGFlow Docker image from its source code. By
## Build a Docker image
<Tabs
defaultValue="without"
values={[
{label: 'Build a Docker image without embedding models', value: 'without'},
{label: 'Build a Docker image including embedding models', value: 'including'}
]}>
<TabItem value="without">
This image is approximately 2 GB in size and relies on external LLM and embedding services.
:::danger IMPORTANT
@ -47,10 +39,6 @@ docker build -f Dockerfile.deps -t infiniflow/ragflow_deps .
docker build -f Dockerfile -t infiniflow/ragflow:nightly .
```
</TabItem>
</Tabs>
## Launch a RAGFlow Service from Docker for MacOS
After building the infiniflow/ragflow:nightly image, you are ready to launch a fully-functional RAGFlow service with all the required components, such as Elasticsearch, MySQL, MinIO, Redis, and more.

View File

@ -514,11 +514,11 @@ See [here](./guides/agent/best_practices/accelerate_agent_question_answering.md)
### How to use MinerU to parse PDF documents?
MinerU PDF document parsing is available starting from v0.21.1. RAGFlow supports MinerU (>= 2.6.3) as an optional PDF parser with multiple backends. RAGFlow itself only acts as a client: it calls MinerU to parse documents, reads the output files, and ingests the parsed content into RAGFlow. To use this feature, follow these steps:
MinerU PDF document parsing is available starting from v0.22.0. RAGFlow supports MinerU (>= 2.6.3) as an optional PDF parser with multiple backends. RAGFlow acts only as a client for MinerU, calling it to parse documents, reading the output files, and ingesting the parsed content. To use this feature, follow these steps:
1. **Prepare MinerU**
1. Prepare MinerU
- **If you run RAGFlow from source**, install MinerU into an isolated virtual environment (recommended path: `$HOME/uv_tools`):
- **If you deploy RAGFlow from source**, install MinerU into an isolated virtual environment (recommended path: `$HOME/uv_tools`):
```bash
mkdir -p "$HOME/uv_tools"
@ -530,7 +530,7 @@ MinerU PDF document parsing is available starting from v0.21.1. RAGFlow supports
# uv pip install -U "mineru[all]" -i https://mirrors.aliyun.com/pypi/simple
```
- **If you run RAGFlow with Docker**, you usually only need to turn on MinerU support in `docker/.env`:
- **If you deploy RAGFlow with Docker**, you usually only need to turn on MinerU support in `docker/.env`:
```bash
# docker/.env
@ -541,7 +541,7 @@ MinerU PDF document parsing is available starting from v0.21.1. RAGFlow supports
Enabling `USE_MINERU=true` will internally perform the same setup as the manual configuration (including setting the MinerU executable path and related environment variables). You only need the manual installation above if you are running from source or want full control over the MinerU installation.
2. **Start RAGFlow with MinerU enabled**
2. Start RAGFlow with MinerU enabled:
- **Source deployment** in the RAGFlow repo, export the key MinerU-related variables and start the backend service:
@ -570,7 +570,7 @@ MinerU PDF document parsing is available starting from v0.21.1. RAGFlow supports
### How to configure MinerU-specific settings?
The table below summarizes the most commonly used MinerU-related environment variables:
The table below summarizes the most frequently used MinerU environment variables:
| Environment variable | Description | Default | Example |
| ---------------------- | ---------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------- |
@ -583,14 +583,14 @@ The table below summarizes the most commonly used MinerU-related environment var
1. Set `MINERU_EXECUTABLE` to the path to the MinerU executable if the default `mineru` is not on `PATH`.
2. Set `MINERU_DELETE_OUTPUT` to `0` to keep MinerU's output. (Default: `1`, which deletes temporary output.)
3. Set `MINERU_OUTPUT_DIR` to specify the output directory for MinerU (otherwise a system temp directory is used).
3. Set `MINERU_OUTPUT_DIR` to specify the output directory for MinerU; otherwise, a system temp directory is used.
4. Set `MINERU_BACKEND` to specify a parsing backend:
- `"pipeline"` (default): The traditional multimodel pipeline.
- `"vlm-transformers"`: A vision-language model using HuggingFace Transformers.
- `"vlm-vllm-engine"`: A vision-language model using a local vLLM engine (requires a local GPU).
- `"vlm-http-client"`: A vision-language model via HTTP client to a remote vLLM server (RAGFlow only requires CPU).
5. If using the `"vlm-http-client"` backend, you must also set `MINERU_SERVER_URL` to the URL of your vLLM server.
6. If you want RAGFlow to call a **remote MinerU service** (instead of a MinerU process running locally with RAGFlow), set `MINERU_APISERVER` to the URL of the remote MinerU server.
5. If using the `"vlm-http-client"` backend, you must also set `MINERU_SERVER_URL` to your vLLM server's URL.
6. If configuring RAGFlow to call a *remote* MinerU service, set `MINERU_APISERVER` to the MinerU server's URL.
:::tip NOTE
For information about other environment variables natively supported by MinerU, see [here](https://opendatalab.github.io/MinerU/usage/cli_tools/#environment-variables-description).

View File

@ -9,9 +9,132 @@ A component that sets the parsing rules for your dataset.
---
A **Parser** component defines how various file types should be parsed, including parsing methods for PDFs , fields to parse for Emails, and OCR methods for images.
A **Parser** component is auto-populated on the ingestion pipeline canvas and required in all ingestion pipeline workflows. Just like the **Extract** stage in the traditional ETL process, a **Parser** component in an ingestion pipeline defines how various file types are parsed into structured data. Click the component to display its configuration panel. In this configuration panel, you set the parsing rules for various file types.
## Configurations
## Scenario
Within the configuration panel, you can add multiple parsers and set the corresponding parsing rules or remove unwanted parsers. Please ensure your set of parsers covers all required file types; otherwise, an error would occur when you select this ingestion pipeline on your dataset's **Files** page.
A **Parser** component is auto-populated on the ingestion pipeline canvas and required in all ingestion pipeline workflows.
The **Parser** component supports parsing the following file types:
| File type | File format |
| ------------- | ------------------------ |
| PDF | PDF |
| Spreadsheet | XLSX, XLS, CSV |
| Image | PNG, JPG, JPEG, GIF, TIF |
| Email | EML |
| Text & Markup | TXT, MD, MDX, HTML, JSON |
| Word | DOCX |
| PowerPoint | PPTX, PPT |
| Audio | MP3, WAV |
| Video | MP4, AVI, MKV |
### PDF parser
The output of a PDF parser is `json`. In the PDF parser, you select the parsing method that works best with your PDFs.
- DeepDoc: (Default) The default visual model performing OCR, TSR, and DLR tasks on complex PDFs, but can be time-consuming.
- Naive: Skip OCR, TSR, and DLR tasks if *all* your PDFs are plain text.
- [MinerU](https://github.com/opendatalab/MinerU): (Experimental) An open-source tool that converts PDF into machine-readable formats.
- [Docling](https://github.com/docling-project/docling): (Experimental) An open-source document processing tool for gen AI.
- A third-party visual model from a specific model provider.
:::danger IMPORTANG
MinerU PDF document parsing is available starting from v0.22.0. RAGFlow supports MinerU (>= 2.6.3) as an optional PDF parser with multiple backends. RAGFlow acts only as a client for MinerU, calling it to parse documents, reading the output files, and ingesting the parsed content. To use this feature, follow these steps:
1. Prepare MinerU:
- **If you deploy RAGFlow from source**, install MinerU into an isolated virtual environment (recommended path: `$HOME/uv_tools`):
```bash
mkdir -p "$HOME/uv_tools"
cd "$HOME/uv_tools"
uv venv .venv
source .venv/bin/activate
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
# or
# uv pip install -U "mineru[all]" -i https://mirrors.aliyun.com/pypi/simple
```
- **If you deploy RAGFlow with Docker**, you usually only need to turn on MinerU support in `docker/.env`:
```bash
# docker/.env
...
USE_MINERU=true
...
```
Enabling `USE_MINERU=true` will internally perform the same setup as the manual configuration (including setting the MinerU executable path and related environment variables). You only need the manual installation above if you are running from source or want full control over the MinerU installation.
2. Start RAGFlow with MinerU enabled:
- **Source deployment** in the RAGFlow repo, export the key MinerU-related variables and start the backend service:
```bash
# in RAGFlow repo
export MINERU_EXECUTABLE="$HOME/uv_tools/.venv/bin/mineru"
export MINERU_DELETE_OUTPUT=0 # keep output directory
export MINERU_BACKEND=pipeline # or another backend you prefer
source .venv/bin/activate
export PYTHONPATH=$(pwd)
bash docker/launch_backend_service.sh
```
- **Docker deployment** after setting `USE_MINERU=true`, restart the containers so that the new settings take effect:
```bash
# in RAGFlow repo
docker compose -f docker/docker-compose.yml restart
```
3. Restart the ragflow-server.
:::
:::caution WARNING
Third-party visual models are marked **Experimental**, because we have not fully tested these models for the aforementioned data extraction tasks.
:::
### Spreadsheet parser
A spreadsheet parser outputs `html`, preserving the original layout and table structure. You may remove this parser if your dataset contains no spreadsheets.
### Image parser
An Image parser uses a native OCR model for text extraction by default. You may select an alternative VLM model, provided that you have properly configured it on the **Model provider** page.
### Email parser
With the Email parser, you select the fields to parse from Emails, such as **subject** and **body**. The parser will then extract text from these specified fields.
### Text&Markup parser
A Text&Markup parser automatically removes all formatting tags (e.g., those from HTML and Markdown files) to output clean, plain text only.
### Word parser
A Word parser outputs `json`, preserving the original document structure information, including titles, paragraphs, tables, headers, and footers.
### PowerPoint (PPT) parser
A PowerPoint parser extracts content from PowerPoint files into `json`, processing each slide individually and distinguishing between its title, body text, and notes.
### Audio parser
An Audio parser transcribes audio files to text. To use this parser, you must first configure an ASR model on the **Model provider** page.
### Video parser
A Video parser transcribes video files to text. To use this parser, you must first configure a VLM model on the **Model provider** page.
## Output
The global variable names for the output of the **Parser** component, which can be referenced by subsequent components in the ingestion pipeline.
| Variable name | Type |
| ------------- | ------------------------ |
| `markdown` | `string` |
| `text` | `string` |
| `html` | `string` |
| `json` | `Array<Object>` |

View File

@ -76,13 +76,8 @@ You can also change a file's chunking method on the **Files** page.
An embedding model converts chunks into embeddings. It cannot be changed once the dataset has chunks. To switch to a different embedding model, you must delete all existing chunks in the dataset. The obvious reason is that we *must* ensure that files in a specific dataset are converted to embeddings using the *same* embedding model (ensure that they are compared in the same embedding space).
The following embedding models can be deployed locally:
- BAAI/bge-large-zh-v1.5
- maidalun1020/bce-embedding-base_v1
:::danger IMPORTANT
These two embedding models are optimized specifically for English and Chinese, so performance may be compromised if you use them to embed documents in other languages.
Some embedding models are optimized for specific languages, so performance may be compromised if you use them to embed documents in other languages.
:::
### Upload file

View File

@ -33,27 +33,61 @@ RAGFlow isn't one-size-fits-all. It is built for flexibility and supports deeper
2. Select the option that works best with your scenario:
- DeepDoc: (Default) The default visual model performing OCR, TSR, and DLR tasks on PDFs, which can be time-consuming.
- DeepDoc: (Default) The default visual model performing OCR, TSR, and DLR tasks on PDFs, but can be time-consuming.
- Naive: Skip OCR, TSR, and DLR tasks if *all* your PDFs are plain text.
- MinerU: An experimental feature.
- A third-party visual model provided by a specific model provider.
- [MinerU](https://github.com/opendatalab/MinerU): (Experimental) An open-source tool that converts PDF into machine-readable formats.
- [Docling](https://github.com/docling-project/docling): (Experimental) An open-source document processing tool for gen AI.
- A third-party visual model from a specific model provider.
:::danger IMPORTANG
MinerU PDF document parsing is available starting from v0.21.1. To use this feature, follow these steps:
MinerU PDF document parsing is available starting from v0.22.0. RAGFlow supports MinerU (>= 2.6.3) as an optional PDF parser with multiple backends. RAGFlow acts only as a client for MinerU, calling it to parse documents, reading the output files, and ingesting the parsed content. To use this feature, follow these steps:
1. Before deploying ragflow-server, update your **docker/.env** file:
- Enable `HF_ENDPOINT=https://hf-mirror.com`
- Add a MinerU entry: `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru`
1. Prepare MinerU:
2. Start the ragflow-server and run the following commands inside the container:
- **If you deploy RAGFlow from source**, install MinerU into an isolated virtual environment (recommended path: `$HOME/uv_tools`):
```bash
mkdir uv_tools
cd uv_tools
uv venv .venv
source .venv/bin/activate
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
```
```bash
mkdir -p "$HOME/uv_tools"
cd "$HOME/uv_tools"
uv venv .venv
source .venv/bin/activate
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
# or
# uv pip install -U "mineru[all]" -i https://mirrors.aliyun.com/pypi/simple
```
- **If you deploy RAGFlow with Docker**, you usually only need to turn on MinerU support in `docker/.env`:
```bash
# docker/.env
...
USE_MINERU=true
...
```
Enabling `USE_MINERU=true` will internally perform the same setup as the manual configuration (including setting the MinerU executable path and related environment variables). You only need the manual installation above if you are running from source or want full control over the MinerU installation.
2. Start RAGFlow with MinerU enabled:
- **Source deployment** in the RAGFlow repo, export the key MinerU-related variables and start the backend service:
```bash
# in RAGFlow repo
export MINERU_EXECUTABLE="$HOME/uv_tools/.venv/bin/mineru"
export MINERU_DELETE_OUTPUT=0 # keep output directory
export MINERU_BACKEND=pipeline # or another backend you prefer
source .venv/bin/activate
export PYTHONPATH=$(pwd)
bash docker/launch_backend_service.sh
```
- **Docker deployment** after setting `USE_MINERU=true`, restart the containers so that the new settings take effect:
```bash
# in RAGFlow repo
docker compose -f docker/docker-compose.yml restart
```
3. Restart the ragflow-server.
4. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown, which supports PDF parsing, and slect **MinerU** in **PDF parser**.