MinerU is a powerful open-source tool for extracting high-quality structured data from PDF, image, and office documents. It provides the following features:
- Text extraction while preserving document structure (headings, paragraphs, lists, etc.)
- Handling complex layouts including multi-column formats
- Automatic formula recognition and conversion to LaTeX format
- Image, table, and footnote extraction
- Automatic scanned document detection and OCR application
- Support for multiple output formats (Markdown, JSON)
### Installation
#### Installing MinerU Dependencies
If you have already installed LightRAG but don't have MinerU support, you can add MinerU support by installing the magic-pdf package directly:
These are the MinerU-related dependencies required by LightRAG.
#### MinerU Model Weights
MinerU requires model weight files to function properly. After installation, you need to download the required model weights. You can use either Hugging Face or ModelScope to download the models.
Both methods will automatically download the model files and configure the model directory in the configuration file. The configuration file is located in your user directory and named `magic-pdf.json`.
> **Note for Windows users**: User directory is at `C:\Users\username`
> **Note for Linux users**: User directory is at `/home/username`
> **Note for macOS users**: User directory is at `/Users/username`
#### Optional: LibreOffice Installation
To process Office documents (DOC, DOCX, PPT, PPTX), you need to install LibreOffice:
**Linux/macOS:**
```bash
apt-get/yum/brew install libreoffice
```
**Windows:**
1. Install LibreOffice
2. Add the installation directory to your PATH: `install_dir\LibreOffice\program`
In RAGAnything, you can directly use file paths as input to the `process_document_complete` method to process documents. Here's a complete configuration example:
```python
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
7. If you encounter a "filename too long" error, the latest version of MineruParser includes logic to automatically handle this issue.
#### Updating Existing Models
If you have previously downloaded models and need to update them, you can simply run the download script again. The script will update the model directory to the latest version.
### Advanced Configuration
The MinerU configuration file `magic-pdf.json` supports various customization options, including:
For complete configuration options, refer to the [MinerU official documentation](https://mineru.readthedocs.io/).
### Using Modal Processors Directly
You can also use LightRAG's modal processors directly without going through MinerU. This is useful when you want to process specific types of content or have more control over the processing pipeline.
Each modal processor returns a tuple containing:
1. A description of the processed content
2. Entity information that can be used for further processing or storage
The processors support different types of content:
-`ImageModalProcessor`: Processes images with captions and footnotes
-`TableModalProcessor`: Processes tables with captions and footnotes
-`EquationModalProcessor`: Processes mathematical equations in LaTeX format
-`GenericModalProcessor`: A base processor that can be extended for custom content types
> **Note**: A complete working example can be found in `examples/modalprocessors_example.py`. You can run it using: