Daria Fokina 90894491cf
docs: add v2.20 docs pages and plugin for relative links (#9926)
* Update documentation and remove unused assets. Enhanced the 'agents' and 'components' sections with clearer descriptions and examples. Removed obsolete images and updated links for better navigation. Adjusted formatting for consistency across various documentation pages.

* remove dependency

* address comments

* delete more empty pages

* broken link

* unduplicate headings

* alphabetical components nav
2025-10-24 09:52:57 +02:00

75 lines
3.0 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "DocumentPreprocessor"
id: documentpreprocessor
slug: "/documentpreprocessor"
description: "Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning."
---
# DocumentPreprocessor
Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning.
| | |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx)  |
| **Mandatory run variables** | "documents": A list of documents |
| **Output variables** | "documents": A list of split and cleaned documents |
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/document_preprocessor.py |
## Overview
`DocumentPreprocessor` first splits and then cleans documents.
It is a SuperComponent that combines a `DocumentSplitter` and a `DocumentCleaner` into a single component.
### Parameters
The `DocumentPreprocessor` exposes all initialization parameters of the underlying `DocumentSplitter` and `DocumentCleaner`, and they are all optional. A detailed description of their parameters is in the respective documentation pages:
- [DocumentSplitter](documentsplitter.mdx)
- [DocumentCleaner](documentcleaner.mdx)
## Usage
### On its own
```python
from haystack import Document
from haystack.components.preprocessors import DocumentPreprocessor
doc = Document(content="I love pizza!")
preprocessor = DocumentPreprocessor()
result = preprocessor.run(documents=[doc])
print(result["documents"])
```
### In a pipeline
You can use the `DocumentPreprocessor` in your indexing pipeline. The example below requires installing additional dependencies for the `MultiFileConverter`:
```shell
pip install pypdf markdown-it-py mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas
```
```python
from haystack import Pipeline
from haystack.components.converters import MultiFileConverter
from haystack.components.preprocessors import DocumentPreprocessor
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", MultiFileConverter())
pipeline.add_component("preprocessor", DocumentPreprocessor())
pipeline.add_component("writer", DocumentWriter(document_store = document_store))
pipeline.connect("converter", "preprocessor")
pipeline.connect("preprocessor", "writer")
result = pipeline.run(data={"sources": ["test.txt", "test.pdf"]})
print(result)
## {'writer': {'documents_written': 3}}
```