--- title: "DocumentPreprocessor" id: documentpreprocessor slug: "/documentpreprocessor" description: "Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning." --- # DocumentPreprocessor Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning.
| | | | -------------------------------------- | ------------------------------------------------------------------------------------------------------------- | | **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx)  | | **Mandatory run variables** | `documents`: A list of documents | | **Output variables** | `documents`: A list of split and cleaned documents | | **API reference** | [PreProcessors](/reference/preprocessors-api) | | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/document_preprocessor.py |
## Overview `DocumentPreprocessor` first splits and then cleans documents. It is a SuperComponent that combines a `DocumentSplitter` and a `DocumentCleaner` into a single component. ### Parameters The `DocumentPreprocessor` exposes all initialization parameters of the underlying `DocumentSplitter` and `DocumentCleaner`, and they are all optional. A detailed description of their parameters is in the respective documentation pages: - [DocumentSplitter](documentsplitter.mdx) - [DocumentCleaner](documentcleaner.mdx) ## Usage ### On its own ```python from haystack import Document from haystack.components.preprocessors import DocumentPreprocessor doc = Document(content="I love pizza!") preprocessor = DocumentPreprocessor() result = preprocessor.run(documents=[doc]) print(result["documents"]) ``` ### In a pipeline You can use the `DocumentPreprocessor` in your indexing pipeline. The example below requires installing additional dependencies for the `MultiFileConverter`: ```shell pip install pypdf markdown-it-py mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas ``` ```python from haystack import Pipeline from haystack.components.converters import MultiFileConverter from haystack.components.preprocessors import DocumentPreprocessor from haystack.components.writers import DocumentWriter from haystack.document_stores.in_memory import InMemoryDocumentStore document_store = InMemoryDocumentStore() pipeline = Pipeline() pipeline.add_component("converter", MultiFileConverter()) pipeline.add_component("preprocessor", DocumentPreprocessor()) pipeline.add_component("writer", DocumentWriter(document_store = document_store)) pipeline.connect("converter", "preprocessor") pipeline.connect("preprocessor", "writer") result = pipeline.run(data={"sources": ["test.txt", "test.pdf"]}) print(result) ## {'writer': {'documents_written': 3}} ```