mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-01-26 06:35:56 +00:00
* Update documentation and remove unused assets. Enhanced the 'agents' and 'components' sections with clearer descriptions and examples. Removed obsolete images and updated links for better navigation. Adjusted formatting for consistency across various documentation pages. * remove dependency * address comments * delete more empty pages * broken link * unduplicate headings * alphabetical components nav
75 lines
3.0 KiB
Plaintext
75 lines
3.0 KiB
Plaintext
---
|
||
title: "DocumentPreprocessor"
|
||
id: documentpreprocessor
|
||
slug: "/documentpreprocessor"
|
||
description: "Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning."
|
||
---
|
||
|
||
# DocumentPreprocessor
|
||
|
||
Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning.
|
||
|
||
| | |
|
||
| --- | --- |
|
||
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx) |
|
||
| **Mandatory run variables** | "documents": A list of documents |
|
||
| **Output variables** | "documents": A list of split and cleaned documents |
|
||
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
|
||
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/document_preprocessor.py |
|
||
|
||
## Overview
|
||
|
||
`DocumentPreprocessor` first splits and then cleans documents.
|
||
|
||
It is a SuperComponent that combines a `DocumentSplitter` and a `DocumentCleaner` into a single component.
|
||
|
||
### Parameters
|
||
|
||
The `DocumentPreprocessor` exposes all initialization parameters of the underlying `DocumentSplitter` and `DocumentCleaner`, and they are all optional. A detailed description of their parameters is in the respective documentation pages:
|
||
|
||
- [DocumentSplitter](documentsplitter.mdx)
|
||
- [DocumentCleaner](documentcleaner.mdx)
|
||
|
||
## Usage
|
||
|
||
### On its own
|
||
|
||
```python
|
||
from haystack import Document
|
||
from haystack.components.preprocessors import DocumentPreprocessor
|
||
|
||
doc = Document(content="I love pizza!")
|
||
preprocessor = DocumentPreprocessor()
|
||
|
||
result = preprocessor.run(documents=[doc])
|
||
print(result["documents"])
|
||
```
|
||
|
||
### In a pipeline
|
||
|
||
You can use the `DocumentPreprocessor` in your indexing pipeline. The example below requires installing additional dependencies for the `MultiFileConverter`:
|
||
|
||
```shell
|
||
pip install pypdf markdown-it-py mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas
|
||
```
|
||
|
||
```python
|
||
from haystack import Pipeline
|
||
from haystack.components.converters import MultiFileConverter
|
||
from haystack.components.preprocessors import DocumentPreprocessor
|
||
from haystack.components.writers import DocumentWriter
|
||
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
||
|
||
document_store = InMemoryDocumentStore()
|
||
|
||
pipeline = Pipeline()
|
||
pipeline.add_component("converter", MultiFileConverter())
|
||
pipeline.add_component("preprocessor", DocumentPreprocessor())
|
||
pipeline.add_component("writer", DocumentWriter(document_store = document_store))
|
||
pipeline.connect("converter", "preprocessor")
|
||
pipeline.connect("preprocessor", "writer")
|
||
|
||
result = pipeline.run(data={"sources": ["test.txt", "test.pdf"]})
|
||
print(result)
|
||
## {'writer': {'documents_written': 3}}
|
||
``` |