mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-02-12 10:26:19 +00:00
* Update documentation and remove unused assets. Enhanced the 'agents' and 'components' sections with clearer descriptions and examples. Removed obsolete images and updated links for better navigation. Adjusted formatting for consistency across various documentation pages. * remove dependency * address comments * delete more empty pages * broken link * unduplicate headings * alphabetical components nav
24 lines
2.5 KiB
Plaintext
24 lines
2.5 KiB
Plaintext
---
|
|
title: "PreProcessors"
|
|
id: preprocessors
|
|
slug: "/preprocessors"
|
|
description: "Use the PreProcessors to preprare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search."
|
|
---
|
|
|
|
# PreProcessors
|
|
|
|
Use the PreProcessors to preprare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search.
|
|
|
|
| PreProcessor | Description |
|
|
| --- | --- |
|
|
| [ChineseDocumentSplitter](../../docs/pipeline-components/preprocessors/chinesedocumentsplitter.mdx) | Divides Chinese text documents into smaller chunks using advanced Chinese language processing capabilities, using HanLP for accurate Chinese word segmentation and sentence tokenization. |
|
|
| [CSVDocumentCleaner](../../docs/pipeline-components/preprocessors/csvdocumentcleaner.mdx) | Cleans CSV documents by removing empty rows and columns while preserving specific ignored rows and columns. |
|
|
| [CSVDocumentSplitter](../../docs/pipeline-components/preprocessors/csvdocumentsplitter.mdx) | Divides CSV documents into smaller sub-tables based on empty rows and columns. |
|
|
| [DocumentCleaner](../../docs/pipeline-components/preprocessors/documentcleaner.mdx) | Removes extra whitespaces, empty lines, specified substrings, regexes, page headers, and footers from documents. |
|
|
| [DocumentPreprocessor](../../docs/pipeline-components/preprocessors/documentpreprocessor.mdx) | Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning. |
|
|
| [DocumentSplitter](../../docs/pipeline-components/preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. |
|
|
| [HierarchicalDocumentSplitter](../../docs/pipeline-components/preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. |
|
|
| [RecursiveSplitter](../../docs/pipeline-components/preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators <br />to the text, applied in the order they are provided. |
|
|
| [TextCleaner](../../docs/pipeline-components/preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. |
|
|
|