Daria Fokina 90894491cf
docs: add v2.20 docs pages and plugin for relative links (#9926)
* Update documentation and remove unused assets. Enhanced the 'agents' and 'components' sections with clearer descriptions and examples. Removed obsolete images and updated links for better navigation. Adjusted formatting for consistency across various documentation pages.

* remove dependency

* address comments

* delete more empty pages

* broken link

* unduplicate headings

* alphabetical components nav
2025-10-24 09:52:57 +02:00

93 lines
4.5 KiB
Plaintext

---
title: "TextCleaner"
id: textcleaner
slug: "/textcleaner"
description: "Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation."
---
# TextCleaner
Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation.
| | |
| --- | --- |
| **Most common position in a pipeline** | Between a [Generator](../generators.mdx) and an [Evaluator](../evaluators.mdx) |
| **Mandatory run variables** | "texts": A list of strings to be cleaned |
| **Output variables** | "texts": A list of cleaned texts |
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/text_cleaner.py |
## Overview
`TextCleaner` expects a list of strings as input and returns a list of strings with cleaned texts. Selectable cleaning steps are to `convert_to_lowercase`, `remove_punctuation`, and to `remove_numbers`. These three parameters are booleans that need to be set when the component is initialized.
- `convert_to_lowercase` converts all characters in texts to lowercase.
- `remove_punctuation` removes all punctuation from the text.
- `remove_numbers` removes all numerical digits from the text.
In addition, you can specify a regular expression with the parameter `remove_regexps`, and any matches will be removed.
## Usage
### On its own
You can use it outside of a pipeline to clean up any texts:
```python
from haystack.components.preprocessors import TextCleaner
text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
result = cleaner.run(texts=[text_to_clean])
```
### In a pipeline
In this example, we are using `TextCleaner` after an `ExtractiveReader` and an `OutputAdapter` to remove the punctuation in texts. Then, our custom-made `ExactMatchEvaluator` component compares the retrieved answer to the ground truth answer.
```python
from typing import List
from haystack import component, Document, Pipeline
from haystack.components.converters import OutputAdapter
from haystack.components.preprocessors import TextCleaner
from haystack.components.readers import ExtractiveReader
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)
@component
class ExactMatchEvaluator:
@component.output_types(score=int)
def run(self, expected: str, provided: List[str]):
return {"score": int(expected in provided)}
adapter = OutputAdapter(
template="{{answers | extract_data}}",
output_type=List[str],
custom_filters={"extract_data": lambda data: [answer.data for answer in data if answer.data]}
)
p = Pipeline()
p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
p.add_component("reader", ExtractiveReader())
p.add_component("adapter", adapter)
p.add_component("cleaner", TextCleaner(remove_punctuation=True))
p.add_component("evaluator", ExactMatchEvaluator())
p.connect("retriever", "reader")
p.connect("reader", "adapter")
p.connect("adapter", "cleaner.texts")
p.connect("cleaner", "evaluator.provided")
question = "What behavior indicates a high level of self-awareness of elephants?"
ground_truth_answer = "recognizing themselves in mirrors"
result = p.run({"retriever": {"query": question}, "reader": {"query": question}, "evaluator": {"expected": ground_truth_answer}})
print(result)
```