Daria Fokina 3e81ec75dc
docs: add 2.18 and 2.19 actual documentation pages (#9946)
* versioned-docs

* external-documentstores
2025-10-27 13:03:22 +01:00

94 lines
4.7 KiB
Plaintext

---
title: "TextCleaner"
id: textcleaner
slug: "/textcleaner"
description: "Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation."
---
# TextCleaner
Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation.
| | |
| :------------------------------------- | :--------------------------------------------------------------------------------------------------- |
| **Most common position in a pipeline** | Between a [Generator](../generators.mdx) and an [Evaluator](../evaluators.mdx) |
| **Mandatory run variables** | "texts": A list of strings to be cleaned |
| **Output variables** | "texts": A list of cleaned texts |
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/text_cleaner.py |
## Overview
`TextCleaner` expects a list of strings as input and returns a list of strings with cleaned texts. Selectable cleaning steps are to `convert_to_lowercase`, `remove_punctuation`, and to `remove_numbers`. These three parameters are booleans that need to be set when the component is initialized.
- `convert_to_lowercase` converts all characters in texts to lowercase.
- `remove_punctuation` removes all punctuation from the text.
- `remove_numbers` removes all numerical digits from the text.
In addition, you can specify a regular expression with the parameter `remove_regexps`, and any matches will be removed.
## Usage
### On its own
You can use it outside of a pipeline to clean up any texts:
```python
from haystack.components.preprocessors import TextCleaner
text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
result = cleaner.run(texts=[text_to_clean])
```
### In a pipeline
In this example, we are using `TextCleaner` after an `ExtractiveReader` and an `OutputAdapter` to remove the punctuation in texts. Then, our custom-made `ExactMatchEvaluator` component compares the retrieved answer to the ground truth answer.
```python
from typing import List
from haystack import component, Document, Pipeline
from haystack.components.converters import OutputAdapter
from haystack.components.preprocessors import TextCleaner
from haystack.components.readers import ExtractiveReader
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)
@component
class ExactMatchEvaluator:
@component.output_types(score=int)
def run(self, expected: str, provided: List[str]):
return {"score": int(expected in provided)}
adapter = OutputAdapter(
template="{{answers | extract_data}}",
output_type=List[str],
custom_filters={"extract_data": lambda data: [answer.data for answer in data if answer.data]}
)
p = Pipeline()
p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
p.add_component("reader", ExtractiveReader())
p.add_component("adapter", adapter)
p.add_component("cleaner", TextCleaner(remove_punctuation=True))
p.add_component("evaluator", ExactMatchEvaluator())
p.connect("retriever", "reader")
p.connect("reader", "adapter")
p.connect("adapter", "cleaner.texts")
p.connect("cleaner", "evaluator.provided")
question = "What behavior indicates a high level of self-awareness of elephants?"
ground_truth_answer = "recognizing themselves in mirrors"
result = p.run({"retriever": {"query": question}, "reader": {"query": question}, "evaluator": {"expected": ground_truth_answer}})
print(result)
```