mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-02-04 22:13:18 +00:00
* Update documentation and remove unused assets. Enhanced the 'agents' and 'components' sections with clearer descriptions and examples. Removed obsolete images and updated links for better navigation. Adjusted formatting for consistency across various documentation pages. * remove dependency * address comments * delete more empty pages * broken link * unduplicate headings * alphabetical components nav
93 lines
4.5 KiB
Plaintext
93 lines
4.5 KiB
Plaintext
---
|
|
title: "TextCleaner"
|
|
id: textcleaner
|
|
slug: "/textcleaner"
|
|
description: "Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation."
|
|
---
|
|
|
|
# TextCleaner
|
|
|
|
Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation.
|
|
|
|
| | |
|
|
| --- | --- |
|
|
| **Most common position in a pipeline** | Between a [Generator](../generators.mdx) and an [Evaluator](../evaluators.mdx) |
|
|
| **Mandatory run variables** | "texts": A list of strings to be cleaned |
|
|
| **Output variables** | "texts": A list of cleaned texts |
|
|
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/text_cleaner.py |
|
|
|
|
## Overview
|
|
|
|
`TextCleaner` expects a list of strings as input and returns a list of strings with cleaned texts. Selectable cleaning steps are to `convert_to_lowercase`, `remove_punctuation`, and to `remove_numbers`. These three parameters are booleans that need to be set when the component is initialized.
|
|
|
|
- `convert_to_lowercase` converts all characters in texts to lowercase.
|
|
- `remove_punctuation` removes all punctuation from the text.
|
|
- `remove_numbers` removes all numerical digits from the text.
|
|
|
|
In addition, you can specify a regular expression with the parameter `remove_regexps`, and any matches will be removed.
|
|
|
|
## Usage
|
|
|
|
### On its own
|
|
|
|
You can use it outside of a pipeline to clean up any texts:
|
|
|
|
```python
|
|
from haystack.components.preprocessors import TextCleaner
|
|
|
|
text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
|
|
|
|
cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
|
|
result = cleaner.run(texts=[text_to_clean])
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
In this example, we are using `TextCleaner` after an `ExtractiveReader` and an `OutputAdapter` to remove the punctuation in texts. Then, our custom-made `ExactMatchEvaluator` component compares the retrieved answer to the ground truth answer.
|
|
|
|
```python
|
|
from typing import List
|
|
from haystack import component, Document, Pipeline
|
|
from haystack.components.converters import OutputAdapter
|
|
from haystack.components.preprocessors import TextCleaner
|
|
from haystack.components.readers import ExtractiveReader
|
|
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
|
|
document_store = InMemoryDocumentStore()
|
|
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
|
|
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
|
|
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
|
|
document_store.write_documents(documents=documents)
|
|
|
|
@component
|
|
class ExactMatchEvaluator:
|
|
@component.output_types(score=int)
|
|
def run(self, expected: str, provided: List[str]):
|
|
return {"score": int(expected in provided)}
|
|
|
|
adapter = OutputAdapter(
|
|
template="{{answers | extract_data}}",
|
|
output_type=List[str],
|
|
custom_filters={"extract_data": lambda data: [answer.data for answer in data if answer.data]}
|
|
)
|
|
|
|
p = Pipeline()
|
|
p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
|
|
p.add_component("reader", ExtractiveReader())
|
|
p.add_component("adapter", adapter)
|
|
p.add_component("cleaner", TextCleaner(remove_punctuation=True))
|
|
p.add_component("evaluator", ExactMatchEvaluator())
|
|
|
|
p.connect("retriever", "reader")
|
|
p.connect("reader", "adapter")
|
|
p.connect("adapter", "cleaner.texts")
|
|
p.connect("cleaner", "evaluator.provided")
|
|
|
|
question = "What behavior indicates a high level of self-awareness of elephants?"
|
|
ground_truth_answer = "recognizing themselves in mirrors"
|
|
|
|
result = p.run({"retriever": {"query": question}, "reader": {"query": question}, "evaluator": {"expected": ground_truth_answer}})
|
|
print(result)
|
|
``` |