mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-12-17 01:58:23 +00:00
800 lines
29 KiB
Markdown
800 lines
29 KiB
Markdown
|
|
---
|
||
|
|
title: PreProcessors
|
||
|
|
id: preprocessors-api
|
||
|
|
description: Preprocess your Documents and texts. Clean, split, and more.
|
||
|
|
---
|
||
|
|
|
||
|
|
<a id="csv_document_cleaner"></a>
|
||
|
|
|
||
|
|
# Module csv\_document\_cleaner
|
||
|
|
|
||
|
|
<a id="csv_document_cleaner.CSVDocumentCleaner"></a>
|
||
|
|
|
||
|
|
## CSVDocumentCleaner
|
||
|
|
|
||
|
|
A component for cleaning CSV documents by removing empty rows and columns.
|
||
|
|
|
||
|
|
This component processes CSV content stored in Documents, allowing
|
||
|
|
for the optional ignoring of a specified number of rows and columns before performing
|
||
|
|
the cleaning operation. Additionally, it provides options to keep document IDs and
|
||
|
|
control whether empty rows and columns should be removed.
|
||
|
|
|
||
|
|
<a id="csv_document_cleaner.CSVDocumentCleaner.__init__"></a>
|
||
|
|
|
||
|
|
#### CSVDocumentCleaner.\_\_init\_\_
|
||
|
|
|
||
|
|
```python
|
||
|
|
def __init__(*,
|
||
|
|
ignore_rows: int = 0,
|
||
|
|
ignore_columns: int = 0,
|
||
|
|
remove_empty_rows: bool = True,
|
||
|
|
remove_empty_columns: bool = True,
|
||
|
|
keep_id: bool = False) -> None
|
||
|
|
```
|
||
|
|
|
||
|
|
Initializes the CSVDocumentCleaner component.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `ignore_rows`: Number of rows to ignore from the top of the CSV table before processing.
|
||
|
|
- `ignore_columns`: Number of columns to ignore from the left of the CSV table before processing.
|
||
|
|
- `remove_empty_rows`: Whether to remove rows that are entirely empty.
|
||
|
|
- `remove_empty_columns`: Whether to remove columns that are entirely empty.
|
||
|
|
- `keep_id`: Whether to retain the original document ID in the output document.
|
||
|
|
Rows and columns ignored using these parameters are preserved in the final output, meaning
|
||
|
|
they are not considered when removing empty rows and columns.
|
||
|
|
|
||
|
|
<a id="csv_document_cleaner.CSVDocumentCleaner.run"></a>
|
||
|
|
|
||
|
|
#### CSVDocumentCleaner.run
|
||
|
|
|
||
|
|
```python
|
||
|
|
@component.output_types(documents=list[Document])
|
||
|
|
def run(documents: list[Document]) -> dict[str, list[Document]]
|
||
|
|
```
|
||
|
|
|
||
|
|
Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `documents`: List of Documents containing CSV-formatted content.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
A dictionary with a list of cleaned Documents under the key "documents".
|
||
|
|
Processing steps:
|
||
|
|
1. Reads each document's content as a CSV table.
|
||
|
|
2. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left.
|
||
|
|
3. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and
|
||
|
|
`remove_empty_columns`).
|
||
|
|
4. Reattaches the ignored rows and columns to maintain their original positions.
|
||
|
|
5. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original
|
||
|
|
document ID.
|
||
|
|
|
||
|
|
<a id="csv_document_splitter"></a>
|
||
|
|
|
||
|
|
# Module csv\_document\_splitter
|
||
|
|
|
||
|
|
<a id="csv_document_splitter.CSVDocumentSplitter"></a>
|
||
|
|
|
||
|
|
## CSVDocumentSplitter
|
||
|
|
|
||
|
|
A component for splitting CSV documents into sub-tables based on split arguments.
|
||
|
|
|
||
|
|
The splitter supports two modes of operation:
|
||
|
|
- identify consecutive empty rows or columns that exceed a given threshold
|
||
|
|
and uses them as delimiters to segment the document into smaller tables.
|
||
|
|
- split each row into a separate sub-table, represented as a Document.
|
||
|
|
|
||
|
|
<a id="csv_document_splitter.CSVDocumentSplitter.__init__"></a>
|
||
|
|
|
||
|
|
#### CSVDocumentSplitter.\_\_init\_\_
|
||
|
|
|
||
|
|
```python
|
||
|
|
def __init__(row_split_threshold: Optional[int] = 2,
|
||
|
|
column_split_threshold: Optional[int] = 2,
|
||
|
|
read_csv_kwargs: Optional[dict[str, Any]] = None,
|
||
|
|
split_mode: SplitMode = "threshold") -> None
|
||
|
|
```
|
||
|
|
|
||
|
|
Initializes the CSVDocumentSplitter component.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `row_split_threshold`: The minimum number of consecutive empty rows required to trigger a split.
|
||
|
|
- `column_split_threshold`: The minimum number of consecutive empty columns required to trigger a split.
|
||
|
|
- `read_csv_kwargs`: Additional keyword arguments to pass to `pandas.read_csv`.
|
||
|
|
By default, the component with options:
|
||
|
|
- `header=None`
|
||
|
|
- `skip_blank_lines=False` to preserve blank lines
|
||
|
|
- `dtype=object` to prevent type inference (e.g., converting numbers to floats).
|
||
|
|
See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
|
||
|
|
- `split_mode`: If `threshold`, the component will split the document based on the number of
|
||
|
|
consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`.
|
||
|
|
If `row-wise`, the component will split each row into a separate sub-table.
|
||
|
|
|
||
|
|
<a id="csv_document_splitter.CSVDocumentSplitter.run"></a>
|
||
|
|
|
||
|
|
#### CSVDocumentSplitter.run
|
||
|
|
|
||
|
|
```python
|
||
|
|
@component.output_types(documents=list[Document])
|
||
|
|
def run(documents: list[Document]) -> dict[str, list[Document]]
|
||
|
|
```
|
||
|
|
|
||
|
|
Processes and splits a list of CSV documents into multiple sub-tables.
|
||
|
|
|
||
|
|
**Splitting Process:**
|
||
|
|
1. Applies a row-based split if `row_split_threshold` is provided.
|
||
|
|
2. Applies a column-based split if `column_split_threshold` is provided.
|
||
|
|
3. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring
|
||
|
|
further fragmentation of any sub-tables that still contain empty sections.
|
||
|
|
4. Sorts the resulting sub-tables based on their original positions within the document.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `documents`: A list of Documents containing CSV-formatted content.
|
||
|
|
Each document is assumed to contain one or more tables separated by empty rows or columns.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
A dictionary with a key `"documents"`, mapping to a list of new `Document` objects,
|
||
|
|
each representing an extracted sub-table from the original CSV.
|
||
|
|
The metadata of each document includes:
|
||
|
|
- A field `source_id` to track the original document.
|
||
|
|
- A field `row_idx_start` to indicate the starting row index of the sub-table in the original table.
|
||
|
|
- A field `col_idx_start` to indicate the starting column index of the sub-table in the original table.
|
||
|
|
- A field `split_id` to indicate the order of the split in the original document.
|
||
|
|
- All other metadata copied from the original document.
|
||
|
|
|
||
|
|
- If a document cannot be processed, it is returned unchanged.
|
||
|
|
- The `meta` field from the original document is preserved in the split documents.
|
||
|
|
|
||
|
|
<a id="document_cleaner"></a>
|
||
|
|
|
||
|
|
# Module document\_cleaner
|
||
|
|
|
||
|
|
<a id="document_cleaner.DocumentCleaner"></a>
|
||
|
|
|
||
|
|
## DocumentCleaner
|
||
|
|
|
||
|
|
Cleans the text in the documents.
|
||
|
|
|
||
|
|
It removes extra whitespaces,
|
||
|
|
empty lines, specified substrings, regexes,
|
||
|
|
page headers and footers (in this order).
|
||
|
|
|
||
|
|
### Usage example:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from haystack import Document
|
||
|
|
from haystack.components.preprocessors import DocumentCleaner
|
||
|
|
|
||
|
|
doc = Document(content="This is a document to clean\n\n\nsubstring to remove")
|
||
|
|
|
||
|
|
cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
|
||
|
|
result = cleaner.run(documents=[doc])
|
||
|
|
|
||
|
|
assert result["documents"][0].content == "This is a document to clean "
|
||
|
|
```
|
||
|
|
|
||
|
|
<a id="document_cleaner.DocumentCleaner.__init__"></a>
|
||
|
|
|
||
|
|
#### DocumentCleaner.\_\_init\_\_
|
||
|
|
|
||
|
|
```python
|
||
|
|
def __init__(remove_empty_lines: bool = True,
|
||
|
|
remove_extra_whitespaces: bool = True,
|
||
|
|
remove_repeated_substrings: bool = False,
|
||
|
|
keep_id: bool = False,
|
||
|
|
remove_substrings: Optional[list[str]] = None,
|
||
|
|
remove_regex: Optional[str] = None,
|
||
|
|
unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
|
||
|
|
"NFKD"]] = None,
|
||
|
|
ascii_only: bool = False)
|
||
|
|
```
|
||
|
|
|
||
|
|
Initialize DocumentCleaner.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `remove_empty_lines`: If `True`, removes empty lines.
|
||
|
|
- `remove_extra_whitespaces`: If `True`, removes extra whitespaces.
|
||
|
|
- `remove_repeated_substrings`: If `True`, removes repeated substrings (headers and footers) from pages.
|
||
|
|
Pages must be separated by a form feed character "\f",
|
||
|
|
which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`.
|
||
|
|
- `remove_substrings`: List of substrings to remove from the text.
|
||
|
|
- `remove_regex`: Regex to match and replace substrings by "".
|
||
|
|
- `keep_id`: If `True`, keeps the IDs of the original documents.
|
||
|
|
- `unicode_normalization`: Unicode normalization form to apply to the text.
|
||
|
|
Note: This will run before any other steps.
|
||
|
|
- `ascii_only`: Whether to convert the text to ASCII only.
|
||
|
|
Will remove accents from characters and replace them with ASCII characters.
|
||
|
|
Other non-ASCII characters will be removed.
|
||
|
|
Note: This will run before any pattern matching or removal.
|
||
|
|
|
||
|
|
<a id="document_cleaner.DocumentCleaner.run"></a>
|
||
|
|
|
||
|
|
#### DocumentCleaner.run
|
||
|
|
|
||
|
|
```python
|
||
|
|
@component.output_types(documents=list[Document])
|
||
|
|
def run(documents: list[Document])
|
||
|
|
```
|
||
|
|
|
||
|
|
Cleans up the documents.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `documents`: List of Documents to clean.
|
||
|
|
|
||
|
|
**Raises**:
|
||
|
|
|
||
|
|
- `TypeError`: if documents is not a list of Documents.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
A dictionary with the following key:
|
||
|
|
- `documents`: List of cleaned Documents.
|
||
|
|
|
||
|
|
<a id="document_preprocessor"></a>
|
||
|
|
|
||
|
|
# Module document\_preprocessor
|
||
|
|
|
||
|
|
<a id="document_preprocessor.DocumentPreprocessor"></a>
|
||
|
|
|
||
|
|
## DocumentPreprocessor
|
||
|
|
|
||
|
|
A SuperComponent that first splits and then cleans documents.
|
||
|
|
|
||
|
|
This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline.
|
||
|
|
It takes a list of documents as input and returns a processed list of documents.
|
||
|
|
|
||
|
|
Usage example:
|
||
|
|
```python
|
||
|
|
from haystack import Document
|
||
|
|
from haystack.components.preprocessors import DocumentPreprocessor
|
||
|
|
|
||
|
|
doc = Document(content="I love pizza!")
|
||
|
|
preprocessor = DocumentPreprocessor()
|
||
|
|
result = preprocessor.run(documents=[doc])
|
||
|
|
print(result["documents"])
|
||
|
|
```
|
||
|
|
|
||
|
|
<a id="document_preprocessor.DocumentPreprocessor.__init__"></a>
|
||
|
|
|
||
|
|
#### DocumentPreprocessor.\_\_init\_\_
|
||
|
|
|
||
|
|
```python
|
||
|
|
def __init__(*,
|
||
|
|
split_by: Literal["function", "page", "passage", "period", "word",
|
||
|
|
"line", "sentence"] = "word",
|
||
|
|
split_length: int = 250,
|
||
|
|
split_overlap: int = 0,
|
||
|
|
split_threshold: int = 0,
|
||
|
|
splitting_function: Optional[Callable[[str], list[str]]] = None,
|
||
|
|
respect_sentence_boundary: bool = False,
|
||
|
|
language: Language = "en",
|
||
|
|
use_split_rules: bool = True,
|
||
|
|
extend_abbreviations: bool = True,
|
||
|
|
remove_empty_lines: bool = True,
|
||
|
|
remove_extra_whitespaces: bool = True,
|
||
|
|
remove_repeated_substrings: bool = False,
|
||
|
|
keep_id: bool = False,
|
||
|
|
remove_substrings: Optional[list[str]] = None,
|
||
|
|
remove_regex: Optional[str] = None,
|
||
|
|
unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
|
||
|
|
"NFKD"]] = None,
|
||
|
|
ascii_only: bool = False) -> None
|
||
|
|
```
|
||
|
|
|
||
|
|
Initialize a DocumentPreProcessor that first splits and then cleans documents.
|
||
|
|
|
||
|
|
**Splitter Parameters**:
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `split_by`: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
|
||
|
|
- `split_length`: The maximum number of units (words, lines, pages, and so on) in each split.
|
||
|
|
- `split_overlap`: The number of overlapping units between consecutive splits.
|
||
|
|
- `split_threshold`: The minimum number of units per split. If a split is smaller than this, it's merged
|
||
|
|
with the previous split.
|
||
|
|
- `splitting_function`: A custom function for splitting if `split_by="function"`.
|
||
|
|
- `respect_sentence_boundary`: If `True`, splits by words but tries not to break inside a sentence.
|
||
|
|
- `language`: Language used by the sentence tokenizer if `split_by="sentence"` or
|
||
|
|
`respect_sentence_boundary=True`.
|
||
|
|
- `use_split_rules`: Whether to apply additional splitting heuristics for the sentence splitter.
|
||
|
|
- `extend_abbreviations`: Whether to extend the sentence splitter with curated abbreviations for certain
|
||
|
|
languages.
|
||
|
|
|
||
|
|
**Cleaner Parameters**:
|
||
|
|
- `remove_empty_lines`: If `True`, removes empty lines.
|
||
|
|
- `remove_extra_whitespaces`: If `True`, removes extra whitespaces.
|
||
|
|
- `remove_repeated_substrings`: If `True`, removes repeated substrings like headers/footers across pages.
|
||
|
|
- `keep_id`: If `True`, keeps the original document IDs.
|
||
|
|
- `remove_substrings`: A list of strings to remove from the document content.
|
||
|
|
- `remove_regex`: A regex pattern whose matches will be removed from the document content.
|
||
|
|
- `unicode_normalization`: Unicode normalization form to apply to the text, for example `"NFC"`.
|
||
|
|
- `ascii_only`: If `True`, converts text to ASCII only.
|
||
|
|
|
||
|
|
<a id="document_preprocessor.DocumentPreprocessor.to_dict"></a>
|
||
|
|
|
||
|
|
#### DocumentPreprocessor.to\_dict
|
||
|
|
|
||
|
|
```python
|
||
|
|
def to_dict() -> dict[str, Any]
|
||
|
|
```
|
||
|
|
|
||
|
|
Serialize SuperComponent to a dictionary.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
Dictionary with serialized data.
|
||
|
|
|
||
|
|
<a id="document_preprocessor.DocumentPreprocessor.from_dict"></a>
|
||
|
|
|
||
|
|
#### DocumentPreprocessor.from\_dict
|
||
|
|
|
||
|
|
```python
|
||
|
|
@classmethod
|
||
|
|
def from_dict(cls, data: dict[str, Any]) -> "DocumentPreprocessor"
|
||
|
|
```
|
||
|
|
|
||
|
|
Deserializes the SuperComponent from a dictionary.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `data`: Dictionary to deserialize from.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
Deserialized SuperComponent.
|
||
|
|
|
||
|
|
<a id="document_splitter"></a>
|
||
|
|
|
||
|
|
# Module document\_splitter
|
||
|
|
|
||
|
|
<a id="document_splitter.DocumentSplitter"></a>
|
||
|
|
|
||
|
|
## DocumentSplitter
|
||
|
|
|
||
|
|
Splits long documents into smaller chunks.
|
||
|
|
|
||
|
|
This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations
|
||
|
|
and prevents exceeding language model context limits.
|
||
|
|
|
||
|
|
The DocumentSplitter is compatible with the following DocumentStores:
|
||
|
|
- [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
|
||
|
|
- [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is
|
||
|
|
not stored
|
||
|
|
- [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
|
||
|
|
- [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
|
||
|
|
- [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
|
||
|
|
- [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping
|
||
|
|
information is not stored
|
||
|
|
- [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
|
||
|
|
- [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore)
|
||
|
|
|
||
|
|
### Usage example
|
||
|
|
|
||
|
|
```python
|
||
|
|
from haystack import Document
|
||
|
|
from haystack.components.preprocessors import DocumentSplitter
|
||
|
|
|
||
|
|
doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
|
||
|
|
|
||
|
|
splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
|
||
|
|
result = splitter.run(documents=[doc])
|
||
|
|
```
|
||
|
|
|
||
|
|
<a id="document_splitter.DocumentSplitter.__init__"></a>
|
||
|
|
|
||
|
|
#### DocumentSplitter.\_\_init\_\_
|
||
|
|
|
||
|
|
```python
|
||
|
|
def __init__(split_by: Literal["function", "page", "passage", "period", "word",
|
||
|
|
"line", "sentence"] = "word",
|
||
|
|
split_length: int = 200,
|
||
|
|
split_overlap: int = 0,
|
||
|
|
split_threshold: int = 0,
|
||
|
|
splitting_function: Optional[Callable[[str], list[str]]] = None,
|
||
|
|
respect_sentence_boundary: bool = False,
|
||
|
|
language: Language = "en",
|
||
|
|
use_split_rules: bool = True,
|
||
|
|
extend_abbreviations: bool = True,
|
||
|
|
*,
|
||
|
|
skip_empty_documents: bool = True)
|
||
|
|
```
|
||
|
|
|
||
|
|
Initialize DocumentSplitter.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `split_by`: The unit for splitting your documents. Choose from:
|
||
|
|
- `word` for splitting by spaces (" ")
|
||
|
|
- `period` for splitting by periods (".")
|
||
|
|
- `page` for splitting by form feed ("\f")
|
||
|
|
- `passage` for splitting by double line breaks ("\n\n")
|
||
|
|
- `line` for splitting each line ("\n")
|
||
|
|
- `sentence` for splitting by NLTK sentence tokenizer
|
||
|
|
- `split_length`: The maximum number of units in each split.
|
||
|
|
- `split_overlap`: The number of overlapping units for each split.
|
||
|
|
- `split_threshold`: The minimum number of units per split. If a split has fewer units
|
||
|
|
than the threshold, it's attached to the previous split.
|
||
|
|
- `splitting_function`: Necessary when `split_by` is set to "function".
|
||
|
|
This is a function which must accept a single `str` as input and return a `list` of `str` as output,
|
||
|
|
representing the chunks after splitting.
|
||
|
|
- `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word".
|
||
|
|
If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
|
||
|
|
- `language`: Choose the language for the NLTK tokenizer. The default is English ("en").
|
||
|
|
- `use_split_rules`: Choose whether to use additional split rules when splitting by `sentence`.
|
||
|
|
- `extend_abbreviations`: Choose whether to extend NLTK's PunktTokenizer abbreviations with a list
|
||
|
|
of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
|
||
|
|
- `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True.
|
||
|
|
Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
|
||
|
|
from non-textual documents.
|
||
|
|
|
||
|
|
<a id="document_splitter.DocumentSplitter.warm_up"></a>
|
||
|
|
|
||
|
|
#### DocumentSplitter.warm\_up
|
||
|
|
|
||
|
|
```python
|
||
|
|
def warm_up()
|
||
|
|
```
|
||
|
|
|
||
|
|
Warm up the DocumentSplitter by loading the sentence tokenizer.
|
||
|
|
|
||
|
|
<a id="document_splitter.DocumentSplitter.run"></a>
|
||
|
|
|
||
|
|
#### DocumentSplitter.run
|
||
|
|
|
||
|
|
```python
|
||
|
|
@component.output_types(documents=list[Document])
|
||
|
|
def run(documents: list[Document])
|
||
|
|
```
|
||
|
|
|
||
|
|
Split documents into smaller parts.
|
||
|
|
|
||
|
|
Splits documents by the unit expressed in `split_by`, with a length of `split_length`
|
||
|
|
and an overlap of `split_overlap`.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `documents`: The documents to split.
|
||
|
|
|
||
|
|
**Raises**:
|
||
|
|
|
||
|
|
- `TypeError`: if the input is not a list of Documents.
|
||
|
|
- `ValueError`: if the content of a document is None.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
A dictionary with the following key:
|
||
|
|
- `documents`: List of documents with the split texts. Each document includes:
|
||
|
|
- A metadata field `source_id` to track the original document.
|
||
|
|
- A metadata field `page_number` to track the original page number.
|
||
|
|
- All other metadata copied from the original document.
|
||
|
|
|
||
|
|
<a id="document_splitter.DocumentSplitter.to_dict"></a>
|
||
|
|
|
||
|
|
#### DocumentSplitter.to\_dict
|
||
|
|
|
||
|
|
```python
|
||
|
|
def to_dict() -> dict[str, Any]
|
||
|
|
```
|
||
|
|
|
||
|
|
Serializes the component to a dictionary.
|
||
|
|
|
||
|
|
<a id="document_splitter.DocumentSplitter.from_dict"></a>
|
||
|
|
|
||
|
|
#### DocumentSplitter.from\_dict
|
||
|
|
|
||
|
|
```python
|
||
|
|
@classmethod
|
||
|
|
def from_dict(cls, data: dict[str, Any]) -> "DocumentSplitter"
|
||
|
|
```
|
||
|
|
|
||
|
|
Deserializes the component from a dictionary.
|
||
|
|
|
||
|
|
<a id="hierarchical_document_splitter"></a>
|
||
|
|
|
||
|
|
# Module hierarchical\_document\_splitter
|
||
|
|
|
||
|
|
<a id="hierarchical_document_splitter.HierarchicalDocumentSplitter"></a>
|
||
|
|
|
||
|
|
## HierarchicalDocumentSplitter
|
||
|
|
|
||
|
|
Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
|
||
|
|
|
||
|
|
The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between
|
||
|
|
are connected such that the smaller blocks are children of the parent-larger blocks.
|
||
|
|
|
||
|
|
## Usage example
|
||
|
|
```python
|
||
|
|
from haystack import Document
|
||
|
|
from haystack.components.preprocessors import HierarchicalDocumentSplitter
|
||
|
|
|
||
|
|
doc = Document(content="This is a simple test document")
|
||
|
|
splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
|
||
|
|
splitter.run([doc])
|
||
|
|
>> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
|
||
|
|
>> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
|
||
|
|
>> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
|
||
|
|
>> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
|
||
|
|
>> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
|
||
|
|
>> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
|
||
|
|
>> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
|
||
|
|
```
|
||
|
|
|
||
|
|
<a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.__init__"></a>
|
||
|
|
|
||
|
|
#### HierarchicalDocumentSplitter.\_\_init\_\_
|
||
|
|
|
||
|
|
```python
|
||
|
|
def __init__(block_sizes: set[int],
|
||
|
|
split_overlap: int = 0,
|
||
|
|
split_by: Literal["word", "sentence", "page",
|
||
|
|
"passage"] = "word")
|
||
|
|
```
|
||
|
|
|
||
|
|
Initialize HierarchicalDocumentSplitter.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `block_sizes`: Set of block sizes to split the document into. The blocks are split in descending order.
|
||
|
|
- `split_overlap`: The number of overlapping units for each split.
|
||
|
|
- `split_by`: The unit for splitting your documents.
|
||
|
|
|
||
|
|
<a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.run"></a>
|
||
|
|
|
||
|
|
#### HierarchicalDocumentSplitter.run
|
||
|
|
|
||
|
|
```python
|
||
|
|
@component.output_types(documents=list[Document])
|
||
|
|
def run(documents: list[Document])
|
||
|
|
```
|
||
|
|
|
||
|
|
Builds a hierarchical document structure for each document in a list of documents.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `documents`: List of Documents to split into hierarchical blocks.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
List of HierarchicalDocument
|
||
|
|
|
||
|
|
<a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.build_hierarchy_from_doc"></a>
|
||
|
|
|
||
|
|
#### HierarchicalDocumentSplitter.build\_hierarchy\_from\_doc
|
||
|
|
|
||
|
|
```python
|
||
|
|
def build_hierarchy_from_doc(document: Document) -> list[Document]
|
||
|
|
```
|
||
|
|
|
||
|
|
Build a hierarchical tree document structure from a single document.
|
||
|
|
|
||
|
|
Given a document, this function splits the document into hierarchical blocks of different sizes represented
|
||
|
|
as HierarchicalDocument objects.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `document`: Document to split into hierarchical blocks.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
List of HierarchicalDocument
|
||
|
|
|
||
|
|
<a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.to_dict"></a>
|
||
|
|
|
||
|
|
#### HierarchicalDocumentSplitter.to\_dict
|
||
|
|
|
||
|
|
```python
|
||
|
|
def to_dict() -> dict[str, Any]
|
||
|
|
```
|
||
|
|
|
||
|
|
Returns a dictionary representation of the component.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
Serialized dictionary representation of the component.
|
||
|
|
|
||
|
|
<a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.from_dict"></a>
|
||
|
|
|
||
|
|
#### HierarchicalDocumentSplitter.from\_dict
|
||
|
|
|
||
|
|
```python
|
||
|
|
@classmethod
|
||
|
|
def from_dict(cls, data: dict[str, Any]) -> "HierarchicalDocumentSplitter"
|
||
|
|
```
|
||
|
|
|
||
|
|
Deserialize this component from a dictionary.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `data`: The dictionary to deserialize and create the component.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
The deserialized component.
|
||
|
|
|
||
|
|
<a id="recursive_splitter"></a>
|
||
|
|
|
||
|
|
# Module recursive\_splitter
|
||
|
|
|
||
|
|
<a id="recursive_splitter.RecursiveDocumentSplitter"></a>
|
||
|
|
|
||
|
|
## RecursiveDocumentSplitter
|
||
|
|
|
||
|
|
Recursively chunk text into smaller chunks.
|
||
|
|
|
||
|
|
This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
|
||
|
|
to the text.
|
||
|
|
|
||
|
|
The separators are applied in the order they are provided, typically this is a list of separators that are
|
||
|
|
applied in a specific order, being the last separator the most specific one.
|
||
|
|
|
||
|
|
Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that
|
||
|
|
are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the
|
||
|
|
list to the remaining text.
|
||
|
|
|
||
|
|
This is done until all chunks are smaller than the split_length parameter.
|
||
|
|
|
||
|
|
**Example**:
|
||
|
|
|
||
|
|
|
||
|
|
```python
|
||
|
|
from haystack import Document
|
||
|
|
from haystack.components.preprocessors import RecursiveDocumentSplitter
|
||
|
|
|
||
|
|
chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
|
||
|
|
text = ('''Artificial intelligence (AI) - Introduction
|
||
|
|
|
||
|
|
AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
|
||
|
|
AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
|
||
|
|
chunker.warm_up()
|
||
|
|
doc = Document(content=text)
|
||
|
|
doc_chunks = chunker.run([doc])
|
||
|
|
print(doc_chunks["documents"])
|
||
|
|
>[
|
||
|
|
>Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
|
||
|
|
>Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
|
||
|
|
>Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
|
||
|
|
>Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
|
||
|
|
>]
|
||
|
|
```
|
||
|
|
|
||
|
|
<a id="recursive_splitter.RecursiveDocumentSplitter.__init__"></a>
|
||
|
|
|
||
|
|
#### RecursiveDocumentSplitter.\_\_init\_\_
|
||
|
|
|
||
|
|
```python
|
||
|
|
def __init__(*,
|
||
|
|
split_length: int = 200,
|
||
|
|
split_overlap: int = 0,
|
||
|
|
split_unit: Literal["word", "char", "token"] = "word",
|
||
|
|
separators: Optional[list[str]] = None,
|
||
|
|
sentence_splitter_params: Optional[dict[str, Any]] = None)
|
||
|
|
```
|
||
|
|
|
||
|
|
Initializes a RecursiveDocumentSplitter.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `split_length`: The maximum length of each chunk by default in words, but can be in characters or tokens.
|
||
|
|
See the `split_units` parameter.
|
||
|
|
- `split_overlap`: The number of characters to overlap between consecutive chunks.
|
||
|
|
- `split_unit`: The unit of the split_length parameter. It can be either "word", "char", or "token".
|
||
|
|
If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base).
|
||
|
|
- `separators`: An optional list of separator strings to use for splitting the text. The string
|
||
|
|
separators will be treated as regular expressions unless the separator is "sentence", in that case the
|
||
|
|
text will be split into sentences using a custom sentence tokenizer based on NLTK.
|
||
|
|
See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter.
|
||
|
|
If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used.
|
||
|
|
- `sentence_splitter_params`: Optional parameters to pass to the sentence tokenizer.
|
||
|
|
See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information.
|
||
|
|
|
||
|
|
**Raises**:
|
||
|
|
|
||
|
|
- `ValueError`: If the overlap is greater than or equal to the chunk size or if the overlap is negative, or
|
||
|
|
if any separator is not a string.
|
||
|
|
|
||
|
|
<a id="recursive_splitter.RecursiveDocumentSplitter.warm_up"></a>
|
||
|
|
|
||
|
|
#### RecursiveDocumentSplitter.warm\_up
|
||
|
|
|
||
|
|
```python
|
||
|
|
def warm_up() -> None
|
||
|
|
```
|
||
|
|
|
||
|
|
Warm up the sentence tokenizer and tiktoken tokenizer if needed.
|
||
|
|
|
||
|
|
<a id="recursive_splitter.RecursiveDocumentSplitter.run"></a>
|
||
|
|
|
||
|
|
#### RecursiveDocumentSplitter.run
|
||
|
|
|
||
|
|
```python
|
||
|
|
@component.output_types(documents=list[Document])
|
||
|
|
def run(documents: list[Document]) -> dict[str, list[Document]]
|
||
|
|
```
|
||
|
|
|
||
|
|
Split a list of documents into documents with smaller chunks of text.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `documents`: List of Documents to split.
|
||
|
|
|
||
|
|
**Raises**:
|
||
|
|
|
||
|
|
- `RuntimeError`: If the component wasn't warmed up but requires it for sentence splitting or tokenization.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
|
||
|
|
to the input documents.
|
||
|
|
|
||
|
|
<a id="text_cleaner"></a>
|
||
|
|
|
||
|
|
# Module text\_cleaner
|
||
|
|
|
||
|
|
<a id="text_cleaner.TextCleaner"></a>
|
||
|
|
|
||
|
|
## TextCleaner
|
||
|
|
|
||
|
|
Cleans text strings.
|
||
|
|
|
||
|
|
It can remove substrings matching a list of regular expressions, convert text to lowercase,
|
||
|
|
remove punctuation, and remove numbers.
|
||
|
|
Use it to clean up text data before evaluation.
|
||
|
|
|
||
|
|
### Usage example
|
||
|
|
|
||
|
|
```python
|
||
|
|
from haystack.components.preprocessors import TextCleaner
|
||
|
|
|
||
|
|
text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
|
||
|
|
|
||
|
|
cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
|
||
|
|
result = cleaner.run(texts=[text_to_clean])
|
||
|
|
```
|
||
|
|
|
||
|
|
<a id="text_cleaner.TextCleaner.__init__"></a>
|
||
|
|
|
||
|
|
#### TextCleaner.\_\_init\_\_
|
||
|
|
|
||
|
|
```python
|
||
|
|
def __init__(remove_regexps: Optional[list[str]] = None,
|
||
|
|
convert_to_lowercase: bool = False,
|
||
|
|
remove_punctuation: bool = False,
|
||
|
|
remove_numbers: bool = False)
|
||
|
|
```
|
||
|
|
|
||
|
|
Initializes the TextCleaner component.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `remove_regexps`: A list of regex patterns to remove matching substrings from the text.
|
||
|
|
- `convert_to_lowercase`: If `True`, converts all characters to lowercase.
|
||
|
|
- `remove_punctuation`: If `True`, removes punctuation from the text.
|
||
|
|
- `remove_numbers`: If `True`, removes numerical digits from the text.
|
||
|
|
|
||
|
|
<a id="text_cleaner.TextCleaner.run"></a>
|
||
|
|
|
||
|
|
#### TextCleaner.run
|
||
|
|
|
||
|
|
```python
|
||
|
|
@component.output_types(texts=list[str])
|
||
|
|
def run(texts: list[str]) -> dict[str, Any]
|
||
|
|
```
|
||
|
|
|
||
|
|
Cleans up the given list of strings.
|
||
|
|
|
||
|
|
**Arguments**:
|
||
|
|
|
||
|
|
- `texts`: List of strings to clean.
|
||
|
|
|
||
|
|
**Returns**:
|
||
|
|
|
||
|
|
A dictionary with the following key:
|
||
|
|
- `texts`: the cleaned list of strings.
|