mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-02-07 07:22:03 +00:00
156 lines
5.6 KiB
Plaintext
156 lines
5.6 KiB
Plaintext
---
|
|
title: "DocumentLengthRouter"
|
|
id: documentlengthrouter
|
|
slug: "/documentlengthrouter"
|
|
description: "Routes documents to different output connections based on the length of their `content` field."
|
|
---
|
|
|
|
# DocumentLengthRouter
|
|
|
|
Routes documents to different output connections based on the length of their `content` field.
|
|
|
|
<div className="key-value-table">
|
|
|
|
| | |
|
|
| --- | --- |
|
|
| **Most common position in a pipeline** | Flexible |
|
|
| **Mandatory run variables** | `documents`: A list of documents |
|
|
| **Output variables** | `short_documents`: A list of documents where `content` is None or the length of `content` is less than or equal to the threshold. <br /> <br />`long_documents`: A list of documents where the length of `content` is greater than the threshold. |
|
|
| **API reference** | [Routers](/reference/routers-api) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/routers/document_length_router.py |
|
|
|
|
</div>
|
|
|
|
## Overview
|
|
|
|
`DocumentLengthRouter` routes documents to different output connections based on the length of their `content` field.
|
|
|
|
It allows to set a `threshold` init parameter. Documents where `content` is None, or the length of `content` is less than or equal to the threshold are routed to "short_documents". Others are routed to "long_documents".
|
|
|
|
A common use case for `DocumentLengthRouter` is handling documents obtained from PDFs that contain non-text content, such as scanned pages or images. This component can detect empty or low-content documents and route them to components that perform OCR, generate captions, or compute image embeddings.
|
|
|
|
## Usage
|
|
|
|
### On its own
|
|
|
|
```python
|
|
from haystack.components.routers import DocumentLengthRouter
|
|
from haystack.dataclasses import Document
|
|
|
|
docs = [
|
|
Document(content="Short"),
|
|
Document(content="Long document "*20),
|
|
]
|
|
|
|
router = DocumentLengthRouter(threshold=10)
|
|
|
|
result = router.run(documents=docs)
|
|
print(result)
|
|
|
|
## {
|
|
## "short_documents": [Document(content="Short", ...)],
|
|
## "long_documents": [Document(content="Long document ...", ...)],
|
|
## }
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
In the following indexing pipeline, the `PyPDFToDocument` Converter extracts text from PDF files. Documents are then split by pages using a `DocumentSplitter`. Next, the `DocumentLengthRouter` routes short documents to `LLMDocumentContentExtractor` to extract text, which is particularly useful for non-textual, image-based pages. Finally, all documents are collected using `DocumentJoiner` and written to the Document Store.
|
|
|
|
```python
|
|
from haystack import Pipeline
|
|
from haystack.components.converters import PyPDFToDocument
|
|
from haystack.components.extractors.image import LLMDocumentContentExtractor
|
|
from haystack.components.generators.chat import OpenAIChatGenerator
|
|
from haystack.components.joiners import DocumentJoiner
|
|
from haystack.components.preprocessors import DocumentSplitter
|
|
from haystack.components.routers import DocumentLengthRouter
|
|
from haystack.components.writers import DocumentWriter
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
|
|
document_store = InMemoryDocumentStore()
|
|
|
|
indexing_pipe = Pipeline()
|
|
indexing_pipe.add_component(
|
|
"pdf_converter",
|
|
PyPDFToDocument(store_full_path=True)
|
|
)
|
|
## setting skip_empty_documents=False is important here because the
|
|
## LLMDocumentContentExtractor can extract text from non-textual documents
|
|
## that otherwise would be skipped
|
|
indexing_pipe.add_component(
|
|
"pdf_splitter",
|
|
DocumentSplitter(
|
|
split_by="page",
|
|
split_length=1,
|
|
skip_empty_documents=False
|
|
)
|
|
)
|
|
indexing_pipe.add_component(
|
|
"doc_length_router",
|
|
DocumentLengthRouter(threshold=10)
|
|
)
|
|
indexing_pipe.add_component(
|
|
"content_extractor",
|
|
LLMDocumentContentExtractor(
|
|
chat_generator=OpenAIChatGenerator(model="gpt-4.1-mini")
|
|
)
|
|
)
|
|
indexing_pipe.add_component(
|
|
"doc_joiner",
|
|
DocumentJoiner(sort_by_score=False)
|
|
)
|
|
indexing_pipe.add_component(
|
|
"document_writer",
|
|
DocumentWriter(document_store=document_store)
|
|
)
|
|
|
|
indexing_pipe.connect("pdf_converter.documents", "pdf_splitter.documents")
|
|
indexing_pipe.connect("pdf_splitter.documents", "doc_length_router.documents")
|
|
## The short PDF pages will be enriched/captioned
|
|
indexing_pipe.connect(
|
|
"doc_length_router.short_documents",
|
|
"content_extractor.documents"
|
|
)
|
|
indexing_pipe.connect(
|
|
"doc_length_router.long_documents",
|
|
"doc_joiner.documents"
|
|
)
|
|
indexing_pipe.connect(
|
|
"content_extractor.documents",
|
|
"doc_joiner.documents"
|
|
)
|
|
indexing_pipe.connect("doc_joiner.documents", "document_writer.documents")
|
|
|
|
## Run the indexing pipeline with sources
|
|
indexing_result = indexing_pipe.run(
|
|
data={"sources": ["textual_pdf.pdf", "non_textual_pdf.pdf"]}
|
|
)
|
|
|
|
## Inspect the documents
|
|
indexed_documents = document_store.filter_documents()
|
|
print(f"Indexed {len(indexed_documents)} documents:\n")
|
|
for doc in indexed_documents:
|
|
print("file_path: ", doc.meta["file_path"])
|
|
print("page_number: ", doc.meta["page_number"])
|
|
print("content: ", doc.content)
|
|
print("-" * 100 + "\n")
|
|
|
|
## Indexed 3 documents:
|
|
##
|
|
## file_path: textual_pdf.pdf
|
|
## page_number: 1
|
|
## content: A sample PDF file...
|
|
## ----------------------------------------------------------------------------------------------------
|
|
##
|
|
## file_path: textual_pdf.pdf
|
|
## page_number: 2
|
|
## content: Page 2 of Sample PDF...
|
|
## ----------------------------------------------------------------------------------------------------
|
|
##
|
|
## file_path: non_textual_pdf.pdf
|
|
## page_number: 1
|
|
## content: Content extracted from non-textual PDF using a LLM...
|
|
## ----------------------------------------------------------------------------------------------------
|
|
```
|