mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-02-03 13:33:47 +00:00
84 lines
4.0 KiB
Plaintext
84 lines
4.0 KiB
Plaintext
---
|
|
title: "CacheChecker"
|
|
id: cachechecker
|
|
slug: "/cachechecker"
|
|
description: "This component checks for the presence of documents in a Document Store based on a specified cache field."
|
|
---
|
|
|
|
# CacheChecker
|
|
|
|
This component checks for the presence of documents in a Document Store based on a specified cache field.
|
|
|
|
<div className="key-value-table">
|
|
|
|
| | |
|
|
| --- | --- |
|
|
| **Most common position in a pipeline** | Flexible |
|
|
| **Mandatory init variables** | `document_store`: A Document Store instance <br /> <br />`cache_field`: Name of the document's metadata field |
|
|
| **Mandatory run variables** | `items`: A list of values associated with the `cache_field` in documents |
|
|
| **Output variables** | `hits`: A list of documents that were found with the specified value in cache <br /> <br />`misses`: A list of values that could not be found |
|
|
| **API reference** | [Caching](/reference/caching-api) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/caching/cache_checker.py |
|
|
|
|
</div>
|
|
|
|
## Overview
|
|
|
|
`CacheChecker` checks if a Document Store contains any document with a value in the `cache_field` that matches any of the values provided in the `items` input variable. It returns a dictionary with two keys: `"hits"` and `"misses"`. The values are lists of documents that were found in the cache and items that were not, respectively.
|
|
|
|
## Usage
|
|
|
|
### On its own
|
|
|
|
```python
|
|
from haystack.components.caching import CacheChecker
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
|
|
my_doc_store = InMemoryDocumentStore()
|
|
|
|
## For URL-based caching
|
|
cache_checker = CacheChecker(document_store=my_doc_store, cache_field="url")
|
|
cache_check_results = cache_checker.run(items=["https://example.com/resource", "https://another_example.com/other_resources"])
|
|
print(cache_check_results["hits"]) # List of Documents that were found in the cache: all of these have 'url': <one of the above> in the metadata
|
|
print(cache_check_results["misses"]) # URLs that were not found in the cache, like ["https://example.com/resource"]
|
|
|
|
## For caching based on a custom identifier
|
|
cache_checker = CacheChecker(document_store=my_doc_store, cache_field="metadata_field")
|
|
cache_check_results = cache_checker.run(items=["12345", "ABCDE"])
|
|
print(cache_check_results["hits"]) # Documents that were found in the cache: all of these have 'metadata_field': <one of the above> in the metadata
|
|
print(cache_check_results["misses"]) # Values that were not found in the cache, like: ["ABCDE"]
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
```python
|
|
from haystack import Pipeline
|
|
from haystack.components.converters import TextFileToDocument
|
|
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
|
|
from haystack.components.writers import DocumentWriter
|
|
from haystack.components.caching import CacheChecker
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
|
|
pipeline = Pipeline()
|
|
document_store = InMemoryDocumentStore()
|
|
pipeline.add_component(instance=CacheChecker(document_store, cache_field="meta.file_path"), name="cache_checker")
|
|
pipeline.add_component(instance=TextFileToDocument(), name="text_file_converter")
|
|
pipeline.add_component(instance=DocumentCleaner(), name="cleaner")
|
|
pipeline.add_component(instance=DocumentSplitter(split_by="sentence", split_length=250, split_overlap=30), name="splitter")
|
|
pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
|
|
pipeline.connect("cache_checker.misses", "text_file_converter.sources")
|
|
pipeline.connect("text_file_converter.documents", "cleaner.documents")
|
|
pipeline.connect("cleaner.documents", "splitter.documents")
|
|
pipeline.connect("splitter.documents", "writer.documents")
|
|
|
|
pipeline.draw("pipeline.png")
|
|
|
|
## Take the current directory as input and run the pipeline
|
|
result = pipeline.run({"cache_checker": {"items": ["code_of_conduct_1.txt"]}})
|
|
print(result)
|
|
|
|
## The second execution skips the files that were already processed
|
|
result = pipeline.run({"cache_checker": {"items": ["code_of_conduct_1.txt"]}})
|
|
print(result)
|
|
```
|