Daria Fokina 3e81ec75dc
docs: add 2.18 and 2.19 actual documentation pages (#9946)
* versioned-docs

* external-documentstores
2025-10-27 13:03:22 +01:00

53 lines
3.0 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "TopPSampler"
id: toppsampler
slug: "/toppsampler"
description: "Uses nucleus sampling to filter documents."
---
# TopPSampler
Uses nucleus sampling to filter documents.
| | |
| :------------------------------------- | :-------------------------------------------------------------------------------------------------------- |
| **Most common position in a pipeline** | After a [Ranker](../rankers.mdx) |
| **Mandatory init variables** | "top_p": A float between 0 and 1 representing the cumulative probability threshold for document selection |
| **Mandatory run variables** | “documents”: A list of documents |
| **Output variables** | “documents”: A list of documents |
| **API reference** | [Samplers](/reference/samplers-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/samplers/top_p.py |
## Overview
Top-P (nucleus) sampling is a method that helps identify and select a subset of documents based on their cumulative probabilities. Instead of choosing a fixed number of documents, this method focuses on a specified percentage of the highest cumulative probabilities within a list of documents. To put it simply, `TopPSampler` provides a way to efficiently select the most relevant documents based on their similarity to a given query.
The practical goal of the `TopPSampler` is to return a list of documents that, in sum, have a score larger than the `top_p` value. So, for example, when `top_p` is set to a high value, more documents will be returned, which can result in more varied outputs. The value is typically set between 0 and 1. By default, the component uses documents' `score` fields to look at the similarity scores.
The components `run()` method takes in a set of documents, calculates the similarity scores between the query and the documents, and then filters the documents based on the cumulative probability of these scores.
## Usage
### On its own
```python
from haystack import Document
from haystack.components.samplers import TopPSampler
sampler = TopPSampler(top_p=0.99, score_field="similarity_score")
docs = [
Document(content="Berlin", meta={"similarity_score": -10.6}),
Document(content="Belgrade", meta={"similarity_score": -8.9}),
Document(content="Sarajevo", meta={"similarity_score": -4.6}),
]
output = sampler.run(documents=docs)
docs = output["documents"]
print(docs)
```
### In a pipeline
To best understand how can you use a `TopPSampler` and which components to pair it with, have a look at this recipe:
RECIPE MISSING