mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-08-27 09:56:37 +00:00

* new docs version * updated directory structure * Add pipelines page * Add Finder deprecation suggestion * header for pipelines file * Document MySQL support * Mention DPR train tutorial coming soon * Mention open distro ES * Update doc strings regarding similarity fn * Add link to API docs * Wrap pipelines docs in box * add api reference for pipelines * copied latest version to v0.6.0 * Remove space * Remove space * Copy to v0.6.0 Co-authored-by: brandenchan <brandenchan@icloud.com>
65 lines
2.8 KiB
Markdown
65 lines
2.8 KiB
Markdown
<!---
|
|
title: "Optimization"
|
|
metaTitle: "Optimization"
|
|
metaDescription: ""
|
|
slug: "/docs/optimization"
|
|
date: "2020-11-05"
|
|
id: "optimizationmd"
|
|
--->
|
|
|
|
# Optimization
|
|
|
|
## Document Length
|
|
|
|
Document length has a very direct impact on the speed of the Reader
|
|
which is why we recommend using the `PreProcessor` class to clean and split your documents.
|
|
**If you halve the length of your documents, you will halve the workload placed onto your Reader.**
|
|
|
|
For **sparse retrievers**, very long documents pose a challenge since the signal of the relevant section of text
|
|
can get washed out by the rest of the document.
|
|
We would recommend making sure that **documents are no longer than 10,000 words**.
|
|
|
|
**Dense retrievers** are limited in the length of text that they can read in one pass.
|
|
As such, it is important that documents are not longer than the dense retriever's maximum input length.
|
|
By default, Haystack's DensePassageRetriever model has a maximum length of 256 tokens.
|
|
As such, we recommend that documents contain significantly less words.
|
|
We have found decent performance with **documents around 100 words long**.
|
|
|
|
## Respecting Sentence Boundaries
|
|
|
|
When splitting documents, it is generally not a good idea to let document boundaries fall in the middle of sentences.
|
|
Doing so means that each document will contain incomplete sentence fragments
|
|
which maybe be hard for both retriever and reader to interpret.
|
|
It is therefore recommended to set `split_respect_sentence_boundary=True` when initializing your `PreProcessor`.
|
|
|
|
## Choosing the Right top-k Values
|
|
|
|
The `top-k` parameter in both the `Retriever` and `Reader` determine how many results they return.
|
|
More specifically, `Retriever` `top-k` dictates how many retrieved documents are passed on to the next stage,
|
|
while `Reader` `top-k` determines how many answer candidates to show.
|
|
|
|
In our experiments, we have found that **`Retriever` `top_k=10`
|
|
gives decent overall performance** and so we have set this as the default in Haystack.
|
|
|
|
The choice of `Retriever` `top-k` is a trade-off between speed and accuracy,
|
|
especially when there is a `Reader` in the pipeline.
|
|
Setting it higher means passing more documents to the `Reader`,
|
|
thus reducing the chance that the answer-containing passage is missed.
|
|
However, passing more documents to the `Reader` will create a larger workload for the component.
|
|
|
|
These parameters can easily be tweaked as follows if using a `Finder`:
|
|
``` python
|
|
answers = finder.get_answers(retriever_top_k=10,
|
|
reader_top_k=5)
|
|
```
|
|
or like this if directly calling the `Retriever`:
|
|
``` python
|
|
retrieved_docs = retriever.retrieve(top_k=10)
|
|
```
|
|
|
|
<div class="recommendation">
|
|
|
|
**Tip:** The Finder class is being deprecated and has been replaced by a more powerful [Pipelines class](/docs/latest/pipelinesmd).
|
|
|
|
</div>
|