Documentation update (#1162)

* Add content * Add German BERT references * Mention preprocessor language * Fix mypy CI * Add document length recommendation * Add more languages
2025-12-24 13:38:53 +00:00 · 2021-06-11 11:06:57 +02:00 · 2021-06-11 11:06:57 +02:00 · 13edff109d
commit 13edff109d
parent 41b537affe
4 changed files with 219 additions and 8 deletions
--- a/docs/_src/usage/usage/document_store.md
+++ b/docs/_src/usage/usage/document_store.md
@ -29,7 +29,7 @@ Initialising a new DocumentStore within Haystack is straight forward.

 [Install](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html)
 Elasticsearch and then [start](https://www.elastic.co/guide/en/elasticsearch/reference/current/starting-elasticsearch.html)
-an instance.
+an instance. 

 If you have Docker set up, we recommend pulling the Docker image and running it.
 ```bash
@ -37,6 +37,8 @@ docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2
 docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2
 ```

+Note that we also have a utility function `haystack.utils.launch_es` that can start up an Elasticsearch instance.
+
 Next you can initialize the Haystack object that will connect to this instance.

 ```python
@ -60,7 +62,8 @@ Use e.g. [aws-requests-auth](https://github.com/davidmuller/aws-requests-auth) t
 <label class="labelouter" for="tab-1-2">Milvus</label>
 <div class="tabcontent">

-Follow the [official documentation](https://www.milvus.io/docs/v1.0.0/milvus_docker-cpu.md) to start a Milvus instance via Docker
+Follow the [official documentation](https://www.milvus.io/docs/v1.0.0/milvus_docker-cpu.md) to start a Milvus instance via Docker. 
+Note that we also have a utility function `haystack.utils.launch_milvus` that can start up a Milvus instance.

 You can initialize the Haystack object that will connect to this instance as follows:
 ```python
--- a/docs/_src/usage/usage/languages.md
+++ b/docs/_src/usage/usage/languages.md
@ -13,6 +13,20 @@ Haystack is well suited to open-domain QA on languages other than English.
 While our defaults are tuned for English,
 you will find some tips and tricks here for using Haystack in your language. 

+##Preprocessor
+
+<div class="recommendation">
+
+**Note**
+
+This feature will be implemented by [this PR.](https://github.com/deepset-ai/haystack/pull/1160)
+
+</div>
+
+The PreProcessor's sentence tokenization is language specific. 
+If you are using the PreProcessor on a language other than English,
+make sure to set the `language` argument when initializing it.
+
 ##Retrievers

 The sparse retriever methods themselves(BM25, TF-IDF) are language agnostic.
@ -36,7 +50,8 @@ document_store = ElasticsearchDocumentStore(analyzer="thai")
 The models used in dense retrievers are language specific. 
 Be sure to check language of the model used in your EmbeddingRetriever. 
 The default model that is loaded in the DensePassageRetriever is for English.
-We are currently working on training a German DensePassageRetriever model and know other teams who work on further languages.
+
+We have created a [German DensePassageRetriever model](https://deepset.ai/germanquad) and know other teams who work on further languages.
 If you have a language model and a question answering dataset in your own language, you can also train a DPR model using Haystack!
 Below is a simplified example.
 See [our tutorial](/docs/latest/tutorial9md) and also the [API reference](/docs/latest/apiretrievermd#train) for `DensePassageRetriever.train()` for more details.
@ -71,6 +86,20 @@ there are a couple QA models that are directly usable in Haystack.

 <div class="tabs innertabslanguage">

+<div class="tabinner">
+<input type="radio" id="tab-5-1" name="tab-group-5" checked>
+<label class="labelinner" for="tab-5-1">German</label>
+<div class="tabcontentinner">
+
+```python
+from haystack.reader import FARMReader
+
+reader = FARMReader("deepset/gelectra-large-germanquad")
+```
+
+</div>
+</div>
+
 <div class="tabinner">
 <input type="radio" id="tab-5-1" name="tab-group-5" checked>
 <label class="labelinner" for="tab-5-1">French</label>
@ -99,6 +128,56 @@ reader = FARMReader("mrm8488/bert-italian-finedtuned-squadv1-it-alfa")
 </div>
 </div>

+<div class="tabinner2">
+<input type="radio" id="tab-6-3" name="tab-group-6">
+<label class="labelinner" for="tab-6-3">Chinese</label>
+<div class="tabcontentinner">
+
+```python
+from haystack.reader import FARMReader
+
+reader = FARMReader("uer/roberta-base-chinese-extractive-qa")
+# or 
+reader = FARMReader("wptoux/albert-chinese-large-qa")
+```
+
+</div>
+</div>
+
+<div class="tabinner2">
+<input type="radio" id="tab-6-3" name="tab-group-6">
+<label class="labelinner" for="tab-6-3">Spanish</label>
+<div class="tabcontentinner">
+
+```python
+from haystack.reader import FARMReader
+
+reader = FARMReader("mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
+# or
+reader = FARMReader("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
+```
+
+</div>
+</div>
+
+<div class="tabinner2">
+<input type="radio" id="tab-6-3" name="tab-group-6">
+<label class="labelinner" for="tab-6-3">Portuguese</label>
+<div class="tabcontentinner">
+
+```python
+from haystack.reader import FARMReader
+
+reader = FARMReader("pierreguillou/bert-base-cased-squad-v1.1-portuguese")
+# or
+reader = FARMReader("pucpr/bioBERTpt-squad-v1.1-portuguese")
+
+```
+
+</div>
+</div>
+
+
 <div class="tabinner">
 <input type="radio" id="tab-5-3" name="tab-group-5">
 <label class="labelinner" for="tab-5-3">Zero-shot</label>
@ -125,6 +204,21 @@ reader = FARMReader("deepset/xlm-roberta-large-squad2")

 <div class="tabs innertabslanguage">

+<div class="tabinner">
+<input type="radio" id="tab-5-1" name="tab-group-5" checked>
+<label class="labelinner" for="tab-5-1">German</label>
+<div class="tabcontentinner">
+
+```python
+from haystack.reader import TransformersReader
+
+reader = TransformersReader("deepset/gelectra-large-germanquad")
+```
+
+</div>
+</div>
+
+
 <div class="tabinner2">
 <input type="radio" id="tab-6-1" name="tab-group-6" checked>
 <label class="labelinner" for="tab-6-1">French</label>
@ -153,6 +247,55 @@ reader = TransformersReader("mrm8488/bert-italian-finedtuned-squadv1-it-alfa")
 </div>
 </div>

+<div class="tabinner2">
+<input type="radio" id="tab-6-3" name="tab-group-6">
+<label class="labelinner" for="tab-6-3">Chinese</label>
+<div class="tabcontentinner">
+
+```python
+from haystack.reader import TransformersReader
+
+reader = TransformersReader("uer/roberta-base-chinese-extractive-qa")
+# or 
+reader = TransformersReader("wptoux/albert-chinese-large-qa")
+```
+
+</div>
+</div>
+
+<div class="tabinner2">
+<input type="radio" id="tab-6-3" name="tab-group-6">
+<label class="labelinner" for="tab-6-3">Spanish</label>
+<div class="tabcontentinner">
+
+```python
+from haystack.reader import TransformersReader
+
+reader = TransformersReader("mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
+# or
+reader = TransformersReader("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
+```
+
+</div>
+</div>
+
+<div class="tabinner2">
+<input type="radio" id="tab-6-3" name="tab-group-6">
+<label class="labelinner" for="tab-6-3">Portuguese</label>
+<div class="tabcontentinner">
+
+```python
+from haystack.reader import TransformersReader
+
+reader = TransformersReader("pierreguillou/bert-base-cased-squad-v1.1-portuguese")
+# or
+reader = TransformersReader("pucpr/bioBERTpt-squad-v1.1-portuguese")
+
+```
+
+</div>
+</div>
+
 <div class="tabinner2">
 <input type="radio" id="tab-6-3" name="tab-group-6">
 <label class="labelinner" for="tab-6-3">Zero-shot</label>
@ -174,9 +317,11 @@ reader = TransformersReader("deepset/xlm-roberta-large-squad2")

 </div>

-The **French** and **Italian models** are both monolingual language models trained on French and Italian versions of the SQuAD dataset
+We are the creators of the **German** model and you can find out more about it [here](https://deepset.ai/germanquad)
+
+The **French**, **Italian**, **Spanish**, **Portuguese** and **Chinese** models are monolingual language models trained on versions of the SQuAD dataset in their respective languages
 and their authors report decent results in their model cards
-[here](https://huggingface.co/illuin/camembert-base-fquad) and [here](https://huggingface.co/mrm8488/bert-italian-finedtuned-squadv1-it-alfa).
+(e.g. [here](https://huggingface.co/illuin/camembert-base-fquad) and [here](https://huggingface.co/mrm8488/bert-italian-finedtuned-squadv1-it-alfa)).
 There also exist Korean QA models on the model hub but their performance is not reported.

 The **zero-shot model** that is shown above is a **multilingual XLM-RoBERTa Large** that is trained on English SQuAD.
@ -186,5 +331,3 @@ but still its performance lags behind that of the monolingual models.
 Nonetheless, if there is not yet a monolingual model for your language and it is one of the 100 supported by XLM-RoBERTa,
 this zero-shot model may serve as a decent first baseline.

-[//]: # (Add link to Reader training, create section in reader.md on training Reader)
-
--- a/docs/_src/usage/usage/optimization.md
+++ b/docs/_src/usage/usage/optimization.md
@ -9,6 +9,31 @@ id: "optimizationmd"

 # Optimization

+## Speeding up Reader
+
+In most pipelines, the Reader will be the most computationally expensive component. 
+If this is a step that you would like to speed up, you can opt for a smaller Reader model 
+that can process more passages in the same amount of time. 
+
+On our [benchmarks page](https://haystack.deepset.ai/bm/benchmarks), you will find a comparison of
+many of the common model architectures. While our default recommendation is RoBERTa,
+MiniLM offers much faster processing for only a minimal drop in accuracy. 
+You can find the models that we've trained on [the HuggingFace Model Hub](https://huggingface.co/deepset)
+
+## GPU acceleration
+
+The transformer based models used in Haystack are designed to be run on a GPU enabled machine. 
+The design of these models means that they greatly benefit from the parallel processing capabilities of graphics cards.
+If Haystack has successfully detected a graphics card, you should see these lines in your console output.
+
+```
+INFO - farm.utils -   Using device: CUDA 
+INFO - farm.utils -   Number of GPUs: 1
+```
+
+You can track the work load on your CUDA enabled Nvidia GPU by tracking the output of `nvidia-smi -l` on the command line
+while your Haystack program is running.
+
 ## Document Length

 Document length has a very direct impact on the speed of the Reader 
@ -17,7 +42,8 @@ which is why we recommend using the `PreProcessor` class to clean and split your

 For **sparse retrievers**, very long documents pose a challenge since the signal of the relevant section of text
 can get washed out by the rest of the document.
-We would recommend making sure that **documents are no longer than 10,000 words**.
+To get a good balance between Reader speed and Retriever performance, we splitting documents to a maximum of 500 words. 
+If there is no Reader in the pipeline following the Retriever, we recommend that **documents be no longer than 10,000 words**.

 **Dense retrievers** are limited in the length of text that they can read in one pass.
 As such, it is important that documents are not longer than the dense retriever's maximum input length.
@ -55,3 +81,27 @@ or like this if directly calling the `Retriever`:
 ``` python
 retrieved_docs = retriever.retrieve(top_k=10)
 ```
+
+## Metadata Filtering
+
+Metadata can be attached to the documents which you index into your DocumentStore (see the input data format [here](/docs/latest/retrievermd)).
+At query time, you can apply filters based on this metadata to limit the scope of your search and ensure your answers 
+come from a specific slice of your data. 
+
+For example, if you have a set of annual reports from various companies, 
+you may want to perform a search on just a specific year, or on a small selection of companies.
+This can reduce the work load of the retriever and also ensure that you get more relevant results.
+
+Filters are applied via the `filters` argument of the `Retriever` class. In practice, this argument will probably
+be passed into the `Pipeline.run()` call, which will then route it on to the `Retriever` class 
+(see our the Arguments on the [Pipelines page](/docs/latest/pipelinesmd) for an explanation).
+
+```python
+pipeline.run(
+    query="Why did the revenue increase?",
+    filters={
+        "years": ["2019"],
+        "companies": ["BMW", "Mercedes"]
+    }
+)
+```
--- a/docs/_src/usage/usage/pipelines.md
+++ b/docs/_src/usage/usage/pipelines.md
@ -30,6 +30,21 @@ p.draw(path="custom_pipe.png")
 ```
 ![image](https://user-images.githubusercontent.com/1563902/102451716-54813700-4039-11eb-881e-f3c01b47ca15.png)

+### Arguments
+
+Whatever keyword arguments are passed into the `Pipeline.run()` method will be passed on to each node in the pipeline.
+For example, in the code snippet below, all nodes will receive `query`, `top_k_retriever` and `top_k_reader` as argument,
+even if they don't use those arguments. It is therefore very important when defining custom nodes that their 
+keyword argument names do not clash with the other nodes in your pipeline.
+
+```python
+res = pipeline.run(
+    query="What did Einstein work on?",
+    top_k_retriever=1,
+    top_k_reader=5
+)
+```
+
 ### YAML File Definitions

 For your convenience, there is also the option of defining and loading pipelines in YAML files.