mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-12-24 13:38:53 +00:00
Documentation update (#1162)
* Add content * Add German BERT references * Mention preprocessor language * Fix mypy CI * Add document length recommendation * Add more languages
This commit is contained in:
parent
41b537affe
commit
13edff109d
@ -29,7 +29,7 @@ Initialising a new DocumentStore within Haystack is straight forward.
|
||||
|
||||
[Install](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html)
|
||||
Elasticsearch and then [start](https://www.elastic.co/guide/en/elasticsearch/reference/current/starting-elasticsearch.html)
|
||||
an instance.
|
||||
an instance.
|
||||
|
||||
If you have Docker set up, we recommend pulling the Docker image and running it.
|
||||
```bash
|
||||
@ -37,6 +37,8 @@ docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2
|
||||
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2
|
||||
```
|
||||
|
||||
Note that we also have a utility function `haystack.utils.launch_es` that can start up an Elasticsearch instance.
|
||||
|
||||
Next you can initialize the Haystack object that will connect to this instance.
|
||||
|
||||
```python
|
||||
@ -60,7 +62,8 @@ Use e.g. [aws-requests-auth](https://github.com/davidmuller/aws-requests-auth) t
|
||||
<label class="labelouter" for="tab-1-2">Milvus</label>
|
||||
<div class="tabcontent">
|
||||
|
||||
Follow the [official documentation](https://www.milvus.io/docs/v1.0.0/milvus_docker-cpu.md) to start a Milvus instance via Docker
|
||||
Follow the [official documentation](https://www.milvus.io/docs/v1.0.0/milvus_docker-cpu.md) to start a Milvus instance via Docker.
|
||||
Note that we also have a utility function `haystack.utils.launch_milvus` that can start up a Milvus instance.
|
||||
|
||||
You can initialize the Haystack object that will connect to this instance as follows:
|
||||
```python
|
||||
|
||||
@ -13,6 +13,20 @@ Haystack is well suited to open-domain QA on languages other than English.
|
||||
While our defaults are tuned for English,
|
||||
you will find some tips and tricks here for using Haystack in your language.
|
||||
|
||||
##Preprocessor
|
||||
|
||||
<div class="recommendation">
|
||||
|
||||
**Note**
|
||||
|
||||
This feature will be implemented by [this PR.](https://github.com/deepset-ai/haystack/pull/1160)
|
||||
|
||||
</div>
|
||||
|
||||
The PreProcessor's sentence tokenization is language specific.
|
||||
If you are using the PreProcessor on a language other than English,
|
||||
make sure to set the `language` argument when initializing it.
|
||||
|
||||
##Retrievers
|
||||
|
||||
The sparse retriever methods themselves(BM25, TF-IDF) are language agnostic.
|
||||
@ -36,7 +50,8 @@ document_store = ElasticsearchDocumentStore(analyzer="thai")
|
||||
The models used in dense retrievers are language specific.
|
||||
Be sure to check language of the model used in your EmbeddingRetriever.
|
||||
The default model that is loaded in the DensePassageRetriever is for English.
|
||||
We are currently working on training a German DensePassageRetriever model and know other teams who work on further languages.
|
||||
|
||||
We have created a [German DensePassageRetriever model](https://deepset.ai/germanquad) and know other teams who work on further languages.
|
||||
If you have a language model and a question answering dataset in your own language, you can also train a DPR model using Haystack!
|
||||
Below is a simplified example.
|
||||
See [our tutorial](/docs/latest/tutorial9md) and also the [API reference](/docs/latest/apiretrievermd#train) for `DensePassageRetriever.train()` for more details.
|
||||
@ -71,6 +86,20 @@ there are a couple QA models that are directly usable in Haystack.
|
||||
|
||||
<div class="tabs innertabslanguage">
|
||||
|
||||
<div class="tabinner">
|
||||
<input type="radio" id="tab-5-1" name="tab-group-5" checked>
|
||||
<label class="labelinner" for="tab-5-1">German</label>
|
||||
<div class="tabcontentinner">
|
||||
|
||||
```python
|
||||
from haystack.reader import FARMReader
|
||||
|
||||
reader = FARMReader("deepset/gelectra-large-germanquad")
|
||||
```
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tabinner">
|
||||
<input type="radio" id="tab-5-1" name="tab-group-5" checked>
|
||||
<label class="labelinner" for="tab-5-1">French</label>
|
||||
@ -99,6 +128,56 @@ reader = FARMReader("mrm8488/bert-italian-finedtuned-squadv1-it-alfa")
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tabinner2">
|
||||
<input type="radio" id="tab-6-3" name="tab-group-6">
|
||||
<label class="labelinner" for="tab-6-3">Chinese</label>
|
||||
<div class="tabcontentinner">
|
||||
|
||||
```python
|
||||
from haystack.reader import FARMReader
|
||||
|
||||
reader = FARMReader("uer/roberta-base-chinese-extractive-qa")
|
||||
# or
|
||||
reader = FARMReader("wptoux/albert-chinese-large-qa")
|
||||
```
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tabinner2">
|
||||
<input type="radio" id="tab-6-3" name="tab-group-6">
|
||||
<label class="labelinner" for="tab-6-3">Spanish</label>
|
||||
<div class="tabcontentinner">
|
||||
|
||||
```python
|
||||
from haystack.reader import FARMReader
|
||||
|
||||
reader = FARMReader("mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
|
||||
# or
|
||||
reader = FARMReader("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
|
||||
```
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tabinner2">
|
||||
<input type="radio" id="tab-6-3" name="tab-group-6">
|
||||
<label class="labelinner" for="tab-6-3">Portuguese</label>
|
||||
<div class="tabcontentinner">
|
||||
|
||||
```python
|
||||
from haystack.reader import FARMReader
|
||||
|
||||
reader = FARMReader("pierreguillou/bert-base-cased-squad-v1.1-portuguese")
|
||||
# or
|
||||
reader = FARMReader("pucpr/bioBERTpt-squad-v1.1-portuguese")
|
||||
|
||||
```
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div class="tabinner">
|
||||
<input type="radio" id="tab-5-3" name="tab-group-5">
|
||||
<label class="labelinner" for="tab-5-3">Zero-shot</label>
|
||||
@ -125,6 +204,21 @@ reader = FARMReader("deepset/xlm-roberta-large-squad2")
|
||||
|
||||
<div class="tabs innertabslanguage">
|
||||
|
||||
<div class="tabinner">
|
||||
<input type="radio" id="tab-5-1" name="tab-group-5" checked>
|
||||
<label class="labelinner" for="tab-5-1">German</label>
|
||||
<div class="tabcontentinner">
|
||||
|
||||
```python
|
||||
from haystack.reader import TransformersReader
|
||||
|
||||
reader = TransformersReader("deepset/gelectra-large-germanquad")
|
||||
```
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div class="tabinner2">
|
||||
<input type="radio" id="tab-6-1" name="tab-group-6" checked>
|
||||
<label class="labelinner" for="tab-6-1">French</label>
|
||||
@ -153,6 +247,55 @@ reader = TransformersReader("mrm8488/bert-italian-finedtuned-squadv1-it-alfa")
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tabinner2">
|
||||
<input type="radio" id="tab-6-3" name="tab-group-6">
|
||||
<label class="labelinner" for="tab-6-3">Chinese</label>
|
||||
<div class="tabcontentinner">
|
||||
|
||||
```python
|
||||
from haystack.reader import TransformersReader
|
||||
|
||||
reader = TransformersReader("uer/roberta-base-chinese-extractive-qa")
|
||||
# or
|
||||
reader = TransformersReader("wptoux/albert-chinese-large-qa")
|
||||
```
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tabinner2">
|
||||
<input type="radio" id="tab-6-3" name="tab-group-6">
|
||||
<label class="labelinner" for="tab-6-3">Spanish</label>
|
||||
<div class="tabcontentinner">
|
||||
|
||||
```python
|
||||
from haystack.reader import TransformersReader
|
||||
|
||||
reader = TransformersReader("mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
|
||||
# or
|
||||
reader = TransformersReader("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
|
||||
```
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tabinner2">
|
||||
<input type="radio" id="tab-6-3" name="tab-group-6">
|
||||
<label class="labelinner" for="tab-6-3">Portuguese</label>
|
||||
<div class="tabcontentinner">
|
||||
|
||||
```python
|
||||
from haystack.reader import TransformersReader
|
||||
|
||||
reader = TransformersReader("pierreguillou/bert-base-cased-squad-v1.1-portuguese")
|
||||
# or
|
||||
reader = TransformersReader("pucpr/bioBERTpt-squad-v1.1-portuguese")
|
||||
|
||||
```
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tabinner2">
|
||||
<input type="radio" id="tab-6-3" name="tab-group-6">
|
||||
<label class="labelinner" for="tab-6-3">Zero-shot</label>
|
||||
@ -174,9 +317,11 @@ reader = TransformersReader("deepset/xlm-roberta-large-squad2")
|
||||
|
||||
</div>
|
||||
|
||||
The **French** and **Italian models** are both monolingual language models trained on French and Italian versions of the SQuAD dataset
|
||||
We are the creators of the **German** model and you can find out more about it [here](https://deepset.ai/germanquad)
|
||||
|
||||
The **French**, **Italian**, **Spanish**, **Portuguese** and **Chinese** models are monolingual language models trained on versions of the SQuAD dataset in their respective languages
|
||||
and their authors report decent results in their model cards
|
||||
[here](https://huggingface.co/illuin/camembert-base-fquad) and [here](https://huggingface.co/mrm8488/bert-italian-finedtuned-squadv1-it-alfa).
|
||||
(e.g. [here](https://huggingface.co/illuin/camembert-base-fquad) and [here](https://huggingface.co/mrm8488/bert-italian-finedtuned-squadv1-it-alfa)).
|
||||
There also exist Korean QA models on the model hub but their performance is not reported.
|
||||
|
||||
The **zero-shot model** that is shown above is a **multilingual XLM-RoBERTa Large** that is trained on English SQuAD.
|
||||
@ -186,5 +331,3 @@ but still its performance lags behind that of the monolingual models.
|
||||
Nonetheless, if there is not yet a monolingual model for your language and it is one of the 100 supported by XLM-RoBERTa,
|
||||
this zero-shot model may serve as a decent first baseline.
|
||||
|
||||
[//]: # (Add link to Reader training, create section in reader.md on training Reader)
|
||||
|
||||
|
||||
@ -9,6 +9,31 @@ id: "optimizationmd"
|
||||
|
||||
# Optimization
|
||||
|
||||
## Speeding up Reader
|
||||
|
||||
In most pipelines, the Reader will be the most computationally expensive component.
|
||||
If this is a step that you would like to speed up, you can opt for a smaller Reader model
|
||||
that can process more passages in the same amount of time.
|
||||
|
||||
On our [benchmarks page](https://haystack.deepset.ai/bm/benchmarks), you will find a comparison of
|
||||
many of the common model architectures. While our default recommendation is RoBERTa,
|
||||
MiniLM offers much faster processing for only a minimal drop in accuracy.
|
||||
You can find the models that we've trained on [the HuggingFace Model Hub](https://huggingface.co/deepset)
|
||||
|
||||
## GPU acceleration
|
||||
|
||||
The transformer based models used in Haystack are designed to be run on a GPU enabled machine.
|
||||
The design of these models means that they greatly benefit from the parallel processing capabilities of graphics cards.
|
||||
If Haystack has successfully detected a graphics card, you should see these lines in your console output.
|
||||
|
||||
```
|
||||
INFO - farm.utils - Using device: CUDA
|
||||
INFO - farm.utils - Number of GPUs: 1
|
||||
```
|
||||
|
||||
You can track the work load on your CUDA enabled Nvidia GPU by tracking the output of `nvidia-smi -l` on the command line
|
||||
while your Haystack program is running.
|
||||
|
||||
## Document Length
|
||||
|
||||
Document length has a very direct impact on the speed of the Reader
|
||||
@ -17,7 +42,8 @@ which is why we recommend using the `PreProcessor` class to clean and split your
|
||||
|
||||
For **sparse retrievers**, very long documents pose a challenge since the signal of the relevant section of text
|
||||
can get washed out by the rest of the document.
|
||||
We would recommend making sure that **documents are no longer than 10,000 words**.
|
||||
To get a good balance between Reader speed and Retriever performance, we splitting documents to a maximum of 500 words.
|
||||
If there is no Reader in the pipeline following the Retriever, we recommend that **documents be no longer than 10,000 words**.
|
||||
|
||||
**Dense retrievers** are limited in the length of text that they can read in one pass.
|
||||
As such, it is important that documents are not longer than the dense retriever's maximum input length.
|
||||
@ -55,3 +81,27 @@ or like this if directly calling the `Retriever`:
|
||||
``` python
|
||||
retrieved_docs = retriever.retrieve(top_k=10)
|
||||
```
|
||||
|
||||
## Metadata Filtering
|
||||
|
||||
Metadata can be attached to the documents which you index into your DocumentStore (see the input data format [here](/docs/latest/retrievermd)).
|
||||
At query time, you can apply filters based on this metadata to limit the scope of your search and ensure your answers
|
||||
come from a specific slice of your data.
|
||||
|
||||
For example, if you have a set of annual reports from various companies,
|
||||
you may want to perform a search on just a specific year, or on a small selection of companies.
|
||||
This can reduce the work load of the retriever and also ensure that you get more relevant results.
|
||||
|
||||
Filters are applied via the `filters` argument of the `Retriever` class. In practice, this argument will probably
|
||||
be passed into the `Pipeline.run()` call, which will then route it on to the `Retriever` class
|
||||
(see our the Arguments on the [Pipelines page](/docs/latest/pipelinesmd) for an explanation).
|
||||
|
||||
```python
|
||||
pipeline.run(
|
||||
query="Why did the revenue increase?",
|
||||
filters={
|
||||
"years": ["2019"],
|
||||
"companies": ["BMW", "Mercedes"]
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
@ -30,6 +30,21 @@ p.draw(path="custom_pipe.png")
|
||||
```
|
||||

|
||||
|
||||
### Arguments
|
||||
|
||||
Whatever keyword arguments are passed into the `Pipeline.run()` method will be passed on to each node in the pipeline.
|
||||
For example, in the code snippet below, all nodes will receive `query`, `top_k_retriever` and `top_k_reader` as argument,
|
||||
even if they don't use those arguments. It is therefore very important when defining custom nodes that their
|
||||
keyword argument names do not clash with the other nodes in your pipeline.
|
||||
|
||||
```python
|
||||
res = pipeline.run(
|
||||
query="What did Einstein work on?",
|
||||
top_k_retriever=1,
|
||||
top_k_reader=5
|
||||
)
|
||||
```
|
||||
|
||||
### YAML File Definitions
|
||||
|
||||
For your convenience, there is also the option of defining and loading pipelines in YAML files.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user