Markus Paff d3fd888a76
Release Docs 0.10.0 (#1460)
* updated tutorials and docstrings and new version

* update to correct directory structure
2021-09-23 16:22:14 +02:00

334 lines
9.3 KiB
Markdown

<!---
title: "Languages Other Than English"
metaTitle: "Languages Other Than English"
metaDescription: ""
slug: "/docs/languages"
date: "2020-11-05"
id: "languagesmd"
--->
# Languages Other Than English
Haystack is well suited to open-domain QA on languages other than English.
While our defaults are tuned for English,
you will find some tips and tricks here for using Haystack in your language.
##Preprocessor
<div class="recommendation">
**Note**
This feature will be implemented by [this PR.](https://github.com/deepset-ai/haystack/pull/1160)
</div>
The PreProcessor's sentence tokenization is language specific.
If you are using the PreProcessor on a language other than English,
make sure to set the `language` argument when initializing it.
##Retrievers
The sparse retriever methods themselves(BM25, TF-IDF) are language agnostic.
Their only requirement is that the text be split into words.
The ElasticsearchDocumentStore relies on an analyzer to impose word boundaries,
but also to handle punctuation, casing and stop words.
The default analyzer is an English analyzer.
While it can still work decently for a large range of languages,
you will want to set it to your language's analyzer for optimal performance.
In some cases, such as with Thai, the default analyzer is completely incompatible.
See [this page](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html)
for the full list of language specific analyzers.
```python
from haystack.document_store import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(analyzer="thai")
```
The models used in dense retrievers are language specific.
Be sure to check language of the model used in your EmbeddingRetriever.
The default model that is loaded in the DensePassageRetriever is for English.
We have created a [German DensePassageRetriever model](https://deepset.ai/germanquad) and know other teams who work on further languages.
If you have a language model and a question answering dataset in your own language, you can also train a DPR model using Haystack!
Below is a simplified example.
See [our tutorial](/docs/latest/tutorial9md) and also the [API reference](/docs/latest/apiretrievermd#train) for `DensePassageRetriever.train()` for more details.
```python
from haystack.retriever import DensePassageRetriever
dense_passage_retriever = DensePassageRetriever(document_store)
dense_passage_retriever.train(self,
data_dir: str,
train_filename: str,
dev_filename: str = None,
test_filename: str = None,
batch_size: int = 16,
embed_title: bool = True,
num_hard_negatives: int = 1,
n_epochs: int = 3)
```
##Readers
While models are comparatively more performant on English,
thanks to a wealth of available English training data,
there are a couple QA models that are directly usable in Haystack.
<div class="tabs tabsreaderlanguage">
<div class="tab">
<input type="radio" id="tab-4-1" name="tab-group-4" checked>
<label class="labelouter" for="tab-4-1">FARM</label>
<div class="tabcontent">
<div class="tabs innertabslanguage">
<div class="tabinner">
<input type="radio" id="tab-5-1" name="tab-group-5" checked>
<label class="labelinner" for="tab-5-1">German</label>
<div class="tabcontentinner">
```python
from haystack.reader import FARMReader
reader = FARMReader("deepset/gelectra-large-germanquad")
```
</div>
</div>
<div class="tabinner">
<input type="radio" id="tab-5-1" name="tab-group-5" checked>
<label class="labelinner" for="tab-5-1">French</label>
<div class="tabcontentinner">
```python
from haystack.reader import FARMReader
reader = FARMReader("illuin/camembert-base-fquad")
```
</div>
</div>
<div class="tabinner">
<input type="radio" id="tab-5-2" name="tab-group-5">
<label class="labelinner" for="tab-5-2">Italian</label>
<div class="tabcontentinner">
```python
from haystack.reader import FARMReader
reader = FARMReader("mrm8488/bert-italian-finedtuned-squadv1-it-alfa")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-3" name="tab-group-6">
<label class="labelinner" for="tab-6-3">Chinese</label>
<div class="tabcontentinner">
```python
from haystack.reader import FARMReader
reader = FARMReader("uer/roberta-base-chinese-extractive-qa")
# or
reader = FARMReader("wptoux/albert-chinese-large-qa")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-3" name="tab-group-6">
<label class="labelinner" for="tab-6-3">Spanish</label>
<div class="tabcontentinner">
```python
from haystack.reader import FARMReader
reader = FARMReader("mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
# or
reader = FARMReader("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-3" name="tab-group-6">
<label class="labelinner" for="tab-6-3">Portuguese</label>
<div class="tabcontentinner">
```python
from haystack.reader import FARMReader
reader = FARMReader("pierreguillou/bert-base-cased-squad-v1.1-portuguese")
# or
reader = FARMReader("pucpr/bioBERTpt-squad-v1.1-portuguese")
```
</div>
</div>
<div class="tabinner">
<input type="radio" id="tab-5-3" name="tab-group-5">
<label class="labelinner" for="tab-5-3">Zero-shot</label>
<div class="tabcontentinner">
```python
from haystack.reader import FARMReader
reader = FARMReader("deepset/xlm-roberta-large-squad2")
```
</div>
</div>
</div>
</div>
</div>
<div class="tab">
<input type="radio" id="tab-4-2" name="tab-group-4">
<label class="labelouter" for="tab-4-2">Transformers</label>
<div class="tabcontent">
<div class="tabs innertabslanguage">
<div class="tabinner">
<input type="radio" id="tab-5-1" name="tab-group-5" checked>
<label class="labelinner" for="tab-5-1">German</label>
<div class="tabcontentinner">
```python
from haystack.reader import TransformersReader
reader = TransformersReader("deepset/gelectra-large-germanquad")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-1" name="tab-group-6" checked>
<label class="labelinner" for="tab-6-1">French</label>
<div class="tabcontentinner">
```python
from haystack.reader import TransformersReader
reader = TransformersReader("illuin/camembert-base-fquad")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-2" name="tab-group-6">
<label class="labelinner" for="tab-6-2">Italian</label>
<div class="tabcontentinner">
```python
from haystack.reader import TransformersReader
reader = TransformersReader("mrm8488/bert-italian-finedtuned-squadv1-it-alfa")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-3" name="tab-group-6">
<label class="labelinner" for="tab-6-3">Chinese</label>
<div class="tabcontentinner">
```python
from haystack.reader import TransformersReader
reader = TransformersReader("uer/roberta-base-chinese-extractive-qa")
# or
reader = TransformersReader("wptoux/albert-chinese-large-qa")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-3" name="tab-group-6">
<label class="labelinner" for="tab-6-3">Spanish</label>
<div class="tabcontentinner">
```python
from haystack.reader import TransformersReader
reader = TransformersReader("mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
# or
reader = TransformersReader("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-3" name="tab-group-6">
<label class="labelinner" for="tab-6-3">Portuguese</label>
<div class="tabcontentinner">
```python
from haystack.reader import TransformersReader
reader = TransformersReader("pierreguillou/bert-base-cased-squad-v1.1-portuguese")
# or
reader = TransformersReader("pucpr/bioBERTpt-squad-v1.1-portuguese")
```
</div>
</div>
<div class="tabinner2">
<input type="radio" id="tab-6-3" name="tab-group-6">
<label class="labelinner" for="tab-6-3">Zero-shot</label>
<div class="tabcontentinner">
```python
from haystack.reader import TransformersReader
reader = TransformersReader("deepset/xlm-roberta-large-squad2")
```
</div>
</div>
</div>
</div>
</div>
</div>
We are the creators of the **German** model and you can find out more about it [here](https://deepset.ai/germanquad)
The **French**, **Italian**, **Spanish**, **Portuguese** and **Chinese** models are monolingual language models trained on versions of the SQuAD dataset in their respective languages
and their authors report decent results in their model cards
(e.g. [here](https://huggingface.co/illuin/camembert-base-fquad) and [here](https://huggingface.co/mrm8488/bert-italian-finedtuned-squadv1-it-alfa)).
There also exist Korean QA models on the model hub but their performance is not reported.
The **zero-shot model** that is shown above is a **multilingual XLM-RoBERTa Large** that is trained on English SQuAD.
It is clear, from our [evaluations](https://huggingface.co/deepset/xlm-roberta-large-squad2#model_card),
that the model has been able to transfer some of its English QA capabilities to other languages,
but still its performance lags behind that of the monolingual models.
Nonetheless, if there is not yet a monolingual model for your language and it is one of the 100 supported by XLM-RoBERTa,
this zero-shot model may serve as a decent first baseline.