mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-01-06 20:17:14 +00:00
Regenerate API and Tutorial md files (#1480)
* Change punctuation * Add latest docstring and tutorial changes * Change punctuation * Add documentation for Docs2Answer * Add latest docstring and tutorial changes * Generate new API docs * Replace Finder with Pipeline * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This commit is contained in:
parent
05da7f71dd
commit
2c4baa7f4e
199
docs/_src/api/api/classifier.md
Normal file
199
docs/_src/api/api/classifier.md
Normal file
@ -0,0 +1,199 @@
|
||||
<a name="base"></a>
|
||||
# Module base
|
||||
|
||||
<a name="base.BaseClassifier"></a>
|
||||
## BaseClassifier Objects
|
||||
|
||||
```python
|
||||
class BaseClassifier(BaseComponent)
|
||||
```
|
||||
|
||||
<a name="base.BaseClassifier.timing"></a>
|
||||
#### timing
|
||||
|
||||
```python
|
||||
| timing(fn, attr_name)
|
||||
```
|
||||
|
||||
Wrapper method used to time functions.
|
||||
|
||||
<a name="farm"></a>
|
||||
# Module farm
|
||||
|
||||
<a name="farm.FARMClassifier"></a>
|
||||
## FARMClassifier Objects
|
||||
|
||||
```python
|
||||
class FARMClassifier(BaseClassifier)
|
||||
```
|
||||
|
||||
This node classifies documents and adds the output from the classification step to the document's meta data.
|
||||
The meta field of the document is a dictionary with the following format:
|
||||
'meta': {'name': '450_Baelor.txt', 'classification': {'label': 'neutral', 'probability' = 0.9997646, ...} }
|
||||
|
||||
| With a FARMClassifier, you can:
|
||||
- directly get predictions via predict()
|
||||
- fine-tune the model on text classification training data via train()
|
||||
|
||||
Usage example:
|
||||
...
|
||||
retriever = ElasticsearchRetriever(document_store=document_store)
|
||||
classifier = FARMClassifier(model_name_or_path="deepset/bert-base-german-cased-sentiment-Germeval17")
|
||||
p = Pipeline()
|
||||
p.add_node(component=retriever, name="Retriever", inputs=["Query"])
|
||||
p.add_node(component=classifier, name="Classifier", inputs=["Retriever"])
|
||||
|
||||
res = p.run(
|
||||
query="Who is the father of Arya Stark?",
|
||||
params={"Retriever": {"top_k": 10}, "Classifier": {"top_k": 5}}
|
||||
)
|
||||
|
||||
print(res["documents"][0].to_dict()["meta"]["classification"]["label"])
|
||||
__Note that print_documents() does not output the content of the classification field in the meta data__
|
||||
|
||||
__document_dicts = [doc.to_dict() for doc in res["documents"]]__
|
||||
|
||||
__res["documents"] = document_dicts__
|
||||
|
||||
__print_documents(res, max_text_len=100)__
|
||||
|
||||
|
||||
<a name="farm.FARMClassifier.__init__"></a>
|
||||
#### \_\_init\_\_
|
||||
|
||||
```python
|
||||
| __init__(model_name_or_path: Union[str, Path], model_version: Optional[str] = None, batch_size: int = 50, use_gpu: bool = True, top_k: int = 10, num_processes: Optional[int] = None, max_seq_len: int = 256, progress_bar: bool = True)
|
||||
```
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'deepset/bert-base-german-cased-sentiment-Germeval17'.
|
||||
See https://huggingface.co/models for full list of available models.
|
||||
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
|
||||
- `batch_size`: Number of samples the model receives in one batch for inference.
|
||||
Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
|
||||
to a value so only a single batch is used.
|
||||
- `use_gpu`: Whether to use GPU (if available)
|
||||
- `top_k`: The maximum number of documents to return
|
||||
- `num_processes`: The number of processes for `multiprocessing.Pool`. Set to value of 0 to disable
|
||||
multiprocessing. Set to None to let Inferencer determine optimum number. If you
|
||||
want to debug the Language Model, you might need to disable multiprocessing!
|
||||
- `max_seq_len`: Max sequence length of one input text for the model
|
||||
- `progress_bar`: Whether to show a tqdm progress bar or not.
|
||||
Can be helpful to disable in production deployments to keep the logs clean.
|
||||
|
||||
<a name="farm.FARMClassifier.train"></a>
|
||||
#### train
|
||||
|
||||
```python
|
||||
| train(data_dir: str, train_filename: str, label_list: List[str], delimiter: str, metric: str, dev_filename: Optional[str] = None, test_filename: Optional[str] = None, use_gpu: Optional[bool] = None, batch_size: int = 10, n_epochs: int = 2, learning_rate: float = 1e-5, max_seq_len: Optional[int] = None, warmup_proportion: float = 0.2, dev_split: float = 0, evaluate_every: int = 300, save_dir: Optional[str] = None, num_processes: Optional[int] = None, use_amp: str = None)
|
||||
```
|
||||
|
||||
Fine-tune a model on a TextClassification dataset.
|
||||
The dataset needs to be in tabular format (CSV, TSV, etc.), with columns called "label" and "text" in no specific order.
|
||||
Options:
|
||||
|
||||
- Take a plain language model (e.g. `bert-base-cased`) and train it for TextClassification
|
||||
- Take a TextClassification model and fine-tune it for your domain
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `data_dir`: Path to directory containing your training data
|
||||
- `label_list`: list of labels in the training dataset, e.g., ["0", "1"]
|
||||
- `delimiter`: delimiter that separates columns in the training dataset, e.g., "\t"
|
||||
- `metric`: evaluation metric to be used while training, e.g., "f1_macro"
|
||||
- `train_filename`: Filename of training data
|
||||
- `dev_filename`: Filename of dev / eval data
|
||||
- `test_filename`: Filename of test data
|
||||
- `dev_split`: Instead of specifying a dev_filename, you can also specify a ratio (e.g. 0.1) here
|
||||
that gets split off from training data for eval.
|
||||
- `use_gpu`: Whether to use GPU (if available)
|
||||
- `batch_size`: Number of samples the model receives in one batch for training
|
||||
- `n_epochs`: Number of iterations on the whole training data set
|
||||
- `learning_rate`: Learning rate of the optimizer
|
||||
- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
|
||||
- `warmup_proportion`: Proportion of training steps until maximum learning rate is reached.
|
||||
Until that point LR is increasing linearly. After that it's decreasing again linearly.
|
||||
Options for different schedules are available in FARM.
|
||||
- `evaluate_every`: Evaluate the model every X steps on the hold-out eval dataset
|
||||
- `save_dir`: Path to store the final model
|
||||
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
|
||||
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
|
||||
Set to None to use all CPU cores minus one.
|
||||
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
|
||||
Available options:
|
||||
None (Don't use AMP)
|
||||
"O0" (Normal FP32 training)
|
||||
"O1" (Mixed Precision => Recommended)
|
||||
"O2" (Almost FP16)
|
||||
"O3" (Pure FP16).
|
||||
See details on: https://nvidia.github.io/apex/amp.html
|
||||
|
||||
**Returns**:
|
||||
|
||||
None
|
||||
|
||||
<a name="farm.FARMClassifier.update_parameters"></a>
|
||||
#### update\_parameters
|
||||
|
||||
```python
|
||||
| update_parameters(max_seq_len: Optional[int] = None)
|
||||
```
|
||||
|
||||
Hot update parameters of a loaded FARMClassifier. It may not to be safe when processing concurrent requests.
|
||||
|
||||
<a name="farm.FARMClassifier.save"></a>
|
||||
#### save
|
||||
|
||||
```python
|
||||
| save(directory: Path)
|
||||
```
|
||||
|
||||
Saves the FARMClassifier model so that it can be reused at a later point in time.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `directory`: Directory where the FARMClassifier model should be saved
|
||||
|
||||
<a name="farm.FARMClassifier.predict_batch"></a>
|
||||
#### predict\_batch
|
||||
|
||||
```python
|
||||
| predict_batch(query_doc_list: List[dict], top_k: int = None, batch_size: int = None)
|
||||
```
|
||||
|
||||
Use loaded FARMClassifier model to, for a list of queries, classify each query's supplied list of Document.
|
||||
|
||||
Returns list of dictionary of query and list of document sorted by (desc.) similarity with query
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `query_doc_list`: List of dictionaries containing queries with their retrieved documents
|
||||
- `top_k`: The maximum number of answers to return for each query
|
||||
- `batch_size`: Number of samples the model receives in one batch for inference
|
||||
|
||||
**Returns**:
|
||||
|
||||
List of dictionaries containing query and list of Document with class probabilities in meta field
|
||||
|
||||
<a name="farm.FARMClassifier.predict"></a>
|
||||
#### predict
|
||||
|
||||
```python
|
||||
| predict(query: str, documents: List[Document], top_k: Optional[int] = None) -> List[Document]
|
||||
```
|
||||
|
||||
Use loaded classification model to classify the supplied list of Document.
|
||||
|
||||
Returns list of Document enriched with class label and probability, which are stored in Document.meta["classification"]
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `query`: Query string (is not used at the moment)
|
||||
- `documents`: List of Document to be classified
|
||||
- `top_k`: The maximum number of documents to return
|
||||
|
||||
**Returns**:
|
||||
|
||||
List of Document with class probabilities in meta field
|
||||
|
||||
@ -985,7 +985,7 @@ the vector embeddings are indexed in a FAISS Index.
|
||||
#### \_\_init\_\_
|
||||
|
||||
```python
|
||||
| __init__(sql_url: str = "sqlite:///", vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional["faiss.swigfaiss.Index"] = None, return_embedding: bool = False, index: str = "document", similarity: str = "dot_product", embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = 'overwrite', **kwargs, ,)
|
||||
| __init__(sql_url: str = "sqlite:///faiss_document_store.db", vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional["faiss.swigfaiss.Index"] = None, return_embedding: bool = False, index: str = "document", similarity: str = "dot_product", embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = 'overwrite', **kwargs, ,)
|
||||
```
|
||||
|
||||
**Arguments**:
|
||||
@ -1012,8 +1012,11 @@ the vector embeddings are indexed in a FAISS Index.
|
||||
or one with docs that you used in Haystack before and want to load again.
|
||||
- `return_embedding`: To return document embedding
|
||||
- `index`: Name of index in document store to use.
|
||||
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default sine it is
|
||||
more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.
|
||||
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default since it is
|
||||
more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence-Transformer model.
|
||||
In both cases, the returned values in Document.score are normalized to be in range [0,1]:
|
||||
For `dot_product`: expit(np.asarray(raw_score / 100))
|
||||
FOr `cosine`: (raw_score + 1) / 2
|
||||
- `embedding_field`: Name of field containing an embedding vector.
|
||||
- `progress_bar`: Whether to show a tqdm progress bar or not.
|
||||
Can be helpful to disable in production deployments to keep the logs clean.
|
||||
@ -1174,14 +1177,19 @@ Find the document that is most similar to the provided `query_emb` by using a ve
|
||||
#### save
|
||||
|
||||
```python
|
||||
| save(file_path: Union[str, Path])
|
||||
| save(index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None)
|
||||
```
|
||||
|
||||
Save FAISS Index to the specified file.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `file_path`: Path to save to.
|
||||
- `index_path`: Path to save the FAISS index to.
|
||||
- `config_path`: Path to save the initial configuration parameters to.
|
||||
Defaults to the same as the file path, save the extension (.json).
|
||||
This file contains all the parameters passed to FAISSDocumentStore()
|
||||
at creation time (for example the SQL path, vector_dim, etc), and will be
|
||||
used by the `load` method to restore the index with the appropriate configuration.
|
||||
|
||||
**Returns**:
|
||||
|
||||
@ -1192,7 +1200,7 @@ None
|
||||
|
||||
```python
|
||||
| @classmethod
|
||||
| load(cls, faiss_file_path: Union[str, Path], sql_url: str, index: str)
|
||||
| load(cls, index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None)
|
||||
```
|
||||
|
||||
Load a saved FAISS index from a file and connect to the SQL database.
|
||||
@ -1201,14 +1209,18 @@ Note: In order to have a correct mapping from FAISS to SQL,
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()`
|
||||
- `index_path`: Stored FAISS index file. Can be created via calling `save()`
|
||||
- `config_path`: Stored FAISS initial configuration parameters.
|
||||
Can be created via calling `save()`
|
||||
- `sql_url`: Connection string to the SQL database that contains your docs and metadata.
|
||||
Overrides the value defined in the `faiss_init_params_path` file, if present
|
||||
- `index`: Index name to load the FAISS index as. It must match the index name used for
|
||||
when creating the FAISS index.
|
||||
when creating the FAISS index. Overrides the value defined in the
|
||||
`faiss_init_params_path` file, if present
|
||||
|
||||
**Returns**:
|
||||
|
||||
|
||||
the DocumentStore
|
||||
|
||||
<a name="milvus"></a>
|
||||
# Module milvus
|
||||
|
||||
@ -16,3 +16,6 @@ pydoc-markdown pydoc-markdown-knowledge-graph.yml
|
||||
pydoc-markdown pydoc-markdown-graph-retriever.yml
|
||||
pydoc-markdown pydoc-markdown-evaluation.yml
|
||||
pydoc-markdown pydoc-markdown-ranker.yml
|
||||
pydoc-markdown pydoc-markdown-question-generator.yml
|
||||
pydoc-markdown pydoc-markdown-classifier.yml
|
||||
|
||||
|
||||
@ -824,6 +824,17 @@ Create an instance of Component.
|
||||
|
||||
Ray calls this method which is then re-directed to the corresponding component's run().
|
||||
|
||||
<a name="pipeline.Docs2Answers"></a>
|
||||
## Docs2Answers Objects
|
||||
|
||||
```python
|
||||
class Docs2Answers(BaseComponent)
|
||||
```
|
||||
|
||||
This Node is used to convert retrieved documents into predicted answers format.
|
||||
It is useful for situations where you are calling a Retriever only pipeline via REST API.
|
||||
This ensures that your output is in a compatible format.
|
||||
|
||||
<a name="pipeline.MostSimilarDocumentsPipeline"></a>
|
||||
## MostSimilarDocumentsPipeline Objects
|
||||
|
||||
|
||||
18
docs/_src/api/api/pydoc-markdown-classifier.yml
Normal file
18
docs/_src/api/api/pydoc-markdown-classifier.yml
Normal file
@ -0,0 +1,18 @@
|
||||
loaders:
|
||||
- type: python
|
||||
search_path: [../../../../haystack/classifier]
|
||||
modules: ['base', 'farm']
|
||||
ignore_when_discovered: ['__init__']
|
||||
processor:
|
||||
- type: filter
|
||||
expression: not name.startswith('_') and default()
|
||||
- documented_only: true
|
||||
- do_not_filter_modules: false
|
||||
- skip_empty_modules: true
|
||||
renderer:
|
||||
type: markdown
|
||||
descriptive_class_title: true
|
||||
descriptive_module_title: true
|
||||
add_method_class_prefix: false
|
||||
add_member_class_prefix: false
|
||||
filename: classifier.md
|
||||
18
docs/_src/api/api/pydoc-markdown-question-generator.yml
Normal file
18
docs/_src/api/api/pydoc-markdown-question-generator.yml
Normal file
@ -0,0 +1,18 @@
|
||||
loaders:
|
||||
- type: python
|
||||
search_path: [../../../../haystack/question_generator]
|
||||
modules: ['question_generator']
|
||||
ignore_when_discovered: ['__init__']
|
||||
processor:
|
||||
- type: filter
|
||||
expression: not name.startswith('_') and default()
|
||||
- documented_only: true
|
||||
- do_not_filter_modules: false
|
||||
- skip_empty_modules: true
|
||||
renderer:
|
||||
type: markdown
|
||||
descriptive_class_title: true
|
||||
descriptive_module_title: true
|
||||
add_method_class_prefix: false
|
||||
add_member_class_prefix: false
|
||||
filename: question_generator.md
|
||||
30
docs/_src/api/api/question_generator.md
Normal file
30
docs/_src/api/api/question_generator.md
Normal file
@ -0,0 +1,30 @@
|
||||
<a name="question_generator"></a>
|
||||
# Module question\_generator
|
||||
|
||||
<a name="question_generator.QuestionGenerator"></a>
|
||||
## QuestionGenerator Objects
|
||||
|
||||
```python
|
||||
class QuestionGenerator(BaseComponent)
|
||||
```
|
||||
|
||||
The Question Generator takes only a document as input and outputs questions that it thinks can be
|
||||
answered by this document. In our current implementation, input texts are split into chunks of 50 words
|
||||
with a 10 word overlap. This is because the default model `valhalla/t5-base-e2e-qg` seems to generate only
|
||||
about 3 questions per passage regardless of length. Our approach prioritizes the creation of more questions
|
||||
over processing efficiency (T5 is able to digest much more than 50 words at once). The returned questions
|
||||
generally come in an order dictated by the order of their answers i.e. early questions in the list generally
|
||||
come from earlier in the document.
|
||||
|
||||
<a name="question_generator.QuestionGenerator.__init__"></a>
|
||||
#### \_\_init\_\_
|
||||
|
||||
```python
|
||||
| __init__(model_name_or_path="valhalla/t5-base-e2e-qg", model_version=None, num_beams=4, max_length=256, no_repeat_ngram_size=3, length_penalty=1.5, early_stopping=True, split_length=50, split_overlap=10, prompt="generate questions:")
|
||||
```
|
||||
|
||||
Uses the valhalla/t5-base-e2e-qg model by default. This class supports any question generation model that is
|
||||
implemented as a Seq2SeqLM in HuggingFace Transformers. Note that this style of question generation (where the only input
|
||||
is a document) is sometimes referred to as end-to-end question generation. Answer-supervised question
|
||||
generation is not currently supported.
|
||||
|
||||
@ -19,7 +19,7 @@ A "knowledge base" could for example be your website, an internal wiki or a coll
|
||||
In this tutorial we will work on a slightly different domain: "Game of Thrones".
|
||||
|
||||
Let's see how we can use a bunch of Wikipedia articles to answer a variety of questions about the
|
||||
marvellous seven kingdoms...
|
||||
marvellous seven kingdoms.
|
||||
|
||||
|
||||
### Prepare environment
|
||||
@ -67,7 +67,7 @@ Haystack finds answers to queries within the documents stored in a `DocumentStor
|
||||
**Hint**: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.
|
||||
|
||||
### Start an Elasticsearch server
|
||||
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.
|
||||
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.
|
||||
|
||||
|
||||
```python
|
||||
@ -224,16 +224,14 @@ pipe = ExtractiveQAPipeline(reader, retriever)
|
||||
|
||||
```python
|
||||
# You can configure how many candidates the reader and retriever shall return
|
||||
# The higher the top_k, the better (but also the slower) your answers.
|
||||
prediction = pipe.run(
|
||||
query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
|
||||
)
|
||||
# The higher top_k_retriever, the better (but also the slower) your answers.
|
||||
prediction = pipe.run(query="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# prediction = pipe.run(query="Who created the Dothraki vocabulary?", params={"Reader": {"top_k": 5}})
|
||||
# prediction = pipe.run(query="Who is the sister of Sansa?", params={"Reader": {"top_k": 5}})
|
||||
# prediction = pipe.run(query="Who created the Dothraki vocabulary?", top_k_reader=5)
|
||||
# prediction = pipe.run(query="Who is the sister of Sansa?", top_k_reader=5)
|
||||
```
|
||||
|
||||
|
||||
|
||||
@ -1320,6 +1320,12 @@ class _RayDeploymentWrapper:
|
||||
|
||||
|
||||
class Docs2Answers(BaseComponent):
|
||||
"""
|
||||
This Node is used to convert retrieved documents into predicted answers format.
|
||||
It is useful for situations where you are calling a Retriever only pipeline via REST API.
|
||||
This ensures that your output is in a compatible format.
|
||||
"""
|
||||
|
||||
outgoing_edges = 1
|
||||
|
||||
def __init__(self):
|
||||
|
||||
@ -16,7 +16,7 @@
|
||||
"In this tutorial we will work on a slightly different domain: \"Game of Thrones\". \n",
|
||||
"\n",
|
||||
"Let's see how we can use a bunch of Wikipedia articles to answer a variety of questions about the \n",
|
||||
"marvellous seven kingdoms... \n"
|
||||
"marvellous seven kingdoms.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -65,7 +65,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -91,22 +91,14 @@
|
||||
"**Hint**: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.\n",
|
||||
"\n",
|
||||
"### Start an Elasticsearch server\n",
|
||||
"You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source."
|
||||
"You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0ae423cd9c30d6f02ca2073e430d4e1f4403d88b8ec316411ec4c198bad3d416\r\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Recommended: Start Elasticsearch using Docker via the Haystack utility function\n",
|
||||
"from haystack.utils import launch_es\n",
|
||||
@ -137,21 +129,13 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"07/07/2020 10:41:47 - INFO - elasticsearch - PUT http://localhost:9200/document [status:200 request:0.364s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Connect to Elasticsearch\n",
|
||||
"\n",
|
||||
@ -180,34 +164,13 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"07/07/2020 10:41:48 - INFO - haystack.indexing.utils - Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.\n",
|
||||
"07/07/2020 10:41:48 - INFO - elasticsearch - POST http://localhost:9200/_bulk [status:200 request:0.461s]\n",
|
||||
"07/07/2020 10:41:49 - INFO - elasticsearch - POST http://localhost:9200/_bulk [status:200 request:0.259s]\n",
|
||||
"07/07/2020 10:41:49 - INFO - elasticsearch - POST http://localhost:9200/_bulk [status:200 request:0.205s]\n",
|
||||
"07/07/2020 10:41:49 - INFO - elasticsearch - POST http://localhost:9200/_bulk [status:200 request:0.158s]\n",
|
||||
"07/07/2020 10:41:49 - INFO - elasticsearch - POST http://localhost:9200/_bulk [status:200 request:0.126s]\n",
|
||||
"07/07/2020 10:41:49 - INFO - elasticsearch - POST http://localhost:9200/_bulk [status:200 request:0.095s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[{'name': '384_Maelor_Targaryen.txt', 'text': '#REDIRECT The Princess and the Queen'}, {'name': '314_Pypar.txt', 'text': \"#REDIRECT List of Game of Thrones characters#Night's Watch\"}, {'name': '73_A_Man_Without_Honor.txt', 'text': '\"\\'\\'\\'A Man Without Honor\\'\\'\\'\" is the seventh episode of the second season of HBO\\'s medieval fantasy television series \\'\\'Game of Thrones\\'\\'.\\nThe episode is written by series co-creators David Benioff and D. B. Weiss and directed, for the second time in this season, by David Nutter. It premiered on May 13, 2012.\\nThe name of the episode comes from Catelyn Stark\\'s assessment of Ser Jaime Lannister: \"You are a man without honor,\" after he kills a member of his own family to attempt escape.'}]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Let's first fetch some documents that we want to query\n",
|
||||
"# Here: 517 Wikipedia articles for Game of Thrones\n",
|
||||
@ -260,7 +223,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -270,7 +233,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false,
|
||||
@ -310,27 +273,13 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"04/28/2020 12:29:45 - INFO - farm.utils - device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n",
|
||||
"04/28/2020 12:29:45 - INFO - farm.infer - Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...\n",
|
||||
"04/28/2020 12:29:49 - WARNING - farm.modeling.language_model - Could not automatically detect from language model name what language it is. \n",
|
||||
"\t We guess it's an *ENGLISH* model ... \n",
|
||||
"\t If not: Init the language model by supplying the 'language' param.\n",
|
||||
"04/28/2020 12:29:54 - WARNING - farm.modeling.prediction_head - Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {\"loss_ignore_index\": -1}\n",
|
||||
"04/28/2020 12:29:58 - INFO - farm.utils - device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load a local model or any of the QA models on\n",
|
||||
"# Hugging Face's model hub (https://huggingface.co/models)\n",
|
||||
@ -369,7 +318,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
@ -390,24 +339,17 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"ename": "SyntaxError",
|
||||
"evalue": "invalid syntax (<ipython-input-1-da5c75822ce3>, line 3)",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001B[0;36m File \u001B[0;32m\"<ipython-input-1-da5c75822ce3>\"\u001B[0;36m, line \u001B[0;32m3\u001B[0m\n\u001B[0;31m prediction = pipe.run(query=\"Who is the father of Arya Stark?\", params={top_k_retriever=10, top_k_reader=5)\u001B[0m\n\u001B[0m ^\u001B[0m\n\u001B[0;31mSyntaxError\u001B[0m\u001B[0;31m:\u001B[0m invalid syntax\n"
|
||||
]
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
],
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# You can configure how many candidates the reader and retriever shall return\n",
|
||||
"# The higher the top_k, the better (but also the slower) your answers.\n",
|
||||
"prediction = pipe.run(\n",
|
||||
" query=\"Who is the father of Arya Stark?\", params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}}\n",
|
||||
")"
|
||||
"# The higher top_k_retriever, the better (but also the slower) your answers. \n",
|
||||
"prediction = pipe.run(query=\"Who is the father of Arya Stark?\", top_k_retriever=10, top_k_reader=5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -416,50 +358,20 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# prediction = pipe.run(query=\"Who created the Dothraki vocabulary?\", params={\"Reader\": {\"top_k\": 5}})\n",
|
||||
"# prediction = pipe.run(query=\"Who is the sister of Sansa?\", params={\"Reader\": {\"top_k\": 5}})"
|
||||
"# prediction = pipe.run(query=\"Who created the Dothraki vocabulary?\", top_k_reader=5)\n",
|
||||
"# prediction = pipe.run(query=\"Who is the sister of Sansa?\", top_k_reader=5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false,
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[ { 'answer': 'Eddard',\n",
|
||||
" 'context': 's Nymeria after a legendary warrior queen. She travels '\n",
|
||||
" \"with her father, Eddard, to King's Landing when he is made \"\n",
|
||||
" 'Hand of the King. Before she leaves,'},\n",
|
||||
" { 'answer': 'Ned',\n",
|
||||
" 'context': 'girl disguised as a boy all along and is surprised to '\n",
|
||||
" \"learn she is Arya, Ned Stark's daughter. After the \"\n",
|
||||
" 'Goldcloaks get help from Ser Amory Lorch and '},\n",
|
||||
" { 'answer': 'Ned',\n",
|
||||
" 'context': 'in the television series.\\n'\n",
|
||||
" '\\n'\n",
|
||||
" '\\n'\n",
|
||||
" '====Season 1====\\n'\n",
|
||||
" 'Arya accompanies her father Ned and her sister Sansa to '\n",
|
||||
" \"King's Landing. Before their departure, Arya's ha\"},\n",
|
||||
" { 'answer': 'Balon Greyjoy',\n",
|
||||
" 'context': 'He sends Theon to the Iron Islands hoping to broker an '\n",
|
||||
" \"alliance with Balon Greyjoy, Theon's father. In exchange \"\n",
|
||||
" 'for Greyjoy support, Robb as the King '},\n",
|
||||
" { 'answer': 'Brynden Tully',\n",
|
||||
" 'context': 'o the weather. Sandor decides to instead take her to her '\n",
|
||||
" 'great-uncle Brynden Tully. On their way to Riverrun, they '\n",
|
||||
" \"encounter two men on Arya's death l\"}]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print_answers(prediction, details=\"minimal\")"
|
||||
]
|
||||
|
||||
@ -7,7 +7,7 @@
|
||||
# In this tutorial we will work on a slightly different domain: "Game of Thrones".
|
||||
#
|
||||
# Let's see how we can use a bunch of Wikipedia articles to answer a variety of questions about the
|
||||
# marvellous seven kingdoms...
|
||||
# marvellous seven kingdoms.
|
||||
|
||||
import logging
|
||||
import subprocess
|
||||
@ -40,7 +40,7 @@ def tutorial1_basic_qa_pipeline():
|
||||
#
|
||||
# Start an Elasticsearch server
|
||||
# You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in
|
||||
# your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.
|
||||
# your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.
|
||||
|
||||
launch_es()
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user