update api markdown files and add markdown file for ranker (#1198)

* update api markdown files and add markdown file for ranker

* added docstrings for weaviate

* new version of pydoc-markdown does not render arguments correctly. We used pydoc-markdown==3.11.0
This commit is contained in:
Markus Paff 2021-06-15 17:50:08 +02:00 committed by GitHub
parent 215c45eb8a
commit 6cd49105e7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 480 additions and 7 deletions

View File

@ -1466,3 +1466,254 @@ List[np.array]: List of vectors.
Return the count of embeddings in the document store.
<a name="weaviate"></a>
# Module weaviate
<a name="weaviate.WeaviateDocumentStore"></a>
## WeaviateDocumentStore Objects
```python
class WeaviateDocumentStore(BaseDocumentStore)
```
Weaviate is a cloud-native, modular, real-time vector search engine built to scale your machine learning models.
(See https://www.semi.technology/developers/weaviate/current/index.html#what-is-weaviate)
Some of the key differences in contrast to FAISS & Milvus:
1. Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
2. Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset
3. Has less variety of ANN algorithms, as of now only HNSW.
Weaviate python client is used to connect to the server, more details are here
https://weaviate-python-client.readthedocs.io/en/docs/weaviate.html
Usage:
1. Start a Weaviate server (see https://www.semi.technology/developers/weaviate/current/getting-started/installation.html)
2. Init a WeaviateDocumentStore in Haystack
<a name="weaviate.WeaviateDocumentStore.__init__"></a>
#### \_\_init\_\_
```python
| __init__(host: Union[str, List[str]] = "http://localhost", port: Union[int, List[int]] = 8080, timeout_config: tuple = (5, 15), username: str = None, password: str = None, index: str = "Document", embedding_dim: int = 768, text_field: str = "text", name_field: str = "name", faq_question_field="question", similarity: str = "dot_product", index_type: str = "hnsw", custom_schema: Optional[dict] = None, return_embedding: bool = False, embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = 'overwrite', **kwargs, ,)
```
**Arguments**:
- `host`: Weaviate server connection URL for storing and processing documents and vectors.
For more details, refer "https://www.semi.technology/developers/weaviate/current/getting-started/installation.html"
- `port`: port of Weaviate instance
- `timeout_config`: Weaviate Timeout config as a tuple of (retries, time out seconds).
- `username`: username (standard authentication via http_auth)
- `password`: password (standard authentication via http_auth)
- `index`: Index name for document text, embedding and metadata (in Weaviate terminology, this is a "Class" in Weaviate schema).
- `embedding_dim`: The embedding vector size. Default: 768.
- `text_field`: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text").
If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.
- `name_field`: Name of field that contains the title of the the doc
- `faq_question_field`: Name of field containing the question in case of FAQ-Style QA
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default.
- `index_type`: Index type of any vector object defined in weaviate schema. The vector index type is pluggable.
Currently, HSNW is only supported.
See: https://www.semi.technology/developers/weaviate/current/more-resources/performance.html
- `custom_schema`: Allows to create custom schema in Weaviate, for more details
See https://www.semi.technology/developers/weaviate/current/data-schema/schema-configuration.html
- `module_name`: Vectorization module to convert data into vectors. Default is "text2vec-trasnformers"
For more details, See https://www.semi.technology/developers/weaviate/current/modules/
- `return_embedding`: To return document embedding.
- `embedding_field`: Name of field containing an embedding vector.
- `progress_bar`: Whether to show a tqdm progress bar or not.
Can be helpful to disable in production deployments to keep the logs clean.
- `duplicate_documents`: Handle duplicates document based on parameter options.
Parameter options : ( 'skip','overwrite','fail')
skip: Ignore the duplicates documents
overwrite: Update any existing documents with the same ID when adding documents.
fail: an error is raised if the document ID of the document being added already exists.
<a name="weaviate.WeaviateDocumentStore.get_document_by_id"></a>
#### get\_document\_by\_id
```python
| get_document_by_id(id: str, index: Optional[str] = None) -> Optional[Document]
```
Fetch a document by specifying its text id string
<a name="weaviate.WeaviateDocumentStore.get_documents_by_id"></a>
#### get\_documents\_by\_id
```python
| get_documents_by_id(ids: List[str], index: Optional[str] = None, batch_size: int = 10_000) -> List[Document]
```
Fetch documents by specifying a list of text id strings
<a name="weaviate.WeaviateDocumentStore.write_documents"></a>
#### write\_documents
```python
| write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None, batch_size: int = 10_000, duplicate_documents: Optional[str] = None)
```
Add new documents to the DocumentStore.
**Arguments**:
- `documents`: List of `Dicts` or List of `Documents`. Passing an Embedding/Vector is mandatory in case weaviate is not
configured with a module. If a module is configured, the embedding is automatically generated by Weaviate.
- `index`: index name for storing the docs and metadata
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
- `duplicate_documents`: Handle duplicates document based on parameter options.
Parameter options : ( 'skip','overwrite','fail')
skip: Ignore the duplicates documents
overwrite: Update any existing documents with the same ID when adding documents.
fail: an error is raised if the document ID of the document being added already
exists.
**Raises**:
- `DuplicateDocumentError`: Exception trigger on duplicate document
**Returns**:
None
<a name="weaviate.WeaviateDocumentStore.update_document_meta"></a>
#### update\_document\_meta
```python
| update_document_meta(id: str, meta: Dict[str, str])
```
Update the metadata dictionary of a document by specifying its string id
<a name="weaviate.WeaviateDocumentStore.get_document_count"></a>
#### get\_document\_count
```python
| get_document_count(filters: Optional[Dict[str, List[str]]] = None, index: Optional[str] = None) -> int
```
Return the number of documents in the document store.
<a name="weaviate.WeaviateDocumentStore.get_all_documents"></a>
#### get\_all\_documents
```python
| get_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None, return_embedding: Optional[bool] = None, batch_size: int = 10_000) -> List[Document]
```
Get documents from the document store.
**Arguments**:
- `index`: Name of the index to get the documents from. If None, the
DocumentStore's default index (self.index) will be used.
- `filters`: Optional filters to narrow down the documents to return.
Example: {"name": ["some", "more"], "category": ["only_one"]}
- `return_embedding`: Whether to return the document embeddings.
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
<a name="weaviate.WeaviateDocumentStore.get_all_documents_generator"></a>
#### get\_all\_documents\_generator
```python
| get_all_documents_generator(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None, return_embedding: Optional[bool] = None, batch_size: int = 10_000) -> Generator[Document, None, None]
```
Get documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process
a large number of documents without having to load all documents in memory.
**Arguments**:
- `index`: Name of the index to get the documents from. If None, the
DocumentStore's default index (self.index) will be used.
- `filters`: Optional filters to narrow down the documents to return.
Example: {"name": ["some", "more"], "category": ["only_one"]}
- `return_embedding`: Whether to return the document embeddings.
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
<a name="weaviate.WeaviateDocumentStore.query"></a>
#### query
```python
| query(query: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None, top_k: int = 10, custom_query: Optional[str] = None, index: Optional[str] = None) -> List[Document]
```
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query as defined by Weaviate semantic search.
**Arguments**:
- `query`: The query
- `filters`: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
- `top_k`: How many documents to return per query.
- `custom_query`: Custom query that will executed using query.raw method, for more details refer
https://www.semi.technology/developers/weaviate/current/graphql-references/filters.html
- `index`: The name of the index in the DocumentStore from which to retrieve documents
<a name="weaviate.WeaviateDocumentStore.query_by_embedding"></a>
#### query\_by\_embedding
```python
| query_by_embedding(query_emb: np.ndarray, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document]
```
Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
**Arguments**:
- `query_emb`: Embedding of the query (e.g. gathered from DPR)
- `filters`: Optional filters to narrow down the search space.
Example: {"name": ["some", "more"], "category": ["only_one"]}
- `top_k`: How many documents to return
- `index`: index name for storing the docs and metadata
- `return_embedding`: To return document embedding
**Returns**:
<a name="weaviate.WeaviateDocumentStore.update_embeddings"></a>
#### update\_embeddings
```python
| update_embeddings(retriever, index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None, update_existing_embeddings: bool = True, batch_size: int = 10_000)
```
Updates the embeddings in the the document store using the encoding model specified in the retriever.
This can be useful if want to change the embeddings for your documents (e.g. after changing the retriever config).
**Arguments**:
- `retriever`: Retriever to use to update the embeddings.
- `index`: Index name to update
- `update_existing_embeddings`: Weaviate mandates an embedding while creating the document itself.
This option must be always true for weaviate and it will update the embeddings for all the documents.
- `filters`: Optional filters to narrow down the documents for which embeddings are to be updated.
Example: {"name": ["some", "more"], "category": ["only_one"]}
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
**Returns**:
None
<a name="weaviate.WeaviateDocumentStore.delete_all_documents"></a>
#### delete\_all\_documents
```python
| delete_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None)
```
Delete documents in an index. All documents are deleted if no filters are passed.
**Arguments**:
- `index`: Index name to delete the document from.
- `filters`: Optional filters to narrow down the documents to be deleted.
**Returns**:
None

View File

@ -15,4 +15,4 @@ pydoc-markdown pydoc-markdown-pipelines.yml
pydoc-markdown pydoc-markdown-knowledge-graph.yml
pydoc-markdown pydoc-markdown-graph-retriever.yml
pydoc-markdown pydoc-markdown-evaluation.yml
pydoc-markdown pydoc-markdown-ranker.yml

View File

@ -1,7 +1,7 @@
loaders:
- type: python
search_path: [../../../../haystack/document_store]
modules: ['base', 'elasticsearch', 'memory', 'sql', 'faiss', 'milvus']
modules: ['base', 'elasticsearch', 'memory', 'sql', 'faiss', 'milvus', 'weaviate']
ignore_when_discovered: ['__init__']
processor:
- type: filter

View File

@ -0,0 +1,19 @@
loaders:
- type: python
search_path: [../../../../haystack/ranker]
modules: ['base', 'farm']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: ranker.md

206
docs/_src/api/api/ranker.md Normal file
View File

@ -0,0 +1,206 @@
<a name="base"></a>
# Module base
<a name="base.BaseRanker"></a>
## BaseRanker Objects
```python
class BaseRanker(BaseComponent)
```
<a name="base.BaseRanker.timing"></a>
#### timing
```python
| timing(fn, attr_name)
```
Wrapper method used to time functions.
<a name="base.BaseRanker.eval"></a>
#### eval
```python
| eval(label_index: str = "label", doc_index: str = "eval_document", label_origin: str = "gold_label", top_k: int = 10, open_domain: bool = False, return_preds: bool = False) -> dict
```
Performs evaluation of the Ranker.
Ranker is evaluated in the same way as a Retriever based on whether it finds the correct document given the query string and at which
position in the ranking of documents the correct document is.
| Returns a dict containing the following metrics:
- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
average precision is normalized by the number of retrieved relevant documents per query.
If ``open_domain=False``, average precision is normalized by the number of all relevant documents
per query.
**Arguments**:
- `label_index`: Index/Table in DocumentStore where labeled questions are stored
- `doc_index`: Index/Table in DocumentStore where documents that are used for evaluation are stored
- `top_k`: How many documents to return per query
- `open_domain`: If ``True``, retrieval will be evaluated by checking if the answer string to a question is
contained in the retrieved docs (common approach in open-domain QA).
If ``False``, retrieval uses a stricter evaluation that checks if the retrieved document ids
are within ids explicitly stated in the labels.
- `return_preds`: Whether to add predictions in the returned dictionary. If True, the returned dictionary
contains the keys "predictions" and "metrics".
<a name="farm"></a>
# Module farm
<a name="farm.FARMRanker"></a>
## FARMRanker Objects
```python
class FARMRanker(BaseRanker)
```
Transformer based model for Document Re-ranking using the TextPairClassifier of FARM framework (https://github.com/deepset-ai/FARM).
While the underlying model can vary (BERT, Roberta, DistilBERT, ...), the interface remains the same.
| With a FARMRanker, you can:
- directly get predictions via predict()
- fine-tune the model on TextPair data via train()
<a name="farm.FARMRanker.__init__"></a>
#### \_\_init\_\_
```python
| __init__(model_name_or_path: Union[str, Path], model_version: Optional[str] = None, batch_size: int = 50, use_gpu: bool = True, top_k: int = 10, num_processes: Optional[int] = None, max_seq_len: int = 256, progress_bar: bool = True)
```
**Arguments**:
- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased',
'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'.
See https://huggingface.co/models for full list of available models.
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
- `batch_size`: Number of samples the model receives in one batch for inference.
Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
to a value so only a single batch is used.
- `use_gpu`: Whether to use GPU (if available)
- `top_k`: The maximum number of documents to return
- `num_processes`: The number of processes for `multiprocessing.Pool`. Set to value of 0 to disable
multiprocessing. Set to None to let Inferencer determine optimum number. If you
want to debug the Language Model, you might need to disable multiprocessing!
- `max_seq_len`: Max sequence length of one input text for the model
- `progress_bar`: Whether to show a tqdm progress bar or not.
Can be helpful to disable in production deployments to keep the logs clean.
<a name="farm.FARMRanker.train"></a>
#### train
```python
| train(data_dir: str, train_filename: str, dev_filename: Optional[str] = None, test_filename: Optional[str] = None, use_gpu: Optional[bool] = None, batch_size: int = 10, n_epochs: int = 2, learning_rate: float = 1e-5, max_seq_len: Optional[int] = None, warmup_proportion: float = 0.2, dev_split: float = 0, evaluate_every: int = 300, save_dir: Optional[str] = None, num_processes: Optional[int] = None, use_amp: str = None)
```
Fine-tune a model on a TextPairClassification dataset. Options:
- Take a plain language model (e.g. `bert-base-cased`) and train it for TextPairClassification
- Take a TextPairClassification model and fine-tune it for your domain
**Arguments**:
- `data_dir`: Path to directory containing your training data in SQuAD style
- `train_filename`: Filename of training data
- `dev_filename`: Filename of dev / eval data
- `test_filename`: Filename of test data
- `dev_split`: Instead of specifying a dev_filename, you can also specify a ratio (e.g. 0.1) here
that gets split off from training data for eval.
- `use_gpu`: Whether to use GPU (if available)
- `batch_size`: Number of samples the model receives in one batch for training
- `n_epochs`: Number of iterations on the whole training data set
- `learning_rate`: Learning rate of the optimizer
- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
- `warmup_proportion`: Proportion of training steps until maximum learning rate is reached.
Until that point LR is increasing linearly. After that it's decreasing again linearly.
Options for different schedules are available in FARM.
- `evaluate_every`: Evaluate the model every X steps on the hold-out eval dataset
- `save_dir`: Path to store the final model
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
Set to None to use all CPU cores minus one.
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
Available options:
None (Don't use AMP)
"O0" (Normal FP32 training)
"O1" (Mixed Precision => Recommended)
"O2" (Almost FP16)
"O3" (Pure FP16).
See details on: https://nvidia.github.io/apex/amp.html
**Returns**:
None
<a name="farm.FARMRanker.update_parameters"></a>
#### update\_parameters
```python
| update_parameters(max_seq_len: Optional[int] = None)
```
Hot update parameters of a loaded Ranker. It may not to be safe when processing concurrent requests.
<a name="farm.FARMRanker.save"></a>
#### save
```python
| save(directory: Path)
```
Saves the Ranker model so that it can be reused at a later point in time.
**Arguments**:
- `directory`: Directory where the Ranker model should be saved
<a name="farm.FARMRanker.predict_batch"></a>
#### predict\_batch
```python
| predict_batch(query_doc_list: List[dict], top_k: int = None, batch_size: int = None)
```
Use loaded Ranker model to, for a list of queries, rank each query's supplied list of Document.
Returns list of dictionary of query and list of document sorted by (desc.) similarity with query
**Arguments**:
- `query_doc_list`: List of dictionaries containing queries with their retrieved documents
- `top_k`: The maximum number of answers to return for each query
- `batch_size`: Number of samples the model receives in one batch for inference
**Returns**:
List of dictionaries containing query and ranked list of Document
<a name="farm.FARMRanker.predict"></a>
#### predict
```python
| predict(query: str, documents: List[Document], top_k: Optional[int] = None)
```
Use loaded ranker model to re-rank the supplied list of Document.
Returns list of Document sorted by (desc.) TextPairClassification similarity with the query.
**Arguments**:
- `query`: Query string
- `documents`: List of Document to be re-ranked
- `top_k`: The maximum number of documents to return
**Returns**:
List of Document

View File

@ -238,7 +238,7 @@ Karpukhin, Vladimir, et al. (2020): "Dense Passage Retrieval for Open-Domain Que
#### \_\_init\_\_
```python
| __init__(document_store: BaseDocumentStore, query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base", single_model_path: Optional[Union[Path, str]] = None, model_version: Optional[str] = None, max_seq_len_query: int = 64, max_seq_len_passage: int = 256, top_k: int = 10, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, infer_tokenizer_classes: bool = False, similarity_function: str = "dot_product", progress_bar: bool = True)
| __init__(document_store: BaseDocumentStore, query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base", model_version: Optional[str] = None, max_seq_len_query: int = 64, max_seq_len_passage: int = 256, top_k: int = 10, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, infer_tokenizer_classes: bool = False, similarity_function: str = "dot_product", progress_bar: bool = True)
```
Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
@ -266,9 +266,6 @@ The checkpoint format matches huggingface transformers' model format
- `passage_embedding_model`: Local path or remote name of passage encoder checkpoint. The format equals the
one used by hugging-face transformers' modelhub models
Currently available remote names: ``"facebook/dpr-ctx_encoder-single-nq-base"``
- `single_model_path`: Local path or remote name of a query and passage embedder in one single model. Those
models are typically trained within FARM.
Currently available remote names: TODO add FARM DPR model to HF modelhub
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
- `max_seq_len_query`: Longest length of each query sequence. Maximum number of tokens for the query text. Longer ones will be cut down."
- `max_seq_len_passage`: Longest length of each passage/context sequence. Maximum number of tokens for the passage text. Longer ones will be cut down."
@ -407,7 +404,7 @@ None
```python
| @classmethod
| load(cls, load_dir: Union[Path, str], document_store: BaseDocumentStore, max_seq_len_query: int = 64, max_seq_len_passage: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, similarity_function: str = "dot_product", query_encoder_dir: str = "query_encoder", passage_encoder_dir: str = "passage_encoder")
| load(cls, load_dir: Union[Path, str], document_store: BaseDocumentStore, max_seq_len_query: int = 64, max_seq_len_passage: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, similarity_function: str = "dot_product", query_encoder_dir: str = "query_encoder", passage_encoder_dir: str = "passage_encoder", infer_tokenizer_classes: bool = False)
```
Load DensePassageRetriever from the specified directory.