mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-12-27 06:58:35 +00:00
update api markdown files and add markdown file for ranker (#1198)
* update api markdown files and add markdown file for ranker * added docstrings for weaviate * new version of pydoc-markdown does not render arguments correctly. We used pydoc-markdown==3.11.0
This commit is contained in:
parent
215c45eb8a
commit
6cd49105e7
@ -1466,3 +1466,254 @@ List[np.array]: List of vectors.
|
||||
|
||||
Return the count of embeddings in the document store.
|
||||
|
||||
<a name="weaviate"></a>
|
||||
# Module weaviate
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore"></a>
|
||||
## WeaviateDocumentStore Objects
|
||||
|
||||
```python
|
||||
class WeaviateDocumentStore(BaseDocumentStore)
|
||||
```
|
||||
|
||||
Weaviate is a cloud-native, modular, real-time vector search engine built to scale your machine learning models.
|
||||
(See https://www.semi.technology/developers/weaviate/current/index.html#what-is-weaviate)
|
||||
|
||||
Some of the key differences in contrast to FAISS & Milvus:
|
||||
1. Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
|
||||
2. Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset
|
||||
3. Has less variety of ANN algorithms, as of now only HNSW.
|
||||
|
||||
Weaviate python client is used to connect to the server, more details are here
|
||||
https://weaviate-python-client.readthedocs.io/en/docs/weaviate.html
|
||||
|
||||
Usage:
|
||||
1. Start a Weaviate server (see https://www.semi.technology/developers/weaviate/current/getting-started/installation.html)
|
||||
2. Init a WeaviateDocumentStore in Haystack
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.__init__"></a>
|
||||
#### \_\_init\_\_
|
||||
|
||||
```python
|
||||
| __init__(host: Union[str, List[str]] = "http://localhost", port: Union[int, List[int]] = 8080, timeout_config: tuple = (5, 15), username: str = None, password: str = None, index: str = "Document", embedding_dim: int = 768, text_field: str = "text", name_field: str = "name", faq_question_field="question", similarity: str = "dot_product", index_type: str = "hnsw", custom_schema: Optional[dict] = None, return_embedding: bool = False, embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = 'overwrite', **kwargs, ,)
|
||||
```
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `host`: Weaviate server connection URL for storing and processing documents and vectors.
|
||||
For more details, refer "https://www.semi.technology/developers/weaviate/current/getting-started/installation.html"
|
||||
- `port`: port of Weaviate instance
|
||||
- `timeout_config`: Weaviate Timeout config as a tuple of (retries, time out seconds).
|
||||
- `username`: username (standard authentication via http_auth)
|
||||
- `password`: password (standard authentication via http_auth)
|
||||
- `index`: Index name for document text, embedding and metadata (in Weaviate terminology, this is a "Class" in Weaviate schema).
|
||||
- `embedding_dim`: The embedding vector size. Default: 768.
|
||||
- `text_field`: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text").
|
||||
If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.
|
||||
- `name_field`: Name of field that contains the title of the the doc
|
||||
- `faq_question_field`: Name of field containing the question in case of FAQ-Style QA
|
||||
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default.
|
||||
- `index_type`: Index type of any vector object defined in weaviate schema. The vector index type is pluggable.
|
||||
Currently, HSNW is only supported.
|
||||
See: https://www.semi.technology/developers/weaviate/current/more-resources/performance.html
|
||||
- `custom_schema`: Allows to create custom schema in Weaviate, for more details
|
||||
See https://www.semi.technology/developers/weaviate/current/data-schema/schema-configuration.html
|
||||
- `module_name`: Vectorization module to convert data into vectors. Default is "text2vec-trasnformers"
|
||||
For more details, See https://www.semi.technology/developers/weaviate/current/modules/
|
||||
- `return_embedding`: To return document embedding.
|
||||
- `embedding_field`: Name of field containing an embedding vector.
|
||||
- `progress_bar`: Whether to show a tqdm progress bar or not.
|
||||
Can be helpful to disable in production deployments to keep the logs clean.
|
||||
- `duplicate_documents`: Handle duplicates document based on parameter options.
|
||||
Parameter options : ( 'skip','overwrite','fail')
|
||||
skip: Ignore the duplicates documents
|
||||
overwrite: Update any existing documents with the same ID when adding documents.
|
||||
fail: an error is raised if the document ID of the document being added already exists.
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.get_document_by_id"></a>
|
||||
#### get\_document\_by\_id
|
||||
|
||||
```python
|
||||
| get_document_by_id(id: str, index: Optional[str] = None) -> Optional[Document]
|
||||
```
|
||||
|
||||
Fetch a document by specifying its text id string
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.get_documents_by_id"></a>
|
||||
#### get\_documents\_by\_id
|
||||
|
||||
```python
|
||||
| get_documents_by_id(ids: List[str], index: Optional[str] = None, batch_size: int = 10_000) -> List[Document]
|
||||
```
|
||||
|
||||
Fetch documents by specifying a list of text id strings
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.write_documents"></a>
|
||||
#### write\_documents
|
||||
|
||||
```python
|
||||
| write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None, batch_size: int = 10_000, duplicate_documents: Optional[str] = None)
|
||||
```
|
||||
|
||||
Add new documents to the DocumentStore.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `documents`: List of `Dicts` or List of `Documents`. Passing an Embedding/Vector is mandatory in case weaviate is not
|
||||
configured with a module. If a module is configured, the embedding is automatically generated by Weaviate.
|
||||
- `index`: index name for storing the docs and metadata
|
||||
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
|
||||
- `duplicate_documents`: Handle duplicates document based on parameter options.
|
||||
Parameter options : ( 'skip','overwrite','fail')
|
||||
skip: Ignore the duplicates documents
|
||||
overwrite: Update any existing documents with the same ID when adding documents.
|
||||
fail: an error is raised if the document ID of the document being added already
|
||||
exists.
|
||||
|
||||
**Raises**:
|
||||
|
||||
- `DuplicateDocumentError`: Exception trigger on duplicate document
|
||||
|
||||
**Returns**:
|
||||
|
||||
None
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.update_document_meta"></a>
|
||||
#### update\_document\_meta
|
||||
|
||||
```python
|
||||
| update_document_meta(id: str, meta: Dict[str, str])
|
||||
```
|
||||
|
||||
Update the metadata dictionary of a document by specifying its string id
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.get_document_count"></a>
|
||||
#### get\_document\_count
|
||||
|
||||
```python
|
||||
| get_document_count(filters: Optional[Dict[str, List[str]]] = None, index: Optional[str] = None) -> int
|
||||
```
|
||||
|
||||
Return the number of documents in the document store.
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.get_all_documents"></a>
|
||||
#### get\_all\_documents
|
||||
|
||||
```python
|
||||
| get_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None, return_embedding: Optional[bool] = None, batch_size: int = 10_000) -> List[Document]
|
||||
```
|
||||
|
||||
Get documents from the document store.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `index`: Name of the index to get the documents from. If None, the
|
||||
DocumentStore's default index (self.index) will be used.
|
||||
- `filters`: Optional filters to narrow down the documents to return.
|
||||
Example: {"name": ["some", "more"], "category": ["only_one"]}
|
||||
- `return_embedding`: Whether to return the document embeddings.
|
||||
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.get_all_documents_generator"></a>
|
||||
#### get\_all\_documents\_generator
|
||||
|
||||
```python
|
||||
| get_all_documents_generator(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None, return_embedding: Optional[bool] = None, batch_size: int = 10_000) -> Generator[Document, None, None]
|
||||
```
|
||||
|
||||
Get documents from the document store. Under-the-hood, documents are fetched in batches from the
|
||||
document store and yielded as individual documents. This method can be used to iteratively process
|
||||
a large number of documents without having to load all documents in memory.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `index`: Name of the index to get the documents from. If None, the
|
||||
DocumentStore's default index (self.index) will be used.
|
||||
- `filters`: Optional filters to narrow down the documents to return.
|
||||
Example: {"name": ["some", "more"], "category": ["only_one"]}
|
||||
- `return_embedding`: Whether to return the document embeddings.
|
||||
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.query"></a>
|
||||
#### query
|
||||
|
||||
```python
|
||||
| query(query: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None, top_k: int = 10, custom_query: Optional[str] = None, index: Optional[str] = None) -> List[Document]
|
||||
```
|
||||
|
||||
Scan through documents in DocumentStore and return a small number documents
|
||||
that are most relevant to the query as defined by Weaviate semantic search.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `query`: The query
|
||||
- `filters`: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
|
||||
- `top_k`: How many documents to return per query.
|
||||
- `custom_query`: Custom query that will executed using query.raw method, for more details refer
|
||||
https://www.semi.technology/developers/weaviate/current/graphql-references/filters.html
|
||||
- `index`: The name of the index in the DocumentStore from which to retrieve documents
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.query_by_embedding"></a>
|
||||
#### query\_by\_embedding
|
||||
|
||||
```python
|
||||
| query_by_embedding(query_emb: np.ndarray, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document]
|
||||
```
|
||||
|
||||
Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `query_emb`: Embedding of the query (e.g. gathered from DPR)
|
||||
- `filters`: Optional filters to narrow down the search space.
|
||||
Example: {"name": ["some", "more"], "category": ["only_one"]}
|
||||
- `top_k`: How many documents to return
|
||||
- `index`: index name for storing the docs and metadata
|
||||
- `return_embedding`: To return document embedding
|
||||
|
||||
**Returns**:
|
||||
|
||||
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.update_embeddings"></a>
|
||||
#### update\_embeddings
|
||||
|
||||
```python
|
||||
| update_embeddings(retriever, index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None, update_existing_embeddings: bool = True, batch_size: int = 10_000)
|
||||
```
|
||||
|
||||
Updates the embeddings in the the document store using the encoding model specified in the retriever.
|
||||
This can be useful if want to change the embeddings for your documents (e.g. after changing the retriever config).
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `retriever`: Retriever to use to update the embeddings.
|
||||
- `index`: Index name to update
|
||||
- `update_existing_embeddings`: Weaviate mandates an embedding while creating the document itself.
|
||||
This option must be always true for weaviate and it will update the embeddings for all the documents.
|
||||
- `filters`: Optional filters to narrow down the documents for which embeddings are to be updated.
|
||||
Example: {"name": ["some", "more"], "category": ["only_one"]}
|
||||
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
|
||||
|
||||
**Returns**:
|
||||
|
||||
None
|
||||
|
||||
<a name="weaviate.WeaviateDocumentStore.delete_all_documents"></a>
|
||||
#### delete\_all\_documents
|
||||
|
||||
```python
|
||||
| delete_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None)
|
||||
```
|
||||
|
||||
Delete documents in an index. All documents are deleted if no filters are passed.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `index`: Index name to delete the document from.
|
||||
- `filters`: Optional filters to narrow down the documents to be deleted.
|
||||
|
||||
**Returns**:
|
||||
|
||||
None
|
||||
|
||||
|
||||
@ -15,4 +15,4 @@ pydoc-markdown pydoc-markdown-pipelines.yml
|
||||
pydoc-markdown pydoc-markdown-knowledge-graph.yml
|
||||
pydoc-markdown pydoc-markdown-graph-retriever.yml
|
||||
pydoc-markdown pydoc-markdown-evaluation.yml
|
||||
|
||||
pydoc-markdown pydoc-markdown-ranker.yml
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
loaders:
|
||||
- type: python
|
||||
search_path: [../../../../haystack/document_store]
|
||||
modules: ['base', 'elasticsearch', 'memory', 'sql', 'faiss', 'milvus']
|
||||
modules: ['base', 'elasticsearch', 'memory', 'sql', 'faiss', 'milvus', 'weaviate']
|
||||
ignore_when_discovered: ['__init__']
|
||||
processor:
|
||||
- type: filter
|
||||
|
||||
19
docs/_src/api/api/pydoc-markdown-ranker.yml
Normal file
19
docs/_src/api/api/pydoc-markdown-ranker.yml
Normal file
@ -0,0 +1,19 @@
|
||||
loaders:
|
||||
- type: python
|
||||
search_path: [../../../../haystack/ranker]
|
||||
modules: ['base', 'farm']
|
||||
ignore_when_discovered: ['__init__']
|
||||
processor:
|
||||
- type: filter
|
||||
expression: not name.startswith('_') and default()
|
||||
- documented_only: true
|
||||
- do_not_filter_modules: false
|
||||
- skip_empty_modules: true
|
||||
renderer:
|
||||
type: markdown
|
||||
descriptive_class_title: true
|
||||
descriptive_module_title: true
|
||||
add_method_class_prefix: false
|
||||
add_member_class_prefix: false
|
||||
filename: ranker.md
|
||||
|
||||
206
docs/_src/api/api/ranker.md
Normal file
206
docs/_src/api/api/ranker.md
Normal file
@ -0,0 +1,206 @@
|
||||
<a name="base"></a>
|
||||
# Module base
|
||||
|
||||
<a name="base.BaseRanker"></a>
|
||||
## BaseRanker Objects
|
||||
|
||||
```python
|
||||
class BaseRanker(BaseComponent)
|
||||
```
|
||||
|
||||
<a name="base.BaseRanker.timing"></a>
|
||||
#### timing
|
||||
|
||||
```python
|
||||
| timing(fn, attr_name)
|
||||
```
|
||||
|
||||
Wrapper method used to time functions.
|
||||
|
||||
<a name="base.BaseRanker.eval"></a>
|
||||
#### eval
|
||||
|
||||
```python
|
||||
| eval(label_index: str = "label", doc_index: str = "eval_document", label_origin: str = "gold_label", top_k: int = 10, open_domain: bool = False, return_preds: bool = False) -> dict
|
||||
```
|
||||
|
||||
Performs evaluation of the Ranker.
|
||||
Ranker is evaluated in the same way as a Retriever based on whether it finds the correct document given the query string and at which
|
||||
position in the ranking of documents the correct document is.
|
||||
|
||||
| Returns a dict containing the following metrics:
|
||||
|
||||
- "recall": Proportion of questions for which correct document is among retrieved documents
|
||||
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
|
||||
Only considers the highest ranked relevant document.
|
||||
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
|
||||
documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
|
||||
average precision is normalized by the number of retrieved relevant documents per query.
|
||||
If ``open_domain=False``, average precision is normalized by the number of all relevant documents
|
||||
per query.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `label_index`: Index/Table in DocumentStore where labeled questions are stored
|
||||
- `doc_index`: Index/Table in DocumentStore where documents that are used for evaluation are stored
|
||||
- `top_k`: How many documents to return per query
|
||||
- `open_domain`: If ``True``, retrieval will be evaluated by checking if the answer string to a question is
|
||||
contained in the retrieved docs (common approach in open-domain QA).
|
||||
If ``False``, retrieval uses a stricter evaluation that checks if the retrieved document ids
|
||||
are within ids explicitly stated in the labels.
|
||||
- `return_preds`: Whether to add predictions in the returned dictionary. If True, the returned dictionary
|
||||
contains the keys "predictions" and "metrics".
|
||||
|
||||
<a name="farm"></a>
|
||||
# Module farm
|
||||
|
||||
<a name="farm.FARMRanker"></a>
|
||||
## FARMRanker Objects
|
||||
|
||||
```python
|
||||
class FARMRanker(BaseRanker)
|
||||
```
|
||||
|
||||
Transformer based model for Document Re-ranking using the TextPairClassifier of FARM framework (https://github.com/deepset-ai/FARM).
|
||||
While the underlying model can vary (BERT, Roberta, DistilBERT, ...), the interface remains the same.
|
||||
|
||||
| With a FARMRanker, you can:
|
||||
|
||||
- directly get predictions via predict()
|
||||
- fine-tune the model on TextPair data via train()
|
||||
|
||||
<a name="farm.FARMRanker.__init__"></a>
|
||||
#### \_\_init\_\_
|
||||
|
||||
```python
|
||||
| __init__(model_name_or_path: Union[str, Path], model_version: Optional[str] = None, batch_size: int = 50, use_gpu: bool = True, top_k: int = 10, num_processes: Optional[int] = None, max_seq_len: int = 256, progress_bar: bool = True)
|
||||
```
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased',
|
||||
'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'.
|
||||
See https://huggingface.co/models for full list of available models.
|
||||
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
|
||||
- `batch_size`: Number of samples the model receives in one batch for inference.
|
||||
Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
|
||||
to a value so only a single batch is used.
|
||||
- `use_gpu`: Whether to use GPU (if available)
|
||||
- `top_k`: The maximum number of documents to return
|
||||
- `num_processes`: The number of processes for `multiprocessing.Pool`. Set to value of 0 to disable
|
||||
multiprocessing. Set to None to let Inferencer determine optimum number. If you
|
||||
want to debug the Language Model, you might need to disable multiprocessing!
|
||||
- `max_seq_len`: Max sequence length of one input text for the model
|
||||
- `progress_bar`: Whether to show a tqdm progress bar or not.
|
||||
Can be helpful to disable in production deployments to keep the logs clean.
|
||||
|
||||
<a name="farm.FARMRanker.train"></a>
|
||||
#### train
|
||||
|
||||
```python
|
||||
| train(data_dir: str, train_filename: str, dev_filename: Optional[str] = None, test_filename: Optional[str] = None, use_gpu: Optional[bool] = None, batch_size: int = 10, n_epochs: int = 2, learning_rate: float = 1e-5, max_seq_len: Optional[int] = None, warmup_proportion: float = 0.2, dev_split: float = 0, evaluate_every: int = 300, save_dir: Optional[str] = None, num_processes: Optional[int] = None, use_amp: str = None)
|
||||
```
|
||||
|
||||
Fine-tune a model on a TextPairClassification dataset. Options:
|
||||
|
||||
- Take a plain language model (e.g. `bert-base-cased`) and train it for TextPairClassification
|
||||
- Take a TextPairClassification model and fine-tune it for your domain
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `data_dir`: Path to directory containing your training data in SQuAD style
|
||||
- `train_filename`: Filename of training data
|
||||
- `dev_filename`: Filename of dev / eval data
|
||||
- `test_filename`: Filename of test data
|
||||
- `dev_split`: Instead of specifying a dev_filename, you can also specify a ratio (e.g. 0.1) here
|
||||
that gets split off from training data for eval.
|
||||
- `use_gpu`: Whether to use GPU (if available)
|
||||
- `batch_size`: Number of samples the model receives in one batch for training
|
||||
- `n_epochs`: Number of iterations on the whole training data set
|
||||
- `learning_rate`: Learning rate of the optimizer
|
||||
- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
|
||||
- `warmup_proportion`: Proportion of training steps until maximum learning rate is reached.
|
||||
Until that point LR is increasing linearly. After that it's decreasing again linearly.
|
||||
Options for different schedules are available in FARM.
|
||||
- `evaluate_every`: Evaluate the model every X steps on the hold-out eval dataset
|
||||
- `save_dir`: Path to store the final model
|
||||
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
|
||||
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
|
||||
Set to None to use all CPU cores minus one.
|
||||
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
|
||||
Available options:
|
||||
None (Don't use AMP)
|
||||
"O0" (Normal FP32 training)
|
||||
"O1" (Mixed Precision => Recommended)
|
||||
"O2" (Almost FP16)
|
||||
"O3" (Pure FP16).
|
||||
See details on: https://nvidia.github.io/apex/amp.html
|
||||
|
||||
**Returns**:
|
||||
|
||||
None
|
||||
|
||||
<a name="farm.FARMRanker.update_parameters"></a>
|
||||
#### update\_parameters
|
||||
|
||||
```python
|
||||
| update_parameters(max_seq_len: Optional[int] = None)
|
||||
```
|
||||
|
||||
Hot update parameters of a loaded Ranker. It may not to be safe when processing concurrent requests.
|
||||
|
||||
<a name="farm.FARMRanker.save"></a>
|
||||
#### save
|
||||
|
||||
```python
|
||||
| save(directory: Path)
|
||||
```
|
||||
|
||||
Saves the Ranker model so that it can be reused at a later point in time.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `directory`: Directory where the Ranker model should be saved
|
||||
|
||||
<a name="farm.FARMRanker.predict_batch"></a>
|
||||
#### predict\_batch
|
||||
|
||||
```python
|
||||
| predict_batch(query_doc_list: List[dict], top_k: int = None, batch_size: int = None)
|
||||
```
|
||||
|
||||
Use loaded Ranker model to, for a list of queries, rank each query's supplied list of Document.
|
||||
|
||||
Returns list of dictionary of query and list of document sorted by (desc.) similarity with query
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `query_doc_list`: List of dictionaries containing queries with their retrieved documents
|
||||
- `top_k`: The maximum number of answers to return for each query
|
||||
- `batch_size`: Number of samples the model receives in one batch for inference
|
||||
|
||||
**Returns**:
|
||||
|
||||
List of dictionaries containing query and ranked list of Document
|
||||
|
||||
<a name="farm.FARMRanker.predict"></a>
|
||||
#### predict
|
||||
|
||||
```python
|
||||
| predict(query: str, documents: List[Document], top_k: Optional[int] = None)
|
||||
```
|
||||
|
||||
Use loaded ranker model to re-rank the supplied list of Document.
|
||||
|
||||
Returns list of Document sorted by (desc.) TextPairClassification similarity with the query.
|
||||
|
||||
**Arguments**:
|
||||
|
||||
- `query`: Query string
|
||||
- `documents`: List of Document to be re-ranked
|
||||
- `top_k`: The maximum number of documents to return
|
||||
|
||||
**Returns**:
|
||||
|
||||
List of Document
|
||||
|
||||
@ -238,7 +238,7 @@ Karpukhin, Vladimir, et al. (2020): "Dense Passage Retrieval for Open-Domain Que
|
||||
#### \_\_init\_\_
|
||||
|
||||
```python
|
||||
| __init__(document_store: BaseDocumentStore, query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base", single_model_path: Optional[Union[Path, str]] = None, model_version: Optional[str] = None, max_seq_len_query: int = 64, max_seq_len_passage: int = 256, top_k: int = 10, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, infer_tokenizer_classes: bool = False, similarity_function: str = "dot_product", progress_bar: bool = True)
|
||||
| __init__(document_store: BaseDocumentStore, query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base", model_version: Optional[str] = None, max_seq_len_query: int = 64, max_seq_len_passage: int = 256, top_k: int = 10, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, infer_tokenizer_classes: bool = False, similarity_function: str = "dot_product", progress_bar: bool = True)
|
||||
```
|
||||
|
||||
Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
|
||||
@ -266,9 +266,6 @@ The checkpoint format matches huggingface transformers' model format
|
||||
- `passage_embedding_model`: Local path or remote name of passage encoder checkpoint. The format equals the
|
||||
one used by hugging-face transformers' modelhub models
|
||||
Currently available remote names: ``"facebook/dpr-ctx_encoder-single-nq-base"``
|
||||
- `single_model_path`: Local path or remote name of a query and passage embedder in one single model. Those
|
||||
models are typically trained within FARM.
|
||||
Currently available remote names: TODO add FARM DPR model to HF modelhub
|
||||
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
|
||||
- `max_seq_len_query`: Longest length of each query sequence. Maximum number of tokens for the query text. Longer ones will be cut down."
|
||||
- `max_seq_len_passage`: Longest length of each passage/context sequence. Maximum number of tokens for the passage text. Longer ones will be cut down."
|
||||
@ -407,7 +404,7 @@ None
|
||||
|
||||
```python
|
||||
| @classmethod
|
||||
| load(cls, load_dir: Union[Path, str], document_store: BaseDocumentStore, max_seq_len_query: int = 64, max_seq_len_passage: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, similarity_function: str = "dot_product", query_encoder_dir: str = "query_encoder", passage_encoder_dir: str = "passage_encoder")
|
||||
| load(cls, load_dir: Union[Path, str], document_store: BaseDocumentStore, max_seq_len_query: int = 64, max_seq_len_passage: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, similarity_function: str = "dot_product", query_encoder_dir: str = "query_encoder", passage_encoder_dir: str = "passage_encoder", infer_tokenizer_classes: bool = False)
|
||||
```
|
||||
|
||||
Load DensePassageRetriever from the specified directory.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user