Api pages (#2248)

* Update Readme WIP * Update Documentation & Code Style * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-01-07 04:27:15 +00:00 · 2022-02-25 13:53:46 +01:00 · 2022-02-25 13:53:46 +01:00 · b563b6622c
commit b563b6622c
parent bb107e5027
6 changed files with 192 additions and 71 deletions
--- a/docs/_src/api/api/document_store.md
+++ b/docs/_src/api/api/document_store.md
@ -3600,3 +3600,131 @@ exists.

 None

+<a id="utils"></a>
+
+# Module utils
+
+<a id="utils.eval_data_from_json"></a>
+
+#### eval\_data\_from\_json
+
+```python
+def eval_data_from_json(filename: str, max_docs: Union[int, bool] = None, preprocessor: PreProcessor = None, open_domain: bool = False) -> Tuple[List[Document], List[Label]]
+```
+
+Read Documents + Labels from a SQuAD-style file.
+
+Document and Labels can then be indexed to the DocumentStore and be used for evaluation.
+
+**Arguments**:
+
+- `filename`: Path to file in SQuAD format
+- `max_docs`: This sets the number of documents that will be loaded. By default, this is set to None, thus reading in all available eval documents.
+- `open_domain`: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.
+
+<a id="utils.eval_data_from_jsonl"></a>
+
+#### eval\_data\_from\_jsonl
+
+```python
+def eval_data_from_jsonl(filename: str, batch_size: Optional[int] = None, max_docs: Union[int, bool] = None, preprocessor: PreProcessor = None, open_domain: bool = False) -> Generator[Tuple[List[Document], List[Label]], None, None]
+```
+
+Read Documents + Labels from a SQuAD-style file in jsonl format, i.e. one document per line.
+
+Document and Labels can then be indexed to the DocumentStore and be used for evaluation.
+
+This is a generator which will yield one tuple per iteration containing a list
+of batch_size documents and a list with the documents' labels.
+If batch_size is set to None, this method will yield all documents and labels.
+
+**Arguments**:
+
+- `filename`: Path to file in SQuAD format
+- `max_docs`: This sets the number of documents that will be loaded. By default, this is set to None, thus reading in all available eval documents.
+- `open_domain`: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.
+
+<a id="utils.squad_json_to_jsonl"></a>
+
+#### squad\_json\_to\_jsonl
+
+```python
+def squad_json_to_jsonl(squad_file: str, output_file: str)
+```
+
+Converts a SQuAD-json-file into jsonl format with one document per line.
+
+**Arguments**:
+
+- `squad_file`: SQuAD-file in json format.
+- `output_file`: Name of output file (SQuAD in jsonl format)
+
+<a id="utils.convert_date_to_rfc3339"></a>
+
+#### convert\_date\_to\_rfc3339
+
+```python
+def convert_date_to_rfc3339(date: str) -> str
+```
+
+Converts a date to RFC3339 format, as Weaviate requires dates to be in RFC3339 format including the time and
+timezone.
+
+If the provided date string does not contain a time and/or timezone, we use 00:00 as default time
+and UTC as default time zone.
+
+This method cannot be part of WeaviateDocumentStore, as this would result in a circular import between weaviate.py
+and filter_utils.py.
+
+<a id="utils.es_index_to_document_store"></a>
+
+#### es\_index\_to\_document\_store
+
+```python
+def es_index_to_document_store(document_store: "BaseDocumentStore", original_index_name: str, original_content_field: str, original_name_field: Optional[str] = None, included_metadata_fields: Optional[List[str]] = None, excluded_metadata_fields: Optional[List[str]] = None, store_original_ids: bool = True, index: Optional[str] = None, preprocessor: Optional[PreProcessor] = None, batch_size: int = 10_000, host: Union[str, List[str]] = "localhost", port: Union[int, List[int]] = 9200, username: str = "", password: str = "", api_key_id: Optional[str] = None, api_key: Optional[str] = None, aws4auth=None, scheme: str = "http", ca_certs: Optional[str] = None, verify_certs: bool = True, timeout: int = 30, use_system_proxy: bool = False) -> "BaseDocumentStore"
+```
+
+This function provides brownfield support of existing Elasticsearch indexes by converting each of the records in
+
+the provided index to haystack `Document` objects and writing them to the specified `DocumentStore`. It can be used
+on a regular basis in order to add new records of the Elasticsearch index to the `DocumentStore`.
+
+**Arguments**:
+
+- `document_store`: The haystack `DocumentStore` to write the converted `Document` objects to.
+- `original_index_name`: Elasticsearch index containing the records to be converted.
+- `original_content_field`: Elasticsearch field containing the text to be put in the `content` field of the
+resulting haystack `Document` objects.
+- `original_name_field`: Optional Elasticsearch field containing the title of the Document.
+- `included_metadata_fields`: List of Elasticsearch fields that shall be stored in the `meta` field of the
+resulting haystack `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
+all the fields found in the Elasticsearch records will be kept as metadata. You can specify only one of the
+`included_metadata_fields` and `excluded_metadata_fields` parameters.
+- `excluded_metadata_fields`: List of Elasticsearch fields that shall be excluded from the `meta` field of the
+resulting haystack `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
+all the fields found in the Elasticsearch records will be kept as metadata. You can specify only one of the
+`included_metadata_fields` and `excluded_metadata_fields` parameters.
+- `store_original_ids`: Whether to store the ID a record had in the original Elasticsearch index at the
+`"_original_es_id"` metadata field of the resulting haystack `Document` objects. This should be set to `True`
+if you want to continuously update the `DocumentStore` with new records inside your Elasticsearch index. If this
+parameter was set to `False` on the first call of `es_index_to_document_store`,
+all the indexed Documents in the `DocumentStore` will be overwritten in the second call.
+- `index`: Name of index in `document_store` to use to store the resulting haystack `Document` objects.
+- `preprocessor`: Optional PreProcessor that will be applied on the content field of the original Elasticsearch
+record.
+- `batch_size`: Number of records to process at once.
+- `host`: URL(s) of Elasticsearch nodes.
+- `port`: Ports(s) of Elasticsearch nodes.
+- `username`: Username (standard authentication via http_auth).
+- `password`: Password (standard authentication via http_auth).
+- `api_key_id`: ID of the API key (altenative authentication mode to the above http_auth).
+- `api_key`: Secret value of the API key (altenative authentication mode to the above http_auth).
+- `aws4auth`: Authentication for usage with AWS Elasticsearch
+(can be generated with the requests-aws4auth package).
+- `scheme`: `"https"` or `"http"`, protocol used to connect to your Elasticsearch instance.
+- `ca_certs`: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk.
+You can use certifi package with `certifi.where()` to find where the CA certs file is located in your machine.
+- `verify_certs`: Whether to be strict about ca certificates.
+- `timeout`: Number of seconds after which an Elasticsearch request times out.
+- `use_system_proxy`: Whether to use system proxy.
+
--- a/docs/_src/api/pydoc/document-store.yml
+++ b/docs/_src/api/pydoc/document-store.yml
@ -1,7 +1,7 @@
 loaders:
  - type: python
    search_path: [../../../../haystack/document_stores]
-    modules: ['base', 'elasticsearch', 'memory', 'sql', 'faiss', 'milvus1', 'milvus2', 'weaviate', 'graphdb', 'deepsetcloud']
+    modules: ['base', 'elasticsearch', 'memory', 'sql', 'faiss', 'milvus1', 'milvus2', 'weaviate', 'graphdb', 'deepsetcloud', 'utils']
    ignore_when_discovered: ['__init__']
 processors:
  - type: filter
--- a/docs/_src/tutorials/tutorials/5.md
+++ b/docs/_src/tutorials/tutorials/5.md
@ -38,13 +38,6 @@ Make sure you enable the GPU runtime to experience decent speed in this tutorial
 !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
 ```

-
-```python
-from haystack.modeling.utils import initialize_device_settings
-
-devices, n_gpu = initialize_device_settings(use_cuda=True)
-```
-
 ## Start an Elasticsearch server
 You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

@ -137,6 +130,7 @@ document_store.add_eval_data(
 from haystack.nodes import ElasticsearchRetriever

 retriever = ElasticsearchRetriever(document_store=document_store)
+
 # Alternative: Evaluate dense retrievers (DensePassageRetriever or EmbeddingRetriever)
 # DensePassageRetriever uses two separate transformer based encoders for query and document.
 # In contrast, EmbeddingRetriever uses a single encoder for both.
@ -145,6 +139,7 @@ retriever = ElasticsearchRetriever(document_store=document_store)
 #        the max_seq_len limitations of Transformers
 # The SentenceTransformer model "all-mpnet-base-v2" generally works well with the EmbeddingRetriever on any kind of English text.
 # For more information check out the documentation at: https://www.sbert.net/docs/pretrained_models.html
+
 # from haystack.retriever import DensePassageRetriever, EmbeddingRetriever
 # retriever = DensePassageRetriever(document_store=document_store,
 #                                   query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
@ -171,6 +166,7 @@ pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

 # The evaluation also works with any other pipeline.
 # For example you could use a DocumentSearchPipeline as an alternative:
+
 # from haystack.pipelines import DocumentSearchPipeline
 # pipeline = DocumentSearchPipeline(retriever=retriever)
 ```
@ -188,21 +184,34 @@ The generation of predictions is seperated from the calculation of metrics. This
 from haystack.schema import EvaluationResult, MultiLabel

 # We can load evaluation labels from the document store
+# We are also opting to filter out no_answer samples
 eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)
+eval_labels = [label for label in eval_labels if not label.no_answer]  # filter out no_answer cases
+
+## Alternative: Define queries and labels directly

-# Alternative: Define queries and labels directly
-# from haystack.schema import Answer, Document, Label, Span
 # eval_labels = [
-#        MultiLabel(labels=[Label(query="who is written in the book of life",
-#        answer=Answer(answer="every person who is destined for Heaven or the World to Come",
-#        offsets_in_context=[Span(374, 434)]),
-#        document=Document(id='1b090aec7dbd1af6739c4c80f8995877-0',
-#        content_type="text",
-#        content='Book of Life - wikipedia Book of Life Jump to: navigation, search This article is about the book mentioned in Christian and Jewish religious teachings. For other uses, see The Book of Life. In Christianity and Judaism, the Book of Life (Hebrew: ספר החיים, transliterated Sefer HaChaim; Greek: βιβλίον τῆς ζωῆς Biblíon tēs Zōēs) is the book in which God records the names of every person who is destined for Heaven or the World to Come. According to the Talmud it is open on Rosh Hashanah, as is its analog for the wicked, the Book of the Dead. For this reason extra mention is made for the Book of Life during Amidah recitations during the Days of Awe, the ten days between Rosh Hashanah, the Jewish new year, and Yom Kippur, the day of atonement (the two High Holidays, particularly in the prayer Unetaneh Tokef). Contents (hide) 1 In the Hebrew Bible 2 Book of Jubilees 3 References in the New Testament 4 The eschatological or annual roll-call 5 Fundraising 6 See also 7 Notes 8 References In the Hebrew Bible(edit) In the Hebrew Bible the Book of Life - the book or muster-roll of God - records forever all people considered righteous before God'),
-#        is_correct_answer=True,
-#        is_correct_document=True,
-#        origin="gold-label")])
-#    ]
+#    MultiLabel(
+#        labels=[
+#            Label(
+#                query="who is written in the book of life",
+#                answer=Answer(
+#                    answer="every person who is destined for Heaven or the World to Come",
+#                    offsets_in_context=[Span(374, 434)]
+#                ),
+#                document=Document(
+#                    id='1b090aec7dbd1af6739c4c80f8995877-0',
+#                    content_type="text",
+#                    content='Book of Life - wikipedia Book of Life Jump to: navigation, search This article is
+#                       about the book mentioned in Christian and Jewish religious teachings...'
+#                ),
+#                is_correct_answer=True,
+#                is_correct_document=True,
+#                origin="gold-label"
+#            )
+#        ]
+#    )
+# ]

 # Similar to pipeline.run() we can execute pipeline.eval()
 eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})
@ -226,13 +235,14 @@ reader_result.head()

 ```python
 # We can filter for all documents retrieved for a given query
-retriever_book_of_life = retriever_result[retriever_result["query"] == "who is written in the book of life"]
+query = "who is written in the book of life"
+retriever_book_of_life = retriever_result[retriever_result["query"] == query]
 ```


 ```python
 # We can also filter for all answers predicted for a given query
-reader_book_of_life = reader_result[reader_result["query"] == "who is written in the book of life"]
+reader_book_of_life = reader_result[reader_result["query"] == query]
 ```


@ -242,7 +252,9 @@ eval_result.save("../")
 ```

 ## Calculating Evaluation Metrics
-Load an EvaluationResult to quickly calculate standard evaluation metrics for all predictions, such as F1-score of each individual prediction of the Reader node or recall of the retriever.
+Load an EvaluationResult to quickly calculate standard evaluation metrics for all predictions,
+such as F1-score of each individual prediction of the Reader node or recall of the retriever.
+To learn more about the metrics, see [Evaluation Metrics](https://haystack.deepset.ai/guides/evaluation#metrics-retrieval)


 ```python
@ -281,14 +293,15 @@ metrics = advanced_eval_result.calculate_metrics()
 print(metrics["Reader"]["sas"])
 ```

-## Isolated Evaluation Mode to Understand Upper Bounds of the Reader's Performance
-The isolated node evaluation uses labels as input to the reader node instead of the output of the preceeding retriever node.
-Thereby, we can additionally calculate the upper bounds of the evaluation metrics of the reader.
+## Isolated Evaluation Mode
+The isolated node evaluation uses labels as input to the Reader node instead of the output of the preceeding Retriever node.
+Thereby, we can additionally calculate the upper bounds of the evaluation metrics of the Reader. Note that even with isolated evaluation enabled, integrated evaluation will still be running.
+


 ```python
 eval_result_with_upper_bounds = pipeline.eval(
-    labels=eval_labels, params={"Retriever": {"top_k": 1}}, add_isolated_node_eval=True
+    labels=eval_labels, params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 5}}, add_isolated_node_eval=True
 )
 ```

@ -304,6 +317,7 @@ Here we evaluate only the retriever, based on whether the gold_label document is

 ```python
 ## Evaluate Retriever on its own
+# Note that no_answer samples are omitted when evaluation is performed with this method
 retriever_eval_results = retriever.eval(top_k=5, label_index=label_index, doc_index=doc_index)
 # Retriever Recall is the proportion of questions for which the correct document containing the answer is
 # among the correct documents
@ -312,6 +326,16 @@ print("Retriever Recall:", retriever_eval_results["recall"])
 print("Retriever Mean Avg Precision:", retriever_eval_results["map"])
 ```

+Just as a sanity check, we can compare the recall from `retriever.eval()` with the multi hit recall from `pipeline.eval(add_isolated_node_eval=True)`.
+These two recall metrics are only comparable since we chose to filter out no_answer samples when generating eval_labels.
+
+
+
+```python
+metrics = eval_result_with_upper_bounds.calculate_metrics()
+print(metrics["Retriever"]["recall_multi_hit"])
+```
+
 ## Evaluation of Individual Components: Reader
 Here we evaluate only the reader in a closed domain fashion i.e. the reader is given one query
 and its corresponding relevant document and metrics are calculated on whether the right position in this text is selected by
@ -320,9 +344,7 @@ the model as the answer span (i.e. SQuAD style)

 ```python
 # Evaluate Reader on its own
-reader_eval_results = reader.eval(
-    document_store=document_store, device=devices[0], label_index=label_index, doc_index=doc_index
-)
+reader_eval_results = reader.eval(document_store=document_store, label_index=label_index, doc_index=doc_index)
 # Evaluation of Reader can also be done directly on a SQuAD-formatted file without passing the data to Elasticsearch
 # reader_eval_results = reader.eval_on_file("../data/nq", "nq_dev_subset_v2.json", device=device)

--- a/haystack/document_stores/utils.py
+++ b/haystack/document_stores/utils.py
@ -314,7 +314,7 @@ def es_index_to_document_store(
    :param original_index_name: Elasticsearch index containing the records to be converted.
    :param original_content_field: Elasticsearch field containing the text to be put in the `content` field of the
        resulting haystack `Document` objects.
-    :param original_name_field: Optional Elasticsearch field containing the title title of the Document.
+    :param original_name_field: Optional Elasticsearch field containing the title of the Document.
    :param included_metadata_fields: List of Elasticsearch fields that shall be stored in the `meta` field of the
        resulting haystack `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
        all the fields found in the Elasticsearch records will be kept as metadata. You can specify only one of the
--- a/tutorials/Tutorial5_Evaluation.ipynb
+++ b/tutorials/Tutorial5_Evaluation.ipynb
@ -402,11 +402,11 @@
    "# We can load evaluation labels from the document store\n",
    "# We are also opting to filter out no_answer samples\n",
    "eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)\n",
-    "eval_labels = [label for label in eval_labels if not label.no_answer] # filter out no_answer cases\n",
+    "eval_labels = [label for label in eval_labels if not label.no_answer]  # filter out no_answer cases\n",
    "\n",
    "## Alternative: Define queries and labels directly\n",
    "\n",
-    "#eval_labels = [\n",
+    "# eval_labels = [\n",
    "#    MultiLabel(\n",
    "#        labels=[\n",
    "#            Label(\n",
@ -427,13 +427,10 @@
    "#            )\n",
    "#        ]\n",
    "#    )\n",
-    "#]\n",
+    "# ]\n",
    "\n",
    "# Similar to pipeline.run() we can execute pipeline.eval()\n",
-    "eval_result = pipeline.eval(\n",
-    "    labels=eval_labels,\n",
-    "    params={\"Retriever\": {\"top_k\": 5}}\n",
-    ")"
+    "eval_result = pipeline.eval(labels=eval_labels, params={\"Retriever\": {\"top_k\": 5}})"
   ]
  },
  {
@ -987,12 +984,7 @@
   "outputs": [],
   "source": [
    "eval_result_with_upper_bounds = pipeline.eval(\n",
-    "    labels=eval_labels,\n",
-    "    params={\n",
-    "        \"Retriever\": {\"top_k\": 5},\n",
-    "        \"Reader\": {\"top_k\": 5}\n",
-    "    }, \n",
-    "    add_isolated_node_eval=True\n",
+    "    labels=eval_labels, params={\"Retriever\": {\"top_k\": 5}, \"Reader\": {\"top_k\": 5}}, add_isolated_node_eval=True\n",
    ")"
   ]
  },
@ -1030,11 +1022,7 @@
   "source": [
    "## Evaluate Retriever on its own\n",
    "# Note that no_answer samples are omitted when evaluation is performed with this method\n",
-    "retriever_eval_results = retriever.eval(\n",
-    "    top_k=5,\n",
-    "    label_index=label_index,\n",
-    "    doc_index=doc_index\n",
-    ")\n",
+    "retriever_eval_results = retriever.eval(top_k=5, label_index=label_index, doc_index=doc_index)\n",
    "# Retriever Recall is the proportion of questions for which the correct document containing the answer is\n",
    "# among the correct documents\n",
    "print(\"Retriever Recall:\", retriever_eval_results[\"recall\"])\n",
@ -1081,11 +1069,7 @@
   "outputs": [],
   "source": [
    "# Evaluate Reader on its own\n",
-    "reader_eval_results = reader.eval(\n",
-    "    document_store=document_store,\n",
-    "    label_index=label_index,\n",
-    "    doc_index=doc_index\n",
-    ")\n",
+    "reader_eval_results = reader.eval(document_store=document_store, label_index=label_index, doc_index=doc_index)\n",
    "# Evaluation of Reader can also be done directly on a SQuAD-formatted file without passing the data to Elasticsearch\n",
    "# reader_eval_results = reader.eval_on_file(\"../data/nq\", \"nq_dev_subset_v2.json\", device=device)\n",
    "\n",
--- a/tutorials/Tutorial5_Evaluation.py
+++ b/tutorials/Tutorial5_Evaluation.py
@ -129,10 +129,7 @@ def tutorial5_evaluation():
    # ]

    # Similar to pipeline.run() we can execute pipeline.eval()
-    eval_result = pipeline.eval(
-        labels=eval_labels,
-        params={"Retriever": {"top_k": 5}}
-    )
+    eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})

    # The EvaluationResult contains a pandas dataframe for each pipeline node.
    # That's why there are two dataframes in the EvaluationResult of an ExtractiveQAPipeline.
@ -199,9 +196,7 @@ def tutorial5_evaluation():
    # Thereby, we can additionally calculate the upper bounds of the evaluation metrics of the Reader.
    # Note that even with isolated evaluation enabled, integrated evaluation will still be running.
    eval_result_with_upper_bounds = pipeline.eval(
-        labels=eval_labels,
-        params={"Retriever": {"top_k": 5}},
-        add_isolated_node_eval=True
+        labels=eval_labels, params={"Retriever": {"top_k": 5}}, add_isolated_node_eval=True
    )
    pipeline.print_eval_report(eval_result_with_upper_bounds)

@ -212,11 +207,7 @@ def tutorial5_evaluation():
    # Evaluate Retriever on its own
    # Here we evaluate only the retriever, based on whether the gold_label document is retrieved.
    # Note that no_answer samples are omitted when evaluation is performed with this method
-    retriever_eval_results = retriever.eval(
-        top_k=5,
-        label_index=label_index,
-        doc_index=doc_index
-    )
+    retriever_eval_results = retriever.eval(top_k=5, label_index=label_index, doc_index=doc_index)

    ## Retriever Recall is the proportion of questions for which the correct document containing the answer is
    ## among the correct documents
@ -234,11 +225,7 @@ def tutorial5_evaluation():
    # Here we evaluate only the reader in a closed domain fashion i.e. the reader is given one query
    # and its corresponding relevant document and metrics are calculated on whether the right position in this text is selected by
    # the model as the answer span (i.e. SQuAD style)
-    reader_eval_results = reader.eval(
-        document_store=document_store,
-        label_index=label_index,
-        doc_index=doc_index
-    )
+    reader_eval_results = reader.eval(document_store=document_store, label_index=label_index, doc_index=doc_index)
    # Evaluation of Reader can also be done directly on a SQuAD-formatted file without passing the data to Elasticsearch
    # reader_eval_results = reader.eval_on_file("../data/nq", "nq_dev_subset_v2.json")