cleaning the api docs (#616)

2025-10-07 05:56:45 +00:00 · 2020-11-24 18:49:14 +01:00 · 2020-11-24 18:49:14 +01:00 · 3dee284f20
commit 3dee284f20
parent e192387e65
12 changed files with 580 additions and 396 deletions
--- a/docs/_src/api/api/document_store.md
+++ b/docs/_src/api/api/document_store.md
@ -1,269 +1,8 @@
-<a name="memory"></a>
-# memory
-
-<a name="memory.InMemoryDocumentStore"></a>
-## InMemoryDocumentStore
-
-```python
-class InMemoryDocumentStore(BaseDocumentStore)
-```
-
-In-memory document store
-
-<a name="memory.InMemoryDocumentStore.write_documents"></a>
-#### write\_documents
-
-```python
- | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None)
-```
-
-Indexes documents for later queries.
-
-
-**Arguments**:
-
- `documents`: a list of Python dictionaries or a list of Haystack Document objects.
-For documents as dictionaries, the format is {"text": "<the-actual-text>"}.
-Optionally: Include meta data via {"text": "<the-actual-text>",
-"meta": {"name": "<some-document-name>, "author": "somebody", ...}}
-It can be used for filtering and is accessible in the responses of the Finder.
- `index`: write documents to a custom namespace. For instance, documents for evaluation can be indexed in a
-separate index than the documents for search.
-
-**Returns**:
-
-None
-
-<a name="memory.InMemoryDocumentStore.update_embeddings"></a>
-#### update\_embeddings
-
-```python
- | update_embeddings(retriever: BaseRetriever, index: Optional[str] = None)
-```
-
-Updates the embeddings in the the document store using the encoding model specified in the retriever.
-This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
-
-**Arguments**:
-
- `retriever`: Retriever
- `index`: Index name to update
-
-**Returns**:
-
-None
-
-<a name="memory.InMemoryDocumentStore.add_eval_data"></a>
-#### add\_eval\_data
-
-```python
- | add_eval_data(filename: str, doc_index: Optional[str] = None, label_index: Optional[str] = None)
-```
-
-Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
-
-**Arguments**:
-
- `filename`: Name of the file containing evaluation data
-:type filename: str
- `doc_index`: Elasticsearch index where evaluation documents should be stored
-:type doc_index: str
- `label_index`: Elasticsearch index where labeled questions should be stored
-:type label_index: str
-
-<a name="memory.InMemoryDocumentStore.delete_all_documents"></a>
-#### delete\_all\_documents
-
-```python
- | delete_all_documents(index: Optional[str] = None)
-```
-
-Delete all documents in a index.
-
-**Arguments**:
-
- `index`: index name
-
-**Returns**:
-
-None
-
-<a name="faiss"></a>
-# faiss
-
-<a name="faiss.FAISSDocumentStore"></a>
-## FAISSDocumentStore
-
-```python
-class FAISSDocumentStore(SQLDocumentStore)
-```
-
-Document store for very large scale embedding based dense retrievers like the DPR.
-
-It implements the FAISS library(https://github.com/facebookresearch/faiss)
-to perform similarity search on vectors.
-
-The document text and meta-data (for filtering) are stored using the SQLDocumentStore, while
-the vector embeddings are indexed in a FAISS Index.
-
-<a name="faiss.FAISSDocumentStore.__init__"></a>
-#### \_\_init\_\_
-
-```python
- | __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, return_embedding: Optional[bool] = True, **kwargs, ,)
-```
-
-**Arguments**:
-
- `sql_url`: SQL connection URL for database. It defaults to local file based SQLite DB. For large scale
-deployment, Postgres is recommended.
- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in
-smaller chunks to reduce memory footprint.
- `vector_dim`: the embedding vector size.
- `faiss_index_factory_str`: Create a new FAISS index of the specified type.
-The type is determined from the given string following the conventions
-of the original FAISS index factory.
-Recommended options:
- "Flat" (default): Best accuracy (= exact). Becomes slow and RAM intense for > 1 Mio docs.
- "HNSW": Graph-based heuristic. If not further specified,
-we use a RAM intense, but more accurate config:
-HNSW256, efConstruction=256 and efSearch=256
- "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist.
-Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point.
-For more details see:
- Overview of indices https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
- Guideline for choosing an index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
- FAISS Index factory https://github.com/facebookresearch/faiss/wiki/The-index-factory
-Benchmarks: XXX
- `faiss_index`: Pass an existing FAISS Index, i.e. an empty one that you configured manually
-or one with docs that you used in Haystack before and want to load again.
- `return_embedding`: To return document embedding
-
-<a name="faiss.FAISSDocumentStore.write_documents"></a>
-#### write\_documents
-
-```python
- | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None)
-```
-
-Add new documents to the DocumentStore.
-
-**Arguments**:
-
- `documents`: List of `Dicts` or List of `Documents`. If they already contain the embeddings, we'll index
-them right away in FAISS. If not, you can later call update_embeddings() to create & index them.
- `index`: (SQL) index name for storing the docs and metadata
-
-**Returns**:
-
-
-
-<a name="faiss.FAISSDocumentStore.update_embeddings"></a>
-#### update\_embeddings
-
-```python
- | update_embeddings(retriever: BaseRetriever, index: Optional[str] = None)
-```
-
-Updates the embeddings in the the document store using the encoding model specified in the retriever.
-This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
-
-**Arguments**:
-
- `retriever`: Retriever to use to get embeddings for text
- `index`: (SQL) index name for storing the docs and metadata
-
-**Returns**:
-
-None
-
-<a name="faiss.FAISSDocumentStore.train_index"></a>
-#### train\_index
-
-```python
- | train_index(documents: Optional[Union[List[dict], List[Document]]], embeddings: Optional[np.array] = None)
-```
-
-Some FAISS indices (e.g. IVF) require initial "training" on a sample of vectors before you can add your final vectors.
-The train vectors should come from the same distribution as your final ones.
-You can pass either documents (incl. embeddings) or just the plain embeddings that the index shall be trained on.
-
-**Arguments**:
-
- `documents`: Documents (incl. the embeddings)
- `embeddings`: Plain embeddings
-
-**Returns**:
-
-None
-
-<a name="faiss.FAISSDocumentStore.query_by_embedding"></a>
-#### query\_by\_embedding
-
-```python
- | query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document]
-```
-
-Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
-
-**Arguments**:
-
- `query_emb`: Embedding of the query (e.g. gathered from DPR)
- `filters`: Optional filters to narrow down the search space.
-Example: {"name": ["some", "more"], "category": ["only_one"]}
- `top_k`: How many documents to return
- `index`: (SQL) index name for storing the docs and metadata
- `return_embedding`: To return document embedding
-
-**Returns**:
-
-
-
-<a name="faiss.FAISSDocumentStore.save"></a>
-#### save
-
-```python
- | save(file_path: Union[str, Path])
-```
-
-Save FAISS Index to the specified file.
-
-**Arguments**:
-
- `file_path`: Path to save to.
-
-**Returns**:
-
-None
-
-<a name="faiss.FAISSDocumentStore.load"></a>
-#### load
-
-```python
- | @classmethod
- | load(cls, faiss_file_path: Union[str, Path], sql_url: str, index_buffer_size: int = 10_000)
-```
-
-Load a saved FAISS index from a file and connect to the SQL database.
-Note: In order to have a correct mapping from FAISS to SQL,
-make sure to use the same SQL DB that you used when calling `save()`.
-
-**Arguments**:
-
- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()`
- `sql_url`: Connection string to the SQL database that contains your docs and metadata.
- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in
-smaller chunks to reduce memory footprint.
-
-**Returns**:
-
-
-
 <a name="elasticsearch"></a>
-# elasticsearch
+# Module elasticsearch

 <a name="elasticsearch.ElasticsearchDocumentStore"></a>
-## ElasticsearchDocumentStore
+## ElasticsearchDocumentStore Objects

 ```python
 class ElasticsearchDocumentStore(BaseDocumentStore)
@ -391,29 +130,139 @@ Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform
 #### delete\_all\_documents

 ```python
- | delete_all_documents(index: str)
+ | delete_all_documents(index: str, filters: Optional[Dict[str, List[str]]] = None)
 ```

-Delete all documents in an index.
+Delete documents in an index. All documents are deleted if no filters are passed.

 **Arguments**:

- `index`: index name
+- `index`: Index name to delete the document from.
+- `filters`: Optional filters to narrow down the documents to be deleted.
+
+**Returns**:
+
+None
+
+<a name="memory"></a>
+# Module memory
+
+<a name="memory.InMemoryDocumentStore"></a>
+## InMemoryDocumentStore Objects
+
+```python
+class InMemoryDocumentStore(BaseDocumentStore)
+```
+
+In-memory document store
+
+<a name="memory.InMemoryDocumentStore.write_documents"></a>
+#### write\_documents
+
+```python
+ | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None)
+```
+
+Indexes documents for later queries.
+
+
+**Arguments**:
+
+- `documents`: a list of Python dictionaries or a list of Haystack Document objects.
+For documents as dictionaries, the format is {"text": "<the-actual-text>"}.
+Optionally: Include meta data via {"text": "<the-actual-text>",
+"meta": {"name": "<some-document-name>, "author": "somebody", ...}}
+It can be used for filtering and is accessible in the responses of the Finder.
+- `index`: write documents to a custom namespace. For instance, documents for evaluation can be indexed in a
+separate index than the documents for search.
+
+**Returns**:
+
+None
+
+<a name="memory.InMemoryDocumentStore.update_embeddings"></a>
+#### update\_embeddings
+
+```python
+ | update_embeddings(retriever: BaseRetriever, index: Optional[str] = None)
+```
+
+Updates the embeddings in the the document store using the encoding model specified in the retriever.
+This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
+
+**Arguments**:
+
+- `retriever`: Retriever
+- `index`: Index name to update
+
+**Returns**:
+
+None
+
+<a name="memory.InMemoryDocumentStore.add_eval_data"></a>
+#### add\_eval\_data
+
+```python
+ | add_eval_data(filename: str, doc_index: Optional[str] = None, label_index: Optional[str] = None)
+```
+
+Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
+
+**Arguments**:
+
+- `filename`: Name of the file containing evaluation data
+:type filename: str
+- `doc_index`: Elasticsearch index where evaluation documents should be stored
+:type doc_index: str
+- `label_index`: Elasticsearch index where labeled questions should be stored
+:type label_index: str
+
+<a name="memory.InMemoryDocumentStore.delete_all_documents"></a>
+#### delete\_all\_documents
+
+```python
+ | delete_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None)
+```
+
+Delete documents in an index. All documents are deleted if no filters are passed.
+
+**Arguments**:
+
+- `index`: Index name to delete the document from.
+- `filters`: Optional filters to narrow down the documents to be deleted.

 **Returns**:

 None

 <a name="sql"></a>
-# sql
+# Module sql

 <a name="sql.SQLDocumentStore"></a>
-## SQLDocumentStore
+## SQLDocumentStore Objects

 ```python
 class SQLDocumentStore(BaseDocumentStore)
 ```

+<a name="sql.SQLDocumentStore.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(url: str = "sqlite://", index: str = "document", label_index: str = "label", update_existing_documents: bool = False)
+```
+
+**Arguments**:
+
+- `url`: URL for SQL database as expected by SQLAlchemy. More info here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls
+- `index`: The documents are scoped to an index attribute that can be used when writing, querying, or deleting documents.
+This parameter sets the default value for document index.
+- `label_index`: The default value of index attribute for the labels.
+- `update_existing_documents`: Whether to update any existing documents with the same ID when adding
+documents. When set as True, any document with an existing ID gets updated.
+If set to False, an error is raised if the document ID of the document being
+added already exists. Using this parameter coud cause performance degradation for document insertion.
+
 <a name="sql.SQLDocumentStore.write_documents"></a>
 #### write\_documents

@ -473,24 +322,25 @@ Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform
 #### delete\_all\_documents

 ```python
- | delete_all_documents(index=None)
+ | delete_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None)
 ```

-Delete all documents in a index.
+Delete documents in an index. All documents are deleted if no filters are passed.

 **Arguments**:

- `index`: index name
+- `index`: Index name to delete the document from.
+- `filters`: Optional filters to narrow down the documents to be deleted.

 **Returns**:

 None

 <a name="base"></a>
-# base
+# Module base

 <a name="base.BaseDocumentStore"></a>
-## BaseDocumentStore
+## BaseDocumentStore Objects

 ```python
 class BaseDocumentStore(ABC)
@ -522,3 +372,179 @@ If None, the DocumentStore's default index (self.index) will be used.

 None

+<a name="faiss"></a>
+# Module faiss
+
+<a name="faiss.FAISSDocumentStore"></a>
+## FAISSDocumentStore Objects
+
+```python
+class FAISSDocumentStore(SQLDocumentStore)
+```
+
+Document store for very large scale embedding based dense retrievers like the DPR.
+
+It implements the FAISS library(https://github.com/facebookresearch/faiss)
+to perform similarity search on vectors.
+
+The document text and meta-data (for filtering) are stored using the SQLDocumentStore, while
+the vector embeddings are indexed in a FAISS Index.
+
+<a name="faiss.FAISSDocumentStore.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, return_embedding: Optional[bool] = True, update_existing_documents: bool = False, index: str = "document", **kwargs, ,)
+```
+
+**Arguments**:
+
+- `sql_url`: SQL connection URL for database. It defaults to local file based SQLite DB. For large scale
+deployment, Postgres is recommended.
+- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in
+smaller chunks to reduce memory footprint.
+- `vector_dim`: the embedding vector size.
+- `faiss_index_factory_str`: Create a new FAISS index of the specified type.
+The type is determined from the given string following the conventions
+of the original FAISS index factory.
+Recommended options:
+- "Flat" (default): Best accuracy (= exact). Becomes slow and RAM intense for > 1 Mio docs.
+- "HNSW": Graph-based heuristic. If not further specified,
+we use a RAM intense, but more accurate config:
+HNSW256, efConstruction=256 and efSearch=256
+- "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist.
+Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point.
+For more details see:
+- Overview of indices https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
+- Guideline for choosing an index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
+- FAISS Index factory https://github.com/facebookresearch/faiss/wiki/The-index-factory
+Benchmarks: XXX
+- `faiss_index`: Pass an existing FAISS Index, i.e. an empty one that you configured manually
+or one with docs that you used in Haystack before and want to load again.
+- `return_embedding`: To return document embedding
+- `update_existing_documents`: Whether to update any existing documents with the same ID when adding
+documents. When set as True, any document with an existing ID gets updated.
+If set to False, an error is raised if the document ID of the document being
+added already exists.
+- `index`: Name of index in document store to use.
+
+<a name="faiss.FAISSDocumentStore.write_documents"></a>
+#### write\_documents
+
+```python
+ | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None)
+```
+
+Add new documents to the DocumentStore.
+
+**Arguments**:
+
+- `documents`: List of `Dicts` or List of `Documents`. If they already contain the embeddings, we'll index
+them right away in FAISS. If not, you can later call update_embeddings() to create & index them.
+- `index`: (SQL) index name for storing the docs and metadata
+
+**Returns**:
+
+
+
+<a name="faiss.FAISSDocumentStore.update_embeddings"></a>
+#### update\_embeddings
+
+```python
+ | update_embeddings(retriever: BaseRetriever, index: Optional[str] = None)
+```
+
+Updates the embeddings in the the document store using the encoding model specified in the retriever.
+This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
+
+**Arguments**:
+
+- `retriever`: Retriever to use to get embeddings for text
+- `index`: (SQL) index name for storing the docs and metadata
+
+**Returns**:
+
+None
+
+<a name="faiss.FAISSDocumentStore.train_index"></a>
+#### train\_index
+
+```python
+ | train_index(documents: Optional[Union[List[dict], List[Document]]], embeddings: Optional[np.array] = None)
+```
+
+Some FAISS indices (e.g. IVF) require initial "training" on a sample of vectors before you can add your final vectors.
+The train vectors should come from the same distribution as your final ones.
+You can pass either documents (incl. embeddings) or just the plain embeddings that the index shall be trained on.
+
+**Arguments**:
+
+- `documents`: Documents (incl. the embeddings)
+- `embeddings`: Plain embeddings
+
+**Returns**:
+
+None
+
+<a name="faiss.FAISSDocumentStore.query_by_embedding"></a>
+#### query\_by\_embedding
+
+```python
+ | query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document]
+```
+
+Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
+
+**Arguments**:
+
+- `query_emb`: Embedding of the query (e.g. gathered from DPR)
+- `filters`: Optional filters to narrow down the search space.
+Example: {"name": ["some", "more"], "category": ["only_one"]}
+- `top_k`: How many documents to return
+- `index`: (SQL) index name for storing the docs and metadata
+- `return_embedding`: To return document embedding
+
+**Returns**:
+
+
+
+<a name="faiss.FAISSDocumentStore.save"></a>
+#### save
+
+```python
+ | save(file_path: Union[str, Path])
+```
+
+Save FAISS Index to the specified file.
+
+**Arguments**:
+
+- `file_path`: Path to save to.
+
+**Returns**:
+
+None
+
+<a name="faiss.FAISSDocumentStore.load"></a>
+#### load
+
+```python
+ | @classmethod
+ | load(cls, faiss_file_path: Union[str, Path], sql_url: str, index_buffer_size: int = 10_000)
+```
+
+Load a saved FAISS index from a file and connect to the SQL database.
+Note: In order to have a correct mapping from FAISS to SQL,
+make sure to use the same SQL DB that you used when calling `save()`.
+
+**Arguments**:
+
+- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()`
+- `sql_url`: Connection string to the SQL database that contains your docs and metadata.
+- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in
+smaller chunks to reduce memory footprint.
+
+**Returns**:
+
+
+
--- a/docs/_src/api/api/file_converter.md
+++ b/docs/_src/api/api/file_converter.md
@ -1,38 +1,8 @@
-<a name="pdf"></a>
-# pdf
-
-<a name="pdf.PDFToTextConverter"></a>
-## PDFToTextConverter
-
-```python
-class PDFToTextConverter(BaseConverter)
-```
-
-<a name="pdf.PDFToTextConverter.__init__"></a>
-#### \_\_init\_\_
-
-```python
- | __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)
-```
-
-**Arguments**:
-
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
-
 <a name="txt"></a>
-# txt
+# Module txt

 <a name="txt.TextConverter"></a>
-## TextConverter
+## TextConverter Objects

 ```python
 class TextConverter(BaseConverter)
@ -77,11 +47,36 @@ Reads text from a txt file and executes optional preprocessing steps.

 Dict of format {"text": "The text from file", "meta": meta}}

+<a name="docx"></a>
+# Module docx
+
+<a name="docx.DocxToTextConverter"></a>
+## DocxToTextConverter Objects
+
+```python
+class DocxToTextConverter(BaseConverter)
+```
+
+<a name="docx.DocxToTextConverter.convert"></a>
+#### convert
+
+```python
+ | convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]
+```
+
+Extract text from a .docx file.
+Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
+For compliance with other converters we nevertheless opted for keeping the methods name.
+
+**Arguments**:
+
+- `file_path`: Path to the .docx file you want to convert
+
 <a name="tika"></a>
-# tika
+# Module tika

 <a name="tika.TikaConverter"></a>
-## TikaConverter
+## TikaConverter Objects

 ```python
 class TikaConverter(BaseConverter)
@ -123,36 +118,11 @@ in garbled text.

 a list of pages and the extracted meta data of the file.

-<a name="docx"></a>
-# docx
-
-<a name="docx.DocxToTextConverter"></a>
-## DocxToTextConverter
-
-```python
-class DocxToTextConverter(BaseConverter)
-```
-
-<a name="docx.DocxToTextConverter.convert"></a>
-#### convert
-
-```python
- | convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]
-```
-
-Extract text from a .docx file.
-Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
-For compliance with other converters we nevertheless opted for keeping the methods name.
-
-**Arguments**:
-
- `file_path`: Path to the .docx file you want to convert
-
 <a name="base"></a>
-# base
+# Module base

 <a name="base.BaseConverter"></a>
-## BaseConverter
+## BaseConverter Objects

 ```python
 class BaseConverter()
@ -207,3 +177,33 @@ supplied meta data like author, url, external IDs can be supplied as a dictionar

 Validate if the language of the text is one of valid languages.

+<a name="pdf"></a>
+# Module pdf
+
+<a name="pdf.PDFToTextConverter"></a>
+## PDFToTextConverter Objects
+
+```python
+class PDFToTextConverter(BaseConverter)
+```
+
+<a name="pdf.PDFToTextConverter.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)
+```
+
+**Arguments**:
+
+- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
+The tabular structures in documents might be noise for the reader model if it
+does not have table parsing capability for finding answers. However, tables
+may also have long strings that could possible candidate for searching answers.
+The rows containing strings are thus retained in this option.
+- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
+(https://en.wikipedia.org/wiki/ISO_639-1) format.
+This option can be used to add test for encoding errors. If the extracted text is
+not one of the valid languages, then it might likely be encoding error resulting
+in garbled text.
+
--- a/docs/_src/api/api/generator.md
+++ b/docs/_src/api/api/generator.md
@ -0,0 +1,137 @@
+<a name="transformers"></a>
+# Module transformers
+
+<a name="transformers.RAGenerator"></a>
+## RAGenerator Objects
+
+```python
+class RAGenerator(BaseGenerator)
+```
+
+Implementation of Facebook's Retrieval-Augmented Generator (https://arxiv.org/abs/2005.11401) based on
+HuggingFace's transformers (https://huggingface.co/transformers/model_doc/rag.html).
+
+Instead of "finding" the answer within a document, these models **generate** the answer.
+In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages
+for real-world applications:
+a) it has a manageable model size
+b) the answer generation is conditioned on retrieved documents,
+i.e. the model can easily adjust to domain documents even after training has finished
+(in contrast: GPT-3 relies on the web data seen during training)
+
+**Example**
+
+```python
+> question = "who got the first nobel prize in physics?"
+
+# Retrieve related documents from retriever
+> retrieved_docs = retriever.retrieve(query=question)
+
+> # Now generate answer from question and retrieved documents
+> generator.predict(
+>    question=question,
+>    documents=retrieved_docs,
+>    top_k=1
+> )
+{'question': 'who got the first nobel prize in physics',
+     'answers':
+         [{'question': 'who got the first nobel prize in physics',
+           'answer': ' albert einstein',
+           'meta': { 'doc_ids': [...],
+                     'doc_scores': [80.42758 ...],
+                     'doc_probabilities': [40.71379089355469, ...
+                     'texts': ['Albert Einstein was a ...]
+                     'titles': ['"Albert Einstein"', ...]
+     }}]}
+```
+
+<a name="transformers.RAGenerator.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(model_name_or_path: str = "facebook/rag-token-nq", retriever: Optional[DensePassageRetriever] = None, generator_type: RAGeneratorType = RAGeneratorType.TOKEN, top_k_answers: int = 2, max_length: int = 200, min_length: int = 2, num_beams: int = 2, embed_title: bool = True, prefix: Optional[str] = None, use_gpu: bool = True)
+```
+
+Load a RAG model from Transformers along with passage_embedding_model.
+See https://huggingface.co/transformers/model_doc/rag.html for more details
+
+**Arguments**:
+
+- `model_name_or_path`: Directory of a saved model or the name of a public model e.g.
+'facebook/rag-token-nq', 'facebook/rag-sequence-nq'.
+See https://huggingface.co/models for full list of available models.
+- `retriever`: `DensePassageRetriever` used to embedded passage
+- `generator_type`: Which RAG generator implementation to use? RAG-TOKEN or RAG-SEQUENCE
+- `top_k_answers`: Number of independently generated text to return
+- `max_length`: Maximum length of generated text
+- `min_length`: Minimum length of generated text
+- `num_beams`: Number of beams for beam search. 1 means no beam search.
+- `embed_title`: Embedded the title of passage while generating embedding
+- `prefix`: The prefix used by the generator's tokenizer.
+- `use_gpu`: Whether to use GPU (if available)
+
+<a name="transformers.RAGenerator.predict"></a>
+#### predict
+
+```python
+ | predict(question: str, documents: List[Document], top_k: Optional[int] = None) -> Dict
+```
+
+Generate the answer to the input question. The generation will be conditioned on the supplied documents.
+These document can for example be retrieved via the Retriever.
+
+**Arguments**:
+
+- `question`: Question
+- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
+- `top_k`: Number of returned answers
+
+**Returns**:
+
+Generated answers plus additional infos in a dict like this:
+
+```python
+> {'question': 'who got the first nobel prize in physics',
+>    'answers':
+>        [{'question': 'who got the first nobel prize in physics',
+>          'answer': ' albert einstein',
+>          'meta': { 'doc_ids': [...],
+>                    'doc_scores': [80.42758 ...],
+>                    'doc_probabilities': [40.71379089355469, ...
+>                    'texts': ['Albert Einstein was a ...]
+>                    'titles': ['"Albert Einstein"', ...]
+>    }}]}
+```
+
+<a name="base"></a>
+# Module base
+
+<a name="base.BaseGenerator"></a>
+## BaseGenerator Objects
+
+```python
+class BaseGenerator(ABC)
+```
+
+Abstract class for Generators
+
+<a name="base.BaseGenerator.predict"></a>
+#### predict
+
+```python
+ | @abstractmethod
+ | predict(question: str, documents: List[Document], top_k: Optional[int]) -> Dict
+```
+
+Abstract method to generate answers.
+
+**Arguments**:
+
+- `question`: Question
+- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
+- `top_k`: Number of returned answers
+
+**Returns**:
+
+Generated answers plus additional infos in a dict
+
--- a/docs/_src/api/api/preprocessor.md
+++ b/docs/_src/api/api/preprocessor.md
@ -1,5 +1,44 @@
+<a name="preprocessor"></a>
+# Module preprocessor
+
+<a name="preprocessor.PreProcessor"></a>
+## PreProcessor Objects
+
+```python
+class PreProcessor(BasePreProcessor)
+```
+
+<a name="preprocessor.PreProcessor.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
+```
+
+**Arguments**:
+
+- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
+for the longest common string. This heuristic uses exact matches and therefore
+works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
+or similar.
+- `clean_whitespace`: Strip whitespaces before or after each line in the text.
+- `clean_empty_lines`: Remove more than two empty lines in the text.
+- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
+- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
+"sentence", then each output document will have 10 sentences.
+- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
+split_length -> 5 & split_stride -> 2, then the splits would be like:
+[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
+Set the value to None to disable striding behaviour.
+- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
+to True, the individual split will always have complete sentences &
+the number of words will be <= split_length.
+
+<a name="cleaning"></a>
+# Module cleaning
+
 <a name="utils"></a>
-# utils
+# Module utils

 <a name="utils.eval_data_from_file"></a>
 #### eval\_data\_from\_file
@ -84,45 +123,6 @@ Fetch an archive (zip or tar.gz) from a url via http and extract content to an o

 bool if anything got fetched

-<a name="preprocessor"></a>
-# preprocessor
-
-<a name="preprocessor.PreProcessor"></a>
-## PreProcessor
-
-```python
-class PreProcessor(BasePreProcessor)
-```
-
-<a name="preprocessor.PreProcessor.__init__"></a>
-#### \_\_init\_\_
-
-```python
- | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
-```
-
-**Arguments**:
-
- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
-for the longest common string. This heuristic uses exact matches and therefore
-works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
-or similar.
- `clean_whitespace`: Strip whitespaces before or after each line in the text.
- `clean_empty_lines`: Remove more than two empty lines in the text.
- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
-"sentence", then each output document will have 10 sentences.
- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
-split_length -> 5 & split_stride -> 2, then the splits would be like:
-[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
-Set the value to None to disable striding behaviour.
- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
-to True, the individual split will always have complete sentences &
-the number of words will be <= split_length.
-
 <a name="base"></a>
-# base
-
-<a name="cleaning"></a>
-# cleaning
+# Module base

--- a/docs/_src/api/api/pydoc-markdown-document-store.yml
+++ b/docs/_src/api/api/pydoc-markdown-document-store.yml
@ -10,5 +10,8 @@ processor:
  - skip_empty_modules: true
 renderer:
  type: markdown
-  descriptive_class_title: false
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
  filename: document_store.md
--- a/docs/_src/api/api/pydoc-markdown-file-converters.yml
+++ b/docs/_src/api/api/pydoc-markdown-file-converters.yml
@ -10,5 +10,8 @@ processor:
  - skip_empty_modules: true
 renderer:
  type: markdown
-  descriptive_class_title: false
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
  filename: file_converter.md
--- a/docs/_src/api/api/pydoc-markdown-generator.yml
+++ b/docs/_src/api/api/pydoc-markdown-generator.yml
@ -10,5 +10,8 @@ processor:
  - skip_empty_modules: true
 renderer:
  type: markdown
-  descriptive_class_title: false
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
  filename: generator.md
--- a/docs/_src/api/api/pydoc-markdown-preprocessor.yml
+++ b/docs/_src/api/api/pydoc-markdown-preprocessor.yml
@ -10,5 +10,8 @@ processor:
  - skip_empty_modules: true
 renderer:
  type: markdown
-  descriptive_class_title: false
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
  filename: preprocessor.md
--- a/docs/_src/api/api/pydoc-markdown-reader.yml
+++ b/docs/_src/api/api/pydoc-markdown-reader.yml
@ -10,5 +10,8 @@ processor:
  - skip_empty_modules: true
 renderer:
  type: markdown
-  descriptive_class_title: false
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
  filename: reader.md
--- a/docs/_src/api/api/pydoc-markdown-retriever.yml
+++ b/docs/_src/api/api/pydoc-markdown-retriever.yml
@ -10,5 +10,8 @@ processor:
  - skip_empty_modules: true
 renderer:
  type: markdown
-  descriptive_class_title: false
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
  filename: retriever.md
--- a/docs/_src/api/api/reader.md
+++ b/docs/_src/api/api/reader.md
@ -1,8 +1,8 @@
 <a name="farm"></a>
-# farm
+# Module farm

 <a name="farm.FARMReader"></a>
-## FARMReader
+## FARMReader Objects

 ```python
 class FARMReader(BaseReader)
@ -279,10 +279,10 @@ float32 could still be be more performant.
 - `opset_version`: ONNX opset version

 <a name="transformers"></a>
-# transformers
+# Module transformers

 <a name="transformers.TransformersReader"></a>
-## TransformersReader
+## TransformersReader Objects

 ```python
 class TransformersReader(BaseReader)
@ -368,5 +368,5 @@ Example:
 Dict containing question and answers

 <a name="base"></a>
-# base
+# Module base

--- a/docs/_src/api/api/retriever.md
+++ b/docs/_src/api/api/retriever.md
@ -1,8 +1,8 @@
 <a name="sparse"></a>
-# sparse
+# Module sparse

 <a name="sparse.ElasticsearchRetriever"></a>
-## ElasticsearchRetriever
+## ElasticsearchRetriever Objects

 ```python
 class ElasticsearchRetriever(BaseRetriever)
@ -52,7 +52,7 @@ self.retrieve(query="Why did the revenue increase?",
 ```

 <a name="sparse.ElasticsearchFilterOnlyRetriever"></a>
-## ElasticsearchFilterOnlyRetriever
+## ElasticsearchFilterOnlyRetriever Objects

 ```python
 class ElasticsearchFilterOnlyRetriever(ElasticsearchRetriever)
@ -62,7 +62,7 @@ Naive "Retriever" that returns all documents that match the given filters. No im
 Helpful for benchmarking, testing and if you want to do QA on small documents without an "active" retriever.

 <a name="sparse.TfidfRetriever"></a>
-## TfidfRetriever
+## TfidfRetriever Objects

 ```python
 class TfidfRetriever(BaseRetriever)
@ -76,10 +76,10 @@ computations when text is passed on to a Reader for QA.
 It uses sklearn's TfidfVectorizer to compute a tf-idf matrix.

 <a name="dense"></a>
-# dense
+# Module dense

 <a name="dense.DensePassageRetriever"></a>
-## DensePassageRetriever
+## DensePassageRetriever Objects

 ```python
 class DensePassageRetriever(BaseRetriever)
@ -201,7 +201,7 @@ train a DensePassageRetrieval model
 - `passage_encoder_save_dir`: directory inside save_dir where passage_encoder model files are saved

 <a name="dense.EmbeddingRetriever"></a>
-## EmbeddingRetriever
+## EmbeddingRetriever Objects

 ```python
 class EmbeddingRetriever(BaseRetriever)
@ -286,10 +286,10 @@ Create embeddings for a list of passages. For this Retriever type: The same as c
 Embeddings, one per input passage

 <a name="base"></a>
-# base
+# Module base

 <a name="base.BaseRetriever"></a>
-## BaseRetriever
+## BaseRetriever Objects

 ```python
 class BaseRetriever(ABC)
@ -330,7 +330,10 @@ position in the ranking of documents the correct document is.
 - "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
 Only considers the highest ranked relevant document.
 - "map": Mean of average precision for each question. Rewards retrievers that give relevant
-documents a higher rank. Considers all retrieved relevant documents. (only with ``open_domain=False``)
+documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
+average precision is normalized by the number of retrieved relevant documents per query.
+If ``open_domain=False``, average precision is normalized by the number of all relevant documents
+per query.

 **Arguments**: