mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-10-08 22:46:00 +00:00
cleaning the api docs (#616)
This commit is contained in:
parent
e192387e65
commit
3dee284f20
@ -1,269 +1,8 @@
|
|||||||
<a name="memory"></a>
|
|
||||||
# memory
|
|
||||||
|
|
||||||
<a name="memory.InMemoryDocumentStore"></a>
|
|
||||||
## InMemoryDocumentStore
|
|
||||||
|
|
||||||
```python
|
|
||||||
class InMemoryDocumentStore(BaseDocumentStore)
|
|
||||||
```
|
|
||||||
|
|
||||||
In-memory document store
|
|
||||||
|
|
||||||
<a name="memory.InMemoryDocumentStore.write_documents"></a>
|
|
||||||
#### write\_documents
|
|
||||||
|
|
||||||
```python
|
|
||||||
| write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None)
|
|
||||||
```
|
|
||||||
|
|
||||||
Indexes documents for later queries.
|
|
||||||
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `documents`: a list of Python dictionaries or a list of Haystack Document objects.
|
|
||||||
For documents as dictionaries, the format is {"text": "<the-actual-text>"}.
|
|
||||||
Optionally: Include meta data via {"text": "<the-actual-text>",
|
|
||||||
"meta": {"name": "<some-document-name>, "author": "somebody", ...}}
|
|
||||||
It can be used for filtering and is accessible in the responses of the Finder.
|
|
||||||
- `index`: write documents to a custom namespace. For instance, documents for evaluation can be indexed in a
|
|
||||||
separate index than the documents for search.
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
None
|
|
||||||
|
|
||||||
<a name="memory.InMemoryDocumentStore.update_embeddings"></a>
|
|
||||||
#### update\_embeddings
|
|
||||||
|
|
||||||
```python
|
|
||||||
| update_embeddings(retriever: BaseRetriever, index: Optional[str] = None)
|
|
||||||
```
|
|
||||||
|
|
||||||
Updates the embeddings in the the document store using the encoding model specified in the retriever.
|
|
||||||
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `retriever`: Retriever
|
|
||||||
- `index`: Index name to update
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
None
|
|
||||||
|
|
||||||
<a name="memory.InMemoryDocumentStore.add_eval_data"></a>
|
|
||||||
#### add\_eval\_data
|
|
||||||
|
|
||||||
```python
|
|
||||||
| add_eval_data(filename: str, doc_index: Optional[str] = None, label_index: Optional[str] = None)
|
|
||||||
```
|
|
||||||
|
|
||||||
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `filename`: Name of the file containing evaluation data
|
|
||||||
:type filename: str
|
|
||||||
- `doc_index`: Elasticsearch index where evaluation documents should be stored
|
|
||||||
:type doc_index: str
|
|
||||||
- `label_index`: Elasticsearch index where labeled questions should be stored
|
|
||||||
:type label_index: str
|
|
||||||
|
|
||||||
<a name="memory.InMemoryDocumentStore.delete_all_documents"></a>
|
|
||||||
#### delete\_all\_documents
|
|
||||||
|
|
||||||
```python
|
|
||||||
| delete_all_documents(index: Optional[str] = None)
|
|
||||||
```
|
|
||||||
|
|
||||||
Delete all documents in a index.
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `index`: index name
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
None
|
|
||||||
|
|
||||||
<a name="faiss"></a>
|
|
||||||
# faiss
|
|
||||||
|
|
||||||
<a name="faiss.FAISSDocumentStore"></a>
|
|
||||||
## FAISSDocumentStore
|
|
||||||
|
|
||||||
```python
|
|
||||||
class FAISSDocumentStore(SQLDocumentStore)
|
|
||||||
```
|
|
||||||
|
|
||||||
Document store for very large scale embedding based dense retrievers like the DPR.
|
|
||||||
|
|
||||||
It implements the FAISS library(https://github.com/facebookresearch/faiss)
|
|
||||||
to perform similarity search on vectors.
|
|
||||||
|
|
||||||
The document text and meta-data (for filtering) are stored using the SQLDocumentStore, while
|
|
||||||
the vector embeddings are indexed in a FAISS Index.
|
|
||||||
|
|
||||||
<a name="faiss.FAISSDocumentStore.__init__"></a>
|
|
||||||
#### \_\_init\_\_
|
|
||||||
|
|
||||||
```python
|
|
||||||
| __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, return_embedding: Optional[bool] = True, **kwargs, ,)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `sql_url`: SQL connection URL for database. It defaults to local file based SQLite DB. For large scale
|
|
||||||
deployment, Postgres is recommended.
|
|
||||||
- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in
|
|
||||||
smaller chunks to reduce memory footprint.
|
|
||||||
- `vector_dim`: the embedding vector size.
|
|
||||||
- `faiss_index_factory_str`: Create a new FAISS index of the specified type.
|
|
||||||
The type is determined from the given string following the conventions
|
|
||||||
of the original FAISS index factory.
|
|
||||||
Recommended options:
|
|
||||||
- "Flat" (default): Best accuracy (= exact). Becomes slow and RAM intense for > 1 Mio docs.
|
|
||||||
- "HNSW": Graph-based heuristic. If not further specified,
|
|
||||||
we use a RAM intense, but more accurate config:
|
|
||||||
HNSW256, efConstruction=256 and efSearch=256
|
|
||||||
- "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist.
|
|
||||||
Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point.
|
|
||||||
For more details see:
|
|
||||||
- Overview of indices https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
|
|
||||||
- Guideline for choosing an index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
|
|
||||||
- FAISS Index factory https://github.com/facebookresearch/faiss/wiki/The-index-factory
|
|
||||||
Benchmarks: XXX
|
|
||||||
- `faiss_index`: Pass an existing FAISS Index, i.e. an empty one that you configured manually
|
|
||||||
or one with docs that you used in Haystack before and want to load again.
|
|
||||||
- `return_embedding`: To return document embedding
|
|
||||||
|
|
||||||
<a name="faiss.FAISSDocumentStore.write_documents"></a>
|
|
||||||
#### write\_documents
|
|
||||||
|
|
||||||
```python
|
|
||||||
| write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None)
|
|
||||||
```
|
|
||||||
|
|
||||||
Add new documents to the DocumentStore.
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `documents`: List of `Dicts` or List of `Documents`. If they already contain the embeddings, we'll index
|
|
||||||
them right away in FAISS. If not, you can later call update_embeddings() to create & index them.
|
|
||||||
- `index`: (SQL) index name for storing the docs and metadata
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<a name="faiss.FAISSDocumentStore.update_embeddings"></a>
|
|
||||||
#### update\_embeddings
|
|
||||||
|
|
||||||
```python
|
|
||||||
| update_embeddings(retriever: BaseRetriever, index: Optional[str] = None)
|
|
||||||
```
|
|
||||||
|
|
||||||
Updates the embeddings in the the document store using the encoding model specified in the retriever.
|
|
||||||
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `retriever`: Retriever to use to get embeddings for text
|
|
||||||
- `index`: (SQL) index name for storing the docs and metadata
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
None
|
|
||||||
|
|
||||||
<a name="faiss.FAISSDocumentStore.train_index"></a>
|
|
||||||
#### train\_index
|
|
||||||
|
|
||||||
```python
|
|
||||||
| train_index(documents: Optional[Union[List[dict], List[Document]]], embeddings: Optional[np.array] = None)
|
|
||||||
```
|
|
||||||
|
|
||||||
Some FAISS indices (e.g. IVF) require initial "training" on a sample of vectors before you can add your final vectors.
|
|
||||||
The train vectors should come from the same distribution as your final ones.
|
|
||||||
You can pass either documents (incl. embeddings) or just the plain embeddings that the index shall be trained on.
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `documents`: Documents (incl. the embeddings)
|
|
||||||
- `embeddings`: Plain embeddings
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
None
|
|
||||||
|
|
||||||
<a name="faiss.FAISSDocumentStore.query_by_embedding"></a>
|
|
||||||
#### query\_by\_embedding
|
|
||||||
|
|
||||||
```python
|
|
||||||
| query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document]
|
|
||||||
```
|
|
||||||
|
|
||||||
Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `query_emb`: Embedding of the query (e.g. gathered from DPR)
|
|
||||||
- `filters`: Optional filters to narrow down the search space.
|
|
||||||
Example: {"name": ["some", "more"], "category": ["only_one"]}
|
|
||||||
- `top_k`: How many documents to return
|
|
||||||
- `index`: (SQL) index name for storing the docs and metadata
|
|
||||||
- `return_embedding`: To return document embedding
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<a name="faiss.FAISSDocumentStore.save"></a>
|
|
||||||
#### save
|
|
||||||
|
|
||||||
```python
|
|
||||||
| save(file_path: Union[str, Path])
|
|
||||||
```
|
|
||||||
|
|
||||||
Save FAISS Index to the specified file.
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `file_path`: Path to save to.
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
None
|
|
||||||
|
|
||||||
<a name="faiss.FAISSDocumentStore.load"></a>
|
|
||||||
#### load
|
|
||||||
|
|
||||||
```python
|
|
||||||
| @classmethod
|
|
||||||
| load(cls, faiss_file_path: Union[str, Path], sql_url: str, index_buffer_size: int = 10_000)
|
|
||||||
```
|
|
||||||
|
|
||||||
Load a saved FAISS index from a file and connect to the SQL database.
|
|
||||||
Note: In order to have a correct mapping from FAISS to SQL,
|
|
||||||
make sure to use the same SQL DB that you used when calling `save()`.
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()`
|
|
||||||
- `sql_url`: Connection string to the SQL database that contains your docs and metadata.
|
|
||||||
- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in
|
|
||||||
smaller chunks to reduce memory footprint.
|
|
||||||
|
|
||||||
**Returns**:
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<a name="elasticsearch"></a>
|
<a name="elasticsearch"></a>
|
||||||
# elasticsearch
|
# Module elasticsearch
|
||||||
|
|
||||||
<a name="elasticsearch.ElasticsearchDocumentStore"></a>
|
<a name="elasticsearch.ElasticsearchDocumentStore"></a>
|
||||||
## ElasticsearchDocumentStore
|
## ElasticsearchDocumentStore Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class ElasticsearchDocumentStore(BaseDocumentStore)
|
class ElasticsearchDocumentStore(BaseDocumentStore)
|
||||||
@ -391,29 +130,139 @@ Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform
|
|||||||
#### delete\_all\_documents
|
#### delete\_all\_documents
|
||||||
|
|
||||||
```python
|
```python
|
||||||
| delete_all_documents(index: str)
|
| delete_all_documents(index: str, filters: Optional[Dict[str, List[str]]] = None)
|
||||||
```
|
```
|
||||||
|
|
||||||
Delete all documents in an index.
|
Delete documents in an index. All documents are deleted if no filters are passed.
|
||||||
|
|
||||||
**Arguments**:
|
**Arguments**:
|
||||||
|
|
||||||
- `index`: index name
|
- `index`: Index name to delete the document from.
|
||||||
|
- `filters`: Optional filters to narrow down the documents to be deleted.
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
<a name="memory"></a>
|
||||||
|
# Module memory
|
||||||
|
|
||||||
|
<a name="memory.InMemoryDocumentStore"></a>
|
||||||
|
## InMemoryDocumentStore Objects
|
||||||
|
|
||||||
|
```python
|
||||||
|
class InMemoryDocumentStore(BaseDocumentStore)
|
||||||
|
```
|
||||||
|
|
||||||
|
In-memory document store
|
||||||
|
|
||||||
|
<a name="memory.InMemoryDocumentStore.write_documents"></a>
|
||||||
|
#### write\_documents
|
||||||
|
|
||||||
|
```python
|
||||||
|
| write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None)
|
||||||
|
```
|
||||||
|
|
||||||
|
Indexes documents for later queries.
|
||||||
|
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `documents`: a list of Python dictionaries or a list of Haystack Document objects.
|
||||||
|
For documents as dictionaries, the format is {"text": "<the-actual-text>"}.
|
||||||
|
Optionally: Include meta data via {"text": "<the-actual-text>",
|
||||||
|
"meta": {"name": "<some-document-name>, "author": "somebody", ...}}
|
||||||
|
It can be used for filtering and is accessible in the responses of the Finder.
|
||||||
|
- `index`: write documents to a custom namespace. For instance, documents for evaluation can be indexed in a
|
||||||
|
separate index than the documents for search.
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
<a name="memory.InMemoryDocumentStore.update_embeddings"></a>
|
||||||
|
#### update\_embeddings
|
||||||
|
|
||||||
|
```python
|
||||||
|
| update_embeddings(retriever: BaseRetriever, index: Optional[str] = None)
|
||||||
|
```
|
||||||
|
|
||||||
|
Updates the embeddings in the the document store using the encoding model specified in the retriever.
|
||||||
|
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `retriever`: Retriever
|
||||||
|
- `index`: Index name to update
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
<a name="memory.InMemoryDocumentStore.add_eval_data"></a>
|
||||||
|
#### add\_eval\_data
|
||||||
|
|
||||||
|
```python
|
||||||
|
| add_eval_data(filename: str, doc_index: Optional[str] = None, label_index: Optional[str] = None)
|
||||||
|
```
|
||||||
|
|
||||||
|
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `filename`: Name of the file containing evaluation data
|
||||||
|
:type filename: str
|
||||||
|
- `doc_index`: Elasticsearch index where evaluation documents should be stored
|
||||||
|
:type doc_index: str
|
||||||
|
- `label_index`: Elasticsearch index where labeled questions should be stored
|
||||||
|
:type label_index: str
|
||||||
|
|
||||||
|
<a name="memory.InMemoryDocumentStore.delete_all_documents"></a>
|
||||||
|
#### delete\_all\_documents
|
||||||
|
|
||||||
|
```python
|
||||||
|
| delete_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None)
|
||||||
|
```
|
||||||
|
|
||||||
|
Delete documents in an index. All documents are deleted if no filters are passed.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `index`: Index name to delete the document from.
|
||||||
|
- `filters`: Optional filters to narrow down the documents to be deleted.
|
||||||
|
|
||||||
**Returns**:
|
**Returns**:
|
||||||
|
|
||||||
None
|
None
|
||||||
|
|
||||||
<a name="sql"></a>
|
<a name="sql"></a>
|
||||||
# sql
|
# Module sql
|
||||||
|
|
||||||
<a name="sql.SQLDocumentStore"></a>
|
<a name="sql.SQLDocumentStore"></a>
|
||||||
## SQLDocumentStore
|
## SQLDocumentStore Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class SQLDocumentStore(BaseDocumentStore)
|
class SQLDocumentStore(BaseDocumentStore)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<a name="sql.SQLDocumentStore.__init__"></a>
|
||||||
|
#### \_\_init\_\_
|
||||||
|
|
||||||
|
```python
|
||||||
|
| __init__(url: str = "sqlite://", index: str = "document", label_index: str = "label", update_existing_documents: bool = False)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `url`: URL for SQL database as expected by SQLAlchemy. More info here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls
|
||||||
|
- `index`: The documents are scoped to an index attribute that can be used when writing, querying, or deleting documents.
|
||||||
|
This parameter sets the default value for document index.
|
||||||
|
- `label_index`: The default value of index attribute for the labels.
|
||||||
|
- `update_existing_documents`: Whether to update any existing documents with the same ID when adding
|
||||||
|
documents. When set as True, any document with an existing ID gets updated.
|
||||||
|
If set to False, an error is raised if the document ID of the document being
|
||||||
|
added already exists. Using this parameter coud cause performance degradation for document insertion.
|
||||||
|
|
||||||
<a name="sql.SQLDocumentStore.write_documents"></a>
|
<a name="sql.SQLDocumentStore.write_documents"></a>
|
||||||
#### write\_documents
|
#### write\_documents
|
||||||
|
|
||||||
@ -473,24 +322,25 @@ Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform
|
|||||||
#### delete\_all\_documents
|
#### delete\_all\_documents
|
||||||
|
|
||||||
```python
|
```python
|
||||||
| delete_all_documents(index=None)
|
| delete_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None)
|
||||||
```
|
```
|
||||||
|
|
||||||
Delete all documents in a index.
|
Delete documents in an index. All documents are deleted if no filters are passed.
|
||||||
|
|
||||||
**Arguments**:
|
**Arguments**:
|
||||||
|
|
||||||
- `index`: index name
|
- `index`: Index name to delete the document from.
|
||||||
|
- `filters`: Optional filters to narrow down the documents to be deleted.
|
||||||
|
|
||||||
**Returns**:
|
**Returns**:
|
||||||
|
|
||||||
None
|
None
|
||||||
|
|
||||||
<a name="base"></a>
|
<a name="base"></a>
|
||||||
# base
|
# Module base
|
||||||
|
|
||||||
<a name="base.BaseDocumentStore"></a>
|
<a name="base.BaseDocumentStore"></a>
|
||||||
## BaseDocumentStore
|
## BaseDocumentStore Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class BaseDocumentStore(ABC)
|
class BaseDocumentStore(ABC)
|
||||||
@ -522,3 +372,179 @@ If None, the DocumentStore's default index (self.index) will be used.
|
|||||||
|
|
||||||
None
|
None
|
||||||
|
|
||||||
|
<a name="faiss"></a>
|
||||||
|
# Module faiss
|
||||||
|
|
||||||
|
<a name="faiss.FAISSDocumentStore"></a>
|
||||||
|
## FAISSDocumentStore Objects
|
||||||
|
|
||||||
|
```python
|
||||||
|
class FAISSDocumentStore(SQLDocumentStore)
|
||||||
|
```
|
||||||
|
|
||||||
|
Document store for very large scale embedding based dense retrievers like the DPR.
|
||||||
|
|
||||||
|
It implements the FAISS library(https://github.com/facebookresearch/faiss)
|
||||||
|
to perform similarity search on vectors.
|
||||||
|
|
||||||
|
The document text and meta-data (for filtering) are stored using the SQLDocumentStore, while
|
||||||
|
the vector embeddings are indexed in a FAISS Index.
|
||||||
|
|
||||||
|
<a name="faiss.FAISSDocumentStore.__init__"></a>
|
||||||
|
#### \_\_init\_\_
|
||||||
|
|
||||||
|
```python
|
||||||
|
| __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, return_embedding: Optional[bool] = True, update_existing_documents: bool = False, index: str = "document", **kwargs, ,)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `sql_url`: SQL connection URL for database. It defaults to local file based SQLite DB. For large scale
|
||||||
|
deployment, Postgres is recommended.
|
||||||
|
- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in
|
||||||
|
smaller chunks to reduce memory footprint.
|
||||||
|
- `vector_dim`: the embedding vector size.
|
||||||
|
- `faiss_index_factory_str`: Create a new FAISS index of the specified type.
|
||||||
|
The type is determined from the given string following the conventions
|
||||||
|
of the original FAISS index factory.
|
||||||
|
Recommended options:
|
||||||
|
- "Flat" (default): Best accuracy (= exact). Becomes slow and RAM intense for > 1 Mio docs.
|
||||||
|
- "HNSW": Graph-based heuristic. If not further specified,
|
||||||
|
we use a RAM intense, but more accurate config:
|
||||||
|
HNSW256, efConstruction=256 and efSearch=256
|
||||||
|
- "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist.
|
||||||
|
Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point.
|
||||||
|
For more details see:
|
||||||
|
- Overview of indices https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
|
||||||
|
- Guideline for choosing an index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
|
||||||
|
- FAISS Index factory https://github.com/facebookresearch/faiss/wiki/The-index-factory
|
||||||
|
Benchmarks: XXX
|
||||||
|
- `faiss_index`: Pass an existing FAISS Index, i.e. an empty one that you configured manually
|
||||||
|
or one with docs that you used in Haystack before and want to load again.
|
||||||
|
- `return_embedding`: To return document embedding
|
||||||
|
- `update_existing_documents`: Whether to update any existing documents with the same ID when adding
|
||||||
|
documents. When set as True, any document with an existing ID gets updated.
|
||||||
|
If set to False, an error is raised if the document ID of the document being
|
||||||
|
added already exists.
|
||||||
|
- `index`: Name of index in document store to use.
|
||||||
|
|
||||||
|
<a name="faiss.FAISSDocumentStore.write_documents"></a>
|
||||||
|
#### write\_documents
|
||||||
|
|
||||||
|
```python
|
||||||
|
| write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None)
|
||||||
|
```
|
||||||
|
|
||||||
|
Add new documents to the DocumentStore.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `documents`: List of `Dicts` or List of `Documents`. If they already contain the embeddings, we'll index
|
||||||
|
them right away in FAISS. If not, you can later call update_embeddings() to create & index them.
|
||||||
|
- `index`: (SQL) index name for storing the docs and metadata
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<a name="faiss.FAISSDocumentStore.update_embeddings"></a>
|
||||||
|
#### update\_embeddings
|
||||||
|
|
||||||
|
```python
|
||||||
|
| update_embeddings(retriever: BaseRetriever, index: Optional[str] = None)
|
||||||
|
```
|
||||||
|
|
||||||
|
Updates the embeddings in the the document store using the encoding model specified in the retriever.
|
||||||
|
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `retriever`: Retriever to use to get embeddings for text
|
||||||
|
- `index`: (SQL) index name for storing the docs and metadata
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
<a name="faiss.FAISSDocumentStore.train_index"></a>
|
||||||
|
#### train\_index
|
||||||
|
|
||||||
|
```python
|
||||||
|
| train_index(documents: Optional[Union[List[dict], List[Document]]], embeddings: Optional[np.array] = None)
|
||||||
|
```
|
||||||
|
|
||||||
|
Some FAISS indices (e.g. IVF) require initial "training" on a sample of vectors before you can add your final vectors.
|
||||||
|
The train vectors should come from the same distribution as your final ones.
|
||||||
|
You can pass either documents (incl. embeddings) or just the plain embeddings that the index shall be trained on.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `documents`: Documents (incl. the embeddings)
|
||||||
|
- `embeddings`: Plain embeddings
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
<a name="faiss.FAISSDocumentStore.query_by_embedding"></a>
|
||||||
|
#### query\_by\_embedding
|
||||||
|
|
||||||
|
```python
|
||||||
|
| query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document]
|
||||||
|
```
|
||||||
|
|
||||||
|
Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `query_emb`: Embedding of the query (e.g. gathered from DPR)
|
||||||
|
- `filters`: Optional filters to narrow down the search space.
|
||||||
|
Example: {"name": ["some", "more"], "category": ["only_one"]}
|
||||||
|
- `top_k`: How many documents to return
|
||||||
|
- `index`: (SQL) index name for storing the docs and metadata
|
||||||
|
- `return_embedding`: To return document embedding
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<a name="faiss.FAISSDocumentStore.save"></a>
|
||||||
|
#### save
|
||||||
|
|
||||||
|
```python
|
||||||
|
| save(file_path: Union[str, Path])
|
||||||
|
```
|
||||||
|
|
||||||
|
Save FAISS Index to the specified file.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `file_path`: Path to save to.
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
<a name="faiss.FAISSDocumentStore.load"></a>
|
||||||
|
#### load
|
||||||
|
|
||||||
|
```python
|
||||||
|
| @classmethod
|
||||||
|
| load(cls, faiss_file_path: Union[str, Path], sql_url: str, index_buffer_size: int = 10_000)
|
||||||
|
```
|
||||||
|
|
||||||
|
Load a saved FAISS index from a file and connect to the SQL database.
|
||||||
|
Note: In order to have a correct mapping from FAISS to SQL,
|
||||||
|
make sure to use the same SQL DB that you used when calling `save()`.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()`
|
||||||
|
- `sql_url`: Connection string to the SQL database that contains your docs and metadata.
|
||||||
|
- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in
|
||||||
|
smaller chunks to reduce memory footprint.
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -1,38 +1,8 @@
|
|||||||
<a name="pdf"></a>
|
|
||||||
# pdf
|
|
||||||
|
|
||||||
<a name="pdf.PDFToTextConverter"></a>
|
|
||||||
## PDFToTextConverter
|
|
||||||
|
|
||||||
```python
|
|
||||||
class PDFToTextConverter(BaseConverter)
|
|
||||||
```
|
|
||||||
|
|
||||||
<a name="pdf.PDFToTextConverter.__init__"></a>
|
|
||||||
#### \_\_init\_\_
|
|
||||||
|
|
||||||
```python
|
|
||||||
| __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
||||||
The tabular structures in documents might be noise for the reader model if it
|
|
||||||
does not have table parsing capability for finding answers. However, tables
|
|
||||||
may also have long strings that could possible candidate for searching answers.
|
|
||||||
The rows containing strings are thus retained in this option.
|
|
||||||
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
||||||
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
||||||
This option can be used to add test for encoding errors. If the extracted text is
|
|
||||||
not one of the valid languages, then it might likely be encoding error resulting
|
|
||||||
in garbled text.
|
|
||||||
|
|
||||||
<a name="txt"></a>
|
<a name="txt"></a>
|
||||||
# txt
|
# Module txt
|
||||||
|
|
||||||
<a name="txt.TextConverter"></a>
|
<a name="txt.TextConverter"></a>
|
||||||
## TextConverter
|
## TextConverter Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class TextConverter(BaseConverter)
|
class TextConverter(BaseConverter)
|
||||||
@ -77,11 +47,36 @@ Reads text from a txt file and executes optional preprocessing steps.
|
|||||||
|
|
||||||
Dict of format {"text": "The text from file", "meta": meta}}
|
Dict of format {"text": "The text from file", "meta": meta}}
|
||||||
|
|
||||||
|
<a name="docx"></a>
|
||||||
|
# Module docx
|
||||||
|
|
||||||
|
<a name="docx.DocxToTextConverter"></a>
|
||||||
|
## DocxToTextConverter Objects
|
||||||
|
|
||||||
|
```python
|
||||||
|
class DocxToTextConverter(BaseConverter)
|
||||||
|
```
|
||||||
|
|
||||||
|
<a name="docx.DocxToTextConverter.convert"></a>
|
||||||
|
#### convert
|
||||||
|
|
||||||
|
```python
|
||||||
|
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]
|
||||||
|
```
|
||||||
|
|
||||||
|
Extract text from a .docx file.
|
||||||
|
Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
|
||||||
|
For compliance with other converters we nevertheless opted for keeping the methods name.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `file_path`: Path to the .docx file you want to convert
|
||||||
|
|
||||||
<a name="tika"></a>
|
<a name="tika"></a>
|
||||||
# tika
|
# Module tika
|
||||||
|
|
||||||
<a name="tika.TikaConverter"></a>
|
<a name="tika.TikaConverter"></a>
|
||||||
## TikaConverter
|
## TikaConverter Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class TikaConverter(BaseConverter)
|
class TikaConverter(BaseConverter)
|
||||||
@ -123,36 +118,11 @@ in garbled text.
|
|||||||
|
|
||||||
a list of pages and the extracted meta data of the file.
|
a list of pages and the extracted meta data of the file.
|
||||||
|
|
||||||
<a name="docx"></a>
|
|
||||||
# docx
|
|
||||||
|
|
||||||
<a name="docx.DocxToTextConverter"></a>
|
|
||||||
## DocxToTextConverter
|
|
||||||
|
|
||||||
```python
|
|
||||||
class DocxToTextConverter(BaseConverter)
|
|
||||||
```
|
|
||||||
|
|
||||||
<a name="docx.DocxToTextConverter.convert"></a>
|
|
||||||
#### convert
|
|
||||||
|
|
||||||
```python
|
|
||||||
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]
|
|
||||||
```
|
|
||||||
|
|
||||||
Extract text from a .docx file.
|
|
||||||
Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
|
|
||||||
For compliance with other converters we nevertheless opted for keeping the methods name.
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `file_path`: Path to the .docx file you want to convert
|
|
||||||
|
|
||||||
<a name="base"></a>
|
<a name="base"></a>
|
||||||
# base
|
# Module base
|
||||||
|
|
||||||
<a name="base.BaseConverter"></a>
|
<a name="base.BaseConverter"></a>
|
||||||
## BaseConverter
|
## BaseConverter Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class BaseConverter()
|
class BaseConverter()
|
||||||
@ -207,3 +177,33 @@ supplied meta data like author, url, external IDs can be supplied as a dictionar
|
|||||||
|
|
||||||
Validate if the language of the text is one of valid languages.
|
Validate if the language of the text is one of valid languages.
|
||||||
|
|
||||||
|
<a name="pdf"></a>
|
||||||
|
# Module pdf
|
||||||
|
|
||||||
|
<a name="pdf.PDFToTextConverter"></a>
|
||||||
|
## PDFToTextConverter Objects
|
||||||
|
|
||||||
|
```python
|
||||||
|
class PDFToTextConverter(BaseConverter)
|
||||||
|
```
|
||||||
|
|
||||||
|
<a name="pdf.PDFToTextConverter.__init__"></a>
|
||||||
|
#### \_\_init\_\_
|
||||||
|
|
||||||
|
```python
|
||||||
|
| __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
||||||
|
The tabular structures in documents might be noise for the reader model if it
|
||||||
|
does not have table parsing capability for finding answers. However, tables
|
||||||
|
may also have long strings that could possible candidate for searching answers.
|
||||||
|
The rows containing strings are thus retained in this option.
|
||||||
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
||||||
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
||||||
|
This option can be used to add test for encoding errors. If the extracted text is
|
||||||
|
not one of the valid languages, then it might likely be encoding error resulting
|
||||||
|
in garbled text.
|
||||||
|
|
||||||
|
137
docs/_src/api/api/generator.md
Normal file
137
docs/_src/api/api/generator.md
Normal file
@ -0,0 +1,137 @@
|
|||||||
|
<a name="transformers"></a>
|
||||||
|
# Module transformers
|
||||||
|
|
||||||
|
<a name="transformers.RAGenerator"></a>
|
||||||
|
## RAGenerator Objects
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RAGenerator(BaseGenerator)
|
||||||
|
```
|
||||||
|
|
||||||
|
Implementation of Facebook's Retrieval-Augmented Generator (https://arxiv.org/abs/2005.11401) based on
|
||||||
|
HuggingFace's transformers (https://huggingface.co/transformers/model_doc/rag.html).
|
||||||
|
|
||||||
|
Instead of "finding" the answer within a document, these models **generate** the answer.
|
||||||
|
In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages
|
||||||
|
for real-world applications:
|
||||||
|
a) it has a manageable model size
|
||||||
|
b) the answer generation is conditioned on retrieved documents,
|
||||||
|
i.e. the model can easily adjust to domain documents even after training has finished
|
||||||
|
(in contrast: GPT-3 relies on the web data seen during training)
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
|
||||||
|
```python
|
||||||
|
> question = "who got the first nobel prize in physics?"
|
||||||
|
|
||||||
|
# Retrieve related documents from retriever
|
||||||
|
> retrieved_docs = retriever.retrieve(query=question)
|
||||||
|
|
||||||
|
> # Now generate answer from question and retrieved documents
|
||||||
|
> generator.predict(
|
||||||
|
> question=question,
|
||||||
|
> documents=retrieved_docs,
|
||||||
|
> top_k=1
|
||||||
|
> )
|
||||||
|
{'question': 'who got the first nobel prize in physics',
|
||||||
|
'answers':
|
||||||
|
[{'question': 'who got the first nobel prize in physics',
|
||||||
|
'answer': ' albert einstein',
|
||||||
|
'meta': { 'doc_ids': [...],
|
||||||
|
'doc_scores': [80.42758 ...],
|
||||||
|
'doc_probabilities': [40.71379089355469, ...
|
||||||
|
'texts': ['Albert Einstein was a ...]
|
||||||
|
'titles': ['"Albert Einstein"', ...]
|
||||||
|
}}]}
|
||||||
|
```
|
||||||
|
|
||||||
|
<a name="transformers.RAGenerator.__init__"></a>
|
||||||
|
#### \_\_init\_\_
|
||||||
|
|
||||||
|
```python
|
||||||
|
| __init__(model_name_or_path: str = "facebook/rag-token-nq", retriever: Optional[DensePassageRetriever] = None, generator_type: RAGeneratorType = RAGeneratorType.TOKEN, top_k_answers: int = 2, max_length: int = 200, min_length: int = 2, num_beams: int = 2, embed_title: bool = True, prefix: Optional[str] = None, use_gpu: bool = True)
|
||||||
|
```
|
||||||
|
|
||||||
|
Load a RAG model from Transformers along with passage_embedding_model.
|
||||||
|
See https://huggingface.co/transformers/model_doc/rag.html for more details
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `model_name_or_path`: Directory of a saved model or the name of a public model e.g.
|
||||||
|
'facebook/rag-token-nq', 'facebook/rag-sequence-nq'.
|
||||||
|
See https://huggingface.co/models for full list of available models.
|
||||||
|
- `retriever`: `DensePassageRetriever` used to embedded passage
|
||||||
|
- `generator_type`: Which RAG generator implementation to use? RAG-TOKEN or RAG-SEQUENCE
|
||||||
|
- `top_k_answers`: Number of independently generated text to return
|
||||||
|
- `max_length`: Maximum length of generated text
|
||||||
|
- `min_length`: Minimum length of generated text
|
||||||
|
- `num_beams`: Number of beams for beam search. 1 means no beam search.
|
||||||
|
- `embed_title`: Embedded the title of passage while generating embedding
|
||||||
|
- `prefix`: The prefix used by the generator's tokenizer.
|
||||||
|
- `use_gpu`: Whether to use GPU (if available)
|
||||||
|
|
||||||
|
<a name="transformers.RAGenerator.predict"></a>
|
||||||
|
#### predict
|
||||||
|
|
||||||
|
```python
|
||||||
|
| predict(question: str, documents: List[Document], top_k: Optional[int] = None) -> Dict
|
||||||
|
```
|
||||||
|
|
||||||
|
Generate the answer to the input question. The generation will be conditioned on the supplied documents.
|
||||||
|
These document can for example be retrieved via the Retriever.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `question`: Question
|
||||||
|
- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
|
||||||
|
- `top_k`: Number of returned answers
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
Generated answers plus additional infos in a dict like this:
|
||||||
|
|
||||||
|
```python
|
||||||
|
> {'question': 'who got the first nobel prize in physics',
|
||||||
|
> 'answers':
|
||||||
|
> [{'question': 'who got the first nobel prize in physics',
|
||||||
|
> 'answer': ' albert einstein',
|
||||||
|
> 'meta': { 'doc_ids': [...],
|
||||||
|
> 'doc_scores': [80.42758 ...],
|
||||||
|
> 'doc_probabilities': [40.71379089355469, ...
|
||||||
|
> 'texts': ['Albert Einstein was a ...]
|
||||||
|
> 'titles': ['"Albert Einstein"', ...]
|
||||||
|
> }}]}
|
||||||
|
```
|
||||||
|
|
||||||
|
<a name="base"></a>
|
||||||
|
# Module base
|
||||||
|
|
||||||
|
<a name="base.BaseGenerator"></a>
|
||||||
|
## BaseGenerator Objects
|
||||||
|
|
||||||
|
```python
|
||||||
|
class BaseGenerator(ABC)
|
||||||
|
```
|
||||||
|
|
||||||
|
Abstract class for Generators
|
||||||
|
|
||||||
|
<a name="base.BaseGenerator.predict"></a>
|
||||||
|
#### predict
|
||||||
|
|
||||||
|
```python
|
||||||
|
| @abstractmethod
|
||||||
|
| predict(question: str, documents: List[Document], top_k: Optional[int]) -> Dict
|
||||||
|
```
|
||||||
|
|
||||||
|
Abstract method to generate answers.
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `question`: Question
|
||||||
|
- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
|
||||||
|
- `top_k`: Number of returned answers
|
||||||
|
|
||||||
|
**Returns**:
|
||||||
|
|
||||||
|
Generated answers plus additional infos in a dict
|
||||||
|
|
@ -1,5 +1,44 @@
|
|||||||
|
<a name="preprocessor"></a>
|
||||||
|
# Module preprocessor
|
||||||
|
|
||||||
|
<a name="preprocessor.PreProcessor"></a>
|
||||||
|
## PreProcessor Objects
|
||||||
|
|
||||||
|
```python
|
||||||
|
class PreProcessor(BasePreProcessor)
|
||||||
|
```
|
||||||
|
|
||||||
|
<a name="preprocessor.PreProcessor.__init__"></a>
|
||||||
|
#### \_\_init\_\_
|
||||||
|
|
||||||
|
```python
|
||||||
|
| __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Arguments**:
|
||||||
|
|
||||||
|
- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
|
||||||
|
for the longest common string. This heuristic uses exact matches and therefore
|
||||||
|
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
|
||||||
|
or similar.
|
||||||
|
- `clean_whitespace`: Strip whitespaces before or after each line in the text.
|
||||||
|
- `clean_empty_lines`: Remove more than two empty lines in the text.
|
||||||
|
- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
|
||||||
|
- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
|
||||||
|
"sentence", then each output document will have 10 sentences.
|
||||||
|
- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
|
||||||
|
split_length -> 5 & split_stride -> 2, then the splits would be like:
|
||||||
|
[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
|
||||||
|
Set the value to None to disable striding behaviour.
|
||||||
|
- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
|
||||||
|
to True, the individual split will always have complete sentences &
|
||||||
|
the number of words will be <= split_length.
|
||||||
|
|
||||||
|
<a name="cleaning"></a>
|
||||||
|
# Module cleaning
|
||||||
|
|
||||||
<a name="utils"></a>
|
<a name="utils"></a>
|
||||||
# utils
|
# Module utils
|
||||||
|
|
||||||
<a name="utils.eval_data_from_file"></a>
|
<a name="utils.eval_data_from_file"></a>
|
||||||
#### eval\_data\_from\_file
|
#### eval\_data\_from\_file
|
||||||
@ -84,45 +123,6 @@ Fetch an archive (zip or tar.gz) from a url via http and extract content to an o
|
|||||||
|
|
||||||
bool if anything got fetched
|
bool if anything got fetched
|
||||||
|
|
||||||
<a name="preprocessor"></a>
|
|
||||||
# preprocessor
|
|
||||||
|
|
||||||
<a name="preprocessor.PreProcessor"></a>
|
|
||||||
## PreProcessor
|
|
||||||
|
|
||||||
```python
|
|
||||||
class PreProcessor(BasePreProcessor)
|
|
||||||
```
|
|
||||||
|
|
||||||
<a name="preprocessor.PreProcessor.__init__"></a>
|
|
||||||
#### \_\_init\_\_
|
|
||||||
|
|
||||||
```python
|
|
||||||
| __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Arguments**:
|
|
||||||
|
|
||||||
- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
|
|
||||||
for the longest common string. This heuristic uses exact matches and therefore
|
|
||||||
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
|
|
||||||
or similar.
|
|
||||||
- `clean_whitespace`: Strip whitespaces before or after each line in the text.
|
|
||||||
- `clean_empty_lines`: Remove more than two empty lines in the text.
|
|
||||||
- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
|
|
||||||
- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
|
|
||||||
"sentence", then each output document will have 10 sentences.
|
|
||||||
- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
|
|
||||||
split_length -> 5 & split_stride -> 2, then the splits would be like:
|
|
||||||
[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
|
|
||||||
Set the value to None to disable striding behaviour.
|
|
||||||
- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
|
|
||||||
to True, the individual split will always have complete sentences &
|
|
||||||
the number of words will be <= split_length.
|
|
||||||
|
|
||||||
<a name="base"></a>
|
<a name="base"></a>
|
||||||
# base
|
# Module base
|
||||||
|
|
||||||
<a name="cleaning"></a>
|
|
||||||
# cleaning
|
|
||||||
|
|
||||||
|
@ -10,5 +10,8 @@ processor:
|
|||||||
- skip_empty_modules: true
|
- skip_empty_modules: true
|
||||||
renderer:
|
renderer:
|
||||||
type: markdown
|
type: markdown
|
||||||
descriptive_class_title: false
|
descriptive_class_title: true
|
||||||
|
descriptive_module_title: true
|
||||||
|
add_method_class_prefix: false
|
||||||
|
add_member_class_prefix: false
|
||||||
filename: document_store.md
|
filename: document_store.md
|
||||||
|
@ -10,5 +10,8 @@ processor:
|
|||||||
- skip_empty_modules: true
|
- skip_empty_modules: true
|
||||||
renderer:
|
renderer:
|
||||||
type: markdown
|
type: markdown
|
||||||
descriptive_class_title: false
|
descriptive_class_title: true
|
||||||
|
descriptive_module_title: true
|
||||||
|
add_method_class_prefix: false
|
||||||
|
add_member_class_prefix: false
|
||||||
filename: file_converter.md
|
filename: file_converter.md
|
||||||
|
@ -10,5 +10,8 @@ processor:
|
|||||||
- skip_empty_modules: true
|
- skip_empty_modules: true
|
||||||
renderer:
|
renderer:
|
||||||
type: markdown
|
type: markdown
|
||||||
descriptive_class_title: false
|
descriptive_class_title: true
|
||||||
|
descriptive_module_title: true
|
||||||
|
add_method_class_prefix: false
|
||||||
|
add_member_class_prefix: false
|
||||||
filename: generator.md
|
filename: generator.md
|
||||||
|
@ -10,5 +10,8 @@ processor:
|
|||||||
- skip_empty_modules: true
|
- skip_empty_modules: true
|
||||||
renderer:
|
renderer:
|
||||||
type: markdown
|
type: markdown
|
||||||
descriptive_class_title: false
|
descriptive_class_title: true
|
||||||
|
descriptive_module_title: true
|
||||||
|
add_method_class_prefix: false
|
||||||
|
add_member_class_prefix: false
|
||||||
filename: preprocessor.md
|
filename: preprocessor.md
|
||||||
|
@ -10,5 +10,8 @@ processor:
|
|||||||
- skip_empty_modules: true
|
- skip_empty_modules: true
|
||||||
renderer:
|
renderer:
|
||||||
type: markdown
|
type: markdown
|
||||||
descriptive_class_title: false
|
descriptive_class_title: true
|
||||||
|
descriptive_module_title: true
|
||||||
|
add_method_class_prefix: false
|
||||||
|
add_member_class_prefix: false
|
||||||
filename: reader.md
|
filename: reader.md
|
||||||
|
@ -10,5 +10,8 @@ processor:
|
|||||||
- skip_empty_modules: true
|
- skip_empty_modules: true
|
||||||
renderer:
|
renderer:
|
||||||
type: markdown
|
type: markdown
|
||||||
descriptive_class_title: false
|
descriptive_class_title: true
|
||||||
|
descriptive_module_title: true
|
||||||
|
add_method_class_prefix: false
|
||||||
|
add_member_class_prefix: false
|
||||||
filename: retriever.md
|
filename: retriever.md
|
||||||
|
@ -1,8 +1,8 @@
|
|||||||
<a name="farm"></a>
|
<a name="farm"></a>
|
||||||
# farm
|
# Module farm
|
||||||
|
|
||||||
<a name="farm.FARMReader"></a>
|
<a name="farm.FARMReader"></a>
|
||||||
## FARMReader
|
## FARMReader Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class FARMReader(BaseReader)
|
class FARMReader(BaseReader)
|
||||||
@ -279,10 +279,10 @@ float32 could still be be more performant.
|
|||||||
- `opset_version`: ONNX opset version
|
- `opset_version`: ONNX opset version
|
||||||
|
|
||||||
<a name="transformers"></a>
|
<a name="transformers"></a>
|
||||||
# transformers
|
# Module transformers
|
||||||
|
|
||||||
<a name="transformers.TransformersReader"></a>
|
<a name="transformers.TransformersReader"></a>
|
||||||
## TransformersReader
|
## TransformersReader Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class TransformersReader(BaseReader)
|
class TransformersReader(BaseReader)
|
||||||
@ -368,5 +368,5 @@ Example:
|
|||||||
Dict containing question and answers
|
Dict containing question and answers
|
||||||
|
|
||||||
<a name="base"></a>
|
<a name="base"></a>
|
||||||
# base
|
# Module base
|
||||||
|
|
||||||
|
@ -1,8 +1,8 @@
|
|||||||
<a name="sparse"></a>
|
<a name="sparse"></a>
|
||||||
# sparse
|
# Module sparse
|
||||||
|
|
||||||
<a name="sparse.ElasticsearchRetriever"></a>
|
<a name="sparse.ElasticsearchRetriever"></a>
|
||||||
## ElasticsearchRetriever
|
## ElasticsearchRetriever Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class ElasticsearchRetriever(BaseRetriever)
|
class ElasticsearchRetriever(BaseRetriever)
|
||||||
@ -52,7 +52,7 @@ self.retrieve(query="Why did the revenue increase?",
|
|||||||
```
|
```
|
||||||
|
|
||||||
<a name="sparse.ElasticsearchFilterOnlyRetriever"></a>
|
<a name="sparse.ElasticsearchFilterOnlyRetriever"></a>
|
||||||
## ElasticsearchFilterOnlyRetriever
|
## ElasticsearchFilterOnlyRetriever Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class ElasticsearchFilterOnlyRetriever(ElasticsearchRetriever)
|
class ElasticsearchFilterOnlyRetriever(ElasticsearchRetriever)
|
||||||
@ -62,7 +62,7 @@ Naive "Retriever" that returns all documents that match the given filters. No im
|
|||||||
Helpful for benchmarking, testing and if you want to do QA on small documents without an "active" retriever.
|
Helpful for benchmarking, testing and if you want to do QA on small documents without an "active" retriever.
|
||||||
|
|
||||||
<a name="sparse.TfidfRetriever"></a>
|
<a name="sparse.TfidfRetriever"></a>
|
||||||
## TfidfRetriever
|
## TfidfRetriever Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class TfidfRetriever(BaseRetriever)
|
class TfidfRetriever(BaseRetriever)
|
||||||
@ -76,10 +76,10 @@ computations when text is passed on to a Reader for QA.
|
|||||||
It uses sklearn's TfidfVectorizer to compute a tf-idf matrix.
|
It uses sklearn's TfidfVectorizer to compute a tf-idf matrix.
|
||||||
|
|
||||||
<a name="dense"></a>
|
<a name="dense"></a>
|
||||||
# dense
|
# Module dense
|
||||||
|
|
||||||
<a name="dense.DensePassageRetriever"></a>
|
<a name="dense.DensePassageRetriever"></a>
|
||||||
## DensePassageRetriever
|
## DensePassageRetriever Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class DensePassageRetriever(BaseRetriever)
|
class DensePassageRetriever(BaseRetriever)
|
||||||
@ -201,7 +201,7 @@ train a DensePassageRetrieval model
|
|||||||
- `passage_encoder_save_dir`: directory inside save_dir where passage_encoder model files are saved
|
- `passage_encoder_save_dir`: directory inside save_dir where passage_encoder model files are saved
|
||||||
|
|
||||||
<a name="dense.EmbeddingRetriever"></a>
|
<a name="dense.EmbeddingRetriever"></a>
|
||||||
## EmbeddingRetriever
|
## EmbeddingRetriever Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class EmbeddingRetriever(BaseRetriever)
|
class EmbeddingRetriever(BaseRetriever)
|
||||||
@ -286,10 +286,10 @@ Create embeddings for a list of passages. For this Retriever type: The same as c
|
|||||||
Embeddings, one per input passage
|
Embeddings, one per input passage
|
||||||
|
|
||||||
<a name="base"></a>
|
<a name="base"></a>
|
||||||
# base
|
# Module base
|
||||||
|
|
||||||
<a name="base.BaseRetriever"></a>
|
<a name="base.BaseRetriever"></a>
|
||||||
## BaseRetriever
|
## BaseRetriever Objects
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class BaseRetriever(ABC)
|
class BaseRetriever(ABC)
|
||||||
@ -330,7 +330,10 @@ position in the ranking of documents the correct document is.
|
|||||||
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
|
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
|
||||||
Only considers the highest ranked relevant document.
|
Only considers the highest ranked relevant document.
|
||||||
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
|
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
|
||||||
documents a higher rank. Considers all retrieved relevant documents. (only with ``open_domain=False``)
|
documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
|
||||||
|
average precision is normalized by the number of retrieved relevant documents per query.
|
||||||
|
If ``open_domain=False``, average precision is normalized by the number of all relevant documents
|
||||||
|
per query.
|
||||||
|
|
||||||
**Arguments**:
|
**Arguments**:
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user