Generate docstrings and deploy to branches to Staging (Website) (#731)

* test pre commit hook

* test status

* test on this branch

* push generated docstrings and tutorials to branch

* fixed syntax error

* Add latest docstring and tutorial changes

* add files before commit

* catch commit error

* separate generation from deployment

* add deployment process for staging

* add current branch to payload

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This commit is contained in:
Markus Paff 2021-01-21 11:01:09 +01:00 committed by GitHub
parent 0f62e0b2ee
commit 0b583b8972
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 285 additions and 187 deletions

View File

@ -13,28 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.7
uses: actions/setup-python@v2
with:
python-version: 3.7
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install 'pydoc-markdown>=3.0.0,<4.0.0'
pip install mkdocs
pip install jupytercontrib
# Generates the docstrings and tutorials so that we have the latest for the deployment
- name: Generate Docstrings and Tutorials
run: |
cd docs/_src/api/api/
./generate_docstrings.sh
cd ../../tutorials/tutorials/
python3 convert_ipynb.py
# Creates dispatch event for haystack-website repo
- name: Repository Dispatch
uses: peter-evans/repository-dispatch@v1

View File

@ -0,0 +1,26 @@
name: Deploy website
# Controls when the action will run. Triggers the workflow on push
# events but only for the master branch
on:
push:
branches-ignore:
- master
- benchmarks
jobs:
# This workflow contains a single job called "build"
build:
# The type of runner that the job will run on
runs-on: ubuntu-latest
steps:
# Creates dispatch event for haystack-website repo
- name: Repository Dispatch
uses: peter-evans/repository-dispatch@v1
with:
token: ${{ secrets.PUBLIC_REPO_ACCESS_TOKEN }}
repository: deepset-ai/haystack-website
event-type: deploy-website-staging
client-payload: '{"ref": "${{ github.ref }}"}'

55
.github/workflows/update_docs.yml vendored Normal file
View File

@ -0,0 +1,55 @@
name: Update Docstrings and Tutorials
# Controls when the action will run. Triggers the workflow on push
# events but only for the master branch
on:
push:
branches-ignore:
- master
- benchmarks
jobs:
# This workflow contains a single job called "build"
build:
# The type of runner that the job will run on
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
persist-credentials: false # otherwise, the token used is the GITHUB_TOKEN, instead of your personal token
fetch-depth: 0 # otherwise, you will failed to push refs to dest repo
- name: Set up Python 3.7
uses: actions/setup-python@v2
with:
python-version: 3.7
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install 'pydoc-markdown>=3.0.0,<4.0.0'
pip install mkdocs
pip install jupytercontrib
# Generates the docstrings and tutorials so that we have the latest for the deployment
- name: Generate Docstrings and Tutorials
run: |
cd docs/_src/api/api/
./generate_docstrings.sh
cd ../../tutorials/tutorials/
python3 convert_ipynb.py
cd ../../../../
git status
- name: Commit files
run: |
git config --local user.email "41898282+github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
git add .
git commit -m "Add latest docstring and tutorial changes" -a || echo "No changes to commit"
- name: Push changes
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
branch: ${{ github.ref }}

View File

@ -1,3 +1,62 @@
<a name="base"></a>
# Module base
<a name="base.BaseConverter"></a>
## BaseConverter Objects
```python
class BaseConverter()
```
Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.
<a name="base.BaseConverter.__init__"></a>
#### \_\_init\_\_
```python
| __init__(remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
```
**Arguments**:
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
The tabular structures in documents might be noise for the reader model if it
does not have table parsing capability for finding answers. However, tables
may also have long strings that could possible candidate for searching answers.
The rows containing strings are thus retained in this option.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) format.
This option can be used to add test for encoding errors. If the extracted text is
not one of the valid languages, then it might likely be encoding error resulting
in garbled text.
<a name="base.BaseConverter.convert"></a>
#### convert
```python
| @abstractmethod
| convert(file_path: Path, meta: Optional[Dict[str, str]]) -> Dict[str, Any]
```
Convert a file to a dictionary containing the text and any associated meta data.
File converters may extract file meta like name or size. In addition to it, user
supplied meta data like author, url, external IDs can be supplied as a dictionary.
**Arguments**:
- `file_path`: path of the file to convert
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
<a name="base.BaseConverter.validate_language"></a>
#### validate\_language
```python
| validate_language(text: str) -> bool
```
Validate if the language of the text is one of valid languages.
<a name="txt"></a>
# Module txt
@ -118,65 +177,6 @@ in garbled text.
a list of pages and the extracted meta data of the file.
<a name="base"></a>
# Module base
<a name="base.BaseConverter"></a>
## BaseConverter Objects
```python
class BaseConverter()
```
Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.
<a name="base.BaseConverter.__init__"></a>
#### \_\_init\_\_
```python
| __init__(remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
```
**Arguments**:
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
The tabular structures in documents might be noise for the reader model if it
does not have table parsing capability for finding answers. However, tables
may also have long strings that could possible candidate for searching answers.
The rows containing strings are thus retained in this option.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) format.
This option can be used to add test for encoding errors. If the extracted text is
not one of the valid languages, then it might likely be encoding error resulting
in garbled text.
<a name="base.BaseConverter.convert"></a>
#### convert
```python
| @abstractmethod
| convert(file_path: Path, meta: Optional[Dict[str, str]]) -> Dict[str, Any]
```
Convert a file to a dictionary containing the text and any associated meta data.
File converters may extract file meta like name or size. In addition to it, user
supplied meta data like author, url, external IDs can be supplied as a dictionary.
**Arguments**:
- `file_path`: path of the file to convert
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
<a name="base.BaseConverter.validate_language"></a>
#### validate\_language
```python
| validate_language(text: str) -> bool
```
Validate if the language of the text is one of valid languages.
<a name="pdf"></a>
# Module pdf

View File

@ -1,3 +1,35 @@
<a name="base"></a>
# Module base
<a name="base.BaseGenerator"></a>
## BaseGenerator Objects
```python
class BaseGenerator(ABC)
```
Abstract class for Generators
<a name="base.BaseGenerator.predict"></a>
#### predict
```python
| @abstractmethod
| predict(query: str, documents: List[Document], top_k: Optional[int]) -> Dict
```
Abstract method to generate answers.
**Arguments**:
- `query`: Query
- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
- `top_k`: Number of returned answers
**Returns**:
Generated answers plus additional infos in a dict
<a name="transformers"></a>
# Module transformers
@ -106,35 +138,3 @@ Generated answers plus additional infos in a dict like this:
| }}]}
```
<a name="base"></a>
# Module base
<a name="base.BaseGenerator"></a>
## BaseGenerator Objects
```python
class BaseGenerator(ABC)
```
Abstract class for Generators
<a name="base.BaseGenerator.predict"></a>
#### predict
```python
| @abstractmethod
| predict(query: str, documents: List[Document], top_k: Optional[int]) -> Dict
```
Abstract method to generate answers.
**Arguments**:
- `query`: Query
- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
- `top_k`: Number of returned answers
**Returns**:
Generated answers plus additional infos in a dict

View File

@ -207,6 +207,44 @@ Initialize a Pipeline for Generative Question Answering.
- `generator`: Generator instance
- `retriever`: Retriever instance
<a name="pipeline.SearchSummarizationPipeline"></a>
## SearchSummarizationPipeline Objects
```python
class SearchSummarizationPipeline(BaseStandardPipeline)
```
<a name="pipeline.SearchSummarizationPipeline.__init__"></a>
#### \_\_init\_\_
```python
| __init__(summarizer: BaseSummarizer, retriever: BaseRetriever)
```
Initialize a Pipeline that retrieves documents for a query and then summarizes those documents.
**Arguments**:
- `summarizer`: Summarizer instance
- `retriever`: Retriever instance
<a name="pipeline.SearchSummarizationPipeline.run"></a>
#### run
```python
| run(query: str, filters: Optional[Dict] = None, top_k_retriever: int = 10, generate_single_summary: bool = False, return_in_answer_format=False)
```
**Arguments**:
- `query`: Your search query
- `filters`:
- `top_k_retriever`: Number of top docs the retriever should pass to the summarizer.
The higher this value, the slower your pipeline.
- `generate_single_summary`: Whether to generate single summary from all retrieved docs (True) or one per doc (False).
- `return_in_answer_format`: Whether the results should be returned as documents (False) or in the answer format used in other QA pipelines (True).
With the latter, you can use this pipeline as a "drop-in replacement" for other QA pipelines.
<a name="pipeline.FAQPipeline"></a>
## FAQPipeline Objects

View File

@ -1,3 +1,6 @@
<a name="base"></a>
# Module base
<a name="farm"></a>
# Module farm
@ -378,6 +381,3 @@ Example:
Dict containing query and answers
<a name="base"></a>
# Module base

View File

@ -1,3 +1,74 @@
<a name="base"></a>
# Module base
<a name="base.BaseRetriever"></a>
## BaseRetriever Objects
```python
class BaseRetriever(ABC)
```
<a name="base.BaseRetriever.retrieve"></a>
#### retrieve
```python
| @abstractmethod
| retrieve(query: str, filters: dict = None, top_k: int = 10, index: str = None) -> List[Document]
```
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query.
**Arguments**:
- `query`: The query
- `filters`: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
- `top_k`: How many documents to return per query.
- `index`: The name of the index in the DocumentStore from which to retrieve documents
<a name="base.BaseRetriever.timing"></a>
#### timing
```python
| timing(fn)
```
Wrapper method used to time functions.
<a name="base.BaseRetriever.eval"></a>
#### eval
```python
| eval(label_index: str = "label", doc_index: str = "eval_document", label_origin: str = "gold_label", top_k: int = 10, open_domain: bool = False, return_preds: bool = False) -> dict
```
Performs evaluation on the Retriever.
Retriever is evaluated based on whether it finds the correct document given the query string and at which
position in the ranking of documents the correct document is.
| Returns a dict containing the following metrics:
- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
average precision is normalized by the number of retrieved relevant documents per query.
If ``open_domain=False``, average precision is normalized by the number of all relevant documents
per query.
**Arguments**:
- `label_index`: Index/Table in DocumentStore where labeled questions are stored
- `doc_index`: Index/Table in DocumentStore where documents that are used for evaluation are stored
- `top_k`: How many documents to return per query
- `open_domain`: If ``True``, retrieval will be evaluated by checking if the answer string to a question is
contained in the retrieved docs (common approach in open-domain QA).
If ``False``, retrieval uses a stricter evaluation that checks if the retrieved document ids
are within ids explicitly stated in the labels.
- `return_preds`: Whether to add predictions in the returned dictionary. If True, the returned dictionary
contains the keys "predictions" and "metrics".
<a name="sparse"></a>
# Module sparse
@ -408,74 +479,3 @@ Create embeddings for a list of passages. For this Retriever type: The same as c
Embeddings, one per input passage
<a name="base"></a>
# Module base
<a name="base.BaseRetriever"></a>
## BaseRetriever Objects
```python
class BaseRetriever(ABC)
```
<a name="base.BaseRetriever.retrieve"></a>
#### retrieve
```python
| @abstractmethod
| retrieve(query: str, filters: dict = None, top_k: int = 10, index: str = None) -> List[Document]
```
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query.
**Arguments**:
- `query`: The query
- `filters`: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
- `top_k`: How many documents to return per query.
- `index`: The name of the index in the DocumentStore from which to retrieve documents
<a name="base.BaseRetriever.timing"></a>
#### timing
```python
| timing(fn)
```
Wrapper method used to time functions.
<a name="base.BaseRetriever.eval"></a>
#### eval
```python
| eval(label_index: str = "label", doc_index: str = "eval_document", label_origin: str = "gold_label", top_k: int = 10, open_domain: bool = False, return_preds: bool = False) -> dict
```
Performs evaluation on the Retriever.
Retriever is evaluated based on whether it finds the correct document given the query string and at which
position in the ranking of documents the correct document is.
| Returns a dict containing the following metrics:
- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
average precision is normalized by the number of retrieved relevant documents per query.
If ``open_domain=False``, average precision is normalized by the number of all relevant documents
per query.
**Arguments**:
- `label_index`: Index/Table in DocumentStore where labeled questions are stored
- `doc_index`: Index/Table in DocumentStore where documents that are used for evaluation are stored
- `top_k`: How many documents to return per query
- `open_domain`: If ``True``, retrieval will be evaluated by checking if the answer string to a question is
contained in the retrieved docs (common approach in open-domain QA).
If ``False``, retrieval uses a stricter evaluation that checks if the retrieved document ids
are within ids explicitly stated in the labels.
- `return_preds`: Whether to add predictions in the returned dictionary. If True, the returned dictionary
contains the keys "predictions" and "metrics".