<imgalt="Checked with MyPy"src="https://camo.githubusercontent.com/34b3a249cd6502d0a521ab2f42c8830b7cfd03fa/687474703a2f2f7777772e6d7970792d6c616e672e6f72672f7374617469632f6d7970795f62616467652e737667">
and state of the art dense methods (e.g. sentence-transformers and Dense Passage Retrieval)
5.**Reader**: Neural network (e.g. BERT or RoBERTA) that reads through texts in detail
to find an answer. The Reader takes multiple passages of text as input and returns top-n answers. Models are trained via [FARM](https://github.com/deepset-ai/FARM) or [Transformers](https://github.com/huggingface/transformers) on SQuAD like tasks. You can just load a pretrained model from [Hugging Face's model hub](https://huggingface.co/models) or fine-tune it on your own domain data.
6.**Generator**: Neural network (e.g. RAG) that *generates* an answer for a given question conditioned on the retrieved documents from the retriever.
6.**Finder**: Glues together a Retriever + Reader/Generator as a pipeline to provide an easy-to-use question answering interface.
7.**REST API**: Exposes a simple API based on fastAPI for running QA search, uploading files and collecting user feedback for continuous learning.
8.**Haystack Annotate**: Create custom QA labels to improve performance of your domain-specific models. [Hosted version](https://annotate.deepset.ai/login) or [Docker images](https://github.com/deepset-ai/haystack/tree/master/annotation_tool).
- Tutorial 2 - Fine-tuning a model on own data: [Jupyter notebook](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial2_Finetune_a_model_on_your_data.ipynb)
Different converters to extract text from your original files (pdf, docx, txt, html).
While it's almost impossible to cover all types, layouts and special cases (especially in PDFs), we cover the most common formats (incl. multi-column) and extract meta information (e.g. page splits). The converters are easily extendable, so that you can customize them for your files if needed.
**Available options**
- Txt
- PDF
- Docx
- Apache Tika (Supports > 340 file formats)
**Example**
```python
#PDF
from haystack.file_converter.pdf import PDFToTextConverter
Cleaning and splitting of your texts are crucial steps that will directly impact the speed and accuracy of your search.
The splitting of larger texts is especially important for achieving fast query speed. The longer the texts that the retriever passes to the reader, the slower your queries.
**Available Options**
We provide a basic `PreProcessor` class that allows:
- clean whitespace, headers, footer and empty lines
- split by words, sentences or passages
- option for "overlapping" splits
- option to never split within a sentence
You can easily extend this class to your own custom requirements.
-> See [docs](https://haystack.deepset.ai/docs/latest/databasemd) for details
### 4) Retrievers
**What**
The Retriever is a fast "filter" that can quickly go through the full document store and pass a set of candidate documents to the Reader. It is an tool for sifting out the obvious negative cases, saving the Reader from doing more work than it needs to and speeding up the querying process. There are two fundamentally different categories of retrievers: sparse (e.g. TF-IDF, BM25) and dense (e.g. DPR, sentence-transformers).
- Use the [hosted version](https://annotate.deepset.ai/login) (Beta) or deploy it yourself with the [Docker Images](https://github.com/deepset-ai/haystack/blob/master/annotation_tool).
- Create labels with different techniques: Come up with questions (+ answers) while reading passages (SQuAD style) or have a set of predefined questions and look for answers in the document (~ Natural Questions).
- Structure your work via organizations, projects, users
- Upload your documents or import labels from an existing SQuAD-style dataset
We are very open to contributions from the community - be it the fix of a small typo or a completely new feature! You don't need to be an
Haystack expert for providing meaningful improvements. To avoid any extra work on either side, please check our [Contributor Guidelines](https://github.com/deepset-ai/haystack/blob/master/CONTRIBUTING.md) first.
Tests will automatically run for every commit you push to your PR. You can also run them locally by executing [pytest](https://docs.pytest.org/en/stable/) in your terminal from the root folder of this repository:
All tests:
``` bash
cd test
pytest
```
You can also only run a subset of tests by specifying a marker and the optional "not" keyword: