mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-12-05 03:17:31 +00:00
New readme (#534)
* WIP readme to md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * delete rst readme
This commit is contained in:
parent
50709a3f9d
commit
c363fefc6e
344
README.md
Normal file
344
README.md
Normal file
@ -0,0 +1,344 @@
|
||||
<p align="center">
|
||||
<br>
|
||||
<img src="https://github.com/deepset-ai/haystack/blob/master/docs/_src/img/haystack_logo_blue_banner.png?raw=true" />
|
||||
<br>
|
||||
<p>
|
||||
<p>
|
||||
<a href="https://github.com/deepset-ai/haystack/actions">
|
||||
<img alt="Build" src="https://github.com/deepset-ai/haystack/workflows/Build/badge.svg?branch=master">
|
||||
</a>
|
||||
<a href="http://mypy-lang.org/">
|
||||
<img alt="Checked with MyPy" src="https://camo.githubusercontent.com/34b3a249cd6502d0a521ab2f42c8830b7cfd03fa/687474703a2f2f7777772e6d7970792d6c616e672e6f72672f7374617469632f6d7970795f62616467652e737667">
|
||||
</a>
|
||||
<a href="https://haystack.deepset.ai/docs/intromd">
|
||||
<img alt="Documentation" src="https://img.shields.io/website/http/haystack.deepset.ai/docs/intromd.svg?down_color=red&down_message=offline&up_message=online">
|
||||
</a>
|
||||
<a href="https://github.com/deepset-ai/haystack/releases">
|
||||
<img alt="Release" src="https://img.shields.io/github/release/deepset-ai/haystack">
|
||||
</a>
|
||||
<a href="https://github.com/deepset-ai/haystack/blob/master/LICENSE">
|
||||
<img alt="License" src="https://img.shields.io/github/license/deepset-ai/haystack.svg?color=blue">
|
||||
</a>
|
||||
</a>
|
||||
<a href="https://github.com/deepset-ai/haystack/commits/master">
|
||||
<img alt="Last commit" src="https://img.shields.io/github/last-commit/deepset-ai/haystack">
|
||||
</a>
|
||||
</p>
|
||||
|
||||
Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ...
|
||||
|
||||
... **ask questions** in natural language and find granular answers in your own documents.
|
||||
... **do semantic document search** and retrieve more relevant documents for your search queries.
|
||||
... **search at scale** through millions of documents.
|
||||
... **use off-the-shelf models or fine-tune** them to your own domain.
|
||||
... **evaluate, benchmark and continuously improve** your models via user feedback.
|
||||
... **improve chat bots** by leveraging existing knowledge bases for the long tail of queries.
|
||||
... **automate processes** by automatically applying a list of questions to new documents and using the extracted answers.
|
||||
|
||||
| | |
|
||||
|-|-|
|
||||
| :ledger: [Docs](https://haystack.deepset.ai/docs/intromd) | Usage, Guides, API documentation ...|
|
||||
| :computer: [Installation](https://github.com/deepset-ai/haystack/#installation) | How to install |
|
||||
| :art: [Key components](https://github.com/deepset-ai/haystack/#key-components) | Overview of core concepts |
|
||||
| :eyes: [Quick Tour](https://github.com/deepset-ai/haystack/#quick-tour) | Basic explanation of concepts, options and usage |
|
||||
| :mortar_board: [Tutorials](https://github.com/deepset-ai/haystack/#tutorials) | Jupyter/Colab Notebooks & Scripts |
|
||||
| :bar_chart: [Benchmarks](https://haystack.deepset.ai/bm/benchmarks) | Speed & Accuracy of Retriever, Readers and DocumentStores |
|
||||
| :telescope: [Roadmap](https://haystack.deepset.ai/en/docs/roadmapmd) | Public roadmap of Haystack |
|
||||
| :heart: [Contributing](https://github.com/deepset-ai/haystack/#heart-contributing) | We welcome all contributions! |
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Latest models**: Utilize all latest transformer based models (e.g. BERT, RoBERTa, MiniLM) for extractive QA, generative QA and document retrieval.
|
||||
- **Modular**: Multiple choices to fit your tech stack and use case. Pick your favorite database, file converter or modeling framwework.
|
||||
- **Open**: 100% compatible with HuggingFace's model hub. Tight interfaces to other frameworks (e.g. Transformers, FARM, sentence-transformers)
|
||||
- **Scalable**: Scale to millions of docs via retrievers, production-ready backends like Elasticsearch / FAISS and a fastAPI REST API
|
||||
- **End-to-End**: All tooling in one place: file conversion, cleaning, splitting, training, eval, inference, labeling ...
|
||||
- **Developer friendly**: Easy to debug, extend and modify.
|
||||
- **Customizable**: Fine-tune models to your own domain or implement your custom DocumentStore.
|
||||
- **Continuous Learning**: Collect new training data via user feedback in production & improve your models continuously
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
PyPi:
|
||||
|
||||
pip install farm-haystack
|
||||
|
||||
Master branch (if you wanna try the latest features):
|
||||
|
||||
git clone https://github.com/deepset-ai/haystack.git
|
||||
cd haystack
|
||||
pip install --editable .
|
||||
|
||||
To update your installation, just do a git pull. The --editable flag
|
||||
will update changes immediately.
|
||||
|
||||
On Windows you might need:
|
||||
```
|
||||
pip install farm-haystack -f https://download.pytorch.org/whl/torch_stable.html
|
||||
```
|
||||
|
||||
## Key Components
|
||||
|
||||

|
||||
|
||||
1. **FileConverter**: Extracts pure text from files (pdf, docx, pptx, html and many more).
|
||||
2. **PreProcessor**: Cleans and splits texts into smaller chunks.
|
||||
3. **DocumentStore**: Database storing the documents, metadata and vectors for our search.
|
||||
We recommend Elasticsearch or FAISS, but have also more light-weight options for fast prototyping (SQL or In-Memory).
|
||||
4. **Retriever**: Fast algorithms that identify candidate documents for a given query from a large collection of documents.
|
||||
Retrievers narrow down the search space significantly and are therefore key for scalable QA.
|
||||
Haystack supports sparse methods (TF-IDF, BM25, custom Elasticsearch queries)
|
||||
and state of the art dense methods (e.g. sentence-transformers and Dense Passage Retrieval)
|
||||
5. **Reader**: Neural network (e.g. BERT or RoBERTA) that reads through texts in detail
|
||||
to find an answer. The Reader takes multiple passages of text as input and returns top-n answers. Models are trained via [FARM](https://github.com/deepset-ai/FARM) or [Transformers](https://github.com/huggingface/transformers) on SQuAD like tasks. You can just load a pretrained model from [Hugging Face's model hub](https://huggingface.co/models) or fine-tune it on your own domain data.
|
||||
6. **Generator**: Neural network (e.g. RAG) that *generates* an answer for a given question conditioned on the retrieved documents from the retriever.
|
||||
6. **Finder**: Glues together a Retriever + Reader/Generator as a pipeline to provide an easy-to-use question answering interface.
|
||||
7. **REST API**: Exposes a simple API based on fastAPI for running QA search, uploading files and collecting user feedback for continuous learning.
|
||||
8. **Haystack Annotate**: Create custom QA labels to improve performance of your domain-specific models. [Hosted version](https://annotate.deepset.ai/login) or [Docker images](https://github.com/deepset-ai/haystack/tree/master/annotation_tool).
|
||||
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||

|
||||
|
||||
## Tutorials
|
||||
|
||||
- Tutorial 1 - Basic QA Pipeline: [Jupyter notebook](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb)
|
||||
or
|
||||
[Colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb)
|
||||
- Tutorial 2 - Fine-tuning a model on own data: [Jupyter notebook](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial2_Finetune_a_model_on_your_data.ipynb)
|
||||
or
|
||||
[Colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial2_Finetune_a_model_on_your_data.ipynb)
|
||||
- Tutorial 3 - Basic QA Pipeline without Elasticsearch: [Jupyter notebook](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb)
|
||||
or
|
||||
[Colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb)
|
||||
- Tutorial 4 - FAQ-style QA: [Jupyter notebook](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb)
|
||||
or
|
||||
[Colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb)
|
||||
- Tutorial 5 - Evaluation of the whole QA-Pipeline: [Jupyter noteboook](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb)
|
||||
or
|
||||
[Colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb)
|
||||
- Tutorial 6 - Better Retrievers via "Dense Passage Retrieval":
|
||||
[Jupyter noteboook](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)
|
||||
or
|
||||
[Colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)
|
||||
|
||||
## Quick Tour
|
||||
[File Conversion](https://github.com/deepset-ai/haystack/blob/master/README.md#1-file-conversion) | [Preprocessing](https://github.com/deepset-ai/haystack/blob/master/README.md#2-preprocessing) | [DocumentStores](https://github.com/deepset-ai/haystack/blob/master/README.md#3-documentstores) | [Retrievers](https://github.com/deepset-ai/haystack/blob/master/README.md#5-retrievers) | [Readers](https://github.com/deepset-ai/haystack/blob/master/README.md#5-readers) | [REST API](https://github.com/deepset-ai/haystack/blob/master/README.md#6-rest-api) | [Labeling Tool](https://github.com/deepset-ai/haystack/blob/master/README.md#7-labeling-tool)
|
||||
|
||||
### 1) File Conversion
|
||||
**What**
|
||||
Different converters to extract text from your original files (pdf, docx, txt, html).
|
||||
While it's almost impossible to cover all types, layouts and special cases (especially in PDFs), we cover the most common formats (incl. multi-column) and extract meta information (e.g. page splits). The converters are easily extendable, so that you can customize them for your files if needed.
|
||||
|
||||
**Available options**
|
||||
- Txt
|
||||
- PDF
|
||||
- Docx
|
||||
- Apache Tika (Supports > 340 file formats)
|
||||
|
||||
**Example**
|
||||
|
||||
```python
|
||||
#PDF
|
||||
from haystack.file_converter.pdf import PDFToTextConverter
|
||||
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
|
||||
doc = converter.convert(file_path=file, meta=None)
|
||||
# => {"text": "text first page \f text second page ...", "meta": None}
|
||||
|
||||
#DOCX
|
||||
from haystack.file_converter.docx import DocxToTextConverter
|
||||
converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
|
||||
doc = converter.convert(file_path=file, meta=None)
|
||||
# => {"text": "some text", "meta": None}
|
||||
```
|
||||
|
||||
|
||||
### 2) Preprocessing
|
||||
**What**
|
||||
Cleaning and splitting of your texts are crucial steps that will directly impact the speed and accuracy of your search.
|
||||
The splitting of larger texts is especially important for achieving fast query speed. The longer the texts that the retriever passes to the reader, the slower your queries.
|
||||
|
||||
**Available Options**
|
||||
We provide a basic `PreProcessor` class that allows:
|
||||
- clean whitespace, headers, footer and empty lines
|
||||
- split by words, sentences or passages
|
||||
- option for "overlapping" splits
|
||||
- option to never split within a sentence
|
||||
|
||||
You can easily extend this class to your own custom requirements.
|
||||
|
||||
**Example**
|
||||
```python
|
||||
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
|
||||
|
||||
processor = PreProcessor(clean_empty_lines=True,
|
||||
clean_whitespace=True,
|
||||
clean_header_footer=True,
|
||||
split_by="word",
|
||||
split_length=200,
|
||||
split_respect_sentence_boundary=True)
|
||||
docs = []
|
||||
for f_name, f_path in zip(filenames, filepaths):
|
||||
# Optional: Supply any meta data here
|
||||
# the "name" field will be used by DPR if embed_title=True, rest is custom and can be named arbitrarily
|
||||
cur_meta = {"name": f_name, "category": "a" ...}
|
||||
|
||||
# Run the conversion on each file (PDF -> 1x doc)
|
||||
d = converter.convert(f_path, meta=cur_meta)
|
||||
|
||||
# clean and split each dict (1x doc -> multiple docs)
|
||||
d = processor.process(d)
|
||||
docs.extend(d)
|
||||
|
||||
# at this point docs will be [{"text": "some", "meta":{"name": "myfilename", "category":"a"}},...]
|
||||
document_store.write_documents(docs)
|
||||
```
|
||||
|
||||
### 3) DocumentStores
|
||||
|
||||
**What**
|
||||
- Store your texts, meta data and optionally embeddings
|
||||
- Documents should be chunked into smaller units (e.g. paragraphs)
|
||||
before indexing to make the results returned by the Retriever more
|
||||
granular and accurate.
|
||||
|
||||
**Available Options**
|
||||
|
||||
- Elasticsearch
|
||||
- FAISS
|
||||
- SQL
|
||||
- InMemory
|
||||
|
||||
**Example**
|
||||
|
||||
```python
|
||||
|
||||
# Run elasticsearch, e.g. via docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2
|
||||
|
||||
# Connect
|
||||
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
|
||||
|
||||
# Get all documents
|
||||
document_store.get_all_documents()
|
||||
|
||||
# Query
|
||||
document_store.query(query="What is the meaning of life?", filters=None, top_k=5)
|
||||
document_store.query_by_embedding(query_emb, filters=None, top_k=5)
|
||||
|
||||
```
|
||||
-> See [docs](https://haystack.deepset.ai/docs/latest/databasemd) for details
|
||||
|
||||
|
||||
### 4) Retrievers
|
||||
|
||||
**What**
|
||||
The Retriever is a fast "filter" that can quickly go through the full document store and pass a set of candidate documents to the Reader. It is an tool for sifting out the obvious negative cases, saving the Reader from doing more work than it needs to and speeding up the querying process. There are two fundamentally different categories of retrievers: sparse (e.g. TF-IDF, BM25) and dense (e.g. DPR, sentence-transformers).
|
||||
|
||||
**Available Options**
|
||||
- DensePassageRetriever
|
||||
- ElasticsearchRetriever
|
||||
- EmbeddingRetriever
|
||||
- TfidfRetriever
|
||||
|
||||
**Example**
|
||||
|
||||
```python
|
||||
retriever = DensePassageRetriever(document_store=document_store,
|
||||
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
|
||||
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
|
||||
use_gpu=True,
|
||||
batch_size=16,
|
||||
embed_title=True)
|
||||
retriever.retrieve(query="Why did the revenue increase?")
|
||||
# returns: [Document, Document]
|
||||
```
|
||||
|
||||
-> See [docs](https://haystack.deepset.ai/docs/latest/retrievermd) for details
|
||||
|
||||
|
||||
### 5) Readers
|
||||
|
||||
**What**
|
||||
Neural networks (i.e. mostly Transformer-based) that read through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or
|
||||
XLNet trained via [FARM](https://github.com/deepset-ai/FARM) or on SQuAD-like datasets. The Reader takes multiple passages of text as input
|
||||
and returns top-n answers with corresponding confidence scores. Both readers can load either a local model or any public model from [Hugging
|
||||
Face's model hub](https://huggingface.co/models)
|
||||
|
||||
**Available Options**
|
||||
- FARMReader: Reader based on [FARM](https://github.com/deepset-ai/FARM) incl. extensive configuration options and speed optimizations
|
||||
- TransformersReader: Reader based on the `pipeline` class of HuggingFace's [Transformers](https://github.com/huggingface/transformers).
|
||||
**Both** Readers can load models directly from HuggingFace's model hub.
|
||||
|
||||
**Example**
|
||||
|
||||
```python
|
||||
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
|
||||
use_gpu=False, no_ans_boost=-10, context_window_size=500,
|
||||
top_k_per_candidate=3, top_k_per_sample=1,
|
||||
num_processes=8, max_seq_len=256, doc_stride=128)
|
||||
|
||||
# Optional: Training & eval
|
||||
reader.train(...)
|
||||
reader.eval(...)
|
||||
|
||||
# Predict
|
||||
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)
|
||||
```
|
||||
-> See [docs](https://haystack.deepset.ai/docs/latest/readermd) for details
|
||||
|
||||
### 6) REST API
|
||||
**What**
|
||||
A simple REST API based on [FastAPI](https://fastapi.tiangolo.com/) is provided to:
|
||||
|
||||
- search answers in texts ([extractive QA](https://github.com/deepset-ai/haystack/blob/master/rest_api/controller/search.py))
|
||||
- search answers by comparing user question to existing questions
|
||||
([FAQ-style QA](https://github.com/deepset-ai/haystack/blob/master/rest_api/controller/search.py))
|
||||
- collect & export user feedback on answers to gain domain-specific
|
||||
training data
|
||||
([feedback](https://github.com/deepset-ai/haystack/blob/master/rest_api/controller/feedback.py))
|
||||
- allow basic monitoring of requests (currently via APM in Kibana)
|
||||
|
||||
**Example**
|
||||
To serve the API, adjust the values in `rest_api/config.py` and run:
|
||||
|
||||
gunicorn rest_api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker -t 300
|
||||
|
||||
You will find the Swagger API documentation at
|
||||
<http://127.0.0.1:8000/docs>
|
||||
|
||||
### 7) Labeling Tool
|
||||
|
||||
- Use the [hosted version](https://annotate.deepset.ai/login) (Beta) or deploy it yourself with the [Docker Images](https://github.com/deepset-ai/haystack/blob/master/annotation_tool).
|
||||
- Create labels with different techniques: Come up with questions (+ answers) while reading passages (SQuAD style) or have a set of predefined questions and look for answers in the document (~ Natural Questions).
|
||||
- Structure your work via organizations, projects, users
|
||||
- Upload your documents or import labels from an existing SQuAD-style dataset
|
||||
|
||||

|
||||
|
||||
|
||||
## :heart: Contributing
|
||||
|
||||
We are very open to contributions from the community - be it the fix of a small typo or a completely new feature! You don't need to be an
|
||||
Haystack expert for providing meaningful improvements. To avoid any extra work on either side, please check our [Contributor Guidelines](https://github.com/deepset-ai/haystack/blob/master/CONTRIBUTING.md) first.
|
||||
|
||||
Tests will automatically run for every commit you push to your PR. You can also run them locally by executing [pytest](https://docs.pytest.org/en/stable/) in your terminal from the root folder of this repository:
|
||||
|
||||
All tests:
|
||||
``` bash
|
||||
cd test
|
||||
pytest
|
||||
```
|
||||
|
||||
You can also only run a subset of tests by specifying a marker and the optional "not" keyword:
|
||||
``` bash
|
||||
cd test
|
||||
pytest -m not elasticsearch
|
||||
pytest -m elasticsearch
|
||||
pytest -m generator
|
||||
pytest -m tika
|
||||
pytest -m not slow
|
||||
...
|
||||
```
|
||||
331
README.rst
331
README.rst
@ -1,331 +0,0 @@
|
||||
.. image:: https://github.com/deepset-ai/haystack/blob/master/docs/_src/img/haystack_logo_blue_banner.png?raw=true
|
||||
:align: center
|
||||
:alt: Haystack Logo
|
||||
|
||||
.. image:: https://github.com/deepset-ai/haystack/workflows/Build/badge.svg?branch=master
|
||||
:target: https://github.com/deepset-ai/haystack/actions
|
||||
:alt: Build
|
||||
|
||||
.. image:: https://camo.githubusercontent.com/34b3a249cd6502d0a521ab2f42c8830b7cfd03fa/687474703a2f2f7777772e6d7970792d6c616e672e6f72672f7374617469632f6d7970795f62616467652e737667
|
||||
:target: http://mypy-lang.org/
|
||||
:alt: Checked with mypy
|
||||
|
||||
.. image:: https://img.shields.io/github/release/deepset-ai/haystack
|
||||
:target: https://github.com/deepset-ai/haystack/releases
|
||||
:alt: Release
|
||||
|
||||
.. image:: https://img.shields.io/github/license/deepset-ai/haystack
|
||||
:target: https://github.com/deepset-ai/haystack/blob/master/LICENSE
|
||||
:alt: License
|
||||
|
||||
.. image:: https://img.shields.io/github/last-commit/deepset-ai/haystack
|
||||
:target: https://github.com/deepset-ai/haystack/commits/master
|
||||
:alt: Last Commit
|
||||
|
||||
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
The performance of **modern Question Answering Models** (BERT, ALBERT ...) has seen drastic improvements within the last year enabling many new opportunities for accessing information more efficiently. However, those models are designed to find answers within rather small text passages. **Haystack lets you scale QA models** to large collections of documents!
|
||||
While QA is the focussed use case for Haystack, we will address further options around neural search in the future (re-ranking, most-similar search ...).
|
||||
|
||||
Haystack is designed in a modular way and lets you use any models trained with `FARM <https://github.com/deepset-ai/FARM>`_ or `Transformers <https://github.com/huggingface/transformers>`_.
|
||||
|
||||
|
||||
|
||||
Core Features
|
||||
=============
|
||||
- **Powerful ML models**: Utilize all latest transformer based models (BERT, ALBERT, RoBERTa ...)
|
||||
- **Modular & future-proof**: Easily switch to newer models once they get published.
|
||||
- **Developer friendly**: Easy to debug, extend and modify.
|
||||
- **Scalable**: Production-ready deployments via Elasticsearch backend & REST API
|
||||
- **Customizable**: Fine-tune models to your own domain & improve them continuously via user feedback
|
||||
|
||||
|
||||
Components
|
||||
==========
|
||||
|
||||
.. image:: https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/sketched_concepts_white.png
|
||||
|
||||
|
||||
1. **DocumentStore**: Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).
|
||||
|
||||
2. **Retriever**: Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.
|
||||
|
||||
3. **Reader**: Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via `FARM <https://github.com/deepset-ai/FARM>`_ or `Transformers <https://github.com/huggingface/transformers>`_ on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from `Hugging Face's model hub <https://huggingface.co/models>`_ or fine-tune it to your own domain data.
|
||||
|
||||
4. **Finder**: Glues together a Reader and a Retriever as a pipeline to provide an easy-to-use question answering interface.
|
||||
|
||||
5. **REST API**: Exposes a simple API for running QA search, collecting feedback and monitoring requests
|
||||
|
||||
6. **Haystack Annotate**: Create custom QA labels, `Hosted version <https://annotate.deepset.ai/login>`_ (Beta), Docker images (coming soon)
|
||||
|
||||
|
||||
Resources
|
||||
=========
|
||||
|
||||
**Documentation:** https://haystack.deepset.ai
|
||||
|
||||
**Roadmap**: https://haystack.deepset.ai/en/docs/roadmapmd
|
||||
|
||||
**Tutorials**
|
||||
|
||||
- Tutorial 1 - Basic QA Pipeline: `Jupyter notebook <https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb>`__ or `Colab <https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb>`_
|
||||
- Tutorial 2 - Fine-tuning a model on own data: `Jupyter notebook <https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial2_Finetune_a_model_on_your_data.ipynb>`__ or `Colab <https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial2_Finetune_a_model_on_your_data.ipynb>`__
|
||||
- Tutorial 3 - Basic QA Pipeline without Elasticsearch: `Jupyter notebook <https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb>`__ or `Colab <https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb>`__
|
||||
- Tutorial 4 - FAQ-style QA: `Jupyter notebook <https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb>`__ or `Colab <https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb>`__
|
||||
- Tutorial 5 - Evaluation of the whole QA-Pipeline: `Jupyter noteboook <https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb>`__ or `Colab <https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb>`__
|
||||
- Tutorial 6 - Better Retrievers via "Dense Passage Retrieval": `Jupyter noteboook <https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb>`__ or `Colab <https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb>`__
|
||||
|
||||
|
||||
Quick Start
|
||||
===========
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
PyPi::
|
||||
|
||||
pip install farm-haystack
|
||||
|
||||
Master branch (if you wanna try the latest features)::
|
||||
|
||||
git clone https://github.com/deepset-ai/haystack.git
|
||||
cd haystack
|
||||
pip install --editable .
|
||||
|
||||
To update your installation, just do a git pull. The --editable flag will update changes immediately.
|
||||
|
||||
Note: On Windows you might need :code:`pip install farm-haystack -f https://download.pytorch.org/whl/torch_stable.html` to install PyTorch correctly
|
||||
|
||||
Usage
|
||||
-----
|
||||
.. image:: https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/code_snippet_usage.png
|
||||
|
||||
|
||||
Quick Tour
|
||||
==========
|
||||
|
||||
|
||||
1) DocumentStores
|
||||
---------------------
|
||||
|
||||
Haystack offers different options for storing your documents for search. We recommend Elasticsearch, but have also light-weight options for fast prototyping and will soon add DocumentStores that are optimized for embeddings (FAISS & Co).
|
||||
|
||||
Elasticsearch (Recommended)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
:code:`haystack.database.elasticsearch.ElasticsearchDocumentStore`
|
||||
|
||||
* Keeps all the logic to store and query documents from Elastic, incl. mapping of fields, adding filters or boosts to your queries, and storing embeddings
|
||||
* You can either use an existing Elasticsearch index or create a new one via haystack
|
||||
* Retrievers operate on top of this DocumentStore to find the relevant documents for a query
|
||||
* Documents should be chunked into smaller units (e.g. paragraphs) before indexing to make the results returned by the Retriever more granular and accurate.
|
||||
|
||||
You can get started by running a single Elasticsearch node using docker::
|
||||
|
||||
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2
|
||||
|
||||
Or if docker is not possible for you::
|
||||
|
||||
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
|
||||
tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
|
||||
chown -R daemon:daemon elasticsearch-7.6.2
|
||||
elasticsearch-7.0.0/bin/elasticsearch
|
||||
|
||||
See Tutorial 1 on how to go on with indexing your docs.
|
||||
|
||||
|
||||
SQL / InMemory (Alternative)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
:code:`haystack.database.sql.SQLDocumentStore` & :code:`haystack.database.memory.InMemoryDocumentStore`
|
||||
|
||||
These DocumentStores are mainly intended to simplify the first development steps or test a prototype on an existing SQL Database containing your texts. The SQLDocumentStore initializes by default a local file-based SQLite database.
|
||||
However, you can easily configure it for PostgreSQL or MySQL since our implementation is based on SQLAlchemy.
|
||||
Limitations: Retrieval (e.g. via TfidfRetriever) happens in-memory here and will therefore only work efficiently on small datasets
|
||||
|
||||
2) Retrievers
|
||||
---------------------
|
||||
|
||||
DensePassageRetriever
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
Using dense embeddings (i.e. vector representations) of texts is a powerful alternative to score similarity of texts.
|
||||
This retriever uses two BERT models - one to embed your query, one to embed your passage. It's based on the work of
|
||||
`Karpukhin et al <https://arxiv.org/abs/2004.04906>`_ and is especially an powerful alternative if there's no direct overlap between tokens in your queries and your texts.
|
||||
|
||||
Example
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
retriever = DensePassageRetriever(document_store=document_store,
|
||||
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
|
||||
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
|
||||
use_gpu=True,
|
||||
batch_size=16,
|
||||
embed_title=True)
|
||||
retriever.retrieve(query="Why did the revenue increase?")
|
||||
# returns: [Document, Document]
|
||||
|
||||
ElasticsearchRetriever
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
Scoring text similarity via sparse Bag-of-words representations are strong and well-established baselines in Information Retrieval.
|
||||
The default :code:`ElasticsearchRetriever` uses Elasticsearch's native scoring (BM25), but can be extended easily with custom queries or filtering.
|
||||
|
||||
Example
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
retriever = ElasticsearchRetriever(document_store=document_store, custom_query=None)
|
||||
retriever.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "company": ["Q1", "Q2"]})
|
||||
# returns: [Document, Document]
|
||||
|
||||
|
||||
EmbeddingRetriever
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
This retriever uses a single model to embed your query and passage (e.g. Sentence-BERT) and finds similar texts by using cosine similarity. This works well if your query and passage are a similar type of text, e.g. you want to find the most similar question in your FAQ given a user question.
|
||||
|
||||
Example
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
retriever = EmbeddingRetriever(document_store=document_store,
|
||||
embedding_model="deepset/sentence_bert",
|
||||
model_format="farm")
|
||||
retriever.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "company": ["Q1", "Q2"]})
|
||||
# returns: [Document, Document]
|
||||
|
||||
TfidfRetriever
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
Basic in-memory retriever getting texts from the DocumentStore, creating TF-IDF representations in-memory and allowing to query them.
|
||||
Simple baseline for quick prototypes. Not recommended for production.
|
||||
|
||||
3) Readers
|
||||
---------------------
|
||||
Neural networks (i.e. mostly Transformer-based) that read through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via `FARM <https://github.com/deepset-ai/FARM>`_ or on SQuAD-like datasets. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores.
|
||||
Both readers can load either a local model or any public model from `Hugging Face's model hub <https://huggingface.co/models>`_
|
||||
|
||||
FARMReader
|
||||
^^^^^^^^^^
|
||||
Implementing various QA models via the `FARM <https://github.com/deepset-ai/FARM>`_ Framework.
|
||||
|
||||
Example
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
|
||||
use_gpu=False, no_ans_boost=-10, context_window_size=500,
|
||||
top_k_per_candidate=3, top_k_per_sample=1,
|
||||
num_processes=8, max_seq_len=256, doc_stride=128)
|
||||
|
||||
# Optional: Training & eval
|
||||
reader.train(...)
|
||||
reader.eval(...)
|
||||
|
||||
# Predict
|
||||
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)
|
||||
|
||||
This Reader comes with:
|
||||
|
||||
* extensive configuration options (no answer boost, aggregation options ...)
|
||||
* multiprocessing to speed-up preprocessing
|
||||
* option to train
|
||||
* option to evaluate
|
||||
* option to load all QA models directly from HuggingFace's model hub
|
||||
|
||||
TransformersReader
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
Implementing various QA models via the :code:`pipeline` class of `Transformers <https://github.com/huggingface/transformers>`_ Framework.
|
||||
|
||||
Example
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad",
|
||||
tokenizer="distilbert-base-uncased",
|
||||
context_window_size=500,
|
||||
use_gpu=-1)
|
||||
|
||||
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)
|
||||
|
||||
|
||||
5. REST API
|
||||
---------------------
|
||||
A simple REST API based on `FastAPI <https://fastapi.tiangolo.com/>`_ is provided to:
|
||||
|
||||
* search answers in texts (`extractive QA <https://github.com/deepset-ai/haystack/blob/master/rest_api/controller/search.py>`_)
|
||||
* search answers by comparing user question to existing questions (`FAQ-style QA <https://github.com/deepset-ai/haystack/blob/master/rest_api/controller/search.py>`_)
|
||||
* collect & export user feedback on answers to gain domain-specific training data (`feedback <https://github.com/deepset-ai/haystack/blob/master/rest_api/controller/feedback.py>`_)
|
||||
* allow basic monitoring of requests (currently via APM in Kibana)
|
||||
|
||||
To serve the API, adjust the values in :code:`rest_api/config.py` and run::
|
||||
|
||||
gunicorn rest_api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker -t 300
|
||||
|
||||
You will find the Swagger API documentation at http://127.0.0.1:8000/docs
|
||||
|
||||
|
||||
6. Labeling Tool
|
||||
---------------------
|
||||
* Use the `hosted version <https://annotate.deepset.ai/login>`_ (Beta) or deploy it yourself with the `Docker Images <https://github.com/deepset-ai/haystack/blob/master/annotation_tool>`_.
|
||||
* Create labels with different techniques: Come up with questions (+ answers) while reading passages (SQuAD style) or have a set of predefined questions and look for answers in the document (~ Natural Questions).
|
||||
* Structure your work via organizations, projects, users
|
||||
* Upload your documents or import labels from an existing SQuAD-style dataset
|
||||
* Coming soon: more file formats for document upload, metrics for label quality ...
|
||||
|
||||
.. image:: https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/annotation_tool.png
|
||||
|
||||
|
||||
7. Indexing PDF / Docx files
|
||||
-----------------------------
|
||||
|
||||
Haystack has basic converters to extract text from PDF and Docx files. While it's almost impossible to cover all types, layouts and special cases in PDFs, the implementation covers the most common formats and provides basic cleaning functions to remove header, footers, and tables. Multi-Column text layouts are also supported.
|
||||
The converters are easily extendable, so that you can customize them for your files if needed.
|
||||
|
||||
Example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
#PDF
|
||||
from haystack.file_converter.pdf import PDFToTextConverter
|
||||
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
|
||||
doc = converter.convert(file_path=file, meta=None)
|
||||
# => {"text": "text first page \f text second page ...", "meta": None}
|
||||
#DOCX
|
||||
from haystack.file_converter.docx import DocxToTextConverter
|
||||
converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
|
||||
doc = converter.convert(file_path=file, meta=None)
|
||||
# => {"text": "some text", "meta": None}
|
||||
|
||||
Advanced document convertion is enabled by leveraging mature text extraction library `Apache Tika <https://tika.apache.org/>`_, which is mostly written in Java. Although it's possible to call Tika API from Python, the current :code:`TikaConverter` only supports RESTful call to a Tika server running at localhost. One may either run Tika as a REST service at port 9998 (default), or to start a `docker container for Tika <https://hub.docker.com/r/apache/tika/tags>`_. The latter is recommended, as it's easily scalable, and does not require setting up any Java runtime environment. What's more, future update is also taken care of by docker.
|
||||
Either way, TikaConverter makes RESTful calls to convert any document format supported by Tika. Example code can be found at :code:`indexing/file_converters/utils.py`'s :code:`tika_convert)_files_to_dicts` function:
|
||||
|
||||
:code:`TikaConverter` supports 341 file formats, including
|
||||
|
||||
* most common text file formats, e.g. HTML, XML, Microsoft Office OLE2/XML/OOXML, OpenOffice ODF, iWorks, PDF, ePub, RTF, TXT, RSS, CHM...
|
||||
* text embedded in media files, e.g. WAV, MP3, Vorbis, Flac, PNG, GIF, JPG, BMP, TIF, PSD, WebP, WMF, EMF, MP4, Quicktime, 3GPP, Ogg, FLV...
|
||||
* mail and database files, e.g. Unitx mailboxes, Outlook PST/MSG/TNEF, SQLite3, Microsoft Access, dBase...
|
||||
* and many more other formats...
|
||||
* and all those file formats in archive files, e.g. TAR, ZIP, BZip2, GZip 7Zip, RAR!
|
||||
|
||||
Check out complete list of files supported by the most recent `Apache Tika 1.24.1 <https://tika.apache.org/1.24.1/formats.html>`_.
|
||||
If you feel adventurous, Tika even supports some image OCR with Tesseract, or object recognition for image and video files. (not implemented yet)
|
||||
|
||||
:code:`TikaConverter` also makes a document's metadata available, including typical fields like file name, file dates and a lot more (e.g. Author and keywords for PDF if they're available in the files), which may save you some time in data labeling or other downstream tasks.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
converter = TikaConverter(tika_url: str = "http://localhost:9998/tika")
|
||||
doc = converter.convert(file_path=path)
|
||||
# => {"text": "text first page \f text second page ...", "meta": {"Content-Type": 'application/pdf', "Last-Modified":...}}
|
||||
|
||||
Contributing
|
||||
=============
|
||||
We are very open to contributions from the community - be it the fix of a small typo or a completely new feature! You don't need to be an Haystack expert for providing meaningful improvements.
|
||||
To avoid any extra work on either side, please check our `Contributor Guidelines <https://github.com/deepset-ai/haystack/blob/master/CONTRIBUTING.md>`_ first.
|
||||
|
||||
Tests will automatically run for every commit you push to your PR. You can also run them locally by executing `pytest <https://docs.pytest.org/en/stable/>`_ in your terminal from the root folder of this repository:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pytest test/
|
||||
Loading…
x
Reference in New Issue
Block a user