
* Skeleton of doc website * Flesh out documentation pages * Split concepts into their own rst files * add tutorial rsts * Consistent level 1 markdown headers in tutorials * Change theme to readthedocs * Turn bullet points into prose * Populate sections * Add more text * Add more sphinx files * Add more retriever documentation * combined all documenations in one structure * rename of src to _src as it was ignored by git * Incorporate MP2's changes * add benchmark bar charts * Adapt docstrings in Readers * Improvements to intro, creation of glossary * Adapt docstrings in Retrievers * Adapt docstrings in Finder * Adapt Docstrings of Finder * Updates to text * Edit text * update doc strings * proof read tutorials * Edit text * Edit text * Add stacked chart * populate graph with data * Switch Documentation to markdown (#386) * add way to generate markdown files to sphinx * changed from rst to markdown and extended sphinx for it * fix spelling * Clean titles * delete file * change spelling * add sections to document store usage * add basic rest api docs * fix readme in setup.py * Update Tutorials * Change section names * add windows note to pip install * update intro * new renderer for markdown files * Fix typos * delete dpr_utils.py * fix windows note in get started * Fix docstrings * deleted rest api docs in api * fixed typo * Fix docstring * revert readme to rst * Fix readme * Update setup.py Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com> Co-authored-by: Bogdan Kostić <bogdankostic@web.de> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
4.2 KiB
Get Started
Installation
The most straightforward way to install Haystack is through pip.
$ pip install farm-haystack
If you’d like to run a specific, unreleased version of Haystack, or make edits to the way Haystack runs,
you’ll want to install it using git
and pip --editable
.
This clones a copy of the repo to a local directory and runs Haystack from there.
$ git clone https://github.com/deepset-ai/haystack.git
$ cd haystack
$ pip install --editable .
By default, this will give you the latest version of the master branch. Use regular git commands to switch between different branches and commits.
Note: On Windows add the arg -f https://download.pytorch.org/whl/torch_stable.html
to install PyTorch correctly
The Building Blocks of Haystack
Here’s a sample of some Haystack code showing the most important components. For a working code example, check out our starter tutorial.
# DocumentStore: holds all your data
document_store = ElasticsearchDocumentStore()
# Clean & load your documents into the DocumentStore
dicts = convert_files_to_dicts(doc_dir, clean_func=clean_wiki_text)
document_store.write_documents(dicts)
# Retriever: A Fast and simple algo to indentify the most promising candidate documents
retriever = ElasticsearchRetriever(document_store)
# Reader: Powerful but slower neural network trained for QA
model_name = "deepset/roberta-base-squad2"
reader = FARMReader(model_name)
# Finder: Combines Reader and Retriever
finder = Finder(reader, retriever)
# Voilà! Ask a question!
question = "Who is the father of Sansa Stark?"
prediction = finder.get_answers(question)
print_answers(prediction)
Loading Documents into the DocumentStore
In Haystack, DocumentStores expect Documents in a dictionary format. They are loaded as follows:
document_store = ElasticsearchDocumentStore()
dicts = [
{
'text': DOCUMENT_TEXT_HERE,
'meta': {'name': DOCUMENT_NAME, ...}
}, ...
]
document_store.write_documents(dicts)
When we talk about Documents in Haystack, we are referring specifically to the individual blocks of text that are being held in the DocumentStore. You might want to use all the text in one file as a Document, or split it into multiple Documents. This splitting can have a big impact on speed and performance.
General Guide: If Haystack is running very slowly, you might want to try splitting your text into smaller Documents. If you want an improvement to performance, you might want to try concatenating text to make larger Documents.
Running Queries
Querying involves searching for an answer to a given question within the full document store. This process will:
-
make the Retriever filter for a small set of relevant candidate documents
-
get the Reader to process this set of candidate documents
-
return potential answers to the given question
Usually, there are tight time constraints on querying and so it needs to be a lightweight operation. When documents are loaded, Haystack will precompute any of the results that might be useful at query time.
In Haystack, querying is performed with a Finder
object which connects the reader to the retriever.
# The Finder sticks together reader and retriever in a pipeline to answer our questions.
finder = Finder(reader, retriever)
# Voilà! Ask a question!
question = "Who is the father of Sansa Stark?"
prediction = finder.get_answers(question)
When the query is complete, you can expect to see results that look something like this:
[
{ 'answer': 'Eddard',
'context': 's Nymeria after a legendary warrior queen. She travels '
"with her father, Eddard, to King's Landing when he is made "
'Hand of the King. Before she leaves,'
}, ...
]