96 Commits

Author SHA1 Message Date
Branden Chan
da97d81305
Change variable names (#1286) 2021-07-14 14:03:34 +02:00
Julian Risch
2a90471c73
Encapsulate tutorial code in method (#1266) 2021-07-09 17:08:19 +02:00
Branden Chan
efc03f72db
Make PreProcessor.process() work on lists of documents (#1163)
* Add process_batch method

* Rename methods

* Fix doc string, satisfy mypy

* Fix mypy CI

* Fix typp

* Update tutorial

* Fix argument name

* Change arg name

* Incorporate reviewer feedback
2021-06-23 18:13:51 +02:00
Branden Chan
7dbd58f6be
Add about sections (#1195) 2021-06-14 18:37:00 +02:00
vblagoje
2a5882578a
Add Longform-QA (LFQA), Seq2SeqGenerator for generative QA and Retribert Retriever (#1086)
* Integrate LFQA with Haystack

* Integrate LFQA with Haystack - unit tests

* Properly initialize conftest default value for vector_dim

* Update PR after inital feedback

* Fix conftest.py import

* Seq2SeqGenerator uses Callables instead of subclasses for custom model input

* Update docstring

* Fix Callable use

* Add LFQA tutorials

* Improve type error reporting for invalid input converter Callable

* Generate docstrings

* Format comments in tutorial script

* Generate tutorial md

* Add usage page

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-06-14 17:53:43 +02:00
Branden Chan
783893c3d2
Tutorial update (#1166)
* Add header / footer

* Add Milvus example

* Generate md files

* Fix mypy CI
2021-06-11 11:09:15 +02:00
Branden Chan
aa6f768efa
Prevent merge of same questions on different documents during evaluation (#1119)
* Fix duplicate question in Reader.eval()

* Add duplicate question support in document store

* Support duplicate questions in retriever eval

* Update tutorial

* Rename key_tuple

* Change error message

* Add warning when more than 6 labels

* Allow for label grouping options

* Add support for aggregating by label meta

* Satisfy mypy

* Fix duplicate question in Reader.eval()

* Add duplicate question support in document store

* Support duplicate questions in retriever eval

* Update tutorial

* Rename key_tuple

* Change error message

* Add warning when more than 6 labels

* Allow for label grouping options

* Add support for aggregating by label meta

* Satisfy mypy

* Make label field flexible, add docstrings

* Satisfy mypy

* Fix failing tests

* Adjust docstring

* Fix tutorial

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-06-02 12:09:03 +02:00
Julian Risch
a7ba146246
Removed comma from last item in json list (#1114) 2021-06-01 12:32:21 +02:00
Julian Risch
40ceaf418a
Fixing grpcio-tools to version of colab's pre-installed grpcio (#1113) 2021-05-31 19:09:10 +02:00
Julian Risch
84c34295a1
Re-ranking component for document search without QA (#1025)
* Adding ranker similar to retriever and reader

* Sort documents according to query-document similarity scores

* Reranking and model training runs for small example

* Added EvalRanker node

* Calculate recall@k in EvalRetriever and EvalRanker nodes

* Renaming EvalRetriever to EvalDocuments and EvalReader to EvalAnswers

* Added mean reciprocal rank as metric for EvalDocuments

* Fix bug that appeared when ranking documents with same score

* Remove commented code for unimplmented eval() of Ranker node

* Add documentation of k parameter in EvalDocuments

* Add Ranker docu and renaming top_k param
2021-05-31 15:31:36 +02:00
Branden Chan
9827b3652e
Pipelines tutorial (#991)
* Start Pipelines tutorial

* Make Tutorial 11 run locally

* Add colab compatibility

* Fix pip install

* Add ES install from source

* Add ES install from source

* Add pygraphviz installation

* Incorporate reviewer feedback

* Ensure print_answers() works for Generator output

* Fix typo
2021-04-29 17:31:28 +02:00
Branden Chan
9626c0d65e
Update Documentation (#976)
* Add api pages

* Add latest docstring and tutorial changes

* First sweep of usage docs

* Add link to conversion script

* Add import statements

* Add summarization page

* Add web crawler documentation

* Add confidence scores usage

* Add crawler api docs

* Regenerate api docs

* Update summarizer and translator api

* Add api pages

* Add latest docstring and tutorial changes

* First sweep of usage docs

* Add link to conversion script

* Add import statements

* Add summarization page

* Add web crawler documentation

* Add confidence scores usage

* Add crawler api docs

* Regenerate api docs

* Update summarizer and translator api

* Add indentation (pydoc-markdown 3.10.1)

* Comment out metadata

* Remove Finder deprecation message

* Remove Finder in FAQ

* Update tutorial link

* Incorporate reviewer feedback

* Regen api docs

* Add type annotations

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-22 16:45:29 +02:00
Julian Risch
d38c07e0ee
knowledge graph example (#934)
* Add knowledge graph module

* Fix type hint

* Add graph retriver module

* Change type annotations, change return format

* Add graph retriever that executes questions as sparql queries

* Linking only those entities that are in the knowledge graph

* Added logging and using relations extracted from Knowledge graph for linking

* Preventing entity linking from linking the same token to multiple entities

* Pruning triples that have no variables for select and count queries

* Support knowledge graphs with Pipelines

* Add text2sparql

* Entity linking and relation linking consider more special cases now based on evaluation on labelled data

* Separating example code from KGQA implementation

* Add eval on combined extarctive and kg questions

* Remove references to hp-test

* Add fields sparql_query and long_answer_list to metadata

* Removing modular Question2SPARQL approach

* Removing additional classes used for modular kgqa approach

* preparing lcquad data

* change graph db

* Translating namespaces in knowledge graph queries

* Creating graphdb index and loading triples from .ttl file

* Fetching graph config files, triples and model from S3

* Fix incompatibility issues with BaseGraphRetriever and BaseComponent

* Removing unused utility functions

* Adding doc strings and tutorial header

* Adding sparqlwrapper dependency

* Moving tutorial header

* Sorting tutorials by number within name of notebook

* Add latest docstring and tutorial changes

* Creating test cases for knowledge graph

* Changing knowledge graph example to harry potter

* Add latest docstring and tutorial changes

* Adapting the tutorial notebook to harry potter example

* Add GraphDB fixture for tests

* Add latest docstring and tutorial changes

* Added GraphDB docker launch to CI

* Use correct GraphDB fixture

* Check if GraphDB instance is already running

* Renaming question/query and incorporating other feedback from Timo and Tanay

* Removed type annotation

* Add latest docstring and tutorial changes

Co-authored-by: oryx1729 <oryx1729@protonmail.com>
Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-08 14:05:33 +02:00
Branden Chan
d77152c469
WIP: Add evaluation nodes for Pipelines (#904)
* Add main eval fns

* WIP: make pipeline_eval.py run

* Fix typo

* Add support for no_answers

* Add latest docstring and tutorial changes

* Working pipeline eval

* Add timing of nodes

* Add latest docstring and tutorial changes

* Refactor and clean

* Update tutorial script

* Set default params

* Update tutorials

* Fix indent

* Add latest docstring and tutorial changes

* Address mypy issues

* Add test

* Fix mypy error

* Clear outputs

* Add doc strings

* Incorporate reviewer feedback

* Add latest docstring and tutorial changes

* Revert query counting

* Fix typo

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-01 17:35:18 +02:00
Timo Moeller
f954f0db38
Fix top_k param in RAG tutorials (#906)
* Fix top_k param

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-03-18 18:00:21 +01:00
Branden Chan
24d0c4d42d
Fix DPR training batch size (#898)
* Adjust batch size

* Add latest docstring and tutorial changes

* Update training results

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-03-17 18:33:59 +01:00
brandenchan
03cda26d85 Fix link in Tutorial 8 2021-02-15 10:45:27 +01:00
Malte Pietsch
e91518ee00
Update tutorials (torch versions, ES version, replace Finder with Pipeline) (#814)
* remove manual torch install on colab

* update elasticsearch version everywhere to 7.9.2

* fix FAQPipeline

* update tutorials with new pipelines

* Add latest docstring and tutorial changes

* revert faqpipeline change. fix field names in tutorial 4

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-09 14:56:54 +01:00
Branden Chan
8d47a71b00
Fix Tutorial 9 (#734)
* Add package download

* Change dev to train file
2021-01-14 10:56:58 +01:00
Julian Risch
3331608e03
Adding a guard that prevents the tutorial code from being executed in every subprocess when using multiprocessing on windows (#729) 2021-01-13 18:17:54 +01:00
Branden Chan
7376185b65
Create DPR training tutorial (#708)
* WIP: Start DPR training tutorial

* Create basics of DPR Train tutorial

* Update documentation

* Allow DPR to be initialized without document store

* WIP: Add param descriptions to DPR notebook

* Clean tutorial

* Improve loading

* Make doc store optional when loading DPR

* Satisfy mypy type check

* Add links

* Add tutorial header

* Add colab badge

* Clear outputs

* Incorporate reviewer feedback

* WIP: Start DPR training tutorial

* Create basics of DPR Train tutorial

* Update documentation

* Allow DPR to be initialized without document store

* WIP: Add param descriptions to DPR notebook

* Clean tutorial

* Improve loading

* Make doc store optional when loading DPR

* Satisfy mypy type check

* Add links

* Add tutorial header

* Add colab badge

* Clear outputs

* Incorporate reviewer feedback

* Add readme links

* Regenerate tutorials

* Add excitement

* Fix typo

* Fix hard negatives comment

* Wrap tutorial for windows users

* Fix mypy issue
2021-01-13 10:33:55 +01:00
Branden Chan
bb8aba18e0
Create Preprocessing Tutorial (#706)
* WIP: First version of preprocessing tutorial

* stride renamed overlap, ipynb and py files created

* rename split_stride in test

* Update preprocessor api documentation

* define order for markdown files

* define order of modules in api docs

* Add colab links

* Incorporate review feedback

Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2021-01-06 15:54:05 +01:00
Malte Pietsch
94b7345505
Make use_gpu=True the default in tutorials (#692)
* enable gpu args in tutorials

* add info box for gpu runtime on colab
2020-12-22 07:58:12 +01:00
Branden Chan
d8154939fc
Scale dot product into probabilities (#667)
* scale dot product

* Add tip in documentation

* Add recommendation boxes

* WIP: Use similarity attribute in all doc stores

* Implement similarity for InMemoryDS

* Add FAISS support

* Clean printout

* Update documentation

* Implement document field map
2020-12-11 12:10:24 +01:00
Branden Chan
8c904d79d6
Fix links (#663) 2020-12-08 10:28:31 +01:00
Tanay Soni
8e52b48e1d
Add pipelines for GenerativeQA & FAQs (#645) 2020-12-03 10:27:06 +01:00
Branden Chan
79555148ac
Add link to FAISS Info in documentation (#643)
* Add link to FAISS info

* Clean link
2020-12-02 15:24:22 +01:00
Branden Chan
1e8af84ecc
Make more changes to documentation (#578)
* First batch of changes

* Add RAG tutorial links

* Prettify RAG tutorial

* draft of generator doc

* Add text

* Complete generator page

* Create optimization section

* Split intro

* Fix formatting tutorial 7
2020-11-19 14:58:27 +01:00
Branden Chan
e72f4f4299
Update Colab Torch Version (#576)
* Update torch version

* Update torch version
2020-11-11 13:55:10 +01:00
kolk
72b637ae6d
DensePassageRetriever: Add Training, Refactor Inference to FARM modules (#527)
* dpr training and inference code refactored with FARM modules

* dpr test cases modified

* docstring and default arguments updated

* dpr training docstring updated

* bugfix in dense retriever inference, DPR tutorials modified

* Bump FARM to 0.5.0

* update README for DPR

* dpr training and inference code refactored with FARM modules

* dpr test cases modified

* docstring and default arguments updated

* dpr training docstring updated

* bugfix in dense retriever inference, DPR tutorials modified

* Bump FARM to 0.5.0

* update README for DPR

* mypy errors fix

* DPR instantiation bugfix

* Fix DPR init in RAG Tutorial

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-30 19:22:06 +01:00
Lalit Pagaria
f13443054a
[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484)
* Adding dummy generator implementation

* Adding tutorial to try the model

* Committing current non working code

* Committing current update where we need to call generate function directly and need to convert embedding to tensor way

* Addressing review comments.

* Refactoring finder, and implementing rag_generator class.

* Refined the implementation of RAGGenerator and now it is in clean shape

* Renaming RAGGenerator to RAGenerator

* Reverting change from finder.py and addressing review comments

* Remove support for RagSequenceForGeneration

* Utilizing embed_passage function from DensePassageRetriever

* Adding sample test data to verify generator output

* Updating testing script

* Updating testing script

* Fixing bug related to top_k

* Updating latest farm dependency

* Comment out farm dependency

* Reverting changes from TransformersReader

* Adding transformers dataset to compare transformers and haystack generator implementation

* Using generator_encoder instead of question_encoder to generate context_input_ids

* Adding workaround to install FARM dependency from master branch

* Removing unnecessary changes

* Fixing generator test

* Removing transformers datasets

* Fixing generator test

* Some cleanup and updating TODO comments

* Adding tutorial notebook

* Updating tutorials with comments

* Explicitly passing token model in RAG test

* Addressing review comments

* Fixing notebook

* Refactoring tests to reduce memory footprint

* Split generator tests in separate ci step and before running it reclaim memory by terminating containers

* Moving tika dependent test to separate dir

* Remove unwanted code

* Brining reader under session scope

* Farm is now session object hence restoring changes from default value

* Updating assert for pdf converter

* Dummy commit to trigger CI flow

* REducing memory footprint required for generator tests

* Fixing mypy issues

* Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits

* reducing changes

* Fixing CI

* changing elastic search ci

* Fixing test error

* Disabling return of embedding

* Marking generator test as well

* Refactoring tutorials

* Increasing ES memory to 750M

* Trying another fix for ES CI

* Reverting CI changes

* Splitting tests in CI

* Generator and non-generator markers split

* Adding pytest.ini to add markers and enable strict-markers option

* Reducing elastic search container memory

* Simplifying generator test by using documents with embedding directly

* Bump up farm to 0.5.0
2020-10-30 18:06:02 +01:00
Tanay Soni
db4151bbc0
Fix scoring in Elasticsearch for dot product (#517) 2020-10-23 17:50:49 +02:00
bogdankostic
f62117c232
Add urllib version requirement to colab notebooks (#509) 2020-10-23 10:43:58 +02:00
Lalit Pagaria
63c12371b9
Change arg "model" to "model_name_or_path" in TransformersReader (#510)
* Consistent parameter naming for TransformersReader along with removing unused imports as well.

* Addressing review comments
2020-10-21 17:15:35 +02:00
Malte Pietsch
bdbd1b323b
Add create_index and similarity metric to api config (#493)
* make creation of label index optional

* add params for rest api

* reset tutorial flag
2020-10-15 18:41:36 +02:00
Guillim
fb5db59590
Remove useless line from Tutorial4_FAQ_style_QA (#416)
* Update Tutorial4_FAQ_style_QA.py

Used to be useful when `.apply()` was necessary, but not any longer

* Update Tutorial4_FAQ_style_QA.ipynb
2020-09-22 09:01:04 +02:00
Malte Pietsch
747e0c0046
Bump FARM to 0.4.9. Remove custom torch installation from colab tutorials (#404) 2020-09-21 10:26:12 +02:00
Malte Pietsch
271ff30262
fix type casting of embeddings for tutorial 4 (#402) 2020-09-18 18:10:50 +02:00
Branden Chan
7fdb85d63a
Create documentation website (#272)
* Skeleton of doc website

* Flesh out documentation pages

* Split concepts into their own rst files

* add tutorial rsts

* Consistent level 1 markdown headers in tutorials

* Change theme to readthedocs

* Turn bullet points into prose

* Populate sections

* Add more text

* Add more sphinx files

* Add more retriever documentation

* combined all documenations in one structure

* rename of src to _src as it was ignored by git

* Incorporate MP2's changes

* add benchmark bar charts

* Adapt docstrings in Readers

* Improvements to intro, creation of glossary

* Adapt docstrings in Retrievers

* Adapt docstrings in Finder

* Adapt Docstrings of Finder

* Updates to text

* Edit text

* update doc strings

* proof read tutorials

* Edit text

* Edit text

* Add stacked chart

* populate graph with data

* Switch Documentation to markdown (#386)

* add way to generate markdown files to sphinx

* changed from rst to markdown and extended sphinx for it

* fix spelling

* Clean titles

* delete file

* change spelling

* add sections to document store usage

* add basic rest api docs

* fix readme in setup.py

* Update Tutorials

* Change section names

* add windows note to pip install

* update intro

* new renderer for markdown files

* Fix typos

* delete dpr_utils.py

* fix windows note in get started

* Fix docstrings

* deleted rest api docs in api

* fixed typo

* Fix docstring

* revert readme to rst

* Fix readme

* Update setup.py

Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
Co-authored-by: Bogdan Kostić <bogdankostic@web.de>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-09-18 12:57:32 +02:00
Malte Pietsch
9727829cc6
Rename and restructure modules (database, indexing, schemas) (#379)
* rename database to documentstore

* move document, label, multilabel to haystack/schema.py

* rename documentstore -> document_store

* split indexing modules -> file_converter + preprocessor

* fix order of imports

* Update tutorial notebooks

* fix torch version in tutorial 4
2020-09-16 18:33:23 +02:00
Malte Pietsch
bde33ddaaa
Bump FARM version to 0.4.8 and PyTorch >=1.5.1, <= 1.6.0 (#376)
* bump farm version to 0.4.8

* move back to original transformers pipeline

* remove dpr_utils and use transformers implementation

* update tutorial notebooks
2020-09-16 17:24:40 +02:00
brandenchan
b44b1ac6ec Set top_k_per_candidate 2020-08-26 12:03:56 +02:00
kolk
f2b6cc761b
Refactor DPR from FB to Transformers codebase (#308)
* change_HFBertEncoder to transformers DPREncoder

* Removed BertTensorizer

* model download relative path

* Refactor model load

* Tutorial5 DPR updated

* fix print_eval_results typo

* copy transformers DPR modules in dpr_utils and test

* transformer v3.0.2 import errors fixed

* remove dependency of DPRConfig on attribute use_return_tuple

* Adjust transformers 302 locally to work with dpr

* projection layer removed from DPR encoders

* fixed mypy errors

* transformers DPR compatible code added

* transformers DPR compatibility added

* bug fix in tutorial 6 notebook

* Docstring update and variable naming issues fix

* tutorial modified to reflect DPR variable naming change

* title addition to passage use-cases handled

* modified handling untitled batch

* resolved mypy errors

* typos in docstrings and comments fixed

* cleaned DPR code and added new test cases

* warnings added for non-bert model [SEP] token removal

* changed warning to logger warning

* title mask creation refactored

* bug fix on cuda issues

* tutorial 6 instantiates modified DPR

* tutorial 5 modified

* tutorial 5 ipython notebook modified: DPR instantiation

* batch_size added to DPR instantiation

* tutorial 5 jupyter notebook typos fixed

* improved docstrings, fixed typos

* Update docstring

Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-08-25 20:16:00 +05:30
Branden Chan
a54d6a5bd7
Make Tutorials Work on Colab GPUs (#322)
* Add pip install torch+cu
2020-08-19 14:52:50 +02:00
bogdankostic
72b1013560
Restructure update embeddings (#304)
* Restructure update embeddings

* Adapt FAISSDocStore

* Adapt test and tutorial

Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
2020-08-18 14:04:31 +02:00
brandenchan
8a3eca05c3 Change to retriever eval top_k to match notebook 2020-08-18 11:39:49 +02:00
Tanay Soni
200bb4bafd
Refactor the DPR tutorial to use FAISS (#317) 2020-08-17 13:30:02 +02:00
Timo Moeller
72e6867278
Aggregate label objects for same questions (#292)
* Add aggregate labels obj, use in retriever eval function

* Change launch ES param

* Move aggregation from ES document store to base class

* Fix type annotations
2020-08-07 11:24:41 +02:00
Malte Pietsch
29a15c0d59
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback (#243) 2020-07-31 11:34:06 +02:00
Malte Pietsch
5b1be233d0 Update Tutorial 4 2020-07-17 19:31:00 +02:00