haystack/test/test_embedding_retriever.py

import pytest
from haystack import Finder


@pytest.mark.slow
@pytest.mark.elasticsearch
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory"], indirect=True)
@pytest.mark.parametrize("retriever", ["embedding"], indirect=True)
def test_embedding_retriever(retriever, document_store):

    documents = [
        {'text': 'By running tox in the command line!', 'meta': {'name': 'How to test this library?', 'question': 'How to test this library?'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
        {'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},
    ]

    embedded = []
    for doc in documents:
        doc['embedding'] = retriever.embed([doc['meta']['question']])[0]
        embedded.append(doc)

    document_store.write_documents(embedded)

    finder = Finder(reader=None, retriever=retriever)
    prediction = finder.get_answers_via_similar_questions(question="How to test this?", top_k_retriever=1)

    assert len(prediction.get('answers', [])) == 1
Fix type casting for vectors in FAISS (#399) * Fix type casting for vectors in FAISS Co-authored-by: philipp-bode <philipp.bode@student.hpi.de> * add type casts for elastic. refactor embedding retriever tests * fix case: empty embedding field * fix faiss tolerance * add assert in test_faiss_retrieving Co-authored-by: philipp-bode <philipp.bode@student.hpi.de> 2020-09-18 17:08:13 +02:00			`import pytest`
Add embedding query for InMemoryDocumentStore 2020-05-18 05:47:41 -07:00			`from haystack import Finder`


Pytest fix memory leak and put pytest marker on slow tests (#520) * Clear faiss_index during teardown * Marking slow test with pytest markers. So In future these test can be optimized. Also command line option can be added to skip them refer https://pytest.org/en/stable/example/simple.html#control-skipping-of-tests-according-to-command-line-option * Fixing test 2020-10-26 19:19:10 +01:00			`@pytest.mark.slow`
[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484) * Adding dummy generator implementation * Adding tutorial to try the model * Committing current non working code * Committing current update where we need to call generate function directly and need to convert embedding to tensor way * Addressing review comments. * Refactoring finder, and implementing rag_generator class. * Refined the implementation of RAGGenerator and now it is in clean shape * Renaming RAGGenerator to RAGenerator * Reverting change from finder.py and addressing review comments * Remove support for RagSequenceForGeneration * Utilizing embed_passage function from DensePassageRetriever * Adding sample test data to verify generator output * Updating testing script * Updating testing script * Fixing bug related to top_k * Updating latest farm dependency * Comment out farm dependency * Reverting changes from TransformersReader * Adding transformers dataset to compare transformers and haystack generator implementation * Using generator_encoder instead of question_encoder to generate context_input_ids * Adding workaround to install FARM dependency from master branch * Removing unnecessary changes * Fixing generator test * Removing transformers datasets * Fixing generator test * Some cleanup and updating TODO comments * Adding tutorial notebook * Updating tutorials with comments * Explicitly passing token model in RAG test * Addressing review comments * Fixing notebook * Refactoring tests to reduce memory footprint * Split generator tests in separate ci step and before running it reclaim memory by terminating containers * Moving tika dependent test to separate dir * Remove unwanted code * Brining reader under session scope * Farm is now session object hence restoring changes from default value * Updating assert for pdf converter * Dummy commit to trigger CI flow * REducing memory footprint required for generator tests * Fixing mypy issues * Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits * reducing changes * Fixing CI * changing elastic search ci * Fixing test error * Disabling return of embedding * Marking generator test as well * Refactoring tutorials * Increasing ES memory to 750M * Trying another fix for ES CI * Reverting CI changes * Splitting tests in CI * Generator and non-generator markers split * Adding pytest.ini to add markers and enable strict-markers option * Reducing elastic search container memory * Simplifying generator test by using documents with embedding directly * Bump up farm to 0.5.0 2020-10-30 18:06:02 +01:00			`@pytest.mark.elasticsearch`
Fix type casting for vectors in FAISS (#399) * Fix type casting for vectors in FAISS Co-authored-by: philipp-bode <philipp.bode@student.hpi.de> * add type casts for elastic. refactor embedding retriever tests * fix case: empty embedding field * fix faiss tolerance * add assert in test_faiss_retrieving Co-authored-by: philipp-bode <philipp.bode@student.hpi.de> 2020-09-18 17:08:13 +02:00			`@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory"], indirect=True)`
Fix update_embeddings function in FAISSDocumentStore and add retriever fixture in tests (#481) * 1. Prevent update_embeddings function in FAISSDocumentStore to set faiss_index as None when document store does not have any docs. 2. cleaning up tests by adding fixture for retriever. * TfidfRetriever need document store with documents during initialization as it call fit() function in constructor so fixing it by checking self.paragraphs of None * Fix naming of retriever's fixture (embedded to embedding and tfid to tfidf) 2020-10-14 16:15:04 +02:00			`@pytest.mark.parametrize("retriever", ["embedding"], indirect=True)`
			`def test_embedding_retriever(retriever, document_store):`
Add embedding query for InMemoryDocumentStore 2020-05-18 05:47:41 -07:00
			`documents = [`
Move document_name attribute to meta (#217) 2020-07-14 09:53:31 +02:00			`{'text': 'By running tox in the command line!', 'meta': {'name': 'How to test this library?', 'question': 'How to test this library?'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
			`{'text': 'By running tox in the command line!', 'meta': {'name': 'blah blah blah', 'question': 'blah blah blah'}},`
Add embedding query for InMemoryDocumentStore 2020-05-18 05:47:41 -07:00			`]`

			`embedded = []`
			`for doc in documents:`
Dense Passage Retriever (Inference) (#167) 2020-06-30 19:05:45 +02:00			`doc['embedding'] = retriever.embed([doc['meta']['question']])[0]`
Add embedding query for InMemoryDocumentStore 2020-05-18 05:47:41 -07:00			`embedded.append(doc)`

			`document_store.write_documents(embedded)`

			`finder = Finder(reader=None, retriever=retriever)`
			`prediction = finder.get_answers_via_similar_questions(question="How to test this?", top_k_retriever=1)`

			`assert len(prediction.get('answers', [])) == 1`