haystack/e2e/preview/pipelines/test_extractive_qa_pipeline.py

import json

from haystack.preview import Pipeline, Document
from haystack.preview.document_stores import InMemoryDocumentStore
from haystack.preview.components.retrievers import InMemoryBM25Retriever
from haystack.preview.components.readers import ExtractiveReader


def test_extractive_qa_pipeline(tmp_path):
    # Create the pipeline
    qa_pipeline = Pipeline()
    qa_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()), name="retriever")
    qa_pipeline.add_component(instance=ExtractiveReader(model_name_or_path="deepset/tinyroberta-squad2"), name="reader")
    qa_pipeline.connect("retriever", "reader")

    # Draw the pipeline
    qa_pipeline.draw(tmp_path / "test_extractive_qa_pipeline.png")

    # Serialize the pipeline to JSON
    with open(tmp_path / "test_bm25_rag_pipeline.json", "w") as f:
        print(json.dumps(qa_pipeline.to_dict(), indent=4))
        json.dump(qa_pipeline.to_dict(), f)

    # Load the pipeline back
    with open(tmp_path / "test_bm25_rag_pipeline.json", "r") as f:
        qa_pipeline = Pipeline.from_dict(json.load(f))

    # Populate the document store
    documents = [
        Document(content="My name is Jean and I live in Paris."),
        Document(content="My name is Mark and I live in Berlin."),
        Document(content="My name is Giorgio and I live in Rome."),
    ]
    qa_pipeline.get_component("retriever").document_store.write_documents(documents)

    # Query and assert
    questions = ["Who lives in Paris?", "Who lives in Berlin?", "Who lives in Rome?"]
    answers_spywords = ["Jean", "Mark", "Giorgio"]

    for question, spyword, doc in zip(questions, answers_spywords, documents):
        result = qa_pipeline.run({"retriever": {"query": question}, "reader": {"query": question}})

        extracted_answers = result["reader"]["answers"]

        # we expect at least one real answer and no_answer
        assert len(extracted_answers) > 1

        # the best answer should contain the spyword
        assert spyword in extracted_answers[0].data

        # no_answer
        assert extracted_answers[-1].data is None

        # since these questions are easily answerable, the best answer should have higher probability than no_answer
        assert extracted_answers[0].probability >= extracted_answers[-1].probability

        for answer in extracted_answers:
            assert answer.query == question

            assert hasattr(answer, "probability")
            assert hasattr(answer, "start")
            assert hasattr(answer, "end")

            assert hasattr(answer, "document")
            # the answer is extracted from the correct document
            if answer.document is not None:
                assert answer.document == doc
test: enhance e2e tests to also draw and serialize/deserialize the test pipelines (#5910) * add draw and serialization/deserialization to e2e pipeline examples * add comment about json serialization * fix a small gptgenerator bug and move indexing in tests * to json * review feedback 2023-10-09 12:54:17 +01:00			`import json`

test: e2e test for Extractive QA Pipeline (#5879) * e2e test for e. qa pipeline 2023-09-26 15:44:34 +02:00			`from haystack.preview import Pipeline, Document`
refactor!: rename `MemoryDocumentStore` and related Retrievers (#6076) * rename doc store and retrievers * release note * fix patch 2023-10-17 16:15:16 +02:00			`from haystack.preview.document_stores import InMemoryDocumentStore`
			`from haystack.preview.components.retrievers import InMemoryBM25Retriever`
test: e2e test for Extractive QA Pipeline (#5879) * e2e test for e. qa pipeline 2023-09-26 15:44:34 +02:00			`from haystack.preview.components.readers import ExtractiveReader`


test: enhance e2e tests to also draw and serialize/deserialize the test pipelines (#5910) * add draw and serialization/deserialization to e2e pipeline examples * add comment about json serialization * fix a small gptgenerator bug and move indexing in tests * to json * review feedback 2023-10-09 12:54:17 +01:00			`def test_extractive_qa_pipeline(tmp_path):`
			`# Create the pipeline`
			`qa_pipeline = Pipeline()`
refactor!: rename `MemoryDocumentStore` and related Retrievers (#6076) * rename doc store and retrievers * release note * fix patch 2023-10-17 16:15:16 +02:00			`qa_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()), name="retriever")`
test: enhance e2e tests to also draw and serialize/deserialize the test pipelines (#5910) * add draw and serialization/deserialization to e2e pipeline examples * add comment about json serialization * fix a small gptgenerator bug and move indexing in tests * to json * review feedback 2023-10-09 12:54:17 +01:00			`qa_pipeline.add_component(instance=ExtractiveReader(model_name_or_path="deepset/tinyroberta-squad2"), name="reader")`
			`qa_pipeline.connect("retriever", "reader")`

			`# Draw the pipeline`
			`qa_pipeline.draw(tmp_path / "test_extractive_qa_pipeline.png")`

			`# Serialize the pipeline to JSON`
			`with open(tmp_path / "test_bm25_rag_pipeline.json", "w") as f:`
			`print(json.dumps(qa_pipeline.to_dict(), indent=4))`
			`json.dump(qa_pipeline.to_dict(), f)`
test: e2e test for Extractive QA Pipeline (#5879) * e2e test for e. qa pipeline 2023-09-26 15:44:34 +02:00
test: enhance e2e tests to also draw and serialize/deserialize the test pipelines (#5910) * add draw and serialization/deserialization to e2e pipeline examples * add comment about json serialization * fix a small gptgenerator bug and move indexing in tests * to json * review feedback 2023-10-09 12:54:17 +01:00			`# Load the pipeline back`
			`with open(tmp_path / "test_bm25_rag_pipeline.json", "r") as f:`
			`qa_pipeline = Pipeline.from_dict(json.load(f))`

			`# Populate the document store`
test: e2e test for Extractive QA Pipeline (#5879) * e2e test for e. qa pipeline 2023-09-26 15:44:34 +02:00			`documents = [`
refactor: Rename `Document`'s `text` field to `content` (#6181) * Rework Document serialisation Make Document backward compatible Fix InMemoryDocumentStore filters Fix InMemoryDocumentStore.bm25_retrieval Add release notes Fix pylint failures Enhance Document kwargs handling and docstrings Rename Document's text field to content Fix e2e tests Fix SimilarityRanker tests Fix typo in release notes Rename Document's metadata field to meta (#6183) * fix bugs * make linters happy * fix * more fix * match regex --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> 2023-10-31 12:44:04 +01:00			`Document(content="My name is Jean and I live in Paris."),`
			`Document(content="My name is Mark and I live in Berlin."),`
			`Document(content="My name is Giorgio and I live in Rome."),`
test: e2e test for Extractive QA Pipeline (#5879) * e2e test for e. qa pipeline 2023-09-26 15:44:34 +02:00			`]`
test: enhance e2e tests to also draw and serialize/deserialize the test pipelines (#5910) * add draw and serialization/deserialization to e2e pipeline examples * add comment about json serialization * fix a small gptgenerator bug and move indexing in tests * to json * review feedback 2023-10-09 12:54:17 +01:00			`qa_pipeline.get_component("retriever").document_store.write_documents(documents)`
test: e2e test for Extractive QA Pipeline (#5879) * e2e test for e. qa pipeline 2023-09-26 15:44:34 +02:00
test: enhance e2e tests to also draw and serialize/deserialize the test pipelines (#5910) * add draw and serialization/deserialization to e2e pipeline examples * add comment about json serialization * fix a small gptgenerator bug and move indexing in tests * to json * review feedback 2023-10-09 12:54:17 +01:00			`# Query and assert`
test: e2e test for Extractive QA Pipeline (#5879) * e2e test for e. qa pipeline 2023-09-26 15:44:34 +02:00			`questions = ["Who lives in Paris?", "Who lives in Berlin?", "Who lives in Rome?"]`
			`answers_spywords = ["Jean", "Mark", "Giorgio"]`

			`for question, spyword, doc in zip(questions, answers_spywords, documents):`
			`result = qa_pipeline.run({"retriever": {"query": question}, "reader": {"query": question}})`

			`extracted_answers = result["reader"]["answers"]`

			`# we expect at least one real answer and no_answer`
			`assert len(extracted_answers) > 1`

			`# the best answer should contain the spyword`
			`assert spyword in extracted_answers[0].data`

			`# no_answer`
			`assert extracted_answers[-1].data is None`

			`# since these questions are easily answerable, the best answer should have higher probability than no_answer`
			`assert extracted_answers[0].probability >= extracted_answers[-1].probability`

			`for answer in extracted_answers:`
			`assert answer.query == question`

			`assert hasattr(answer, "probability")`
			`assert hasattr(answer, "start")`
			`assert hasattr(answer, "end")`

			`assert hasattr(answer, "document")`
			`# the answer is extracted from the correct document`
			`if answer.document is not None:`
			`assert answer.document == doc`