haystack/test/test_file_converter.py

from pathlib import Path

import pytest

from haystack.file_converter.docx import DocxToTextConverter
from haystack.file_converter.pdf import PDFToTextConverter
from haystack.file_converter.tika import TikaConverter


@pytest.mark.tika
@pytest.mark.parametrize("Converter", [PDFToTextConverter, TikaConverter])
def test_convert(Converter, xpdf_fixture):
    converter = Converter()
    document = converter.convert(file_path=Path("samples/pdf/sample_pdf_1.pdf"))
    pages = document["text"].split("\f")
    assert len(pages) == 4  # the sample PDF file has four pages.
    assert pages[0] != ""  # the page 1 of PDF contains text.
    assert pages[2] == ""  # the page 3 of PDF file is empty.


@pytest.mark.tika
@pytest.mark.parametrize("Converter", [PDFToTextConverter, TikaConverter])
def test_table_removal(Converter, xpdf_fixture):
    converter = Converter(remove_numeric_tables=True)
    document = converter.convert(file_path=Path("samples/pdf/sample_pdf_1.pdf"))
    pages = document["text"].split("\f")
    # assert numeric rows are removed from the table.
    assert "324" not in pages[0]
    assert "54x growth" not in pages[0]

    # assert text is retained from the document.
    # As whitespace can differ (\n," ", etc.), we standardize all to simple whitespace
    page_standard_whitespace = " ".join(pages[0].split())
    assert "Adobe Systems made the PDF specification available free of charge in 1993." in page_standard_whitespace


@pytest.mark.tika
@pytest.mark.parametrize("Converter", [PDFToTextConverter, TikaConverter])
def test_language_validation(Converter, xpdf_fixture, caplog):
    converter = Converter(valid_languages=["en"])
    converter.convert(file_path=Path("samples/pdf/sample_pdf_1.pdf"))
    assert "The language for samples/pdf/sample_pdf_1.pdf is not one of ['en']." not in caplog.text

    converter = Converter(valid_languages=["de"])
    converter.convert(file_path=Path("samples/pdf/sample_pdf_1.pdf"))
    assert "The language for samples/pdf/sample_pdf_1.pdf is not one of ['de']." in caplog.text


def test_docx_converter():
    converter = DocxToTextConverter()
    document = converter.convert(file_path=Path("samples/docx/sample_docx.docx"))
    assert document["text"].startswith("Sample Docx File")
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00			`from pathlib import Path`

Add Tika Converter (#314) 2020-08-17 11:21:09 +02:00			`import pytest`
Revert "Add Tika Converter (#314)" This reverts commit 5ef59b1901da6d51bfa085683321a243228d4fc9. 2020-08-17 11:13:52 +02:00
Revamp CI (#825) 2021-02-12 13:38:54 +01:00			`from haystack.file_converter.docx import DocxToTextConverter`
Rename and restructure modules (database, indexing, schemas) (#379) * rename database to documentstore * move document, label, multilabel to haystack/schema.py * rename documentstore -> document_store * split indexing modules -> file_converter + preprocessor * fix order of imports * Update tutorial notebooks * fix torch version in tutorial 4 2020-09-16 18:33:23 +02:00			`from haystack.file_converter.pdf import PDFToTextConverter`
			`from haystack.file_converter.tika import TikaConverter`
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00

[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484) * Adding dummy generator implementation * Adding tutorial to try the model * Committing current non working code * Committing current update where we need to call generate function directly and need to convert embedding to tensor way * Addressing review comments. * Refactoring finder, and implementing rag_generator class. * Refined the implementation of RAGGenerator and now it is in clean shape * Renaming RAGGenerator to RAGenerator * Reverting change from finder.py and addressing review comments * Remove support for RagSequenceForGeneration * Utilizing embed_passage function from DensePassageRetriever * Adding sample test data to verify generator output * Updating testing script * Updating testing script * Fixing bug related to top_k * Updating latest farm dependency * Comment out farm dependency * Reverting changes from TransformersReader * Adding transformers dataset to compare transformers and haystack generator implementation * Using generator_encoder instead of question_encoder to generate context_input_ids * Adding workaround to install FARM dependency from master branch * Removing unnecessary changes * Fixing generator test * Removing transformers datasets * Fixing generator test * Some cleanup and updating TODO comments * Adding tutorial notebook * Updating tutorials with comments * Explicitly passing token model in RAG test * Addressing review comments * Fixing notebook * Refactoring tests to reduce memory footprint * Split generator tests in separate ci step and before running it reclaim memory by terminating containers * Moving tika dependent test to separate dir * Remove unwanted code * Brining reader under session scope * Farm is now session object hence restoring changes from default value * Updating assert for pdf converter * Dummy commit to trigger CI flow * REducing memory footprint required for generator tests * Fixing mypy issues * Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits * reducing changes * Fixing CI * changing elastic search ci * Fixing test error * Disabling return of embedding * Marking generator test as well * Refactoring tutorials * Increasing ES memory to 750M * Trying another fix for ES CI * Reverting CI changes * Splitting tests in CI * Generator and non-generator markers split * Adding pytest.ini to add markers and enable strict-markers option * Reducing elastic search container memory * Simplifying generator test by using documents with embedding directly * Bump up farm to 0.5.0 2020-10-30 18:06:02 +01:00			`@pytest.mark.tika`
Add Tika Converter (#314) 2020-08-17 11:21:09 +02:00			`@pytest.mark.parametrize("Converter", [PDFToTextConverter, TikaConverter])`
Refactor file converter interface (#393) 2020-09-18 10:42:13 +02:00			`def test_convert(Converter, xpdf_fixture):`
Add Tika Converter (#314) 2020-08-17 11:21:09 +02:00			`converter = Converter()`
Refactor file converter interface (#393) 2020-09-18 10:42:13 +02:00			`document = converter.convert(file_path=Path("samples/pdf/sample_pdf_1.pdf"))`
			`pages = document["text"].split("\f")`
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00			`assert len(pages) == 4 # the sample PDF file has four pages.`
			`assert pages[0] != "" # the page 1 of PDF contains text.`
			`assert pages[2] == "" # the page 3 of PDF file is empty.`


[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484) * Adding dummy generator implementation * Adding tutorial to try the model * Committing current non working code * Committing current update where we need to call generate function directly and need to convert embedding to tensor way * Addressing review comments. * Refactoring finder, and implementing rag_generator class. * Refined the implementation of RAGGenerator and now it is in clean shape * Renaming RAGGenerator to RAGenerator * Reverting change from finder.py and addressing review comments * Remove support for RagSequenceForGeneration * Utilizing embed_passage function from DensePassageRetriever * Adding sample test data to verify generator output * Updating testing script * Updating testing script * Fixing bug related to top_k * Updating latest farm dependency * Comment out farm dependency * Reverting changes from TransformersReader * Adding transformers dataset to compare transformers and haystack generator implementation * Using generator_encoder instead of question_encoder to generate context_input_ids * Adding workaround to install FARM dependency from master branch * Removing unnecessary changes * Fixing generator test * Removing transformers datasets * Fixing generator test * Some cleanup and updating TODO comments * Adding tutorial notebook * Updating tutorials with comments * Explicitly passing token model in RAG test * Addressing review comments * Fixing notebook * Refactoring tests to reduce memory footprint * Split generator tests in separate ci step and before running it reclaim memory by terminating containers * Moving tika dependent test to separate dir * Remove unwanted code * Brining reader under session scope * Farm is now session object hence restoring changes from default value * Updating assert for pdf converter * Dummy commit to trigger CI flow * REducing memory footprint required for generator tests * Fixing mypy issues * Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits * reducing changes * Fixing CI * changing elastic search ci * Fixing test error * Disabling return of embedding * Marking generator test as well * Refactoring tutorials * Increasing ES memory to 750M * Trying another fix for ES CI * Reverting CI changes * Splitting tests in CI * Generator and non-generator markers split * Adding pytest.ini to add markers and enable strict-markers option * Reducing elastic search container memory * Simplifying generator test by using documents with embedding directly * Bump up farm to 0.5.0 2020-10-30 18:06:02 +01:00			`@pytest.mark.tika`
Add Tika Converter (#314) 2020-08-17 11:21:09 +02:00			`@pytest.mark.parametrize("Converter", [PDFToTextConverter, TikaConverter])`
			`def test_table_removal(Converter, xpdf_fixture):`
			`converter = Converter(remove_numeric_tables=True)`
Refactor file converter interface (#393) 2020-09-18 10:42:13 +02:00			`document = converter.convert(file_path=Path("samples/pdf/sample_pdf_1.pdf"))`
			`pages = document["text"].split("\f")`
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00			`# assert numeric rows are removed from the table.`
			`assert "324" not in pages[0]`
			`assert "54x growth" not in pages[0]`

			`# assert text is retained from the document.`
Allow custom encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions (#813) * fix encoding of pdftotext. fix version in download instructions * fix test * Add latest docstring and tutorial changes * make latin-1 default encoding again * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2021-02-09 13:42:43 +01:00			`# As whitespace can differ (\n," ", etc.), we standardize all to simple whitespace`
			`page_standard_whitespace = " ".join(pages[0].split())`
			`assert "Adobe Systems made the PDF specification available free of charge in 1993." in page_standard_whitespace`
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00

[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484) * Adding dummy generator implementation * Adding tutorial to try the model * Committing current non working code * Committing current update where we need to call generate function directly and need to convert embedding to tensor way * Addressing review comments. * Refactoring finder, and implementing rag_generator class. * Refined the implementation of RAGGenerator and now it is in clean shape * Renaming RAGGenerator to RAGenerator * Reverting change from finder.py and addressing review comments * Remove support for RagSequenceForGeneration * Utilizing embed_passage function from DensePassageRetriever * Adding sample test data to verify generator output * Updating testing script * Updating testing script * Fixing bug related to top_k * Updating latest farm dependency * Comment out farm dependency * Reverting changes from TransformersReader * Adding transformers dataset to compare transformers and haystack generator implementation * Using generator_encoder instead of question_encoder to generate context_input_ids * Adding workaround to install FARM dependency from master branch * Removing unnecessary changes * Fixing generator test * Removing transformers datasets * Fixing generator test * Some cleanup and updating TODO comments * Adding tutorial notebook * Updating tutorials with comments * Explicitly passing token model in RAG test * Addressing review comments * Fixing notebook * Refactoring tests to reduce memory footprint * Split generator tests in separate ci step and before running it reclaim memory by terminating containers * Moving tika dependent test to separate dir * Remove unwanted code * Brining reader under session scope * Farm is now session object hence restoring changes from default value * Updating assert for pdf converter * Dummy commit to trigger CI flow * REducing memory footprint required for generator tests * Fixing mypy issues * Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits * reducing changes * Fixing CI * changing elastic search ci * Fixing test error * Disabling return of embedding * Marking generator test as well * Refactoring tutorials * Increasing ES memory to 750M * Trying another fix for ES CI * Reverting CI changes * Splitting tests in CI * Generator and non-generator markers split * Adding pytest.ini to add markers and enable strict-markers option * Reducing elastic search container memory * Simplifying generator test by using documents with embedding directly * Bump up farm to 0.5.0 2020-10-30 18:06:02 +01:00			`@pytest.mark.tika`
Add Tika Converter (#314) 2020-08-17 11:21:09 +02:00			`@pytest.mark.parametrize("Converter", [PDFToTextConverter, TikaConverter])`
			`def test_language_validation(Converter, xpdf_fixture, caplog):`
			`converter = Converter(valid_languages=["en"])`
Refactor file converter interface (#393) 2020-09-18 10:42:13 +02:00			`converter.convert(file_path=Path("samples/pdf/sample_pdf_1.pdf"))`
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00			`assert "The language for samples/pdf/sample_pdf_1.pdf is not one of ['en']." not in caplog.text`

Add Tika Converter (#314) 2020-08-17 11:21:09 +02:00			`converter = Converter(valid_languages=["de"])`
Refactor file converter interface (#393) 2020-09-18 10:42:13 +02:00			`converter.convert(file_path=Path("samples/pdf/sample_pdf_1.pdf"))`
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00			`assert "The language for samples/pdf/sample_pdf_1.pdf is not one of ['de']." in caplog.text`


Revamp CI (#825) 2021-02-12 13:38:54 +01:00			`def test_docx_converter():`
			`converter = DocxToTextConverter()`
			`document = converter.convert(file_path=Path("samples/docx/sample_docx.docx"))`
			`assert document["text"].startswith("Sample Docx File")`