* fix encoding of pdftotext. fix version in download instructions
* fix test
* Add latest docstring and tutorial changes
* make latin-1 default encoding again
* Add latest docstring and tutorial changes
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Make batchwise adding of evaluation data possible
* Fix typos in docstrings
* Merge add_eval_data and add_eval_data_batchwise
* Improve import statements
* Move add_eval_data to BaseDocumentStore
* Add batch_size param to write_documents and write_labels in EsDocStore
* Adjust docstring
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
* Integration of SummarizationQAPipeline with Haystack.
* Moving summarizer tests because of OOM issue
* Fixing typo
* Splitting summarizer test in separate ci step
* Removing sysctl configuration as we already running elastic search in docker container
* fixing mypy issue
* update parameter names and docstrings
* update parameter names in BaseSummarizer
* rename pipeline
* change return type of summarizer from answer to document
* change scope of doc store fixture
* revert scope
* temp. disable test_faiss_index_save_and_load()
* fix mypy. change order for mypy in CI
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
* Using Columns name instead of ORM object for get all documents call
* Separating meta search from documents. This way it will optimize the memory not duplicating document.text
* Fixing mypy issue
* SQLite have limit on number of host variable hence using batching to fetch meta information
* Query meta only if meta field is not Null in DocOrm
* Add batch_size to other functions except label
* meta can be none so fix that issue
* Dummy commit to trigger CI
* Using chunked dictionary
* Upgrading faiss
* reverting change related to faiss upgrade
* Changing DB name in test_faiss_retrieving test as it might interfere with exiting files by corrupting DB file
* Updating doc string related to batch_size
* Update docstring for batch_size
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
* WIP: First version of preprocessing tutorial
* stride renamed overlap, ipynb and py files created
* rename split_stride in test
* Update preprocessor api documentation
* define order for markdown files
* define order of modules in api docs
* Add colab links
* Incorporate review feedback
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
* Update preprocessor.py
Concatenation of sentences done correctly. Stride functionality enabled for splitting by words while respecting sentence boundaries.
* Simplify code, add test
Co-authored-by: Krak91 <45461739+Krak91@users.noreply.github.com>
* initial test cml
* Update cml.yaml
* WIP test workflow
* switch to general ubuntu ami
* switch to general ubuntu ami
* disable gpu for tests
* rm gpu infos
* rm gpu infos
* update token env
* switch github token
* add postgres
* test db connection
* fix typo
* remove tty
* add sleep for db
* debug runner
* debug removal postgres
* debug: reset to working commit
* debug: change github token
* switch to new bot token
* debug token
* add back postgres
* adjust network runner docker
* add elastic
* fix typo
* adjust working dir
* fix benchmark execution
* enable s3 downloads
* add query benchmark. fix path
* add saving of markdown files
* cat md files. add faiss+dpr. increase n_queries
* switch to GPU instance
* switch availability zone
* switch to public aws DL ami
* increase volume size
* rm faiss. fix error logging
* save markdown files
* add reader benchmarks
* add download of squad data
* correct reader metric normalization
* fix newlines between reports
* fix max_docs for reader eval data. remove max_docs from ci run config
* fix mypy. switch workflow trigger
* try trigger for label
* try trigger for label
* change trigger syntax
* debug machine shutdown with test workflow
* add es and postgres to test workflow
* Revert "add es and postgres to test workflow"
This reverts commit 6f038d3d7f12eea924b54529e61b192858eaa9d5.
* Revert "debug machine shutdown with test workflow"
This reverts commit db70eabae8850b88e1d61fd79b04d4f49d54990a.
* fix typo in action. set benchmark config back to original
* Add languages and preprocessing pages
* add content
* address review comments
* make link relative
* update api ref with latest docstrings
* move doc readme and update
* add generator API docs
* fix example code
* design and link fix
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
* Add save and load method for DPR
* lower memory footprint for test. change names to load() and save()
* add test cases
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
* Adding dummy generator implementation
* Adding tutorial to try the model
* Committing current non working code
* Committing current update where we need to call generate function directly and need to convert embedding to tensor way
* Addressing review comments.
* Refactoring finder, and implementing rag_generator class.
* Refined the implementation of RAGGenerator and now it is in clean shape
* Renaming RAGGenerator to RAGenerator
* Reverting change from finder.py and addressing review comments
* Remove support for RagSequenceForGeneration
* Utilizing embed_passage function from DensePassageRetriever
* Adding sample test data to verify generator output
* Updating testing script
* Updating testing script
* Fixing bug related to top_k
* Updating latest farm dependency
* Comment out farm dependency
* Reverting changes from TransformersReader
* Adding transformers dataset to compare transformers and haystack generator implementation
* Using generator_encoder instead of question_encoder to generate context_input_ids
* Adding workaround to install FARM dependency from master branch
* Removing unnecessary changes
* Fixing generator test
* Removing transformers datasets
* Fixing generator test
* Some cleanup and updating TODO comments
* Adding tutorial notebook
* Updating tutorials with comments
* Explicitly passing token model in RAG test
* Addressing review comments
* Fixing notebook
* Refactoring tests to reduce memory footprint
* Split generator tests in separate ci step and before running it reclaim memory by terminating containers
* Moving tika dependent test to separate dir
* Remove unwanted code
* Brining reader under session scope
* Farm is now session object hence restoring changes from default value
* Updating assert for pdf converter
* Dummy commit to trigger CI flow
* REducing memory footprint required for generator tests
* Fixing mypy issues
* Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits
* reducing changes
* Fixing CI
* changing elastic search ci
* Fixing test error
* Disabling return of embedding
* Marking generator test as well
* Refactoring tutorials
* Increasing ES memory to 750M
* Trying another fix for ES CI
* Reverting CI changes
* Splitting tests in CI
* Generator and non-generator markers split
* Adding pytest.ini to add markers and enable strict-markers option
* Reducing elastic search container memory
* Simplifying generator test by using documents with embedding directly
* Bump up farm to 0.5.0
* Adding support to return embedding along with other result via query_by_embedding function
* Adding test case to check return embedding
* By default for all tests but DPR tests: disable return_embedding flag
* Reducing None test case and fixing query_by_embedding of ElasticsearchDocumentStore when it updating self.excluded_meta_data directly
* Fixing mypy reported issue