haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-07 05:14:08 +00:00

Author	SHA1	Message	Date
oryx1729	c4607cbd98	Revamp CI (#825 )	2021-02-12 13:38:54 +01:00
Tanay Soni	fd5c5dd23c	Introduce incremental updates for embeddings in document stores (#812 )	2021-02-09 21:25:01 +01:00
Malte Pietsch	ac9f92466f	Allow custom encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions (#813 ) * fix encoding of pdftotext. fix version in download instructions * fix test * Add latest docstring and tutorial changes * make latin-1 default encoding again * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-02-09 13:42:43 +01:00
Tanay Soni	f95b70df38	Fix file upload API (#808 )	2021-02-05 12:17:38 +01:00
Branden Chan	f3a3b73d9b	Choose correct similarity fns during benchmark runs & re-run benchmarks (#773 ) * Adapt to new dataset_from_dicts return signature * rename fn * Align similarity fn in benchmark doc store * Better choice of similarity fn * Increase postgres wait time * Add more expected returned variables * update benchmark results * Fix typo * update all benchmark runs * multiply stats by 100 * Specify similarity fns for website Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-02-03 11:45:18 +01:00
Tanay Soni	8a5dc8f826	Load Pipeline with YAML config file (#785 )	2021-02-02 17:32:17 +01:00
Timo Moeller	f3ccd59045	Improve preprocessing and adding of eval data (#780 ) * Remove empty document when splitting text * Move error message of problematic ids to a highler level	2021-02-01 17:08:27 +01:00
Tanay Soni	b87dd244c1	Get metadata values for a key from Elasticsearch (#776 )	2021-02-01 16:13:26 +01:00
brandenchan	5665d55ab4	Remove duplicate file	2021-02-01 15:43:53 +01:00
Pavel Soriano	16b8291091	SQuAD to DPR dataset converter (#765 ) * Create squad_to_dpr.py First commit of the squad2dpr script. * adding review corrections/improvements * Merge master 5bf351e * Move script, add docstring * Add type hints Co-authored-by: brandenchan <brandenchan@icloud.com>	2021-02-01 15:40:43 +01:00
Lalit Pagaria	9f7f95221f	Milvus integration (#771 ) * Initial commit for Milvus integration * Add latest docstring and tutorial changes * Updating implementation of Milvus document store * Add latest docstring and tutorial changes * Adding tests and updating doc string * Add latest docstring and tutorial changes * Fixing issue caught by tests * Addressing review comments * Fixing mypy detected issue * Fixing issue caught in test about sorting of vector ids * fixing test * Fixing generator test failure * update docstrings * Addressing review comments about multiple network call while fetching embedding from milvus server * Add latest docstring and tutorial changes * Ignoring mypy issue while converting vector_id to int Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-01-29 13:29:12 +01:00
Tanay Soni	d9f011da9a	Add flag for use of window queries in SQLDocumentStore (#768 )	2021-01-25 12:54:34 +01:00
Tanay Soni	46307d1571	Remove quotes around placeholders in Elasticsearch custom query (#762 )	2021-01-25 12:46:43 +01:00
Tanay Soni	f0aa879a1c	Fix delete_all_documents for the SQLDocumentStore (#761 )	2021-01-22 14:39:24 +01:00
Tanay Soni	337376c81d	Add `batch_size` and generators to document stores. (#733 ) * Add batch update of embeddings in document stores * Resolve merge conflict * Remove document ordering dependency in tests * Adjust index buffer size for tests * Adjust ES Scroll Slice * Use generator for document store pagination * Add pagination for InMemoryDocumentStore * Fix missing index parameter in FAISS update_embeddings() * Fix FAISS update_embeddings() * Update FAISS tests * Update eval tests * Revert code formatting change * Fix document count in FAISS update embeddings * Fix vector_ids reset in SQLDocumentStore * Update doctrings * Update docstring	2021-01-21 16:00:08 +01:00
Timo Moeller	7522d2d1b0	Increase FARM to Version 0.6.2 (#755 ) * Increase farm version * Fix test	2021-01-21 10:15:41 +01:00
Timo Moeller	4803da009a	Using PreProcessor functions on eval data (#751 ) * Add eval data splitting * Adjust for split by passage, add test and test data, adjust docstrings, add max_docs to highler level fct	2021-01-20 14:40:10 +01:00
Tanay Soni	aa8a3666c3	Support filters for DensePassageRetriever + InMemoryDocumentStore (#754 )	2021-01-20 12:52:52 +01:00
bogdankostic	7709b6cee0	Make batchwise adding of evaluation data possible (#717 ) * Make batchwise adding of evaluation data possible * Fix typos in docstrings * Merge add_eval_data and add_eval_data_batchwise * Improve import statements * Move add_eval_data to BaseDocumentStore * Add batch_size param to write_documents and write_labels in EsDocStore * Adjust docstring Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-01-12 17:54:43 +01:00
Tanay Soni	281f9ff970	Fix SQLite errors in tests (#723 )	2021-01-11 13:24:38 +01:00
Lalit Pagaria	75d0ebd076	Add Summarizer (standalone + node in custom pipelines + SearchSummarizationPipeline) (#698 ) * Integration of SummarizationQAPipeline with Haystack. * Moving summarizer tests because of OOM issue * Fixing typo * Splitting summarizer test in separate ci step * Removing sysctl configuration as we already running elastic search in docker container * fixing mypy issue * update parameter names and docstrings * update parameter names in BaseSummarizer * rename pipeline * change return type of summarizer from answer to document * change scope of doc store fixture * revert scope * temp. disable test_faiss_index_save_and_load() * fix mypy. change order for mypy in CI Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-01-08 14:29:46 +01:00
Lalit Pagaria	3a9a756810	Using Columns names instead of ORM to get all documents (#620 ) * Using Columns name instead of ORM object for get all documents call * Separating meta search from documents. This way it will optimize the memory not duplicating document.text * Fixing mypy issue * SQLite have limit on number of host variable hence using batching to fetch meta information * Query meta only if meta field is not Null in DocOrm * Add batch_size to other functions except label * meta can be none so fix that issue * Dummy commit to trigger CI * Using chunked dictionary * Upgrading faiss * reverting change related to faiss upgrade * Changing DB name in test_faiss_retrieving test as it might interfere with exiting files by corrupting DB file * Updating doc string related to batch_size * Update docstring for batch_size Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-01-06 15:56:19 +01:00
Branden Chan	bb8aba18e0	Create Preprocessing Tutorial (#706 ) * WIP: First version of preprocessing tutorial * stride renamed overlap, ipynb and py files created * rename split_stride in test * Update preprocessor api documentation * define order for markdown files * define order of modules in api docs * Add colab links * Incorporate review feedback Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>	2021-01-06 15:54:05 +01:00
Tanay Soni	0e4eec9499	Add tests for custom embedding field (#640 )	2020-12-17 09:18:57 +01:00
Tanay Soni	4c2804e38e	Add support for aggregating scores in JoinDocuments node (#683 )	2020-12-16 15:54:58 +01:00
Tanay Soni	33fe597949	Cleanup Pytest Fixtures (#639 )	2020-12-14 18:15:44 +01:00
Malte Pietsch	149d98a0fd	Add latest benchmark run (#652 ) * add latest benchmark run * update templates and fix small json errors * Change scale Co-authored-by: brandenchan <brandenchan@icloud.com>	2020-12-10 16:25:51 +01:00
Timo Moeller	efc754b166	Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries (#641 ) * Update preprocessor.py Concatenation of sentences done correctly. Stride functionality enabled for splitting by words while respecting sentence boundaries. * Simplify code, add test Co-authored-by: Krak91 <45461739+Krak91@users.noreply.github.com>	2020-12-09 16:12:36 +01:00
Tanay Soni	4152ad8426	Enable dynamic parameter updates for the FARMReader (#650 )	2020-12-07 14:07:20 +01:00
Tanay Soni	8e52b48e1d	Add pipelines for GenerativeQA & FAQs (#645 )	2020-12-03 10:27:06 +01:00
Malte Pietsch	216787ed34	Fix benchmarks (#648 ) * disable fasttokenizer, increase ES timeout for delete requests * add session.close() * fix deletion of docs	2020-12-02 16:59:42 +01:00
Tanay Soni	5e62e54875	Rename question parameter to query (#614 )	2020-11-30 17:50:04 +01:00
Tanay Soni	ea976ba5b5	Add return_embedding parameter for get_all_documents() (#615 )	2020-11-26 10:32:30 +01:00
Tanay Soni	e3a68aedaf	Add support for building custom Search Pipelines (#596 )	2020-11-20 17:41:08 +01:00
Malte Pietsch	0acafc403a	Automate benchmarks via CML (#518 ) * initial test cml * Update cml.yaml * WIP test workflow * switch to general ubuntu ami * switch to general ubuntu ami * disable gpu for tests * rm gpu infos * rm gpu infos * update token env * switch github token * add postgres * test db connection * fix typo * remove tty * add sleep for db * debug runner * debug removal postgres * debug: reset to working commit * debug: change github token * switch to new bot token * debug token * add back postgres * adjust network runner docker * add elastic * fix typo * adjust working dir * fix benchmark execution * enable s3 downloads * add query benchmark. fix path * add saving of markdown files * cat md files. add faiss+dpr. increase n_queries * switch to GPU instance * switch availability zone * switch to public aws DL ami * increase volume size * rm faiss. fix error logging * save markdown files * add reader benchmarks * add download of squad data * correct reader metric normalization * fix newlines between reports * fix max_docs for reader eval data. remove max_docs from ci run config * fix mypy. switch workflow trigger * try trigger for label * try trigger for label * change trigger syntax * debug machine shutdown with test workflow * add es and postgres to test workflow * Revert "add es and postgres to test workflow" This reverts commit 6f038d3d7f12eea924b54529e61b192858eaa9d5. * Revert "debug machine shutdown with test workflow" This reverts commit db70eabae8850b88e1d61fd79b04d4f49d54990a. * fix typo in action. set benchmark config back to original	2020-11-18 18:28:17 +01:00
Lalit Pagaria	3f81c93f36	Add document update for SQL and FAISS Document Store (#584 )	2020-11-16 16:08:13 +01:00
Tanay Soni	3e095ddd7d	Add filters for delete_all_documents() (#591 )	2020-11-16 14:15:32 +01:00
Timo Moeller	f118e4b738	Add needed whitespace before sentence start (#582 )	2020-11-13 14:14:24 +01:00
Branden Chan	44230fca45	Fix CI bug due to new Elasticsearch release and new model release (#579 ) * Cast generator to list * Restrict ES version range * Loosen ES requirement * Change no_answer_test value	2020-11-13 10:35:53 +01:00
Tanay Soni	acd088808b	Allow list of filter values in REST API (#568 )	2020-11-09 20:41:53 +01:00
Branden Chan	99e924aede	Update Documentation for Haystack 0.5.0 (#557 ) * Add languages and preprocessing pages * add content * address review comments * make link relative * update api ref with latest docstrings * move doc readme and update * add generator API docs * fix example code * design and link fix Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>	2020-11-06 10:53:22 +01:00
bogdankostic	ffaa0249f7	Fix retriever evaluation metrics (#547 ) * Add mean reciprocal rank and fix mean average precision * Add mrr metric to docstring * Fix mypy error	2020-11-05 13:34:47 +01:00
bogdankostic	53be92c155	Add save and load method for DPR (#550 ) * Add save and load method for DPR * lower memory footprint for test. change names to load() and save() * add test cases Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2020-11-05 13:29:23 +01:00
kolk	72b637ae6d	DensePassageRetriever: Add Training, Refactor Inference to FARM modules (#527 ) * dpr training and inference code refactored with FARM modules * dpr test cases modified * docstring and default arguments updated * dpr training docstring updated * bugfix in dense retriever inference, DPR tutorials modified * Bump FARM to 0.5.0 * update README for DPR * dpr training and inference code refactored with FARM modules * dpr test cases modified * docstring and default arguments updated * dpr training docstring updated * bugfix in dense retriever inference, DPR tutorials modified * Bump FARM to 0.5.0 * update README for DPR * mypy errors fix * DPR instantiation bugfix * Fix DPR init in RAG Tutorial Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2020-10-30 19:22:06 +01:00
Lalit Pagaria	f13443054a	[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484 ) * Adding dummy generator implementation * Adding tutorial to try the model * Committing current non working code * Committing current update where we need to call generate function directly and need to convert embedding to tensor way * Addressing review comments. * Refactoring finder, and implementing rag_generator class. * Refined the implementation of RAGGenerator and now it is in clean shape * Renaming RAGGenerator to RAGenerator * Reverting change from finder.py and addressing review comments * Remove support for RagSequenceForGeneration * Utilizing embed_passage function from DensePassageRetriever * Adding sample test data to verify generator output * Updating testing script * Updating testing script * Fixing bug related to top_k * Updating latest farm dependency * Comment out farm dependency * Reverting changes from TransformersReader * Adding transformers dataset to compare transformers and haystack generator implementation * Using generator_encoder instead of question_encoder to generate context_input_ids * Adding workaround to install FARM dependency from master branch * Removing unnecessary changes * Fixing generator test * Removing transformers datasets * Fixing generator test * Some cleanup and updating TODO comments * Adding tutorial notebook * Updating tutorials with comments * Explicitly passing token model in RAG test * Addressing review comments * Fixing notebook * Refactoring tests to reduce memory footprint * Split generator tests in separate ci step and before running it reclaim memory by terminating containers * Moving tika dependent test to separate dir * Remove unwanted code * Brining reader under session scope * Farm is now session object hence restoring changes from default value * Updating assert for pdf converter * Dummy commit to trigger CI flow * REducing memory footprint required for generator tests * Fixing mypy issues * Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits * reducing changes * Fixing CI * changing elastic search ci * Fixing test error * Disabling return of embedding * Marking generator test as well * Refactoring tutorials * Increasing ES memory to 750M * Trying another fix for ES CI * Reverting CI changes * Splitting tests in CI * Generator and non-generator markers split * Adding pytest.ini to add markers and enable strict-markers option * Reducing elastic search container memory * Simplifying generator test by using documents with embedding directly * Bump up farm to 0.5.0	2020-10-30 18:06:02 +01:00
Branden Chan	7a9f32f264	Fix template	2020-10-29 10:30:03 +01:00
Branden Chan	7c81dfdc3a	Address reviewer comments	2020-10-27 12:41:11 +01:00
Branden Chan	d5cb227909	Merge branch 'master' into automate_benchmarks	2020-10-27 11:50:49 +01:00
Lalit Pagaria	9521e180b3	Standardize behavior of DocumentStores to return embeddings (#514 ) * Adding support to return embedding along with other result via query_by_embedding function * Adding test case to check return embedding * By default for all tests but DPR tests: disable return_embedding flag * Reducing None test case and fixing query_by_embedding of ElasticsearchDocumentStore when it updating self.excluded_meta_data directly * Fixing mypy reported issue	2020-10-27 08:33:39 +01:00
Lalit Pagaria	abda994116	Pytest fix memory leak and put pytest marker on slow tests (#520 ) * Clear faiss_index during teardown * Marking slow test with pytest markers. So In future these test can be optimized. Also command line option can be added to skip them refer https://pytest.org/en/stable/example/simple.html#control-skipping-of-tests-according-to-command-line-option * Fixing test	2020-10-26 19:19:10 +01:00

... 27 28 29 30 31

1524 Commits