haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-08-10 09:39:22 +00:00

Author	SHA1	Message	Date
Stefano Fiorucci	f43bc562d3	refactor: replace `torch.no_grad` with `torch.inference_mode` (where possible) (#3601 ) * try to replace torch.no_grad * revert erroneous change * revert other module breaking * revert training/base	2022-11-23 09:26:11 +01:00
Stefano Fiorucci	3040e59c63	feat: add support for `BM25Retriever` in `InMemoryDocumentStore` (#3561 ) * very first draft * implement query and query_batch * add more bm25 parameters * add rank_bm25 dependency * fix mypy * remove tokenizer callable parameter * remove unused import * only json serializable attributes * try to fix: pylint too-many-public-methods / R0904 * bm25 attribute always present * convert errors into warnings to make the tutorial 1 work * add docstrings; tests * try to make tests run * better docstrings; revert not running tests * some suggestions from review * rename elasticsearch retriever as bm25 in tests; try to test memory_bm25 * exclude tests with filters * change elasticsearch to bm25 retriever in test_summarizer * add tests * try to improve tests * better type hint * adapt test_table_text_retriever_embedding * handle non-textual docs * query only textual documents	2022-11-22 09:24:52 +01:00
tstadel	0d45cbce56	convert eval metrics to python float (#3612 )	2022-11-22 09:05:10 +01:00
Espoir Murhabazi	d114a994f1	refactor: update Squad data (#3513 ) * refractor the to_squad data class * fix the validation label * refractor the to_squad data class * fix the validation label * add the test for the to_label object function * fix the tests for to_label_objects * move all the test related to squad data to one file * remove unused imports * revert tiny_augmented.json Co-authored-by: ZanSara <sarazanzo94@gmail.com>	2022-11-21 11:06:14 +01:00
Massimiliano Pippi	ea75e2aab5	feat: store metadata using JSON in SQLDocumentStore (#3547 ) * add warnings * make the field cachable * review comment	2022-11-18 08:26:19 +01:00
Massimiliano Pippi	1399681c81	move milvus tests to their own module (#3596 )	2022-11-17 16:22:02 +01:00
Massimiliano Pippi	6cd0e337d0	refactor: Generate JSON schema when missing (#3533 ) * removed unused script * print info logs when generating openapi schema * create json schema only when needed * fix tests * Remove leftover Co-authored-by: ZanSara <sarazanzo94@gmail.com>	2022-11-17 11:09:27 +01:00
Julian Risch	8052632b64	test: add test to check id_hash_keys is not ignored (#3577 )	2022-11-17 09:25:02 +01:00
Stefano Fiorucci	dc26e6d43e	fix: Flatten `DocumentClassifier` output in `SQLDocumentStore`; remove `_sql_session_rollback` hack in tests (#3273 ) * first draft * fix * fix * move test to test_sql	2022-11-16 12:20:57 +01:00
Massimiliano Pippi	ba75d39029	fix: discard metadata fields if not set in Weaviate (#3578 ) * fix weaviate bug in returning embeddings and setting empty meta fields * review comment	2022-11-15 22:02:53 +01:00
tstadel	6ce2d296f4	fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta (#3572 ) * fix meta bug * adjust brownfield test	2022-11-15 12:13:21 +01:00
Stefano Fiorucci	9de56b0283	fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" (#3548 ) * add_all fixes the bug * improved test	2022-11-15 10:04:04 +01:00
Massimiliano Pippi	6a48ace9b9	BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x (#3552 ) * remove milvus1 * leftover * revert deprecation process	2022-11-15 09:54:55 +01:00
Massimiliano Pippi	057a8c0b4f	refactor: Pinecone tests (#3555 ) * add pytest option to unmock pinecone * first try * handle missing answer * fix labels metadata * more tests * adapt workflow * typo * address review comments	2022-11-14 15:19:15 +01:00
Massimiliano Pippi	4dfddf0d10	refactor: Refactor Weaviate tests (#3541 ) * refactor tests * fix job * revert * revert * revert * use latest weaviate * fix abstract methods signatures * pass class_name to all the CRUD methods * finish moving all the tests * bump weaviate version * raise, don't pass	2022-11-14 09:57:30 +01:00
Massimiliano Pippi	3319ef6d1c	refactor: refactor FAISS tests (#3537 ) * fix write docs behaviour * refactor FAISS tests * do not remove the sqlite db * try * remove extra slash * Apply suggestions from code review Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * review comments * Update test/document_stores/test_faiss.py Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * review comments Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-11-08 16:37:01 +01:00
Sara Zan	43b24fd1a7	fix: strip whitespaces safely from `FARMReader`'s answers (#3526 ) * remove .strip() * check for right-side offset * return the whitespace-cleaned answer * lstrip, not rstrip :D * remove int * left_offset * slightly refactor reader fixture * extend test_output	2022-11-08 09:26:47 +01:00
Mayank Jobanputra	794fe5ffa4	bug: didn't clean up model files after running pytest for test_table_text_retriever_training (#3534 ) * Added tmp path to avoid clean up of model files later	2022-11-07 15:07:04 +05:30
Massimiliano Pippi	255072d8d5	refactor: move dC tests to their own module and job (#3529 ) * move dC tests to their own module and job * restore global var * revert	2022-11-04 17:05:10 +01:00
Massimiliano Pippi	2bb81331b7	feat: add SQLDocumentStore tests (#3517 ) * port SQL tests * cleanup document_store_tests.py from sql tests * leftover * Update .github/workflows/tests.yml Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * review comments * Update test/document_stores/test_base.py Co-authored-by: bogdankostic <bogdankostic@web.de> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: bogdankostic <bogdankostic@web.de>	2022-11-04 09:24:19 +01:00
Stefano Fiorucci	1a60e21137	refactor: simplify Summarizer, add Document Merger (#3452 ) * remove generate_single_summary * update schemas * remove unused import * fix mypy * fix mypy * test: summarizer doesnt change content * other test correction * move test_summarizer_translation to test_extractor_translation * fix test * first try for doc merger * reintroduce and deprecate generate_single_summary * progress in document merger * document merger! * mypy, pylint fixes * use generator * added test that will fail in 1.12 * adapt to review * extended deprecation docstring * Update test/nodes/test_extractor_translation.py * Update test/nodes/test_summarizer.py * Update test/nodes/test_summarizer.py * black * documents fixture Co-authored-by: Sara Zan <sarazanzo94@gmail.com>	2022-11-03 16:04:53 +01:00
Sara Zan	f0be78c6a6	bug: remove useless import in conftest.py (#3362 ) * Remove useless milvus import in conftest * schemas * schemas	2022-11-02 19:22:24 +05:30
Stefano Fiorucci	4b0894f4c2	fix: support long texts for labels in `ElasticsearchDocumentStore` (#3346 )	2022-11-02 11:16:36 +01:00
Sara Zan	bb1d9983b0	refactor: remove YAML save/load methods for subclasses of `BaseStandardPipeline` (#3443 ) * remove methods & update docstring * remove irrelevant test	2022-11-02 10:14:33 +01:00
bogdankostic	60224412bc	feat: Add headline extraction to `ParsrConverter` (#3488 ) * Add headline extraction to ParsrConverter * Add sample PDF file * Add test * Use extract_headlines if set in convert method * Integrate PR feedback	2022-10-31 19:00:02 +01:00
Massimiliano Pippi	b694c7b5cb	Document Store test refactoring (#3449 ) * add new marker * start using test hierarchies * move ES tests into their own class * refactor test workflow * job steps * add more tests * move more tests * more tests * test labels * add more tests * Update tests.yml * Update tests.yml * fix * typo * fix es image tag * map es ports * try * fix * default port * remove opensearch from the markers sorcery * revert * skip new tests in old jobs * skip opensearch_faiss	2022-10-31 15:30:14 +01:00
Sebastian	384663981d	Fixed bug in onnx converter for XLMRoberta architecture (#3470 )	2022-10-28 15:35:53 +02:00
Sebastian	8db7dfb884	refactor: TableReader (#3456 ) * Refactoring table reader	2022-10-26 20:57:28 +02:00
Sebastian	59857cb492	feat: Speed up reader tests (#3476 ) * Use a smaller reader where possible * Change scope to module of reader to get faster load times	2022-10-26 19:04:18 +02:00
Sara Zan	05c68b6624	feat: add `document_store` to all `BaseRetriever.retrieve()` and `BaseRetriever.retrieve_batch()` implementations (#3379 ) * add document_store to retrieve()] * mypy & pylint * pass docstore to embedding encoders * schemas * mypy and pylint * fix tfidfretriever * pylint * mypy * pylint * fix tfidf * mypy * pylint * schemas * another fix for tfidf * fix question generation tests * remove docstore from embedding encoder signature * pylint * revert accidental test changes * Apply suggestions from code review * check for docstore similarity function only if the docstore is present * check for docstore similarity function only if the docstore is present	2022-10-26 15:47:06 +02:00
Julian Risch	d0691a4bd5	bug: replace decorator with counter attribute for pipeline event (#3462 )	2022-10-26 12:09:04 +02:00
bogdankostic	4fbe80c098	feat: Extraction of headlines in markdown files (#3445 ) * Extract headings from markdown files + adapt PreProcessor * Add tests * Fix mypy * Generate JSON schema * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/nodes/file_converter/markdown.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply black * Add PR feedback Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-10-26 11:57:55 +02:00
Vladimir Blagojevic	5ca96357ff	feat: Add CohereEmbeddingEncoder to EmbeddingRetriever (#3453 )	2022-10-25 17:52:29 +02:00
Stefano Fiorucci	54ec13eaf7	refactor: Change `no_answer` attribute (#3411 ) * always run validation * update schemas * no_answer as a property. break things! * forgotten schema * fix * update openapi * removed my unnecessary test * fix sql document store Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-10-25 13:07:00 +02:00
Mayank Jobanputra	d48577b4e7	bug: removed duplicated meta "name" field addition to content before embedding in `update_embeddings` workflow (#3368 ) * Removed explicit passage formatting by name field * passing correct input type for embedding the docs * Updated test, updated similarity scores and added results * changed expected input to embed method	2022-10-25 14:52:05 +05:30
Sara Zan	cbf44413d8	feat: add `__cointains__` to `Span` (#3446 ) * add __contains__ * add tests	2022-10-21 13:58:17 +02:00
Vladimir Blagojevic	8f31228211	feat: Add exponential backoff decorator; apply it to OpenAI requests (#3398 )	2022-10-19 17:47:38 +02:00
Ursin Brunner	5fedfb03b0	fix: Fix the error of wrong page numbers when documents contain empty pages. (#3330 ) * Fix the error of wrong page numbers when documents contain empty pages. * Reformat using git hooks. * Use a more descriptive placeholder	2022-10-18 17:51:02 +02:00
Sebastian	93817f63b4	feat: Speed up integration tests (nodes) (#3408 ) * Changed summarizer model to a smaller one (2GB to 500MB) to save on space and speed up the tests. * Removed google pegasus from cache	2022-10-18 16:23:57 +02:00
Sebastian	15a59fd040	feat: Updated EntityExtractor to handle long texts and added better postprocessing (#3154 ) * Remove dependence on HuggingFace TokenClassificationPipeline and group all postprocessing functions under one class * Added copyright notice for HF and deepset to entity file to acknowledge that a lot of the postprocessing parts came from the transformers library. * Fixed text squishing problem. Added additional unit test for it. Co-authored-by: ju-gu <julian.gutsch@deepset.ai>	2022-10-17 21:26:44 +02:00
Unai Garay Maestre	3a2c8ae3c5	bug: Adds better way of checking `query` in BaseRetriever and Pipeline.run() (#3304 ) * changes how query and queries are checked if they have been passed in BaseRetriever * Fixes checking query properly in Pipeline run * Fixes checking query properly in Pipeline run * Adds test for FilterRetriever using run method when query is empty * Adds mock filter retriever and adapts test * Removes old test, adds MockRetriever to test file and test uses document_store * Logs error when query is not of type string with a new test for run batch * Update test/nodes/test_retriever.py * schemas	2022-10-17 19:00:13 +02:00
Sara Zan	101d2bc86c	feat: `MultiModalRetriever` (#2891 ) * Adding Data2VecVision and Data2VecText to the supported models and adapt Tokenizers accordingly * content_types * Splitting classes into respective folders * small changes * Fix EOF * eof * black * API * EOF * whitespace * api * improve multimodal similarity processor * tokenizer -> feature extractor * Making feature vectors come out of the feature extractor in the similarity head * embed_queries is now self-sufficient * couple trivial errors * Implemented separate language model classes for multimodal inference * Document embedding seems to work * removing batch_encode_plus, is deprecated anyway * Realized the base Data2Vec models are not trained on retrieval tasks * Issue with the generated embeddings * Add batching * Try to fit CLIP in * Stub of CLIP integration * Retrieval goes through but returns noise only * Still working on the scores * Introduce temporary adapter for CLIP models * Image retrieval now works with sentence-transformers * Tidying up the code * Refactoring is now functional * Add MPNet to the supported sentence transformers models * Remove unused classes * pylint * docs * docs * Remove the method renaming * mpyp first pass * docs * tutorial * schema * mypy * Move devices setup into get_model * more mypy * mypy * pylint * Move a few params in HaystackModel's init * make feature extractor work with squadprocessor * fix feature_extractor_kwargs forwarding * Forgotten part of the fix * Revert unrelated ES change * Revert unrelated memdocstore changes * comment * Small corrections * mypy and pylint * mypy * typo * mypy * Refactor the call * mypy * Do not make FARMReader use the new FeatureExtractor * mypy * Detach DPR tests from FeatureExtractor too * Detach processor tests too * Add end2end marker * extract end2end feature extractor tests * temporary disable feature extraction tests * Introduce end2end tests for tokenizer tests * pylint * Fix model loading from folder in FeatureExtractor * working o n end2end * end2end keeps failing * Restructuring retriever tests * Restructuring retriever tests * remove covert_dataset_to_dataloader * remove comment * Better check sentence-transformers models * Use embed_meta_fields properly * rename passage into document * Embedding dims can't be found * Add check for models that support it * pylint * Split all retriever tests into suites, running mostly on InMemory only * fix mypy * fix tfidf test * fix weaviate tests * Parallelize on every docstore * Fix schema and specify modality in base retriever suite * tests * Add first image tests * remove comment * Revert to simpler tests * Update docs/_src/api/api/primitives.md Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/modeling/model/multimodal/__init__.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * get_args * mypy * Update haystack/modeling/model/multimodal/__init__.py * Update haystack/modeling/model/multimodal/base.py * Update haystack/modeling/model/multimodal/base.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/modeling/model/multimodal/sentence_transformers.py * Update haystack/modeling/model/multimodal/sentence_transformers.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/modeling/model/multimodal/transformers.py * Update haystack/modeling/model/multimodal/transformers.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/modeling/model/multimodal/transformers.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/nodes/retriever/multimodal/retriever.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * mypy * mypy * removing more ContentTypes * more contentypes * pylint * add to __init__ * revert end2end workflow for now * missing integration markers * Update haystack/nodes/retriever/multimodal/embedder.py Co-authored-by: bogdankostic <bogdankostic@web.de> * review feedback, removing HaystackImageTransformerModel * review feedback part 2 * mypy & pylint * mypy * mypy * fix multimodal docs also for Pinecone * add note on internal constants * Fix pinecone write_documents * schemas * keep support for sentence-transformers only * fix pinecone test * schemas * fix pinecone again * temporarily disable some tests, need to understand if they're still relevant Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: bogdankostic <bogdankostic@web.de>	2022-10-17 18:58:35 +02:00
Vladimir Blagojevic	159cd5a666	feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever (#3356 )	2022-10-14 15:01:03 +02:00
Vladimir Blagojevic	5ebe3cb33d	fix: QuestionGenerator generates wrong document questions for non-default `num_queries_per_doc` parameter (#3381 )	2022-10-14 12:08:30 +02:00
Stefano Fiorucci	7290196c32	fix: allow same `vector_id` in different indexes for SQL-based Document stores (#3383 ) * fix_multiple_indexes * improve test names	2022-10-14 09:55:56 +02:00
Massimiliano Pippi	31fa75e9fd	feat: add support for Elasticsearch 7.16.2 (#3318 ) * bump elastic to 7.16.2+ * decouple Elasticsearch and Opensearch use method override instead of func variables fix mypy default value fix broken tests update schema * relax version pin * rename the base class * rename module * fix import order * do not run the new tests in the old job * remove outdated TODO	2022-10-13 11:53:27 +02:00
Sebastian	75641dd024	fix: Added checks for DataParallel and WrappedDataParallel (#3366 ) * Added checks for DataParallel and WrappedDataParallel * Update isinstance checks according to pylint recommendation * Using isinstance over types * Added test for dpr training	2022-10-13 08:05:56 +02:00
Malte Pietsch	fb02b61e90	Update README.md (#3247 )	2022-10-11 10:43:17 +02:00
tstadel	7fe5003c97	fix: eval() with `add_isolated_node_eval=True` breaks if no node supports it (#3347 ) * fix isolated eval for pipelines without a node supporting isolated mode * reformat * add test	2022-10-10 20:48:13 +02:00
bogdankostic	84aff5e2b3	fix: Allow less restrictive values for parameters in Pipeline configurations (#3345 ) * fix: Allow arbitrary values for parameters in Pipeline configurations * Add test * Adapt expected error message in tests * Fix bug * Fix bug on checking JSON * Remove test cases that previously tested if error was thrown * Change encoding in test * Restrict possible values * Re-add tests * Re-add tests * Add value flag to list elements	2022-10-10 13:08:45 +02:00

... 13 14 15 16 17 ...

1204 Commits