haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-22 00:11:14 +00:00

Author	SHA1	Message	Date
tstadel	668fd548a6	Fix `embeddings_field_supports_similarity` of `OpenSearchDocumentStore` when creating index (#3030 ) * fix embeddings_field_supports_similarity when creating index * fix test	2022-08-12 11:19:59 +02:00
James Briggs	26c938a8e6	test: add meta fields for meta_config to be used during testing (#3021 ) * added meta fields for meta_config to be used during realtime testing of PineconeDocumentStore * Add documentation on metadata filtering in docstring * docs Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-08-12 10:27:56 +02:00
bogdankostic	81a5949103	ci: Increase Weaviate's disk usage + print docker logs (#3026 )	2022-08-11 18:13:43 +02:00
Sebastian	44e2b1beed	Resolving issue 2853: no answer logic in FARMReader (#2856 ) * Update FARMReader.eval_on_file to be consistent with FARMReader.eval * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-08-11 16:45:03 +02:00
Sara Zan	fc8ecbf20c	Move `azure-core` pin into the dev dependency list (#3022 )	2022-08-11 15:16:43 +02:00
Zoltan Fedor	408d8e6ff5	Enable the `JoinDocuments` node to work with documents with `score=None` (#2984 ) * Enable the `JoinDocuments` node to work with documents with `score=None` This fixes #2983 As of now, the `JoinDocuments` node will error out if any of the documents has `score=None` - which is possible, as some retriever are not able to provide a score, like the `TfidfRetriever` on Elasticsearch or the `BM25Retriever` on Weaviate. THe reason for the error is that the `JoinDocuments` always sorts the documents by score and cannot sort when `score=None`. There was a very similar issue for `JoinAnswers` too, which was addressed by this PR: https://github.com/deepset-ai/haystack/pull/2436 This solution applies the same solution to `JoinDocuments` - so both the `JoinAnswers` and `JoinDocuments` now will have the same additional argument to disable sorting when that is requried. The solution is to add an argument to `JoinDocuments` called `sort_by_score: bool`, which allows the user to turn off the sorting of documents by score, but keeps the current functionality of sorting being performed as the default. * Fixing test bug * Addressing PR review comments - Extending unit tests - Simplifying logic * Making the sorting work even with no scores By making the no score being sorted as -Inf * Forgot to commit the change in `join_docs.py` * [EMPTY] Re-trigger CI * Added am INFO log if the `JoinDocuments` is sorting while some of the docs have `score=None` * Adjusting the arguments of `any()` * [EMPTY] Re-trigger CI	2022-08-11 10:43:25 +02:00
Massimiliano Pippi	2cd65e99b8	revert Remove pipes (#3006 )	2022-08-11 10:42:22 +02:00
Zoltan Fedor	aafa017c17	Refactoring the `Raypipeline.run` method - merging it with the `Pipeline.run` (#2981 ) * Refactoring the `Raypipeline.run` method - merging it with the `Pipeline.run` This is to fix #2968 * Bug: variable `i` was already in use * Removing unused imports * Removing unused import * [EMPTY] Re-trigger CI * Addressing concerns raised pre-review - Removing the attempt to try to make it without the need for `JoinDocuments` - it is okey to fail without `JoinDocuments` for certain pipelines. * Refactoring based on reviews	2022-08-11 09:50:14 +02:00
Zoltan Fedor	f4128d3581	Adding support for additional distance/similarity metrics for Weaviate (#3001 ) * Adding support for additional distance metrics for Weaviate Fixes #3000 * Updating the docs * Fixing error texts * Fixing issues raised by the review * Addressing the last issue from the reviews - removing test `test_weaviate.py::test_similarity` * [EMPTY] Re-trigger CI * Fixing things based on review * [EMPTY] Re-trigger CI	2022-08-11 09:48:21 +02:00
Florian Hardow	0b39ce6431	fetch experiment run results from dc (#2960 ) * feat: fetch results for DeepsetCloudExperiments * chore: test DC fetch predicitons for eval run * chore: switch to dict iteration with .items() * chore: update DC url to fetch predictions from * chore: update doc strings for fetching eval run results * chore: update DeepsetCloudExperiments description, change function names for fetching predictions of an eval run * chore: test for DeepsetCloudExperiments.get_run_results * chore: adjust request mock for test_get_eval_run_results * chore: push first row of dataframe into variable for test checks * chore: adjust mock data to correct data types * chore: make documentation more readable with line breaks * chore: update documentation for eval run result fetching	2022-08-10 15:02:36 +02:00
Stefano Fiorucci	5778b6f9e9	fix run_batch unbound error (#3016 )	2022-08-10 12:59:15 +02:00
James Briggs	5d4e3bd7ca	convert to set so not relying on correct order (#3015 )	2022-08-10 12:57:31 +02:00
James Briggs	524c9b959d	switch label variables in test_labels (#3011 )	2022-08-10 12:01:57 +02:00
camille	f363b152ff	bug: make `MultiLabel` ids consistent across python interpreters (#2998 ) * use hashlib.md5() instead of (interpreter dependent) hash() funtion to generate MultiLabel id * add tests to assess constancy of MultiLabel.id * make test_multilabel_id test ensure that MultiLabel ids are always the same	2022-08-10 09:43:21 +02:00
Julian Risch	b685409c78	chore: add topic tags to auto generation of release notes (#3008 )	2022-08-09 17:12:42 +02:00
bogdankostic	5c3bfad078	feat: Add page number to Documents coming from PDFConverters and PreProcessor (#2932 ) * Add page number to Documents coming from PDFConverters and PreProcessor * Fix mypy * Update API Docs * Update API Docs * Remove unused imports * Generate JSON schema * Generate JSON schema * Make test variable shorter * Make regex a separate function * Move counting of page breaks to a function * Generate JSON schema * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update API Documentation * Don't create instance for testing staticmethod * Update haystack/nodes/preprocessor/preprocessor.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-08-09 15:55:27 +02:00
Stefano Fiorucci	09707b576a	Make `MultiLabel` preserve order (#2956 ) * try simple approach * added test * add requested test	2022-08-09 15:53:24 +02:00
Branden Chan	dfeb171686	Add API page for util functions (#2863 ) * Clean OpenAIAnswerGenerator docstrings * Incorporate reviewer feedback * Update Documentation & Code Style * Improve id_hash_keys description * Simplify id_hash_keys description * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-08-09 14:53:45 +02:00
Vladimir Blagojevic	50f7d660e2	Add slack hook for test failures (#2996 )	2022-08-09 08:27:52 -04:00
Massimiliano Pippi	862ac31b5c	bump streamlit version (#3002 )	2022-08-09 10:52:41 +02:00
Stefano Fiorucci	4a63484916	feat: Extend `TransformersQueryClassifier`: clean version (#2965 ) * extend query classifier in one commit * variable number of outgoing edges * improve tests * fix unused import * lightweight approach * fix _calculate_outgoing_edges * remove duplicate label validation * Remove print	2022-08-09 09:43:33 +02:00
MichelBartels	c91316e862	feat: add gradient accumulation in FARMReader (#2925 ) * expose gradient accumulation to train function of FARMReader * add documentation for gradient accumulation * Update Documentation & Code Style * doc string improvements Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * doc string improvements Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * doc string improvements Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-08-08 18:42:21 +02:00
Sara Zan	82448efa4f	feat: warn users if they're calling `get_all_labels` on a document index and vice-versa (Elasticsearch & Opensearch only) (#2990 ) * Add fix to ES * Update haystack/document_stores/elasticsearch.py	2022-08-08 16:50:42 +02:00
Vladimir Blagojevic	d1f8b7118c	Add progress bar to batch run component ops (#2864 ) * Add progress bar to batch run component ops * Update docs * Update schema * PR review: thanks Bogdan	2022-08-08 09:32:44 -04:00
Massimiliano Pippi	0e8efdafa9	Add enhanced pydoc-markdown pre-hook (#2979 ) * add pydoc-markdown pre-hook * add more comments, remove debug prints	2022-08-08 12:41:21 +02:00
Sara Zan	1a0a4c8836	Remove pipes from code block (#2973 ) * Remove pipes * Generate md	2022-08-05 19:18:57 +02:00
James Briggs	4ba2444652	Update CONTRIBUTING.md (#2975 )	2022-08-05 19:00:18 +02:00
Tobias Wochinger	065173fe5e	chore: add PR template (#2883 ) * chore: add PR template * ci: update PR template after latest discussions in Notion * Apply suggestions from code review Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Apply suggestions from code review Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * Update .github/pull_request_template.md Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * docs: re-order and add link * docs: add new conventions to contributor guidelines Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2022-08-05 18:14:18 +02:00
Vladimir Blagojevic	4f8d11c591	Update Seq2SeqGenerator API documentation (#2970 ) * Seq2SeqGenerator - update API docs	2022-08-05 17:39:23 +02:00
Sebastian	88cab19bd9	Remove unused variable (#2974 )	2022-08-05 16:41:11 +02:00
Vladimir Blagojevic	762a12fcb1	Print eval reports improvements (#2941 )	2022-08-04 11:21:27 -04:00
Sebastian	1b86b715b3	Better check for "DebertaV2" architecture in Trainer.train (#2966 ) * Update haystack/modeling/training/base.py to better check for "DebertaV2" architecture Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-08-04 16:30:06 +02:00
Bilge Yücel	489699bd98	Fix docs code format for sentence transformers (#2957 ) Co-authored-by: bilge4 <bilge@techwolf.ai>	2022-08-04 12:31:42 +02:00
Vladimir Blagojevic	368828fd4a	Component batch_size should be defined rather than Optional (#2958 ) * Ensure batch_size for components is defined rather than Optional * PR review - update schema	2022-08-04 12:20:28 +02:00
Vladimir Blagojevic	515a85d633	Update contributing guide, clarify when '.[all]' install is needed (#2961 )	2022-08-04 12:20:07 +02:00
tstadel	b042dd9c82	Fix validation for dynamic outgoing edges (#2850 ) * fix validation for dynamic outgoing edges * Update Documentation & Code Style * use class outgoing_edges as fallback if no instance is provided * implement classmethod approach * readd comment * fix mypy * fix tests * set outgoing_edges for all components * set outgoing_edges for mocks too * set document store outgoing_edges to 1 * set last missing outgoing_edges * enforce BaseComponent subclasses to define outgoing_edges * override _calculate_outgoing_edges for FileTypeClassifier * remove superfluous test * set rest_api's custom component's outgoing_edges * Update docstring Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * remove unnecessary else Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-08-04 10:27:50 +02:00
Massimiliano Pippi	40d07c2038	Enable Opensearch unit tests in Windows CI (#2936 ) * enable Opensearch unit tests under Win * move unit tests into a dedicated job * skip audio tests on missing dependencies * avoid failing test collection when soundfile is not available * Update .github/workflows/tests.yml Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-08-03 19:19:07 +02:00
Francesco Castelli	1b238c880b	Generalize <sep>, <pad> and </s> tokens of QuestionGenerator node (#2769 ) * fixed tokens in question generation * simplified assignment * same behavior also for pad and eos * use skip_special_tokens in batch_decode * fixed black error and update docs * fixed schemas ci error * JSON schemas * Add git diff to debug schema issues * opensearch schema was missing * Add missing instruction in the workflow error message * typo	2022-08-03 18:51:34 +02:00
Zoltan Fedor	1e20818328	Ability to run Ray Serve detached (#2945 ) * Ability to run Ray Serve detached Fixes #2944 Ability to run Ray Serve detached - to allow running multiple instances of the app (HA). See https://docs.ray.io/en/latest/serve/package-ref.html#core-apis * Generating the docs * Re-trigger the CI pipeline * Retrigger the CI Pipeline * Typo in docstrings * Fixing docstring and typing issues * Regenerating docs * [EMPTY] Re-trigger CI * [EMPTY] Re-trigger CI * Refactoring to allow any number of args for the `serve.start()` method There seems to be additional arguments of the `serve.start()` method, so we should probably cover all of them at once, instead of only the `detached` option. * [EMPTY] Re-trigger CI * Test whether the ServeControllerClient in fact has the supplied `detached` parameter	2022-08-03 18:49:03 +02:00
Bijay Gurung	717796c587	Tutorial 06: Replace DPR with EmbeddingRetriever (#2910 ) * Tutorial 06: Replace DPR with EmbeddingRetriever Closes #2887 * Add updated tutorials/6.md file Replace `DensePassageRetriever` with `EmbeddingRetriever` * Update Tutorial 06 based on PR feedback * Further updates to Tutorial-06 according to review feedback * [Tutorial 06] Put in review feedback for the py file	2022-08-03 18:43:54 +02:00
Massimiliano Pippi	3728a95de6	fix docker tag for cuda (#2952 )	2022-08-03 17:59:46 +02:00
Zoltan Fedor	7b97bbbff0	Extending the Ray Serve integration to allow attributes for Serve deployments (#2918 ) * Extending the Ray Serve integration to allow attributes for Serve deployments This closes #2917 We should be able to set Ray Serve attributes for the nodes of pipelines, like amount of GPU to use, max_concurrent_queries, etc. Now this is possible from the pipeline yaml file for each node of the pipeline. * Ran black and regenerated the json schemas * Fixing the JSON Schema generation * Trying to fix the schema CI test issue * Fixing the test and the schemas Python 3.8 was generating a different schema than Python 3.7 is creating in the CI. You MUST use Python 3.7 to generate the schemas, otherwise the CIs will fail. * Merge the two Ray pipeline test cases * Generate the JSON schemas again after `$ pip install .[all]` * Removing `haystack/json-schemas/haystack-pipeline-1.16.schema.json` This was generated by the JSON generator, but based on @ZanSara's instructions, I am removing it. * Making changes based on @ZanSara's request - the newly requested test is failing * Fixing the JSON schema generation again * Renaming `replicas` and moving it under `serve_deployment_kwargs` * add extras validation, untested * Dcoumentation update * Black * [EMPTY] Re-trigger CI Co-authored-by: Sara Zan <sarazanzo94@gmail.com>	2022-08-03 16:38:22 +02:00
Sara Zan	669f6f0128	Add git diff to schema checks (#2959 )	2022-08-03 09:46:38 -04:00
Massimiliano Pippi	e766bb8684	add code owners (#2950 ) * add code owners * add tutorials folder	2022-08-03 10:48:30 +02:00
Sebastian	bde3261b07	Update minimum selenium version supported for crawler (#2921 ) * Update minimum requirement for selenium for using the crawler * Updating pin of grpcio to match default in google colab * Adding requests requirement	2022-08-03 10:11:18 +02:00
tstadel	2c56305ed3	Fix serialization of numpy arrays and pandas dataframes in REST API (#2838 ) * correct serialization of numpy arrays and pandas dataframes * Update Documentation & Code Style * set additional json_encoders globally * Update Documentation & Code Style * add tests for non primitive return types Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-08-02 09:49:32 +02:00
Vladimir Blagojevic	86d56b4dfe	Add HF model caching for integration tests (#2909 ) * Add HF model caching for integration tests * Remove windows mode caching - not worth it	2022-07-29 18:17:05 +02:00
Steven Haley	6b7d4a0514	Bug fix Weaviate document deletion (#2899 ) * Bug fix Weaviate document deletion If no filters param is passed in, then the original code retrieves all documents before then deleting by their IDs. There's no need for that, since we can delete by their IDs directly. * Edit comment to clarify deletion and recreation * Write unit tests for bug fix	2022-07-29 17:21:25 +02:00
Sara Zan	434b1c3682	Disable a few checks in the pre-commit hook (#2929 ) * Disable small checks giving trouble to pydoc-markdown and JSON Schema * Add instructions for JSON schema generator in the workflow logs	2022-07-29 17:02:56 +02:00
Sara Zan	3157e20dff	Change `black` pre-commit hook into `black-jupyter` (#2928 ) * change black into black-jupyter * Revert tutorial changes This reverts commit dd3c5d954d6a9eed41b849e6a3d14269019bf21b. * finalize pre-commit changes	2022-07-29 15:56:22 +02:00

... 46 47 48 49 50 ...

3803 Commits