3803 Commits

Author SHA1 Message Date
tstadel
668fd548a6
Fix embeddings_field_supports_similarity of OpenSearchDocumentStore when creating index (#3030)
* fix embeddings_field_supports_similarity when creating index

* fix test
2022-08-12 11:19:59 +02:00
James Briggs
26c938a8e6
test: add meta fields for meta_config to be used during testing (#3021)
* added meta fields for meta_config to be used during realtime testing of PineconeDocumentStore

* Add documentation on metadata filtering in  docstring

* docs

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-12 10:27:56 +02:00
bogdankostic
81a5949103
ci: Increase Weaviate's disk usage + print docker logs (#3026) 2022-08-11 18:13:43 +02:00
Sebastian
44e2b1beed
Resolving issue 2853: no answer logic in FARMReader (#2856)
* Update FARMReader.eval_on_file to be consistent with FARMReader.eval

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-08-11 16:45:03 +02:00
Sara Zan
fc8ecbf20c
Move azure-core pin into the dev dependency list (#3022) 2022-08-11 15:16:43 +02:00
Zoltan Fedor
408d8e6ff5
Enable the JoinDocuments node to work with documents with score=None (#2984)
* Enable the `JoinDocuments` node to work with documents with `score=None`

This fixes #2983

As of now, the `JoinDocuments` node will error out if any of the documents has `score=None` - which is possible, as some retriever are not able to provide a score, like the `TfidfRetriever` on Elasticsearch or the `BM25Retriever` on Weaviate.
THe reason for the error is that the `JoinDocuments` always sorts the documents by score and cannot sort when `score=None`.

There was a very similar issue for `JoinAnswers` too, which was addressed by this PR: https://github.com/deepset-ai/haystack/pull/2436
This solution applies the same solution to `JoinDocuments` - so both the `JoinAnswers` and `JoinDocuments` now will have the same additional argument to disable sorting when that is requried.

The solution is to add an argument to `JoinDocuments` called `sort_by_score: bool`, which allows the user to turn off the sorting of documents by score, but keeps the current functionality of sorting being performed as the default.

* Fixing test bug

* Addressing PR review comments

- Extending unit tests
- Simplifying logic

* Making the sorting work even with no scores

By making the no score being sorted as -Inf

* Forgot to commit the change in `join_docs.py`

* [EMPTY] Re-trigger CI

* Added am INFO log if the `JoinDocuments` is sorting while some of the docs have `score=None`

* Adjusting the arguments of `any()`

* [EMPTY] Re-trigger CI
2022-08-11 10:43:25 +02:00
Massimiliano Pippi
2cd65e99b8
revert Remove pipes (#3006) 2022-08-11 10:42:22 +02:00
Zoltan Fedor
aafa017c17
Refactoring the Raypipeline.run method - merging it with the Pipeline.run (#2981)
* Refactoring the `Raypipeline.run` method - merging it with the `Pipeline.run`

This is to fix #2968

* Bug: variable `i` was already in use

* Removing unused imports

* Removing unused import

* [EMPTY] Re-trigger CI

* Addressing concerns raised pre-review

- Removing the attempt to try to make it without the need for `JoinDocuments` - it is okey to fail without `JoinDocuments` for certain pipelines.

* Refactoring based on reviews
2022-08-11 09:50:14 +02:00
Zoltan Fedor
f4128d3581
Adding support for additional distance/similarity metrics for Weaviate (#3001)
* Adding support for additional distance metrics for Weaviate

Fixes #3000

* Updating the docs

* Fixing error texts

* Fixing issues raised by the review

* Addressing the last issue from the reviews - removing test `test_weaviate.py::test_similarity`

* [EMPTY] Re-trigger CI

* Fixing things based on review

* [EMPTY] Re-trigger CI
2022-08-11 09:48:21 +02:00
Florian Hardow
0b39ce6431
fetch experiment run results from dc (#2960)
* feat: fetch results for DeepsetCloudExperiments

* chore: test DC fetch predicitons for eval run

* chore: switch to dict iteration with .items()

* chore: update DC url to fetch predictions from

* chore: update doc strings for fetching eval run results

* chore: update DeepsetCloudExperiments description, change function names for fetching predictions of an eval run

* chore: test for DeepsetCloudExperiments.get_run_results

* chore: adjust request mock for test_get_eval_run_results

* chore: push first row of dataframe into variable for test checks

* chore: adjust mock data to correct data types

* chore: make documentation more readable with line breaks

* chore: update documentation for eval run result fetching
2022-08-10 15:02:36 +02:00
Stefano Fiorucci
5778b6f9e9
fix run_batch unbound error (#3016) 2022-08-10 12:59:15 +02:00
James Briggs
5d4e3bd7ca
convert to set so not relying on correct order (#3015) 2022-08-10 12:57:31 +02:00
James Briggs
524c9b959d
switch label variables in test_labels (#3011) 2022-08-10 12:01:57 +02:00
camille
f363b152ff
bug: make MultiLabel ids consistent across python interpreters (#2998)
* use hashlib.md5() instead of (interpreter dependent) hash() funtion to generate MultiLabel id

* add tests to assess constancy of MultiLabel.id

* make test_multilabel_id test ensure that MultiLabel ids are always the same
2022-08-10 09:43:21 +02:00
Julian Risch
b685409c78
chore: add topic tags to auto generation of release notes (#3008) 2022-08-09 17:12:42 +02:00
bogdankostic
5c3bfad078
feat: Add page number to Documents coming from PDFConverters and PreProcessor (#2932)
* Add page number to Documents coming from PDFConverters and PreProcessor

* Fix mypy

* Update API Docs

* Update API Docs

* Remove unused imports

* Generate JSON schema

* Generate JSON schema

* Make test variable shorter

* Make regex a separate function

* Move counting of page breaks to a function

* Generate JSON schema

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update API Documentation

* Don't create instance for testing staticmethod

* Update haystack/nodes/preprocessor/preprocessor.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-09 15:55:27 +02:00
Stefano Fiorucci
09707b576a
Make MultiLabel preserve order (#2956)
* try simple approach

* added test

* add requested test
2022-08-09 15:53:24 +02:00
Branden Chan
dfeb171686
Add API page for util functions (#2863)
* Clean OpenAIAnswerGenerator docstrings

* Incorporate reviewer feedback

* Update Documentation & Code Style

* Improve id_hash_keys description

* Simplify id_hash_keys description

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-08-09 14:53:45 +02:00
Vladimir Blagojevic
50f7d660e2
Add slack hook for test failures (#2996) 2022-08-09 08:27:52 -04:00
Massimiliano Pippi
862ac31b5c
bump streamlit version (#3002) 2022-08-09 10:52:41 +02:00
Stefano Fiorucci
4a63484916
feat: Extend TransformersQueryClassifier: clean version (#2965)
* extend query classifier in one commit

* variable number of outgoing edges

* improve tests

* fix unused import

* lightweight approach

* fix _calculate_outgoing_edges

* remove duplicate label validation

* Remove print
2022-08-09 09:43:33 +02:00
MichelBartels
c91316e862
feat: add gradient accumulation in FARMReader (#2925)
* expose gradient accumulation to train function of FARMReader

* add documentation for gradient accumulation

* Update Documentation & Code Style

* doc string improvements

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* doc string improvements

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* doc string improvements

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-08 18:42:21 +02:00
Sara Zan
82448efa4f
feat: warn users if they're calling get_all_labels on a document index and vice-versa (Elasticsearch & Opensearch only) (#2990)
* Add fix to ES

* Update haystack/document_stores/elasticsearch.py
2022-08-08 16:50:42 +02:00
Vladimir Blagojevic
d1f8b7118c
Add progress bar to batch run component ops (#2864)
* Add progress bar to batch run component ops

* Update docs

* Update schema

* PR review: thanks Bogdan
2022-08-08 09:32:44 -04:00
Massimiliano Pippi
0e8efdafa9
Add enhanced pydoc-markdown pre-hook (#2979)
* add pydoc-markdown pre-hook

* add more comments, remove debug prints
2022-08-08 12:41:21 +02:00
Sara Zan
1a0a4c8836
Remove pipes from code block (#2973)
* Remove pipes

* Generate md
2022-08-05 19:18:57 +02:00
James Briggs
4ba2444652
Update CONTRIBUTING.md (#2975) 2022-08-05 19:00:18 +02:00
Tobias Wochinger
065173fe5e
chore: add PR template (#2883)
* chore: add PR template

* ci: update PR template after latest discussions in Notion

* Apply suggestions from code review

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* Apply suggestions from code review

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update .github/pull_request_template.md

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* docs: re-order and add link

* docs: add new conventions to contributor guidelines

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2022-08-05 18:14:18 +02:00
Vladimir Blagojevic
4f8d11c591
Update Seq2SeqGenerator API documentation (#2970)
* Seq2SeqGenerator - update API docs
2022-08-05 17:39:23 +02:00
Sebastian
88cab19bd9
Remove unused variable (#2974) 2022-08-05 16:41:11 +02:00
Vladimir Blagojevic
762a12fcb1
Print eval reports improvements (#2941) 2022-08-04 11:21:27 -04:00
Sebastian
1b86b715b3
Better check for "DebertaV2" architecture in Trainer.train (#2966)
* Update haystack/modeling/training/base.py to better check for "DebertaV2" architecture

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-04 16:30:06 +02:00
Bilge Yücel
489699bd98
Fix docs code format for sentence transformers (#2957)
Co-authored-by: bilge4 <bilge@techwolf.ai>
2022-08-04 12:31:42 +02:00
Vladimir Blagojevic
368828fd4a
Component batch_size should be defined rather than Optional (#2958)
* Ensure batch_size for components is defined rather than Optional

* PR review - update schema
2022-08-04 12:20:28 +02:00
Vladimir Blagojevic
515a85d633
Update contributing guide, clarify when '.[all]' install is needed (#2961) 2022-08-04 12:20:07 +02:00
tstadel
b042dd9c82
Fix validation for dynamic outgoing edges (#2850)
* fix validation for dynamic outgoing edges

* Update Documentation & Code Style

* use class outgoing_edges as fallback if no instance is provided

* implement classmethod approach

* readd comment

* fix mypy

* fix tests

* set outgoing_edges for all components

* set outgoing_edges for mocks too

* set document store outgoing_edges to 1

* set last missing outgoing_edges

* enforce BaseComponent subclasses to define outgoing_edges

* override _calculate_outgoing_edges for FileTypeClassifier

* remove superfluous test

* set rest_api's custom component's outgoing_edges

* Update docstring

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* remove unnecessary else

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-04 10:27:50 +02:00
Massimiliano Pippi
40d07c2038
Enable Opensearch unit tests in Windows CI (#2936)
* enable Opensearch unit tests under Win

* move unit tests into a dedicated job

* skip audio tests on missing dependencies

* avoid failing test collection when soundfile is not available

* Update .github/workflows/tests.yml

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-03 19:19:07 +02:00
Francesco Castelli
1b238c880b
Generalize <sep>, <pad> and </s> tokens of QuestionGenerator node (#2769)
* fixed tokens in question generation

* simplified assignment

* same behavior also for pad and eos

* use skip_special_tokens in batch_decode

* fixed black error and update docs

* fixed schemas ci error

* JSON schemas

* Add git diff to debug schema issues

* opensearch schema was missing

* Add missing instruction in the workflow error message

* typo
2022-08-03 18:51:34 +02:00
Zoltan Fedor
1e20818328
Ability to run Ray Serve detached (#2945)
* Ability to run Ray Serve detached

Fixes #2944

Ability to run Ray Serve detached - to allow running multiple instances of the app (HA).

See https://docs.ray.io/en/latest/serve/package-ref.html#core-apis

* Generating the docs

* Re-trigger the CI pipeline

* Retrigger the CI Pipeline

* Typo in docstrings

* Fixing docstring and typing issues

* Regenerating docs

* [EMPTY] Re-trigger CI

* [EMPTY] Re-trigger CI

* Refactoring to allow any number of args for the `serve.start()` method

There seems to be additional arguments of the `serve.start()` method, so we should probably cover all of them at once, instead of only the `detached` option.

* [EMPTY] Re-trigger CI

* Test whether the ServeControllerClient in fact has the supplied `detached` parameter
2022-08-03 18:49:03 +02:00
Bijay Gurung
717796c587
Tutorial 06: Replace DPR with EmbeddingRetriever (#2910)
* Tutorial 06: Replace DPR with EmbeddingRetriever

Closes #2887

* Add updated tutorials/6.md file

Replace `DensePassageRetriever` with `EmbeddingRetriever`

* Update Tutorial 06 based on PR feedback

* Further updates to Tutorial-06 according to review feedback

* [Tutorial 06] Put in review feedback for the py file
2022-08-03 18:43:54 +02:00
Massimiliano Pippi
3728a95de6
fix docker tag for cuda (#2952) 2022-08-03 17:59:46 +02:00
Zoltan Fedor
7b97bbbff0
Extending the Ray Serve integration to allow attributes for Serve deployments (#2918)
* Extending the Ray Serve integration to allow attributes for Serve deployments

This closes #2917

We should be able to set Ray Serve attributes for the nodes of pipelines, like amount of GPU to use, max_concurrent_queries, etc.

Now this is possible from the pipeline yaml file for each node of the pipeline.

* Ran black and regenerated the json schemas

* Fixing the JSON Schema generation

* Trying to fix the schema CI test issue

* Fixing the test and the schemas

Python 3.8 was generating a different schema than Python 3.7 is creating in the CI. You MUST use Python 3.7 to generate the schemas, otherwise the CIs will fail.

* Merge the two Ray pipeline test cases

* Generate the JSON schemas again after `$ pip install .[all]`

* Removing `haystack/json-schemas/haystack-pipeline-1.16.schema.json`

This was generated by the JSON generator, but based on @ZanSara's instructions, I am removing it.

* Making changes based on @ZanSara's request - the newly requested test is failing

* Fixing the JSON schema generation again

* Renaming `replicas` and moving it under `serve_deployment_kwargs`

* add extras validation, untested

* Dcoumentation update

* Black

* [EMPTY] Re-trigger CI

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-08-03 16:38:22 +02:00
Sara Zan
669f6f0128
Add git diff to schema checks (#2959) 2022-08-03 09:46:38 -04:00
Massimiliano Pippi
e766bb8684
add code owners (#2950)
* add code owners

* add tutorials folder
2022-08-03 10:48:30 +02:00
Sebastian
bde3261b07
Update minimum selenium version supported for crawler (#2921)
* Update minimum requirement for selenium for using the crawler

* Updating pin of grpcio to match default in google colab

* Adding requests requirement
2022-08-03 10:11:18 +02:00
tstadel
2c56305ed3
Fix serialization of numpy arrays and pandas dataframes in REST API (#2838)
* correct serialization of numpy arrays and pandas dataframes

* Update Documentation & Code Style

* set additional json_encoders globally

* Update Documentation & Code Style

* add tests for non primitive return types

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-08-02 09:49:32 +02:00
Vladimir Blagojevic
86d56b4dfe
Add HF model caching for integration tests (#2909)
* Add HF model caching for integration tests

* Remove windows mode caching - not worth it
2022-07-29 18:17:05 +02:00
Steven Haley
6b7d4a0514
Bug fix Weaviate document deletion (#2899)
* Bug fix Weaviate document deletion

If no filters param is passed in, then the original code retrieves *all* documents before then deleting by their IDs. There's no need for that, since we can delete by their IDs directly.

* Edit comment to clarify deletion and recreation

* Write unit tests for bug fix
2022-07-29 17:21:25 +02:00
Sara Zan
434b1c3682
Disable a few checks in the pre-commit hook (#2929)
* Disable small checks giving trouble to pydoc-markdown and JSON Schema

* Add instructions for JSON schema generator in the workflow logs
2022-07-29 17:02:56 +02:00
Sara Zan
3157e20dff
Change black pre-commit hook into black-jupyter (#2928)
* change black into black-jupyter

* Revert tutorial changes

This reverts commit dd3c5d954d6a9eed41b849e6a3d14269019bf21b.

* finalize pre-commit changes
2022-07-29 15:56:22 +02:00