* Fix when env does nto exist
* Fix missed line
* Set conservative chromedriver options
* Set default options based on environment
* Fix removed line
* Updated documentation
* Generate new schemas manually
* Add arguments via iterator and helper function
* Pre-push doc format
* Use imported Option vs full namespace access
* Manually update schema
* Manually add documentation and schema
* Fix language and documentation
* Fix typo
* Auto generated docs
* Updated documentation
* Set translated text on a copy of original document
* Return new translated list
* Manually generated docs
TODO: check pre-commit
* Hook generated file
* Rename variables for better maintenance
* fix(translator): prevent inputs from being changed
* fix: manual update translator docs
* style(translator): explicit type declaration on List
* docs(translator): re-run pre-commit hook
* style(translator): ignore mypy wrong type check
* docs(translator): re-run pre-commit hook
* prepare 1.7.1 release
* Fix schemas
* Update haystack/json-schemas/haystack-pipeline-1.7.1.schema.json
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* change back main to master
* remove newline at end of file
* generate schema file with no newline
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* added meta fields for meta_config to be used during realtime testing of PineconeDocumentStore
* Add documentation on metadata filtering in docstring
* docs
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Enable the `JoinDocuments` node to work with documents with `score=None`
This fixes#2983
As of now, the `JoinDocuments` node will error out if any of the documents has `score=None` - which is possible, as some retriever are not able to provide a score, like the `TfidfRetriever` on Elasticsearch or the `BM25Retriever` on Weaviate.
THe reason for the error is that the `JoinDocuments` always sorts the documents by score and cannot sort when `score=None`.
There was a very similar issue for `JoinAnswers` too, which was addressed by this PR: https://github.com/deepset-ai/haystack/pull/2436
This solution applies the same solution to `JoinDocuments` - so both the `JoinAnswers` and `JoinDocuments` now will have the same additional argument to disable sorting when that is requried.
The solution is to add an argument to `JoinDocuments` called `sort_by_score: bool`, which allows the user to turn off the sorting of documents by score, but keeps the current functionality of sorting being performed as the default.
* Fixing test bug
* Addressing PR review comments
- Extending unit tests
- Simplifying logic
* Making the sorting work even with no scores
By making the no score being sorted as -Inf
* Forgot to commit the change in `join_docs.py`
* [EMPTY] Re-trigger CI
* Added am INFO log if the `JoinDocuments` is sorting while some of the docs have `score=None`
* Adjusting the arguments of `any()`
* [EMPTY] Re-trigger CI
* Adding support for additional distance metrics for Weaviate
Fixes#3000
* Updating the docs
* Fixing error texts
* Fixing issues raised by the review
* Addressing the last issue from the reviews - removing test `test_weaviate.py::test_similarity`
* [EMPTY] Re-trigger CI
* Fixing things based on review
* [EMPTY] Re-trigger CI
* Add page number to Documents coming from PDFConverters and PreProcessor
* Fix mypy
* Update API Docs
* Update API Docs
* Remove unused imports
* Generate JSON schema
* Generate JSON schema
* Make test variable shorter
* Make regex a separate function
* Move counting of page breaks to a function
* Generate JSON schema
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update API Documentation
* Don't create instance for testing staticmethod
* Update haystack/nodes/preprocessor/preprocessor.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* fixed tokens in question generation
* simplified assignment
* same behavior also for pad and eos
* use skip_special_tokens in batch_decode
* fixed black error and update docs
* fixed schemas ci error
* JSON schemas
* Add git diff to debug schema issues
* opensearch schema was missing
* Add missing instruction in the workflow error message
* typo
* Ability to run Ray Serve detached
Fixes#2944
Ability to run Ray Serve detached - to allow running multiple instances of the app (HA).
See https://docs.ray.io/en/latest/serve/package-ref.html#core-apis
* Generating the docs
* Re-trigger the CI pipeline
* Retrigger the CI Pipeline
* Typo in docstrings
* Fixing docstring and typing issues
* Regenerating docs
* [EMPTY] Re-trigger CI
* [EMPTY] Re-trigger CI
* Refactoring to allow any number of args for the `serve.start()` method
There seems to be additional arguments of the `serve.start()` method, so we should probably cover all of them at once, instead of only the `detached` option.
* [EMPTY] Re-trigger CI
* Test whether the ServeControllerClient in fact has the supplied `detached` parameter
* Extending the Ray Serve integration to allow attributes for Serve deployments
This closes#2917
We should be able to set Ray Serve attributes for the nodes of pipelines, like amount of GPU to use, max_concurrent_queries, etc.
Now this is possible from the pipeline yaml file for each node of the pipeline.
* Ran black and regenerated the json schemas
* Fixing the JSON Schema generation
* Trying to fix the schema CI test issue
* Fixing the test and the schemas
Python 3.8 was generating a different schema than Python 3.7 is creating in the CI. You MUST use Python 3.7 to generate the schemas, otherwise the CIs will fail.
* Merge the two Ray pipeline test cases
* Generate the JSON schemas again after `$ pip install .[all]`
* Removing `haystack/json-schemas/haystack-pipeline-1.16.schema.json`
This was generated by the JSON generator, but based on @ZanSara's instructions, I am removing it.
* Making changes based on @ZanSara's request - the newly requested test is failing
* Fixing the JSON schema generation again
* Renaming `replicas` and moving it under `serve_deployment_kwargs`
* add extras validation, untested
* Dcoumentation update
* Black
* [EMPTY] Re-trigger CI
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
* add Opensearch extras
* let OpenSearchDocumentStore use opensearch-py
* Update Documentation & Code Style
* fix a bug found after adding tests
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Upgrading Weaviate used for testing to 1.14.1 from 1.11.0
This has also brought up an issue with one of the test filtering for value "a". This test has started to fail, as "a" is a default stopword in Weaviate, so I have changed this test to look for value "c" instead of value "a" to get around the stopword issue.
* Weaviate client upgrade
From v3.3.3 to v3.6.0
* Adding BM25 Retrieval to Weaviate
Weaviate now supports BM25 retrieval in experiment mode and with some limitations (like it cannot be combined with filters).
This commit adds support for inverted index (BM25) querying against Weaviate.
* Running Black on the recent code changes
* Update Documentation & Code Style
* Fixing linting issues after code changes by black
* The BM25 query needs to be in all lowercase for now
The BM25 query needs to be provided all lowercase while the functionality is in experimental mode in Weaviate.
See https://app.slack.com/client/T0181DYT9KN/C017EG2SL3H/thread/C017EG2SL3H-1658790227.208119
* Fixing method parameter docstring to highlight that they are not supported in Weaviate
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add support for model folder into BasePreProcessor
* First draft of custom model on PreProcessor
* Update Documentation & Code Style
* Update tests to support custom models
* Update Documentation & Code Style
* Test for wrong models in custom folder
* Default to ISO names on custom model folder
Use long names only when needed
* Update Documentation & Code Style
* Refactoring language names usage
* Update fallback logic
* Check unpickling error
* Updated tests using parametrize
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Refactored common logic
* Add format control to NLTK load
* Tests improvements
Add a sample for specialized model
* Update Documentation & Code Style
* Minor log text update
* Log model format exception details
* Change pickle protocol version to 4 for 3.7 compat
* Removed unnecessary model folder parameter
Changed logic comparisons
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Update Documentation & Code Style
* Removed unused import
* Change errors with warnings
* Change to absolute path
* Rename sentence tokenizer method
Co-authored-by: tstadel
* Check document content is a string before process
* Change to log errors and not warnings
* Update Documentation & Code Style
* Improve split sentences method
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Update Documentation & Code Style
* Empty commit - trigger workflow
* Remove superfluous parameters
Co-authored-by: tstadel
* Explicit None checking
Co-authored-by: tstadel
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Changing the name that crawled page is saved to avoid long file names error on some file systems
* Custom naming function for saving crawled files
* Update Documentation & Code Style
* Remove bad characters on file name and preffix
* Add test for naming function
* Update Documentation & Code Style
* Fix expensive regex recalculation and linter warns
* Check for exceptions on file dump
* Remove param_naming variable
* Fix file paths on Windows, Linux and Mac
* Update Documentation & Code Style
* Test using one of the docstrings examples
* Change default naming function
Update docstrings
* Applying formatting rules
* Update Documentation & Code Style
* Fix mypy incompatible assignment error
* Remove unused type declaration
* Fix typo
* Update tests for naming function
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* first version of save_to_remote for HF from FarmReader
* Update Documentation & Code Style
* Changes based on comments
* Update Documentation & Code Style
* imports order
* making small changes to pydoc
* indent fix
* Update Documentation & Code Style
* keyword arguments instead of positional
* Changing to repo_id
huggingface-hub package would have to be v0.5 or higher - checking how to handle with Thomas
* Update Documentation & Code Style
* adding huggingface-hub dependency 0.5 or above
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
* extract common code for ES and OS into a base class
* Update Documentation & Code Style
* give the base class a more obvious name
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add new audio answer primitives
* Add AnswerToSpeech
* Add dependency group
* Update Documentation & Code Style
* Extract TextToSpeech in a helper class, create DocumentToSpeech and primitives
* Add tests
* Update Documentation & Code Style
* Add ability to compress audio and more tests
* Add audio group to test, all and all-gpu
* fix pylint
* Update Documentation & Code Style
* Accidental git tag
* Try pleasing mypy
* Update Documentation & Code Style
* fix pylint
* Add warning for missing OS library and support in CI
* Try fixing mypy
* Update Documentation & Code Style
* Add docs, simplify args for audio nodes and add tutorials
* Fix mypy
* Fix run_batch
* Feedback on tutorials
* fix mypy and pylint
* Fix mypy again
* Fix mypy yet again
* Fix the ci
* Fix dicts merge and install ffmpeg on CI
* Make the audio nodes import safe
* Trying to increase tolerance in audio test
* Fix import paths
* fix linter
* Update Documentation & Code Style
* Add audio libs in unit tests
* Update _text_to_speech.py
* Update answer_to_speech.py
* Use dedicated dataset & update telemetry
* Remove and use distilled roberta
* Revert special primitives so that the nodes run in indexing
* Improve tutorials and fix smaller bugs
* Update Documentation & Code Style
* Fix serialization issue
* Update Documentation & Code Style
* Improve tutorial
* Update Documentation & Code Style
* Update _text_to_speech.py
* Minor lg updates
* Minor lg updates to tutorial
* Making indexing work in tutorials
* Update Documentation & Code Style
* Improve docstrings
* Try to use GPU when available
* Update Documentation & Code Style
* Fixi mypy and pylint
* Try to pass the device correctly
* Update Documentation & Code Style
* Use type of device
* use .cpu()
* Improve .ipynb
* update apt index to be able to download libsndfile1
* Fix SpeechDocument.from_dict()
* Change pip URL
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>