* prepare 1.7.1 release
* Fix schemas
* Update haystack/json-schemas/haystack-pipeline-1.7.1.schema.json
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* change back main to master
* remove newline at end of file
* generate schema file with no newline
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* added meta fields for meta_config to be used during realtime testing of PineconeDocumentStore
* Add documentation on metadata filtering in docstring
* docs
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Enable the `JoinDocuments` node to work with documents with `score=None`
This fixes#2983
As of now, the `JoinDocuments` node will error out if any of the documents has `score=None` - which is possible, as some retriever are not able to provide a score, like the `TfidfRetriever` on Elasticsearch or the `BM25Retriever` on Weaviate.
THe reason for the error is that the `JoinDocuments` always sorts the documents by score and cannot sort when `score=None`.
There was a very similar issue for `JoinAnswers` too, which was addressed by this PR: https://github.com/deepset-ai/haystack/pull/2436
This solution applies the same solution to `JoinDocuments` - so both the `JoinAnswers` and `JoinDocuments` now will have the same additional argument to disable sorting when that is requried.
The solution is to add an argument to `JoinDocuments` called `sort_by_score: bool`, which allows the user to turn off the sorting of documents by score, but keeps the current functionality of sorting being performed as the default.
* Fixing test bug
* Addressing PR review comments
- Extending unit tests
- Simplifying logic
* Making the sorting work even with no scores
By making the no score being sorted as -Inf
* Forgot to commit the change in `join_docs.py`
* [EMPTY] Re-trigger CI
* Added am INFO log if the `JoinDocuments` is sorting while some of the docs have `score=None`
* Adjusting the arguments of `any()`
* [EMPTY] Re-trigger CI
* Adding support for additional distance metrics for Weaviate
Fixes#3000
* Updating the docs
* Fixing error texts
* Fixing issues raised by the review
* Addressing the last issue from the reviews - removing test `test_weaviate.py::test_similarity`
* [EMPTY] Re-trigger CI
* Fixing things based on review
* [EMPTY] Re-trigger CI
* Add page number to Documents coming from PDFConverters and PreProcessor
* Fix mypy
* Update API Docs
* Update API Docs
* Remove unused imports
* Generate JSON schema
* Generate JSON schema
* Make test variable shorter
* Make regex a separate function
* Move counting of page breaks to a function
* Generate JSON schema
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update API Documentation
* Don't create instance for testing staticmethod
* Update haystack/nodes/preprocessor/preprocessor.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* fixed tokens in question generation
* simplified assignment
* same behavior also for pad and eos
* use skip_special_tokens in batch_decode
* fixed black error and update docs
* fixed schemas ci error
* JSON schemas
* Add git diff to debug schema issues
* opensearch schema was missing
* Add missing instruction in the workflow error message
* typo
* Ability to run Ray Serve detached
Fixes#2944
Ability to run Ray Serve detached - to allow running multiple instances of the app (HA).
See https://docs.ray.io/en/latest/serve/package-ref.html#core-apis
* Generating the docs
* Re-trigger the CI pipeline
* Retrigger the CI Pipeline
* Typo in docstrings
* Fixing docstring and typing issues
* Regenerating docs
* [EMPTY] Re-trigger CI
* [EMPTY] Re-trigger CI
* Refactoring to allow any number of args for the `serve.start()` method
There seems to be additional arguments of the `serve.start()` method, so we should probably cover all of them at once, instead of only the `detached` option.
* [EMPTY] Re-trigger CI
* Test whether the ServeControllerClient in fact has the supplied `detached` parameter
* Tutorial 06: Replace DPR with EmbeddingRetriever
Closes#2887
* Add updated tutorials/6.md file
Replace `DensePassageRetriever` with `EmbeddingRetriever`
* Update Tutorial 06 based on PR feedback
* Further updates to Tutorial-06 according to review feedback
* [Tutorial 06] Put in review feedback for the py file
* Extending the Ray Serve integration to allow attributes for Serve deployments
This closes#2917
We should be able to set Ray Serve attributes for the nodes of pipelines, like amount of GPU to use, max_concurrent_queries, etc.
Now this is possible from the pipeline yaml file for each node of the pipeline.
* Ran black and regenerated the json schemas
* Fixing the JSON Schema generation
* Trying to fix the schema CI test issue
* Fixing the test and the schemas
Python 3.8 was generating a different schema than Python 3.7 is creating in the CI. You MUST use Python 3.7 to generate the schemas, otherwise the CIs will fail.
* Merge the two Ray pipeline test cases
* Generate the JSON schemas again after `$ pip install .[all]`
* Removing `haystack/json-schemas/haystack-pipeline-1.16.schema.json`
This was generated by the JSON generator, but based on @ZanSara's instructions, I am removing it.
* Making changes based on @ZanSara's request - the newly requested test is failing
* Fixing the JSON schema generation again
* Renaming `replicas` and moving it under `serve_deployment_kwargs`
* add extras validation, untested
* Dcoumentation update
* Black
* [EMPTY] Re-trigger CI
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
* add Opensearch extras
* let OpenSearchDocumentStore use opensearch-py
* Update Documentation & Code Style
* fix a bug found after adding tests
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Upgrading Weaviate used for testing to 1.14.1 from 1.11.0
This has also brought up an issue with one of the test filtering for value "a". This test has started to fail, as "a" is a default stopword in Weaviate, so I have changed this test to look for value "c" instead of value "a" to get around the stopword issue.
* Weaviate client upgrade
From v3.3.3 to v3.6.0
* Adding BM25 Retrieval to Weaviate
Weaviate now supports BM25 retrieval in experiment mode and with some limitations (like it cannot be combined with filters).
This commit adds support for inverted index (BM25) querying against Weaviate.
* Running Black on the recent code changes
* Update Documentation & Code Style
* Fixing linting issues after code changes by black
* The BM25 query needs to be in all lowercase for now
The BM25 query needs to be provided all lowercase while the functionality is in experimental mode in Weaviate.
See https://app.slack.com/client/T0181DYT9KN/C017EG2SL3H/thread/C017EG2SL3H-1658790227.208119
* Fixing method parameter docstring to highlight that they are not supported in Weaviate
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add pre-commit config
* update contributing guidelines
* try failing the workflow
* add pre-commit to the deps
* updating uninstall instructions
* separate jobs in CI
* make tutorials check fail
* make black check fail
* make openapi check fail
* make yaml schema and api docs checks fail
* highlight the instructions
* Update .pre-commit-config.yaml
Co-authored-by: Tobias Wochinger <mail@tobias-wochinger.de>
* Update CONTRIBUTING.md
Co-authored-by: Tobias Wochinger <mail@tobias-wochinger.de>
* Update CONTRIBUTING.md
Co-authored-by: Tobias Wochinger <mail@tobias-wochinger.de>
* Use black --check
* Add images of the CI
* title level
* feedback
Co-authored-by: Tobias Wochinger <mail@tobias-wochinger.de>
* Add support for model folder into BasePreProcessor
* First draft of custom model on PreProcessor
* Update Documentation & Code Style
* Update tests to support custom models
* Update Documentation & Code Style
* Test for wrong models in custom folder
* Default to ISO names on custom model folder
Use long names only when needed
* Update Documentation & Code Style
* Refactoring language names usage
* Update fallback logic
* Check unpickling error
* Updated tests using parametrize
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Refactored common logic
* Add format control to NLTK load
* Tests improvements
Add a sample for specialized model
* Update Documentation & Code Style
* Minor log text update
* Log model format exception details
* Change pickle protocol version to 4 for 3.7 compat
* Removed unnecessary model folder parameter
Changed logic comparisons
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Update Documentation & Code Style
* Removed unused import
* Change errors with warnings
* Change to absolute path
* Rename sentence tokenizer method
Co-authored-by: tstadel
* Check document content is a string before process
* Change to log errors and not warnings
* Update Documentation & Code Style
* Improve split sentences method
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Update Documentation & Code Style
* Empty commit - trigger workflow
* Remove superfluous parameters
Co-authored-by: tstadel
* Explicit None checking
Co-authored-by: tstadel
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Remove caching and install audio deps
* Fix `Tutorials` as well
* Run all tutorials even though some fail
* Forgot fi
* fix failure condition
* proper bash string equality
* Enable debug logs
* remove audio files
* Update Documentation & Code Style
* Use the setup action in the Tutorial CI as well
* Try with a file that exists
* Update Documentation & Code Style
* Fix the comments in the tutorials
* Update Documentation & Code Style
* Fix tutorials.sh
* Remove debug logging
* import pprint and try editable install
* Update Documentation & Code Style
* extract no run list
* Add tutorial18 to no run list nightly
* import pprint correctly
* Update Documentation & Code Style
* try making site-packages editable
* Make pythonpath editable every time Tut17 is run on CI
* typo
* fix imports in tut5
* add git clean
* Update Documentation & Code Style
* add comments and remove` -e`
* accidentally deleted a line
* Update .github/utils/tutorials.sh
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* Changing the name that crawled page is saved to avoid long file names error on some file systems
* Custom naming function for saving crawled files
* Update Documentation & Code Style
* Remove bad characters on file name and preffix
* Add test for naming function
* Update Documentation & Code Style
* Fix expensive regex recalculation and linter warns
* Check for exceptions on file dump
* Remove param_naming variable
* Fix file paths on Windows, Linux and Mac
* Update Documentation & Code Style
* Test using one of the docstrings examples
* Change default naming function
Update docstrings
* Applying formatting rules
* Update Documentation & Code Style
* Fix mypy incompatible assignment error
* Remove unused type declaration
* Fix typo
* Update tests for naming function
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Tutorial 18:Open in Colab doesn't work in Firefox
* Tutorial 18:Open in Colab doesn't work in Firefox v2
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>