* Refactoring the `Raypipeline.run` method - merging it with the `Pipeline.run`
This is to fix#2968
* Bug: variable `i` was already in use
* Removing unused imports
* Removing unused import
* [EMPTY] Re-trigger CI
* Addressing concerns raised pre-review
- Removing the attempt to try to make it without the need for `JoinDocuments` - it is okey to fail without `JoinDocuments` for certain pipelines.
* Refactoring based on reviews
* Extending the Ray Serve integration to allow attributes for Serve deployments
This closes#2917
We should be able to set Ray Serve attributes for the nodes of pipelines, like amount of GPU to use, max_concurrent_queries, etc.
Now this is possible from the pipeline yaml file for each node of the pipeline.
* Ran black and regenerated the json schemas
* Fixing the JSON Schema generation
* Trying to fix the schema CI test issue
* Fixing the test and the schemas
Python 3.8 was generating a different schema than Python 3.7 is creating in the CI. You MUST use Python 3.7 to generate the schemas, otherwise the CIs will fail.
* Merge the two Ray pipeline test cases
* Generate the JSON schemas again after `$ pip install .[all]`
* Removing `haystack/json-schemas/haystack-pipeline-1.16.schema.json`
This was generated by the JSON generator, but based on @ZanSara's instructions, I am removing it.
* Making changes based on @ZanSara's request - the newly requested test is failing
* Fixing the JSON schema generation again
* Renaming `replicas` and moving it under `serve_deployment_kwargs`
* add extras validation, untested
* Dcoumentation update
* Black
* [EMPTY] Re-trigger CI
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
* Add support for model folder into BasePreProcessor
* First draft of custom model on PreProcessor
* Update Documentation & Code Style
* Update tests to support custom models
* Update Documentation & Code Style
* Test for wrong models in custom folder
* Default to ISO names on custom model folder
Use long names only when needed
* Update Documentation & Code Style
* Refactoring language names usage
* Update fallback logic
* Check unpickling error
* Updated tests using parametrize
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Refactored common logic
* Add format control to NLTK load
* Tests improvements
Add a sample for specialized model
* Update Documentation & Code Style
* Minor log text update
* Log model format exception details
* Change pickle protocol version to 4 for 3.7 compat
* Removed unnecessary model folder parameter
Changed logic comparisons
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Update Documentation & Code Style
* Removed unused import
* Change errors with warnings
* Change to absolute path
* Rename sentence tokenizer method
Co-authored-by: tstadel
* Check document content is a string before process
* Change to log errors and not warnings
* Update Documentation & Code Style
* Improve split sentences method
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Update Documentation & Code Style
* Empty commit - trigger workflow
* Remove superfluous parameters
Co-authored-by: tstadel
* Explicit None checking
Co-authored-by: tstadel
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* Change split logic to list
* Fix wrong parameter for run
* Fix mypy error
* Fix layout/raw parameter
* Add test for filename with whitespaces on PDFToText
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Changing the name that crawled page is saved to avoid long file names error on some file systems
* Custom naming function for saving crawled files
* Update Documentation & Code Style
* Remove bad characters on file name and preffix
* Add test for naming function
* Update Documentation & Code Style
* Fix expensive regex recalculation and linter warns
* Check for exceptions on file dump
* Remove param_naming variable
* Fix file paths on Windows, Linux and Mac
* Update Documentation & Code Style
* Test using one of the docstrings examples
* Change default naming function
Update docstrings
* Applying formatting rules
* Update Documentation & Code Style
* Fix mypy incompatible assignment error
* Remove unused type declaration
* Fix typo
* Update tests for naming function
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add new audio answer primitives
* Add AnswerToSpeech
* Add dependency group
* Update Documentation & Code Style
* Extract TextToSpeech in a helper class, create DocumentToSpeech and primitives
* Add tests
* Update Documentation & Code Style
* Add ability to compress audio and more tests
* Add audio group to test, all and all-gpu
* fix pylint
* Update Documentation & Code Style
* Accidental git tag
* Try pleasing mypy
* Update Documentation & Code Style
* fix pylint
* Add warning for missing OS library and support in CI
* Try fixing mypy
* Update Documentation & Code Style
* Add docs, simplify args for audio nodes and add tutorials
* Fix mypy
* Fix run_batch
* Feedback on tutorials
* fix mypy and pylint
* Fix mypy again
* Fix mypy yet again
* Fix the ci
* Fix dicts merge and install ffmpeg on CI
* Make the audio nodes import safe
* Trying to increase tolerance in audio test
* Fix import paths
* fix linter
* Update Documentation & Code Style
* Add audio libs in unit tests
* Update _text_to_speech.py
* Update answer_to_speech.py
* Use dedicated dataset & update telemetry
* Remove and use distilled roberta
* Revert special primitives so that the nodes run in indexing
* Improve tutorials and fix smaller bugs
* Update Documentation & Code Style
* Fix serialization issue
* Update Documentation & Code Style
* Improve tutorial
* Update Documentation & Code Style
* Update _text_to_speech.py
* Minor lg updates
* Minor lg updates to tutorial
* Making indexing work in tutorials
* Update Documentation & Code Style
* Improve docstrings
* Try to use GPU when available
* Update Documentation & Code Style
* Fixi mypy and pylint
* Try to pass the device correctly
* Update Documentation & Code Style
* Use type of device
* use .cpu()
* Improve .ipynb
* update apt index to be able to download libsndfile1
* Fix SpeechDocument.from_dict()
* Change pip URL
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* make crawler extract also hidden text
* Update Documentation & Code Style
* try to adapt test for extract_hidden_text
* Update Documentation & Code Style
* fix test bug
* fix bug in test
* added test for hidden text"
* Update Documentation & Code Style
* fix bug in test
* Update Documentation & Code Style
* fix test
* Update Documentation & Code Style
* fix other test bug
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Rewrite crawler tests (very slow) and fix small crawler bug
* Update Documentation & Code Style
* compile the regex only once
* Factor out the html files & add content check to most tests
* Clarify that even starting URLs can be excluded
* Update Documentation & Code Style
* Change signature
* Fix failing test
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add possibility to upload evaluation sets to DC
* fix test_eval sas comparisons
* quickwin docstring feedback changes
* Add hint about annotation tool and mark optional and required columns
* minor changes to docstrings
* Ray pipelines now validate
* Update Documentation & Code Style
* rename Ray pipeline in tests
* Add extras:ray to the test pipeline
* pylint
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Remove BasePipeline and make a module for RayPipeline
* Can load pipelines from yaml, plenty of issues left
* Extract graph validation logic into _add_node_to_pipeline_graph & refactor load_from_config and add_node to use it
* Fix pipeline tests
* Move some tests out of test_pipeline.py and create MockDenseRetriever
* myoy and pylint (silencing too-many-public-methods)
* Fix issue found in some yaml files and in schema files
* Fix paths to YAML and fix some typos in Ray
* Fix eval tests
* Simplify MockDenseRetriever
* Fix Ray test
* Accidentally pushed merge coinflict, fixed
* Typo in schemas
* Typo in _json_schema.py
* Slightly reduce noisyness of version validation warnings
* Fix version logs tests
* Fix version logs tests again
* remove seemingly unused file
* Add check and test to avoid adding the same node to the pipeline twice
* Update Documentation & Code Style
* Revert config to pipeline_config
* Remo0ve unused import
* Complete reverting to pipeline_config
* Some more stray config=
* Update Documentation & Code Style
* Feedback
* Move back other_nodes tests into pipeline tests temporarily
* Update Documentation & Code Style
* Fixing tests
* Update Documentation & Code Style
* Fixing ray and standard pipeline tests
* Rename colliding load() methods in dense retrievers and faiss
* Update Documentation & Code Style
* Fix mypy on ray.py as well
* Add check for no root node
* Fix tests to use load_from_directory and load_index
* Try to workaround the disabled add_node of RayPipeline
* Update Documentation & Code Style
* Fix Ray test
* Fix FAISS tests
* Relax class check in _add_node_to_pipeline_graph
* Update Documentation & Code Style
* Try to fix mypy in ray.py
* unused import
* Try another fix for Ray
* Fix connector tests
* Update Documentation & Code Style
* Fix ray
* Update Documentation & Code Style
* use BaseComponent.load() in pipelines/base.py
* another round of feedback
* stray BaseComponent.load()
* Update Documentation & Code Style
* Fix FAISS tests too
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
* Change exception into warning, add strict_version param, and remove compatibility between schemas
* Simplify update_json_schema
* Rename unstable into master
* Prevent validate_config from changing the config to validate
* Fix version validation and add tests
* Rename master into ignore
* Complete parameter rename
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Delete files in _src
* Filter unused images and re-add images that were in use in docs/img
* Remove all usages of user-images.githubusercontent.com
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
* extract extension based on file's content
* Add python-magic dependency
* fix the _estimate_extension function and lowercase the file extensions
* check if the FileTypeClassifier can be imported
* add test and new file types
* fix typing
* import Optional
* revert Optional and make sure a string is always returned
* fix test so that it skips markdown files
* Emulate Code & Docs action
* Generate schemas
* Tidy up test code & extensioness files
* Improve error messages
* Revert schema changes
* Emulate black and docs CI again
* Add BasePipeline.validate_config, BasePipeline.validate_yaml, and some new custom exception classes
* Make error composition work properly
* Clarify typing
* Help mypy a bit more
* Update Documentation & Code Style
* Enable autogenerated docs for Milvus1 and 2 separately
* Revert "Enable autogenerated docs for Milvus1 and 2 separately"
This reverts commit 282be4a78a6e95862a9b4c924fc3dea5ca71e28d.
* Update Documentation & Code Style
* Re-enable 'additionalProperties: False'
* Add pipeline.type to JSON Schema, was somehow forgotten
* Disable additionalProperties on the pipeline properties too
* Fix json-schemas for 1.1.0 and 1.2.0 (should not do it again in the future)
* Cal super in PipelineValidationError
* Improve _read_pipeline_config_from_yaml's error handling
* Fix generate_json_schema.py to include document stores
* Fix json schemas (retro-fix 1.1.0 again)
* Improve custom errors printing, add link to docs
* Add function in BaseComponent to list its subclasses in a module
* Make some document stores base classes abstract
* Add marker 'integration' in pytest flags
* Slighly improve validation of pipelines at load
* Adding tests for YAML loading and validation
* Make custom_query Optional for validation issues
* Fix bug in _read_pipeline_config_from_yaml
* Improve error handling in BasePipeline and Pipeline and add DAG check
* Move json schema generation into haystack/nodes/_json_schema.py (useful for tests)
* Simplify errors slightly
* Add some YAML validation tests
* Remove load_from_config from BasePipeline, it was never used anyway
* Improve tests
* Include json-schemas in package
* Fix conftest imports
* Make BasePipeline abstract
* Improve mocking by making the test independent from the YAML version
* Add exportable_to_yaml decorator to forget about set_config on mock nodes
* Fix mypy errors
* Comment out one monkeypatch
* Fix typing again
* Improve error message for validation
* Add required properties to pipelines
* Fix YAML version for REST API YAMLs to 1.2.0
* Fix load_from_yaml call in load_from_deepset_cloud
* fix HaystackError.__getattr__
* Add super().__init__()in most nodes and docstore, comment set_config
* Remove type from REST API pipelines
* Remove useless init from doc2answers
* Call super in Seq3SeqGenerator
* Typo in deepsetcloud.py
* Fix rest api indexing error mismatch and mock version of JSON schema in all tests
* Working on pipeline tests
* Improve errors printing slightly
* Add back test_pipeline.yaml
* _json_schema.py supports different versions with identical schemas
* Add type to 0.7 schema for backwards compatibility
* Fix small bug in _json_schema.py
* Try alternative to generate json schemas on the CI
* Update Documentation & Code Style
* Make linux CI match autoformat CI
* Fix super-init-not-called
* Accidentally committed file
* Update Documentation & Code Style
* fix test_summarizer_translation.py's import
* Mock YAML in a few suites, split and simplify test_pipeline_debug_and_validation.py::test_invalid_run_args
* Fix json schema for ray tests too
* Update Documentation & Code Style
* Reintroduce validation
* Usa unstable version in tests and rest api
* Make unstable support the latest versions
* Update Documentation & Code Style
* Remove needless fixture
* Make type in pipeline optional in the strings validation
* Fix schemas
* Fix string validation for pipeline type
* Improve validate_config_strings
* Remove type from test p[ipelines
* Update Documentation & Code Style
* Fix test_pipeline
* Removing more type from pipelines
* Temporary CI patc
* Fix issue with exportable_to_yaml never invoking the wrapped init
* rm stray file
* pipeline tests are green again
* Linux CI now needs .[all] to generate the schema
* Bugfixes, pipeline tests seems to be green
* Typo in version after merge
* Implement missing methods in Weaviate
* Trying to avoid FAISS tests from running in the Milvus1 test suite
* Fix some stray test paths and faiss index dumping
* Fix pytest markers list
* Temporarily disable cache to be able to see tests failures
* Fix pyproject.toml syntax
* Use only tmp_path
* Fix preprocessor signature after merge
* Fix faiss bug
* Fix Ray test
* Fix documentation issue by removing quotes from faiss type
* Update Documentation & Code Style
* use document properly in preprocessor tests
* Update Documentation & Code Style
* make preprocessor capable of handling documents
* import document
* Revert support for documents in preprocessor, do later
* Fix bug in _json_schema.py that was breaking validation
* re-enable cache
* Update Documentation & Code Style
* Simplify calling _json_schema.py from the CI
* Remove redundant ABC inheritance
* Ensure exportable_to_yaml works only on implementations
* Rename subclass to class_ in Meta
* Make run() and get_config() abstract in BasePipeline
* Revert unintended change in preprocessor
* Move outgoing_edges_input_node check inside try block
* Rename VALID_CODE_GEN_INPUT_REGEX into VALID_INPUT_REGEX
* Add check for a RecursionError on validate_config_strings
* Address usages of _pipeline_config in data silo and elasticsearch
* Rename _pipeline_config into _init_parameters
* Fix pytest marker and remove unused imports
* Remove most redundant ABCs
* Rename _init_parameters into _component_configuration
* Remove set_config and type from _component_configuration's dict
* Remove last instances of set_config and replace with super().__init__()
* Implement __init_subclass__ approach
* Simplify checks on the existence of _component_configuration
* Fix faiss issue
* Dynamic generation of node schemas & weed out old schemas
* Add debatable test
* Add docstring to debatable test
* Positive diff between schemas implemented
* Improve diff printing
* Rename REST API YAML files to trigger IDE validation
* Fix typing issues
* Fix more typing
* Typo in YAML filename
* Remove needless type:ignore
* Add tests
* Fix tests & validation feedback for accessory classes in custom nodes
* Refactor RAGeneratorType out
* Fix broken import in conftest
* Improve source error handling
* Remove unused import in test_eval.py breaking tests
* Fix changed error message in tests matches too
* Normalize generate_openapi_specs.py and generate_json_schema.py in the actions
* Fix path to generate_openapi_specs.py in autoformat.yml
* Update Documentation & Code Style
* Add test for FAISSDocumentStore-like situations (superclass with init params)
* Update Documentation & Code Style
* Fix indentation
* Remove commented set_config
* Store model_name_or_path in FARMReader to use in DistillationDataSilo
* Rename _component_configuration into _component_config
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Make YAML files get the same version as Haystack and throw warning at load in case of mismatch
* Update version of most YAMLs in the codebase (aesthethic chamge, only to avoid the warning).
* Remove quotes from version in tests
* Fix version in generate_json_schema.py
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* minimal DCDocumentStore
* support filters
* implement get_documents_by_id
* handle not existing documents
* add docstrings
* auth added
* add tests
* generate docs
* Add latest docstring and tutorial changes
* add responses to dev dependencies
* fix tests
* support query() and quey_by_embedding()
* Add latest docstring and tutorial changes
* query tests added
* read api_key and api_endpoint from env
* Add latest docstring and tutorial changes
* support query() and quey_by_embedding()
* query tests added
* Add latest docstring and tutorial changes
* Add latest docstring and tutorial changes
* support dynamic similarity and return_embedding values
* Add latest docstring and tutorial changes
* adjust KeywordDocumentStore description
* refactoring
* Add latest docstring and tutorial changes
* implement get_document_count and raise on all not implemented methods
* Add latest docstring and tutorial changes
* don't use abbreviation DC in comments and errors
* Add latest docstring and tutorial changes
* docstring added to KeywordDocumentStore
* Add latest docstring and tutorial changes
* enhanced api key set
* split tests into two parts
* change setup.py in order to work around build cache
* added link
* Add latest docstring and tutorial changes
* rename DCDocumentStore to DeepsetCloudDocumentStore
* Add latest docstring and tutorial changes
* remove dc.py
* reinsert link to docs
* fix imports
* Add latest docstring and tutorial changes
* better test structure
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ArzelaAscoIi <kristof.herrmann@rwth-aachen.de>
* add tinybert data augmentation
* don't reload glove in tinybert data augmentation
* fix unnecessary load_glove call
* fix type hints
* add comments and type hints
* add batch_size argument
* don't predict subwords as alternative for words
* fix subword predictions
* limit sequence length
* actually limit sequence length
* improve performance by calculating nearest glove vector on gpu
* add model and tokenizer parameter
* fix type hints
* improve data augmentation performance
* explained limits of script
* corrected comment
* added data augmentation test
* don't label every question in augmented dataset as impossible
* add sample glove
* better handling of downloading of glove
* fix typo of last commit
* set fixture scope to "function"
* run FARMReader without multiprocessing
* dispose off ray after tests
* run most expensive tasks first in test files
* run expensive tests first
* run garbage collector between tests
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* ensure tf-idf matrix calculation before retrieval
* Run fit() automatically if new documents have been added
* Add latest docstring and tutorial changes
* Fix type error
* Add test case for tfidf retriever yaml pipeline
* Use InMemoryDocStore and add 2nd test case
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add endpoint to get documents by filter
* Add test for /documents/get_by_filter and extend the delete documents test
* Add rest_api/file-upload to .gitignore
* Make sure the document store is empty for each test
* Improve docstrings of delete_documents_by_filters and get_documents_by_filters
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>