1204 Commits

Author SHA1 Message Date
Stefano Fiorucci
f43bc562d3
refactor: replace torch.no_grad with torch.inference_mode (where possible) (#3601)
* try to replace torch.no_grad

* revert erroneous change

* revert other module breaking

* revert training/base
2022-11-23 09:26:11 +01:00
Stefano Fiorucci
3040e59c63
feat: add support for BM25Retriever in InMemoryDocumentStore (#3561)
* very first draft

* implement query and query_batch

* add more bm25 parameters

* add rank_bm25 dependency

* fix mypy

* remove tokenizer callable parameter

* remove unused import

* only json serializable attributes

* try to fix: pylint too-many-public-methods / R0904

* bm25 attribute always present

* convert errors into warnings to make the tutorial 1 work

* add docstrings; tests

* try to make tests run

* better docstrings; revert not running tests

* some suggestions from review

* rename elasticsearch retriever as bm25 in tests; try to test memory_bm25

* exclude tests with filters

* change elasticsearch to bm25 retriever in test_summarizer

* add tests

* try to improve tests

* better type hint

* adapt test_table_text_retriever_embedding

* handle non-textual docs

* query only textual documents
2022-11-22 09:24:52 +01:00
tstadel
0d45cbce56
convert eval metrics to python float (#3612) 2022-11-22 09:05:10 +01:00
Espoir Murhabazi
d114a994f1
refactor: update Squad data (#3513)
* refractor the to_squad data class

* fix the validation label

* refractor the to_squad data class

* fix the validation label

* add the test for the to_label object function

* fix the tests for to_label_objects

* move all the test related to squad data to one file

* remove unused imports

* revert tiny_augmented.json

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2022-11-21 11:06:14 +01:00
Massimiliano Pippi
ea75e2aab5
feat: store metadata using JSON in SQLDocumentStore (#3547)
* add warnings

* make the field cachable

* review comment
2022-11-18 08:26:19 +01:00
Massimiliano Pippi
1399681c81
move milvus tests to their own module (#3596) 2022-11-17 16:22:02 +01:00
Massimiliano Pippi
6cd0e337d0
refactor: Generate JSON schema when missing (#3533)
* removed unused script

* print info logs when generating openapi schema

* create json schema only when needed

* fix tests

* Remove leftover

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2022-11-17 11:09:27 +01:00
Julian Risch
8052632b64
test: add test to check id_hash_keys is not ignored (#3577) 2022-11-17 09:25:02 +01:00
Stefano Fiorucci
dc26e6d43e
fix: Flatten DocumentClassifier output in SQLDocumentStore; remove _sql_session_rollback hack in tests (#3273)
* first draft

* fix

* fix

* move test to test_sql
2022-11-16 12:20:57 +01:00
Massimiliano Pippi
ba75d39029
fix: discard metadata fields if not set in Weaviate (#3578)
* fix weaviate bug in returning embeddings and setting empty meta fields

* review comment
2022-11-15 22:02:53 +01:00
tstadel
6ce2d296f4
fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta (#3572)
* fix meta bug

* adjust brownfield test
2022-11-15 12:13:21 +01:00
Stefano Fiorucci
9de56b0283
fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" (#3548)
* add_all fixes the bug

* improved test
2022-11-15 10:04:04 +01:00
Massimiliano Pippi
6a48ace9b9
BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x (#3552)
* remove milvus1

* leftover

* revert deprecation process
2022-11-15 09:54:55 +01:00
Massimiliano Pippi
057a8c0b4f
refactor: Pinecone tests (#3555)
* add pytest option to unmock pinecone

* first try

* handle missing answer

* fix labels metadata

* more tests

* adapt workflow

* typo

* address review comments
2022-11-14 15:19:15 +01:00
Massimiliano Pippi
4dfddf0d10
refactor: Refactor Weaviate tests (#3541)
* refactor tests

* fix job

* revert

* revert

* revert

* use latest weaviate

* fix abstract methods signatures

* pass class_name to all the CRUD methods

* finish moving all the tests

* bump weaviate version

* raise, don't pass
2022-11-14 09:57:30 +01:00
Massimiliano Pippi
3319ef6d1c
refactor: refactor FAISS tests (#3537)
* fix write docs behaviour

* refactor FAISS tests

* do not remove the sqlite db

* try

* remove extra slash

* Apply suggestions from code review

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* review comments

* Update test/document_stores/test_faiss.py

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* review comments

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-11-08 16:37:01 +01:00
Sara Zan
43b24fd1a7
fix: strip whitespaces safely from FARMReader's answers (#3526)
* remove .strip()

* check for right-side offset

* return the whitespace-cleaned answer

* lstrip, not rstrip :D

* remove int

* left_offset

* slightly refactor reader fixture

* extend test_output
2022-11-08 09:26:47 +01:00
Mayank Jobanputra
794fe5ffa4
bug: didn't clean up model files after running pytest for test_table_text_retriever_training (#3534)
* Added tmp path to avoid clean up of model files later
2022-11-07 15:07:04 +05:30
Massimiliano Pippi
255072d8d5
refactor: move dC tests to their own module and job (#3529)
* move dC tests to their own module and job

* restore global var

* revert
2022-11-04 17:05:10 +01:00
Massimiliano Pippi
2bb81331b7
feat: add SQLDocumentStore tests (#3517)
* port SQL tests

* cleanup document_store_tests.py from sql tests

* leftover

* Update .github/workflows/tests.yml

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* review comments

* Update test/document_stores/test_base.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-11-04 09:24:19 +01:00
Stefano Fiorucci
1a60e21137
refactor: simplify Summarizer, add Document Merger (#3452)
* remove generate_single_summary

* update schemas

* remove unused import

* fix mypy

* fix mypy

* test: summarizer doesnt change content

* other test correction

* move test_summarizer_translation to test_extractor_translation

* fix test

* first try for doc merger

* reintroduce and deprecate generate_single_summary

* progress in document merger

* document merger!

* mypy, pylint fixes

* use generator

* added test that will fail in 1.12

* adapt to review

* extended deprecation docstring

* Update test/nodes/test_extractor_translation.py

* Update test/nodes/test_summarizer.py

* Update test/nodes/test_summarizer.py

* black

* documents fixture

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-11-03 16:04:53 +01:00
Sara Zan
f0be78c6a6
bug: remove useless import in conftest.py (#3362)
* Remove useless milvus import in conftest

* schemas

* schemas
2022-11-02 19:22:24 +05:30
Stefano Fiorucci
4b0894f4c2
fix: support long texts for labels in ElasticsearchDocumentStore (#3346) 2022-11-02 11:16:36 +01:00
Sara Zan
bb1d9983b0
refactor: remove YAML save/load methods for subclasses of BaseStandardPipeline (#3443)
* remove methods & update docstring

* remove irrelevant test
2022-11-02 10:14:33 +01:00
bogdankostic
60224412bc
feat: Add headline extraction to ParsrConverter (#3488)
* Add headline extraction to ParsrConverter

* Add sample PDF file

* Add test

* Use extract_headlines if set in convert method

* Integrate PR feedback
2022-10-31 19:00:02 +01:00
Massimiliano Pippi
b694c7b5cb
Document Store test refactoring (#3449)
* add new marker

* start using test hierarchies

* move ES tests into their own class

* refactor test workflow

* job steps

* add more tests

* move more tests

* more tests

* test labels

* add more tests

* Update tests.yml

* Update tests.yml

* fix

* typo

* fix es image tag

* map es ports

* try

* fix

* default port

* remove opensearch from the markers sorcery

* revert

* skip new tests in old jobs

* skip opensearch_faiss
2022-10-31 15:30:14 +01:00
Sebastian
384663981d
Fixed bug in onnx converter for XLMRoberta architecture (#3470) 2022-10-28 15:35:53 +02:00
Sebastian
8db7dfb884
refactor: TableReader (#3456)
* Refactoring table reader
2022-10-26 20:57:28 +02:00
Sebastian
59857cb492
feat: Speed up reader tests (#3476)
* Use a smaller reader where possible

* Change scope to module of reader to get faster load times
2022-10-26 19:04:18 +02:00
Sara Zan
05c68b6624
feat: add document_store to all BaseRetriever.retrieve() and BaseRetriever.retrieve_batch() implementations (#3379)
* add document_store to retrieve()]

* mypy & pylint

* pass docstore to embedding encoders

* schemas

* mypy and pylint

* fix tfidfretriever

* pylint

* mypy

* pylint

* fix tfidf

* mypy

* pylint

* schemas

* another fix for tfidf

* fix question generation tests

* remove docstore from embedding encoder signature

* pylint

* revert accidental test changes

* Apply suggestions from code review

* check for docstore similarity function only if the docstore is present

* check for docstore similarity function only if the docstore is present
2022-10-26 15:47:06 +02:00
Julian Risch
d0691a4bd5
bug: replace decorator with counter attribute for pipeline event (#3462) 2022-10-26 12:09:04 +02:00
bogdankostic
4fbe80c098
feat: Extraction of headlines in markdown files (#3445)
* Extract headings from markdown files + adapt PreProcessor

* Add tests

* Fix mypy

* Generate JSON schema

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/file_converter/markdown.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply black

* Add PR feedback

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-10-26 11:57:55 +02:00
Vladimir Blagojevic
5ca96357ff
feat: Add CohereEmbeddingEncoder to EmbeddingRetriever (#3453) 2022-10-25 17:52:29 +02:00
Stefano Fiorucci
54ec13eaf7
refactor: Change no_answer attribute (#3411)
* always run validation

* update schemas

* no_answer as a property. break things!

* forgotten schema

* fix

* update openapi

* removed my unnecessary test

* fix sql document store

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-10-25 13:07:00 +02:00
Mayank Jobanputra
d48577b4e7
bug: removed duplicated meta "name" field addition to content before embedding in update_embeddings workflow (#3368)
* Removed explicit passage formatting by name field

* passing correct input type for embedding the docs

* Updated test, updated similarity scores and added results

* changed expected input to embed method
2022-10-25 14:52:05 +05:30
Sara Zan
cbf44413d8
feat: add __cointains__ to Span (#3446)
* add __contains__

* add tests
2022-10-21 13:58:17 +02:00
Vladimir Blagojevic
8f31228211
feat: Add exponential backoff decorator; apply it to OpenAI requests (#3398) 2022-10-19 17:47:38 +02:00
Ursin Brunner
5fedfb03b0
fix: Fix the error of wrong page numbers when documents contain empty pages. (#3330)
* Fix the error of wrong page numbers when documents contain empty pages.

* Reformat using git hooks.

* Use a more descriptive placeholder
2022-10-18 17:51:02 +02:00
Sebastian
93817f63b4
feat: Speed up integration tests (nodes) (#3408)
* Changed summarizer model to a smaller one (2GB to 500MB) to save on space and speed up the tests.

* Removed google pegasus from cache
2022-10-18 16:23:57 +02:00
Sebastian
15a59fd040
feat: Updated EntityExtractor to handle long texts and added better postprocessing (#3154)
* Remove dependence on HuggingFace TokenClassificationPipeline and group all postprocessing functions under one class

* Added copyright notice for HF and deepset to entity file to acknowledge that a lot of the postprocessing parts came from the transformers library.

* Fixed text squishing problem. Added additional unit test for it.

Co-authored-by: ju-gu <julian.gutsch@deepset.ai>
2022-10-17 21:26:44 +02:00
Unai Garay Maestre
3a2c8ae3c5
bug: Adds better way of checking query in BaseRetriever and Pipeline.run() (#3304)
* changes how query and queries are checked if they have been passed in BaseRetriever

* Fixes checking query properly in Pipeline run

* Fixes checking query properly in Pipeline run

* Adds test for FilterRetriever using run method when query is empty

* Adds mock filter retriever and adapts test

* Removes old test, adds MockRetriever to test file and test uses document_store

* Logs error when query is not of type string with a new test for run batch

* Update test/nodes/test_retriever.py

* schemas
2022-10-17 19:00:13 +02:00
Sara Zan
101d2bc86c
feat: MultiModalRetriever (#2891)
* Adding Data2VecVision and Data2VecText to the supported models and adapt Tokenizers accordingly

* content_types

* Splitting classes into respective folders

* small changes

* Fix EOF

* eof

* black

* API

* EOF

* whitespace

* api

* improve multimodal similarity processor

* tokenizer -> feature extractor

* Making feature vectors come out of the feature extractor in the similarity head

* embed_queries is now self-sufficient

* couple trivial errors

* Implemented separate language model classes for multimodal inference

* Document embedding seems to work

* removing batch_encode_plus, is deprecated anyway

* Realized the base Data2Vec models are not trained on retrieval tasks

* Issue with the generated embeddings

* Add batching

* Try to fit CLIP in

* Stub of CLIP integration

* Retrieval goes through but returns noise only

* Still working on the scores

* Introduce temporary adapter for CLIP models

* Image retrieval now works with sentence-transformers

* Tidying up the code

* Refactoring is now functional

* Add MPNet to the supported sentence transformers models

* Remove unused classes

* pylint

* docs

* docs

* Remove the method renaming

* mpyp first pass

* docs

* tutorial

* schema

* mypy

* Move devices setup into get_model

* more mypy

* mypy

* pylint

* Move a few params in HaystackModel's init

* make feature extractor work with squadprocessor

* fix feature_extractor_kwargs forwarding

* Forgotten part of the fix

* Revert unrelated ES change

* Revert unrelated memdocstore changes

* comment

* Small corrections

* mypy and pylint

* mypy

* typo

* mypy

* Refactor the  call

* mypy

* Do not make FARMReader use the new FeatureExtractor

* mypy

* Detach DPR tests from FeatureExtractor too

* Detach processor tests too

* Add end2end marker

* extract end2end feature extractor tests

* temporary disable feature extraction tests

* Introduce end2end tests for tokenizer tests

* pylint

* Fix model loading from folder in FeatureExtractor

* working o n end2end

* end2end keeps failing

* Restructuring retriever tests

* Restructuring retriever tests

* remove covert_dataset_to_dataloader

* remove comment

* Better check sentence-transformers models

* Use embed_meta_fields properly

* rename passage into document

* Embedding dims can't be found

* Add check for models that support it

* pylint

* Split all retriever tests into suites, running mostly on InMemory only

* fix mypy

* fix tfidf test

* fix weaviate tests

* Parallelize on every docstore

* Fix schema and specify modality in base retriever suite

* tests

* Add first image tests

* remove comment

* Revert to simpler tests

* Update docs/_src/api/api/primitives.md

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/__init__.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* get_args

* mypy

* Update haystack/modeling/model/multimodal/__init__.py

* Update haystack/modeling/model/multimodal/base.py

* Update haystack/modeling/model/multimodal/base.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/sentence_transformers.py

* Update haystack/modeling/model/multimodal/sentence_transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/transformers.py

* Update haystack/modeling/model/multimodal/transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/retriever/multimodal/retriever.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* mypy

* mypy

* removing more ContentTypes

* more contentypes

* pylint

* add to __init__

* revert end2end workflow for now

* missing integration markers

* Update haystack/nodes/retriever/multimodal/embedder.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* review feedback, removing HaystackImageTransformerModel

* review feedback part 2

* mypy & pylint

* mypy

* mypy

* fix multimodal docs also for Pinecone

* add note on internal constants

* Fix pinecone write_documents

* schemas

* keep support for sentence-transformers only

* fix pinecone test

* schemas

* fix pinecone again

* temporarily disable some tests, need to understand if they're still relevant

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-10-17 18:58:35 +02:00
Vladimir Blagojevic
159cd5a666
feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever (#3356) 2022-10-14 15:01:03 +02:00
Vladimir Blagojevic
5ebe3cb33d
fix: QuestionGenerator generates wrong document questions for non-default num_queries_per_doc parameter (#3381) 2022-10-14 12:08:30 +02:00
Stefano Fiorucci
7290196c32
fix: allow same vector_id in different indexes for SQL-based Document stores (#3383)
* fix_multiple_indexes

* improve test names
2022-10-14 09:55:56 +02:00
Massimiliano Pippi
31fa75e9fd
feat: add support for Elasticsearch 7.16.2 (#3318)
* bump elastic to 7.16.2+

* decouple Elasticsearch and Opensearch

use method override instead of func variables

fix mypy

default value

fix broken tests

update schema

* relax version pin

* rename the base class

* rename module

* fix import order

* do not run the new tests in the old job

* remove outdated TODO
2022-10-13 11:53:27 +02:00
Sebastian
75641dd024
fix: Added checks for DataParallel and WrappedDataParallel (#3366)
* Added checks for DataParallel and WrappedDataParallel

* Update isinstance checks according to pylint recommendation

* Using isinstance over types

* Added test for dpr training
2022-10-13 08:05:56 +02:00
Malte Pietsch
fb02b61e90
Update README.md (#3247) 2022-10-11 10:43:17 +02:00
tstadel
7fe5003c97
fix: eval() with add_isolated_node_eval=True breaks if no node supports it (#3347)
* fix isolated eval for pipelines without a node supporting isolated mode

* reformat

* add test
2022-10-10 20:48:13 +02:00
bogdankostic
84aff5e2b3
fix: Allow less restrictive values for parameters in Pipeline configurations (#3345)
* fix: Allow arbitrary values for parameters in Pipeline configurations

* Add test

* Adapt expected error message in tests

* Fix bug

* Fix bug on checking JSON

* Remove test cases that previously tested if error was thrown

* Change encoding in test

* Restrict possible values

* Re-add tests

* Re-add tests

* Add value flag to list elements
2022-10-10 13:08:45 +02:00