1675 Commits

Author SHA1 Message Date
Sara Zan
8ddeda811a
generate docs for search.engine.py (#3507) 2022-10-31 16:57:39 +01:00
Massimiliano Pippi
9fe2f69d56
add workflow to triage new issues with GH projects (#3508) 2022-10-31 16:01:59 +01:00
Massimiliano Pippi
b694c7b5cb
Document Store test refactoring (#3449)
* add new marker

* start using test hierarchies

* move ES tests into their own class

* refactor test workflow

* job steps

* add more tests

* move more tests

* more tests

* test labels

* add more tests

* Update tests.yml

* Update tests.yml

* fix

* typo

* fix es image tag

* map es ports

* try

* fix

* default port

* remove opensearch from the markers sorcery

* revert

* skip new tests in old jobs

* skip opensearch_faiss
2022-10-31 15:30:14 +01:00
Mayank Jobanputra
85cdc1040a
Added telemetry changes (#3503) 2022-10-31 12:49:52 +01:00
Sara Zan
adc982a624
fix: do not reference package directory in PDFToTextOCRConverter.convert() (#3478)
* remove weird temp path from PDFToTextOCRConverter.convert()

* remove debug lines

* remove os import
2022-10-31 12:48:43 +01:00
Massimiliano Pippi
17cd79e2c8
[release process] Create new schema when bumping unstable (#3416)
* also create new schema when bumping unstable version

* openapi schema

* no need to update the json schema anymore
2022-10-31 12:26:48 +01:00
Sara Zan
54cc9cd4cf
refactor: remove json-schemas (#3485)
* remove json-schemas

* main schema can be removed too

* add .gitignore to schemas folder

* try to explicitly get the new haystack in the rest api tests

* fix workflow again

* fix version string in rest api tests

* add pip freeze

* debug statements in workflow

* -U prevents schema generation
2022-10-31 11:24:43 +01:00
Massimiliano Pippi
b52ed52c4e
fix docker minimal deprecated image (#3497) 2022-10-28 16:46:48 +02:00
Sebastian
384663981d
Fixed bug in onnx converter for XLMRoberta architecture (#3470) 2022-10-28 15:35:53 +02:00
Massimiliano Pippi
9f4a9a76a3
fix: pattern to match tags push (#3469) 2022-10-28 14:52:30 +02:00
Sara Zan
823d0d3006
Add Schemas badge on README.md (#3493) 2022-10-28 13:57:42 +02:00
Sara Zan
a66e7caa34
feat: hatch-autorun generates schemas (#3484)
* hatch-run generates the schemas

* fix path

* keep schemas for now

* fix path

* schemas

* Do not generate rc schemas

* make the autorun hook self-destroy

* typo

* schemas

* schemas were ok

* improve logs to make generate_schema.py usable standalone too

* fix warning

* Update warning

* Update generate_schema.py

* black
2022-10-28 13:55:11 +02:00
Massimiliano Pippi
1f9f4ab03a
fix: fix docs badge (#3491)
* fix: fix docs badge

* format

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2022-10-28 11:59:49 +02:00
Sara Zan
f377b78263
refactor: replace YAML schema check with a dispatch call (#3482)
* Replace yaml check with a dispatch call

* split workflow

* add branch for testing

* access secrets properly

* remove testing branch trigger
2022-10-28 10:48:59 +02:00
Sebastian
8db7dfb884
refactor: TableReader (#3456)
* Refactoring table reader
2022-10-26 20:57:28 +02:00
Sebastian
59857cb492
feat: Speed up reader tests (#3476)
* Use a smaller reader where possible

* Change scope to module of reader to get faster load times
2022-10-26 19:04:18 +02:00
Tuana Celik
a4002ae87c
Updating readme to point to new docs site (#3336)
* Updating readme to point to new docs site

* updating some links

* updating docs link
2022-10-26 17:28:46 +02:00
Sara Zan
dd774b867d
add missing schemas (#3480) 2022-10-26 17:27:42 +02:00
Sara Zan
05c68b6624
feat: add document_store to all BaseRetriever.retrieve() and BaseRetriever.retrieve_batch() implementations (#3379)
* add document_store to retrieve()]

* mypy & pylint

* pass docstore to embedding encoders

* schemas

* mypy and pylint

* fix tfidfretriever

* pylint

* mypy

* pylint

* fix tfidf

* mypy

* pylint

* schemas

* another fix for tfidf

* fix question generation tests

* remove docstore from embedding encoder signature

* pylint

* revert accidental test changes

* Apply suggestions from code review

* check for docstore similarity function only if the docstore is present

* check for docstore similarity function only if the docstore is present
2022-10-26 15:47:06 +02:00
Julian Risch
d0691a4bd5
bug: replace decorator with counter attribute for pipeline event (#3462) 2022-10-26 12:09:04 +02:00
bogdankostic
4fbe80c098
feat: Extraction of headlines in markdown files (#3445)
* Extract headings from markdown files + adapt PreProcessor

* Add tests

* Fix mypy

* Generate JSON schema

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/file_converter/markdown.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply black

* Add PR feedback

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-10-26 11:57:55 +02:00
Vladimir Blagojevic
5ca96357ff
feat: Add CohereEmbeddingEncoder to EmbeddingRetriever (#3453) 2022-10-25 17:52:29 +02:00
Branden Chan
7b15799853
Change slug and title (#3474) 2022-10-25 16:41:27 +01:00
Stefano Fiorucci
a2d459dbed
fix: warning if doc store similarity function is incompatible with Sentence Transformers model (#3455)
* check_docstore_similarity_function

* remove import
2022-10-25 17:00:35 +02:00
Stefano Fiorucci
54ec13eaf7
refactor: Change no_answer attribute (#3411)
* always run validation

* update schemas

* no_answer as a property. break things!

* forgotten schema

* fix

* update openapi

* removed my unnecessary test

* fix sql document store

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-10-25 13:07:00 +02:00
Julian Risch
6a422d588f
fix: disabling telemetry prevents writing config (#3465)
* fix: disabling telemetry prevents writing config

* set user id to empty string if telemetry disabled

* Update haystack/telemetry.py

* set id to None instead of "" in error case

* remove RuntimeError if user id is not set

* Revert "remove RuntimeError if user id is not set"

This reverts commit c59f06d47216afa7ada6199b03f1b09a2b936c02.

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-10-25 12:11:54 +02:00
Mayank Jobanputra
d48577b4e7
bug: removed duplicated meta "name" field addition to content before embedding in update_embeddings workflow (#3368)
* Removed explicit passage formatting by name field

* passing correct input type for embedding the docs

* Updated test, updated similarity scores and added results

* changed expected input to embed method
2022-10-25 14:52:05 +05:30
Vladimir Blagojevic
1b9586ae40
Add indexing pipeline type (#3461) 2022-10-24 17:26:15 +02:00
Timo Moeller
9b931bbf66
Fix prompt length computation (#3448) 2022-10-24 11:59:54 +02:00
Sara Zan
cbf44413d8
feat: add __cointains__ to Span (#3446)
* add __contains__

* add tests
2022-10-21 13:58:17 +02:00
Unai Garay Maestre
e41cb24358
Feat: allow decreasing size of datasets loaded from BEIR (#3392)
* Adds cropping of dataset in eval beir

* Adapts queries to remaining cropped documents

* Adds logging warning if num_documents has an invalid value

* Adapts to linting suggestions
2022-10-21 13:54:20 +02:00
Branden Chan
03ba07dcb5
docs: Extend utils API docs coverage (#3402)
* Add more utils modules

* Format docstrings

* Incorporate reviewer feedback
2022-10-21 12:51:11 +01:00
Massimiliano Pippi
df4d20d32c
fix the readme version to sync (#3417) 2022-10-20 16:50:36 +02:00
Vladimir Blagojevic
79c6063ac2
feat: send event if number of queries exceeds threshold (#3419)
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2022-10-20 16:02:45 +02:00
Branden Chan
3f956c75f4
Add multimodal retrieval to API docs (#3430) 2022-10-20 15:07:48 +02:00
Stefano Fiorucci
abdcb8124b
update pyworld pin (#3435) 2022-10-20 12:28:38 +02:00
Stefano Fiorucci
8c1a34494d
refactor: update package strategy in ui (#3396)
* update ui package: first try

* update README

* fixes

* update schemas

* restore schemas

* use matrix folder in tests

* fix tests

* fix schemas

* really fix schemas

* don't use matrix folder

* remove blank line

* cleaner pytest command
2022-10-20 12:18:03 +02:00
Stefano Fiorucci
3860bb9966
fix: improve Document __repr__ (#3385)
* fix document __repr__

* take the best from 2 approaches

* fix schema
2022-10-19 22:32:23 +02:00
Vladimir Blagojevic
8f31228211
feat: Add exponential backoff decorator; apply it to OpenAI requests (#3398) 2022-10-19 17:47:38 +02:00
Massimiliano Pippi
5335e9e4d9
Add new schema for latest unstable (#3415)
* add new schema for latest unstable

* openapi
2022-10-19 13:21:05 +02:00
Julian Risch
16723bf180
bug: change type of split_by to Literal including None (#3389)
* change type of split_by

* fix mpy and update schema files

* change split_by type to Literal

* handle ImportError for Literal py<3.8
2022-10-19 10:11:41 +02:00
github-actions[bot]
f4a49f7178
Bump version (#3409)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2022-10-18 18:05:48 +02:00
Ursin Brunner
5fedfb03b0
fix: Fix the error of wrong page numbers when documents contain empty pages. (#3330)
* Fix the error of wrong page numbers when documents contain empty pages.

* Reformat using git hooks.

* Use a more descriptive placeholder
2022-10-18 17:51:02 +02:00
Sebastian
51d4fe01c3
fix: Update env variable for model caching timeout (#3405)
* fix: Update env variable for model caching timeout

The environment variable used to set the timeout for the model caching step had a typo in it from the maintainers of `actions/cache@v3`, which is why it has not been working (see comment [here](https://github.com/actions/cache/issues/810#issuecomment-1281895575)).

* Removed newline
2022-10-18 17:36:25 +02:00
Branden Chan
cf4642a5f8
[CI] Create Github Workflow that creates a new version branch in Haystack and Readme (#3335)
* Test readme_integration.yml

* Test readme_integration.yml

* Test variables

* Test variables

* Test variables

* Test variables

* Test commit

* Test commit

* Test commit

* Trigger action

* Add v

* Trigger action

* Trigger action

* Trigger action

* Trigger action

* Update API docs headers

* Revert "Update API docs headers"

This reverts commit 34e665063f4de29854befe575a795dbfef04415c.

* Trigger action

* Trigger action

* Trigger action

* Update release

* Update release

* Update release

* Delete File

* Split steps into own files

* Edit action names

* Start making changes

* Start implementing version bump

* Implement minor version release

* Fix github action

* Test action

* Test action

* Test action

* Test action

* Test action

* Change back to main

* Add comments

* Remove line

* Format docstring

* Incorporate reviewer feedback

* Fix variable name

* Print version.txt

* Incorporate Reviewer feedback

* Rename variables for clarity

* Add fetch

* Change branch

* Change branch

* Change branch

* Change branch

* Change branch

* Revert docstring changes

* Incorporate reviewer feedback

* Run black

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2022-10-18 17:09:43 +02:00
Sebastian
93817f63b4
feat: Speed up integration tests (nodes) (#3408)
* Changed summarizer model to a smaller one (2GB to 500MB) to save on space and speed up the tests.

* Removed google pegasus from cache
2022-10-18 16:23:57 +02:00
Branden Chan
3bf5d4350f
docs: Add comment about the generation of no-answer samples in FARMReader training (#3404)
* Add comment about no-answer generation

* Add comment about no-answer generation

* Fix typo

Co-authored-by: Sebastian <sjrl@users.noreply.github.com>

* Incorporate reviewer feedback

* Incorporate reviewer feedback

Co-authored-by: Sebastian <sjrl@users.noreply.github.com>
2022-10-18 14:37:37 +02:00
Sebastian
15a59fd040
feat: Updated EntityExtractor to handle long texts and added better postprocessing (#3154)
* Remove dependence on HuggingFace TokenClassificationPipeline and group all postprocessing functions under one class

* Added copyright notice for HF and deepset to entity file to acknowledge that a lot of the postprocessing parts came from the transformers library.

* Fixed text squishing problem. Added additional unit test for it.

Co-authored-by: ju-gu <julian.gutsch@deepset.ai>
2022-10-17 21:26:44 +02:00
Unai Garay Maestre
3a2c8ae3c5
bug: Adds better way of checking query in BaseRetriever and Pipeline.run() (#3304)
* changes how query and queries are checked if they have been passed in BaseRetriever

* Fixes checking query properly in Pipeline run

* Fixes checking query properly in Pipeline run

* Adds test for FilterRetriever using run method when query is empty

* Adds mock filter retriever and adapts test

* Removes old test, adds MockRetriever to test file and test uses document_store

* Logs error when query is not of type string with a new test for run batch

* Update test/nodes/test_retriever.py

* schemas
2022-10-17 19:00:13 +02:00
Sara Zan
101d2bc86c
feat: MultiModalRetriever (#2891)
* Adding Data2VecVision and Data2VecText to the supported models and adapt Tokenizers accordingly

* content_types

* Splitting classes into respective folders

* small changes

* Fix EOF

* eof

* black

* API

* EOF

* whitespace

* api

* improve multimodal similarity processor

* tokenizer -> feature extractor

* Making feature vectors come out of the feature extractor in the similarity head

* embed_queries is now self-sufficient

* couple trivial errors

* Implemented separate language model classes for multimodal inference

* Document embedding seems to work

* removing batch_encode_plus, is deprecated anyway

* Realized the base Data2Vec models are not trained on retrieval tasks

* Issue with the generated embeddings

* Add batching

* Try to fit CLIP in

* Stub of CLIP integration

* Retrieval goes through but returns noise only

* Still working on the scores

* Introduce temporary adapter for CLIP models

* Image retrieval now works with sentence-transformers

* Tidying up the code

* Refactoring is now functional

* Add MPNet to the supported sentence transformers models

* Remove unused classes

* pylint

* docs

* docs

* Remove the method renaming

* mpyp first pass

* docs

* tutorial

* schema

* mypy

* Move devices setup into get_model

* more mypy

* mypy

* pylint

* Move a few params in HaystackModel's init

* make feature extractor work with squadprocessor

* fix feature_extractor_kwargs forwarding

* Forgotten part of the fix

* Revert unrelated ES change

* Revert unrelated memdocstore changes

* comment

* Small corrections

* mypy and pylint

* mypy

* typo

* mypy

* Refactor the  call

* mypy

* Do not make FARMReader use the new FeatureExtractor

* mypy

* Detach DPR tests from FeatureExtractor too

* Detach processor tests too

* Add end2end marker

* extract end2end feature extractor tests

* temporary disable feature extraction tests

* Introduce end2end tests for tokenizer tests

* pylint

* Fix model loading from folder in FeatureExtractor

* working o n end2end

* end2end keeps failing

* Restructuring retriever tests

* Restructuring retriever tests

* remove covert_dataset_to_dataloader

* remove comment

* Better check sentence-transformers models

* Use embed_meta_fields properly

* rename passage into document

* Embedding dims can't be found

* Add check for models that support it

* pylint

* Split all retriever tests into suites, running mostly on InMemory only

* fix mypy

* fix tfidf test

* fix weaviate tests

* Parallelize on every docstore

* Fix schema and specify modality in base retriever suite

* tests

* Add first image tests

* remove comment

* Revert to simpler tests

* Update docs/_src/api/api/primitives.md

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/__init__.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* get_args

* mypy

* Update haystack/modeling/model/multimodal/__init__.py

* Update haystack/modeling/model/multimodal/base.py

* Update haystack/modeling/model/multimodal/base.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/sentence_transformers.py

* Update haystack/modeling/model/multimodal/sentence_transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/transformers.py

* Update haystack/modeling/model/multimodal/transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/retriever/multimodal/retriever.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* mypy

* mypy

* removing more ContentTypes

* more contentypes

* pylint

* add to __init__

* revert end2end workflow for now

* missing integration markers

* Update haystack/nodes/retriever/multimodal/embedder.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* review feedback, removing HaystackImageTransformerModel

* review feedback part 2

* mypy & pylint

* mypy

* mypy

* fix multimodal docs also for Pinecone

* add note on internal constants

* Fix pinecone write_documents

* schemas

* keep support for sentence-transformers only

* fix pinecone test

* schemas

* fix pinecone again

* temporarily disable some tests, need to understand if they're still relevant

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-10-17 18:58:35 +02:00