3803 Commits

Author SHA1 Message Date
nickchomey
e6767fccef
bugfix for TranslationWrapperPipeline (#3290)
* bugfix for TranslationWrapperPipeline

* Update standard_pipelines.py

* Update haystack/pipelines/standard_pipelines.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-10-04 09:44:48 +02:00
Jeff Risberg
ad8fbe56ee
bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node (#3170)
* don't send the list of inputs back as an output in the running of a node.

* updated documentation

* Update pydoc-markdown.py

* added test case for pipeline join fix

Co-authored-by: JeffRisberg <jrisberg@aol.com>
2022-09-30 13:27:17 +02:00
Stefano Fiorucci
e2e6887ee8
Improve TransformersDocumentClassifier tests (#3270) 2022-09-27 13:25:34 +02:00
Taner Topal
24d4591307
docs: Fix a docstring in ray.py 2022-09-27 09:05:04 +02:00
Vladimir Blagojevic
9582a423a2
fix: ONNX FARMReader model conversion is broken (#3211) 2022-09-26 09:18:12 -04:00
Stefano Fiorucci
b579b9d54a
bug: make ElasticSearchDocumentStore use batch_size in get_documents_by_id (#3166)
* use batch_size

* try to fix git mess

* improve docstrings

* fix
2022-09-26 13:21:59 +02:00
Vladimir Blagojevic
9ca3ccae98
fix:MostSimilarDocumentsPipeline doesn't have pipeline property (#3265)
* Add comments and a unit test

* More unit tests for MostSimilarDocumentsPipeline
2022-09-23 09:46:48 -04:00
Vladimir Blagojevic
eba7cf51b1
chore: Remove Update API documentation hook (#3271)
* Remove Update API documentation hook

* Remove .github/utils/pydoc-markdown.py file
2022-09-23 08:54:08 -04:00
tstadel
05a86b9d3d
feat: FAISS in OpenSearch: Support HNSW for cosine (#3217)
* support cosine similiarity with faiss

* update docs

* update api docs

* fix tests

* Revert "update api docs"

This reverts commit 6138fdfefb3beaee2d55c5729cd4a2745ea6b143.

* fix api docs

* collapse test

* rename similairity to space_type mappings

* only normalize for faiss

* fix merge

* fix docs normalization

* get rid of List[np.array]

* update docs

* fix tests and tutorials

* fix mypy

* fix mypy

* fix mypy again

* again mypy

* blacken

* update tutorial  4 docs

* fix embeddingretriever

* fix faiss

* move dense specific logic to DenseRetriever

* fix mypy

* cosine tests for all documents stores

* fix pinecone

* add docstring

* docstring corrections

* update docs

* add integration test marker

* docstrings update

* update docs

* fix typo

* update docs

* fix MockDenseRetriever

* run integration tests for all documentstores

* fix test_update_embeddings_cosine_similarity

* fix faiss tests not running

* blacken

* make test_cosine_sanity_check integration test

* split PR

* update docs

* manually revert tutorial doc change

* Fix embedding type

* set integration marker correctly

* make BaseDocumentStore.normalize_embedding static

* format

* fix handling of opensearch_faiss param

* fix merge

* add DenseRetriever typing

* organize imports in conftest.py

* organize imports in conftest.py (2)

* fix DenseRetriever import

* add opensearch-tests-linux
2022-09-23 13:26:49 +02:00
tstadel
4fa9d2d8e7
Fix milvus and faiss tests not running (#3263)
* fix milvus and faiss tests not running

* fix schema manually

* fix test_dpr_embedding test for milvus

* pip freeze on milvus tests

* fix milvus1 tests being executed: fix all_doc_stores order

* Revert "pip freeze on milvus tests"

This reverts commit 75ebb6f7e507bb8477e87d9e63b4a294f7946cab.

* make infer_required_doc_store more robust

* don't skip tests without docstore requirements

* use markers for docstore tests
2022-09-22 17:46:49 +02:00
Massimiliano Pippi
2b803a265b
run checks on release branches (#3267) 2022-09-22 16:25:34 +02:00
Vladimir Blagojevic
820742cac7
Fix schema for 1.10.x (#3269) 2022-09-22 15:20:51 +02:00
tstadel
b10e2c392e
chore: add DenseRetriever abstraction (#3252)
* support cosine similiarity with faiss

* update docs

* update api docs

* fix tests

* Revert "update api docs"

This reverts commit 6138fdfefb3beaee2d55c5729cd4a2745ea6b143.

* fix api docs

* collapse test

* rename similairity to space_type mappings

* only normalize for faiss

* fix merge

* fix docs normalization

* get rid of List[np.array]

* update docs

* fix tests and tutorials

* fix mypy

* fix mypy

* fix mypy again

* again mypy

* blacken

* update tutorial  4 docs

* fix embeddingretriever

* fix faiss

* move dense specific logic to DenseRetriever

* fix mypy

* cosine tests for all documents stores

* fix pinecone

* add docstring

* docstring corrections

* update docs

* add integration test marker

* docstrings update

* update docs

* fix typo

* update docs

* fix MockDenseRetriever

* run integration tests for all documentstores

* fix test_update_embeddings_cosine_similarity

* fix faiss tests not running

* blacken

* make test_cosine_sanity_check integration test

* update docs

* fix imports

* import  DenseRetriever normally

* update docs

* fix deepcopy of documents

* update schema

* Revert "update schema"

This reverts commit 83cf8f323648468e1c322d54852bec084d637e3f.

* fix schema for ci manually
2022-09-21 19:08:54 +02:00
Branden Chan
492a8046d8
docs: sync Haystack API with Readme (#3223)
* First pass at syncing Haystack API with Readme

* Reapply changes

* Regularize slugs

* Regularize slugs

* Regularize slugs

* Set category id and regen

* Trigger workflow

* Delete old md files

* Test sync

* Undo test string

* Incorporate reviewer feedback

* Test on the fly API generation and sync

* Test on the fly API generation and sync

* Test on the fly API generation and sync

* Test on the fly API generation and sync

* Test on the fly API generation and sync

* Change name of pydoc-markdown scripts

* Test on the fly API generation and sync

* Remove version tag

* Test version tag

* Test version tag

* Test version tag

* Revert test docstring

* Revert md file changes

* Revert md file changes

* Revert script naming

* Test on the fly generation and sync

* Adjust for on the fly generation and sync

* Revert test string

* Remove old documentation workflow

* Set workflow to work on main

* Change readme version name
2022-09-21 17:18:34 +02:00
Massimiliano Pippi
8f76d64f6f
chore: bump release number for unstable version (#3251)
* bump version for unstable

* allow generation of rc schemas

* update schemas
2022-09-21 16:58:06 +02:00
Vladimir Blagojevic
938e6fda5b
Classify pipeline's type based on its components (#3132)
* Add pipeline get_type mehod

* Add pipeline uptime

* Add pipeline telemetry event sending

* Send pipeline telemetry once a day (at most)

* Add pipeline invocation counter, change invocation counter logic

* Update allowed telemetry parameters - allow pipeline parameters

* PR review: add unit test
2022-09-21 14:53:42 +02:00
Stefano Fiorucci
89247b804c
refactor: make TransformersDocumentClassifier output consistent between different types of classification (#3224)
* make output consistent

* make output consistent

* added tests for details

* better tests

* Update test_document_classifier.py

* make black happy

* Update test_document_classifier.py

* Update test_document_classifier.py
2022-09-21 13:16:03 +02:00
Massimiliano Pippi
15bb6c2ea2
remove tutorials from the repo (#3244) 2022-09-20 18:32:45 +02:00
Tuana Celik
336c144e72
chore: updating colab links in older docs versions (#3250)
* updating colab links to tutorial 1

* remaining tutorials

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2022-09-20 18:15:29 +02:00
Vladimir Blagojevic
fe31896fcb
Proper retrieval of answers for batch eval (#3245)
* Proper retrieval of answers and documents for batch eval
2022-09-20 08:16:03 -04:00
Malte Pietsch
7e79a48540
bug: reactivate benchmarks with quick fixes (#2766)
* quick fix benchmark runs to make them work with current haystack version

* fix minor typo

* update readme. fix minor things to make benchmarks run again

* Update Documentation & Code Style

* fix typo in readme

* update result files for reader and retriever querying

* reduce batch size for update embeddings to prevent xlarge bulk_update requests that exceed elastic's limits (happening in dense 500k runs)

* change default memory allocation back to normal. add note to readme

* add first indexing results

* add memory to docker cmd

* full benchmarks results on commit  c5a2651fcbbeffca06ffa9036b10e62669bcc1b0

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-09-20 10:22:08 +02:00
Massimiliano Pippi
9399ddf949
fix pydoc-markdown hook (#3238) 2022-09-19 18:20:35 +02:00
Sara Zan
dcb132ba59
chore: remove f-strings from logs for performance reasons (#3212)
* Use the %s syntax on all debug messages

* Use the %s syntax on some more debug messages

* Use the %s syntax on info messages

* Use the %s syntax on warning messages

* Use the %s syntax on error and exception messages

* mypy

* pylint

* trogger tutorials execution in CI

* trigger tutorials execution on CI

* black

* remove embeddings from repr

* fix Document `__repr__`

* address feedback

* mypy
2022-09-19 18:18:32 +02:00
Massimiliano Pippi
8fbccbda82
fix: handle Documents containing dataframes in Multilabel constructor (#3237)
* format

* fix docs
2022-09-19 14:59:20 +02:00
banjocustard
19af6f4e40
bug: fix pdftotext installation verification (#3233) 2022-09-19 11:32:58 +02:00
Massimiliano Pippi
859c303c16
include fontconfig in the final image and fix tagging (#3230) 2022-09-16 15:33:24 +02:00
Malte Pietsch
3134b0d679
fix: type of temperature param and adjust defaults for OpenAIAnswerGenerator (#3073)
* fix: type of temperature param and adjust defaults

* update schema

* update api docs
2022-09-16 14:11:33 +02:00
Massimiliano Pippi
4ddeb7b14b
chore: fix Windows CI (#3222)
* replicate issue

* pin openjdk version

* not sure it's needed
2022-09-16 13:08:30 +02:00
nickchomey
42c963f54b
Update rest_api Docker Compose yamls for recent refactoring of rest_api (#3197)
* update rest_api yamls for recent refactoring

* Update docker-compose.yml
2022-09-15 19:47:40 +02:00
Anam Saatvik Reddy
f50b496f03
bug: fix embedding_dim mismatch in DocumentStore (#3183)
* match index dim with embed dim (deepset-ai#3090)

* aligned messages across all docstores

* aligned messages across all docstores (deepset-ai#3090)

* aligned messages across all docstores (deepset-ai#3090)
2022-09-15 15:23:53 +02:00
Sara Zan
768583d00c
chore: disable Windows ES tests on CI (#3220)
* disable Windows ES tests

* Add comments
2022-09-15 15:18:29 +02:00
Daniel Bichuetti
df1f4205b6
feat: add public layout-base extraction support on PDFToTextConverter (#3137)
* feat(PDFToTextConverter): add option to get text in physical layout order

* test: add physical layout extraction test to PDFToTextConverter

* refactor: change layout parameter attribution places

* docs: manually trigger pre-commits

* docs: generate new docs to comply with pydoc-markdown style
2022-09-13 16:55:21 +02:00
Kristof Herrmann
da1cc577ae
feat: exponential backoff with exp decreasing batch size for opensearch client (#3194)
* Validate custom_mapping properly as an object

* Remove related test

* black

* feat: exponential backoff with exp dec batch size

* added docstring and split doc lsit

* fix

* fix mypy

* fix

* catch generic exception

* added test

* mypy ignore

* fixed no attribute

* added test

* added tests

* revert strange merge conflicts

* revert merge conflict again

* Update haystack/document_stores/elasticsearch.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* done

* adjust test

* remove not required caplog

* fixed comments

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2022-09-13 14:30:30 +01:00
Sara Zan
b47c93989b
remove imports redirect (#3204) 2022-09-13 11:16:39 +01:00
Sara Zan
49b1c8856e
test: lower low boundary for accuracy in test_calculate_context_similarity_on_non_matching_contexts (#3199)
* Change min value

* revert test change and pin rapidfuzz<2.8.0

* duplicate
2022-09-13 09:32:38 +02:00
Massimiliano Pippi
64b0c43885
refactoring: reimplement Docker strategy (#3162)
* setup base images

* add cpu flavor

* use the same Dockerfile for cpu and gpu

* better naming, add docs

* add docker workflow

* add missing image input

* change cwd for bake

* also push api images

* try conditional tagging for releases

* revert testing code

* update docker readme

* document variable override

* use Python 3.10

* allow empty HAYSTACK_EXTRAS

* Apply suggestions from code review

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* remove repo description step, can't make it work so far

* add docs to the last step as it's tricky

* manage tags for the newest images

* tests are passing, checking in the last bit

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-09-12 16:33:56 +02:00
Bijay Gurung
21aedc644f feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers (#3164)
* Add option to use MultipleNegativesRankingLoss

Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever
training with sentence-transformers

* Move out losses into separate retriever/_losses.py module

* Remove unused import in retriever/_losses.py

* Apply documentation suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-09-12 09:38:04 +02:00
Sebastian
fc07799206
feat: Updates docs and types for language param in PreProcessor (#3186)
* Small update to language param docs in PreProcessor
2022-09-12 08:52:52 +02:00
Sara Zan
96bb9b5905
bug: validate custom_mapping as an object (#3189)
* Validate custom_mapping properly as an object

* Remove related test

* black
2022-09-09 18:03:29 +02:00
Daniel Bichuetti
621e1af74c
refactor: improve support for dataclasses (#3142)
* refactor: improve support for dataclasses

* refactor: refactor class init

* refactor: remove unused import

* refactor: testing 3.7 diffs

* refactor: checking meta where is Optional

* refactor: reverting some changes on 3.7

* refactor: remove unused imports

* build: manual pre-commit run

* doc: run doc pre-commit manually

* refactor: post initialization hack for 3.7-3.10 compat.

TODO: investigate another method to improve 3.7 compatibility.

* doc: force pre-commit

* refactor: refactored for both Python 3.7 and 3.9

* docs: manually run pre-commit hooks

* docs: run api docs manually

* docs: fix wrong comment

* refactor: change no type-checked test code

* docs: update primitives

* docs: api documentation

* docs: api documentation

* refactor: minor test refactoring

* refactor: remova unused enumeration on test

* refactor: remove unneeded dir in gitignore

* refactor: exclude all private fields and change meta def

* refactor: add pydantic comment

* refactor : fix for mypy on Python 3.7

* refactor: revert custom init

* docs: update docs to new pydoc-markdown style

* Update test/nodes/test_generator.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-09-09 11:31:37 +02:00
Daniel Bichuetti
1a6cbca9b6
feat: add health check endpoint to rest api (#3168)
* feat: add /health endpoint to rest api

* refactor: adjust to new dir structure

* fix: add new rest api dependency

* docs: add new openapi schema

* docs: manual black run

* refactor: remove some sys-wide details

* docs: minor description changes

* docs: minor description changes

* docs: generate openapi schemas

* tests: improved tests

* refactor: add cls method decorator
2022-09-08 18:24:16 +02:00
Vladimir Blagojevic
e0d73f3ae0
Replace torch.device(cuda) with torch.device(cuda:0) in devices initialization (#3184) 2022-09-08 09:36:38 -04:00
Vladimir Blagojevic
20880c9d41
Add 15 min timeout for downloading cached HF models (#3179) 2022-09-07 08:35:09 -04:00
Sebastian
62e7c19011
fix: Reduce GPU to CPU copies at inference (#3127)
* Send matrix from gpu to cpu once instead of individual elements

* Moved location of if statement so it would be triggered only when
needed. Provides very modest speedup for large top_k_per_sample
2022-09-07 11:00:05 +02:00
Steven Haley
9a750f7032
docs: Fix the word length splitting; should be set to 100 not 1,000 (#3133)
* Fix the word length splitting; should be set to 100 not 1,000 due to limitations of transformer models

* Update documentation for tutorial change
2022-09-07 10:57:54 +02:00
Vladimir Blagojevic
84acb6584f
Type all parameter constructors, add model_version optional parameter where applicable (#3152) 2022-09-06 05:05:42 -04:00
Sebastian
20c2320434
Fix for torch device (#3161) 2022-09-06 09:03:52 +02:00
Massimiliano Pippi
6790eaf7d8
refactor: update package strategy in rest_api (#3148)
* update packaging

* fix author metadata

* add newline

* add empty readme

* fix path to pipeline files

* fix pylint job

* fix metadata
2022-09-05 16:58:43 +02:00
Massimiliano Pippi
e2110644c4
docs: add tests types to CONTRIBUTING.md (#3158)
* Update CONTRIBUTING.md

Add the outcome of #2811 to the developers docs

Ideally, newly added tests will follow those requirements while we progressively adapt the existing tests to the new model.

* address review comments
2022-09-05 16:56:48 +02:00
Daniel Bichuetti
e1f399284f
refactor: update dependencies and remove pins (#3147)
* refactor: remove azure-core, pydoc and hf-hub pins

* fix: remove extra-comma

* fix: force minimum version of azure forms recognizer

* refactor: allow newer ocr libs

* refactor: update more dependencies and container versions

* refactor: remove extra comment

* docs: pre-commit manual run

* refactor: remove unnecessary dependency

* tests: update weaviate container image version
2022-09-05 14:30:35 +02:00