306 Commits

Author SHA1 Message Date
Malte Pietsch
7e79a48540
bug: reactivate benchmarks with quick fixes (#2766)
* quick fix benchmark runs to make them work with current haystack version

* fix minor typo

* update readme. fix minor things to make benchmarks run again

* Update Documentation & Code Style

* fix typo in readme

* update result files for reader and retriever querying

* reduce batch size for update embeddings to prevent xlarge bulk_update requests that exceed elastic's limits (happening in dense 500k runs)

* change default memory allocation back to normal. add note to readme

* add first indexing results

* add memory to docker cmd

* full benchmarks results on commit  c5a2651fcbbeffca06ffa9036b10e62669bcc1b0

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-09-20 10:22:08 +02:00
Massimiliano Pippi
9399ddf949
fix pydoc-markdown hook (#3238) 2022-09-19 18:20:35 +02:00
Massimiliano Pippi
8fbccbda82
fix: handle Documents containing dataframes in Multilabel constructor (#3237)
* format

* fix docs
2022-09-19 14:59:20 +02:00
Malte Pietsch
3134b0d679
fix: type of temperature param and adjust defaults for OpenAIAnswerGenerator (#3073)
* fix: type of temperature param and adjust defaults

* update schema

* update api docs
2022-09-16 14:11:33 +02:00
Daniel Bichuetti
df1f4205b6
feat: add public layout-base extraction support on PDFToTextConverter (#3137)
* feat(PDFToTextConverter): add option to get text in physical layout order

* test: add physical layout extraction test to PDFToTextConverter

* refactor: change layout parameter attribution places

* docs: manually trigger pre-commits

* docs: generate new docs to comply with pydoc-markdown style
2022-09-13 16:55:21 +02:00
Bijay Gurung
21aedc644f feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers (#3164)
* Add option to use MultipleNegativesRankingLoss

Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever
training with sentence-transformers

* Move out losses into separate retriever/_losses.py module

* Remove unused import in retriever/_losses.py

* Apply documentation suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-09-12 09:38:04 +02:00
Sebastian
fc07799206
feat: Updates docs and types for language param in PreProcessor (#3186)
* Small update to language param docs in PreProcessor
2022-09-12 08:52:52 +02:00
Daniel Bichuetti
621e1af74c
refactor: improve support for dataclasses (#3142)
* refactor: improve support for dataclasses

* refactor: refactor class init

* refactor: remove unused import

* refactor: testing 3.7 diffs

* refactor: checking meta where is Optional

* refactor: reverting some changes on 3.7

* refactor: remove unused imports

* build: manual pre-commit run

* doc: run doc pre-commit manually

* refactor: post initialization hack for 3.7-3.10 compat.

TODO: investigate another method to improve 3.7 compatibility.

* doc: force pre-commit

* refactor: refactored for both Python 3.7 and 3.9

* docs: manually run pre-commit hooks

* docs: run api docs manually

* docs: fix wrong comment

* refactor: change no type-checked test code

* docs: update primitives

* docs: api documentation

* docs: api documentation

* refactor: minor test refactoring

* refactor: remova unused enumeration on test

* refactor: remove unneeded dir in gitignore

* refactor: exclude all private fields and change meta def

* refactor: add pydantic comment

* refactor : fix for mypy on Python 3.7

* refactor: revert custom init

* docs: update docs to new pydoc-markdown style

* Update test/nodes/test_generator.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-09-09 11:31:37 +02:00
Daniel Bichuetti
1a6cbca9b6
feat: add health check endpoint to rest api (#3168)
* feat: add /health endpoint to rest api

* refactor: adjust to new dir structure

* fix: add new rest api dependency

* docs: add new openapi schema

* docs: manual black run

* refactor: remove some sys-wide details

* docs: minor description changes

* docs: minor description changes

* docs: generate openapi schemas

* tests: improved tests

* refactor: add cls method decorator
2022-09-08 18:24:16 +02:00
Vladimir Blagojevic
84acb6584f
Type all parameter constructors, add model_version optional parameter where applicable (#3152) 2022-09-06 05:05:42 -04:00
Daniel Bichuetti
e1f399284f
refactor: update dependencies and remove pins (#3147)
* refactor: remove azure-core, pydoc and hf-hub pins

* fix: remove extra-comma

* fix: force minimum version of azure forms recognizer

* refactor: allow newer ocr libs

* refactor: update more dependencies and container versions

* refactor: remove extra comment

* docs: pre-commit manual run

* refactor: remove unnecessary dependency

* tests: update weaviate container image version
2022-09-05 14:30:35 +02:00
Branden Chan
d4722c2ec5
Document FARMReader.train() evaluation report log level (#3129)
* Mention evaluation report logging level

* Mention evaluation report logging level
2022-09-01 10:58:47 +02:00
Vladimir Blagojevic
356537c883
Standardize devices parameter and device initialization (#3062)
* Use devices parameter and initialize devices consistently
2022-08-31 15:30:31 +02:00
Julian Risch
f010a17f04
increase version to next release candidate (#3115) 2022-08-29 17:05:44 +02:00
Julian Risch
4e518cdddd
chore: increase version for 1.8 release (#3109)
* increase version for 1.8 release

* ignore missing-timeout for pylint
2022-08-26 15:00:14 +02:00
Julian Risch
3e3ff33cdd
feat: add batch evaluation method for pipelines (#2942)
* add basic pipeline.eval_batch for qa without filters

* black formatting

* pydoc-markdown

* remove batch eval tests failing due to bugs

* remove comment

* explain commented out tests

* avoid code duplication

* black

* mypy

* pydoc markdown

* add batch option to execute_eval_run

* pydoc markdown

* Apply documentation suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply documentation suggestion from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* add documentation based on review comments

* black

* black

* schema updates

* remove duplicate tests

* add separate method for column reordering

* merge _build_eval_dataframe methods

* pylint ignore in function

* change type annotation of queries to list only

* one-liner addressing review comment on params dict

* markdown files updated

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-25 17:50:57 +02:00
Julian Risch
cc9d39c360
increase version to next release candidate (#3100) 2022-08-25 15:55:34 +02:00
Julian Risch
0950db5032
chore: increase version to 1.7.2 for patch release (#3097)
* schema update

* schema update audio nodes

* schema update audio param type
2022-08-25 13:55:28 +02:00
Sebastian
0cf0568dd0
fix: Use use_auth_token in all cases when loading from the HF Hub (#3094)
* Making sure to pass on use_auth_token to all from_pretrained calls
2022-08-25 10:30:03 +02:00
Sara Zan
e92ea4fccb
refactor: rename master into main in documentation and links (#3063)
* master->main

* revert master rename

* Revert change to sphinx link and rename master schema
2022-08-24 19:05:12 +02:00
tstadel
92046ce5b5
feat: FAISS in OpenSearch: Support HNSW for dot product and l2 (#3029)
* support faiss hnsw

* blacken

* update docs

* improve similarity check

* add tests

* update schema

* set ef_search param correctly

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* regenerate docs

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-24 16:43:48 +02:00
James Briggs
9b1b03002f
update to PineconeDocumentStore to remove dependency on SQL db (#2749)
* update to PineconeDocumentStore to remove dependency on SQL db

* Update Documentation & Code Style

* typing fixes

* Update Documentation & Code Style

* fixed embedding generator to yield Documents

* Update Documentation & Code Style

* fixes for final typing issues

* fixes for pylint

* Update Documentation & Code Style

* uncomment pinecone tests

* added new params to docstrings

* Update Documentation & Code Style

* Update Documentation & Code Style

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update Documentation & Code Style

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* changes based on comments, updated errors and install

* Update Documentation & Code Style

* mypy

* implement simple filtering in pinecone mock

* typo

* typo in reverse

* account for missing meta key in filtering

* typo

* added metadata filtering to describe index

* added handling for users switching indexes in same doc store, and handling duplicate docs in write

* syntax tweaks

* added index option to document/embedding count calls

* labels implementation in progress

* added metadata fields to be indexed for pinecone tests

* further changes to mock

* WIP implementation of labels+multilabels

* switched to rely on labels namespace rather than filter

* simpler delete_labels

* label fixes, remove debug code

* Apply dostring fixes

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* mypy

* pylint

* docs

* temporarily un-mock Pinecone

* Small Pinecone test suite

* pylint

* Add fake test key to pass the None check

* Add again fake test key to pass the None check

* Add Pinecone to default docstores and fix filters

* Fix field name

* Change field name

* Change field value

* Remove comments

* forgot to upgrade pyproject.toml

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-24 13:27:15 +02:00
Branden Chan
6d4031d8f6
Add OpenAI Answer Generator API (#3050)
* Add OpenAI Answer Generator API

* Regen tutorials

* Regen md files

* Incorporate reviewer feedback

* Incorporate reviewer feedback

* Incorporate reviewer feedback

* Incorporate reviewer feedback
2022-08-24 09:20:08 +02:00
Sebastian
3ea57801ae
feat: Early stopping can be used in Reader and Retriever training (#3071)
* Add option to set early stopping in training

* Moved EarlyStopping to haystack/utils/early_stopping.py and added EarlyStopping to training Dense retrievers.
2022-08-23 14:18:12 +02:00
Daniel Bichuetti
d715d0202d
fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter (#3043)
* Fix when env does nto exist

* Fix missed line

* Set conservative chromedriver options

* Set default options based on environment

* Fix removed line

* Updated documentation

* Generate new schemas manually

* Add arguments via iterator and helper function

* Pre-push doc format

* Use imported Option vs full namespace access

* Manually update schema

* Manually add documentation and schema

* Fix language and documentation

* Fix typo

* Auto generated docs

* Updated documentation
2022-08-22 12:59:33 +02:00
Daniel Bichuetti
d5e36ce6b4
fix(translator): write translated text to output documents, while keeping input untouched (#3077)
* Set translated text on a copy of original document

* Return new translated list

* Manually generated docs

TODO: check pre-commit

* Hook generated file

* Rename variables for better maintenance

* fix(translator): prevent inputs from being changed

* fix: manual update translator docs

* style(translator): explicit type declaration on List

* docs(translator): re-run pre-commit hook

* style(translator): ignore mypy wrong type check

* docs(translator): re-run pre-commit hook
2022-08-22 04:07:05 -04:00
Julian Risch
bc6f71b5ba
chore: increase version to next release candidate (#3067)
* increase version to next release candidate

* generate schema files
2022-08-19 14:49:50 +02:00
Julian Risch
eb0f0da0fd
Prepare 1.7.1 release (#3061)
* prepare 1.7.1 release

* Fix schemas

* Update haystack/json-schemas/haystack-pipeline-1.7.1.schema.json

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* change back main to master

* remove newline at end of file

* generate schema file with no newline

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-19 13:24:40 +02:00
tstadel
1027ab3624
Bump Version to 1.7.1rc (#3041)
* bump version to 1.7.1rc

* update openapi
2022-08-18 10:31:57 +02:00
tstadel
baefd32b6f
Upgrade to v1.7.0 and copy docs folder (#3014)
* update version to 1.7.0

* copy docs

* update openapi

* generate schemas

* make update_json_schema() idempotent

* update docs, schema and openapi
2022-08-15 14:20:30 +02:00
Julian Risch
d61755322f
chore: fix typo in API docs (#3023)
* chore: fix typo in API docs

* fix openapi

Co-authored-by: Thomas Stadelmann <thomas.stadelmann@deepset.ai>
2022-08-15 13:25:20 +02:00
tstadel
0aa0c68785
Fix broken MultiLabel serialization (#3037)
* Fix MultiLabel serialization

* update docs

* better comment

* remove unused imports

* remove unused imports (2)
2022-08-15 13:09:18 +02:00
Branden Chan
ff38a20863
docs: update File Classifier Docstring (#3018)
* Update docstring

* Trigger pre-commit hook

* Trigger pre-commit hook

* Incorporate reviewer feedback

* Incorporate reviewer feedback
2022-08-15 12:37:28 +02:00
Branden Chan
7312f99584
Update Summarizer Docs (#3032)
* Change text to content

* Change text to content
2022-08-15 12:35:41 +02:00
bogdankostic
3a849d6c07
bug: Make TranslationWrapperPipeline work with QuestionAnswerGenerationPipeline (#3034)
* Overwrite output_translator's run method with run_batch

* Fix mypy

* Revert change

* Overwrite run method only with QuestionAnswerGenerationPipeline
2022-08-15 10:05:34 +02:00
Dmitry Goryunov
da7836a931
feat: Support embedding dimensions on DeepsetCloudDocumentStore (#2995)
* Add embedding_dim to dc store

* Remove similarity from query params, it is not used

* Remove unused `return_embedding` parameter

* Remove unused param

* Update the documentation

* Update schemas

* Revert openapi changes

* Revert openapi changes

* Fix openapi

* Fix json schema

* Improve docstrings

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Improve logs

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update the docs

* Fix similarity

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-12 11:46:52 +02:00
James Briggs
26c938a8e6
test: add meta fields for meta_config to be used during testing (#3021)
* added meta fields for meta_config to be used during realtime testing of PineconeDocumentStore

* Add documentation on metadata filtering in  docstring

* docs

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-12 10:27:56 +02:00
Sebastian
44e2b1beed
Resolving issue 2853: no answer logic in FARMReader (#2856)
* Update FARMReader.eval_on_file to be consistent with FARMReader.eval

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-08-11 16:45:03 +02:00
Zoltan Fedor
408d8e6ff5
Enable the JoinDocuments node to work with documents with score=None (#2984)
* Enable the `JoinDocuments` node to work with documents with `score=None`

This fixes #2983

As of now, the `JoinDocuments` node will error out if any of the documents has `score=None` - which is possible, as some retriever are not able to provide a score, like the `TfidfRetriever` on Elasticsearch or the `BM25Retriever` on Weaviate.
THe reason for the error is that the `JoinDocuments` always sorts the documents by score and cannot sort when `score=None`.

There was a very similar issue for `JoinAnswers` too, which was addressed by this PR: https://github.com/deepset-ai/haystack/pull/2436
This solution applies the same solution to `JoinDocuments` - so both the `JoinAnswers` and `JoinDocuments` now will have the same additional argument to disable sorting when that is requried.

The solution is to add an argument to `JoinDocuments` called `sort_by_score: bool`, which allows the user to turn off the sorting of documents by score, but keeps the current functionality of sorting being performed as the default.

* Fixing test bug

* Addressing PR review comments

- Extending unit tests
- Simplifying logic

* Making the sorting work even with no scores

By making the no score being sorted as -Inf

* Forgot to commit the change in `join_docs.py`

* [EMPTY] Re-trigger CI

* Added am INFO log if the `JoinDocuments` is sorting while some of the docs have `score=None`

* Adjusting the arguments of `any()`

* [EMPTY] Re-trigger CI
2022-08-11 10:43:25 +02:00
Massimiliano Pippi
2cd65e99b8
revert Remove pipes (#3006) 2022-08-11 10:42:22 +02:00
Zoltan Fedor
f4128d3581
Adding support for additional distance/similarity metrics for Weaviate (#3001)
* Adding support for additional distance metrics for Weaviate

Fixes #3000

* Updating the docs

* Fixing error texts

* Fixing issues raised by the review

* Addressing the last issue from the reviews - removing test `test_weaviate.py::test_similarity`

* [EMPTY] Re-trigger CI

* Fixing things based on review

* [EMPTY] Re-trigger CI
2022-08-11 09:48:21 +02:00
bogdankostic
5c3bfad078
feat: Add page number to Documents coming from PDFConverters and PreProcessor (#2932)
* Add page number to Documents coming from PDFConverters and PreProcessor

* Fix mypy

* Update API Docs

* Update API Docs

* Remove unused imports

* Generate JSON schema

* Generate JSON schema

* Make test variable shorter

* Make regex a separate function

* Move counting of page breaks to a function

* Generate JSON schema

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update API Documentation

* Don't create instance for testing staticmethod

* Update haystack/nodes/preprocessor/preprocessor.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-09 15:55:27 +02:00
Branden Chan
dfeb171686
Add API page for util functions (#2863)
* Clean OpenAIAnswerGenerator docstrings

* Incorporate reviewer feedback

* Update Documentation & Code Style

* Improve id_hash_keys description

* Simplify id_hash_keys description

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-08-09 14:53:45 +02:00
Stefano Fiorucci
4a63484916
feat: Extend TransformersQueryClassifier: clean version (#2965)
* extend query classifier in one commit

* variable number of outgoing edges

* improve tests

* fix unused import

* lightweight approach

* fix _calculate_outgoing_edges

* remove duplicate label validation

* Remove print
2022-08-09 09:43:33 +02:00
MichelBartels
c91316e862
feat: add gradient accumulation in FARMReader (#2925)
* expose gradient accumulation to train function of FARMReader

* add documentation for gradient accumulation

* Update Documentation & Code Style

* doc string improvements

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* doc string improvements

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* doc string improvements

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-08 18:42:21 +02:00
Vladimir Blagojevic
d1f8b7118c
Add progress bar to batch run component ops (#2864)
* Add progress bar to batch run component ops

* Update docs

* Update schema

* PR review: thanks Bogdan
2022-08-08 09:32:44 -04:00
Sara Zan
1a0a4c8836
Remove pipes from code block (#2973)
* Remove pipes

* Generate md
2022-08-05 19:18:57 +02:00
Vladimir Blagojevic
4f8d11c591
Update Seq2SeqGenerator API documentation (#2970)
* Seq2SeqGenerator - update API docs
2022-08-05 17:39:23 +02:00
Vladimir Blagojevic
762a12fcb1
Print eval reports improvements (#2941) 2022-08-04 11:21:27 -04:00
Bilge Yücel
489699bd98
Fix docs code format for sentence transformers (#2957)
Co-authored-by: bilge4 <bilge@techwolf.ai>
2022-08-04 12:31:42 +02:00