34 Commits

Author SHA1 Message Date
Zoltan Fedor
408d8e6ff5
Enable the JoinDocuments node to work with documents with score=None (#2984)
* Enable the `JoinDocuments` node to work with documents with `score=None`

This fixes #2983

As of now, the `JoinDocuments` node will error out if any of the documents has `score=None` - which is possible, as some retriever are not able to provide a score, like the `TfidfRetriever` on Elasticsearch or the `BM25Retriever` on Weaviate.
THe reason for the error is that the `JoinDocuments` always sorts the documents by score and cannot sort when `score=None`.

There was a very similar issue for `JoinAnswers` too, which was addressed by this PR: https://github.com/deepset-ai/haystack/pull/2436
This solution applies the same solution to `JoinDocuments` - so both the `JoinAnswers` and `JoinDocuments` now will have the same additional argument to disable sorting when that is requried.

The solution is to add an argument to `JoinDocuments` called `sort_by_score: bool`, which allows the user to turn off the sorting of documents by score, but keeps the current functionality of sorting being performed as the default.

* Fixing test bug

* Addressing PR review comments

- Extending unit tests
- Simplifying logic

* Making the sorting work even with no scores

By making the no score being sorted as -Inf

* Forgot to commit the change in `join_docs.py`

* [EMPTY] Re-trigger CI

* Added am INFO log if the `JoinDocuments` is sorting while some of the docs have `score=None`

* Adjusting the arguments of `any()`

* [EMPTY] Re-trigger CI
2022-08-11 10:43:25 +02:00
bogdankostic
5c3bfad078
feat: Add page number to Documents coming from PDFConverters and PreProcessor (#2932)
* Add page number to Documents coming from PDFConverters and PreProcessor

* Fix mypy

* Update API Docs

* Update API Docs

* Remove unused imports

* Generate JSON schema

* Generate JSON schema

* Make test variable shorter

* Make regex a separate function

* Move counting of page breaks to a function

* Generate JSON schema

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update API Documentation

* Don't create instance for testing staticmethod

* Update haystack/nodes/preprocessor/preprocessor.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-09 15:55:27 +02:00
Stefano Fiorucci
4a63484916
feat: Extend TransformersQueryClassifier: clean version (#2965)
* extend query classifier in one commit

* variable number of outgoing edges

* improve tests

* fix unused import

* lightweight approach

* fix _calculate_outgoing_edges

* remove duplicate label validation

* Remove print
2022-08-09 09:43:33 +02:00
tstadel
b042dd9c82
Fix validation for dynamic outgoing edges (#2850)
* fix validation for dynamic outgoing edges

* Update Documentation & Code Style

* use class outgoing_edges as fallback if no instance is provided

* implement classmethod approach

* readd comment

* fix mypy

* fix tests

* set outgoing_edges for all components

* set outgoing_edges for mocks too

* set document store outgoing_edges to 1

* set last missing outgoing_edges

* enforce BaseComponent subclasses to define outgoing_edges

* override _calculate_outgoing_edges for FileTypeClassifier

* remove superfluous test

* set rest_api's custom component's outgoing_edges

* Update docstring

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* remove unnecessary else

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-04 10:27:50 +02:00
Massimiliano Pippi
40d07c2038
Enable Opensearch unit tests in Windows CI (#2936)
* enable Opensearch unit tests under Win

* move unit tests into a dedicated job

* skip audio tests on missing dependencies

* avoid failing test collection when soundfile is not available

* Update .github/workflows/tests.yml

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-03 19:19:07 +02:00
Sara Zan
4e45062a00
Simplify language_modeling.py and tokenization.py (#2703)
* Simplification of language_model.py and tokenization.py to remove code duplication

Co-authored-by: vblagoje <dovlex@gmail.com>
2022-07-22 16:29:30 +02:00
Daniel Bichuetti
3948b997b2
Add support for custom trained PunktTokenizer in PreProcessor (#2783)
* Add support for model folder into BasePreProcessor

* First draft of custom model on PreProcessor

* Update Documentation & Code Style

* Update tests to support custom models

* Update Documentation & Code Style

* Test for wrong models in custom folder

* Default to ISO names on custom model folder

Use long names only when needed

* Update Documentation & Code Style

* Refactoring language names usage

* Update fallback logic

* Check unpickling error

* Updated tests using parametrize

Co-authored-by:  Sara Zan <sara.zanzottera@deepset.ai>

* Refactored common logic

* Add format control to NLTK load

* Tests improvements

Add a sample for specialized model

* Update Documentation & Code Style

* Minor log text update

* Log model format exception details

* Change pickle protocol version to 4 for 3.7 compat

* Removed unnecessary model folder parameter

Changed logic comparisons

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* Update Documentation & Code Style

* Removed unused import

* Change errors with warnings

* Change to absolute path

* Rename sentence tokenizer method

Co-authored-by: tstadel

* Check document content is a string before process

* Change to log errors and not warnings

* Update Documentation & Code Style

* Improve split sentences method

Co-authored-by:  Sara Zan  <sara.zanzottera@deepset.ai>

* Update Documentation & Code Style

* Empty commit - trigger workflow

* Remove superfluous parameters

Co-authored-by: tstadel

* Explicit None checking

Co-authored-by: tstadel

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-07-21 09:50:45 +02:00
Sara Zan
d8e7aaeacc
API key check in OpenAIAnswerGenerator (#2791)
* api key check in node and tests

* Clarify skip message

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-12 14:05:47 +02:00
Sowmiya Jaganathan
4d8f40425b
Passing the meta-data in the summerizer response (#2179)
* Passing the all the meta-data in the summerizer

* Disable metadata forwarding if `generate_single_summary` is `True`

* Update Documentation & Code Style

* simplify tests

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 17:28:36 +02:00
Daniel Augustus Bichuetti Silva
1706729e26
Prevent PDFToTextConverter from failing on PDFs with spaces in their names (#2786)
* Change split logic to list

* Fix wrong parameter for run

* Fix mypy error

* Fix layout/raw parameter

* Add test for filename with whitespaces on PDFToText

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 13:30:33 +02:00
Daniel Augustus Bichuetti Silva
77a513fe49
Fix crawler long file names (#2723)
* Changing the name that crawled page is saved to avoid long file names error on some file systems

* Custom naming function for saving crawled files

* Update Documentation & Code Style

* Remove bad characters on file name and preffix

* Add test for naming function

* Update Documentation & Code Style

* Fix expensive regex recalculation and linter warns

* Check for exceptions on file dump

* Remove param_naming variable

* Fix file paths on Windows, Linux and Mac

* Update Documentation & Code Style

* Test using one of the docstrings examples

* Change default naming function
Update docstrings

* Applying formatting rules

* Update Documentation & Code Style

* Fix mypy incompatible assignment error

* Remove unused type declaration

* Fix typo

* Update tests for naming function

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 12:16:32 +02:00
Malte Pietsch
ba08fc86f5
Add node to use OpenAI's GPT-3 for QA (#2605)
* first draft of openai node for QA

* Update Documentation & Code Style

* fix mypy. add node to inits

* Update Documentation & Code Style

* fix linter

* Adapt OpenAIGenerator to completions endpoint

* Update Documentation & Code Style

* Fix pylint

* Fix doc strings

* Make use of temperature

* Make use of api key in tests

* Adapt doc strings

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-07-08 13:59:27 +02:00
Patrick Deutschmann
1db3fd0942
Add support for Multi-Hop Dense Retrieval (#2571)
* Implement MDR

* Adapt conftest to new MDR signature

* Update Documentation & Code Style

* Change signature of queries param in batch methods of MDR like in #2575

* Update Documentation & Code Style

* Rename MultihopDenseRetriever to MultihopEmbeddingRetriever

* Fix filters in retrieve_batch

* Add docstring for MultihopEmbeddingRetriever.__init__

* Update Documentation & Code Style

* Revert forward signature of TextSimilarityHead

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-05 11:31:11 +02:00
bogdankostic
dc48c444d4
Fix loading of tokenizers in DPR (#2755) 2022-07-04 18:18:14 +02:00
Francesco Castelli
31dcd55c24
Validate max_seq_length in SquadProcessor (#2740)
* added max_len_seq validation in SquadProcessor

* fixed string formatting

* added tests for invalid max_seq_len

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-04 13:35:45 +02:00
Daniel Augustus Bichuetti Silva
e3b2ee956a
Improved crawler support for dynamically loaded pages (#2710)
* Improved crawler support for dynamically loaded pages

* Reduced scope of StaleElementReferenceException and removed deprecated code from WebDriver initialization

* Improvements on crawler testing code

* Code format and style applied on f028331948c170448613e86dfdfa222f7c2043fd

* Update Documentation & Code Style

* Remove unused imports/parameters

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-01 10:47:33 +02:00
mathislucka
8d65bc5f9b
Update document scores based on ranker node (#2048)
* ranker should return scores for later usage

* fix wrong tuple order

* adjust ranker scores; add tests

* Update Documentation & Code Style

* fix mypy

* Update Documentation & Code Style

* fix mypy

* Update Documentation & Code Style

* relax ranker test tolerance

* update ranker test score

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2022-06-27 12:17:18 +02:00
Sara Zan
e8546e2124
Replace deprecated Selenium methods (#2724)
* Fix crawler.py

* Fix test_connector.py

* unused import

Co-authored-by: danielbichuetti <daniel.bichuetti@gmail.com>
2022-06-24 12:05:32 +02:00
tstadel
1168f6365d
Fix using id_hash_keys as pipeline params (#2717)
* Fix using id_hash_keys as pipeline params

* Update Documentation & Code Style

* add tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-24 09:55:09 +02:00
Aleksander Smywiński-Pohl
642229255f
Use AutoTokenizer by default, to easily adapt to new models and token… (#1902)
* Use AutoTokenizer by default, to easily adapt to new models and tokenizers

* Add missing AutoTokenizer import

* Apply Black

* Missing import

* Fix DPR tests

* Remove tests on max length

* Update Documentation & Code Style

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 13:13:48 +02:00
Sara Zan
584e046642
AnswerToSpeech (#2584)
* Add new audio answer primitives

* Add AnswerToSpeech

* Add dependency group

* Update Documentation & Code Style

* Extract TextToSpeech in a helper class, create DocumentToSpeech and primitives

* Add tests

* Update Documentation & Code Style

* Add ability to compress audio and more tests

* Add audio group to test, all and all-gpu

* fix pylint

* Update Documentation & Code Style

* Accidental git tag

* Try pleasing mypy

* Update Documentation & Code Style

* fix pylint

* Add warning for missing OS library and support in CI

* Try fixing mypy

* Update Documentation & Code Style

* Add docs, simplify args for audio nodes and add tutorials

* Fix mypy

* Fix run_batch

* Feedback on tutorials

* fix mypy and pylint

* Fix mypy again

* Fix mypy yet again

* Fix the ci

* Fix dicts merge and install ffmpeg on CI

* Make the audio nodes import safe

* Trying to increase tolerance in audio test

* Fix import paths

* fix linter

* Update Documentation & Code Style

* Add audio libs in unit tests

* Update _text_to_speech.py

* Update answer_to_speech.py

* Use dedicated dataset & update telemetry

* Remove  and use distilled roberta

* Revert special primitives so that the nodes run in indexing

* Improve tutorials and fix smaller bugs

* Update Documentation & Code Style

* Fix serialization issue

* Update Documentation & Code Style

* Improve tutorial

* Update Documentation & Code Style

* Update _text_to_speech.py

* Minor lg updates

* Minor lg updates to tutorial

* Making indexing work in tutorials

* Update Documentation & Code Style

* Improve docstrings

* Try to use GPU when available

* Update Documentation & Code Style

* Fixi mypy and pylint

* Try to pass the device correctly

* Update Documentation & Code Style

* Use type of device

* use .cpu()

* Improve .ipynb

* update apt index to be able to download libsndfile1

* Fix SpeechDocument.from_dict()

* Change pip URL

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-06-15 10:13:18 +02:00
Sara Zan
54518ac790
[CI Refactoring] Refactor Document fixtures in tests (#2577)
* Refactor document fixtures

* Add embedding files

* Update Documentation & Code Style

* Indentation issue

* Update Documentation & Code Style

* Fix type conversion in conftest.py

* Update Documentation & Code Style

* mypy on sql.py

* mypy on crawler.py

* mypy on pinecone.py

* Adapt retriever tests

* Update Documentation & Code Style

* mypy on crawler.py

* Update Documentation & Code Style

* mypy on crawler.py again

* Update Documentation & Code Style

* mypy fix was too rough

* Fix some more tests

* Update Documentation & Code Style

* Skip meaningless test on FilterRetriever

* Make embedding values less specific

* Update Documentation & Code Style

* Use stable IDs in retriever tests that depend on it

* Remove needless fixtures

* docs_with_ids

* Update Documentation & Code Style

* Typo

* Fix retriever tests

* Fix reader tests

* Update Documentation & Code Style

* Workaround #2626

* Update Documentation & Code Style

* Fix label generator tests

* Reorder vectors

* remove print

* Update Documentation & Code Style

* Update Documentation & Code Style

* git tags leftover

* Update Documentation & Code Style

* fix last failing test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-10 18:22:48 +02:00
Sara Zan
e5423b1515
Fix markers in GPL tests (#2652) 2022-06-10 06:42:19 -04:00
Sara Zan
33a51fa915
[CI Refactoring] Move unrelated tests out of test_pipeline.py (#2573)
* move unrelated tests out of test_pipeline.py

* Update Documentation & Code Style

* fix fixture name

* Typo

* Make sure all docs are Documents in routedocuments tests

* Fix tests

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-10 11:45:13 +02:00
Vladimir Blagojevic
b13c32eb9c
Add GPL API docs, unit tests update (#2634)
* Update test_label_generator.py

* GPL increase default batch size to 16

* GPL - API docs

* GPL - split unit tests

* Make devs aware of multilingual GPL

* Create separate train/save test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-10 05:25:28 -04:00
Stefano Fiorucci
c178f60e3a
Make crawler extract also hidden text (#2642)
* make crawler extract also hidden text

* Update Documentation & Code Style

* try to adapt test for extract_hidden_text

* Update Documentation & Code Style

* fix test bug

* fix bug in test

* added test for hidden text"

* Update Documentation & Code Style

* fix bug in test

* Update Documentation & Code Style

* fix test

* Update Documentation & Code Style

* fix other test bug

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-10 09:51:41 +02:00
Sara Zan
c17969e001
Fix failing Crawler test (#2640)
* Make tests insensntive to ordering of crawled pages

* fix docstring
2022-06-07 18:14:43 +02:00
Sara Zan
59608ca474
[CI Refactoring] Workflow refactoring (#2576)
* Unify CI tests (from #2466)

* Update Documentation & Code Style

* Change folder names

* Fix markers list

* Remove marker 'slow', replaced with 'integration'

* Soften children check

* Start ES first so it has time to boot while Python is setup

* Run the full workflow

* Try to make pip upgrade on Windows

* Set KG tests as integration

* Update Documentation & Code Style

* typo

* faster pylint

* Make Pylint use the cache

* filter diff files for pylint

* debug pylint statement

* revert pylint changes

* Remove path from asserted log (fails on Windows)

* Skip preprocessor test on Windows

* Tackling Windows specific failures

* Fix pytest command for windows suites

* Remove \ from command

* Move poppler test into integration

* Skip opensearch test on windows

* Add tolerance in reader sas score for Windows

* Another pytorch approx

* Raise time limit for unit tests :(

* Skip poppler test on Windows CI

* Specify to pull with FF only in docs check

* temporarily run the docs check immediately

* Allow merge commit for now

* Try without fetch depth

* Accelerating test

* Accelerating test

* Add repository and ref alongside fetch-depth

* Separate out code&docs check from tests

* Use setup-python cache

* Delete custom action

* Remove the pull step in the docs check, will find a way to run on bot commits

* Add requirements.txt in .github for caching

* Actually install dependencies

* Change deps group for pylint

* Unclear why the requirements.txt is still required :/

* Fix the code check python setup

* Install all deps for pylint

* Make the autoformat check depend on tests and doc updates workflows

* Try installing dependencies in another order

* Try again to install the deps

* quoting the paths

* Ad back the requirements

* Try again to install rest_api and ui

* Change deps group

* Duplicate haystack install line

* See if the cache is the problem

* Disable also in mypy, who knows

* split the install step

* Split install step everywhere

* Revert "Separate out code&docs check from tests"

This reverts commit 1cd59b15ffc5b984e1d642dcbf4c8ccc2bb6c9bd.

* Add back the action

* Proactive support for audio (see text2speech branch)

* Fix label generator tests

* Remove install of libsndfile1 on win temporarily

* exclude audio tests on win

* install ffmpeg for integration tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-07 09:23:03 +02:00
Sara Zan
83648b9bc0
[CI refactoring] Rewrite Crawler tests (#2557)
* Rewrite crawler tests (very slow) and fix small crawler bug

* Update Documentation & Code Style

* compile the regex only once

* Factor out the html files & add content check to most tests

* Clarify that even starting URLs can be excluded

* Update Documentation & Code Style

* Change signature

* Fix failing test

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-06 17:52:37 +02:00
Vladimir Blagojevic
e10a3fba74
Add Generative Pseudo Labeling (#2388) 2022-06-02 10:12:47 -04:00
bogdankostic
61d9429c25
Simplify loading of EmbeddingRetriever (#2619)
* Infer model format for EmbeddingRetriever automatically

* Update Documentation & Code Style

* Adapt conftest to automatic inference of model_format

* Update Documentation & Code Style

* Fix tests

* Update Documentation & Code Style

* Fix tests

* Adapt tutorials

* Update Documentation & Code Style

* Add test for similarity scores with sentence transformers

* Adapt doc string and warning message

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-02 15:05:29 +02:00
bogdankostic
867695ad0c
Change signature of queries param in batch methods (#2575)
* Change signature of queries param in batch methods

* Update Documentation & Code Style

* Fix mypy

* Remove unused import

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-24 12:33:45 +02:00
Julian Risch
075ed7fbcb
Remove encoding option from PDFToTextOCRConverter (#2553)
* remove encoding option from PDFToTextOCRConverter

* Update Documentation & Code Style

* add unused 'encoding' param to PDFToTextOCRConverter

* Update Documentation & Code Style

* call run instead of convert to use ligature replacing

* Update Documentation & Code Style

* add text to check installed poppler version

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-24 11:31:32 +02:00
Sara Zan
ff4303c51b
[CI refactoring] Categorize tests into folders (#2554)
* Categorize tests into folders

* Fix linux_ci.yml and an import

* Wrong path
2022-05-17 09:55:53 +01:00