haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-30 20:31:44 +00:00

Author	SHA1	Message	Date
Zoltan Fedor	408d8e6ff5	Enable the `JoinDocuments` node to work with documents with `score=None` (#2984 ) * Enable the `JoinDocuments` node to work with documents with `score=None` This fixes #2983 As of now, the `JoinDocuments` node will error out if any of the documents has `score=None` - which is possible, as some retriever are not able to provide a score, like the `TfidfRetriever` on Elasticsearch or the `BM25Retriever` on Weaviate. THe reason for the error is that the `JoinDocuments` always sorts the documents by score and cannot sort when `score=None`. There was a very similar issue for `JoinAnswers` too, which was addressed by this PR: https://github.com/deepset-ai/haystack/pull/2436 This solution applies the same solution to `JoinDocuments` - so both the `JoinAnswers` and `JoinDocuments` now will have the same additional argument to disable sorting when that is requried. The solution is to add an argument to `JoinDocuments` called `sort_by_score: bool`, which allows the user to turn off the sorting of documents by score, but keeps the current functionality of sorting being performed as the default. * Fixing test bug * Addressing PR review comments - Extending unit tests - Simplifying logic * Making the sorting work even with no scores By making the no score being sorted as -Inf * Forgot to commit the change in `join_docs.py` * [EMPTY] Re-trigger CI * Added am INFO log if the `JoinDocuments` is sorting while some of the docs have `score=None` * Adjusting the arguments of `any()` * [EMPTY] Re-trigger CI	2022-08-11 10:43:25 +02:00
bogdankostic	5c3bfad078	feat: Add page number to Documents coming from PDFConverters and PreProcessor (#2932 ) * Add page number to Documents coming from PDFConverters and PreProcessor * Fix mypy * Update API Docs * Update API Docs * Remove unused imports * Generate JSON schema * Generate JSON schema * Make test variable shorter * Make regex a separate function * Move counting of page breaks to a function * Generate JSON schema * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update API Documentation * Don't create instance for testing staticmethod * Update haystack/nodes/preprocessor/preprocessor.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-08-09 15:55:27 +02:00
Stefano Fiorucci	4a63484916	feat: Extend `TransformersQueryClassifier`: clean version (#2965 ) * extend query classifier in one commit * variable number of outgoing edges * improve tests * fix unused import * lightweight approach * fix _calculate_outgoing_edges * remove duplicate label validation * Remove print	2022-08-09 09:43:33 +02:00
tstadel	b042dd9c82	Fix validation for dynamic outgoing edges (#2850 ) * fix validation for dynamic outgoing edges * Update Documentation & Code Style * use class outgoing_edges as fallback if no instance is provided * implement classmethod approach * readd comment * fix mypy * fix tests * set outgoing_edges for all components * set outgoing_edges for mocks too * set document store outgoing_edges to 1 * set last missing outgoing_edges * enforce BaseComponent subclasses to define outgoing_edges * override _calculate_outgoing_edges for FileTypeClassifier * remove superfluous test * set rest_api's custom component's outgoing_edges * Update docstring Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * remove unnecessary else Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-08-04 10:27:50 +02:00
Massimiliano Pippi	40d07c2038	Enable Opensearch unit tests in Windows CI (#2936 ) * enable Opensearch unit tests under Win * move unit tests into a dedicated job * skip audio tests on missing dependencies * avoid failing test collection when soundfile is not available * Update .github/workflows/tests.yml Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-08-03 19:19:07 +02:00
Sara Zan	4e45062a00	Simplify `language_modeling.py` and `tokenization.py` (#2703 ) * Simplification of language_model.py and tokenization.py to remove code duplication Co-authored-by: vblagoje <dovlex@gmail.com>	2022-07-22 16:29:30 +02:00
Daniel Bichuetti	3948b997b2	Add support for custom trained PunktTokenizer in PreProcessor (#2783 ) * Add support for model folder into BasePreProcessor * First draft of custom model on PreProcessor * Update Documentation & Code Style * Update tests to support custom models * Update Documentation & Code Style * Test for wrong models in custom folder * Default to ISO names on custom model folder Use long names only when needed * Update Documentation & Code Style * Refactoring language names usage * Update fallback logic * Check unpickling error * Updated tests using parametrize Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Refactored common logic * Add format control to NLTK load * Tests improvements Add a sample for specialized model * Update Documentation & Code Style * Minor log text update * Log model format exception details * Change pickle protocol version to 4 for 3.7 compat * Removed unnecessary model folder parameter Changed logic comparisons Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Update Documentation & Code Style * Removed unused import * Change errors with warnings * Change to absolute path * Rename sentence tokenizer method Co-authored-by: tstadel * Check document content is a string before process * Change to log errors and not warnings * Update Documentation & Code Style * Improve split sentences method Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Update Documentation & Code Style * Empty commit - trigger workflow * Remove superfluous parameters Co-authored-by: tstadel * Explicit None checking Co-authored-by: tstadel Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-07-21 09:50:45 +02:00
Sara Zan	d8e7aaeacc	API key check in `OpenAIAnswerGenerator` (#2791 ) * api key check in node and tests * Clarify skip message * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-12 14:05:47 +02:00
Sowmiya Jaganathan	4d8f40425b	Passing the meta-data in the summerizer response (#2179 ) * Passing the all the meta-data in the summerizer * Disable metadata forwarding if `generate_single_summary` is `True` * Update Documentation & Code Style * simplify tests * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-11 17:28:36 +02:00
Daniel Augustus Bichuetti Silva	1706729e26	Prevent `PDFToTextConverter` from failing on PDFs with spaces in their names (#2786 ) * Change split logic to list * Fix wrong parameter for run * Fix mypy error * Fix layout/raw parameter * Add test for filename with whitespaces on PDFToText * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-11 13:30:33 +02:00
Daniel Augustus Bichuetti Silva	77a513fe49	Fix crawler long file names (#2723 ) * Changing the name that crawled page is saved to avoid long file names error on some file systems * Custom naming function for saving crawled files * Update Documentation & Code Style * Remove bad characters on file name and preffix * Add test for naming function * Update Documentation & Code Style * Fix expensive regex recalculation and linter warns * Check for exceptions on file dump * Remove param_naming variable * Fix file paths on Windows, Linux and Mac * Update Documentation & Code Style * Test using one of the docstrings examples * Change default naming function Update docstrings * Applying formatting rules * Update Documentation & Code Style * Fix mypy incompatible assignment error * Remove unused type declaration * Fix typo * Update tests for naming function * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-11 12:16:32 +02:00
Malte Pietsch	ba08fc86f5	Add node to use OpenAI's GPT-3 for QA (#2605 ) * first draft of openai node for QA * Update Documentation & Code Style * fix mypy. add node to inits * Update Documentation & Code Style * fix linter * Adapt OpenAIGenerator to completions endpoint * Update Documentation & Code Style * Fix pylint * Fix doc strings * Make use of temperature * Make use of api key in tests * Adapt doc strings Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ZanSara <sarazanzo94@gmail.com> Co-authored-by: bogdankostic <bogdankostic@web.de>	2022-07-08 13:59:27 +02:00
Patrick Deutschmann	1db3fd0942	Add support for Multi-Hop Dense Retrieval (#2571 ) * Implement MDR * Adapt conftest to new MDR signature * Update Documentation & Code Style * Change signature of queries param in batch methods of MDR like in #2575 * Update Documentation & Code Style * Rename MultihopDenseRetriever to MultihopEmbeddingRetriever * Fix filters in retrieve_batch * Add docstring for MultihopEmbeddingRetriever.__init__ * Update Documentation & Code Style * Revert forward signature of TextSimilarityHead Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-05 11:31:11 +02:00
bogdankostic	dc48c444d4	Fix loading of tokenizers in DPR (#2755 )	2022-07-04 18:18:14 +02:00
Francesco Castelli	31dcd55c24	Validate `max_seq_length` in `SquadProcessor` (#2740 ) * added max_len_seq validation in SquadProcessor * fixed string formatting * added tests for invalid max_seq_len * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-04 13:35:45 +02:00
Daniel Augustus Bichuetti Silva	e3b2ee956a	Improved crawler support for dynamically loaded pages (#2710 ) * Improved crawler support for dynamically loaded pages * Reduced scope of StaleElementReferenceException and removed deprecated code from WebDriver initialization * Improvements on crawler testing code * Code format and style applied on f028331948c170448613e86dfdfa222f7c2043fd * Update Documentation & Code Style * Remove unused imports/parameters Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-01 10:47:33 +02:00
mathislucka	8d65bc5f9b	Update document scores based on ranker node (#2048 ) * ranker should return scores for later usage * fix wrong tuple order * adjust ranker scores; add tests * Update Documentation & Code Style * fix mypy * Update Documentation & Code Style * fix mypy * Update Documentation & Code Style * relax ranker test tolerance * update ranker test score Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2022-06-27 12:17:18 +02:00
Sara Zan	e8546e2124	Replace deprecated Selenium methods (#2724 ) * Fix crawler.py * Fix test_connector.py * unused import Co-authored-by: danielbichuetti <daniel.bichuetti@gmail.com>	2022-06-24 12:05:32 +02:00
tstadel	1168f6365d	Fix using id_hash_keys as pipeline params (#2717 ) * Fix using id_hash_keys as pipeline params * Update Documentation & Code Style * add tests Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-24 09:55:09 +02:00
Aleksander Smywiński-Pohl	642229255f	Use AutoTokenizer by default, to easily adapt to new models and token… (#1902 ) * Use AutoTokenizer by default, to easily adapt to new models and tokenizers * Add missing AutoTokenizer import * Apply Black * Missing import * Fix DPR tests * Remove tests on max length * Update Documentation & Code Style Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-15 13:13:48 +02:00
Sara Zan	584e046642	`AnswerToSpeech` (#2584 ) * Add new audio answer primitives * Add AnswerToSpeech * Add dependency group * Update Documentation & Code Style * Extract TextToSpeech in a helper class, create DocumentToSpeech and primitives * Add tests * Update Documentation & Code Style * Add ability to compress audio and more tests * Add audio group to test, all and all-gpu * fix pylint * Update Documentation & Code Style * Accidental git tag * Try pleasing mypy * Update Documentation & Code Style * fix pylint * Add warning for missing OS library and support in CI * Try fixing mypy * Update Documentation & Code Style * Add docs, simplify args for audio nodes and add tutorials * Fix mypy * Fix run_batch * Feedback on tutorials * fix mypy and pylint * Fix mypy again * Fix mypy yet again * Fix the ci * Fix dicts merge and install ffmpeg on CI * Make the audio nodes import safe * Trying to increase tolerance in audio test * Fix import paths * fix linter * Update Documentation & Code Style * Add audio libs in unit tests * Update _text_to_speech.py * Update answer_to_speech.py * Use dedicated dataset & update telemetry * Remove and use distilled roberta * Revert special primitives so that the nodes run in indexing * Improve tutorials and fix smaller bugs * Update Documentation & Code Style * Fix serialization issue * Update Documentation & Code Style * Improve tutorial * Update Documentation & Code Style * Update _text_to_speech.py * Minor lg updates * Minor lg updates to tutorial * Making indexing work in tutorials * Update Documentation & Code Style * Improve docstrings * Try to use GPU when available * Update Documentation & Code Style * Fixi mypy and pylint * Try to pass the device correctly * Update Documentation & Code Style * Use type of device * use .cpu() * Improve .ipynb * update apt index to be able to download libsndfile1 * Fix SpeechDocument.from_dict() * Change pip URL Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-06-15 10:13:18 +02:00
Sara Zan	54518ac790	[CI Refactoring] Refactor `Document` fixtures in tests (#2577 ) * Refactor document fixtures * Add embedding files * Update Documentation & Code Style * Indentation issue * Update Documentation & Code Style * Fix type conversion in conftest.py * Update Documentation & Code Style * mypy on sql.py * mypy on crawler.py * mypy on pinecone.py * Adapt retriever tests * Update Documentation & Code Style * mypy on crawler.py * Update Documentation & Code Style * mypy on crawler.py again * Update Documentation & Code Style * mypy fix was too rough * Fix some more tests * Update Documentation & Code Style * Skip meaningless test on FilterRetriever * Make embedding values less specific * Update Documentation & Code Style * Use stable IDs in retriever tests that depend on it * Remove needless fixtures * docs_with_ids * Update Documentation & Code Style * Typo * Fix retriever tests * Fix reader tests * Update Documentation & Code Style * Workaround #2626 * Update Documentation & Code Style * Fix label generator tests * Reorder vectors * remove print * Update Documentation & Code Style * Update Documentation & Code Style * git tags leftover * Update Documentation & Code Style * fix last failing test Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-10 18:22:48 +02:00
Sara Zan	e5423b1515	Fix markers in GPL tests (#2652 )	2022-06-10 06:42:19 -04:00
Sara Zan	33a51fa915	[CI Refactoring] Move unrelated tests out of `test_pipeline.py` (#2573 ) * move unrelated tests out of test_pipeline.py * Update Documentation & Code Style * fix fixture name * Typo * Make sure all docs are Documents in routedocuments tests * Fix tests * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-10 11:45:13 +02:00
Vladimir Blagojevic	b13c32eb9c	Add GPL API docs, unit tests update (#2634 ) * Update test_label_generator.py * GPL increase default batch size to 16 * GPL - API docs * GPL - split unit tests * Make devs aware of multilingual GPL * Create separate train/save test Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-10 05:25:28 -04:00
Stefano Fiorucci	c178f60e3a	Make crawler extract also hidden text (#2642 ) * make crawler extract also hidden text * Update Documentation & Code Style * try to adapt test for extract_hidden_text * Update Documentation & Code Style * fix test bug * fix bug in test * added test for hidden text" * Update Documentation & Code Style * fix bug in test * Update Documentation & Code Style * fix test * Update Documentation & Code Style * fix other test bug Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-10 09:51:41 +02:00
Sara Zan	c17969e001	Fix failing `Crawler` test (#2640 ) * Make tests insensntive to ordering of crawled pages * fix docstring	2022-06-07 18:14:43 +02:00
Sara Zan	59608ca474	[CI Refactoring] Workflow refactoring (#2576 ) * Unify CI tests (from #2466) * Update Documentation & Code Style * Change folder names * Fix markers list * Remove marker 'slow', replaced with 'integration' * Soften children check * Start ES first so it has time to boot while Python is setup * Run the full workflow * Try to make pip upgrade on Windows * Set KG tests as integration * Update Documentation & Code Style * typo * faster pylint * Make Pylint use the cache * filter diff files for pylint * debug pylint statement * revert pylint changes * Remove path from asserted log (fails on Windows) * Skip preprocessor test on Windows * Tackling Windows specific failures * Fix pytest command for windows suites * Remove \ from command * Move poppler test into integration * Skip opensearch test on windows * Add tolerance in reader sas score for Windows * Another pytorch approx * Raise time limit for unit tests :( * Skip poppler test on Windows CI * Specify to pull with FF only in docs check * temporarily run the docs check immediately * Allow merge commit for now * Try without fetch depth * Accelerating test * Accelerating test * Add repository and ref alongside fetch-depth * Separate out code&docs check from tests * Use setup-python cache * Delete custom action * Remove the pull step in the docs check, will find a way to run on bot commits * Add requirements.txt in .github for caching * Actually install dependencies * Change deps group for pylint * Unclear why the requirements.txt is still required :/ * Fix the code check python setup * Install all deps for pylint * Make the autoformat check depend on tests and doc updates workflows * Try installing dependencies in another order * Try again to install the deps * quoting the paths * Ad back the requirements * Try again to install rest_api and ui * Change deps group * Duplicate haystack install line * See if the cache is the problem * Disable also in mypy, who knows * split the install step * Split install step everywhere * Revert "Separate out code&docs check from tests" This reverts commit 1cd59b15ffc5b984e1d642dcbf4c8ccc2bb6c9bd. * Add back the action * Proactive support for audio (see text2speech branch) * Fix label generator tests * Remove install of libsndfile1 on win temporarily * exclude audio tests on win * install ffmpeg for integration tests Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-07 09:23:03 +02:00
Sara Zan	83648b9bc0	[CI refactoring] Rewrite `Crawler` tests (#2557 ) * Rewrite crawler tests (very slow) and fix small crawler bug * Update Documentation & Code Style * compile the regex only once * Factor out the html files & add content check to most tests * Clarify that even starting URLs can be excluded * Update Documentation & Code Style * Change signature * Fix failing test * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-06 17:52:37 +02:00
Vladimir Blagojevic	e10a3fba74	Add Generative Pseudo Labeling (#2388 )	2022-06-02 10:12:47 -04:00
bogdankostic	61d9429c25	Simplify loading of `EmbeddingRetriever` (#2619 ) * Infer model format for EmbeddingRetriever automatically * Update Documentation & Code Style * Adapt conftest to automatic inference of model_format * Update Documentation & Code Style * Fix tests * Update Documentation & Code Style * Fix tests * Adapt tutorials * Update Documentation & Code Style * Add test for similarity scores with sentence transformers * Adapt doc string and warning message * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-02 15:05:29 +02:00
bogdankostic	867695ad0c	Change signature of queries param in batch methods (#2575 ) * Change signature of queries param in batch methods * Update Documentation & Code Style * Fix mypy * Remove unused import * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-05-24 12:33:45 +02:00
Julian Risch	075ed7fbcb	Remove encoding option from PDFToTextOCRConverter (#2553 ) * remove encoding option from PDFToTextOCRConverter * Update Documentation & Code Style * add unused 'encoding' param to PDFToTextOCRConverter * Update Documentation & Code Style * call run instead of convert to use ligature replacing * Update Documentation & Code Style * add text to check installed poppler version * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-05-24 11:31:32 +02:00
Sara Zan	ff4303c51b	[CI refactoring] Categorize tests into folders (#2554 ) * Categorize tests into folders * Fix linux_ci.yml and an import * Wrong path	2022-05-17 09:55:53 +01:00

34 Commits