haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-02 10:49:30 +00:00

Author	SHA1	Message	Date
Daniel Bichuetti	df1f4205b6	feat: add public layout-base extraction support on PDFToTextConverter (#3137 ) * feat(PDFToTextConverter): add option to get text in physical layout order * test: add physical layout extraction test to PDFToTextConverter * refactor: change layout parameter attribution places * docs: manually trigger pre-commits * docs: generate new docs to comply with pydoc-markdown style	2022-09-13 16:55:21 +02:00
Kristof Herrmann	da1cc577ae	feat: exponential backoff with exp decreasing batch size for opensearch client (#3194 ) * Validate custom_mapping properly as an object * Remove related test * black * feat: exponential backoff with exp dec batch size * added docstring and split doc lsit * fix * fix mypy * fix * catch generic exception * added test * mypy ignore * fixed no attribute * added test * added tests * revert strange merge conflicts * revert merge conflict again * Update haystack/document_stores/elasticsearch.py Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * done * adjust test * remove not required caplog * fixed comments Co-authored-by: ZanSara <sarazanzo94@gmail.com> Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2022-09-13 14:30:30 +01:00
Sara Zan	b47c93989b	remove imports redirect (#3204 )	2022-09-13 11:16:39 +01:00
Sara Zan	49b1c8856e	test: lower low boundary for accuracy in `test_calculate_context_similarity_on_non_matching_contexts` (#3199 ) * Change min value * revert test change and pin rapidfuzz<2.8.0 * duplicate	2022-09-13 09:32:38 +02:00
Massimiliano Pippi	64b0c43885	refactoring: reimplement Docker strategy (#3162 ) * setup base images * add cpu flavor * use the same Dockerfile for cpu and gpu * better naming, add docs * add docker workflow * add missing image input * change cwd for bake * also push api images * try conditional tagging for releases * revert testing code * update docker readme * document variable override * use Python 3.10 * allow empty HAYSTACK_EXTRAS * Apply suggestions from code review Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * remove repo description step, can't make it work so far * add docs to the last step as it's tricky * manage tags for the newest images * tests are passing, checking in the last bit Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-09-12 16:33:56 +02:00
Bijay Gurung	21aedc644f	feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers (#3164 ) * Add option to use MultipleNegativesRankingLoss Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers * Move out losses into separate retriever/_losses.py module * Remove unused import in retriever/_losses.py * Apply documentation suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-09-12 09:38:04 +02:00
Sebastian	fc07799206	feat: Updates docs and types for language param in PreProcessor (#3186 ) * Small update to language param docs in PreProcessor	2022-09-12 08:52:52 +02:00
Sara Zan	96bb9b5905	bug: validate `custom_mapping` as an object (#3189 ) * Validate custom_mapping properly as an object * Remove related test * black	2022-09-09 18:03:29 +02:00
Daniel Bichuetti	621e1af74c	refactor: improve support for dataclasses (#3142 ) * refactor: improve support for dataclasses * refactor: refactor class init * refactor: remove unused import * refactor: testing 3.7 diffs * refactor: checking meta where is Optional * refactor: reverting some changes on 3.7 * refactor: remove unused imports * build: manual pre-commit run * doc: run doc pre-commit manually * refactor: post initialization hack for 3.7-3.10 compat. TODO: investigate another method to improve 3.7 compatibility. * doc: force pre-commit * refactor: refactored for both Python 3.7 and 3.9 * docs: manually run pre-commit hooks * docs: run api docs manually * docs: fix wrong comment * refactor: change no type-checked test code * docs: update primitives * docs: api documentation * docs: api documentation * refactor: minor test refactoring * refactor: remova unused enumeration on test * refactor: remove unneeded dir in gitignore * refactor: exclude all private fields and change meta def * refactor: add pydantic comment * refactor : fix for mypy on Python 3.7 * refactor: revert custom init * docs: update docs to new pydoc-markdown style * Update test/nodes/test_generator.py Co-authored-by: Sara Zan <sarazanzo94@gmail.com>	2022-09-09 11:31:37 +02:00
Daniel Bichuetti	1a6cbca9b6	feat: add health check endpoint to rest api (#3168 ) * feat: add /health endpoint to rest api * refactor: adjust to new dir structure * fix: add new rest api dependency * docs: add new openapi schema * docs: manual black run * refactor: remove some sys-wide details * docs: minor description changes * docs: minor description changes * docs: generate openapi schemas * tests: improved tests * refactor: add cls method decorator	2022-09-08 18:24:16 +02:00
Vladimir Blagojevic	e0d73f3ae0	Replace torch.device(cuda) with torch.device(cuda:0) in devices initialization (#3184 )	2022-09-08 09:36:38 -04:00
Vladimir Blagojevic	20880c9d41	Add 15 min timeout for downloading cached HF models (#3179 )	2022-09-07 08:35:09 -04:00
Sebastian	62e7c19011	fix: Reduce GPU to CPU copies at inference (#3127 ) * Send matrix from gpu to cpu once instead of individual elements * Moved location of if statement so it would be triggered only when needed. Provides very modest speedup for large top_k_per_sample	2022-09-07 11:00:05 +02:00
Steven Haley	9a750f7032	docs: Fix the word length splitting; should be set to 100 not 1,000 (#3133 ) * Fix the word length splitting; should be set to 100 not 1,000 due to limitations of transformer models * Update documentation for tutorial change	2022-09-07 10:57:54 +02:00
Vladimir Blagojevic	84acb6584f	Type all parameter constructors, add model_version optional parameter where applicable (#3152 )	2022-09-06 05:05:42 -04:00
Sebastian	20c2320434	Fix for torch device (#3161 )	2022-09-06 09:03:52 +02:00
Massimiliano Pippi	6790eaf7d8	refactor: update package strategy in rest_api (#3148 ) * update packaging * fix author metadata * add newline * add empty readme * fix path to pipeline files * fix pylint job * fix metadata	2022-09-05 16:58:43 +02:00
Massimiliano Pippi	e2110644c4	docs: add tests types to CONTRIBUTING.md (#3158 ) * Update CONTRIBUTING.md Add the outcome of #2811 to the developers docs Ideally, newly added tests will follow those requirements while we progressively adapt the existing tests to the new model. * address review comments	2022-09-05 16:56:48 +02:00
Daniel Bichuetti	e1f399284f	refactor: update dependencies and remove pins (#3147 ) * refactor: remove azure-core, pydoc and hf-hub pins * fix: remove extra-comma * fix: force minimum version of azure forms recognizer * refactor: allow newer ocr libs * refactor: update more dependencies and container versions * refactor: remove extra comment * docs: pre-commit manual run * refactor: remove unnecessary dependency * tests: update weaviate container image version	2022-09-05 14:30:35 +02:00
Massimiliano Pippi	b07fcb7185	feat: add a security policy for Haystack (#3130 ) * add the security policy * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * include review feedback Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-09-02 12:00:14 +02:00
Branden Chan	d4722c2ec5	Document FARMReader.train() evaluation report log level (#3129 ) * Mention evaluation report logging level * Mention evaluation report logging level	2022-09-01 10:58:47 +02:00
Vladimir Blagojevic	356537c883	Standardize devices parameter and device initialization (#3062 ) * Use devices parameter and initialize devices consistently	2022-08-31 15:30:31 +02:00
Massimiliano Pippi	ffee36c694	pin pydantic to 1.9.2 (#3126 )	2022-08-31 14:36:40 +02:00
Vladimir Blagojevic	66f3f42a46	fix: Replace multiprocessing tokenization with batched fast tokenization (#3089 ) * Replace multiprocessing tokenization with batched fast tokenization * Replace deprecated tokenization method invocations	2022-08-31 07:33:39 -04:00
Stefano Fiorucci	e7771dc18e	bug: adapt UI random question for streamlit 1.12 and pin to streamlit>=1.9.0 (#3121 ) * adapt for streamlit 1.12.0 and pin to streamlit>=1.9.0 * make pylint happy	2022-08-31 12:35:40 +02:00
Fernando Pereira	911a2fa7e4	feat: Add warnings to PineconeDocumentStore about indexing metadata if filters return no documents (#3086 ) * black-jupyter format changes * fix merge * filters and documents/ids list evaluations fix (for this specific warning context)	2022-08-30 17:02:07 +02:00
Julian Risch	f010a17f04	increase version to next release candidate (#3115 )	2022-08-29 17:05:44 +02:00
Vladimir Blagojevic	99efab7928	Bump transformers to v4.21.2 (#3098 )	2022-08-29 11:02:13 -04:00
Sara Zan	e88f1e2577	Add custom_mapping to the list of fields that can contain string-encoded JSON (#3065 )	2022-08-29 11:10:24 +02:00
Julian Risch	4e518cdddd	chore: increase version for 1.8 release (#3109 ) * increase version for 1.8 release * ignore missing-timeout for pylint v1.8.0	2022-08-26 15:00:14 +02:00
Julian Risch	3e3ff33cdd	feat: add batch evaluation method for pipelines (#2942 ) * add basic pipeline.eval_batch for qa without filters * black formatting * pydoc-markdown * remove batch eval tests failing due to bugs * remove comment * explain commented out tests * avoid code duplication * black * mypy * pydoc markdown * add batch option to execute_eval_run * pydoc markdown * Apply documentation suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply documentation suggestion from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * add documentation based on review comments * black * black * schema updates * remove duplicate tests * add separate method for column reordering * merge _build_eval_dataframe methods * pylint ignore in function * change type annotation of queries to list only * one-liner addressing review comment on params dict * markdown files updated Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-08-25 17:50:57 +02:00
bogdankostic	e2ec0d1c15	feat: FAISS in OpenSearch: check existing index (#3101 ) * Add check for mapping for existing indices * Add test * Check if "method" field exists	2022-08-25 17:33:26 +02:00
Julian Risch	cc9d39c360	increase version to next release candidate (#3100 )	2022-08-25 15:55:34 +02:00
Julian Risch	0950db5032	chore: increase version to 1.7.2 for patch release (#3097 ) * schema update * schema update audio nodes * schema update audio param type v1.7.2	2022-08-25 13:55:28 +02:00
Sebastian	0cf0568dd0	fix: Use use_auth_token in all cases when loading from the HF Hub (#3094 ) * Making sure to pass on use_auth_token to all from_pretrained calls	2022-08-25 10:30:03 +02:00
Sara Zan	e92ea4fccb	refactor: rename `master` into `main` in documentation and links (#3063 ) * master->main * revert master rename * Revert change to sphinx link and rename master schema	2022-08-24 19:05:12 +02:00
tstadel	92046ce5b5	feat: FAISS in OpenSearch: Support HNSW for dot product and l2 (#3029 ) * support faiss hnsw * blacken * update docs * improve similarity check * add tests * update schema * set ef_search param correctly * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * regenerate docs Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-08-24 16:43:48 +02:00
James Briggs	9b1b03002f	update to PineconeDocumentStore to remove dependency on SQL db (#2749 ) * update to PineconeDocumentStore to remove dependency on SQL db * Update Documentation & Code Style * typing fixes * Update Documentation & Code Style * fixed embedding generator to yield Documents * Update Documentation & Code Style * fixes for final typing issues * fixes for pylint * Update Documentation & Code Style * uncomment pinecone tests * added new params to docstrings * Update Documentation & Code Style * Update Documentation & Code Style * Update haystack/document_stores/pinecone.py Co-authored-by: Sara Zan <sarazanzo94@gmail.com> * Update haystack/document_stores/pinecone.py Co-authored-by: Sara Zan <sarazanzo94@gmail.com> * Update Documentation & Code Style * Update haystack/document_stores/pinecone.py Co-authored-by: Sara Zan <sarazanzo94@gmail.com> * Update haystack/document_stores/pinecone.py Co-authored-by: Sara Zan <sarazanzo94@gmail.com> * Update haystack/document_stores/pinecone.py Co-authored-by: Sara Zan <sarazanzo94@gmail.com> * Update haystack/document_stores/pinecone.py Co-authored-by: Sara Zan <sarazanzo94@gmail.com> * changes based on comments, updated errors and install * Update Documentation & Code Style * mypy * implement simple filtering in pinecone mock * typo * typo in reverse * account for missing meta key in filtering * typo * added metadata filtering to describe index * added handling for users switching indexes in same doc store, and handling duplicate docs in write * syntax tweaks * added index option to document/embedding count calls * labels implementation in progress * added metadata fields to be indexed for pinecone tests * further changes to mock * WIP implementation of labels+multilabels * switched to rely on labels namespace rather than filter * simpler delete_labels * label fixes, remove debug code * Apply dostring fixes Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * mypy * pylint * docs * temporarily un-mock Pinecone * Small Pinecone test suite * pylint * Add fake test key to pass the None check * Add again fake test key to pass the None check * Add Pinecone to default docstores and fix filters * Fix field name * Change field name * Change field value * Remove comments * forgot to upgrade pyproject.toml Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: Sara Zan <sarazanzo94@gmail.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-08-24 13:27:15 +02:00
Stefano Fiorucci	891707ecaa	bug: handle `Optional` params in schema validation (#2980 ) * not working draft * first draft * fix * revert json schema * better schema * improvements, support different python versions * little simplification * improvements and more tests * Revert "Merge branch 'handle_optional_params' into origin/main" This reverts commit 0114cba1f72c9bab23a3ce6a24cb4b346834cf34. * fix git mess * handle optional params; schema * test null values Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-08-24 10:40:19 +02:00
Ofek Lev	f6a4a14790	refactor: update package metadata (#3079 ) * Update package metadata * fix yaml * remove Python version cap * address review	2022-08-24 09:46:21 +02:00
Branden Chan	6d4031d8f6	Add OpenAI Answer Generator API (#3050 ) * Add OpenAI Answer Generator API * Regen tutorials * Regen md files * Incorporate reviewer feedback * Incorporate reviewer feedback * Incorporate reviewer feedback * Incorporate reviewer feedback	2022-08-24 09:20:08 +02:00
Malte Pietsch	76af0444cc	feat: add progressbar to upload_files() of deepset Cloud client (#3069 )	2022-08-23 20:51:08 +02:00
Sebastian	3ea57801ae	feat: Early stopping can be used in Reader and Retriever training (#3071 ) * Add option to set early stopping in training * Moved EarlyStopping to haystack/utils/early_stopping.py and added EarlyStopping to training Dense retrievers.	2022-08-23 14:18:12 +02:00
bogdankostic	b03de53716	Use `random_sample` instead of `ndarray` for random array (#3083 )	2022-08-22 13:19:45 +02:00
Daniel Bichuetti	149224fe3a	fix: Crawler quits ChromeDriver on destruction (#3070 ) * Close Chrome and Selenium WebDriver on destruction * Fix failed pre-commit hook	2022-08-22 13:08:16 +02:00
Daniel Bichuetti	d715d0202d	fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter (#3043 ) * Fix when env does nto exist * Fix missed line * Set conservative chromedriver options * Set default options based on environment * Fix removed line * Updated documentation * Generate new schemas manually * Add arguments via iterator and helper function * Pre-push doc format * Use imported Option vs full namespace access * Manually update schema * Manually add documentation and schema * Fix language and documentation * Fix typo * Auto generated docs * Updated documentation	2022-08-22 12:59:33 +02:00
David G	e715dee17d	docs:fixed typo (or old documentation) in ipynb tutorial 3 (#3033 ) * Update Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb Just fixed the key in the document dictionary format so `write_documents()` won't raise an error. By the way the `write_documents()` error is really explicative * Run convert_notebooks_into_webpages.py Co-authored-by: David Gervasoni <david.gervasoni@trix.ai>	2022-08-22 12:56:30 +02:00
Massimiliano Pippi	97a8d30512	feat: Allow exact list matching with field in Elasticsearch filtering (#2988 ) * ES filtering - allow exact list matching with field typing fix Update Documentation & Code Style remove default hit limit in filtering queries Update Documentation & Code Style pytest es list eq filter Update Documentation & Code Style * review feedback * fixed test Co-authored-by: Krak91 <45461739+Krak91@users.noreply.github.com>	2022-08-22 12:42:37 +02:00
Daniel Bichuetti	d5e36ce6b4	fix(translator): write translated text to output documents, while keeping input untouched (#3077 ) * Set translated text on a copy of original document * Return new translated list * Manually generated docs TODO: check pre-commit * Hook generated file * Rename variables for better maintenance * fix(translator): prevent inputs from being changed * fix: manual update translator docs * style(translator): explicit type declaration on List * docs(translator): re-run pre-commit hook * style(translator): ignore mypy wrong type check * docs(translator): re-run pre-commit hook	2022-08-22 04:07:05 -04:00
Julian Risch	bc6f71b5ba	chore: increase version to next release candidate (#3067 ) * increase version to next release candidate * generate schema files	2022-08-19 14:49:50 +02:00

1 2 3 4 5 ...

1522 Commits