haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-15 09:33:34 +00:00

Author	SHA1	Message	Date
Julian Risch	8cfeed095d	build: Remove mmh3 dependency (#4896 ) * build: Remove mmh3 dependency * resolve circular import * pylint * make mmh3.py sibling of schema.py * pylint import order * pylint * undo example changes * increase coverage in modeling module * increase coverage further * rename new unit tests	2023-05-17 21:31:08 +02:00
Julian Risch	d4bbde2d9d	build: Upgrade transformers to 4.29.1 (#4886 ) * Upgrade transformers to 4.29.0 * Upgrade transformers to 4.29.1	2023-05-15 17:11:17 +02:00
Farzad E	6eb251d1f0	fix: Support for gpt-4-32k (#4825 ) * Add step to loook up tokenizers by prefix in openai_utils * Updated tiktoken min version + openai_utils test * Added test case for GPT-4 and Azure model naming * Broken down tests * Added default case --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-05-12 19:02:12 +02:00
Massimiliano Pippi	428096733d	ci: add a job to vet license of direct dependencies only (#4885 ) * add conversion script * run job in CI * typo * invoke python * install toml * fix pylint error * more exclusions * add toml to dev dependencies * fix exclusions list * fix mypy and remove test clause	2023-05-12 11:20:48 +02:00
Massimiliano Pippi	d322beed6c	build: do not install 'dev' extras with 'all' (#4888 ) * do not install 'dev' with 'all' * some fixes around	2023-05-11 19:24:47 +02:00
Silvano Cerza	98947e4c3c	feat: Add Anthropic invocation layer (#4818 ) * feat: Add Anthropic Claude Invocation Layer * feat: Add AnthropicClaude Invocation Layer * fix: Permission changes * fix: Permission changes * Move anthropic utils in anthropic invocation layer file * Rework method to post data * Simplify invoke * Simplify supports classmethod * Remove unnecessary functions * Use always same tokenizer * Add module import * Rename some members and kwargs * Add tests * Fix _post not handling HTTPError * Fix handling of streamed response * Fix kwargs handling * Update tests * Update supports to be generic * Fix failing test * Use correct tokenizer and fix tests * Update lg * Fix mypy issue * Move requests-cache from dev to base dependencies * Fix failing test * Handle all stop words use cases --------- Co-authored-by: recrudesce <recrudesce@gmail.com> Co-authored-by: agnieszka-m <amarzec13@gmail.com>	2023-05-11 10:14:33 +02:00
ZanSara	611b09b6c0	pin canals (#4853 )	2023-05-10 13:45:57 +02:00
ZanSara	28260c5c3f	feat: introduce `generalimport` (#4662 ) * introduce generalimport * pylint * fix optional deps typing for schema * leftover * typo * typing with faiss * make Base generation optional too * handle sqlalchemy * (almost) all import are optional * TO REMOVE hijacking CI for tests * some deps are actually needed * get feature branch in CI * get feature branch in CI * fix array_equal * pylint * pandas also required * improve imports.yml * fix SquadData * fix SquadData again * generalimport imports list * Update haystack/utils/openai_utils.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * Update haystack/utils/openai_utils.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * review feedback * remove todos * reference main release * pylint * circular import * review feedback * move is_imported in init * pylint --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-05-08 15:20:10 +02:00
Massimiliano Pippi	d8dc0d7403	chore: move custom linter to a separate package (#4790 ) * move custom linter to its own package * install the custom linter * fix formatting * drop python 3.7	2023-05-04 15:49:26 +02:00
Silvano Cerza	645a5fe5ba	ci: Add coverage tracking with Coveralls (#4772 ) * Format tests.yml properly * Add pytest-cov dependency * Add coverage in unit tests * Ignore cov.info * Change report format * Unignore cov.info	2023-04-28 11:59:09 +02:00
Vladimir Blagojevic	aebc22d27e	Upgrade transformers to 4.28.1 (#4665 ) * Upgrade to transformers 4.28.1 * Commenting out failing piece of test * trailing-whitespace * Adjust regex for error match - it changed between releases * Remove RAG tests failing with transformers update	2023-04-27 12:55:21 +02:00
ZanSara	1b57b96210	refactor!: extract `elasticsearch` (#4668 ) * extract elasticsearch * update pyproject.toml * make more import optional * move MockBaseRetriever in conftest * install es in the es integration tests	2023-04-26 10:14:20 +02:00
bogdankostic	7db025a97b	Update weaviate-client (#4715 )	2023-04-20 17:54:55 +02:00
Massimiliano Pippi	0c081f19e2	fix: remove warnings from the more recent Elasticsearch client (#4602 ) * clean up the ES instance in a more robust way * do not sleep, refresh the index instead * remove client warnings * fix unit tests * fix opensearch compatibility * fix unit tests * update ES version * bump elasticsearch-py * adjust docs * use recreate_index param * use same fixture strategy for Opensearch * Update lg --------- Co-authored-by: agnieszka-m <amarzec13@gmail.com>	2023-04-18 15:40:17 +02:00
ZanSara	d8ac30fa47	refactor!: extract preprocessing and file conversion deps (#4605 ) * isolate file-conversion deps * pylint * add to all extra * chain was missing * move langdetect into preprocessing and fix tika * add file-conversion extra	2023-04-14 11:34:16 +02:00
ZanSara	ba11d1c2a8	refactor!: extract evaluation and statistical dependencies (#4457 ) * try-catch sklearn and scipy * haystack imports * linting * mypy * try to import baseretriever * remove typing * unused import * remove more typing * pylint * isolate sql imports for postgres, which we don't use anyway * remove stats * replace expit * als inmemory * mypy * feedback * docker * expit * re-add njit	2023-04-12 15:38:56 +02:00
ZanSara	ce61eda970	feat: Haystack CLI (#4568 ) * first implementation * only version * delete rest api management * pylint	2023-04-04 14:24:00 +02:00
ZanSara	c202866093	feat!: drop Python3.7 support (#4421 ) * drop py3.7 * importlib-metadata	2023-04-03 10:34:58 +02:00
Vladimir Blagojevic	be25655663	feat: Add agent tools (#4437 ) * Initial commit, add search_engine * Add TopPSampler * Add more TopPSampler unit tests * Remove SearchEngineSampler (converted to TopPSampler) * Add some basic WebSearch unit tests * Rename unit tests * Add WebRetriever into agent_tools * Adjust to WebRetriever * Add WebRetriever mode [snippet\|document] * Minor changes * SerperDev: add peopleAlsoAsk search results * First agent for hotpotqa * Making WebRetriever work on hotpotqa * refactor: minor WebRetriever improvements (#4377) * refactor: remove doc ids rebuild + antecipate cache * refactor: improve caching, fix Document ids * Minor WebRetriever improvements * Overlooked minor fixes * feat: add Bing API as search engine * refactor: let kwargs pass-through * feat: increase search context * check sampler result, improve batch typing * refactor: increase mypy compliance * Initial commit, add search_engine * Add TopPSampler * Add more TopPSampler unit tests * Remove SearchEngineSampler (converted to TopPSampler) * Add some basic WebSearch unit tests * Rename unit tests * Add WebRetriever into agent_tools * Adjust to WebRetriever * Add WebRetriever mode [snippet\|document] * Minor changes * SerperDev: add peopleAlsoAsk search results * First agent for hotpotqa * Making WebRetriever work on hotpotqa * refactor: minor WebRetriever improvements (#4377) * refactor: remove doc ids rebuild + antecipate cache * refactor: improve caching, fix Document ids * Minor WebRetriever improvements * Overlooked minor fixes * feat: add Bing API as search engine * refactor: let kwargs pass-through * feat: increase search context * check sampler result, improve batch typing * refactor: increase mypy compliance * Fix mypy * Minor example fixes * Fix the descriptions * PR feedback updates * More fixes * TopPSampler: handle top p None value, add unit test * Add top_k to WebSearch * Use boilerpy3 instead trafilatura * Remove date finding * Add more WebRetriever docs * Refactor long methods * making the preprocessor optional * hide WebSearch and make NeuralWebSearch a pipeline * remove unused imports * add WebQAPipeline and split example into two * change example search engine to SerperDev * Turn off progress bars in WebRetriever's PreProcesssor * Agent tool examples - final updates * Add webqa test, search results ranking scores * Better answer box handling for SerperDev and SerpAPI * Minor fixes * pylint * pylint fixes * extract TopPSampler from WebRetriever * use sampler only for WebRetriever modes other than snippet * add web retriever tests * add web retriever tests * exclude rdflib@6.3.2 due to license issues * add test for preprocessed docs and kwargs examples in docstrings * Move test_webqa_pipeline to test/pipelines * change docstring for join_documents_and_scores * Use WebQAPipeline in examples/web_lfqa.py * Use WebQAPipeline in examples/web_lfqa.py * Move test_webqa_pipeline to e2e * Updated lg * Sampler added automatically in WebQAPipeline, no need to add it * Updated lg * Updated lg * :ignore Update agent tools examples to new templates (#4503) * Update examples to new templates * Add print back * fix linting and black format issues --------- Co-authored-by: Daniel Bichuetti <daniel.bichuetti@gmail.com> Co-authored-by: agnieszka-m <amarzec13@gmail.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2023-03-27 18:14:58 +02:00
Julian Risch	45ce87bb48	bug: Exclude rdflib 6.3.2 because of fossa license issues (#4495 )	2023-03-27 10:07:03 +02:00
Vladimir Blagojevic	c99b58100d	feat:Add agent event callbacks (#4491 ) * Implement agent callbacks with events * Fix mypy errors * Fix prompt_params assignment * PR review fixes --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-03-27 10:06:11 +02:00
Silvano Cerza	0f605118d9	ci: remove python_cache internal action (#4429 )	2023-03-17 13:55:07 +01:00
Ahmed Nabil	d29342c8bf	feat: Add the New Tokenizer of `gpt-3.5-turbo` (#4331 ) * Updated the tokenizer algorithm and pyproject.tomel tiktoken version * Updated the tokenizer algorithm and pyproject.tomel tiktoken version * Update haystack/utils/openai_utils.py Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update references in openai_utils.py * Update docs/pydoc/config/extractor.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/document-classifier.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/file-converters.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/file-classifier.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/other.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/pipelines.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/preprocessor.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/primitives.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/translator.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/crawler.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/prompt-node.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/pseudo-label-generator.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/query-classifier.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/question-generator.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/reader.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/ranker.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/retriever.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update docs/pydoc/config/transformers-img-to-text.yml Co-authored-by: Sebastian <sjrl@users.noreply.github.com> * Update openai_utils.py Adding GPT-4 tokenization handler * try to fix black --------- Co-authored-by: Sebastian <sjrl@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-03-17 08:20:57 +01:00
Silvano Cerza	b59cf76093	refactor: Remove AnswerToSpeech and DocumentToSpeech nodes (#4391 ) * Remove AnswerToSpeech and DocumentToSpeech nodes * Remove unused dataclasses * Remove unnecessary dependencies * Remove unused error class and imports	2023-03-15 19:31:13 +01:00
Vladimir Blagojevic	f13501309e	OpenAI streaming support (#4397 )	2023-03-15 18:24:47 +01:00
ZanSara	3ecce5cbeb	refactor: rename `v2` package to `preview` (#4409 ) * v2->preview * fossa -> py3.8 * test matrix * test matrix * tests * test imports --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-03-15 18:02:18 +01:00
Silvano Cerza	b3a659cd4a	test: Fix audio tests failing (#4418 ) * Fix audio tests failing * Disable local whisper tests	2023-03-15 15:26:30 +01:00
ZanSara	677fc8badf	feat: new Pipeline (#4368 ) * add import for canals * add stores support to canals * pyproject.toml * move tests * add v2 to the extras in ci * install v2 in action * pylint * save and load * save and load * codename "Alfalfa" * workflows	2023-03-14 17:01:19 +01:00
Vladimir Blagojevic	98256ecf57	Add Whisper node (#4335 ) * Add Whisper node * Add support for audio path, improve tests * Add docs * Improve tests	2023-03-13 16:17:07 +01:00
Sebastian	1a42166978	fix: Prevent going past token limit in OpenAI calls in PromptNode (#4179 ) * Refactoring to remove duplicate code when using OpenAI API * Adding docstrings * Fix mypy issue * Moved retry mechanism to openai_request function in openai_utils * Migrate OpenAI embedding encoder to use the openai_request util function. * Adding docstrings. * pylint import errors * More pylint import errors * Move construction of headers into openai_request and api_key as input variable. * Made _openai_text_completion_tokenization_details so can be resued in PromptNode and OpenAIAnswerGenerator * Add prompt truncation to the PromptNode. * Removed commented out test. * Bump version of tiktoken to 0.2.0 so we can use MODEL_TO_ENCODING to automatically determine correct tokenizer for the requested model * Change one method back to public * Fixed bug in token length truncation. Included answer length into truncation amount. Moved truncation higher up to PromptNode level. * Pylint error * Improved warning message * Added _ensure_token_limit for HFLocalInvocationLayer. Had to remove max_length from base PromptModelInvocationLayer to ensure that max_length has a default value. * Adding tests * Expanded on doc strings * Updated tests * Update docstrings * Update tests, and go back to how USE_TIKTOKEN was used before. * Update haystack/nodes/prompt/prompt_node.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/nodes/prompt/prompt_node.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/nodes/prompt/prompt_node.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/nodes/retriever/_openai_encoder.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/utils/openai_utils.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/utils/openai_utils.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Updated docstrings, and added integration marks * Remove comment * Update test * Fix test * Update test * Updated openai_request function to work with the azure api * Fixed error in _openai_encodery.py --------- Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>	2023-03-03 13:49:21 +01:00
Silvano Cerza	18e83b3ed4	Pin requests-cache test dependency to <1.0.0 (#4325 )	2023-03-03 12:47:15 +01:00
Daniel Bichuetti	7c49fffc71	feat: Enable PDFToTextConverter multiprocessing, increase general performance and simplify installation (#4226 ) * refactor: isolate PDF converters * refactor: remove xpdf dependency and fix tests * refactor: add min. version * feat: enable multiprocessing and add tests * fix: remove unused imports * fix: regression when moved code * refactor: use itertools * fix: mypy claims * refactor: double tool support * refactor: add fallback to xpdf * refactor: black formatting * refactor: make superclass signature compatible * refactor: complete removal of xPdf * refactor: regroup Haystack imports and fix regression * refactor: remove original declaration * docs: fix docstrings * tests: add [pdf] to [all] * refactor: remove redundant checks, avoid extra processes * refactor: add deprecation warning * refactor: add pytest mark * tests: change PDF test file * fix: correct pytest mark * refactor: deprecate parameter and add new * tests: change pdf sample * Add minor lg changes to docstrings * Fix default value in doc strings * Update test/nodes/test_file_converter.py Co-authored-by: bogdankostic <bogdankostic@web.de> * tests: fix page count * refactor: add imported function * refactor: change default value * tests: change parameters and fix typo * Unify sort_by_position parameter names --------- Co-authored-by: bogdankostic <bogdankostic@web.de> Co-authored-by: agnieszka-m <amarzec13@gmail.com>	2023-03-01 22:34:38 +01:00
ZanSara	13c4ff1b52	refactor: remove direct logging without a logger (#4253 ) * remove direct logging without a logger * add custom pylint checker * add test * pylint * improve checker message * mypy * remove test * add checker for basicConfig * more logging missed * ignore basicConfig * move out logger * move out statement * remove logging configuration	2023-02-23 20:42:42 +01:00
bogdankostic	7eeb3e07bf	feat: Add IVF and Product Quantization support for OpenSearchDocumentStore (#3850 ) * Add IVF and Product Quantization support for OpenSearchDocumentStore * Remove unused import statement * Fix mypy * Adapt doc strings and error messages to account for PQ * Adapt validation of indices * Adapt existing tests * Fix pylint * Add tests * Update lg * Adapt based on PR review comments * Fix Pylint * Adapt based on PR review * Add request_timeout * Adapt based on PR review * Adapt based on PR review * Adapt tests * Pin tenacity * Unpin tenacity * Adapt based on PR comments * Add match to tests --------- Co-authored-by: agnieszka-m <amarzec13@gmail.com>	2023-02-17 10:28:36 +01:00
Daniel Bichuetti	9f5a3344d5	fix: Windows amd64 platform repr (#4175 )	2023-02-16 19:46:34 +01:00
Daniel Bichuetti	5187cc1801	refactor: Remove the pin from the espnet module and fix the audio node tests. (#4128 ) * fix: fix audio tests + unbound some dependencies * fix: update for Python 3.8 * refactor: change numpy assertion * feat: add voice recog. support on audio tests * fix: fix var assignement * chore: dummy commit * fix: fix sndfile error * refactor: change skip reason * refactor: hardcode variable * refactor: unpin numpy * fix: pin numpy only for audio	2023-02-16 22:12:17 +05:30
Silvano Cerza	274746db07	style: Update black (#4101 ) * Update black version * Format file with new black style * Update black pre-commit hook version	2023-02-08 15:34:43 +01:00
Julian Risch	0e282e5ca4	refactor: replace mutable default arguments (#4070 ) * refactor: replace mutable default arguments * change type annotation in BasePreProcessor to Optional[List]	2023-02-07 09:30:33 +01:00
ZanSara	76db26f228	logging-format-interpolation (#3907 )	2023-02-03 13:30:56 +01:00
Silvano Cerza	6a9cb8651b	Fix pylint version to prevent crash (#4043 )	2023-02-02 17:57:39 +01:00
Massimiliano Pippi	2878c57645	Update pyproject.toml (#4035 )	2023-02-02 11:59:17 +01:00
Tuana Celik	790e9acd3e	feat: add frontmatter to meta in `MarkdownConverter` (#3953 ) * first attempt to add frontmatter of markdown to the metadata * remove bug fix * running black and pre-commit * moving the import line * adding a test * adding pydoc	2023-01-26 17:15:02 +01:00
Daniel Bichuetti	afc1e1ccef	fix: add tiktoken fallback mechanism. (#3929 ) * feat: migrate to tiktoken when tokenizing for OpenAI * refactor: add OpenAI optional egg * fix: add Python 3.7 fallback support for tiktoken * refactor: change both tokenization implementations and fix mypy * refactor: remove dummy-class * refactor: add tiktoken as core dependency and minor refactoring * refactor: sort imports * refactor: remove out-of-scope PR change * refactor: reintroduce corner case check * refactor: remove unused egg * refactor: remove unused exception after titkoken as core dep * refactor: reduce ifs and include log warning * refactor: remove timeout linting ignore * refactor: revert change due to mypy * refactor: disable pylint import error * fix: add arm64 fallback to HF tokenizer * fix: add aarch64 fallback mechanism * refactor: improve log message * fix: change platform selection method * refactor: consolidate archs	2023-01-25 11:37:29 +01:00
Daniel Bichuetti	739fc228c6	feat: support cl100k_base tokenization and increase performance for GPT2 (#3897 ) * feat: migrate to tiktoken when tokenizing for OpenAI * refactor: add OpenAI optional egg * fix: add Python 3.7 fallback support for tiktoken * refactor: change both tokenization implementations and fix mypy * refactor: remove dummy-class * refactor: add tiktoken as core dependency and minor refactoring * refactor: sort imports * refactor: remove out-of-scope PR change * refactor: reintroduce corner case check * refactor: remove unused egg * refactor: remove unused exception after titkoken as core dep * refactor: reduce ifs and include log warning * refactor: remove timeout linting ignore * refactor: revert change due to mypy * refactor: disable pylint import error	2023-01-24 16:15:49 +01:00
ZanSara	e954230ae7	chore: enable `f-string-without-interpolation` (#3906 ) * f-string-without-interpolation * remove line * missed one line	2023-01-23 17:35:52 +01:00
Zoltan Fedor	e447bd728a	feat: adding the ability to use Ray Serve async functionality (#3769 ) * Adding the ability to call the Ray pipeline from concurrent apps with async This is to fix #2968 * Fixes: mype + pylint (`invalid-overridden-method`) * Simplifying - no real need for an `AsyncRayPipeline` anymore * Moving the new `run_async` method to the `RayPipeline` * Cleanup * [EMPTY] Re-trigger CI	2023-01-23 16:23:09 +01:00
ZanSara	62935bde6d	enable `unused-variable` (#3846 )	2023-01-12 19:38:45 +01:00
Zoltan Fedor	9cf80ee07e	feat: add HA support for Weaviate (#3764 ) * feat: add HA support for Weaviate Adding the `replicationConfig => factor` parameter to the Weaviate class at the time of class creation, allowing the user to have Haystack create a Weaviate "Class" with a replication factor set above 1. This enables the use of Weaviate in a HA (High Availability) fashion, where the created class is stored on multiple Weaviate nodes increasing Weaviate's throughput and also ensuring high availability. * Trying out a recommendation from @masci to fix the CI issue	2023-01-12 10:01:38 +01:00
ZanSara	d157e41c1f	chore: enable `logging-fstring-interpolation` and cleanup (#3843 ) * enable logging-fstring-interpolation * remove logging-fstring-interpolation from exclusion list * remove implicit string interpolations added by black * remove from rest_api too * fix % sign	2023-01-12 09:31:21 +01:00
ZanSara	4cbc8550d6	chore: enable `trailing-whitespace` and cleanup (#3847 ) * enable trailing-whitespace * remove trailing whitespace on rest api too	2023-01-11 20:08:19 +01:00

1 2 3 4 5

238 Commits