haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-12-12 15:27:06 +00:00

Author	SHA1	Message	Date
Massimiliano Pippi	ac408134f4	feat: add support for async openai calls (#5946 ) * add support for async openai calls * add actual async call * split the async api * ask permission * Update haystack/utils/openai_utils.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * Fix OpenAI content moderation tests * Fix ChatGPT invocation layer tests --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-10-03 10:42:21 +02:00
Lavesh Akhadkar	1ccf674d73	feat: `DocumentWriter` returns number of documents written (#5939 ) * Make DocumentWriter return the number of documents it wrote * Fixed return type	2023-10-03 10:02:33 +02:00
Massimiliano Pippi	0947f59545	feat: add async PromptNode run (#5890 ) * add async promptnode * Remove unecessary calls to dict.keys() --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com> Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-09-29 08:40:01 +02:00
Vladimir Blagojevic	e882a7d5c8	feat: Add HTMLToDocument component (v2) (#5907 )	2023-09-28 17:22:28 +02:00
Stefano Fiorucci	d4aacad5f9	feat: `OpenAIDocumentEmbedder` (#5822 ) * first draft * release note * mypy fix * fix test * corrections * pr feedback * better secrets handling and new tests * missing imports in embedders/__init__.py * better format condition * address feedback	2023-09-28 15:42:51 +02:00
Julian Risch	4413675e64	feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870 ) * draft split by word, sentence, passage * naive way to split sentences without nltk * reno * add tests * make input list of docs, review feedback * add source_id and more validation * update docstrings * add split delimiters back to strings --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-27 12:26:20 +02:00
bogdankostic	80192589b1	feat: Add `AzureOCRDocumentConverter` (2.0) (#5855 ) * Add AzureOCRDocumentConverter * Add tests * Add release note * Formatting * update docstrings * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> * PR feedback * PR feedback * PR feedback * Add secrets as environment variables * Adapt test * Add azure dependency to CI * Add azure dependency to CI --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-26 15:57:55 +02:00
Silvano Cerza	cf7f0ebc22	Add Pipelines async run (#5864 ) * Add Pipeline.arun() * Sleeper node * Fix async running * Add e2e tests To run a Pipeline that doesn't have any async node in async mode: pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline To run a Pipeline that has a single async node in concurrent mode: pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline To run a Pipeline that has a single async node in sequential mode: pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline * Remove unused _adispatch_run method * Make Pipeline.run work with async nodes * Revert "Make Pipeline.run work with async nodes" This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431. * Rename Pipeline.arun to Pipeline._arun * Enhance docstring * Add Sleeper docstring * Add release notes * ignore typing across the node * make pylint happy * skip pylint on needed unused import * fix * if a node has an arun method, use it --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-26 15:37:27 +02:00
ZanSara	6cb7d16e22	feat: `preview` extra (#5869 ) * copy the deps list over from haystack-ai * fix lazyimport usage * keep jinja and openai * fix ci * reno * separate out preview unit tests * fix import error message for tika * tika * add preview to all * wrap torch * remove comment * unwrap openai and jinja	2023-09-26 12:48:15 +02:00
bogdankostic	9a4373bf8e	feat: Add `TikaDocumentConverter` (2.0) (#5847 ) * Add TikaFileToDocument component * Add tests * Add tika service to CI * Add release note * Change name * PR feedback * Fix naming in tests * Fix tika version in CI * Update tests --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-25 11:47:21 +02:00
Stefano Fiorucci	c0f22372d4	feat: `OpenAITextEmbedder` (#5801 ) * first draft * release notes * avoid serializing secrets * fix import order * simplify serialization * simplification * monkeypatch delenv * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * docstrings updates * fix test * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * rm comment --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-22 21:54:11 +02:00
Massimiliano Pippi	a5a0dc9f87	feat: optionally pass an id to the Document constructor (#5862 ) * revert #5826 * do not use Optional	2023-09-22 11:09:59 +02:00
Silvano Cerza	cc4f95bf51	Remove unnecessary GPT4Generator class (#5863 ) * Remove GPT4Generator class * Rename GPT35Generator to GPTGenerator * Fix tests * Release notes	2023-09-22 11:05:06 +02:00
MichelBartels	f3dc9edd26	feat: initial ExtractiveReader implementation (#5553 ) * initial ExtractiveReader implementation * initial ExtractiveReader implementation * fix mypy * remove unused import * Use AutoTokenizer * rename reader to model * combine no-answer logit * support document slicing with proper probabilities * add variable stride * validate model * fix typo * make postprocessing easier to understand * remove debug code * set default reader * add ExtractiveReader to __init__ * remove validation * use new answer class * add batching * use v2 lazy imports * move reader * fix type hints * add doc strings * add nucleus sampling * fix types * fix doc string * add no_answer parameter * remove print statement * fix gpu support * turn into binary classification task * change dataclass so document does not need to be provided for no answer * add simple tests * add unit tests * rename reader folder to readers * add integration tests * fix type hints * add release notes * remove accidentally included test file * remove unnecessary __init__ file * revert __init__ file to main * rename test script by adding test_ prefix * undo accidentally moving of test script after renaming it * remove use of bisect * rename _flatten and _unflatten * make variable name more intuitive * remove type: ignore * fix mypy issue * refactor long tuple * add doc strings * explain HF test * remove unnecessary top_k check --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 12:16:51 +02:00
Vladimir Blagojevic	92a6221927	feat: Add PyPDFToDocument component (2.0) (#5850 ) * Initial PyPDFToDocument implementation * Remove progress bar * Add release note * Minor fix * import check and dependency --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 11:52:26 +02:00
bogdankostic	abe2706298	feat: Add `MetadataRouter` (2.0) (#5824 ) * Move filter utilities * Add MetadataRouter * Add tests for MetadataRouter * Add more tests * Rename FileExtensionClassifer to FileExtensionRouter * Add support for dates in filters * Add tests * Add release note * Add release note * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-20 14:49:17 +02:00
ZanSara	454988672e	feat: `UrlCacheChecker` (#5841 ) * add UrlCacheChecker * rename * add tests * reno * pylint * review feedback	2023-09-20 14:45:50 +02:00
bogdankostic	719c1c040c	feat: Add support for dates in filters (2.0) (#5823 ) * Add support for dates in filters * Add tests * Add release note * Update haystack/preview/utils/filters.py Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-20 12:05:56 +02:00
Vladimir Blagojevic	0983fb656a	feat: Add `LinkContentFetcher` Haystack 2.0 component (#5724 ) * Add LinkContentFetcher * Add release note * Small fixes * Fix pydocs * PR feedback * Remove handlers registration * PR feedback * adjustments * improve tests * initial draft * tests * add proposal * proposal number * reno * fix tests and usage of content and content_type * update branch & fix more tests * mypy * use the new document * add docstring * fix more tests * mypy * fix tests * add e2e * review feedback * improve __str__ * Apply suggestions from code review Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/preview/dataclasses/document.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * improve __str__ * fix tests * fix more tests * fix test * Fix end-of-file-fixer * Post merge fixes * Move e2e tests back into component --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-20 11:03:52 +02:00
Malte Pietsch	aa3cc3d5ae	feat: Add support for OpenAI's `gpt-3.5-turbo-instruct` model (#5837 ) * support gpt-3.5.-turbo-instruct * add release note	2023-09-19 16:06:43 +02:00
Onur Eren Arpacı	8af0d816e6	bug: fix the date_fields request bottleneck (#5695 ) * bug: fix the date_fields request bottleneck I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours. After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document. To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors. * bug: fix the date_fields request bottleneck * fix: executed the pre commit hooks for #9341	2023-09-15 18:12:14 +02:00
Silvano Cerza	5c04cd6ba2	Fix Document constructor accepting unused id parameter (#5826 )	2023-09-15 17:03:03 +02:00
Chivereanu Radu	cab21da87b	fix: Support for Azure 16k gpt 35 deployment (#5804 ) * Support for Azure 16k gpt 35 deployment * releasenote added --------- Co-authored-by: user11999 <radugabrielchivereanu@gmail.com>	2023-09-14 18:01:22 +02:00
Ivana Zeljkovic	4bad202197	feat: Pinecone document store refactoring (#5725 ) * Refactor codebase so that doc_type metadata is used instead of namespaces for making distinction between documents without embeddings, documents with embeddings and labels * Fix parameter name in integration test * Remove code under comment in add_type_metadata_filter method * Fix mypy and pylint checks * Add release note * Apply minimal changes: rename method, update method docs and remove redundant method * Mypy fixes * Fix docstrings * Revert helper methods for fetching documents when the number of documents exceeds Pinecone limit * Remove unnecessary attributes in PineconeDocumentStore * Fix unit test --------- Co-authored-by: Ivana Zeljkovic <ivana.zeljkovic@smartcat.io> Co-authored-by: DosticJelena <jelena.dostic@smartcat.io>	2023-09-14 11:46:47 +02:00
Darion	beb8853412	fix: return types of EntityExtractor to work with FAISSDocumentStore (#5750 ) * Changed entity extractor score from type float32 to float64 and start/stop from int64 to int * Added relase notes	2023-09-14 10:49:54 +02:00
Stefano Fiorucci	28f42fbaab	move release note to the right directory (#5808 )	2023-09-14 09:57:09 +02:00
Christian Clauss	6dd52d91b2	ci: Fix typos discovered by codespell (#5778 ) * Fix typos discovered by codespell * pylint: max-args = 38	2023-09-13 16:14:45 +02:00
Julian Risch	4ae0924ea0	feat!: Remove SklearnQueryClassifier (#5779 ) * remove SklearnQueryClassifier * reno	2023-09-13 12:55:33 +02:00
Stefano Fiorucci	283ecf2760	feat: add `prefix` and `suffix` to `SentenceTransformersDocumentEmbedder` (#5745 ) * add prefix and suffix * fix test	2023-09-13 12:55:06 +02:00
ZanSara	2c4d839b64	feat: `GPT4Generator` (#5744 ) * add gpt4generator * add e2e * add tests * reno * fix e2e * Update test/preview/components/generators/openai/test_gpt4_generator.py Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com> --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-09-13 10:07:09 +02:00
Christian Clauss	23f7308bec	ci: pre-commit autoupdate (#5777 )	2023-09-12 14:34:41 +02:00
ZanSara	6e70d403f8	feat: Improve `Document` for Haystack 2.0 (#5738 ) * initial draft * tests * add proposal * proposal number * reno * fix tests and usage of content and content_type * update branch & fix more tests * mypy * add docstring * fix more tests * review feedback * improve __str__ * Apply suggestions from code review Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/preview/dataclasses/document.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * improve __str__ * fix tests * fix more tests * Update haystack/preview/document_stores/memory/document_store.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-11 17:40:00 +02:00
Stefano Fiorucci	2edf85f739	`MemoryEmbeddingRetriever` (2.0) (#5726 ) * MemoryDocumentStore - Embedding retrieval draft * add release notes * fix mypy * better comment * improve return_embeddings handling * MemoryEmbeddingRetriever - first draft * address PR comments * release note * update docstrings * update docstrings * incorporated feeback * add return_embedding to __init__ * rm leftover docstring --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-08 15:52:48 +02:00
Stefano Fiorucci	b7bea3ae9c	`MemoryDocumentStore` - Embedding retrieval (2.0) (#5715 ) * MemoryDocumentStore - Embedding retrieval draft * add release notes * fix mypy * better comment * improve return_embeddings handling * address PR comments * update docstrings * incorporated feeback --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-07 15:44:07 +02:00
ZanSara	63cbde7287	feat: `GPT35Generator` (#5714 ) * chatgpt backend * fix tests * reno * remove print * helpers tests * add chatgpt generator * use openai sdk * remove backend * tests are broken * fix tests * stray param * move _check_troncated_answers into the class * wrong import * rename function * typo in test * add openai deps * mypy * improve system prompt docstring * typos update * Update haystack/preview/components/generators/openai/chatgpt.py * pylint * Update haystack/preview/components/generators/openai/chatgpt.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * Update haystack/preview/components/generators/openai/chatgpt.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * Update haystack/preview/components/generators/openai/chatgpt.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * review feedback * fix tests * freview feedback * reno * remove tenacity mock * gpt35generator * fix naming * remove stray references to chatgpt * fix e2e * Update releasenotes/notes/chatgpt-llm-generator-d043532654efe684.yaml Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * add another test * test wrong model name * review feedback --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-09-07 10:06:57 +02:00
Vladimir Blagojevic	c5edb45c10	feat: Add `SerperDevWebSearch` Haystack 2.0 component (#5712 ) * Add SerperDev * Add release note * PR Feedback * Simplify, remove one-liner * Update haystack/preview/components/websearch/serper_dev.py Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> * Update haystack/preview/components/websearch/serper_dev.py Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> * Fix formatting * PR feedback * Fix tests * Function rename * Remove scoring, update tests * PR feedback * Fix return * small adjustments * fix tests * add e2e test * fix release notes * fix tests * fix e2e --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-06 17:31:42 +02:00
bogdankostic	639f7cf888	chore: Rename `AnswersBuilder` to `AnswerBuilder` (#5720 ) * Add AnswersBuilder * Add tests for AnswersBuilder * Add release note * PR feedback * Fix mypy * Remove redundant check for number of groups * Rename AnswersBuilder to AnswerBuilder * Update test/preview/components/builders/test_answer_builder.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Rename reno file --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-05 14:34:22 +02:00
Silvano Cerza	2acc41ea85	Add `PromptBuilder` (#5713 ) * Add PromptBuilder * Update release note * Add test	2023-09-05 12:22:21 +02:00
bogdankostic	a5b815690e	feat: Add `AnswersBuilder` component (2.0) (#5701 ) * Add AnswersBuilder * Add tests for AnswersBuilder * Add release note * PR feedback * Fix mypy * Remove redundant check for number of groups * docstrings upd --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-04 21:16:20 +02:00
bogdankostic	11440395f4	fix: Set model_max_length in the Tokenizer of `DefaultPromptHandler` (#5596 ) * Set model_max_length in tokenizer in prompt handler * Add release note	2023-09-01 11:48:41 +02:00
ZanSara	5f1256ac7e	feat: `generators` (2.0) (#5690 ) * add generators module * add tests for module helper * reno * add another test * move into openai * improve tests	2023-08-31 17:33:12 +02:00
Fanli Lin	40d9f34e68	feat: enable passing `use_fast` to the underlying transformers' pipeline (#5655 ) * copy instead of deepcopy * fix pylint * add use_fast * add release note * remove unrelevant changes * black fix * fix bug * black * bug fix	2023-08-30 10:25:18 +02:00
ZanSara	b1daa7c647	chore: migrate to `canals==0.7.0` (#5647 ) * add default_to_dict and default_from_dict placeholders to ease migration to canals 0.7.0 * canals==0.7.0 * whisper components * add to_dict/from_dict stubs * import serialization methods in init to hide canals imports * reno * export deserializationerror too * Update haystack/preview/__init__.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * serialization methods for LocalWhisperTranscriber (#5648) * chore: serialization methods for `FileExtensionClassifier` (#5651) * serialization methods for FileExtensionClassifier * Update test_file_classifier.py * chore: serialization methods for `SentenceTransformersDocumentEmbedder` (#5652) * serialization methods for SentenceTransformersDocumentEmbedder * fix device management * serialization methods for SentenceTransformersTextEmbedder (#5653) * serialization methods for TextFileToDocument (#5654) * chore: serialization methods for `RemoteWhisperTranscriber` (#5650) * serialization methods for RemoteWhisperTranscriber * remove patches * Add default to_dict and from_dict in document stores built with factory (#5674) * fix tests (#5671) * chore: simplify serialization methods for `MemoryDocumentStore` (#5667) * simplify serialization for MemoryDocumentStore * remove redundant tests * pylint * chore: serialization methods for `MemoryRetriever` (#5663) * serialization method for MemoryRetriever * more tests * remove hash from default_document_store_to_dict * remove diff in factory.py * chore: serialization methods for `DocumentWriter` (#5661) * serialization methods for DocumentWriter * more tests * use factory * black --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-08-29 18:15:07 +02:00
Vladimir Blagojevic	e5e7bb9654	feat: Allow WebRetrieve to use custom LinkContentFetcher (#5662 ) * Allow use of custom LinkContentFetcher * Add release note	2023-08-29 15:46:48 +02:00
Vladimir Blagojevic	1f7c7b716a	Update release note for #5526 (#5664 )	2023-08-29 14:25:52 +02:00
Julian Risch	fa81c611e8	build: Upgrade transformers to v4.32.1 (#5658 ) * upgrade transformers to 4.32.1 * added release notes * upgrade transformers version also for inference extra	2023-08-29 13:46:00 +02:00
Vladimir Blagojevic	f13b37db24	fix: LinkContentFetcher - when no content retrieved (i.e. request blocked), default to snippet text (#5656 ) * When no content retrieved (i.e. request blocked), default to snippet * Add release note	2023-08-29 10:57:47 +02:00
Vladimir Blagojevic	2118f68769	feat: Add domain scoping to WebRetriever (#5587 ) * WebSearch: add allowed_domains scoped search * Add talk to website example * Add release note * Add allowed_domains to WebSearch * Minor fix --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-08-28 20:02:02 +02:00
Stefano Fiorucci	72fe4fc57b	feat: SentenceTransformersDocumentEmbedder (#5606 ) * first draft * incorporate feedback * some unit tests * release notes * real release notes * refactored to use a factory class * allow forcing fresh instances * first draft * Update haystack/preview/embedding_backends/sentence_transformers_backend.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * simplify implementation and tests * add embed_meta_fields implementation * lg update * improve meta data embedding; tests * support non-string metadata * make factory private * change return type; improve tests * warm_up not called in run * fix typing * rm unused import * Remove base test class * black --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-08-28 16:23:41 +02:00
Stefano Fiorucci	89c1813d9f	feat: SentenceTransformersTextEmbedder (#5600 ) * first draft * incorporate feedback * some unit tests * release notes * real release notes * first draft * refactored to use a factory class * adapt to new ST Embedding Backend implementation * allow forcing fresh instances * add tests * release notes * fix typo * little improvements in tests * Update haystack/preview/embedding_backends/sentence_transformers_backend.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * simplify implementation and tests * lg update * input check * better error message * make factory private * change return type; improve tests * warm_up not called in run * warm_up not called in run * rm unused import; default model * fix typing * rm unused import * Remove BaseTestComponent * black --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-08-28 16:23:26 +02:00

1 2

89 Commits