haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-08-01 05:08:42 +00:00

Author	SHA1	Message	Date
Stefano Fiorucci	4e4af99a5e	refactor!: rename `MemoryDocumentStore` and related Retrievers (#6076 ) * rename doc store and retrievers * release note * fix patch	2023-10-17 16:15:16 +02:00
Silvano Cerza	ec9f898cd6	fix: Fix TextDocumentSplitter failing if run with empty list (#6081 ) * Fix TextDocumentSplitter failing if run with empty list * Release notes * Simplify check * Enhance test	2023-10-17 11:25:28 +02:00
Julian Risch	90ddeba579	fix: DocumentSplitter and DocumentCleaner copy `id_hash_keys` to newly created Documents (#6083 ) * copy id_hash_keys in splitter and cleaner * reno	2023-10-17 11:03:48 +02:00
Stefano Fiorucci	e963c8acdd	feat: `HuggingFaceLocalGenerator` - stopwords handling (#6049 ) * first implementation * release notes * fixes * tests * better reno * release note	2023-10-17 10:36:08 +02:00
Ivana Zeljkovic	2326f2f9fe	feat: Pinecone document store optimizations (#5902 ) * Optimize methods for deleting documents and getting vector count. Enable warning messages when Pinecone limits are exceeded on Starter index type. * Fix typo * Add release note * Fix mypy errors * Remove unused import. Fix warning logging message. * Update release note with description about limits for Starter index type in Pinecone * Improve code base by: - Adding new test cases for get_embedding_count method - Fixing get_embedding_count method - Improving delete documents - Fix label retrieval - Increase default batch size - Improve get_document_count method * Remove unused variable * Fix mypy issues	2023-10-16 19:26:24 +02:00
Nicola Procopio	32e87d37c1	fixed join_docs.py concatenate (#5970 ) * added hybrid search example Added an example about hybrid search for faq pipeline on covid dataset * formatted with back formatter * renamed document * fixed * fixed typos * added test added test for hybrid search * fixed withespaces * removed test for hybrid search * fixed pylint * commented logging * fixed bug in join_docs.py _concatenate_results * Update join_docs.py updated comment * format with black * added releasenote on PR * updated release notes * updated test_join_documents * updated test * updated test * Update test_join_documents.py * formatted with black * fixed test * fixed --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-16 09:31:52 +02:00
Julian Risch	aaee03aee8	feat: Add DocumentCleaner 2.0 (#5976 ) * remove whitespaces, substrings, regex, empty lines * remove repeated substrings * reno * return empty string as shortest common ngram * address first half of review feedback * address second half of review feedback * mention \f page separator for header/footer removal * mention \f page separator for header/footer removal * mark example usage as python code	2023-10-13 12:39:55 +02:00
Bilge Yücel	ad25041618	Remove old Cohere models and add aliases for existing ones (#6007 ) * Remove old cohere models * Add aliases for the existing models according to Cohere documentation * Add release note * put cohere embdding models in a constant * update doc strings --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-10-13 12:08:26 +02:00
Stefano Fiorucci	fbd22bc1e9	feat: `HuggingFaceLocalGenerator` - first implementation (#6022 ) * draft * still a raw draft * still a raw draft * improvements * minimal impl ok * tests * reno * better language * examples of generation_kwargs * incorporate feedback * lg and format updates * don't save valid str tokens * fix style --------- Co-authored-by: Darja Fokina <daria.f93@gmail.com>	2023-10-13 11:23:56 +02:00
Julian Risch	b507f1a124	feat: Add TextLanguageClassifier 2.0 (#6026 ) * draft TextLanguageClassifier * implement language detection with langdetect * add unit test for logging message * reno * pylint * change input from List[str] to str * remove empty output connections * add from_dict/to_dict tests * mark example usage as python code	2023-10-13 10:30:49 +02:00
ZanSara	110aacdc35	feat: add basic telemetry to pipelines 2.0 (#5929 ) * add telemetry to pipelines 2.0 * only collect data if telemetry is on * reno * add downsampling * typing * manual tests * pylint * simplify code * Update haystack/preview/telemetry/__init__.py * rather index by component type * black * mypy * review feedback & small improvements * defaultdict * stray changes * lint * invert condition * always send the first event of the day * collect specs * track 2nd and 3rd events too * send first event and then max 1 event a minute * rename constant * invert condition * linting	2023-10-13 09:31:51 +02:00
ZanSara	adf7e49af3	chore: review `all` extra (#6029 )	2023-10-12 21:50:53 +02:00
Vladimir Blagojevic	6a50123b9f	feat: Adjust LinkContentFetcher run method, use ByteStream (#5972 )	2023-10-10 17:48:31 +02:00
Nicola Procopio	c102b152dc	fix: Run update_embeddings in examples (#6008 ) * added hybrid search example Added an example about hybrid search for faq pipeline on covid dataset * formatted with back formatter * renamed document * fixed * fixed typos * added test added test for hybrid search * fixed withespaces * removed test for hybrid search * fixed pylint * commented logging * updated hybrid search example * release notes * Update hybrid_search_faq_pipeline.py-815df846dca7e872.yaml * Update hybrid_search_faq_pipeline.py * mention hybrid search example in release notes * reduce installed dependencies in examples test workflow * do not install cuda dependencies * skip models if API key not set; delete document indices * skip models if API key not set; delete document indices * skip models if API key not set; delete document indices * keep roberta-base model and inference extra * pylint * disable pylint no-logging-basicconfig rule --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2023-10-10 16:38:52 +02:00
Vladimir Blagojevic	98215aec0d	feat: Rename `FileExtensionRouter` to `FileTypeRouter`, handle ByteStream(s) (#5998 ) Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-10-10 09:14:04 +02:00
DanShatford	07048791aa	feat: allow list of file paths in `convert_files_to_docs` (#5961 ) * feat: allow list of file paths in `convert_files_to_docs` * Fix validation * Fix check errors	2023-10-09 20:19:03 +02:00
David Berenstein	13fb7c5b5f	feat: added on_agent_final_answer-support to Agent callback_manager (#5736 ) * chore: added on_agent_final_answer-support to Agent callback_manager * chore: format black * run pre-commit to format file * updated release notes * reverted sorted imports --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-09 18:03:47 +02:00
Vladimir Blagojevic	40b83d8a47	feat: Add TopPSampler Haystack 2.0 component (#5924 )	2023-10-09 13:44:01 +02:00
Vladimir Blagojevic	1cdff6427e	feat: Add SimilarityRanker to Haystack 2.0 (#5923 ) * Initial SimilarityRanker	2023-10-06 16:01:34 +02:00
Stefano Fiorucci	ccc9f010bb	fix: fix ChatGPT invocation layer (and add async support) (#5979 ) * ChatGPT async * release note * fix tests	2023-10-05 18:43:26 +02:00
Tobias Wochinger	d5d3a9eef4	chore: adapt deepset cloud sdk endpoint format for saving pipelines (#5969 ) * chore: adapt to new endpoints formats * docs: add release notes	2023-10-05 08:56:28 +02:00
Massimiliano Pippi	c2ec3f5fde	feat: add File type to preview package (#5873 ) * add Blob type * review feedback * fix tests and naming * Update add-blob-type-2a9476a39841f54d.yaml * removed unused import --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-04 17:23:12 +02:00
Stefano Fiorucci	cc70b4b613	deprecation (#5954 )	2023-10-03 12:48:06 +02:00
Massimiliano Pippi	ac408134f4	feat: add support for async openai calls (#5946 ) * add support for async openai calls * add actual async call * split the async api * ask permission * Update haystack/utils/openai_utils.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * Fix OpenAI content moderation tests * Fix ChatGPT invocation layer tests --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-10-03 10:42:21 +02:00
Lavesh Akhadkar	1ccf674d73	feat: `DocumentWriter` returns number of documents written (#5939 ) * Make DocumentWriter return the number of documents it wrote * Fixed return type	2023-10-03 10:02:33 +02:00
Massimiliano Pippi	0947f59545	feat: add async PromptNode run (#5890 ) * add async promptnode * Remove unecessary calls to dict.keys() --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com> Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-09-29 08:40:01 +02:00
Vladimir Blagojevic	e882a7d5c8	feat: Add HTMLToDocument component (v2) (#5907 )	2023-09-28 17:22:28 +02:00
Stefano Fiorucci	d4aacad5f9	feat: `OpenAIDocumentEmbedder` (#5822 ) * first draft * release note * mypy fix * fix test * corrections * pr feedback * better secrets handling and new tests * missing imports in embedders/__init__.py * better format condition * address feedback	2023-09-28 15:42:51 +02:00
Julian Risch	4413675e64	feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870 ) * draft split by word, sentence, passage * naive way to split sentences without nltk * reno * add tests * make input list of docs, review feedback * add source_id and more validation * update docstrings * add split delimiters back to strings --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-27 12:26:20 +02:00
bogdankostic	80192589b1	feat: Add `AzureOCRDocumentConverter` (2.0) (#5855 ) * Add AzureOCRDocumentConverter * Add tests * Add release note * Formatting * update docstrings * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> * PR feedback * PR feedback * PR feedback * Add secrets as environment variables * Adapt test * Add azure dependency to CI * Add azure dependency to CI --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-26 15:57:55 +02:00
Silvano Cerza	cf7f0ebc22	Add Pipelines async run (#5864 ) * Add Pipeline.arun() * Sleeper node * Fix async running * Add e2e tests To run a Pipeline that doesn't have any async node in async mode: pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline To run a Pipeline that has a single async node in concurrent mode: pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline To run a Pipeline that has a single async node in sequential mode: pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline * Remove unused _adispatch_run method * Make Pipeline.run work with async nodes * Revert "Make Pipeline.run work with async nodes" This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431. * Rename Pipeline.arun to Pipeline._arun * Enhance docstring * Add Sleeper docstring * Add release notes * ignore typing across the node * make pylint happy * skip pylint on needed unused import * fix * if a node has an arun method, use it --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-26 15:37:27 +02:00
ZanSara	6cb7d16e22	feat: `preview` extra (#5869 ) * copy the deps list over from haystack-ai * fix lazyimport usage * keep jinja and openai * fix ci * reno * separate out preview unit tests * fix import error message for tika * tika * add preview to all * wrap torch * remove comment * unwrap openai and jinja	2023-09-26 12:48:15 +02:00
bogdankostic	9a4373bf8e	feat: Add `TikaDocumentConverter` (2.0) (#5847 ) * Add TikaFileToDocument component * Add tests * Add tika service to CI * Add release note * Change name * PR feedback * Fix naming in tests * Fix tika version in CI * Update tests --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-25 11:47:21 +02:00
Stefano Fiorucci	c0f22372d4	feat: `OpenAITextEmbedder` (#5801 ) * first draft * release notes * avoid serializing secrets * fix import order * simplify serialization * simplification * monkeypatch delenv * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * docstrings updates * fix test * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * rm comment --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-22 21:54:11 +02:00
Massimiliano Pippi	a5a0dc9f87	feat: optionally pass an id to the Document constructor (#5862 ) * revert #5826 * do not use Optional	2023-09-22 11:09:59 +02:00
Silvano Cerza	cc4f95bf51	Remove unnecessary GPT4Generator class (#5863 ) * Remove GPT4Generator class * Rename GPT35Generator to GPTGenerator * Fix tests * Release notes	2023-09-22 11:05:06 +02:00
MichelBartels	f3dc9edd26	feat: initial ExtractiveReader implementation (#5553 ) * initial ExtractiveReader implementation * initial ExtractiveReader implementation * fix mypy * remove unused import * Use AutoTokenizer * rename reader to model * combine no-answer logit * support document slicing with proper probabilities * add variable stride * validate model * fix typo * make postprocessing easier to understand * remove debug code * set default reader * add ExtractiveReader to __init__ * remove validation * use new answer class * add batching * use v2 lazy imports * move reader * fix type hints * add doc strings * add nucleus sampling * fix types * fix doc string * add no_answer parameter * remove print statement * fix gpu support * turn into binary classification task * change dataclass so document does not need to be provided for no answer * add simple tests * add unit tests * rename reader folder to readers * add integration tests * fix type hints * add release notes * remove accidentally included test file * remove unnecessary __init__ file * revert __init__ file to main * rename test script by adding test_ prefix * undo accidentally moving of test script after renaming it * remove use of bisect * rename _flatten and _unflatten * make variable name more intuitive * remove type: ignore * fix mypy issue * refactor long tuple * add doc strings * explain HF test * remove unnecessary top_k check --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 12:16:51 +02:00
Vladimir Blagojevic	92a6221927	feat: Add PyPDFToDocument component (2.0) (#5850 ) * Initial PyPDFToDocument implementation * Remove progress bar * Add release note * Minor fix * import check and dependency --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 11:52:26 +02:00
bogdankostic	abe2706298	feat: Add `MetadataRouter` (2.0) (#5824 ) * Move filter utilities * Add MetadataRouter * Add tests for MetadataRouter * Add more tests * Rename FileExtensionClassifer to FileExtensionRouter * Add support for dates in filters * Add tests * Add release note * Add release note * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-20 14:49:17 +02:00
ZanSara	454988672e	feat: `UrlCacheChecker` (#5841 ) * add UrlCacheChecker * rename * add tests * reno * pylint * review feedback	2023-09-20 14:45:50 +02:00
bogdankostic	719c1c040c	feat: Add support for dates in filters (2.0) (#5823 ) * Add support for dates in filters * Add tests * Add release note * Update haystack/preview/utils/filters.py Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-20 12:05:56 +02:00
Vladimir Blagojevic	0983fb656a	feat: Add `LinkContentFetcher` Haystack 2.0 component (#5724 ) * Add LinkContentFetcher * Add release note * Small fixes * Fix pydocs * PR feedback * Remove handlers registration * PR feedback * adjustments * improve tests * initial draft * tests * add proposal * proposal number * reno * fix tests and usage of content and content_type * update branch & fix more tests * mypy * use the new document * add docstring * fix more tests * mypy * fix tests * add e2e * review feedback * improve __str__ * Apply suggestions from code review Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/preview/dataclasses/document.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * improve __str__ * fix tests * fix more tests * fix test * Fix end-of-file-fixer * Post merge fixes * Move e2e tests back into component --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-20 11:03:52 +02:00
Malte Pietsch	aa3cc3d5ae	feat: Add support for OpenAI's `gpt-3.5-turbo-instruct` model (#5837 ) * support gpt-3.5.-turbo-instruct * add release note	2023-09-19 16:06:43 +02:00
Onur Eren Arpacı	8af0d816e6	bug: fix the date_fields request bottleneck (#5695 ) * bug: fix the date_fields request bottleneck I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours. After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document. To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors. * bug: fix the date_fields request bottleneck * fix: executed the pre commit hooks for #9341	2023-09-15 18:12:14 +02:00
Silvano Cerza	5c04cd6ba2	Fix Document constructor accepting unused id parameter (#5826 )	2023-09-15 17:03:03 +02:00
Chivereanu Radu	cab21da87b	fix: Support for Azure 16k gpt 35 deployment (#5804 ) * Support for Azure 16k gpt 35 deployment * releasenote added --------- Co-authored-by: user11999 <radugabrielchivereanu@gmail.com>	2023-09-14 18:01:22 +02:00
Ivana Zeljkovic	4bad202197	feat: Pinecone document store refactoring (#5725 ) * Refactor codebase so that doc_type metadata is used instead of namespaces for making distinction between documents without embeddings, documents with embeddings and labels * Fix parameter name in integration test * Remove code under comment in add_type_metadata_filter method * Fix mypy and pylint checks * Add release note * Apply minimal changes: rename method, update method docs and remove redundant method * Mypy fixes * Fix docstrings * Revert helper methods for fetching documents when the number of documents exceeds Pinecone limit * Remove unnecessary attributes in PineconeDocumentStore * Fix unit test --------- Co-authored-by: Ivana Zeljkovic <ivana.zeljkovic@smartcat.io> Co-authored-by: DosticJelena <jelena.dostic@smartcat.io>	2023-09-14 11:46:47 +02:00
Darion	beb8853412	fix: return types of EntityExtractor to work with FAISSDocumentStore (#5750 ) * Changed entity extractor score from type float32 to float64 and start/stop from int64 to int * Added relase notes	2023-09-14 10:49:54 +02:00
Stefano Fiorucci	28f42fbaab	move release note to the right directory (#5808 )	2023-09-14 09:57:09 +02:00
Christian Clauss	6dd52d91b2	ci: Fix typos discovered by codespell (#5778 ) * Fix typos discovered by codespell * pylint: max-args = 38	2023-09-13 16:14:45 +02:00

... 2 3 4 5 6

262 Commits