haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-11 23:54:37 +00:00

Author	SHA1	Message	Date
Ashwin Mathur	101bd816f8	refactor: Remove api_key from serialization of `AzureOCRDocumentConverter` and `SerperDevWebSearch` (#6150 ) * Remove api_key from serialization of AzureOCRDocumentConverter * Remove api_key from serialization of SerperDevWebSearch * Add release notes * Add init_fail_without_api_key test for SerperDevWebSearch * Rename env var to AZURE_AI_API_KEY	2023-10-23 12:26:23 +02:00
Silvano Cerza	c8d162ced9	refactor: Change `Document.embedding` type to list of floats (#6135 ) * Change Document.embedding type * Add release notes * Fix document_store testing * Fix pylint * Fix tests	2023-10-23 12:26:05 +02:00
Silvano Cerza	8f289282f1	refactor: Remove `id_hash_keys` field from `Document` (#6127 ) * Remove id_hash_fields from Document * Update release notes * Remove unused import	2023-10-23 10:35:24 +02:00
Stefano Fiorucci	7e6c6becd6	fix release note (#6145 )	2023-10-22 11:15:51 +02:00
Julian Risch	64649312bc	build: Upgrade to `canals==0.9.0` (#6133 ) * build: Upgrade to `canals==0.9.0` * reno	2023-10-20 13:00:24 +02:00
Silvano Cerza	3f98bd9137	refactor: Rework `Document.id` generation (#6122 ) * Rework Document id generation * Fix tests * Add release notes * Fix failing integration test * Remove score from Document id generation * Enhance tests * Update release notes --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2023-10-20 10:34:28 +02:00
Sunil Kumar Dash	957d1be68d	Enrich documents with embeddings for OpenAIDocumentEmbedder (#6126 ) * Enrich documents with embeddings Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com> * add release note Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com> * try to fix typing * change embedding field type in Document --------- Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com> Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-19 18:29:16 +02:00
Stefano Fiorucci	ef40c7c728	refactor: make sure that Document's `id_hash_keys` has a valid value (#6112 ) * fix handling id_hash_keys * reno * handle empty id_hash_keys in post_init * fix * reno * test	2023-10-19 12:10:19 +02:00
Julian Risch	9f3b6512be	refactor: Remove reimplementations of default `from_dict`/`to_dict` and corresponding tests in 2.0 (#6108 ) * whisper transcriber * remove from/to_dict from builders * remove from/to_dict from embedders * remove from/to_dict from fetcher, file_converters * remove from/to_dict from generators, preprocessors * remove from/to_dict from ranker, reader * remove from/to_dict from router, sampler, websearch * pylint * reno * refactor import * remove unused import	2023-10-19 11:17:02 +02:00
Stefano Fiorucci	21d894d85a	refactor: adopt `token` instead of `use_auth_token` in HF components (#6040 ) * move embedding backends * use token in Sentence Transformers embeddings * more compact token handling * token parameter in reader * add token to ranker * release note * add test for reader	2023-10-17 16:32:13 +02:00
Stefano Fiorucci	4e4af99a5e	refactor!: rename `MemoryDocumentStore` and related Retrievers (#6076 ) * rename doc store and retrievers * release note * fix patch	2023-10-17 16:15:16 +02:00
Silvano Cerza	ec9f898cd6	fix: Fix TextDocumentSplitter failing if run with empty list (#6081 ) * Fix TextDocumentSplitter failing if run with empty list * Release notes * Simplify check * Enhance test	2023-10-17 11:25:28 +02:00
Julian Risch	90ddeba579	fix: DocumentSplitter and DocumentCleaner copy `id_hash_keys` to newly created Documents (#6083 ) * copy id_hash_keys in splitter and cleaner * reno	2023-10-17 11:03:48 +02:00
Stefano Fiorucci	e963c8acdd	feat: `HuggingFaceLocalGenerator` - stopwords handling (#6049 ) * first implementation * release notes * fixes * tests * better reno * release note	2023-10-17 10:36:08 +02:00
Ivana Zeljkovic	2326f2f9fe	feat: Pinecone document store optimizations (#5902 ) * Optimize methods for deleting documents and getting vector count. Enable warning messages when Pinecone limits are exceeded on Starter index type. * Fix typo * Add release note * Fix mypy errors * Remove unused import. Fix warning logging message. * Update release note with description about limits for Starter index type in Pinecone * Improve code base by: - Adding new test cases for get_embedding_count method - Fixing get_embedding_count method - Improving delete documents - Fix label retrieval - Increase default batch size - Improve get_document_count method * Remove unused variable * Fix mypy issues	2023-10-16 19:26:24 +02:00
Nicola Procopio	32e87d37c1	fixed join_docs.py concatenate (#5970 ) * added hybrid search example Added an example about hybrid search for faq pipeline on covid dataset * formatted with back formatter * renamed document * fixed * fixed typos * added test added test for hybrid search * fixed withespaces * removed test for hybrid search * fixed pylint * commented logging * fixed bug in join_docs.py _concatenate_results * Update join_docs.py updated comment * format with black * added releasenote on PR * updated release notes * updated test_join_documents * updated test * updated test * Update test_join_documents.py * formatted with black * fixed test * fixed --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-16 09:31:52 +02:00
Julian Risch	aaee03aee8	feat: Add DocumentCleaner 2.0 (#5976 ) * remove whitespaces, substrings, regex, empty lines * remove repeated substrings * reno * return empty string as shortest common ngram * address first half of review feedback * address second half of review feedback * mention \f page separator for header/footer removal * mention \f page separator for header/footer removal * mark example usage as python code	2023-10-13 12:39:55 +02:00
Bilge Yücel	ad25041618	Remove old Cohere models and add aliases for existing ones (#6007 ) * Remove old cohere models * Add aliases for the existing models according to Cohere documentation * Add release note * put cohere embdding models in a constant * update doc strings --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-10-13 12:08:26 +02:00
Stefano Fiorucci	fbd22bc1e9	feat: `HuggingFaceLocalGenerator` - first implementation (#6022 ) * draft * still a raw draft * still a raw draft * improvements * minimal impl ok * tests * reno * better language * examples of generation_kwargs * incorporate feedback * lg and format updates * don't save valid str tokens * fix style --------- Co-authored-by: Darja Fokina <daria.f93@gmail.com>	2023-10-13 11:23:56 +02:00
Julian Risch	b507f1a124	feat: Add TextLanguageClassifier 2.0 (#6026 ) * draft TextLanguageClassifier * implement language detection with langdetect * add unit test for logging message * reno * pylint * change input from List[str] to str * remove empty output connections * add from_dict/to_dict tests * mark example usage as python code	2023-10-13 10:30:49 +02:00
ZanSara	110aacdc35	feat: add basic telemetry to pipelines 2.0 (#5929 ) * add telemetry to pipelines 2.0 * only collect data if telemetry is on * reno * add downsampling * typing * manual tests * pylint * simplify code * Update haystack/preview/telemetry/__init__.py * rather index by component type * black * mypy * review feedback & small improvements * defaultdict * stray changes * lint * invert condition * always send the first event of the day * collect specs * track 2nd and 3rd events too * send first event and then max 1 event a minute * rename constant * invert condition * linting	2023-10-13 09:31:51 +02:00
ZanSara	adf7e49af3	chore: review `all` extra (#6029 )	2023-10-12 21:50:53 +02:00
Vladimir Blagojevic	6a50123b9f	feat: Adjust LinkContentFetcher run method, use ByteStream (#5972 )	2023-10-10 17:48:31 +02:00
Nicola Procopio	c102b152dc	fix: Run update_embeddings in examples (#6008 ) * added hybrid search example Added an example about hybrid search for faq pipeline on covid dataset * formatted with back formatter * renamed document * fixed * fixed typos * added test added test for hybrid search * fixed withespaces * removed test for hybrid search * fixed pylint * commented logging * updated hybrid search example * release notes * Update hybrid_search_faq_pipeline.py-815df846dca7e872.yaml * Update hybrid_search_faq_pipeline.py * mention hybrid search example in release notes * reduce installed dependencies in examples test workflow * do not install cuda dependencies * skip models if API key not set; delete document indices * skip models if API key not set; delete document indices * skip models if API key not set; delete document indices * keep roberta-base model and inference extra * pylint * disable pylint no-logging-basicconfig rule --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2023-10-10 16:38:52 +02:00
Vladimir Blagojevic	98215aec0d	feat: Rename `FileExtensionRouter` to `FileTypeRouter`, handle ByteStream(s) (#5998 ) Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-10-10 09:14:04 +02:00
DanShatford	07048791aa	feat: allow list of file paths in `convert_files_to_docs` (#5961 ) * feat: allow list of file paths in `convert_files_to_docs` * Fix validation * Fix check errors	2023-10-09 20:19:03 +02:00
David Berenstein	13fb7c5b5f	feat: added on_agent_final_answer-support to Agent callback_manager (#5736 ) * chore: added on_agent_final_answer-support to Agent callback_manager * chore: format black * run pre-commit to format file * updated release notes * reverted sorted imports --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-09 18:03:47 +02:00
Vladimir Blagojevic	40b83d8a47	feat: Add TopPSampler Haystack 2.0 component (#5924 )	2023-10-09 13:44:01 +02:00
Vladimir Blagojevic	1cdff6427e	feat: Add SimilarityRanker to Haystack 2.0 (#5923 ) * Initial SimilarityRanker	2023-10-06 16:01:34 +02:00
Stefano Fiorucci	ccc9f010bb	fix: fix ChatGPT invocation layer (and add async support) (#5979 ) * ChatGPT async * release note * fix tests	2023-10-05 18:43:26 +02:00
Tobias Wochinger	d5d3a9eef4	chore: adapt deepset cloud sdk endpoint format for saving pipelines (#5969 ) * chore: adapt to new endpoints formats * docs: add release notes	2023-10-05 08:56:28 +02:00
Massimiliano Pippi	c2ec3f5fde	feat: add File type to preview package (#5873 ) * add Blob type * review feedback * fix tests and naming * Update add-blob-type-2a9476a39841f54d.yaml * removed unused import --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-04 17:23:12 +02:00
Stefano Fiorucci	cc70b4b613	deprecation (#5954 )	2023-10-03 12:48:06 +02:00
Massimiliano Pippi	ac408134f4	feat: add support for async openai calls (#5946 ) * add support for async openai calls * add actual async call * split the async api * ask permission * Update haystack/utils/openai_utils.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * Fix OpenAI content moderation tests * Fix ChatGPT invocation layer tests --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-10-03 10:42:21 +02:00
Lavesh Akhadkar	1ccf674d73	feat: `DocumentWriter` returns number of documents written (#5939 ) * Make DocumentWriter return the number of documents it wrote * Fixed return type	2023-10-03 10:02:33 +02:00
Massimiliano Pippi	0947f59545	feat: add async PromptNode run (#5890 ) * add async promptnode * Remove unecessary calls to dict.keys() --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com> Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-09-29 08:40:01 +02:00
Vladimir Blagojevic	e882a7d5c8	feat: Add HTMLToDocument component (v2) (#5907 )	2023-09-28 17:22:28 +02:00
Stefano Fiorucci	d4aacad5f9	feat: `OpenAIDocumentEmbedder` (#5822 ) * first draft * release note * mypy fix * fix test * corrections * pr feedback * better secrets handling and new tests * missing imports in embedders/__init__.py * better format condition * address feedback	2023-09-28 15:42:51 +02:00
Julian Risch	4413675e64	feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870 ) * draft split by word, sentence, passage * naive way to split sentences without nltk * reno * add tests * make input list of docs, review feedback * add source_id and more validation * update docstrings * add split delimiters back to strings --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-27 12:26:20 +02:00
bogdankostic	80192589b1	feat: Add `AzureOCRDocumentConverter` (2.0) (#5855 ) * Add AzureOCRDocumentConverter * Add tests * Add release note * Formatting * update docstrings * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> * PR feedback * PR feedback * PR feedback * Add secrets as environment variables * Adapt test * Add azure dependency to CI * Add azure dependency to CI --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-26 15:57:55 +02:00
Silvano Cerza	cf7f0ebc22	Add Pipelines async run (#5864 ) * Add Pipeline.arun() * Sleeper node * Fix async running * Add e2e tests To run a Pipeline that doesn't have any async node in async mode: pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline To run a Pipeline that has a single async node in concurrent mode: pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline To run a Pipeline that has a single async node in sequential mode: pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline * Remove unused _adispatch_run method * Make Pipeline.run work with async nodes * Revert "Make Pipeline.run work with async nodes" This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431. * Rename Pipeline.arun to Pipeline._arun * Enhance docstring * Add Sleeper docstring * Add release notes * ignore typing across the node * make pylint happy * skip pylint on needed unused import * fix * if a node has an arun method, use it --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-26 15:37:27 +02:00
ZanSara	6cb7d16e22	feat: `preview` extra (#5869 ) * copy the deps list over from haystack-ai * fix lazyimport usage * keep jinja and openai * fix ci * reno * separate out preview unit tests * fix import error message for tika * tika * add preview to all * wrap torch * remove comment * unwrap openai and jinja	2023-09-26 12:48:15 +02:00
bogdankostic	9a4373bf8e	feat: Add `TikaDocumentConverter` (2.0) (#5847 ) * Add TikaFileToDocument component * Add tests * Add tika service to CI * Add release note * Change name * PR feedback * Fix naming in tests * Fix tika version in CI * Update tests --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-25 11:47:21 +02:00
Stefano Fiorucci	c0f22372d4	feat: `OpenAITextEmbedder` (#5801 ) * first draft * release notes * avoid serializing secrets * fix import order * simplify serialization * simplification * monkeypatch delenv * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * docstrings updates * fix test * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * rm comment --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-22 21:54:11 +02:00
Massimiliano Pippi	a5a0dc9f87	feat: optionally pass an id to the Document constructor (#5862 ) * revert #5826 * do not use Optional	2023-09-22 11:09:59 +02:00
Silvano Cerza	cc4f95bf51	Remove unnecessary GPT4Generator class (#5863 ) * Remove GPT4Generator class * Rename GPT35Generator to GPTGenerator * Fix tests * Release notes	2023-09-22 11:05:06 +02:00
MichelBartels	f3dc9edd26	feat: initial ExtractiveReader implementation (#5553 ) * initial ExtractiveReader implementation * initial ExtractiveReader implementation * fix mypy * remove unused import * Use AutoTokenizer * rename reader to model * combine no-answer logit * support document slicing with proper probabilities * add variable stride * validate model * fix typo * make postprocessing easier to understand * remove debug code * set default reader * add ExtractiveReader to __init__ * remove validation * use new answer class * add batching * use v2 lazy imports * move reader * fix type hints * add doc strings * add nucleus sampling * fix types * fix doc string * add no_answer parameter * remove print statement * fix gpu support * turn into binary classification task * change dataclass so document does not need to be provided for no answer * add simple tests * add unit tests * rename reader folder to readers * add integration tests * fix type hints * add release notes * remove accidentally included test file * remove unnecessary __init__ file * revert __init__ file to main * rename test script by adding test_ prefix * undo accidentally moving of test script after renaming it * remove use of bisect * rename _flatten and _unflatten * make variable name more intuitive * remove type: ignore * fix mypy issue * refactor long tuple * add doc strings * explain HF test * remove unnecessary top_k check --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 12:16:51 +02:00
Vladimir Blagojevic	92a6221927	feat: Add PyPDFToDocument component (2.0) (#5850 ) * Initial PyPDFToDocument implementation * Remove progress bar * Add release note * Minor fix * import check and dependency --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 11:52:26 +02:00
bogdankostic	abe2706298	feat: Add `MetadataRouter` (2.0) (#5824 ) * Move filter utilities * Add MetadataRouter * Add tests for MetadataRouter * Add more tests * Rename FileExtensionClassifer to FileExtensionRouter * Add support for dates in filters * Add tests * Add release note * Add release note * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-20 14:49:17 +02:00
ZanSara	454988672e	feat: `UrlCacheChecker` (#5841 ) * add UrlCacheChecker * rename * add tests * reno * pylint * review feedback	2023-09-20 14:45:50 +02:00

... 9 10 11 12 13

623 Commits