haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-15 17:43:55 +00:00

Author	SHA1	Message	Date
Ashwin Mathur	101bd816f8	refactor: Remove api_key from serialization of `AzureOCRDocumentConverter` and `SerperDevWebSearch` (#6150 ) * Remove api_key from serialization of AzureOCRDocumentConverter * Remove api_key from serialization of SerperDevWebSearch * Add release notes * Add init_fail_without_api_key test for SerperDevWebSearch * Rename env var to AZURE_AI_API_KEY	2023-10-23 12:26:23 +02:00
Silvano Cerza	c8d162ced9	refactor: Change `Document.embedding` type to list of floats (#6135 ) * Change Document.embedding type * Add release notes * Fix document_store testing * Fix pylint * Fix tests	2023-10-23 12:26:05 +02:00
Silvano Cerza	8f289282f1	refactor: Remove `id_hash_keys` field from `Document` (#6127 ) * Remove id_hash_fields from Document * Update release notes * Remove unused import	2023-10-23 10:35:24 +02:00
Silvano Cerza	2a45e7cc06	refactor: Remove `id_hash_keys` from all `file_converters` (#6125 ) * Remove id_hash_keys from DocumentCleaner * Remove id_hash_keys from TextDocumentSplitter * Remove id_hash_keys from all file_converters * Fix pylint failure * Update docstrings	2023-10-20 16:22:14 +02:00
Silvano Cerza	3d69094f9a	refactor: Remove `id_hash_keys` from `TextDocumentSplitter` (#6124 ) * Remove id_hash_keys from DocumentCleaner * Remove id_hash_keys from TextDocumentSplitter	2023-10-20 15:18:28 +02:00
Silvano Cerza	ec376c7dbd	Remove id_hash_keys from DocumentCleaner (#6123 )	2023-10-20 15:16:06 +02:00
Silvano Cerza	3f98bd9137	refactor: Rework `Document.id` generation (#6122 ) * Rework Document id generation * Fix tests * Add release notes * Fix failing integration test * Remove score from Document id generation * Enhance tests * Update release notes --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2023-10-20 10:34:28 +02:00
Stefano Fiorucci	ef40c7c728	refactor: make sure that Document's `id_hash_keys` has a valid value (#6112 ) * fix handling id_hash_keys * reno * handle empty id_hash_keys in post_init * fix * reno * test	2023-10-19 12:10:19 +02:00
Julian Risch	9f3b6512be	refactor: Remove reimplementations of default `from_dict`/`to_dict` and corresponding tests in 2.0 (#6108 ) * whisper transcriber * remove from/to_dict from builders * remove from/to_dict from embedders * remove from/to_dict from fetcher, file_converters * remove from/to_dict from generators, preprocessors * remove from/to_dict from ranker, reader * remove from/to_dict from router, sampler, websearch * pylint * reno * refactor import * remove unused import	2023-10-19 11:17:02 +02:00
Stefano Fiorucci	21d894d85a	refactor: adopt `token` instead of `use_auth_token` in HF components (#6040 ) * move embedding backends * use token in Sentence Transformers embeddings * more compact token handling * token parameter in reader * add token to ranker * release note * add test for reader	2023-10-17 16:32:13 +02:00
Stefano Fiorucci	4e4af99a5e	refactor!: rename `MemoryDocumentStore` and related Retrievers (#6076 ) * rename doc store and retrievers * release note * fix patch	2023-10-17 16:15:16 +02:00
Silvano Cerza	ec9f898cd6	fix: Fix TextDocumentSplitter failing if run with empty list (#6081 ) * Fix TextDocumentSplitter failing if run with empty list * Release notes * Simplify check * Enhance test	2023-10-17 11:25:28 +02:00
Julian Risch	90ddeba579	fix: DocumentSplitter and DocumentCleaner copy `id_hash_keys` to newly created Documents (#6083 ) * copy id_hash_keys in splitter and cleaner * reno	2023-10-17 11:03:48 +02:00
Stefano Fiorucci	e963c8acdd	feat: `HuggingFaceLocalGenerator` - stopwords handling (#6049 ) * first implementation * release notes * fixes * tests * better reno * release note	2023-10-17 10:36:08 +02:00
Ivana Zeljkovic	2326f2f9fe	feat: Pinecone document store optimizations (#5902 ) * Optimize methods for deleting documents and getting vector count. Enable warning messages when Pinecone limits are exceeded on Starter index type. * Fix typo * Add release note * Fix mypy errors * Remove unused import. Fix warning logging message. * Update release note with description about limits for Starter index type in Pinecone * Improve code base by: - Adding new test cases for get_embedding_count method - Fixing get_embedding_count method - Improving delete documents - Fix label retrieval - Increase default batch size - Improve get_document_count method * Remove unused variable * Fix mypy issues	2023-10-16 19:26:24 +02:00
ZanSara	660f84e6ef	feat: enable telemetry to pick up component data (#5957 ) * add telemetry to pipelines 2.0 * only collect data if telemetry is on * reno * add downsampling * typing * manual tests * pylint * simplify code * Update haystack/preview/telemetry/__init__.py * look for _telemetry_data * rather index by component type * black * mypy * error handling * comment * review feedback & small improvements * defaultdict * stray changes * try-catch * method instead of attribute * fixes * remove print statements * lint * invert condition * always send the first event of the day * collect specs * track 2nd and 3rd events too * send first event and then max 1 event a minute * rename constant * black * add test	2023-10-16 17:43:48 +02:00
Nicola Procopio	32e87d37c1	fixed join_docs.py concatenate (#5970 ) * added hybrid search example Added an example about hybrid search for faq pipeline on covid dataset * formatted with back formatter * renamed document * fixed * fixed typos * added test added test for hybrid search * fixed withespaces * removed test for hybrid search * fixed pylint * commented logging * fixed bug in join_docs.py _concatenate_results * Update join_docs.py updated comment * format with black * added releasenote on PR * updated release notes * updated test_join_documents * updated test * updated test * Update test_join_documents.py * formatted with black * fixed test * fixed --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-16 09:31:52 +02:00
Julian Risch	aaee03aee8	feat: Add DocumentCleaner 2.0 (#5976 ) * remove whitespaces, substrings, regex, empty lines * remove repeated substrings * reno * return empty string as shortest common ngram * address first half of review feedback * address second half of review feedback * mention \f page separator for header/footer removal * mention \f page separator for header/footer removal * mark example usage as python code	2023-10-13 12:39:55 +02:00
Stefano Fiorucci	fbd22bc1e9	feat: `HuggingFaceLocalGenerator` - first implementation (#6022 ) * draft * still a raw draft * still a raw draft * improvements * minimal impl ok * tests * reno * better language * examples of generation_kwargs * incorporate feedback * lg and format updates * don't save valid str tokens * fix style --------- Co-authored-by: Darja Fokina <daria.f93@gmail.com>	2023-10-13 11:23:56 +02:00
Julian Risch	b507f1a124	feat: Add TextLanguageClassifier 2.0 (#6026 ) * draft TextLanguageClassifier * implement language detection with langdetect * add unit test for logging message * reno * pylint * change input from List[str] to str * remove empty output connections * add from_dict/to_dict tests * mark example usage as python code	2023-10-13 10:30:49 +02:00
ZanSara	adf7e49af3	chore: review `all` extra (#6029 )	2023-10-12 21:50:53 +02:00
Stefano Fiorucci	2c2549f13d	move embedding backends (#6033 )	2023-10-12 17:52:28 +02:00
Vladimir Blagojevic	d51be9edac	Add top_k to SimilarityRanker (#6036 )	2023-10-12 13:52:01 +02:00
Vladimir Blagojevic	3803d23ff6	feat: Update `PyPDFToDocument` to process `ByteStream` inputs (#6021 ) * Update PyPDF converter * Add mixed source unit test * Update haystack/preview/components/file_converters/pypdf.py Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-10-11 10:52:08 +02:00
Vladimir Blagojevic	1a6a8863e8	feat: Update `HTMLToDocument` to handle `ByteStream` inputs (#6020 ) * Update HTML converter * Add mixed source unit test * Update haystack/preview/components/file_converters/html.py Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-10-11 10:15:58 +02:00
Vladimir Blagojevic	6a50123b9f	feat: Adjust LinkContentFetcher run method, use ByteStream (#5972 )	2023-10-10 17:48:31 +02:00
Vladimir Blagojevic	98215aec0d	feat: Rename `FileExtensionRouter` to `FileTypeRouter`, handle ByteStream(s) (#5998 ) Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-10-10 09:14:04 +02:00
DanShatford	07048791aa	feat: allow list of file paths in `convert_files_to_docs` (#5961 ) * feat: allow list of file paths in `convert_files_to_docs` * Fix validation * Fix check errors	2023-10-09 20:19:03 +02:00
Vladimir Blagojevic	40b83d8a47	feat: Add TopPSampler Haystack 2.0 component (#5924 )	2023-10-09 13:44:01 +02:00
Vladimir Blagojevic	1cdff6427e	feat: Add SimilarityRanker to Haystack 2.0 (#5923 ) * Initial SimilarityRanker	2023-10-06 16:01:34 +02:00
Stefano Fiorucci	ccc9f010bb	fix: fix ChatGPT invocation layer (and add async support) (#5979 ) * ChatGPT async * release note * fix tests	2023-10-05 18:43:26 +02:00
Vladimir Blagojevic	282419d82b	feat: Unfreeze Document in Haystack 2.0 (#5974 ) * Unfreeze document * Remove immutability test	2023-10-05 17:55:07 +02:00
Tobias Wochinger	d5d3a9eef4	chore: adapt deepset cloud sdk endpoint format for saving pipelines (#5969 ) * chore: adapt to new endpoints formats * docs: add release notes	2023-10-05 08:56:28 +02:00
Massimiliano Pippi	c2ec3f5fde	feat: add File type to preview package (#5873 ) * add Blob type * review feedback * fix tests and naming * Update add-blob-type-2a9476a39841f54d.yaml * removed unused import --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-10-04 17:23:12 +02:00
Stefano Fiorucci	cc70b4b613	deprecation (#5954 )	2023-10-03 12:48:06 +02:00
Massimiliano Pippi	ac408134f4	feat: add support for async openai calls (#5946 ) * add support for async openai calls * add actual async call * split the async api * ask permission * Update haystack/utils/openai_utils.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * Fix OpenAI content moderation tests * Fix ChatGPT invocation layer tests --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-10-03 10:42:21 +02:00
Massimiliano Pippi	0947f59545	feat: add async PromptNode run (#5890 ) * add async promptnode * Remove unecessary calls to dict.keys() --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com> Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-09-29 08:40:01 +02:00
Vladimir Blagojevic	e882a7d5c8	feat: Add HTMLToDocument component (v2) (#5907 )	2023-09-28 17:22:28 +02:00
Stefano Fiorucci	d4aacad5f9	feat: `OpenAIDocumentEmbedder` (#5822 ) * first draft * release note * mypy fix * fix test * corrections * pr feedback * better secrets handling and new tests * missing imports in embedders/__init__.py * better format condition * address feedback	2023-09-28 15:42:51 +02:00
ZanSara	83724b74e3	feat: Make `metadata` optional in AnswerBuilder (#5909 ) * optional metadata * improve docstring	2023-09-28 14:42:19 +02:00
Stefano Fiorucci	9340c572f9	alternative skipif conditions in azure ocr converter test (#5906 )	2023-09-28 12:09:19 +02:00
Julian Risch	4413675e64	feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870 ) * draft split by word, sentence, passage * naive way to split sentences without nltk * reno * add tests * make input list of docs, review feedback * add source_id and more validation * update docstrings * add split delimiters back to strings --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-27 12:26:20 +02:00
bogdankostic	80192589b1	feat: Add `AzureOCRDocumentConverter` (2.0) (#5855 ) * Add AzureOCRDocumentConverter * Add tests * Add release note * Formatting * update docstrings * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> * PR feedback * PR feedback * PR feedback * Add secrets as environment variables * Adapt test * Add azure dependency to CI * Add azure dependency to CI --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-26 15:57:55 +02:00
Stefano Fiorucci	6aa471ac5e	chore: make preview integration tests reproducible (#5871 ) * relax extractive reader integration tests * force reader to CPU * ensure integration tests reproducibility * move set_all_seeds to testing package	2023-09-25 18:39:10 +02:00
bogdankostic	9a4373bf8e	feat: Add `TikaDocumentConverter` (2.0) (#5847 ) * Add TikaFileToDocument component * Add tests * Add tika service to CI * Add release note * Change name * PR feedback * Fix naming in tests * Fix tika version in CI * Update tests --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-25 11:47:21 +02:00
MichelBartels	4da43b6b05	Add link output to `SerperDevWebSearch` (#5853 ) * add link output * adjust tests * fix test * remove print statements --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-25 10:03:01 +02:00
Stefano Fiorucci	c0f22372d4	feat: `OpenAITextEmbedder` (#5801 ) * first draft * release notes * avoid serializing secrets * fix import order * simplify serialization * simplification * monkeypatch delenv * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * docstrings updates * fix test * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * rm comment --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-22 21:54:11 +02:00
Massimiliano Pippi	a5a0dc9f87	feat: optionally pass an id to the Document constructor (#5862 ) * revert #5826 * do not use Optional	2023-09-22 11:09:59 +02:00
Silvano Cerza	cc4f95bf51	Remove unnecessary GPT4Generator class (#5863 ) * Remove GPT4Generator class * Rename GPT35Generator to GPTGenerator * Fix tests * Release notes	2023-09-22 11:05:06 +02:00
MichelBartels	f3dc9edd26	feat: initial ExtractiveReader implementation (#5553 ) * initial ExtractiveReader implementation * initial ExtractiveReader implementation * fix mypy * remove unused import * Use AutoTokenizer * rename reader to model * combine no-answer logit * support document slicing with proper probabilities * add variable stride * validate model * fix typo * make postprocessing easier to understand * remove debug code * set default reader * add ExtractiveReader to __init__ * remove validation * use new answer class * add batching * use v2 lazy imports * move reader * fix type hints * add doc strings * add nucleus sampling * fix types * fix doc string * add no_answer parameter * remove print statement * fix gpu support * turn into binary classification task * change dataclass so document does not need to be provided for no answer * add simple tests * add unit tests * rename reader folder to readers * add integration tests * fix type hints * add release notes * remove accidentally included test file * remove unnecessary __init__ file * revert __init__ file to main * rename test script by adding test_ prefix * undo accidentally moving of test script after renaming it * remove use of bisect * rename _flatten and _unflatten * make variable name more intuitive * remove type: ignore * fix mypy issue * refactor long tuple * add doc strings * explain HF test * remove unnecessary top_k check --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 12:16:51 +02:00

... 10 11 12 13 14 ...

1524 Commits