haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-09-21 22:23:23 +00:00

Author	SHA1	Message	Date
Stefano Fiorucci	fb96aef4dd	refactor!: move classifiers to an appropriate directory/package (#6240 ) * mv classifiers * release note	2023-11-06 12:00:01 +01:00
Stefano Fiorucci	063d27c522	refactor!: rename `TextDocumentSplitter` to `DocumentSplitter` (#6223 ) * rename TextDocumentSplitter to DocumentSplitter * reno * fix init	2023-11-03 11:33:20 +01:00
Julian Risch	29b1fefaa4	feat: Add DocumentLanguageClassifier 2.0 (#6037 ) * add DocumentLanguageClassifier and tests * reno * fix import, rename DocumentCleaner * mark example usage as python code * add assertions to e2e test * use deserialized document_store * Apply suggestions from code review Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * remove from/to_dict * use renamed InMemoryDocumentStore * adapt to Document refactoring * improve docstring * fix test for new Document --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com> Co-authored-by: anakin87 <stefanofiorucci@gmail.com>	2023-10-31 15:35:05 +01:00
Silvano Cerza	7287657f0e	refactor: Rename `Document`'s `text` field to `content` (#6181 ) * Rework Document serialisation Make Document backward compatible Fix InMemoryDocumentStore filters Fix InMemoryDocumentStore.bm25_retrieval Add release notes Fix pylint failures Enhance Document kwargs handling and docstrings Rename Document's text field to content Fix e2e tests Fix SimilarityRanker tests Fix typo in release notes Rename Document's metadata field to meta (#6183) * fix bugs * make linters happy * fix * more fix * match regex --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-10-31 12:44:04 +01:00
Silvano Cerza	3d69094f9a	refactor: Remove `id_hash_keys` from `TextDocumentSplitter` (#6124 ) * Remove id_hash_keys from DocumentCleaner * Remove id_hash_keys from TextDocumentSplitter	2023-10-20 15:18:28 +02:00
Silvano Cerza	ec376c7dbd	Remove id_hash_keys from DocumentCleaner (#6123 )	2023-10-20 15:16:06 +02:00
Julian Risch	9f3b6512be	refactor: Remove reimplementations of default `from_dict`/`to_dict` and corresponding tests in 2.0 (#6108 ) * whisper transcriber * remove from/to_dict from builders * remove from/to_dict from embedders * remove from/to_dict from fetcher, file_converters * remove from/to_dict from generators, preprocessors * remove from/to_dict from ranker, reader * remove from/to_dict from router, sampler, websearch * pylint * reno * refactor import * remove unused import	2023-10-19 11:17:02 +02:00
Silvano Cerza	ec9f898cd6	fix: Fix TextDocumentSplitter failing if run with empty list (#6081 ) * Fix TextDocumentSplitter failing if run with empty list * Release notes * Simplify check * Enhance test	2023-10-17 11:25:28 +02:00
Julian Risch	90ddeba579	fix: DocumentSplitter and DocumentCleaner copy `id_hash_keys` to newly created Documents (#6083 ) * copy id_hash_keys in splitter and cleaner * reno	2023-10-17 11:03:48 +02:00
Julian Risch	aaee03aee8	feat: Add DocumentCleaner 2.0 (#5976 ) * remove whitespaces, substrings, regex, empty lines * remove repeated substrings * reno * return empty string as shortest common ngram * address first half of review feedback * address second half of review feedback * mention \f page separator for header/footer removal * mention \f page separator for header/footer removal * mark example usage as python code	2023-10-13 12:39:55 +02:00
Julian Risch	b507f1a124	feat: Add TextLanguageClassifier 2.0 (#6026 ) * draft TextLanguageClassifier * implement language detection with langdetect * add unit test for logging message * reno * pylint * change input from List[str] to str * remove empty output connections * add from_dict/to_dict tests * mark example usage as python code	2023-10-13 10:30:49 +02:00
Julian Risch	4413675e64	feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870 ) * draft split by word, sentence, passage * naive way to split sentences without nltk * reno * add tests * make input list of docs, review feedback * add source_id and more validation * update docstrings * add split delimiters back to strings --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-27 12:26:20 +02:00

12 Commits