5 Commits

Author SHA1 Message Date
Silvano Cerza
7287657f0e
refactor: Rename Document's text field to content (#6181)
* Rework Document serialisation

Make Document backward compatible

Fix InMemoryDocumentStore filters

Fix InMemoryDocumentStore.bm25_retrieval

Add release notes

Fix pylint failures

Enhance Document kwargs handling and docstrings

Rename Document's text field to content

Fix e2e tests

Fix SimilarityRanker tests

Fix typo in release notes

Rename Document's metadata field to meta (#6183)

* fix bugs

* make linters happy

* fix

* more fix

* match regex

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-31 12:44:04 +01:00
Silvano Cerza
ec376c7dbd
Remove id_hash_keys from DocumentCleaner (#6123) 2023-10-20 15:16:06 +02:00
Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict/to_dict and corresponding tests in 2.0 (#6108)
* whisper transcriber

* remove from/to_dict from builders

* remove from/to_dict from embedders

* remove from/to_dict from fetcher, file_converters

* remove from/to_dict from generators, preprocessors

* remove from/to_dict from ranker, reader

* remove from/to_dict from router, sampler, websearch

* pylint

* reno

* refactor import

* remove unused import
2023-10-19 11:17:02 +02:00
Julian Risch
90ddeba579
fix: DocumentSplitter and DocumentCleaner copy id_hash_keys to newly created Documents (#6083)
* copy id_hash_keys in splitter and cleaner

* reno
2023-10-17 11:03:48 +02:00
Julian Risch
aaee03aee8
feat: Add DocumentCleaner 2.0 (#5976)
* remove whitespaces, substrings, regex, empty lines

* remove repeated substrings

* reno

* return empty string as shortest common ngram

* address first half of review feedback

* address second half of review feedback

* mention \f page separator for header/footer removal

* mention \f page separator for header/footer removal

* mark example usage as python code
2023-10-13 12:39:55 +02:00