mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-06-26 22:00:13 +00:00

* #4983 implemented split by token for tiktoken tokenizer * #4983 added unit test for tiktoken splitting * #4983 implemented and added a test for splitting documents with HuggingFace tokenizer * #4983 added support for passing HF model names (instead of objects) and added an example to the HF token splitting test * mocked HTTP model loading in unit tests, fixed pylint error * fix lossy tokenizers splitting, use LazyImport, ignore UnicodeEncodeError for tiktoken * reno * rename reno file --------- Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com> Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
3 lines
58 B
YAML
3 lines
58 B
YAML
features:
|
|
- Add `split_length` by token in PreProcessor
|