haystack/split-by-token-b9a4f954d4077ecc.yaml at main - haystack - Gitea: Git with a cup of tea

yujunjun/haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-06-26 22:00:13 +00:00

Ben Heckmann a492771b4d

feat: PreProcessor split by token (tiktoken & Hugging Face) (#5276 )

* #4983 implemented split by token for tiktoken tokenizer

* #4983 added unit test for tiktoken splitting

* #4983 implemented and added a test for splitting documents with HuggingFace tokenizer

* #4983 added support for passing HF model names (instead of objects) and added an example to the HF token splitting test

* mocked HTTP model loading in unit tests, fixed pylint error

* fix lossy tokenizers splitting, use LazyImport, ignore UnicodeEncodeError for tiktoken

* reno

* rename reno file

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

2023-11-23 12:26:37 +01:00

3 lines

58 B

YAML

Raw Permalink Blame History

	`features:`
	- Add `split_length` by token in PreProcessor