2025-01-23 17:52:22 +08:00
..
2025-01-23 17:52:22 +08:00

DataSet

This will point to the training data we use for training various models.

Dataset Introduction
MLDR Docuemtn Retrieval Dataset, covering 13 languages
bge-m3-data Fine-tuning data used by bge-m3
public-data Public data identical to e5-mistral
full-data The full dataset we used for training bge-en-icl
bge-multilingual-gemma2-data The full multilingual dataset we used for training bge-multilingual-gemma2
reranker-data a mixture of multilingual datasets