2025-02-22 17:28:45 -05:00
..
2025-02-22 17:28:45 -05:00

DataSet

This will point to the training data we use for training various models.

Dataset Introduction
MLDR Document Retrieval Dataset, covering 13 languages
bge-m3-data Fine-tuning data used by bge-m3
public-data Public data identical to e5-mistral
full-data The full dataset we used for training bge-en-icl
bge-multilingual-gemma2-data The full multilingual dataset we used for training bge-multilingual-gemma2
reranker-data a mixture of multilingual datasets