2025-01-16 11:43:42 +00:00

241 lines
7.1 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# C-MTEB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"C-MTEB is the largest benchmark for Chinese text embeddings, similar to MTEB. In this tutorial, we will go through how to evaluate an embedding model's ability on Chinese tasks in C-MTEB."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 0. Installation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First install dependent packages:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install FlagEmbedding mteb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"C-MTEB uses similar task splits and metrics as English MTEB. It contains 35 datasets in 6 different tasks: Classification, Clustering, Pair Classification, Reranking, Retrieval, and Semantic Textual Similarity (STS). \n",
"\n",
"1. **Classification**: Use the embeddings to train a logistic regression on the train set and is scored on the test set. F1 is the main metric.\n",
"2. **Clustering**: Train a mini-batch k-means model with batch size 32 and k equals to the number of different labels. Then score using v-measure.\n",
"3. **Pair Classification**: A pair of text inputs is provided and a label which is a binary variable needs to be assigned. The main metric is average precision score.\n",
"4. **Reranking**: Rank a list of relevant and irrelevant reference texts according to a query. Metrics are mean MRR@k and MAP.\n",
"5. **Retrieval**: Each dataset comprises corpus, queries, and a mapping that links each query to its relevant documents within the corpus. The goal is to retrieve relevant documents for each query. The main metric is nDCG@k. MTEB directly adopts BEIR for the retrieval task.\n",
"6. **Semantic Textual Similarity (STS)**: Determine the similarity between each sentence pair. Spearman correlation based on cosine\n",
"similarity serves as the main metric.\n",
"\n",
"\n",
"Check the [HF page](https://huggingface.co/C-MTEB) for the details of each dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ChineseTaskList = [\n",
" 'TNews', 'IFlyTek', 'MultilingualSentiment', 'JDReview', 'OnlineShopping', 'Waimai',\n",
" 'CLSClusteringS2S.v2', 'CLSClusteringP2P.v2', 'ThuNewsClusteringS2S.v2', 'ThuNewsClusteringP2P.v2',\n",
" 'Ocnli', 'Cmnli',\n",
" 'T2Reranking', 'MMarcoReranking', 'CMedQAv1-reranking', 'CMedQAv2-reranking',\n",
" 'T2Retrieval', 'MMarcoRetrieval', 'DuRetrieval', 'CovidRetrieval', 'CmedqaRetrieval', 'EcomRetrieval', 'MedicalRetrieval', 'VideoRetrieval',\n",
" 'ATEC', 'BQ', 'LCQMC', 'PAWSX', 'STSB', 'AFQMC', 'QBQTC'\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, load the model for evaluation. Note that the instruction here is used for retreival tasks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ...C_MTEB.flag_dres_model import FlagDRESModel\n",
"\n",
"instruction = \"为这个句子生成表示以用于检索相关文章:\"\n",
"model_name = \"BAAI/bge-base-zh-v1.5\"\n",
"\n",
"model = FlagDRESModel(model_name_or_path=\"BAAI/bge-base-zh-v1.5\",\n",
" query_instruction_for_retrieval=instruction,\n",
" pooling_method=\"cls\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Otherwise, you can load a model using sentence_transformers:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"\n",
"model = SentenceTransformer(\"PATH_TO_MODEL\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or implement a class following the structure below:\n",
"\n",
"```python\n",
"class MyModel():\n",
" def __init__(self):\n",
" \"\"\"initialize the tokenizer and model\"\"\"\n",
" pass\n",
"\n",
" def encode(self, sentences, batch_size=32, **kwargs):\n",
" \"\"\" Returns a list of embeddings for the given sentences.\n",
" Args:\n",
" sentences (`List[str]`): List of sentences to encode\n",
" batch_size (`int`): Batch size for the encoding\n",
"\n",
" Returns:\n",
" `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences\n",
" \"\"\"\n",
" pass\n",
"\n",
"model = MyModel()\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Evaluate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After we've prepared the dataset and model, we can start the evaluation. For time efficiency, we highly recommend to use GPU for evaluation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import mteb\n",
"from mteb import MTEB\n",
"\n",
"tasks = mteb.get_tasks(ChineseTaskList)\n",
"\n",
"for task in tasks:\n",
" evaluation = MTEB(tasks=[task])\n",
" evaluation.run(model, output_folder=f\"zh_results/{model_name.split('/')[-1]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Submit to MTEB Leaderboard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After the evaluation is done, all the evaluation results should be stored in `zh_results/{model_name}/`.\n",
"\n",
"Then run the following shell command to create the model_card.md. Change {model_name} and its following to your path."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!!mteb create_meta --results_folder results/{model_name}/ --output_path model_card.md"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copy and paste the contents of model_card.md to the top of README.md of your model on HF Hub. Then goto the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and choose the Chinese leaderboard to find your model! It will appear soon after the website's daily refresh."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}