FlagEmbedding/Tutorials/1_Embedding/1.2.7_BGE_Code_v1.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BGE-Code-v1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -U FlagEmbedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Model  | Language |   Parameters   |   Model Size   |    Description    |   Base Model     |\n",
    "|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|\n",
    "| [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1)                  |    Multi-lingual     |   1.54B   |  6.18 GB  |  LLM-based code embedding model with strong text retrieval and multilingual capabilities. | Qwen-2.5-Coder-1.5B |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**[BGE-Code-v1](https://huggingface.co/BAAI/bge-code-v1)** is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities:\n",
    "- Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages.\n",
    "- Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale.\n",
    "- Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer, AutoModel\n",
    "import torch, os\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-code-v1\")\n",
    "raw_model = AutoModel.from_pretrained(\"BAAI/bge-code-v1\")\n",
    "\n",
    "raw_model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given the following tiny corpus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = [\"\"\"\n",
    "def func_1(arr, target):\n",
    "    low, high = 0, len(arr) - 1\n",
    "    while low <= high:\n",
    "        mid = (low + high) // 2\n",
    "        if arr[mid] == target: return mid\n",
    "        elif arr[mid] < target: low = mid + 1\n",
    "        else: high = mid - 1\n",
    "    return -1\n",
    "\"\"\",\n",
    "\"\"\"\n",
    "def func_2(n, memo={}):\n",
    "    if n <= 1: return n\n",
    "    if n not in memo:\n",
    "        memo[n] = fib(n-1, memo) + fib(n-2, memo)\n",
    "    return memo[n]\n",
    "\"\"\",\n",
    "\"\"\"\n",
    "def func_3(a, b):\n",
    "    while b:\n",
    "        a, b = b, a % b\n",
    "    return a\n",
    "\"\"\",\n",
    "\"\"\"\n",
    "def func_4(n):\n",
    "    if n < 2: return False\n",
    "    for i in range(2, int(n**0.5) + 1):\n",
    "        if n % i == 0: return False\n",
    "    return True\n",
    "\"\"\",\n",
    "\"\"\"\n",
    "int func_5(const vector<int>& arr, int target) {\n",
    "    int low = 0, high = arr.size() - 1;\n",
    "    while (low <= high) {\n",
    "        int mid = low + (high - low) / 2;\n",
    "        if (arr[mid] == target) return mid;\n",
    "        else if (arr[mid] < target) low = mid + 1;\n",
    "        else high = mid - 1;\n",
    "    }\n",
    "    return -1;\n",
    "}\n",
    "\"\"\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to find the answer to the following question:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"The fastest way to find an element in a sorted array\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.08it/s]\n"
     ]
    }
   ],
   "source": [
    "from FlagEmbedding import FlagLLMModel\n",
    "\n",
    "model = FlagLLMModel('BAAI/bge-code-v1', \n",
    "                     query_instruction_format=\"<instruct>{}\\n<query>{}\",\n",
    "                     query_instruction_for_retrieval=\"Given a question in text, retrieve SQL queries that are appropriate responses to the question.\",\n",
    "                     trust_remote_code=True,\n",
    "                     devices=0,\n",
    "                     use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((1536,), (5, 1536))"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_emb = model.encode_queries(query)\n",
    "corpus_emb = model.encode_corpus(corpus)\n",
    "query_emb.shape, corpus_emb.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.4553 0.2172 0.2277 0.196  0.4355]\n"
     ]
    }
   ],
   "source": [
    "similarity = query_emb @ corpus_emb.T\n",
    "print(similarity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the elements with index 0 and 5, which are the implementation of binary search in Python and C++, have conspicuously higher similarity than other candidates."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
update tutorials 2025-06-04 17:27:36 +08:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# BGE-Code-v1"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## 0. Installation"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"%pip install -U FlagEmbedding"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## 1. Introduction"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"\| Model \| Language \| Parameters \| Model Size \| Description \| Base Model \|\n",`
			`"\|:-------\|:--------:\|:--------------:\|:--------------:\|:-----------------:\|:----------------:\|\n",`
			`"\| [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) \| Multi-lingual \| 1.54B \| 6.18 GB \| LLM-based code embedding model with strong text retrieval and multilingual capabilities. \| Qwen-2.5-Coder-1.5B \|"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"[BGE-Code-v1](https://huggingface.co/BAAI/bge-code-v1) is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities:\n",`
			`"- Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages.\n",`
			`"- Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale.\n",`
			`"- Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from transformers import AutoTokenizer, AutoModel\n",`
			`"import torch, os\n",`
			`"\n",`
			`"tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-code-v1\")\n",`
			`"raw_model = AutoModel.from_pretrained(\"BAAI/bge-code-v1\")\n",`
			`"\n",`
			`"raw_model.eval()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## 2. Usage"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Given the following tiny corpus:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 21,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"corpus = [\"\"\"\n",`
			`"def func_1(arr, target):\n",`
			`" low, high = 0, len(arr) - 1\n",`
			`" while low <= high:\n",`
			`" mid = (low + high) // 2\n",`
			`" if arr[mid] == target: return mid\n",`
			`" elif arr[mid] < target: low = mid + 1\n",`
			`" else: high = mid - 1\n",`
			`" return -1\n",`
			`"\"\"\",\n",`
			`"\"\"\"\n",`
			`"def func_2(n, memo={}):\n",`
			`" if n <= 1: return n\n",`
			`" if n not in memo:\n",`
			`" memo[n] = fib(n-1, memo) + fib(n-2, memo)\n",`
			`" return memo[n]\n",`
			`"\"\"\",\n",`
			`"\"\"\"\n",`
			`"def func_3(a, b):\n",`
			`" while b:\n",`
			`" a, b = b, a % b\n",`
			`" return a\n",`
			`"\"\"\",\n",`
			`"\"\"\"\n",`
			`"def func_4(n):\n",`
			`" if n < 2: return False\n",`
			`" for i in range(2, int(n**0.5) + 1):\n",`
			`" if n % i == 0: return False\n",`
			`" return True\n",`
			`"\"\"\",\n",`
			`"\"\"\"\n",`
			`"int func_5(const vector<int>& arr, int target) {\n",`
			`" int low = 0, high = arr.size() - 1;\n",`
			`" while (low <= high) {\n",`
			`" int mid = low + (high - low) / 2;\n",`
			`" if (arr[mid] == target) return mid;\n",`
			`" else if (arr[mid] < target) low = mid + 1;\n",`
			`" else high = mid - 1;\n",`
			`" }\n",`
			`" return -1;\n",`
			`"}\n",`
			`"\"\"\"\n",`
			`"]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We want to find the answer to the following question:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 22,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"query = \"The fastest way to find an element in a sorted array\""`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 23,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"Loading checkpoint shards: 100%\|██████████\| 2/2 [00:00<00:00, 6.08it/s]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"from FlagEmbedding import FlagLLMModel\n",`
			`"\n",`
			`"model = FlagLLMModel('BAAI/bge-code-v1', \n",`
			`" query_instruction_format=\"<instruct>{}\\n<query>{}\",\n",`
			`" query_instruction_for_retrieval=\"Given a question in text, retrieve SQL queries that are appropriate responses to the question.\",\n",`
			`" trust_remote_code=True,\n",`
			`" devices=0,\n",`
			`" use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 24,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"((1536,), (5, 1536))"`
			`]`
			`},`
			`"execution_count": 24,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"query_emb = model.encode_queries(query)\n",`
			`"corpus_emb = model.encode_corpus(corpus)\n",`
			`"query_emb.shape, corpus_emb.shape"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 25,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[0.4553 0.2172 0.2277 0.196 0.4355]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"similarity = query_emb @ corpus_emb.T\n",`
			`"print(similarity)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We can see that the elements with index 0 and 5, which are the implementation of binary search in Python and C++, have conspicuously higher similarity than other candidates."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.16"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`