# BGE-Code-v1

## 0. Installation

In [None]:
%pip install -U FlagEmbedding

## 1. Introduction

| Model | Language | Parameters | Model Size | Description | Base Model |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) | Multi-lingual | 1.54B | 6.18 GB | LLM-based code embedding model with strong text retrieval and multilingual capabilities. | Qwen-2.5-Coder-1.5B |

**[BGE-Code-v1](https://huggingface.co/BAAI/bge-code-v1)** is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities:
- Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages.
- Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale.
- Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch, os

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-code-v1")
raw_model = AutoModel.from_pretrained("BAAI/bge-code-v1")

raw_model.eval()

## 2. Usage

Given the following tiny corpus:

In [21]:
corpus = ["""
def func_1(arr, target):
 low, high = 0, len(arr) - 1
 while low <= high:
 mid = (low + high) // 2
 if arr[mid] == target: return mid
 elif arr[mid] < target: low = mid + 1
 else: high = mid - 1
 return -1
""",
"""
def func_2(n, memo={}):
 if n <= 1: return n
 if n not in memo:
 memo[n] = fib(n-1, memo) + fib(n-2, memo)
 return memo[n]
""",
"""
def func_3(a, b):
 while b:
 a, b = b, a % b
 return a
""",
"""
def func_4(n):
 if n < 2: return False
 for i in range(2, int(n**0.5) + 1):
 if n % i == 0: return False
 return True
""",
"""
int func_5(const vector& arr, int target) {
 int low = 0, high = arr.size() - 1;
 while (low <= high) {
 int mid = low + (high - low) / 2;
 if (arr[mid] == target) return mid;
 else if (arr[mid] < target) low = mid + 1;
 else high = mid - 1;
 }
 return -1;
}
"""
]

We want to find the answer to the following question:

In [22]:
query = "The fastest way to find an element in a sorted array"

In [23]:
from FlagEmbedding import FlagLLMModel

model = FlagLLMModel('BAAI/bge-code-v1', 
 query_instruction_format="{}\n{}",
 query_instruction_for_retrieval="Given a question in text, retrieve SQL queries that are appropriate responses to the question.",
 trust_remote_code=True,
 devices=0,
 use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 6.08it/s]


In [24]:
query_emb = model.encode_queries(query)
corpus_emb = model.encode_corpus(corpus)
query_emb.shape, corpus_emb.shape

((1536,), (5, 1536))

In [25]:
similarity = query_emb @ corpus_emb.T
print(similarity)

[0.4553 0.2172 0.2277 0.196 0.4355]


We can see that the elements with index 0 and 5, which are the implementation of binary search in Python and C++, have conspicuously higher similarity than other candidates.