mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2025-06-27 02:39:58 +00:00
242 lines
6.4 KiB
Plaintext
242 lines
6.4 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# BGE-Code-v1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 0. Installation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install -U FlagEmbedding"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Introduction"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"| Model | Language | Parameters | Model Size | Description | Base Model |\n",
|
|
"|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|\n",
|
|
"| [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) | Multi-lingual | 1.54B | 6.18 GB | LLM-based code embedding model with strong text retrieval and multilingual capabilities. | Qwen-2.5-Coder-1.5B |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**[BGE-Code-v1](https://huggingface.co/BAAI/bge-code-v1)** is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities:\n",
|
|
"- Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages.\n",
|
|
"- Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale.\n",
|
|
"- Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from transformers import AutoTokenizer, AutoModel\n",
|
|
"import torch, os\n",
|
|
"\n",
|
|
"tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-code-v1\")\n",
|
|
"raw_model = AutoModel.from_pretrained(\"BAAI/bge-code-v1\")\n",
|
|
"\n",
|
|
"raw_model.eval()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Usage"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Given the following tiny corpus:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"corpus = [\"\"\"\n",
|
|
"def func_1(arr, target):\n",
|
|
" low, high = 0, len(arr) - 1\n",
|
|
" while low <= high:\n",
|
|
" mid = (low + high) // 2\n",
|
|
" if arr[mid] == target: return mid\n",
|
|
" elif arr[mid] < target: low = mid + 1\n",
|
|
" else: high = mid - 1\n",
|
|
" return -1\n",
|
|
"\"\"\",\n",
|
|
"\"\"\"\n",
|
|
"def func_2(n, memo={}):\n",
|
|
" if n <= 1: return n\n",
|
|
" if n not in memo:\n",
|
|
" memo[n] = fib(n-1, memo) + fib(n-2, memo)\n",
|
|
" return memo[n]\n",
|
|
"\"\"\",\n",
|
|
"\"\"\"\n",
|
|
"def func_3(a, b):\n",
|
|
" while b:\n",
|
|
" a, b = b, a % b\n",
|
|
" return a\n",
|
|
"\"\"\",\n",
|
|
"\"\"\"\n",
|
|
"def func_4(n):\n",
|
|
" if n < 2: return False\n",
|
|
" for i in range(2, int(n**0.5) + 1):\n",
|
|
" if n % i == 0: return False\n",
|
|
" return True\n",
|
|
"\"\"\",\n",
|
|
"\"\"\"\n",
|
|
"int func_5(const vector<int>& arr, int target) {\n",
|
|
" int low = 0, high = arr.size() - 1;\n",
|
|
" while (low <= high) {\n",
|
|
" int mid = low + (high - low) / 2;\n",
|
|
" if (arr[mid] == target) return mid;\n",
|
|
" else if (arr[mid] < target) low = mid + 1;\n",
|
|
" else high = mid - 1;\n",
|
|
" }\n",
|
|
" return -1;\n",
|
|
"}\n",
|
|
"\"\"\"\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We want to find the answer to the following question:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"The fastest way to find an element in a sorted array\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 6.08it/s]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from FlagEmbedding import FlagLLMModel\n",
|
|
"\n",
|
|
"model = FlagLLMModel('BAAI/bge-code-v1', \n",
|
|
" query_instruction_format=\"<instruct>{}\\n<query>{}\",\n",
|
|
" query_instruction_for_retrieval=\"Given a question in text, retrieve SQL queries that are appropriate responses to the question.\",\n",
|
|
" trust_remote_code=True,\n",
|
|
" devices=0,\n",
|
|
" use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"((1536,), (5, 1536))"
|
|
]
|
|
},
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"query_emb = model.encode_queries(query)\n",
|
|
"corpus_emb = model.encode_corpus(corpus)\n",
|
|
"query_emb.shape, corpus_emb.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[0.4553 0.2172 0.2277 0.196 0.4355]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"similarity = query_emb @ corpus_emb.T\n",
|
|
"print(similarity)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can see that the elements with index 0 and 5, which are the implementation of binary search in Python and C++, have conspicuously higher similarity than other candidates."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.16"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|