{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BGE-Code-v1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -U FlagEmbedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Model | Language | Parameters | Model Size | Description | Base Model |\n", "|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|\n", "| [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) | Multi-lingual | 1.54B | 6.18 GB | LLM-based code embedding model with strong text retrieval and multilingual capabilities. | Qwen-2.5-Coder-1.5B |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[BGE-Code-v1](https://huggingface.co/BAAI/bge-code-v1)** is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities:\n", "- Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages.\n", "- Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale.\n", "- Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers import AutoTokenizer, AutoModel\n", "import torch, os\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-code-v1\")\n", "raw_model = AutoModel.from_pretrained(\"BAAI/bge-code-v1\")\n", "\n", "raw_model.eval()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given the following tiny corpus:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "corpus = [\"\"\"\n", "def func_1(arr, target):\n", " low, high = 0, len(arr) - 1\n", " while low <= high:\n", " mid = (low + high) // 2\n", " if arr[mid] == target: return mid\n", " elif arr[mid] < target: low = mid + 1\n", " else: high = mid - 1\n", " return -1\n", "\"\"\",\n", "\"\"\"\n", "def func_2(n, memo={}):\n", " if n <= 1: return n\n", " if n not in memo:\n", " memo[n] = fib(n-1, memo) + fib(n-2, memo)\n", " return memo[n]\n", "\"\"\",\n", "\"\"\"\n", "def func_3(a, b):\n", " while b:\n", " a, b = b, a % b\n", " return a\n", "\"\"\",\n", "\"\"\"\n", "def func_4(n):\n", " if n < 2: return False\n", " for i in range(2, int(n**0.5) + 1):\n", " if n % i == 0: return False\n", " return True\n", "\"\"\",\n", "\"\"\"\n", "int func_5(const vector& arr, int target) {\n", " int low = 0, high = arr.size() - 1;\n", " while (low <= high) {\n", " int mid = low + (high - low) / 2;\n", " if (arr[mid] == target) return mid;\n", " else if (arr[mid] < target) low = mid + 1;\n", " else high = mid - 1;\n", " }\n", " return -1;\n", "}\n", "\"\"\"\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to find the answer to the following question:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "query = \"The fastest way to find an element in a sorted array\"" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 6.08it/s]\n" ] } ], "source": [ "from FlagEmbedding import FlagLLMModel\n", "\n", "model = FlagLLMModel('BAAI/bge-code-v1', \n", " query_instruction_format=\"{}\\n{}\",\n", " query_instruction_for_retrieval=\"Given a question in text, retrieve SQL queries that are appropriate responses to the question.\",\n", " trust_remote_code=True,\n", " devices=0,\n", " use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1536,), (5, 1536))" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_emb = model.encode_queries(query)\n", "corpus_emb = model.encode_corpus(corpus)\n", "query_emb.shape, corpus_emb.shape" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.4553 0.2172 0.2277 0.196 0.4355]\n" ] } ], "source": [ "similarity = query_emb @ corpus_emb.T\n", "print(similarity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the elements with index 0 and 5, which are the implementation of binary search in Python and C++, have conspicuously higher similarity than other candidates." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 2 }