update docs

This commit is contained in:
ZiyiXia 2025-06-04 17:57:10 +08:00
parent 235f967a01
commit 875fd4ffcb
3 changed files with 85 additions and 0 deletions

View File

@ -0,0 +1,36 @@
BGE-Code-v1
===========
**`BGE-Code-v1 <https://huggingface.co/BAAI/bge-code-v1>`_** is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities:
- Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages.
- Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale.
- Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more.
+-------------------------------------------------------------------+-----------------+------------+--------------+----------------------------------------------------------------------------------------------------+
| Model | Language | Parameters | Model Size | Description |
+===================================================================+=================+============+==============+====================================================================================================+
| `BAAI/bge-code-v1 <https://huggingface.co/BAAI/bge-code-v1>`_ | Multilingual | 1.5B | 6.18 GB | SOTA code retrieval model, with exceptional multilingual text retrieval performance as well |
+-------------------------------------------------------------------+-----------------+------------+--------------+----------------------------------------------------------------------------------------------------+
.. code:: python
from FlagEmbedding import FlagLLMModel
queries = [
"Delete the record with ID 4 from the 'Staff' table.",
'Delete all records in the "Livestock" table where age is greater than 5'
]
documents = [
"DELETE FROM Staff WHERE StaffID = 4;",
"DELETE FROM Livestock WHERE age > 5;"
]
model = FlagLLMModel('BAAI/bge-code-v1',
query_instruction_format="<instruct>{}\n<query>{}",
query_instruction_for_retrieval="Given a question in text, retrieve SQL queries that are appropriate responses to the question.",
trust_remote_code=True,
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
embeddings_1 = model.encode_queries(queries)
embeddings_2 = model.encode_corpus(documents)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

View File

@ -16,6 +16,8 @@ BGE-VL contains light weight CLIP based models as well as more powerful LLAVA-Ne
+----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
| `BAAI/bge-vl-MLLM-S2 <https://huggingface.co/BAAI/BGE-VL-MLLM-S2>`_ | English | 7.57B | 15.14 GB | Finetune BGE-VL-MLLM-S1 with one epoch on MMEB training set |
+----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
| `BAAI/BGE-VL-v1.5-zs <https://huggingface.co/BAAI/BGE-VL-v1.5-zs>`_ | English | 7.57B | 15.14 GB | Better multi-modal retrieval model with performs well in all kinds of tasks |
| `BAAI/BGE-VL-v1.5-mmeb <https://huggingface.co/BAAI/BGE-VL-v1.5-mmeb>`_ | English | 7.57B | 15.14 GB | Better multi-modal retrieval model, additionally fine-tuned on MMEB training set |
BGE-VL-CLIP
@ -107,4 +109,50 @@ The normalized last hidden state of the [EOS] token in the MLLM is used as the e
print(scores)
BGE-VL-v1.5
-----------
BGE-VL-v1.5 series is the updated version of BGE-VL, bringing better performance on both retrieval and multi-modal understanding. The models were trained on 30M MegaPairs data and extra 10M natural and synthetic data.
`bge-vl-v1.5-zs` is a zero-shot model, only trained on the data mentioned above. `bge-vl-v1.5-mmeb` is the fine-tuned version on MMEB training set.
.. code:: python
import torch
from transformers import AutoModel
from PIL import Image
MODEL_NAME= "BAAI/BGE-VL-v1.5-mmeb" # "BAAI/BGE-VL-v1.5-zs"
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
model.eval()
model.cuda()
with torch.no_grad():
model.set_processor(MODEL_NAME)
query_inputs = model.data_process(
text="Make the background dark, as if the camera has taken the photo at night",
images="../../imgs/cir_query.png",
q_or_c="q",
task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: "
)
candidate_inputs = model.data_process(
images=["../../imgs/cir_candi_1.png", "../../imgs/cir_candi_2.png"],
q_or_c="c",
)
query_embs = model(**query_inputs, output_hidden_states=True)[:, -1, :]
candi_embs = model(**candidate_inputs, output_hidden_states=True)[:, -1, :]
query_embs = torch.nn.functional.normalize(query_embs, dim=-1)
candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1)
scores = torch.matmul(query_embs, candi_embs.T)
print(scores)
For more details, check out the repo of `MegaPairs <https://github.com/VectorSpaceLab/MegaPairs>`_

View File

@ -15,6 +15,7 @@ BGE
bge_m3
bge_icl
bge_vl
bge_code
.. toctree::
:maxdepth: 1