mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2025-06-27 02:39:58 +00:00
110 lines
5.7 KiB
ReStructuredText
110 lines
5.7 KiB
ReStructuredText
BGE-VL
|
|
======
|
|
|
|
BGE-VL is a series of multimodel retrieval models training on `MegaPairs <https://github.com/VectorSpaceLab/MegaPairs>`_
|
|
|
|
BGE-VL contains light weight CLIP based models as well as more powerful LLAVA-NeXT based MLLM models:
|
|
|
|
+----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
|
|
| Model | Language | Parameters | Model Size | Description |
|
|
+======================================================================+===========+============+==============+=======================================================================+
|
|
| `BAAI/bge-vl-base <https://huggingface.co/BAAI/BGE-VL-base>`_ | English | 150M | 299 MB | Light weight multimodel embedder among image and text |
|
|
+----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
|
|
| `BAAI/bge-vl-large <https://huggingface.co/BAAI/BGE-VL-large>`_ | English | 428M | 855 MB | Large scale multimodel embedder among image and text |
|
|
+----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
|
|
| `BAAI/bge-vl-MLLM-S1 <https://huggingface.co/BAAI/BGE-VL-MLLM-S1>`_ | English | 7.57B | 15.14 GB | SOTA in composed image retrieval, trained on MegaPairs dataset |
|
|
+----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
|
|
| `BAAI/bge-vl-MLLM-S2 <https://huggingface.co/BAAI/BGE-VL-MLLM-S2>`_ | English | 7.57B | 15.14 GB | Finetune BGE-VL-MLLM-S1 with one epoch on MMEB training set |
|
|
+----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
|
|
|
|
|
|
BGE-VL-CLIP
|
|
-----------
|
|
|
|
The base and large model are trained based on CLIP-vit-base-patch16 and CLIP-vit-large-patch14.
|
|
For composed image-text data, the model directly use score-fusion to sum up the outputs of visual encoder and text encoder and get the final embedding.
|
|
|
|
.. tip::
|
|
|
|
Our code works well on transformers==4.45.2, and we recommend using this version.
|
|
|
|
You can easily use BGE-VL-CLIP models based on transformers:
|
|
|
|
.. code:: python
|
|
|
|
import torch
|
|
from transformers import AutoModel
|
|
|
|
MODEL_NAME = "BAAI/BGE-VL-base" # or "BAAI/BGE-VL-large"
|
|
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) # You must set trust_remote_code=True
|
|
model.set_processor(MODEL_NAME)
|
|
model.eval()
|
|
|
|
with torch.no_grad():
|
|
query = model.encode(
|
|
images = "./assets/cir_query.png",
|
|
text = "Make the background dark, as if the camera has taken the photo at night"
|
|
)
|
|
candidates = model.encode(
|
|
images = ["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"]
|
|
)
|
|
|
|
scores = query @ candidates.T
|
|
print(scores)
|
|
|
|
|
|
BGE-VL-MLLM
|
|
-----------
|
|
|
|
The multimodal large language models (MLLMs) incorporate a visual encoder, typically based on a vision transformer, into a large language model (LLM).
|
|
This integration allows image tokens to be directly processed by the LLM.
|
|
Consequently, MLLMs can effectively handle diverse multimodal inputs by converting any type of input into a sequence of tokens.
|
|
|
|
BGE-VL-MLLM builds upon the LLaVA1.6. In both training and inference stages, MMRet uses task-specific instructions for query inputs to improve generalization, aligning
|
|
with standard practices in LLM-based embedding models.
|
|
A typical multimodal query input is structured as follows:
|
|
|
|
.. math::
|
|
|
|
⟨\text{instruct}⟩{\{task\_ inst\}} \space⟨\text{query}⟩\{q_t\} \{q_i\}\space[\text{EOS}]
|
|
|
|
where :math:`{task_inst}` represents the task-specific instruction, :math:`{qt}` denotes the input query text, and
|
|
:math:`{qi}` is the input query image.
|
|
The normalized last hidden state of the [EOS] token in the MLLM is used as the embedding of any given input sequence.
|
|
|
|
.. code:: python
|
|
|
|
import torch
|
|
from transformers import AutoModel
|
|
from PIL import Image
|
|
|
|
MODEL_NAME= "BAAI/BGE-VL-MLLM-S1"
|
|
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
|
|
model.eval()
|
|
model.cuda()
|
|
|
|
with torch.no_grad():
|
|
model.set_processor(MODEL_NAME)
|
|
|
|
query_inputs = model.data_process(
|
|
text="Make the background dark, as if the camera has taken the photo at night",
|
|
images="./assets/cir_query.png",
|
|
q_or_c="q",
|
|
task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: "
|
|
)
|
|
candidate_inputs = model.data_process(
|
|
images=["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"],
|
|
q_or_c="c",
|
|
)
|
|
|
|
query_embs = model(**query_inputs, output_hidden_states=True)[:, -1, :]
|
|
candi_embs = model(**candidate_inputs, output_hidden_states=True)[:, -1, :]
|
|
|
|
query_embs = torch.nn.functional.normalize(query_embs, dim=-1)
|
|
candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1)
|
|
|
|
scores = torch.matmul(query_embs, candi_embs.T)
|
|
print(scores)
|
|
|
|
|
|
For more details, check out the repo of `MegaPairs <https://github.com/VectorSpaceLab/MegaPairs>`_ |