# BGE-VL-v1&v1.5

In this tutorial, we will go through the multimodel retrieval models BGE-VL series, which achieved state-of-the-art performance on four popular zero-shot composed image retrieval benchmarks and the massive multimodal embedding benchmark (MMEB).

## 0. Installation

Install the required packages in your environment.

- Our model works well on transformers==4.45.2, and we recommend using this version.

In [None]:
%pip install numpy torch transformers pillow

## 1. BGE-VL-CLIP

| Model | Language | Parameters | Model Size | Description | Base Model |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-vl-base](https://huggingface.co/BAAI/BGE-VL-base) | English | 150M | 299 MB | Light weight multimodel embedder among image and text | CLIP-base |
| [BAAI/bge-vl-large](https://huggingface.co/BAAI/BGE-VL-large) | English | 428M | 855 MB | Large scale multimodel embedder among image and text | CLIP-large |

BGE-VL-base and BGE-VL-large are trained based on CLIP base and CLIP large, which both contain a vision transformer and a text transformer:

In [3]:
import numpy as np
import torch
from transformers import AutoModel

MODEL_NAME = "BAAI/BGE-VL-base" # or "BAAI/BGE-VL-base"

model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) # You must set trust_remote_code=True
model.set_processor(MODEL_NAME)
model.eval()

 from .autonotebook import tqdm as notebook_tqdm


CLIPModel(
 (text_model): CLIPTextTransformer(
 (embeddings): CLIPTextEmbeddings(
 (token_embedding): Embedding(49408, 512)
 (position_embedding): Embedding(77, 512)
 )
 (encoder): CLIPEncoder(
 (layers): ModuleList(
 (0-11): 12 x CLIPEncoderLayer(
 (self_attn): CLIPSdpaAttention(
 (k_proj): Linear(in_features=512, out_features=512, bias=True)
 (v_proj): Linear(in_features=512, out_features=512, bias=True)
 (q_proj): Linear(in_features=512, out_features=512, bias=True)
 (out_proj): Linear(in_features=512, out_features=512, bias=True)
 )
 (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
 (mlp): CLIPMLP(
 (activation_fn): QuickGELUActivation()
 (fc1): Linear(in_features=512, out_features=2048, bias=True)
 (fc2): Linear(in_features=2048, out_features=512, bias=True)
 )
 (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
 )
 )
 )
 (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
 )
 (vision_model): CLIPVisionTransformer(
 (embe

In [6]:
with torch.no_grad():
 query = model.encode(
 images = "../../imgs/cir_query.png", 
 text = "Make the background dark, as if the camera has taken the photo at night"
 )

 candidates = model.encode(
 images = ["../../imgs/cir_candi_1.png", "../../imgs/cir_candi_2.png"]
 )
 
 scores = query @ candidates.T
print(scores)

tensor([[0.2647, 0.1242]])


## 2. BGE-VL-MLLM

| Model | Language | Parameters | Model Size | Description | Base Model |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-vl-MLLM-S1](https://huggingface.co/BAAI/BGE-VL-MLLM-S1) | English | 7.57B | 15.14 GB | SOTA in composed image retrieval, trained on MegaPairs dataset | LLaVA-1.6 |
| [BAAI/bge-vl-MLLM-S2](https://huggingface.co/BAAI/BGE-VL-MLLM-S2) | English | 7.57B | 15.14 GB | Finetune BGE-VL-MLLM-S1 with one epoch on MMEB training set | LLaVA-1.6 |

In [1]:
import torch
from transformers import AutoModel
from PIL import Image

MODEL_NAME= "BAAI/BGE-VL-MLLM-S1"

model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
model.eval()
model.cuda()

 from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.28it/s]


LLaVANextForEmbedding(
 (vision_tower): CLIPVisionModel(
 (vision_model): CLIPVisionTransformer(
 (embeddings): CLIPVisionEmbeddings(
 (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
 (position_embedding): Embedding(577, 1024)
 )
 (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
 (encoder): CLIPEncoder(
 (layers): ModuleList(
 (0-23): 24 x CLIPEncoderLayer(
 (self_attn): CLIPSdpaAttention(
 (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
 (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
 (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
 (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
 )
 (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
 (mlp): CLIPMLP(
 (activation_fn): QuickGELUActivation()
 (fc1): Linear(in_features=1024, out_features=4096, bias=True)
 (fc2): Linear(in_features=4096, out_features=1024, bias=True)
 )
 (layer_norm2): Layer

In [5]:
with torch.no_grad():
 model.set_processor(MODEL_NAME)

 query_inputs = model.data_process(
 text="Make the background dark, as if the camera has taken the photo at night", 
 images="../../imgs/cir_query.png",
 q_or_c="q",
 task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: "
 )

 candidate_inputs = model.data_process(
 images=["../../imgs/cir_candi_1.png", "../../imgs/cir_candi_2.png"],
 q_or_c="c",
 )

 query_embs = model(**query_inputs, output_hidden_states=True)[:, -1, :]
 candi_embs = model(**candidate_inputs, output_hidden_states=True)[:, -1, :]
 
 query_embs = torch.nn.functional.normalize(query_embs, dim=-1)
 candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1)

 scores = torch.matmul(query_embs, candi_embs.T)
print(scores)

tensor([[0.4109, 0.1807]], device='cuda:0')


## 3. BGE-VL-v1.5

BGE-VL-v1.5 series is a new version of BGE-VL, bringing better performance on both retrieval and multi-modal understanding. It is trained on 30M MegaPairs data and extra 10M natural and synthetic data.

`bge-vl-v1.5-zs` is a zero-shot model, only trained on the data mentioned above. `bge-vl-v1.5-mmeb` is the fine-tuned version on MMEB training set.

| Model | Language | Parameters | Model Size | Description | Base Model |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/BGE-VL-v1.5-zs](https://huggingface.co/BAAI/BGE-VL-v1.5-zs) | English | 7.57B | 15.14 GB | Better multi-modal retrieval model with performs well in all kinds of tasks | LLaVA-1.6 |
| [BAAI/BGE-VL-v1.5-mmeb](https://huggingface.co/BAAI/BGE-VL-v1.5-mmeb) | English | 7.57B | 15.14 GB | Better multi-modal retrieval model, additionally fine-tuned on MMEB training set | LLaVA-1.6 |

You can use BGE-VL-v1.5 models in the exact same way as BGE-VL-MLLM.

In [3]:
import torch
from transformers import AutoModel
from PIL import Image

MODEL_NAME= "BAAI/BGE-VL-v1.5-mmeb" # "BAAI/BGE-VL-v1.5-zs"

model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
model.eval()
model.cuda()

with torch.no_grad():
 model.set_processor(MODEL_NAME)

 query_inputs = model.data_process(
 text="Make the background dark, as if the camera has taken the photo at night", 
 images="../../imgs/cir_query.png",
 q_or_c="q",
 task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: "
 )

 candidate_inputs = model.data_process(
 images=["../../imgs/cir_candi_1.png", "../../imgs/cir_candi_2.png"],
 q_or_c="c",
 )

 query_embs = model(**query_inputs, output_hidden_states=True)[:, -1, :]
 candi_embs = model(**candidate_inputs, output_hidden_states=True)[:, -1, :]
 
 query_embs = torch.nn.functional.normalize(query_embs, dim=-1)
 candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1)

 scores = torch.matmul(query_embs, candi_embs.T)
print(scores)


Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 2.26it/s]


tensor([[0.3880, 0.1815]], device='cuda:0')
