2023-08-02 17:40:00 +08:00
# Pre-train
2023-08-06 01:13:48 +08:00
In this example, we show how to do pre-training using retromae,
2023-08-02 17:40:00 +08:00
which can improve the retrieval performance.
2023-09-12 19:55:37 +08:00
## 1. Installation
2023-08-06 01:13:48 +08:00
* **with pip**
```
2023-08-06 10:05:41 +08:00
pip install -U FlagEmbedding
2023-08-06 01:13:48 +08:00
```
* **from source**
2023-08-02 17:40:00 +08:00
```
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install .
```
For development, install as editable:
```
pip install -e .
```
2023-09-12 19:55:37 +08:00
## 2. Data format
2023-08-02 17:40:00 +08:00
Train data should be a json file, where each line is a dict like this:
```
{"text": str}
```
2023-10-09 17:27:23 +08:00
See [toy_pretrain_data.jsonl ](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/pretrain/toy_pretrain_data.jsonl ) for a toy data file.
2023-08-02 17:40:00 +08:00
2023-09-12 19:55:37 +08:00
## 3. Train
2023-08-02 17:40:00 +08:00
```bash
2023-08-03 11:03:54 +08:00
torchrun --nproc_per_node {number of gpus} \
2023-08-06 01:13:48 +08:00
-m FlagEmbedding.baai_general_embedding.retromae_pretrain.run \
2023-08-02 17:40:00 +08:00
--output_dir {path to save model} \
2023-08-06 01:13:48 +08:00
--model_name_or_path BAAI/bge-large-en \
2023-08-02 17:40:00 +08:00
--train_data toy_pretrain_data.jsonl \
--learning_rate 2e-5 \
2023-08-18 00:17:02 +08:00
--num_train_epochs 2 \
2023-08-23 13:10:56 +08:00
--per_device_train_batch_size {batch size; set 1 for toy data} \
2023-08-06 01:13:48 +08:00
--dataloader_drop_last True \
2023-08-03 19:50:02 +08:00
--max_seq_length 512 \
2023-09-28 00:49:16 +08:00
--logging_steps 10 \
--dataloader_num_workers 12
2023-08-02 17:40:00 +08:00
```
2023-08-03 19:50:02 +08:00
2023-08-22 23:28:37 +08:00
More training arguments please refer to [transformers.TrainingArguments ](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments ).
After training, the encoder model will saved to `{output_dir}/encoder_model`
2023-08-02 17:40:00 +08:00