mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2025-06-27 02:39:58 +00:00
update link in README
This commit is contained in:
parent
7f02137123
commit
d8590e208f
@ -33,7 +33,7 @@
|
||||
</h4>
|
||||
|
||||
|
||||
[English](README.md) | [中文](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/README_zh.md)
|
||||
[English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
|
||||
|
||||
|
||||
|
||||
|
@ -237,7 +237,7 @@ Merge 10 models fine-tuned on other tasks based on five examples for new tasks:
|
||||
- Examples Data for dataset from FLAN: [./llm_examples.json]()
|
||||
- MMLU dataset: https://huggingface.co/datasets/cais/mmlu (use the example in dev set to do in-context learning)
|
||||
|
||||
You can use these models and our code to produce a new model and evaluate its performance using the [llm-embedder script](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/research/llm_embedder/docs/evaluation.md) as following:
|
||||
You can use these models and our code to produce a new model and evaluate its performance using the [llm-embedder script](https://github.com/FlagOpen/FlagEmbedding/blob/master/research/llm_embedder/docs/evaluation.md) as following:
|
||||
```
|
||||
# for 30 tasks from FLAN
|
||||
torchrun --nproc_per_node 8 -m evaluation.eval_icl \
|
||||
|
@ -17,7 +17,7 @@ Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/e
|
||||
Some suggestions:
|
||||
|
||||
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune/embedder#hard-negatives), which can improve the retrieval performance.
|
||||
- In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/examples/finetune/ds_stage0.json), `--gradient_checkpointing`, etc.
|
||||
- In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_stage0.json), `--gradient_checkpointing`, etc.
|
||||
- If you want to maintain the performance on other tasks when fine-tuning on your data, you can use [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/research/LM_Cocktail) to merge the fine-tuned model and the original bge model. Besides, if you want to fine-tune on multiple tasks, you also can approximate the multi-task learning via model merging as [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/research/LM_Cocktail).
|
||||
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
|
||||
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
|
||||
|
@ -142,7 +142,7 @@ Please replace the `query_instruction_for_retrieval` with your instruction if yo
|
||||
|
||||
|
||||
### 6. Evaluate model
|
||||
We provide [a simple script](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/research/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance.
|
||||
We provide [a simple script](https://github.com/FlagOpen/FlagEmbedding/blob/master/research/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance.
|
||||
A brief summary of how the script works:
|
||||
|
||||
1. Load the model on all available GPUs through [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html).
|
||||
|
@ -63,7 +63,7 @@ torchrun --nproc_per_node {number of gpus} \
|
||||
You can also refer to [this script](./unified_finetune_bge-m3_exmaple.sh) for more details. In this script, we use `deepspeed` to perform distributed training. Learn more about `deepspeed` at https://www.deepspeed.ai/getting-started/. Note that there are some important parameters to be modified in this script:
|
||||
|
||||
- `HOST_FILE_CONTENT`: Machines and GPUs for training. If you want to use multiple machines for training, please refer to https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node (note that you should configure `pdsh` and `ssh` properly).
|
||||
- `DS_CONFIG_FILE`: Path of deepspeed config file. [Here](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/examples/finetune/ds_stage0.json) is an example of `ds_config.json`.
|
||||
- `DS_CONFIG_FILE`: Path of deepspeed config file. [Here](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_stage0.json) is an example of `ds_config.json`.
|
||||
- `DATA_PATH`: One or more paths of training data. **Each path must be a directory containing one or more jsonl files**.
|
||||
- `DEFAULT_BATCH_SIZE`: Default batch size for training. If you use efficient batching strategy, which means you have split your data to different parts by sequence length, then the batch size for each part will be decided by the `get_file_batch_size()` function in [`BGE_M3/data.py`](../../BGE_M3/data.py). Before starting training, you should set the corresponding batch size for each part in this function according to the GPU memory of your machines. `DEFAULT_BATCH_SIZE` will be used for the part whose sequence length is not in the `get_file_batch_size()` function.
|
||||
- `EPOCHS`: Number of training epochs.
|
||||
|
Loading…
x
Reference in New Issue
Block a user