mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2025-06-27 02:39:58 +00:00
update readme
This commit is contained in:
parent
0e9cd1c7c5
commit
b76eac8d93
163
README.md
163
README.md
@ -32,7 +32,7 @@
|
||||
</h4>
|
||||
|
||||
|
||||
[English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
|
||||
[English](README.md) | [中文](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/README_zh.md)
|
||||
|
||||
FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:
|
||||
|
||||
@ -42,11 +42,11 @@ FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following p
|
||||
- **[Dataset](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/dataset)**: [MLDR](https://huggingface.co/datasets/Shitao/MLDR), [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data), [public-data](https://huggingface.co/datasets/cfli/bge-e5data), [full-data](https://huggingface.co/datasets/cfli/bge-full-data), [reranker-data](Shitao/bge-reranker-data)
|
||||
- **[Tutorials](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/Tutorials)**
|
||||
- **research**:
|
||||
- **Long-Context LLM**: [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora)
|
||||
- **Fine-tuning of LM** : [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
|
||||
- **Embedding Model**: [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual), [BGE-M3](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3), [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), [BGE Embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding)
|
||||
- **Reranker Model**: [llm rerankers](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker), [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
||||
- **Benchmark**: [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB), [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench), [MLVU](https://github.com/JUNJIE99/MLVU)
|
||||
- **Long-Context LLM**: [Activation Beacon](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/activation_beacon), [LongLLM QLoRA](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/longllm_qlora)
|
||||
- **Fine-tuning of LM** : [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail)
|
||||
- **Embedding Model**: [Visualized-BGE](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/visual_bge), [BGE-M3](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/BGE_M3), [LLM Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_embedder), [BGE Embedding](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding)
|
||||
- **Reranker Model**: [llm rerankers](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_reranker), [BGE Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/reranker)
|
||||
- **Benchmark**: [C-MTEB](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB), [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench), [MLVU](https://github.com/JUNJIE99/MLVU)
|
||||
|
||||
## News
|
||||
- 22/10/2024: :fire: We release another interesting model: [OmniGen](https://github.com/VectorSpaceLab/OmniGen), which is a unified image generation model supporting various tasks. OmniGen can accomplish complex image generation tasks without the need for additional plugins like ControlNet, IP-Adapter, or auxiliary models such as pose detection and face detection.
|
||||
@ -62,26 +62,26 @@ FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following p
|
||||
|
||||
- 6/7/2024: Release a new benchmark [MLVU](https://github.com/JUNJIE99/MLVU), the first comprehensive benchmark specifically designed for long video understanding. MLVU features an extensive range of video durations, a diverse collection of video sources, and a set of evaluation tasks uniquely tailored for long-form video understanding. :fire:
|
||||
- 5/21/2024: Release a new benchmark [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench) together with Jina AI, Zilliz, HuggingFace, and other partners. AIR-Bench focuses on a fair out-of-distribution evaluation for Neural IR & RAG. It generates the synthetic data for benchmarking w.r.t. diverse domains and languages. It is dynamic and will be updated on regular basis. [Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard) :fire:
|
||||
- 4/30/2024: Release [Llama-3-8B-Instruct-80K-QLoRA](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA), extending the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA training on a few synthesized long-context data. The model achieves remarkable performance on various long-context benchmarks. [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora) :fire:
|
||||
- 3/18/2024: Release new [rerankers](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker), built upon powerful M3 and LLM (GEMMA and MiniCPM, not so large actually :smiley:) backbones, supporitng multi-lingual processing and larger inputs, massive improvements of ranking performances on BEIR, C-MTEB/Retrieval, MIRACL, LlamaIndex Evaluation :fire:
|
||||
- 3/18/2024: Release [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual), equipping BGE with visual capabilities. Visualized-BGE can be utilized to generate embeddings for hybrid image-text data. :fire:
|
||||
- 4/30/2024: Release [Llama-3-8B-Instruct-80K-QLoRA](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA), extending the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA training on a few synthesized long-context data. The model achieves remarkable performance on various long-context benchmarks. [Code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/longllm_qlora) :fire:
|
||||
- 3/18/2024: Release new [rerankers](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_reranker), built upon powerful M3 and LLM (GEMMA and MiniCPM, not so large actually :smiley:) backbones, supporitng multi-lingual processing and larger inputs, massive improvements of ranking performances on BEIR, C-MTEB/Retrieval, MIRACL, LlamaIndex Evaluation :fire:
|
||||
- 3/18/2024: Release [Visualized-BGE](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/visual_bge), equipping BGE with visual capabilities. Visualized-BGE can be utilized to generate embeddings for hybrid image-text data. :fire:
|
||||
- 1/30/2024: Release **BGE-M3**, a new member to BGE model series! M3 stands for **M**ulti-linguality (100+ languages), **M**ulti-granularities (input length up to 8192), **M**ulti-Functionality (unification of dense, lexical, multi-vec/colbert retrieval).
|
||||
It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks.
|
||||
[Technical Report](https://arxiv.org/pdf/2402.03216.pdf) and [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3). :fire:
|
||||
- 1/9/2024: Release [Activation-Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. [Technical Report](https://arxiv.org/abs/2401.03462)
|
||||
- 12/24/2023: Release **LLaRA**, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. [Technical Report](https://arxiv.org/abs/2312.15503)
|
||||
- 11/23/2023: Release [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail), a method to maintain general capabilities during fine-tuning by merging multiple language models. [Technical Report](https://arxiv.org/abs/2311.13534)
|
||||
- 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Technical Report](https://arxiv.org/pdf/2310.07554.pdf)
|
||||
[Technical Report](https://arxiv.org/pdf/2402.03216.pdf) and [Code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/BGE_M3). :fire:
|
||||
- 1/9/2024: Release [Activation-Beacon](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/activation_beacon), an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. [Technical Report](https://arxiv.org/abs/2401.03462)
|
||||
- 12/24/2023: Release **LLaRA**, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. [Technical Report](https://arxiv.org/abs/2312.15503) and [Code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LLARA)
|
||||
- 11/23/2023: Release [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail), a method to maintain general capabilities during fine-tuning by merging multiple language models. [Technical Report](https://arxiv.org/abs/2311.13534)
|
||||
- 10/12/2023: Release [LLM-Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Technical Report](https://arxiv.org/pdf/2310.07554.pdf)
|
||||
- 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
|
||||
- 09/15/2023: The [massive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
|
||||
- 09/12/2023: New models:
|
||||
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
||||
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
||||
- 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
|
||||
- 09/07/2023: Update [fine-tune code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding): Add script to mine hard negatives and support adding instruction during fine-tuning.
|
||||
- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
|
||||
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
|
||||
- 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
|
||||
- 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
|
||||
- 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
|
||||
|
||||
|
||||
</details>
|
||||
@ -143,116 +143,35 @@ The following contents are releasing in the upcoming weeks:
|
||||
<img src="./Tutorials/tutorial_map.png"/>
|
||||
</details>
|
||||
|
||||
|
||||
## [Projects](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research)
|
||||
|
||||
### BGE-M3 ([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
|
||||
|
||||
|
||||
In this project, we introduce BGE-M3, the first embedding model which supports:
|
||||
- **Multi-Functionality**: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
|
||||
- **Multi-Linguality**: It can support more than 100 working languages.
|
||||
- **Multi-Granularity**: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
|
||||
|
||||
**The training code and fine-tuning data will be open-sourced in the near future.**
|
||||
|
||||
### [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual)
|
||||
In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.
|
||||
|
||||
Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
|
||||
|
||||
|
||||
|
||||
### [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora)
|
||||
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine (the context length can go far beyond 80k with more computing resources). The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts.
|
||||
|
||||
|
||||
### [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon)
|
||||
|
||||
The utilization of long contexts poses a big challenge for large language models due to their limited context window length.
|
||||
Activation Beacon condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window.
|
||||
It is an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM.
|
||||
More details please refer to our [paper](https://arxiv.org/abs/2401.03462) and [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon).
|
||||
|
||||
|
||||
### [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
|
||||
|
||||
LM-Cocktail automatically merges fine-tuned models and base model using a simple function to compute merging weights.
|
||||
LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain,
|
||||
as well as generate a model for new tasks without fine-tuning.
|
||||
You can use it to merge the LLMs (e.g., Llama) or embedding models.
|
||||
More details please refer to our report: [LM-Cocktail](https://arxiv.org/abs/2311.13534) and [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
|
||||
|
||||
|
||||
|
||||
### [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
|
||||
|
||||
LLM Embedder is fine-tuned based on the feedback from LLMs.
|
||||
It supports the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval.
|
||||
It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation,
|
||||
Long-Range Language Modeling, In-Context Learning, and Tool Learning.
|
||||
For more details please refer to [report](https://arxiv.org/abs/2310.07554) and [./FlagEmbedding/llm_embedder/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
|
||||
|
||||
|
||||
### [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
||||
|
||||
Cross-encoder will perform full-attention over the input pair,
|
||||
which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model.
|
||||
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
||||
We train the cross-encoder on a multilingual pair data,
|
||||
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
||||
For more details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
||||
|
||||
|
||||
### [LLM Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)
|
||||
We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker#fine-tune).
|
||||
For more details please refer to [./FlagEmbedding/llm_reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker).
|
||||
|
||||
### [BGE Embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding)
|
||||
|
||||
BGE embedding is a general Embedding Model. We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pair data using contrastive learning.
|
||||
**You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
|
||||
We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
|
||||
Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
|
||||
Refer to our [report: c-pack](https://arxiv.org/pdf/2309.07597.pdf) and [code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more details.
|
||||
|
||||
**BGE uses the last hidden state of `[cls]` as the sentence embedding: `sentence_embeddings = model_output[0][:, 0]`. If you use mean pooling, there will be a significant decrease in performance.**
|
||||
|
||||
|
||||
### [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)
|
||||
A benchmark for chinese text embedding. This benchmark has been merged into MTEB.
|
||||
Refer to our [report: c-pack](https://arxiv.org/pdf/2309.07597.pdf) and [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) for more details.
|
||||
|
||||
|
||||
## Model List
|
||||
|
||||
`bge` is short for `BAAI general embedding`.
|
||||
|
||||
| Model | Language | | Description | query instruction for retrieval |
|
||||
|:--------------------------------------------------------------------------|:--------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:|
|
||||
| [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl) | English | | A LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examples | Provide instructions and few-shot examples freely based on the given task. |
|
||||
| [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2) | Multilingual | - | A LLM-based multilingual embedding model, trained on a diverse range of languages and tasks. | Provide instructions based on the given task. |
|
||||
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | [Inference](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#usage) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | |
|
||||
| [LM-Cocktail](https://huggingface.co/Shitao) | English | | fine-tuned models (Llama and BGE) which can be used to reproduce the results of LM-Cocktail | |
|
||||
| [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder) | English | [Inference](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See [README](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) |
|
||||
| [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) | Multilingual | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a lightweight cross-encoder model, possesses strong multilingual capabilities, easy to deploy, with fast inference. | |
|
||||
| [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) | Multilingual | [Inference](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) | a cross-encoder model which is suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities. | |
|
||||
| [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise) | Multilingual | [Inference](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) | a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference. | |
|
||||
| [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) | Multilingual | [Inference](BAAI/bge-reranker-v2.5-gemma2-lightweight) | a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers, compress ratio and compress layers for output, facilitating accelerated inference. | |
|
||||
| [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient | |
|
||||
| [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient | |
|
||||
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | Embedding Model which map text into vector | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | Embedding Model which map text into vector | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| Model | Language | Description | query instruction for retrieval |
|
||||
|:--------------------------------------------------------------------------|:--------:|:-----------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:|
|
||||
| [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl) | English | A LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examples | Provide instructions and few-shot examples freely based on the given task. |
|
||||
| [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2) | Multilingual | A LLM-based multilingual embedding model, trained on a diverse range of languages and tasks. | Provide instructions based on the given task. |
|
||||
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | |
|
||||
| [LM-Cocktail](https://huggingface.co/Shitao) | English | fine-tuned models (Llama and BGE) which can be used to reproduce the results of LM-Cocktail | |
|
||||
| [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder) | English | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See [README](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) |
|
||||
| [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) | Multilingual | a lightweight cross-encoder model, possesses strong multilingual capabilities, easy to deploy, with fast inference. | |
|
||||
| [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) | Multilingual | a cross-encoder model which is suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities. | |
|
||||
| [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise) | Multilingual | a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference. | |
|
||||
| [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) | Multilingual | a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers, compress ratio and compress layers for output, facilitating accelerated inference. | |
|
||||
| [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | a cross-encoder model which is more accurate but less efficient | |
|
||||
| [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | a cross-encoder model which is more accurate but less efficient | |
|
||||
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | Chinese | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | Chinese | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5) | Chinese | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | Embedding Model which map text into vector | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | Embedding Model which map text into vector | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
||||
|
||||
|
||||
|
||||
|
165
README_zh.md
165
README_zh.md
@ -36,11 +36,17 @@
|
||||
|
||||
FlagEmbedding专注于检索增强llm领域,目前包括以下项目:
|
||||
|
||||
- **Long-Context LLM**: [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora)
|
||||
- **Fine-tuning of LM** : [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
|
||||
- **Embedding Model**: [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual), [BGE-M3](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3), [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), [BGE Embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding)
|
||||
- **Reranker Model**: [llm rerankers](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker), [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
||||
- **Benchmark**: [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB), [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench), [MLVU](https://github.com/JUNJIE99/MLVU)
|
||||
- **推理**: [Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/embedder), [Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/reranker)
|
||||
- **微调**: [Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder), [Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/reranker)
|
||||
- **评估**: [MTEB](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#1-mteb), [BEIR](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#2-beir), [MSMARCO](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#3-msmarco), [MIRACL](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#4-miracl), [MLDR](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#5-mldr), [MKQA](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#6-mkqa), [AIR-Bench](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#7-air-bench), [Custom Dataset](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#8-custom-dataset)
|
||||
- **[数据集](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/dataset)**: [MLDR](https://huggingface.co/datasets/Shitao/MLDR), [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data), [public-data](https://huggingface.co/datasets/cfli/bge-e5data), [full-data](https://huggingface.co/datasets/cfli/bge-full-data), [reranker-data](Shitao/bge-reranker-data)
|
||||
- **[教程](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/Tutorials)**
|
||||
- **研究**:
|
||||
- **Long-Context LLM**: [Activation Beacon](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/activation_beacon), [LongLLM QLoRA](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/longllm_qlora)
|
||||
- **Fine-tuning of LM** : [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail)
|
||||
- **Embedding Model**: [Visualized-BGE](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/visual_bge), [BGE-M3](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/BGE_M3), [LLM Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_embedder), [BGE Embedding](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding)
|
||||
- **Reranker Model**: [llm rerankers](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_reranker), [BGE Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/reranker)
|
||||
- **Benchmark**: [C-MTEB](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB), [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench), [MLVU](https://github.com/JUNJIE99/MLVU)
|
||||
|
||||
## 更新
|
||||
- 9/2/2024: 开始维护更新[教程](./Tutorials/),教程文件夹中的内容会在未来不断丰富,欢迎持续关注! :books:
|
||||
@ -53,23 +59,23 @@ FlagEmbedding专注于检索增强llm领域,目前包括以下项目:
|
||||
|
||||
- 6/7/2024: 发布首个专为长视频理解设计的全面评测基准[MLVU](https://github.com/JUNJIE99/MLVU)。MLVU拥有丰富的视频时长范围,多样化的视频来源,以及多个专为长视频理解设计的评估任务。 :fire:
|
||||
- 5/21/2024:联合 Jina AI、Zilliz、HuggingFace 等机构发布评测基准 [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench),针对检索任务和 RAG 场景设计。AIR-Bench 首次提出在检索任务中使用 LLMs 自动化生产评估数据,避免模型过拟合测试数据。AIR-Bench 不需要人工参与标注数据,因而可以更灵活覆盖更多垂直领域和不同语种。同时 AIR-Bench 会定期进行更新从而满足社区不断变化的评测需求。[Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard) :fire:
|
||||
- 4/30/2024: 发布[Llama-3-8B-Instruct-80K-QLoRA](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA), 其通过在少量合成的长文本数据上的QLoRA训练,有效地将Llama-3-8B-Instruct的上下文长度从8K扩展到80K。详见[代码](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora) :fire:
|
||||
- 3/18/2024: 发布新的[rerankers](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker), 拥有更好的性能同时支持多语言和长文本。 :fire:
|
||||
- 3/18/2024: 发布[Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual),该项目通过引入image token embedding赋予BGE视觉编码能力。Visualized-BGE可以对混合图文数据进行编码,用于广泛的混合模态检索任务。 :fire:
|
||||
- 1/30/2024: 发布**BGE-M3**, 第一个具有多功能、多语言和多粒度特性的文本检索模型,高效支持多语言(100+语言)、长文本(至多8192长度的输入文本)、和混合检索(稠密、稀疏、多向量)。 详见[report](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf)和[代码](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) :fire:
|
||||
- 1/9/2024: 发布[Activation-Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), 一个有效、高效、兼容、低成本(训练)的扩展大预言模型上下文长度的方法。[技术报告](https://arxiv.org/abs/2401.03462)
|
||||
- 12/24/2023: 发布**LLaRA**, 一个基于LLaMA-7B的稠密检索模型, MS MARCO与BEIR上取得了迄今最好的实验结果. 模型与代码将会陆续开源. 敬请关注. [技术报告](https://arxiv.org/abs/2312.15503)
|
||||
- 11/23/2023: 发布[LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail), 一种通过模型融合在微调时保持原有模型通用能力的方法. [技术报告](https://arxiv.org/abs/2311.13534)
|
||||
- 10/12/2023: 发布 [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), 专为大语言模型**各种检索增强任务设计**的英文向量模型。[技术报告](https://arxiv.org/pdf/2310.07554.pdf)
|
||||
- 4/30/2024: 发布[Llama-3-8B-Instruct-80K-QLoRA](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA), 其通过在少量合成的长文本数据上的QLoRA训练,有效地将Llama-3-8B-Instruct的上下文长度从8K扩展到80K。详见[代码](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/longllm_qlora) :fire:
|
||||
- 3/18/2024: 发布新的[rerankers](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_reranker), 拥有更好的性能同时支持多语言和长文本。 :fire:
|
||||
- 3/18/2024: 发布[Visualized-BGE](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/visual_bge),该项目通过引入image token embedding赋予BGE视觉编码能力。Visualized-BGE可以对混合图文数据进行编码,用于广泛的混合模态检索任务。 :fire:
|
||||
- 1/30/2024: 发布**BGE-M3**, 第一个具有多功能、多语言和多粒度特性的文本检索模型,高效支持多语言(100+语言)、长文本(至多8192长度的输入文本)、和混合检索(稠密、稀疏、多向量)。 详见[report](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf)和[代码](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/BGE_M3) :fire:
|
||||
- 1/9/2024: 发布[Activation-Beacon](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/activation_beacon), 一个有效、高效、兼容、低成本(训练)的扩展大预言模型上下文长度的方法。[技术报告](https://arxiv.org/abs/2401.03462)
|
||||
- 12/24/2023: 发布**LLaRA**, 一个基于LLaMA-7B的稠密检索模型, MS MARCO与BEIR上取得了迄今最好的实验结果. 模型与代码将会陆续开源. 敬请关注. [技术报告](https://arxiv.org/abs/2312.15503) 和 [代码](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LLARA)
|
||||
- 11/23/2023: 发布[LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail), 一种通过模型融合在微调时保持原有模型通用能力的方法. [技术报告](https://arxiv.org/abs/2311.13534)
|
||||
- 10/12/2023: 发布 [LLM-Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_embedder), 专为大语言模型**各种检索增强任务设计**的英文向量模型。[技术报告](https://arxiv.org/pdf/2310.07554.pdf)
|
||||
- 09/15/2023: 发布 [技术报告](https://arxiv.org/pdf/2309.07597.pdf) 和 [数据集](https://data.baai.ac.cn/details/BAAI-MTP).
|
||||
- 09/12/2023: 更新:
|
||||
- **新增重排模型**:开源交叉编码器模型bge-reranker,具有比向量模型更强大的排序能力。非常建议使用或者微调它来重新排序向量模型返回的top-k文档,提高最终结果的相关性。
|
||||
- **更新向量模型**:发布bge-*-v1.5向量模型,缓解相似度分布问题,提升无指令情况下的检索能力(但检索任务仍建议使用指令)
|
||||
- 09/07/2023: 更新[微调代码](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): 增加难负样本挖掘脚本,增加指令参数方便在微调中添加指令.
|
||||
- 09/07/2023: 更新[微调代码](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding): 增加难负样本挖掘脚本,增加指令参数方便在微调中添加指令.
|
||||
- 08/09/2023: BGE模型整合入Langchain, 可以在langchain中非常简单的[使用它](#using-langchain); C-MTEB中文榜单已[在线更新](https://huggingface.co/spaces/mteb/leaderboard).
|
||||
- 08/05/2023: 发布更小的模型(base, small), **在同尺寸模型中取得最好的性能! 🤗**
|
||||
- 08/02/2023: :tada: :tada: 发布中英文向量模型BGE(BAAI General Embedding的缩写), **在MTEB和C-MTEB榜单上取得最好的性能**
|
||||
- 08/01/2023: 发布大规模中文文本向量[评测榜单](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), 其包括31个测试任务.
|
||||
- 08/01/2023: 发布大规模中文文本向量[评测榜单](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB) (**C-MTEB**), 其包括31个测试任务.
|
||||
|
||||
</details>
|
||||
|
||||
@ -129,111 +135,32 @@ print(similarity)
|
||||
<summary>教程规划</summary>
|
||||
<img src="./Tutorials/tutorial_map.png"/>
|
||||
</details>
|
||||
|
||||
|
||||
## 项目
|
||||
|
||||
### BGE-M3([Paper](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
|
||||
在这个项目中,我们发布了BGE-M3,它是第一个具有多功能、多语言和多粒度特性的文本检索模型。
|
||||
- 多功能:可以同时执行三种检索功能:单向量检索、多向量检索和稀疏检索。
|
||||
- 多语言:支持100多种工作语言。
|
||||
- 多粒度:它能够处理不同粒度的输入,从短句子到长达8192个词汇的长文档。
|
||||
|
||||
在本项目中,为了提高单一检索模式的性能,提出了一种新的自知识蒸馏方法。
|
||||
我们优化了批处理策略,支持大批处理大小,这可以在对长文本或大型语言模型进行向量微调时简单使用。
|
||||
我们还构建了一个用于文档检索的数据集,并提出了一个简单的策略来提高长文本的建模能力。
|
||||
**训练代码和微调数据将在不久的将来开源。**
|
||||
|
||||
### [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual)
|
||||
在这个项目中,我们发布了Visualized-BGE。
|
||||
通过引入image token embedding,Visualized-BGE可以被用来编码混合图文数据。它可以被应用在广泛的多模态检索任务中,包括但不限于:多模态知识检索,多模态查询的图像检索等。
|
||||
|
||||
### [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora)
|
||||
我们通过 QLoRA 微调将 Llama-3-8B-Instruct 的上下文长度从 8K 扩展到 80K。 整个训练过程非常高效,在一台8xA800 (80G) GPU 机器上仅需要8个小时。 该模型在NIHS、主题检索和长上下文语言理解等广泛的评估任务中表现出卓越的性能; 同时,它在短上下文中也很好地保留了其原有的能力。 如此强大的长文本能力主要归因于GPT-4生成的仅3.5K合成数据,这表明LLM具有扩展其原始上下文的固有(但在很大程度上被低估)潜力。 事实上,一旦有更多的计算资源,该方法可以将上下文长度扩展更长。
|
||||
|
||||
### [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon)
|
||||
|
||||
由于有限的上下文窗口长度,有效利用长上下文信息是对大型语言模型的一个巨大挑战。
|
||||
Activation Beacon 将 LLM 的原始激活压缩为更紧凑的形式,以便它可以在有限的上下文窗口中感知更长的上下文。
|
||||
它是一种有效、高效、兼容、低成本(训练)的延长LLM上下文长度的方法。
|
||||
更多细节请参考[技术报告](https://arxiv.org/abs/2401.03462)和[代码](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon)。
|
||||
|
||||
|
||||
### [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
|
||||
|
||||
模型合并被用于提高单模型的性能。
|
||||
我们发现这种方法对大型语言模型和文本向量模型也很有用, 并设计了”语言模型鸡尾酒“方案,其自动计算融合比例去融合基础模型和微调模型。
|
||||
利用LM-Cocktail可以缓解灾难性遗忘问题,即在不降低通用性能的情况下提高目标任务性能。
|
||||
通过构造少量数据样例,它还可以用于为新任务生成模型,而无需进行微调。
|
||||
它可以被使用来合并生成模型或向量模型。
|
||||
更多细节请参考[技术报告](https://arxiv.org/abs/2311.13534)和[代码](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)。
|
||||
|
||||
|
||||
### [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
|
||||
|
||||
LLM-Embedder向量模型是根据LLM的反馈进行微调的。
|
||||
它可以支持大型语言模型的检索增强需求,包括知识检索、记忆检索、示例检索和工具检索。
|
||||
它在6个任务上进行了微调:问题回答,对话搜索,长对话,
|
||||
长文本建模、上下文学习和工具学习。
|
||||
更多细节请参考[./FlagEmbedding/llm_embedder/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
|
||||
|
||||
|
||||
### [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
||||
|
||||
交叉编码器将对查询和答案实时计算相关性分数,这比向量模型(即双编码器)更准确,但比向量模型更耗时。
|
||||
因此,它可以用来对嵌入模型返回的前k个文档重新排序。
|
||||
我们在多语言数据上训练了交叉编码器,数据格式与向量模型相同,因此您可以根据我们的[示例](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) 轻松地对其进行微调。
|
||||
更多细节请参考[./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/reranker/README.md)
|
||||
|
||||
|
||||
|
||||
我们提供了新版的交叉编码器,支持更多的语言以及更长的长度。使用的数据格式与向量模型类似,但是新增了prompt用于微调以及推理。您可以使用特定的层进行推理或使用完整的层进行推理,您可以根根据我们的[示例](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker#fine-tune) 轻松地对其进行微调。
|
||||
更多细节请参考[./FlagEmbedding/llm_reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/llm_reranker/README.md)
|
||||
|
||||
### [BGE Embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding)
|
||||
|
||||
BGE Embedding是一个通用向量模型。 我们使用[retromae](https://github.com/staoxiao/RetroMAE) 对模型进行预训练,再用对比学习在大规模成对数据上训练模型。
|
||||
**你可以按照我们的[示例](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) 在本地数据上微调嵌入模型。**
|
||||
我们还提供了一个[预训练示例](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain) 。
|
||||
请注意,预训练的目标是重构文本,预训练后的模型无法直接用于相似度计算,需要进行微调之后才可以用于相似度计算。
|
||||
更多关于bge的训练情况请参阅[论文](https://arxiv.org/pdf/2309.07597.pdf)和[代码](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
|
||||
|
||||
**注意BGE使用CLS的表征作为整个句子的表示,如果使用了错误的方式(如mean pooling)会导致效果很差。**
|
||||
|
||||
|
||||
### [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)
|
||||
中文向量榜单,已整合入MTEB中。更多细节参考 [论文](https://arxiv.org/pdf/2309.07597.pdf) 和[代码](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB).
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## 模型列表
|
||||
| Model | Language | | Description | query instruction for retrieval [1] |
|
||||
|:--------------------------------------------------------------------------|:-------------------:| :--------:|:--------------------------------------:|:--------:|
|
||||
| [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl) | English | | 基于大型语言模型的向量模型,具有上下文学习能力,能够基于少量示例充分发挥模型的潜力。 | 根据给定的任务自由提供指示和少数示例。 |
|
||||
| [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2) | Multilingual | | 基于大型语言模型的多语言向量模型,在多种语言和任务上训练,适应多样化的下游场景。 | 根据给定的任务自由提供指示和少数示例。 |
|
||||
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | [推理](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#usage) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) | 多功能(向量检索,稀疏检索,多表征检索)、多语言、多粒度(最大长度8192) | |
|
||||
| [LM-Cocktail](https://huggingface.co/Shitao) | English | | 微调的Llama和BGE模型,可以用来复现LM-Cocktail论文的结果 | |
|
||||
| [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder) | English | [推理](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) | 专为大语言模型各种检索增强任务设计的向量模型 | 详见 [README](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) |
|
||||
| [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) | Multilingual | [推理](#usage-for-reranker) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | 一个轻量级的交叉编码器模型,具有强大的多语言能力,易于部署,具有快速的推理能力。 | |
|
||||
| [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) | Multilingual | [推理](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) | 一个支持多语言的交叉编码器模型,在英文和多语言能力方面均表现出色。 | |
|
||||
| [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise) | Multilingual | [推理](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) | 一个支持多语言的交叉编码器模型,在英文和中文方面均表现良好,允许自由选择输出层,以便加速推理。 | |
|
||||
| [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) | Multilingual | [推理](BAAI/bge-reranker-v2.5-gemma2-lightweight) | 一个支持多语言的跨编码器模型,不仅在英文和中文上表现良好,还允许自由选择输出层、压缩比例和压缩层,从而便于加速推理。 | |
|
||||
| [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | [推理](#usage-for-reranker) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | 交叉编码器模型,精度比向量模型更高但推理效率较低 [2] | |
|
||||
| [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | [推理](#usage-for-reranker) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | 交叉编码器模型,精度比向量模型更高但推理效率较低 [2] | |
|
||||
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | 1.5版本,相似度分布更加合理 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | 1.5版本,相似度分布更加合理 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | 1.5版本,相似度分布更加合理 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | Chinese | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | 1.5版本,相似度分布更加合理 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | Chinese | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | 1.5版本,相似度分布更加合理 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5) | Chinese | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | 1.5版本,相似度分布更加合理 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | 向量模型,将文本转换为向量 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | base-scale 向量模型 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | small-scale 向量模型 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | 向量模型,将文本转换为向量 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | base-scale 向量模型 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [推理](#usage-for-embedding-model) [微调](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | small-scale 向量模型 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| Model | Language | Description | query instruction for retrieval [1] |
|
||||
|:--------------------------------------------------------------------------|:-------------------:|:--------------------------------------:|:--------:|
|
||||
| [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl) | English | 基于大型语言模型的向量模型,具有上下文学习能力,能够基于少量示例充分发挥模型的潜力。 | 根据给定的任务自由提供指示和少数示例。 |
|
||||
| [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2) | Multilingual | 基于大型语言模型的多语言向量模型,在多种语言和任务上训练,适应多样化的下游场景。 | 根据给定的任务自由提供指示和少数示例。 |
|
||||
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | 多功能(向量检索,稀疏检索,多表征检索)、多语言、多粒度(最大长度8192) | |
|
||||
| [LM-Cocktail](https://huggingface.co/Shitao) | English | 微调的Llama和BGE模型,可以用来复现LM-Cocktail论文的结果 | |
|
||||
| [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder) | English | 专为大语言模型各种检索增强任务设计的向量模型 | 详见 [README](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) |
|
||||
| [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) | Multilingual | 一个轻量级的交叉编码器模型,具有强大的多语言能力,易于部署,具有快速的推理能力。 | |
|
||||
| [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) | Multilingual | 一个支持多语言的交叉编码器模型,在英文和多语言能力方面均表现出色。 | |
|
||||
| [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise) | Multilingual | 一个支持多语言的交叉编码器模型,在英文和中文方面均表现良好,允许自由选择输出层,以便加速推理。 | |
|
||||
| [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) | Multilingual | 一个支持多语言的跨编码器模型,不仅在英文和中文上表现良好,还允许自由选择输出层、压缩比例和压缩层,从而便于加速推理。 | |
|
||||
| [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | 交叉编码器模型,精度比向量模型更高但推理效率较低 [2] | |
|
||||
| [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | 交叉编码器模型,精度比向量模型更高但推理效率较低 [2] | |
|
||||
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | 1.5版本,相似度分布更加合理 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | 1.5版本,相似度分布更加合理 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | 1.5版本,相似度分布更加合理 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | Chinese | 1.5版本,相似度分布更加合理 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | Chinese | 1.5版本,相似度分布更加合理 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5) | Chinese | 1.5版本,相似度分布更加合理 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | 向量模型,将文本转换为向量 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | base-scale 向量模型 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | small-scale 向量模型 | `Represent this sentence for searching relevant passages: ` |
|
||||
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | 向量模型,将文本转换为向量 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | base-scale 向量模型 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | small-scale 向量模型 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
|
||||
|
||||
|
||||
|
@ -1,6 +1,6 @@
|
||||
# Research
|
||||
|
||||
### BGE-M3 ([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
|
||||
### BGE-M3 ([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/BGE_M3))
|
||||
|
||||
|
||||
In this project, we introduce BGE-M3, the first embedding model which supports:
|
||||
@ -11,32 +11,36 @@ In this project, we introduce BGE-M3, the first embedding model which supports:
|
||||
|
||||
**The training code and fine-tuning data will be open-sourced in the near future.**
|
||||
|
||||
### [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual)
|
||||
### [Visualized-BGE](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/visual_bge)
|
||||
|
||||
In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.
|
||||
|
||||
Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
|
||||
|
||||
### [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora)
|
||||
|
||||
|
||||
### [LongLLM QLoRA](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/longllm_qlora)
|
||||
|
||||
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine (the context length can go far beyond 80k with more computing resources). The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts.
|
||||
|
||||
|
||||
### [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon)
|
||||
### [Activation Beacon](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/activation_beacon)
|
||||
|
||||
The utilization of long contexts poses a big challenge for large language models due to their limited context window length.
|
||||
Activation Beacon condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window.
|
||||
It is an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM.
|
||||
More details please refer to our [paper](https://arxiv.org/abs/2401.03462) and [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon).
|
||||
More details please refer to our [paper](https://arxiv.org/abs/2401.03462) and [code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/activation_beacon).
|
||||
|
||||
|
||||
### [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
|
||||
### [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail)
|
||||
|
||||
LM-Cocktail automatically merges fine-tuned models and base model using a simple function to compute merging weights.
|
||||
LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain,
|
||||
as well as generate a model for new tasks without fine-tuning.
|
||||
You can use it to merge the LLMs (e.g., Llama) or embedding models.
|
||||
More details please refer to our report: [LM-Cocktail](https://arxiv.org/abs/2311.13534) and [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
|
||||
More details please refer to our report: [LM-Cocktail](https://arxiv.org/abs/2311.13534) and [code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail).
|
||||
|
||||
|
||||
|
||||
### [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
|
||||
|
||||
@ -44,36 +48,36 @@ LLM Embedder is fine-tuned based on the feedback from LLMs.
|
||||
It supports the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval.
|
||||
It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation,
|
||||
Long-Range Language Modeling, In-Context Learning, and Tool Learning.
|
||||
For more details please refer to [report](https://arxiv.org/abs/2310.07554) and [./FlagEmbedding/llm_embedder/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
|
||||
For more details please refer to [report](https://arxiv.org/abs/2310.07554) and [./llm_embedder/README.md](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_embedder)
|
||||
|
||||
|
||||
### [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
||||
### [BGE Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/reranker)
|
||||
|
||||
Cross-encoder will perform full-attention over the input pair,
|
||||
which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model.
|
||||
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
||||
We train the cross-encoder on a multilingual pair data,
|
||||
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
||||
For more details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
||||
For more details please refer to [./reranker/README.md](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/reranker)
|
||||
|
||||
|
||||
### [LLM Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)
|
||||
### [LLM Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_reranker)
|
||||
|
||||
We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker#fine-tune).
|
||||
For more details please refer to [./FlagEmbedding/llm_reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker).
|
||||
For more details please refer to [./llm_reranker/README.md](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_reranker).
|
||||
|
||||
### [BGE Embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding)
|
||||
### [BGE Embedding](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding)
|
||||
|
||||
BGE embedding is a general Embedding Model. We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pair data using contrastive learning.
|
||||
**You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
|
||||
We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
|
||||
**You can fine-tune the embedding model on your data following our [examples](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder).**
|
||||
We also provide a [pre-train example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/old-examples/pretrain).
|
||||
Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
|
||||
Refer to our [report: c-pack](https://arxiv.org/pdf/2309.07597.pdf) and [code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more details.
|
||||
Refer to our [report: c-pack](https://arxiv.org/pdf/2309.07597.pdf) and [code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding) for more details.
|
||||
|
||||
**BGE uses the last hidden state of `[cls]` as the sentence embedding: `sentence_embeddings = model_output[0][:, 0]`. If you use mean pooling, there will be a significant decrease in performance.**
|
||||
|
||||
|
||||
### [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)
|
||||
### [C-MTEB](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB)
|
||||
|
||||
A benchmark for chinese text embedding. This benchmark has been merged into MTEB.
|
||||
Refer to our [report: c-pack](https://arxiv.org/pdf/2309.07597.pdf) and [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) for more details.
|
||||
Refer to our [report: c-pack](https://arxiv.org/pdf/2309.07597.pdf) and [code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB) for more details.
|
||||
|
Loading…
x
Reference in New Issue
Block a user