update docs

This commit is contained in:
ZiyiXia 2024-12-03 11:31:38 +00:00
parent 6d9fa4ecf6
commit 1374b98da4
20 changed files with 436 additions and 53 deletions

View File

@ -13,7 +13,7 @@ jobs:
- uses: actions/setup-python@v5
- name: Install dependencies
run: |
pip install . sphinx sphinx_rtd_theme myst_parser myst-nb furo
pip install . sphinx myst_parser myst-nb sphinx-design pydata-sphinx-theme
- name: Sphinx build
run: |
sphinx-build docs/source docs/build

View File

@ -159,7 +159,7 @@ Currently we are updating the [tutorials](./Tutorials/), we aim to create a comp
The following contents are releasing in the upcoming weeks:
- Evaluation
- RAG
- BGE-EN-ICL
<details>
<summary>The whole tutorial roadmap</summary>

View File

@ -1,3 +1,5 @@
sphinx
myst-nb
furo
sphinx-design
pydata-sphinx-theme
# furo

View File

@ -3,4 +3,5 @@ Abstract Class
.. toctree::
abc/inference
abc/evaluation
abc/finetune

11
docs/source/API/index.rst Normal file
View File

@ -0,0 +1,11 @@
API
===
.. toctree::
:hidden:
:maxdepth: 1
abc
inference
evaluation
finetune

View File

@ -0,0 +1,2 @@
FAQ
===

View File

@ -0,0 +1,37 @@
Concept
=======
Embedder
--------
Embedder, or embedding model, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.
A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:
.. image:: ../_static/img/word2vec.png
:width: 500
:align: center
Nowadays, embedders are capable of mapping sentences and even passages into vector space.
They are widely used in real world tasks such as retrieval, clustering, etc.
In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.
Reranker
--------
Reranker, or Cross-Encoder, is a model that refines the ranking of candidate pairs (e.g., query-document pairs) by jointly encoding and scoring them.
Typically, we use embedder as a Bi-Encoder. It first computes the embeddings of two input sentences, then compute their similarity using metrics such as cosine similarity or Euclidean distance.
Whereas a reranker takes two sentences at the same time and directly computer a score representing their similarity.
The following figure shows their difference:
.. figure:: https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png
:width: 500
:align: center
Bi-Encoder & Cross-Encoder (from Sentence Transformers)
Although Cross-Encoder usually has better performances than Bi-Encoder, it is extremly time consuming to use Cross-Encoder if we have a great amount of data.
Thus a widely accepted approach is to use a Bi-Encoder for initial retrieval (e.g., selecting the top 100 candidates from 100,000 sentences) and then refine the ranking of the selected candidates using a Cross-Encoder for more accurate results.

View File

@ -0,0 +1,19 @@
Introduction
============
BGE builds one-stop retrieval toolkit for search and RAG. We provide inference, evaluation, and fine-tuning for embedding models and reranker.
.. figure:: ../_static/img/RAG_pipeline.png
:width: 700
:align: center
BGE embedder and reranker in an RAG pipeline.
Quickly get started with:
.. toctree::
:maxdepth: 1
installation
concept
quick_start

View File

@ -40,4 +40,9 @@ For development in editable mode:
# If you do not want to finetune the models, you can install the package without the finetune dependency:
pip install -e .
# If you want to finetune the models, you can install the package with the finetune dependency:
pip install -e .[finetune]
pip install -e .[finetune]
PyTorch-CUDA
------------
If you want to use CUDA GPUs during inference and finetuning, please install appropriate version of `PyTorch <https://pytorch.org/get-started/locally/>`_ with CUDA support.

View File

@ -0,0 +1,9 @@
.bd-sidebar-primary {
width: 22%;
line-height: 1.4;
}
.col-lg-3 {
flex: 0 0 auto;
width: 22%;
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 297 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

View File

@ -1,2 +1,117 @@
======
BGE-M3
======
======
BGE-M3 is a compound and powerful embedding model distinguished for its versatility in:
- **Multi-Functionality**: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
- **Multi-Linguality**: It can support more than 100 working languages.
- **Multi-Granularity**: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
+-------------------------------------------------------------------+-----------------+------------+--------------+-----------------------------------------------------------------------+
| Model | Language | Parameters | Model Size | Description |
+===================================================================+=================+============+==============+=======================================================================+
| `BAAI/bge-m3 <https://huggingface.co/BAAI/bge-m3>`_ | Multi-Lingual | 569M | 2.27 GB | Multi-Functionality, Multi-Linguality, and Multi-Granularity |
+-------------------------------------------------------------------+-----------------+------------+--------------+-----------------------------------------------------------------------+
Multi-Linguality
================
BGE-M3 was trained on multiple datasets covering up to 170+ different languages.
While the amount of training data on languages are highly unbalanced, the actual model performance on different languages will have difference.
For more information of datasets and evaluation results, please check out our `paper <https://arxiv.org/pdf/2402.03216s>`_ for details.
Multi-Granularity
=================
We extend the max position to 8192, enabling the embedding of larger corpus.
Proposing a simple but effective method: MCLS (Multiple CLS) to enhance the model's ability on long text without additional fine-tuning.
Multi-Functionality
===================
.. code:: python
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('BAAI/bge-m3')
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
Dense Retrieval
---------------
Similar to BGE v1 or v1.5 models, BGE-M3 use the normalized hidden state of the special token [CLS] as the dense embedding:
.. math:: e_q = norm(H_q[0])
Next, to compute the relevance score between the query and passage:
.. math:: s_{dense}=f_{sim}(e_p, e_q)
where :math:`e_p, e_q` are the embedding vectors of passage and query, respectively.
:math:`f_{sim}` is the score function (such as inner product and L2 distance) for comupting two embeddings' similarity.
Sparse Retrieval
----------------
BGE-M3 generates sparce embeddings by adding a linear layer and a ReLU activation function following the hidden states:
.. math:: w_{qt} = \text{Relu}(W_{lex}^T H_q [i])
where :math:`W_{lex}` representes the weights of linear layer and :math:`H_q[i]` is the encoder's output of the :math:`i^{th}` token.
Based on the tokens' weights of query and passage, the relevance score between them is computed by the joint importance of the co-existed terms within the query and passage:
.. math:: s_{lex} = \sum_{t\in q\cap p}(w_{qt} * w_{pt})
where :math:`w_{qt}, w_{pt}` are the importance weights of each co-existed term :math:`t` in query and passage, respectively.
Multi-Vector
------------
The multi-vector method utilizes the entire output embeddings for the representation of query :math:`E_q` and passage :math:`E_p`.
.. math::
E_q = norm(W_{mul}^T H_q)
E_p = norm(W_{mul}^T H_p)
where :math:`W_{mul}` is the learnable projection matrix.
Following ColBert, BGE-M3 use late-interaction to compute the fine-grained relevance score:
.. math:: s_{mul}=\frac{1}{N}\sum_{i=1}^N\max_{j=1}^M E_q[i]\cdot E_p^T[j]
where :math:`E_q, E_p` are the entire output embeddings of query and passage, respectively.
This is a summation of average of maximum similarity of each :math:`v\in E_q` with vectors in :math:`E_p`.
Hybrid Ranking
--------------
BGE-M3's multi-functionality gives the possibility of hybrid ranking to improve retrieval.
Firstly, due to the heavy cost of multi-vector method, we can retrieve the candidate results by either of the dense or sparse method.
Then, to get the final result, we can rerank the candidates based on the integrated relevance score:
.. math:: s_{rank} = w_1\cdot s_{dense}+w_2\cdot s_{lex} + w_3\cdot s_{mul}
where the values chosen for :math:`w_1`, :math:`w_2` and :math:`w_3` varies depending on the downstream scenario.
Usage
=====
.. code:: python
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('BAAI/bge-m3')
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
output = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
dense, sparse, multiv = output['dense_vecs'], output['lexical_weights'], output['colbert_vecs']

View File

@ -1,5 +1,7 @@
BGE-v1
======
BGE v1 & v1.5
=============
BGE v1 and v1.5 are series of encoder only models base on BERT. They achieved best performance among the models of the same size at the time of release.
BGE
---
@ -26,7 +28,7 @@ C-MTEB benchmarks at the time released.
BGE-v1.5
--------
Then to enhance its retrieval ability without instruction and alleviate the issue of the similarity distribution, :code:`bge-*-1.5` models
Then to enhance its retrieval ability without instruction and alleviate the issue of the similarity distribution, :code:`bge-*-v1.5` models
were released in Sep 2023. They are still the most popular embedding models that balanced well between embedding quality and model sizes.
+-----------------------------------------------------------------------------+-----------+------------+--------------+--------------+
@ -37,8 +39,8 @@ were released in Sep 2023. They are still the most popular embedding models that
| `BAAI/bge-base-en-v1.5 <https://huggingface.co/BAAI/bge-base-en-v1.5>`_ | English | 109M | 438 MB | reasonable |
+-----------------------------------------------------------------------------+-----------+------------+--------------+ similarity +
| `BAAI/bge-small-en-v1.5 <https://huggingface.co/BAAI/bge-small-en-v1.5>`_ | English | 33.4M | 133 MB | distribution |
+-----------------------------------------------------------------------------+-----------+------------+--------------+ +
| `BAAI/bge-large-zh-v1.5 <https://huggingface.co/BAAI/bge-large-zh-v1.5>`_ | Chinese | 326M | 1.3 GB | |
+-----------------------------------------------------------------------------+-----------+------------+--------------+ and better +
| `BAAI/bge-large-zh-v1.5 <https://huggingface.co/BAAI/bge-large-zh-v1.5>`_ | Chinese | 326M | 1.3 GB | performance |
+-----------------------------------------------------------------------------+-----------+------------+--------------+ +
| `BAAI/bge-base-zh-v1.5 <https://huggingface.co/BAAI/bge-base-zh-v1.5>`_ | Chinese | 102M | 409 MB | |
+-----------------------------------------------------------------------------+-----------+------------+--------------+ +
@ -46,4 +48,30 @@ were released in Sep 2023. They are still the most popular embedding models that
+-----------------------------------------------------------------------------+-----------+------------+--------------+--------------+
Usage
-----
To use BGE v1 or v1.5 model for inference, load model through ``
.. code:: python
from FlagEmbedding import FlagModel
model = FlagModel('BAAI/bge-base-en-v1.5')
sentences = ["Hello world", "I am inevitable"]
embeddings = model.encode(sentences)
.. tip::
For simple tasks that only encode a few sentences like above, it's faster to use single GPU comparing to multi-GPUs:
.. code:: python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
or
.. code:: python
model = FlagModel('BAAI/bge-base-en-v1.5', devices=0)

19
docs/source/bge/index.rst Normal file
View File

@ -0,0 +1,19 @@
BGE
===
**BGE** stands for **BAAI General Embeddings**, which is a series of embedding models released by BAAI.
.. toctree::
:maxdepth: 1
:caption: Embedder
bge_v1_v1.5
bge_m3
bge_icl
.. toctree::
:maxdepth: 1
:caption: Embedder
bge_reranker

View File

@ -1,5 +0,0 @@
Introduction
============
**BGE** stands for **BAAI General Embeddings**, which is a series of embedding models released by BAAI.

View File

@ -0,0 +1,2 @@
Community
=========

View File

@ -24,6 +24,7 @@ extensions = [
"sphinx.ext.githubpages",
"sphinx.ext.viewcode",
"sphinx.ext.coverage",
"sphinx_design",
"myst_nb",
]
@ -33,21 +34,55 @@ exclude_patterns = []
# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = 'furo'
# html_logo = "_static/img/BAAI_logo.png"
# html_theme = 'furo'
html_theme = "pydata_sphinx_theme"
html_logo = "_static/img/BAAI_logo.png"
html_title = "FlagEmbedding"
html_static_path = ['_static']
html_css_files = ["css/custom.css"]
html_theme_options = {
# "light_logo": "/_static/img/BAAI_logo.png",
"light_css_variables": {
"color-brand-primary": "#238be8",
"color-brand-content": "#238be8",
},
"dark_css_variables": {
"color-brand-primary": "#FBCB67",
"color-brand-content": "#FBCB67",
},
# # "light_logo": "/_static/img/BAAI_logo.png",
# "light_css_variables": {
# "color-brand-primary": "#238be8",
# "color-brand-content": "#238be8",
# },
# "dark_css_variables": {
# "color-brand-primary": "#FBCB67",
# "color-brand-content": "#FBCB67",
# },
"navigation_depth": 5,
}
# MyST-NB conf
nb_execution_mode = "off"
nb_execution_mode = "off"
html_theme_options = {
"external_links": [
{
"url": "https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d",
"name": "HF Models",
},
],
"icon_links":[
{
"name": "GitHub",
"url": "https://github.com/FlagOpen/FlagEmbedding",
"icon": "fa-brands fa-github",
},
{
"name": "PyPI",
"url": "https://pypi.org/project/FlagEmbedding/",
"icon": "fa-brands fa-python",
},
{
"name": "HF Models",
"url": "https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d",
"icon": "fa-solid fa-cube",
}
],
"header_links_before_dropdown": 7,
}
html_context = {
"default_mode": "light"
}

View File

@ -3,23 +3,100 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
BAAI General Embedding
======================
:html_theme.sidebar_secondary.remove: True
|
|
.. image:: _static/img/bge_logo.jpg
:target: https://github.com/FlagOpen/FlagEmbedding
:width: 500
:align: center
|
|
BGE
===
Welcome to BGE documentation!
We aim for building one-stop retrieval toolkit for search and RAG.
.. figure:: _static/img/bge_logo.jpg
:width: 400
:align: center
.. grid:: 3
:gutter: 3
.. grid-item-card:: :octicon:`milestone` Introduction
New to BGE? Quickly get hands-on information.
+++
.. button-ref:: Introduction/index
:expand:
:color: primary
:click-parent:
To Introduction
.. grid-item-card:: :octicon:`package` BGE Models
Get to know BGE embedding models and rerankers.
+++
.. button-ref:: bge/index
:expand:
:color: primary
:click-parent:
To BGE
.. grid-item-card:: :octicon:`log` Tutorials
Find useful tutorials to start with if you are looking for guidance
+++
.. button-ref:: tutorial/index
:expand:
:color: primary
:click-parent:
To Tutorials
.. grid-item-card:: :octicon:`codescan` API
Check the API of classes and functions in FlagEmbedding.
+++
.. button-ref:: API/index
:expand:
:color: primary
:click-parent:
To APIs
.. grid-item-card:: :octicon:`question` FAQ
Take a look of questions people frequently asked.
+++
.. button-ref:: FAQ/index
:expand:
:color: primary
:click-parent:
To FAQ
.. grid-item-card:: :octicon:`people` Community
Welcome to join BGE community!
+++
.. button-ref:: community/index
:expand:
:color: primary
:click-parent:
To Community
Besides the resources we provide here in this documentation, please visit our `GitHub repo <https://github.com/FlagOpen/FlagEmbedding>`_ for more contents including:
@ -49,27 +126,39 @@ BGE is developed by Beijing Academy of Artificial Intelligence (BAAI).
:maxdepth: 1
:caption: Introduction
Introduction/installation
Introduction/quick_start
Introduction/index
.. toctree::
:hidden:
:maxdepth: 5
:caption: API
:maxdepth: 1
:caption: BGE
API/abc
API/inference
API/evaluation
API/finetune
bge/index
.. toctree::
:hidden:
:maxdepth: 2
:caption: Tutorials
tutorial/1_Embedding
tutorial/2_Metrics
tutorial/3_Indexing
tutorial/4_Evaluation
tutorial/5_Reranking
tutorial/6_RAG
tutorial/index
.. toctree::
:hidden:
:maxdepth: 5
:caption: API
API/index
.. toctree::
:hidden:
:maxdepth: 1
:caption: FAQ
FAQ/index
.. toctree::
:hidden:
:maxdepth: 1
:caption: Community
community/index

View File

@ -0,0 +1,14 @@
Tutorials
=========
.. toctree::
:hidden:
:maxdepth: 1
:caption: Tutorials
1_Embedding
2_Metrics
3_Indexing
4_Evaluation
5_Reranking
6_RAG