update docs

2026-01-05 11:41:28 +00:00 · 2024-12-03 11:31:38 +00:00 · 2024-12-03 11:31:38 +00:00 · 1374b98da4
commit 1374b98da4
parent 6d9fa4ecf6
20 changed files with 436 additions and 53 deletions
--- a/.github/workflows/documentation.yml
+++ b/.github/workflows/documentation.yml
@ -13,7 +13,7 @@ jobs:
      - uses: actions/setup-python@v5
      - name: Install dependencies
        run: |
-          pip install . sphinx sphinx_rtd_theme myst_parser myst-nb furo
+          pip install . sphinx myst_parser myst-nb sphinx-design pydata-sphinx-theme
      - name: Sphinx build
        run: |
          sphinx-build docs/source docs/build
--- a/README.md
+++ b/README.md
@ -159,7 +159,7 @@ Currently we are updating the [tutorials](./Tutorials/), we aim to create a comp
 The following contents are releasing in the upcoming weeks:

 - Evaluation
- RAG
+- BGE-EN-ICL

 <details>
  <summary>The whole tutorial roadmap</summary>
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -1,3 +1,5 @@
 sphinx
 myst-nb
-furo
+sphinx-design
+pydata-sphinx-theme
+# furo
--- a/docs/source/API/abc.rst
+++ b/docs/source/API/abc.rst
@ -3,4 +3,5 @@ Abstract Class

 .. toctree::
    abc/inference
+    abc/evaluation
    abc/finetune
--- a/docs/source/API/index.rst
+++ b/docs/source/API/index.rst
@ -0,0 +1,11 @@
+API
+===
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   abc
+   inference
+   evaluation
+   finetune
--- a/docs/source/FAQ/index.rst
+++ b/docs/source/FAQ/index.rst
@ -0,0 +1,2 @@
+FAQ
+===
--- a/docs/source/Introduction/concept.rst
+++ b/docs/source/Introduction/concept.rst
@ -0,0 +1,37 @@
+Concept
+=======
+
+Embedder
+--------
+
+Embedder, or embedding model, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
+These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.
+
+A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:
+
+.. image:: ../_static/img/word2vec.png
+   :width: 500
+   :align: center
+
+Nowadays, embedders are capable of mapping sentences and even passages into vector space.
+They are widely used in real world tasks such as retrieval, clustering, etc.
+In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.
+
+Reranker
+--------
+
+Reranker, or Cross-Encoder, is a model that refines the ranking of candidate pairs (e.g., query-document pairs) by jointly encoding and scoring them.
+
+Typically, we use embedder as a Bi-Encoder. It first computes the embeddings of two input sentences, then compute their similarity using metrics such as cosine similarity or Euclidean distance.
+Whereas a reranker takes two sentences at the same time and directly computer a score representing their similarity.
+
+The following figure shows their difference:
+
+.. figure:: https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png
+   :width: 500
+   :align: center
+   
+   Bi-Encoder & Cross-Encoder (from Sentence Transformers)
+
+Although Cross-Encoder usually has better performances than Bi-Encoder, it is extremly time consuming to use Cross-Encoder if we have a great amount of data. 
+Thus a widely accepted approach is to use a Bi-Encoder for initial retrieval (e.g., selecting the top 100 candidates from 100,000 sentences) and then refine the ranking of the selected candidates using a Cross-Encoder for more accurate results.
--- a/docs/source/Introduction/index.rst
+++ b/docs/source/Introduction/index.rst
@ -0,0 +1,19 @@
+Introduction
+============
+
+BGE builds one-stop retrieval toolkit for search and RAG. We provide inference, evaluation, and fine-tuning for embedding models and reranker.
+
+.. figure:: ../_static/img/RAG_pipeline.png
+   :width: 700
+   :align: center
+
+   BGE embedder and reranker in an RAG pipeline.
+
+Quickly get started with:
+
+.. toctree::
+   :maxdepth: 1
+
+   installation
+   concept
+   quick_start
--- a/docs/source/Introduction/installation.rst
+++ b/docs/source/Introduction/installation.rst
@ -40,4 +40,9 @@ For development in editable mode:
    # If you do not want to finetune the models, you can install the package without the finetune dependency:
    pip install -e .
    # If you want to finetune the models, you can install the package with the finetune dependency:
-    pip install -e .[finetune]
+    pip install -e .[finetune]
+
+PyTorch-CUDA
+------------
+
+If you want to use CUDA GPUs during inference and finetuning, please install appropriate version of `PyTorch <https://pytorch.org/get-started/locally/>`_ with CUDA support.
--- a/docs/source/_static/css/custom.css
+++ b/docs/source/_static/css/custom.css
@ -0,0 +1,9 @@
+.bd-sidebar-primary {
+    width: 22%;
+    line-height: 1.4;
+}
+
+.col-lg-3 {
+    flex: 0 0 auto;
+    width: 22%;
+}
--- a/docs/source/_static/img/RAG_pipeline.png
+++ b/docs/source/_static/img/RAG_pipeline.png
--- a/docs/source/_static/img/word2vec.png
+++ b/docs/source/_static/img/word2vec.png
--- a/docs/source/bge/bge_m3.rst
+++ b/docs/source/bge/bge_m3.rst
@ -1,2 +1,117 @@
+======
 BGE-M3
-======
+======
+
+BGE-M3 is a compound and powerful embedding model distinguished for its versatility in:
+- **Multi-Functionality**: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
+- **Multi-Linguality**: It can support more than 100 working languages.
+- **Multi-Granularity**: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
+
+-------------------------------------------------------------------+-----------------+------------+--------------+-----------------------------------------------------------------------+
+|                                  Model                            |    Language     | Parameters |  Model Size  |                              Description                              |
+===================================================================+=================+============+==============+=======================================================================+
+| `BAAI/bge-m3 <https://huggingface.co/BAAI/bge-m3>`_               |  Multi-Lingual  |    569M    |    2.27 GB   | Multi-Functionality, Multi-Linguality, and Multi-Granularity          |
+-------------------------------------------------------------------+-----------------+------------+--------------+-----------------------------------------------------------------------+
+
+Multi-Linguality
+================
+
+BGE-M3 was trained on multiple datasets covering up to 170+ different languages. 
+While the amount of training data on languages are highly unbalanced, the actual model performance on different languages will have difference.
+
+For more information of datasets and evaluation results, please check out our `paper <https://arxiv.org/pdf/2402.03216s>`_ for details.
+
+Multi-Granularity
+=================
+
+We extend the max position to 8192, enabling the embedding of larger corpus. 
+Proposing a simple but effective method: MCLS (Multiple CLS) to enhance the model's ability on long text without additional fine-tuning.
+
+Multi-Functionality
+===================
+
+.. code:: python
+
+    from FlagEmbedding import BGEM3FlagModel
+
+    model = BGEM3FlagModel('BAAI/bge-m3')
+    sentences_1 = ["What is BGE M3?", "Defination of BM25"]
+    sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
+                   "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
+
+Dense Retrieval
+---------------
+
+Similar to BGE v1 or v1.5 models, BGE-M3 use the normalized hidden state of the special token [CLS] as the dense embedding:
+
+.. math:: e_q = norm(H_q[0])
+
+Next, to compute the relevance score between the query and passage:
+
+.. math:: s_{dense}=f_{sim}(e_p, e_q)
+
+where :math:`e_p, e_q` are the embedding vectors of passage and query, respectively.
+
+:math:`f_{sim}` is the score function (such as inner product and L2 distance) for comupting two embeddings' similarity.
+
+Sparse Retrieval
+----------------
+
+BGE-M3 generates sparce embeddings by adding a linear layer and a ReLU activation function following the hidden states:
+
+.. math:: w_{qt} = \text{Relu}(W_{lex}^T H_q [i])
+
+where :math:`W_{lex}` representes the weights of linear layer and :math:`H_q[i]` is the encoder's output of the :math:`i^{th}` token.
+
+Based on the tokens' weights of query and passage, the relevance score between them is computed by the joint importance of the co-existed terms within the query and passage:
+
+.. math:: s_{lex} = \sum_{t\in q\cap p}(w_{qt} * w_{pt})
+
+where :math:`w_{qt}, w_{pt}` are the importance weights of each co-existed term :math:`t` in query and passage, respectively.
+
+Multi-Vector
+------------
+
+The multi-vector method utilizes the entire output embeddings for the representation of query :math:`E_q` and passage :math:`E_p`.
+
+.. math:: 
+
+    E_q = norm(W_{mul}^T H_q)
+
+    E_p = norm(W_{mul}^T H_p)
+
+where :math:`W_{mul}` is the learnable projection matrix.
+
+Following ColBert, BGE-M3 use late-interaction to compute the fine-grained relevance score:
+
+.. math:: s_{mul}=\frac{1}{N}\sum_{i=1}^N\max_{j=1}^M E_q[i]\cdot E_p^T[j]
+
+where :math:`E_q, E_p` are the entire output embeddings of query and passage, respectively.
+
+This is a summation of average of maximum similarity of each :math:`v\in E_q` with vectors in :math:`E_p`.
+
+Hybrid Ranking
+--------------
+
+BGE-M3's multi-functionality gives the possibility of hybrid ranking to improve retrieval. 
+Firstly, due to the heavy cost of multi-vector method, we can retrieve the candidate results by either of the dense or sparse method. 
+Then, to get the final result, we can rerank the candidates based on the integrated relevance score:
+
+.. math:: s_{rank} = w_1\cdot s_{dense}+w_2\cdot s_{lex} + w_3\cdot s_{mul}
+
+where the values chosen for :math:`w_1`, :math:`w_2` and :math:`w_3` varies depending on the downstream scenario.
+
+
+Usage
+=====
+
+.. code:: python
+
+    from FlagEmbedding import BGEM3FlagModel
+
+    model = BGEM3FlagModel('BAAI/bge-m3')
+
+    sentences_1 = ["What is BGE M3?", "Defination of BM25"]
+
+    output = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
+    dense, sparse, multiv = output['dense_vecs'], output['lexical_weights'], output['colbert_vecs']
--- a/docs/source/bge/bge_v1_v1.5.rst
+++ b/docs/source/bge/bge_v1_v1.5.rst
@ -1,5 +1,7 @@
-BGE-v1
-======
+BGE v1 & v1.5
+=============
+
+BGE v1 and v1.5 are series of encoder only models base on BERT. They achieved best performance among the models of the same size at the time of release.

 BGE
 ---
@ -26,7 +28,7 @@ C-MTEB benchmarks at the time released.
 BGE-v1.5
 --------

-Then to enhance its retrieval ability without instruction and alleviate the issue of the similarity distribution, :code:`bge-*-1.5` models 
+Then to enhance its retrieval ability without instruction and alleviate the issue of the similarity distribution, :code:`bge-*-v1.5` models 
 were released in Sep 2023. They are still the most popular embedding models that balanced well between embedding quality and model sizes.

 +-----------------------------------------------------------------------------+-----------+------------+--------------+--------------+
@ -37,8 +39,8 @@ were released in Sep 2023. They are still the most popular embedding models that
 | `BAAI/bge-base-en-v1.5 <https://huggingface.co/BAAI/bge-base-en-v1.5>`_     |  English  |    109M    |    438 MB    | reasonable   |
 +-----------------------------------------------------------------------------+-----------+------------+--------------+ similarity   +
 | `BAAI/bge-small-en-v1.5 <https://huggingface.co/BAAI/bge-small-en-v1.5>`_   |  English  |    33.4M   |    133 MB    | distribution |
-+-----------------------------------------------------------------------------+-----------+------------+--------------+              +
-| `BAAI/bge-large-zh-v1.5 <https://huggingface.co/BAAI/bge-large-zh-v1.5>`_   |  Chinese  |    326M    |    1.3 GB    |              |
+-----------------------------------------------------------------------------+-----------+------------+--------------+ and better   +
+| `BAAI/bge-large-zh-v1.5 <https://huggingface.co/BAAI/bge-large-zh-v1.5>`_   |  Chinese  |    326M    |    1.3 GB    | performance  |
 +-----------------------------------------------------------------------------+-----------+------------+--------------+              +
 | `BAAI/bge-base-zh-v1.5 <https://huggingface.co/BAAI/bge-base-zh-v1.5>`_     |  Chinese  |    102M    |    409 MB    |              |
 +-----------------------------------------------------------------------------+-----------+------------+--------------+              +
@ -46,4 +48,30 @@ were released in Sep 2023. They are still the most popular embedding models that
 +-----------------------------------------------------------------------------+-----------+------------+--------------+--------------+


+Usage
+-----

+To use BGE v1 or v1.5 model for inference, load model through ``
+
+.. code:: python
+
+    from FlagEmbedding import FlagModel
+
+    model = FlagModel('BAAI/bge-base-en-v1.5')
+
+    sentences = ["Hello world", "I am inevitable"]
+    embeddings = model.encode(sentences)
+
+.. tip::
+
+    For simple tasks that only encode a few sentences like above, it's faster to use single GPU comparing to multi-GPUs:
+
+    .. code:: python
+
+        import os
+        os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+    or 
+
+    .. code:: python
+
+        model = FlagModel('BAAI/bge-base-en-v1.5', devices=0)
--- a/docs/source/bge/index.rst
+++ b/docs/source/bge/index.rst
@ -0,0 +1,19 @@
+BGE
+===
+
+**BGE** stands for **BAAI General Embeddings**, which is a series of embedding models released by BAAI.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Embedder
+
+   bge_v1_v1.5
+   bge_m3
+   bge_icl
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Embedder
+   
+   bge_reranker
+
--- a/docs/source/bge/introduction.rst
+++ b/docs/source/bge/introduction.rst
@ -1,5 +0,0 @@
-Introduction
-============
-
-**BGE** stands for **BAAI General Embeddings**, which is a series of embedding models released by BAAI.
-
--- a/docs/source/community/index.rst
+++ b/docs/source/community/index.rst
@ -0,0 +1,2 @@
+Community
+=========
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -24,6 +24,7 @@ extensions = [
    "sphinx.ext.githubpages",
    "sphinx.ext.viewcode",
    "sphinx.ext.coverage",
+    "sphinx_design",
    "myst_nb",
 ]

@ -33,21 +34,55 @@ exclude_patterns = []
 # -- Options for HTML output -------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

-html_theme = 'furo'
-# html_logo = "_static/img/BAAI_logo.png"
+# html_theme = 'furo'
+html_theme = "pydata_sphinx_theme"
+html_logo = "_static/img/BAAI_logo.png"
 html_title = "FlagEmbedding"
 html_static_path = ['_static']
+html_css_files = ["css/custom.css"]
 html_theme_options = {
-    # "light_logo": "/_static/img/BAAI_logo.png",
-    "light_css_variables": {
-        "color-brand-primary": "#238be8",
-        "color-brand-content": "#238be8",
-    },
-    "dark_css_variables": {
-        "color-brand-primary": "#FBCB67",
-        "color-brand-content": "#FBCB67",
-    },
+#     # "light_logo": "/_static/img/BAAI_logo.png",
+#     "light_css_variables": {
+#         "color-brand-primary": "#238be8",
+#         "color-brand-content": "#238be8",
+#     },
+#     "dark_css_variables": {
+#         "color-brand-primary": "#FBCB67",
+#         "color-brand-content": "#FBCB67",
+#     },
+    "navigation_depth": 5,
 }

 # MyST-NB conf
-nb_execution_mode = "off"
+nb_execution_mode = "off"
+
+html_theme_options = {
+    "external_links": [
+        {
+            "url": "https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d",
+            "name": "HF Models",
+        },
+    ],
+    "icon_links":[
+        {
+            "name": "GitHub",
+            "url": "https://github.com/FlagOpen/FlagEmbedding",
+            "icon": "fa-brands fa-github",
+        },
+        {
+            "name": "PyPI",
+            "url": "https://pypi.org/project/FlagEmbedding/",
+            "icon": "fa-brands fa-python",
+        },
+        {
+            "name": "HF Models",
+            "url": "https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d",
+            "icon": "fa-solid fa-cube",
+        }
+    ],
+    "header_links_before_dropdown": 7,
+}
+
+html_context = {
+   "default_mode": "light"
+}
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -3,23 +3,100 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

-BAAI General Embedding
-======================
+:html_theme.sidebar_secondary.remove: True

-|
-|

-.. image:: _static/img/bge_logo.jpg
-   :target: https://github.com/FlagOpen/FlagEmbedding
-   :width: 500
-   :align: center
-
-|
-|
+BGE
+===

 Welcome to BGE documentation!

-We aim for building one-stop retrieval toolkit for search and RAG.
+.. figure:: _static/img/bge_logo.jpg
+   :width: 400
+   :align: center
+
+.. grid:: 3
+   :gutter: 3
+
+   .. grid-item-card:: :octicon:`milestone` Introduction
+
+      New to BGE? Quickly get hands-on information.
+
+      +++
+
+      .. button-ref:: Introduction/index
+         :expand:
+         :color: primary
+         :click-parent:
+
+         To Introduction
+
+
+   .. grid-item-card:: :octicon:`package` BGE Models
+
+      Get to know BGE embedding models and rerankers.
+
+      +++
+
+      .. button-ref:: bge/index
+         :expand:
+         :color: primary
+         :click-parent:
+
+         To BGE
+
+
+   .. grid-item-card:: :octicon:`log` Tutorials
+
+      Find useful tutorials to start with if you are looking for guidance
+
+      +++
+
+      .. button-ref:: tutorial/index
+         :expand:
+         :color: primary
+         :click-parent:
+
+         To Tutorials
+
+   .. grid-item-card:: :octicon:`codescan` API
+
+      Check the API of classes and functions in FlagEmbedding.
+
+      +++
+
+      .. button-ref:: API/index
+         :expand:
+         :color: primary
+         :click-parent:
+
+         To APIs
+
+   .. grid-item-card:: :octicon:`question` FAQ
+
+      Take a look of questions people frequently asked.
+
+      +++
+
+      .. button-ref:: FAQ/index
+         :expand:
+         :color: primary
+         :click-parent:
+
+         To FAQ
+   
+   .. grid-item-card:: :octicon:`people` Community
+
+      Welcome to join BGE community!
+
+      +++
+
+      .. button-ref:: community/index
+         :expand:
+         :color: primary
+         :click-parent:
+
+         To Community

 Besides the resources we provide here in this documentation, please visit our `GitHub repo <https://github.com/FlagOpen/FlagEmbedding>`_ for more contents including:

@ -49,27 +126,39 @@ BGE is developed by Beijing Academy of Artificial Intelligence (BAAI).
   :maxdepth: 1
   :caption: Introduction

-   Introduction/installation
-   Introduction/quick_start
+   Introduction/index

 .. toctree::
   :hidden:
-   :maxdepth: 5
-   :caption: API
+   :maxdepth: 1
+   :caption: BGE

-   API/abc
-   API/inference
-   API/evaluation
-   API/finetune
+   bge/index

 .. toctree::
   :hidden:
   :maxdepth: 2
   :caption: Tutorials

-   tutorial/1_Embedding
-   tutorial/2_Metrics
-   tutorial/3_Indexing
-   tutorial/4_Evaluation
-   tutorial/5_Reranking
-   tutorial/6_RAG
+   tutorial/index
+
+.. toctree::
+   :hidden:
+   :maxdepth: 5
+   :caption: API
+
+   API/index
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+   :caption: FAQ
+
+   FAQ/index
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+   :caption: Community
+
+   community/index
--- a/docs/source/tutorial/index.rst
+++ b/docs/source/tutorial/index.rst
@ -0,0 +1,14 @@
+Tutorials
+=========
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+   :caption: Tutorials
+
+   1_Embedding
+   2_Metrics
+   3_Indexing
+   4_Evaluation
+   5_Reranking
+   6_RAG