mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2025-12-27 15:12:14 +00:00
update docs
This commit is contained in:
parent
70c3d329f8
commit
db83452b04
@ -47,8 +47,7 @@ Below are some commonly asked questions.
|
||||
:animate: fade-in-slide-down
|
||||
|
||||
- Dense retrieval: map the text into a single embedding, e.g., `DPR <https://arxiv.org/abs/2004.04906>`_, `BGE-v1.5 <../bge/bge_v1_v1.5>`_
|
||||
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text.
|
||||
e.g., BM25, `unicoil <https://arxiv.org/pdf/2106.14807>`_, and `splade <https://arxiv.org/abs/2107.05720>`_
|
||||
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, `unicoil <https://arxiv.org/pdf/2106.14807>`_, and `splade <https://arxiv.org/abs/2107.05720>`_
|
||||
- Multi-vector retrieval: use multiple vectors to represent a text, e.g., `ColBERT <https://arxiv.org/abs/2004.12832>`_.
|
||||
|
||||
.. dropdown:: Recommended vector database?
|
||||
|
||||
39
docs/source/Introduction/embedder.rst
Normal file
39
docs/source/Introduction/embedder.rst
Normal file
@ -0,0 +1,39 @@
|
||||
Embedder
|
||||
========
|
||||
|
||||
.. tip::
|
||||
|
||||
If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`!
|
||||
|
||||
Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
|
||||
These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.
|
||||
|
||||
A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:
|
||||
|
||||
.. image:: ../_static/img/word2vec.png
|
||||
:width: 500
|
||||
:align: center
|
||||
|
||||
Nowadays, embedders are capable of mapping sentences and even passages into vector space.
|
||||
They are widely used in real world tasks such as retrieval, clustering, etc.
|
||||
In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.
|
||||
|
||||
|
||||
Sparse Vector
|
||||
-------------
|
||||
|
||||
Sparse vectors usually have structure of high dimensionality with only a few non-zero values, which usually effective for tasks like keyword matching.
|
||||
Typically, though not always, the number of dimensions in sparse vectors corresponds to the different tokens present in the language.
|
||||
Each dimension is assigned a value representing the token's relative importance within the document.
|
||||
Some well known algorithms for sparse vector embedding includes `bag-of-words <https://en.wikipedia.org/wiki/Bag-of-words_model>`_, `TF-IDF <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, `BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_, etc.
|
||||
Sparse vector embeddings have great ability to extract the information of key terms and their corresponding importance within documents.
|
||||
|
||||
Dense Vector
|
||||
------------
|
||||
|
||||
Dense vectors typically use neural networks to map words, sentences, and passages into a fixed dimension latent vector space.
|
||||
Then we can compare the similarity between two objects using certain metrics like Euclidean distance or Cos similarity.
|
||||
Those vectors can represent deeper meaning of the sentences.
|
||||
Thus we can distinguish sentences using similar words but actually have different meaning.
|
||||
And also understand different ways in speaking and writing that express the same thing.
|
||||
Dense vector embeddings, instead of keywords counting and matching, directly capture the semantics.
|
||||
@ -25,5 +25,6 @@ Quickly get started with:
|
||||
:caption: Concept
|
||||
|
||||
IR
|
||||
model
|
||||
embedder
|
||||
reranker
|
||||
retrieval_demo
|
||||
@ -1,26 +1,5 @@
|
||||
Model
|
||||
=====
|
||||
|
||||
If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`!
|
||||
|
||||
Embedder
|
||||
--------
|
||||
|
||||
Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
|
||||
These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.
|
||||
|
||||
A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:
|
||||
|
||||
.. image:: ../_static/img/word2vec.png
|
||||
:width: 500
|
||||
:align: center
|
||||
|
||||
Nowadays, embedders are capable of mapping sentences and even passages into vector space.
|
||||
They are widely used in real world tasks such as retrieval, clustering, etc.
|
||||
In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.
|
||||
|
||||
Reranker
|
||||
--------
|
||||
========
|
||||
|
||||
Reranker, or Cross-Encoder, is a model that refines the ranking of candidate pairs (e.g., query-document pairs) by jointly encoding and scoring them.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user