"For text retrieval, pattern matching is the most intuitive way. People would use certain characters, words, phrases, or sentence patterns. However, not only for human, it is also extremely inefficient for computer to do pattern matching between a query and a collection of text files to find the possible results. \n",
"\n",
"For images and acoustic waves, there are rgb pixels and digital signals. Similarly, in order to accomplish more sophisticated tasks of natural language such as retrieval, classification, clustering, or semantic search, we need a way to represent text data. That's how text embedding comes in front of the stage."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Background"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Traditional text embedding methods like one-hot encoding and bag-of-words (BoW) represent words and sentences as sparse vectors based on their statistical features, such as word appearance and frequency within a document. More advanced methods like TF-IDF and BM25 improve on these by considering a word's importance across an entire corpus, while n-gram techniques capture word order in small groups. However, these approaches suffer from the \"curse of dimensionality\" and fail to capture semantic similarity like \"cat\" and \"kitty\", difference like \"play the watch\" and \"watch the play\"."
"To overcome these limitations, dense word embeddings were developed, mapping words to vectors in a low-dimensional space that captures semantic and relational information. Early models like Word2Vec demonstrated the power of dense embeddings using neural networks. Subsequent advancements with neural network architectures like RNNs, LSTMs, and Transformers have enabled more sophisticated models such as BERT, RoBERTa, and GPT to excel in capturing complex word relationships and contexts. **BAAI General Embedding (BGE)** provide a series of open-source models that could satisfy all kinds of demands."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get Embedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step of modern text retrieval is embedding the text. So let's take a look at how to use the embedding models."
"Then run the following cells to get the embeddings. Check their official [documentation](https://platform.openai.com/docs/guides/embeddings) for more details."
"embeddings = np.asarray([response.data[i].embedding for i in range(len(sentences))])\n",
"print(f\"Embeddings:\\n{embeddings.shape}\")\n",
"\n",
"scores = embeddings @ embeddings.T\n",
"print(f\"Similarity scores:\\n{scores}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Voyage AI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Voyage AI provides embedding models and rerankers for different purpus and in various fields. Their API keys can be freely used in low frequency and token length."