mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2025-07-13 12:05:55 +00:00
337 lines
8.5 KiB
Plaintext
337 lines
8.5 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# RAG with LangChain"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"LangChain is well adopted by open-source community because of its diverse functionality and clean API usage. In this tutorial we will show how to use LangChain to build an RAG pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 0. Preparation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"First, install all the required packages:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install pypdf langchain langchain-openai langchain-huggingface"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Then fill the OpenAI API key below:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# For openai key\n",
|
|
"import os\n",
|
|
"os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"BGE-M3 is a very powerful embedding model, We would like to know what does that 'M3' stands for.\n",
|
|
"\n",
|
|
"Let's first ask GPT the question:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"M3-Embedding typically refers to a specific method or framework used in machine learning and natural language processing for creating embeddings, which are dense vector representations of data. The \"M3\" could indicate a particular model, method, or version related to embeddings, but without additional context, it's hard to provide a precise definition.\n",
|
|
"\n",
|
|
"If you have a specific context or source in mind where \"M3-Embedding\" is used, please provide more details, and I may be able to give a more accurate explanation!\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_openai.chat_models import ChatOpenAI\n",
|
|
"\n",
|
|
"llm = ChatOpenAI(model_name=\"gpt-4o-mini\")\n",
|
|
"\n",
|
|
"response = llm.invoke(\"What does M3-Embedding stands for?\")\n",
|
|
"print(response.content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"By quickly checking the GitHub [repo](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) of BGE-M3. Since BGE-M3 paper is not in its training dataset, GPT is not capable to give us correct answer.\n",
|
|
"\n",
|
|
"Now, let's use the [paper](https://arxiv.org/pdf/2402.03216) of BGE-M3 to build an RAG application to answer our question precisely."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The first step is to load the pdf of our paper:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_community.document_loaders import PyPDFLoader\n",
|
|
"\n",
|
|
"# Or download the paper and put a path to the local file instead\n",
|
|
"loader = PyPDFLoader(\"https://arxiv.org/pdf/2402.03216\")\n",
|
|
"docs = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'source': 'https://arxiv.org/pdf/2402.03216', 'page': 0}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(docs[0].metadata)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The whole paper contains 18 pages. That's a huge amount of information. Thus we split the paper into chunks to construct a corpus."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
|
|
"\n",
|
|
"# initialize a splitter\n",
|
|
"splitter = RecursiveCharacterTextSplitter(\n",
|
|
" chunk_size=1000, # Maximum size of chunks to return\n",
|
|
" chunk_overlap=150, # number of overlap characters between chunks\n",
|
|
")\n",
|
|
"\n",
|
|
"# use the splitter to split our paper\n",
|
|
"corpus = splitter.split_documents(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Indexing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Indexing is one of the most important part in RAG. LangChain provides APIs for embedding models and vector databases that make things simple and straightforward.\n",
|
|
"\n",
|
|
"Here, we choose bge-base-en-v1.5 to embed all the chunks to vectors, and use Faiss as our vector database."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_huggingface.embeddings import HuggingFaceEmbeddings\n",
|
|
"\n",
|
|
"embedding_model = HuggingFaceEmbeddings(model_name=\"BAAI/bge-base-en-v1.5\", \n",
|
|
"encode_kwargs={\"normalize_embeddings\": True})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Then create a Faiss vector database given our corpus and embedding model. \n",
|
|
"\n",
|
|
"If you want to know more about Faiss, refer to the tutorial of [Faiss and indexing](https://github.com/FlagOpen/FlagEmbedding/tree/master/Tutorials/3_Indexing)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.vectorstores import FAISS\n",
|
|
"\n",
|
|
"vectordb = FAISS.from_documents(corpus, embedding_model)\n",
|
|
"\n",
|
|
"# (optional) save the vector database to a local directory\n",
|
|
"vectordb.save_local(\"vectorstore.db\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create retriever for later use\n",
|
|
"retriever = vectordb.as_retriever()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Retreive and Generate"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's write a simple prompt template. Modify the contents to match your different use cases."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_core.prompts import ChatPromptTemplate\n",
|
|
"\n",
|
|
"template = \"\"\"\n",
|
|
"You are a Q&A chat bot.\n",
|
|
"Use the given context only, answer the question.\n",
|
|
"\n",
|
|
"<context>\n",
|
|
"{context}\n",
|
|
"</context>\n",
|
|
"\n",
|
|
"Question: {input}\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"# Create a prompt template\n",
|
|
"prompt = ChatPromptTemplate.from_template(template)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now everything is ready. Assemble them to a chain and let the magic happen!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.chains.combine_documents import create_stuff_documents_chain\n",
|
|
"from langchain.chains import create_retrieval_chain\n",
|
|
"\n",
|
|
"doc_chain = create_stuff_documents_chain(llm, prompt)\n",
|
|
"chain = create_retrieval_chain(retriever, doc_chain)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Run the following cell, we can see that the chatbot can answer the question correctly!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"M3-Embedding stands for a new embedding model that is distinguished for its versatility in multi-linguality, multi-functionality, and multi-granularity.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"response = chain.invoke({\"input\": \"What does M3-Embedding stands for?\"})\n",
|
|
"\n",
|
|
"# print the answer only\n",
|
|
"print(response['answer'])"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "base",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|