{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# RAG with LangChain" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LangChain is well adopted by open-source community because of its diverse functionality and clean API usage. In this tutorial we will show how to use LangChain to build an RAG pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, install all the required packages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install pypdf langchain langchain-openai langchain-huggingface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then fill the OpenAI API key below:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# For openai key\n", "import os\n", "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "BGE-M3 is a very powerful embedding model, We would like to know what does that 'M3' stands for.\n", "\n", "Let's first ask GPT the question:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "M3-Embedding typically refers to a specific method or framework used in machine learning and natural language processing for creating embeddings, which are dense vector representations of data. The \"M3\" could indicate a particular model, method, or version related to embeddings, but without additional context, it's hard to provide a precise definition.\n", "\n", "If you have a specific context or source in mind where \"M3-Embedding\" is used, please provide more details, and I may be able to give a more accurate explanation!\n" ] } ], "source": [ "from langchain_openai.chat_models import ChatOpenAI\n", "\n", "llm = ChatOpenAI(model_name=\"gpt-4o-mini\")\n", "\n", "response = llm.invoke(\"What does M3-Embedding stands for?\")\n", "print(response.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By quickly checking the GitHub [repo](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) of BGE-M3. Since BGE-M3 paper is not in its training dataset, GPT is not capable to give us correct answer.\n", "\n", "Now, let's use the [paper](https://arxiv.org/pdf/2402.03216) of BGE-M3 to build an RAG application to answer our question precisely." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is to load the pdf of our paper:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from langchain_community.document_loaders import PyPDFLoader\n", "\n", "# Or download the paper and put a path to the local file instead\n", "loader = PyPDFLoader(\"https://arxiv.org/pdf/2402.03216\")\n", "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'source': 'https://arxiv.org/pdf/2402.03216', 'page': 0}\n" ] } ], "source": [ "print(docs[0].metadata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The whole paper contains 18 pages. That's a huge amount of information. Thus we split the paper into chunks to construct a corpus." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "\n", "# initialize a splitter\n", "splitter = RecursiveCharacterTextSplitter(\n", " chunk_size=1000, # Maximum size of chunks to return\n", " chunk_overlap=150, # number of overlap characters between chunks\n", ")\n", "\n", "# use the splitter to split our paper\n", "corpus = splitter.split_documents(docs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indexing is one of the most important part in RAG. LangChain provides APIs for embedding models and vector databases that make things simple and straightforward.\n", "\n", "Here, we choose bge-base-en-v1.5 to embed all the chunks to vectors, and use Faiss as our vector database." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from langchain_huggingface.embeddings import HuggingFaceEmbeddings\n", "\n", "embedding_model = HuggingFaceEmbeddings(model_name=\"BAAI/bge-base-en-v1.5\", \n", "encode_kwargs={\"normalize_embeddings\": True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then create a Faiss vector database given our corpus and embedding model. \n", "\n", "If you want to know more about Faiss, refer to the tutorial of [Faiss and indexing](https://github.com/FlagOpen/FlagEmbedding/tree/master/Tutorials/3_Indexing)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from langchain.vectorstores import FAISS\n", "\n", "vectordb = FAISS.from_documents(corpus, embedding_model)\n", "\n", "# (optional) save the vector database to a local directory\n", "vectordb.save_local(\"vectorstore.db\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Create retriever for later use\n", "retriever = vectordb.as_retriever()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Retreive and Generate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's write a simple prompt template. Modify the contents to match your different use cases." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "from langchain_core.prompts import ChatPromptTemplate\n", "\n", "template = \"\"\"\n", "You are a Q&A chat bot.\n", "Use the given context only, answer the question.\n", "\n", "\n", "{context}\n", "\n", "\n", "Question: {input}\n", "\"\"\"\n", "\n", "# Create a prompt template\n", "prompt = ChatPromptTemplate.from_template(template)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now everything is ready. Assemble them to a chain and let the magic happen!" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from langchain.chains.combine_documents import create_stuff_documents_chain\n", "from langchain.chains import create_retrieval_chain\n", "\n", "doc_chain = create_stuff_documents_chain(llm, prompt)\n", "chain = create_retrieval_chain(retriever, doc_chain)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell, we can see that the chatbot can answer the question correctly!" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "M3-Embedding stands for a new embedding model that is distinguished for its versatility in multi-linguality, multi-functionality, and multi-granularity.\n" ] } ], "source": [ "response = chain.invoke({\"input\": \"What does M3-Embedding stands for?\"})\n", "\n", "# print the answer only\n", "print(response['answer'])" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 2 }