Vector Database can help LLMs to access external knowledge. You can load baai-general-embedding as the encoder to generate the vectors. Here a example to build a bot which can answer your question using the knowledge in chinese wikipedia.

Here's a description of the Q&A dialogue scenario using flag embedding and a large language model:

Data Preprocessing and Indexing:
- Download a Chinese wikipedia dataset.
- Encode the Chinese wikipedia text using flag embedding.
- Build an index using BM25.
Query Enhancement with Large Language Model (LLM):
- Utilize a Large Language Model (LLM) to enhance and enrich the original user query based on the chat history.
- The LLM can perform tasks such as text completion and paraphrasing to make the query more robust and comprehensive.
Document Retrieval:
- Employ BM25 to retrieve the top-n documents from the locally stored Chinese wiki dataset based on the newly enhanced query.
Embedding Retrieval:
- Perform an embedding retrieval on the top-n retrieved documents using brute force search to get top-k documents.
Answer Retrieval with Language Model (LLM):
- Present the question, the top-k retrieved documents, and chat history to the Large Language Model (LLM).
- The LLM can utilize its understanding of language and context to provide accurate and comprehensive answers to the user's question.

By following these steps, the Q&A system can leverage flag embedding, BM25 indexing, and a Large Language Model to improve the accuracy and intelligence of the system. The integration of these techniques can create a more sophisticated and reliable Q&A system for users, providing them with comprehensive information to effectively answer their questions.

Installation

sudo apt install default-jdk
pip install -r requirements.txt
conda install -c anaconda openjdk

Prepare Data

python pre_process.py --data_path ./data

This script will download the dataset (Chinese wikipedia), building BM25 index, inference embedding, and then save them to data_path.

Q&A usage

Run Directly

export OPENAI_API_KEY=...
python run.py --data_path ./data

This script will build a Q&A dialogue scenario.

Quick Start

# encoding=gbk
from tool import LocalDatasetLoader, BMVectorIndex, Agent
loader = LocalDatasetLoader(data_path="./data/dataset",
                            embedding_path="./data/emb/data.npy")
index = BMVectorIndex(model_path="BAAI/bge-large-zh",
                      bm_index_path="./data/index",
                      data_loader=loader)
agent = Agent(index)
question = "上次有人登月是什么时候"
agent.Answer(question, RANKING=1000, TOP_N=5, verbose=False)