--- title: Use AutoGen for Local LLMs authors: jialeliu tags: [LLM] --- **TL;DR:** We demonstrate how to use autogen for local LLM application. As an example, we will initiate an endpoint using [FastChat](https://github.com/lm-sys/FastChat) and perform inference on [ChatGLMv2-6b](https://github.com/THUDM/ChatGLM2-6B). ## Preparations ### Clone FastChat FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. However, its code needs minor modification in order to function properly. ```bash git clone https://github.com/lm-sys/FastChat.git cd FastChat ``` ### Download checkpoint ChatGLM-6B is an open bilingual language model based on General Language Model (GLM) framework, with 6.2 billion parameters. ChatGLM2-6B is its second-generation version. Before downloading from HuggingFace Hub, you need to have Git LFS [installed](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage). ```bash git clone https://huggingface.co/THUDM/chatglm2-6b ``` ## Initiate server First, launch the controller ```bash python -m fastchat.serve.controller ``` Then, launch the model worker(s) ```bash python -m fastchat.serve.model_worker --model-path chatglm2-6b ``` Finally, launch the RESTful API server ```bash python -m fastchat.serve.openai_api_server --host localhost --port 8000 ``` Normally this will work. However, if you encounter error like [this](https://github.com/lm-sys/FastChat/issues/1641), commenting out all the lines containing `finish_reason` in `fastchat/protocol/api_protocol.py` and `fastchat/protocol/openai_api_protocol.py` will fix the problem. The modified code looks like: ```python class CompletionResponseChoice(BaseModel): index: int text: str logprobs: Optional[int] = None # finish_reason: Optional[Literal["stop", "length"]] class CompletionResponseStreamChoice(BaseModel): index: int text: str logprobs: Optional[float] = None # finish_reason: Optional[Literal["stop", "length"]] = None ``` ## Interact with model using `oai.Completion` (requires openai<1) Now the models can be directly accessed through openai-python library as well as `autogen.oai.Completion` and `autogen.oai.ChatCompletion`. ```python from autogen import oai # create a text completion request response = oai.Completion.create( config_list=[ { "model": "chatglm2-6b", "base_url": "http://localhost:8000/v1", "api_type": "open_ai", "api_key": "NULL", # just a placeholder } ], prompt="Hi", ) print(response) # create a chat completion request response = oai.ChatCompletion.create( config_list=[ { "model": "chatglm2-6b", "base_url": "http://localhost:8000/v1", "api_type": "open_ai", "api_key": "NULL", } ], messages=[{"role": "user", "content": "Hi"}] ) print(response) ``` If you would like to switch to different models, download their checkpoints and specify model path when launching model worker(s). ## interacting with multiple local LLMs If you would like to interact with multiple LLMs on your local machine, replace the `model_worker` step above with a multi model variant: ```bash python -m fastchat.serve.multi_model_worker \ --model-path lmsys/vicuna-7b-v1.3 \ --model-names vicuna-7b-v1.3 \ --model-path chatglm2-6b \ --model-names chatglm2-6b ``` The inference code would be: ```python from autogen import oai # create a chat completion request response = oai.ChatCompletion.create( config_list=[ { "model": "chatglm2-6b", "base_url": "http://localhost:8000/v1", "api_type": "open_ai", "api_key": "NULL", }, { "model": "vicuna-7b-v1.3", "base_url": "http://localhost:8000/v1", "api_type": "open_ai", "api_key": "NULL", } ], messages=[{"role": "user", "content": "Hi"}] ) print(response) ``` ## For Further Reading * [Documentation](/docs/Getting-Started) about `autogen`. * [Documentation](https://github.com/lm-sys/FastChat) about FastChat.