mirror of
https://github.com/microsoft/autogen.git
synced 2025-07-23 17:01:35 +00:00

* set use_docker to default to true * black formatting * centralize checking and add env variable option * set docker env flag for contrib tests * set docker env flag for contrib tests * better error message and cleanup * disable explicit docker tests * docker is installed so can't check for that in test * pr comments and fix test * rename and fix function descriptions * documentation * update notebooks so that they can be run with change in default * add unit tests for new code * cache and restore env var * skip on windows because docker is running in the CI but there are problems connecting the volume * update documentation * move header * update contrib tests
404 lines
19 KiB
Plaintext
404 lines
19 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e4fccaaa-fda5-4f99-a4c5-c463c5c890f5",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a href=\"https://colab.research.google.com/github/microsoft/autogen/blob/main/notebook/agentchat_video_transcript_translate_with_whisper.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a5b4540e-4987-4774-9305-764c3133e953",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"toc\"></a>\n",
|
|
"# Auto Generated Agent Chat: Translating Video audio using Whisper and GPT-3.5-turbo\n",
|
|
"In this notebook, we demonstrate how to use whisper and GPT-3.5-turbo with `AssistantAgent` and `UserProxyAgent` to recognize and translate\n",
|
|
"the speech sound from a video file and add the timestamp like a subtitle file based on [agentchat_function_call.ipynb](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_function_call.ipynb)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4fd644cc-2b14-4700-8b1d-959fb2e9acb0",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Requirements\n",
|
|
"AutoGen requires `Python>=3.8`. To run this notebook example, please install `openai`, `pyautogen`, `whisper`, and `moviepy`:\n",
|
|
"```bash\n",
|
|
"pip install openai\n",
|
|
"pip install openai-whisper\n",
|
|
"pip install moviepy\n",
|
|
"pip install pyautogen\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bc4600b8-c6df-49dd-945d-ce69f30a65cc",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%capture --no-stderr\n",
|
|
"# %pip install moviepy~=1.0.3\n",
|
|
"# %pip install openai-whisper~=20230918\n",
|
|
"# %pip install openai~=1.3.5\n",
|
|
"# %pip install \"pyautogen>=0.2.3\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "18bdeb0b-c4b6-4dec-97d2-d84f09cffa00",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Set your API Endpoint\n",
|
|
"It is recommended to store your OpenAI API key in the environment variable. For example, store it in `OPENAI_API_KEY`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "26d1ae87-f007-4286-a56a-dcf68abf9393",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"import whisper\n",
|
|
"from moviepy.editor import VideoFileClip\n",
|
|
"from openai import OpenAI\n",
|
|
"\n",
|
|
"import autogen\n",
|
|
"\n",
|
|
"config_list = [\n",
|
|
" {\n",
|
|
" \"model\": \"gpt-4\",\n",
|
|
" \"api_key\": os.getenv(\"OPENAI_API_KEY\"),\n",
|
|
" }\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "324fec65-ab23-45db-a7a8-0aaf753fe19c",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Example and Output\n",
|
|
"Below is an example of speech recognition from a [Peppa Pig cartoon video clip](https://drive.google.com/file/d/1QY0naa2acHw2FuH7sY3c-g2sBLtC2Sv4/view?usp=drive_link) originally in English and translated into Chinese.\n",
|
|
"'FFmpeg' does not support online files. To run the code on the example video, you need to download the example video locally. You can change `your_file_path` to your local video file path."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "ed549b75-b4ea-4ec5-8c0b-a15e93ffd618",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
|
|
"\n",
|
|
"For the video located in E:\\pythonProject\\gpt_detection\\peppa pig.mp4, recognize the speech and transfer it into a script file, then translate from English text to a Chinese video subtitle text. \n",
|
|
"\n",
|
|
"--------------------------------------------------------------------------------\n",
|
|
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
|
|
"\n",
|
|
"\u001b[32m***** Suggested function Call: recognize_transcript_from_video *****\u001b[0m\n",
|
|
"Arguments: \n",
|
|
"{\n",
|
|
"\"audio_filepath\": \"E:\\\\pythonProject\\\\gpt_detection\\\\peppa pig.mp4\"\n",
|
|
"}\n",
|
|
"\u001b[32m********************************************************************\u001b[0m\n",
|
|
"\n",
|
|
"--------------------------------------------------------------------------------\n",
|
|
"\u001b[35m\n",
|
|
">>>>>>>> EXECUTING FUNCTION recognize_transcript_from_video...\u001b[0m\n",
|
|
"Detecting language using up to the first 30 seconds. Use `--language` to specify the language\n",
|
|
"Detected language: English\n",
|
|
"[00:00.000 --> 00:03.000] This is my little brother George.\n",
|
|
"[00:03.000 --> 00:05.000] This is Mummy Pig.\n",
|
|
"[00:05.000 --> 00:07.000] And this is Daddy Pig.\n",
|
|
"[00:07.000 --> 00:09.000] Pee-pah Pig.\n",
|
|
"[00:09.000 --> 00:11.000] Desert Island.\n",
|
|
"[00:11.000 --> 00:14.000] Pepper and George are at Danny Dog's house.\n",
|
|
"[00:14.000 --> 00:17.000] Captain Dog is telling stories of when he was a sailor.\n",
|
|
"[00:17.000 --> 00:20.000] I sailed all around the world.\n",
|
|
"[00:20.000 --> 00:22.000] And then I came home again.\n",
|
|
"[00:22.000 --> 00:25.000] But now I'm back for good.\n",
|
|
"[00:25.000 --> 00:27.000] I'll never forget you.\n",
|
|
"[00:27.000 --> 00:29.000] Daddy, do you miss the sea?\n",
|
|
"[00:29.000 --> 00:31.000] Well, sometimes.\n",
|
|
"[00:31.000 --> 00:36.000] It is Grandad Dog, Grandpa Pig and Grumpy Rabbit.\n",
|
|
"[00:36.000 --> 00:37.000] Hello.\n",
|
|
"[00:37.000 --> 00:40.000] Can Captain Dog come out to play?\n",
|
|
"[00:40.000 --> 00:43.000] What? We are going on a fishing trip.\n",
|
|
"[00:43.000 --> 00:44.000] On a boat?\n",
|
|
"[00:44.000 --> 00:45.000] On the sea!\n",
|
|
"[00:45.000 --> 00:47.000] OK, let's go.\n",
|
|
"[00:47.000 --> 00:51.000] But Daddy, you said you'd never get on a boat again.\n",
|
|
"[00:51.000 --> 00:54.000] I'm not going to get on a boat again.\n",
|
|
"[00:54.000 --> 00:57.000] You said you'd never get on a boat again.\n",
|
|
"[00:57.000 --> 01:00.000] Oh, yes. So I did.\n",
|
|
"[01:00.000 --> 01:02.000] OK, bye-bye.\n",
|
|
"[01:02.000 --> 01:03.000] Bye.\n",
|
|
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
|
|
"\n",
|
|
"\u001b[32m***** Response from calling function \"recognize_transcript_from_video\" *****\u001b[0m\n",
|
|
"[{'sentence': 'This is my little brother George..', 'timestamp_start': 0, 'timestamp_end': 3.0}, {'sentence': 'This is Mummy Pig..', 'timestamp_start': 3.0, 'timestamp_end': 5.0}, {'sentence': 'And this is Daddy Pig..', 'timestamp_start': 5.0, 'timestamp_end': 7.0}, {'sentence': 'Pee-pah Pig..', 'timestamp_start': 7.0, 'timestamp_end': 9.0}, {'sentence': 'Desert Island..', 'timestamp_start': 9.0, 'timestamp_end': 11.0}, {'sentence': \"Pepper and George are at Danny Dog's house..\", 'timestamp_start': 11.0, 'timestamp_end': 14.0}, {'sentence': 'Captain Dog is telling stories of when he was a sailor..', 'timestamp_start': 14.0, 'timestamp_end': 17.0}, {'sentence': 'I sailed all around the world..', 'timestamp_start': 17.0, 'timestamp_end': 20.0}, {'sentence': 'And then I came home again..', 'timestamp_start': 20.0, 'timestamp_end': 22.0}, {'sentence': \"But now I'm back for good..\", 'timestamp_start': 22.0, 'timestamp_end': 25.0}, {'sentence': \"I'll never forget you..\", 'timestamp_start': 25.0, 'timestamp_end': 27.0}, {'sentence': 'Daddy, do you miss the sea?.', 'timestamp_start': 27.0, 'timestamp_end': 29.0}, {'sentence': 'Well, sometimes..', 'timestamp_start': 29.0, 'timestamp_end': 31.0}, {'sentence': 'It is Grandad Dog, Grandpa Pig and Grumpy Rabbit..', 'timestamp_start': 31.0, 'timestamp_end': 36.0}, {'sentence': 'Hello..', 'timestamp_start': 36.0, 'timestamp_end': 37.0}, {'sentence': 'Can Captain Dog come out to play?.', 'timestamp_start': 37.0, 'timestamp_end': 40.0}, {'sentence': 'What? We are going on a fishing trip..', 'timestamp_start': 40.0, 'timestamp_end': 43.0}, {'sentence': 'On a boat?.', 'timestamp_start': 43.0, 'timestamp_end': 44.0}, {'sentence': 'On the sea!.', 'timestamp_start': 44.0, 'timestamp_end': 45.0}, {'sentence': \"OK, let's go..\", 'timestamp_start': 45.0, 'timestamp_end': 47.0}, {'sentence': \"But Daddy, you said you'd never get on a boat again..\", 'timestamp_start': 47.0, 'timestamp_end': 51.0}, {'sentence': \"I'm not going to get on a boat again..\", 'timestamp_start': 51.0, 'timestamp_end': 54.0}, {'sentence': \"You said you'd never get on a boat again..\", 'timestamp_start': 54.0, 'timestamp_end': 57.0}, {'sentence': 'Oh, yes. So I did..', 'timestamp_start': 57.0, 'timestamp_end': 60.0}, {'sentence': 'OK, bye-bye..', 'timestamp_start': 60.0, 'timestamp_end': 62.0}, {'sentence': 'Bye..', 'timestamp_start': 62.0, 'timestamp_end': 63.0}]\n",
|
|
"\u001b[32m****************************************************************************\u001b[0m\n",
|
|
"\n",
|
|
"--------------------------------------------------------------------------------\n",
|
|
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
|
|
"\n",
|
|
"\u001b[32m***** Suggested function Call: translate_transcript *****\u001b[0m\n",
|
|
"Arguments: \n",
|
|
"{\n",
|
|
"\"source_language\": \"en\",\n",
|
|
"\"target_language\": \"zh\"\n",
|
|
"}\n",
|
|
"\u001b[32m*********************************************************\u001b[0m\n",
|
|
"\n",
|
|
"--------------------------------------------------------------------------------\n",
|
|
"\u001b[35m\n",
|
|
">>>>>>>> EXECUTING FUNCTION translate_transcript...\u001b[0m\n",
|
|
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
|
|
"\n",
|
|
"\u001b[32m***** Response from calling function \"translate_transcript\" *****\u001b[0m\n",
|
|
"0s to 3.0s: 这是我小弟弟乔治。\n",
|
|
"3.0s to 5.0s: 这是妈妈猪。\n",
|
|
"5.0s to 7.0s: 这位是猪爸爸..\n",
|
|
"7.0s to 9.0s: 'Peppa Pig...' (皮皮猪)\n",
|
|
"9.0s to 11.0s: \"荒岛..\"\n",
|
|
"11.0s to 14.0s: 胡椒和乔治在丹尼狗的家里。\n",
|
|
"14.0s to 17.0s: 船长狗正在讲述他作为一名海员时的故事。\n",
|
|
"17.0s to 20.0s: 我环游了全世界。\n",
|
|
"20.0s to 22.0s: 然后我又回到了家。。\n",
|
|
"22.0s to 25.0s: \"但现在我回来了,永远地回来了...\"\n",
|
|
"25.0s to 27.0s: \"我永远不会忘记你...\"\n",
|
|
"27.0s to 29.0s: \"爸爸,你想念大海吗?\"\n",
|
|
"29.0s to 31.0s: 嗯,有时候...\n",
|
|
"31.0s to 36.0s: 这是大爷狗、爷爷猪和脾气暴躁的兔子。\n",
|
|
"36.0s to 37.0s: 你好。\n",
|
|
"37.0s to 40.0s: \"船长狗可以出来玩吗?\"\n",
|
|
"40.0s to 43.0s: 什么?我们要去钓鱼了。。\n",
|
|
"43.0s to 44.0s: 在船上?\n",
|
|
"44.0s to 45.0s: 在海上!\n",
|
|
"45.0s to 47.0s: 好的,我们走吧。\n",
|
|
"47.0s to 51.0s: \"但是爸爸,你说过你再也不会上船了…\"\n",
|
|
"51.0s to 54.0s: \"我不会再上船了..\"\n",
|
|
"54.0s to 57.0s: \"你说过再也不会上船了...\"\n",
|
|
"57.0s to 60.0s: 哦,是的。所以我做了。\n",
|
|
"60.0s to 62.0s: 好的,再见。\n",
|
|
"62.0s to 63.0s: 再见。。\n",
|
|
"\u001b[32m*****************************************************************\u001b[0m\n",
|
|
"\n",
|
|
"--------------------------------------------------------------------------------\n",
|
|
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
|
|
"\n",
|
|
"TERMINATE\n",
|
|
"\n",
|
|
"--------------------------------------------------------------------------------\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"def recognize_transcript_from_video(audio_filepath):\n",
|
|
" try:\n",
|
|
" # Load model\n",
|
|
" model = whisper.load_model(\"small\")\n",
|
|
"\n",
|
|
" # Transcribe audio with detailed timestamps\n",
|
|
" result = model.transcribe(audio_filepath, verbose=True)\n",
|
|
"\n",
|
|
" # Initialize variables for transcript\n",
|
|
" transcript = []\n",
|
|
" sentence = \"\"\n",
|
|
" start_time = 0\n",
|
|
"\n",
|
|
" # Iterate through the segments in the result\n",
|
|
" for segment in result[\"segments\"]:\n",
|
|
" # If new sentence starts, save the previous one and reset variables\n",
|
|
" if segment[\"start\"] != start_time and sentence:\n",
|
|
" transcript.append(\n",
|
|
" {\n",
|
|
" \"sentence\": sentence.strip() + \".\",\n",
|
|
" \"timestamp_start\": start_time,\n",
|
|
" \"timestamp_end\": segment[\"start\"],\n",
|
|
" }\n",
|
|
" )\n",
|
|
" sentence = \"\"\n",
|
|
" start_time = segment[\"start\"]\n",
|
|
"\n",
|
|
" # Add the word to the current sentence\n",
|
|
" sentence += segment[\"text\"] + \" \"\n",
|
|
"\n",
|
|
" # Add the final sentence\n",
|
|
" if sentence:\n",
|
|
" transcript.append(\n",
|
|
" {\n",
|
|
" \"sentence\": sentence.strip() + \".\",\n",
|
|
" \"timestamp_start\": start_time,\n",
|
|
" \"timestamp_end\": result[\"segments\"][-1][\"end\"],\n",
|
|
" }\n",
|
|
" )\n",
|
|
"\n",
|
|
" # Save the transcript to a file\n",
|
|
" with open(\"transcription.txt\", \"w\") as file:\n",
|
|
" for item in transcript:\n",
|
|
" sentence = item[\"sentence\"]\n",
|
|
" start_time, end_time = item[\"timestamp_start\"], item[\"timestamp_end\"]\n",
|
|
" file.write(f\"{start_time}s to {end_time}s: {sentence}\\n\")\n",
|
|
"\n",
|
|
" return transcript\n",
|
|
"\n",
|
|
" except FileNotFoundError:\n",
|
|
" return \"The specified audio file could not be found.\"\n",
|
|
" except Exception as e:\n",
|
|
" return f\"An unexpected error occurred: {str(e)}\"\n",
|
|
"\n",
|
|
"\n",
|
|
"def translate_text(input_text, source_language, target_language):\n",
|
|
" client = OpenAI(api_key=key)\n",
|
|
"\n",
|
|
" response = client.chat.completions.create(\n",
|
|
" model=\"gpt-3.5-turbo\",\n",
|
|
" messages=[\n",
|
|
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
|
|
" {\n",
|
|
" \"role\": \"user\",\n",
|
|
" \"content\": f\"Directly translate the following {source_language} text to a pure {target_language} \"\n",
|
|
" f\"video subtitle text without additional explanation.: '{input_text}'\",\n",
|
|
" },\n",
|
|
" ],\n",
|
|
" max_tokens=1500,\n",
|
|
" )\n",
|
|
"\n",
|
|
" # Correctly accessing the response content\n",
|
|
" translated_text = response.choices[0].message.content if response.choices else None\n",
|
|
" return translated_text\n",
|
|
"\n",
|
|
"\n",
|
|
"def translate_transcript(source_language, target_language):\n",
|
|
" with open(\"transcription.txt\", \"r\") as f:\n",
|
|
" lines = f.readlines()\n",
|
|
"\n",
|
|
" translated_transcript = []\n",
|
|
"\n",
|
|
" for line in lines:\n",
|
|
" # Split each line into timestamp and text parts\n",
|
|
" parts = line.strip().split(\": \")\n",
|
|
" if len(parts) == 2:\n",
|
|
" timestamp, text = parts[0], parts[1]\n",
|
|
" # Translate only the text part\n",
|
|
" translated_text = translate_text(text, source_language, target_language)\n",
|
|
" # Reconstruct the line with the translated text and the preserved timestamp\n",
|
|
" translated_line = f\"{timestamp}: {translated_text}\"\n",
|
|
" translated_transcript.append(translated_line)\n",
|
|
" else:\n",
|
|
" # If the line doesn't contain a timestamp, add it as is\n",
|
|
" translated_transcript.append(line.strip())\n",
|
|
"\n",
|
|
" return \"\\n\".join(translated_transcript)\n",
|
|
"\n",
|
|
"\n",
|
|
"llm_config = {\n",
|
|
" \"functions\": [\n",
|
|
" {\n",
|
|
" \"name\": \"recognize_transcript_from_video\",\n",
|
|
" \"description\": \"recognize the speech from video and transfer into a txt file\",\n",
|
|
" \"parameters\": {\n",
|
|
" \"type\": \"object\",\n",
|
|
" \"properties\": {\n",
|
|
" \"audio_filepath\": {\n",
|
|
" \"type\": \"string\",\n",
|
|
" \"description\": \"path of the video file\",\n",
|
|
" }\n",
|
|
" },\n",
|
|
" \"required\": [\"audio_filepath\"],\n",
|
|
" },\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"name\": \"translate_transcript\",\n",
|
|
" \"description\": \"using translate_text function to translate the script\",\n",
|
|
" \"parameters\": {\n",
|
|
" \"type\": \"object\",\n",
|
|
" \"properties\": {\n",
|
|
" \"source_language\": {\n",
|
|
" \"type\": \"string\",\n",
|
|
" \"description\": \"source language\",\n",
|
|
" },\n",
|
|
" \"target_language\": {\n",
|
|
" \"type\": \"string\",\n",
|
|
" \"description\": \"target language\",\n",
|
|
" },\n",
|
|
" },\n",
|
|
" \"required\": [\"source_language\", \"target_language\"],\n",
|
|
" },\n",
|
|
" },\n",
|
|
" ],\n",
|
|
" \"config_list\": config_list,\n",
|
|
" \"timeout\": 120,\n",
|
|
"}\n",
|
|
"source_language = \"English\"\n",
|
|
"target_language = \"Chinese\"\n",
|
|
"key = os.getenv(\"OPENAI_API_KEY\")\n",
|
|
"target_video = \"your_file_path\"\n",
|
|
"\n",
|
|
"chatbot = autogen.AssistantAgent(\n",
|
|
" name=\"chatbot\",\n",
|
|
" system_message=\"For coding tasks, only use the functions you have been provided with. Reply TERMINATE when the task is done.\",\n",
|
|
" llm_config=llm_config,\n",
|
|
")\n",
|
|
"\n",
|
|
"user_proxy = autogen.UserProxyAgent(\n",
|
|
" name=\"user_proxy\",\n",
|
|
" is_termination_msg=lambda x: x.get(\"content\", \"\") and x.get(\"content\", \"\").rstrip().endswith(\"TERMINATE\"),\n",
|
|
" human_input_mode=\"NEVER\",\n",
|
|
" max_consecutive_auto_reply=10,\n",
|
|
" code_execution_config={\n",
|
|
" \"work_dir\": \"coding_2\",\n",
|
|
" \"use_docker\": False,\n",
|
|
" }, # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.\n",
|
|
")\n",
|
|
"\n",
|
|
"user_proxy.register_function(\n",
|
|
" function_map={\n",
|
|
" \"recognize_transcript_from_video\": recognize_transcript_from_video,\n",
|
|
" \"translate_transcript\": translate_transcript,\n",
|
|
" }\n",
|
|
")\n",
|
|
"user_proxy.initiate_chat(\n",
|
|
" chatbot,\n",
|
|
" message=f\"For the video located in {target_video}, recognize the speech and transfer it into a script file, \"\n",
|
|
" f\"then translate from {source_language} text to a {target_language} video subtitle text. \",\n",
|
|
")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|