autogen/notebook/agentchat_groupchat_RAG.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--\n",
    "tags: [\"group chat\", \"orchestration\", \"RAG\"]\n",
    "description: |\n",
    "    Implement and manage a multi-agent chat system using AutoGen, where AI assistants retrieve information, generate code, and interact collaboratively to solve complex tasks, especially in areas not covered by their training data.\n",
    "-->"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Group Chat with Retrieval Augmented Generation\n",
    "\n",
    "AutoGen supports conversable agents powered by LLMs, tools, or humans, performing tasks collectively via automated chat. This framework allows tool use and human participation through multi-agent conversation.\n",
    "Please find documentation about this feature [here](https://microsoft.github.io/autogen/docs/Use-Cases/agent_chat).\n",
    "\n",
    "````{=mdx}\n",
    ":::info Requirements\n",
    "Some extra dependencies are needed for this notebook, which can be installed via pip:\n",
    "\n",
    "```bash\n",
    "pip install pyautogen[retrievechat]\n",
    "```\n",
    "\n",
    "For more information, please refer to the [installation guide](/docs/installation/).\n",
    ":::\n",
    "````"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set your API Endpoint\n",
    "\n",
    "The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LLM models:  ['gpt-35-turbo', 'gpt-35-turbo-0613']\n"
     ]
    }
   ],
   "source": [
    "from typing_extensions import Annotated\n",
    "import chromadb\n",
    "\n",
    "import autogen\n",
    "from autogen import AssistantAgent\n",
    "from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent\n",
    "\n",
    "config_list = autogen.config_list_from_json(\"OAI_CONFIG_LIST\")\n",
    "\n",
    "print(\"LLM models: \", [config_list[i][\"model\"] for i in range(len(config_list))])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "````{=mdx}\n",
    ":::tip\n",
    "Learn more about configuring LLMs for agents [here](/docs/llm_configuration).\n",
    ":::\n",
    "````\n",
    "\n",
    "## Construct Agents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm_config = {\n",
    "    \"timeout\": 60,\n",
    "    \"temperature\": 0,\n",
    "    \"config_list\": config_list,\n",
    "}\n",
    "\n",
    "\n",
    "def termination_msg(x):\n",
    "    return isinstance(x, dict) and \"TERMINATE\" == str(x.get(\"content\", \"\"))[-9:].upper()\n",
    "\n",
    "\n",
    "boss = autogen.UserProxyAgent(\n",
    "    name=\"Boss\",\n",
    "    is_termination_msg=termination_msg,\n",
    "    human_input_mode=\"NEVER\",\n",
    "    code_execution_config=False,  # we don't want to execute code in this case.\n",
    "    default_auto_reply=\"Reply `TERMINATE` if the task is done.\",\n",
    "    description=\"The boss who ask questions and give tasks.\",\n",
    ")\n",
    "\n",
    "boss_aid = RetrieveUserProxyAgent(\n",
    "    name=\"Boss_Assistant\",\n",
    "    is_termination_msg=termination_msg,\n",
    "    human_input_mode=\"NEVER\",\n",
    "    max_consecutive_auto_reply=3,\n",
    "    retrieve_config={\n",
    "        \"task\": \"code\",\n",
    "        \"docs_path\": \"https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Examples/Integrate%20-%20Spark.md\",\n",
    "        \"chunk_token_size\": 1000,\n",
    "        \"model\": config_list[0][\"model\"],\n",
    "        \"client\": chromadb.PersistentClient(path=\"/tmp/chromadb\"),\n",
    "        \"collection_name\": \"groupchat\",\n",
    "        \"get_or_create\": True,\n",
    "    },\n",
    "    code_execution_config=False,  # we don't want to execute code in this case.\n",
    "    description=\"Assistant who has extra content retrieval power for solving difficult problems.\",\n",
    ")\n",
    "\n",
    "coder = AssistantAgent(\n",
    "    name=\"Senior_Python_Engineer\",\n",
    "    is_termination_msg=termination_msg,\n",
    "    system_message=\"You are a senior python engineer, you provide python code to answer questions. Reply `TERMINATE` in the end when everything is done.\",\n",
    "    llm_config=llm_config,\n",
    "    description=\"Senior Python Engineer who can write code to solve problems and answer questions.\",\n",
    ")\n",
    "\n",
    "pm = autogen.AssistantAgent(\n",
    "    name=\"Product_Manager\",\n",
    "    is_termination_msg=termination_msg,\n",
    "    system_message=\"You are a product manager. Reply `TERMINATE` in the end when everything is done.\",\n",
    "    llm_config=llm_config,\n",
    "    description=\"Product Manager who can design and plan the project.\",\n",
    ")\n",
    "\n",
    "reviewer = autogen.AssistantAgent(\n",
    "    name=\"Code_Reviewer\",\n",
    "    is_termination_msg=termination_msg,\n",
    "    system_message=\"You are a code reviewer. Reply `TERMINATE` in the end when everything is done.\",\n",
    "    llm_config=llm_config,\n",
    "    description=\"Code Reviewer who can review the code.\",\n",
    ")\n",
    "\n",
    "PROBLEM = \"How to use spark for parallel training in FLAML? Give me sample code.\"\n",
    "\n",
    "\n",
    "def _reset_agents():\n",
    "    boss.reset()\n",
    "    boss_aid.reset()\n",
    "    coder.reset()\n",
    "    pm.reset()\n",
    "    reviewer.reset()\n",
    "\n",
    "\n",
    "def rag_chat():\n",
    "    _reset_agents()\n",
    "    groupchat = autogen.GroupChat(\n",
    "        agents=[boss_aid, pm, coder, reviewer], messages=[], max_round=12, speaker_selection_method=\"round_robin\"\n",
    "    )\n",
    "    manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)\n",
    "\n",
    "    # Start chatting with boss_aid as this is the user proxy agent.\n",
    "    boss_aid.initiate_chat(\n",
    "        manager,\n",
    "        problem=PROBLEM,\n",
    "        n_results=3,\n",
    "    )\n",
    "\n",
    "\n",
    "def norag_chat():\n",
    "    _reset_agents()\n",
    "    groupchat = autogen.GroupChat(\n",
    "        agents=[boss, pm, coder, reviewer],\n",
    "        messages=[],\n",
    "        max_round=12,\n",
    "        speaker_selection_method=\"auto\",\n",
    "        allow_repeat_speaker=False,\n",
    "    )\n",
    "    manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)\n",
    "\n",
    "    # Start chatting with the boss as this is the user proxy agent.\n",
    "    boss.initiate_chat(\n",
    "        manager,\n",
    "        message=PROBLEM,\n",
    "    )\n",
    "\n",
    "\n",
    "def call_rag_chat():\n",
    "    _reset_agents()\n",
    "\n",
    "    # In this case, we will have multiple user proxy agents and we don't initiate the chat\n",
    "    # with RAG user proxy agent.\n",
    "    # In order to use RAG user proxy agent, we need to wrap RAG agents in a function and call\n",
    "    # it from other agents.\n",
    "    def retrieve_content(\n",
    "        message: Annotated[\n",
    "            str,\n",
    "            \"Refined message which keeps the original meaning and can be used to retrieve content for code generation and question answering.\",\n",
    "        ],\n",
    "        n_results: Annotated[int, \"number of results\"] = 3,\n",
    "    ) -> str:\n",
    "        boss_aid.n_results = n_results  # Set the number of results to be retrieved.\n",
    "        # Check if we need to update the context.\n",
    "        update_context_case1, update_context_case2 = boss_aid._check_update_context(message)\n",
    "        if (update_context_case1 or update_context_case2) and boss_aid.update_context:\n",
    "            boss_aid.problem = message if not hasattr(boss_aid, \"problem\") else boss_aid.problem\n",
    "            _, ret_msg = boss_aid._generate_retrieve_user_reply(message)\n",
    "        else:\n",
    "            ret_msg = boss_aid.generate_init_message(message, n_results=n_results)\n",
    "        return ret_msg if ret_msg else message\n",
    "\n",
    "    boss_aid.human_input_mode = \"NEVER\"  # Disable human input for boss_aid since it only retrieves content.\n",
    "\n",
    "    for caller in [pm, coder, reviewer]:\n",
    "        d_retrieve_content = caller.register_for_llm(\n",
    "            description=\"retrieve content for code generation and question answering.\", api_style=\"function\"\n",
    "        )(retrieve_content)\n",
    "\n",
    "    for executor in [boss, pm]:\n",
    "        executor.register_for_execution()(d_retrieve_content)\n",
    "\n",
    "    groupchat = autogen.GroupChat(\n",
    "        agents=[boss, pm, coder, reviewer],\n",
    "        messages=[],\n",
    "        max_round=12,\n",
    "        speaker_selection_method=\"round_robin\",\n",
    "        allow_repeat_speaker=False,\n",
    "    )\n",
    "\n",
    "    manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)\n",
    "\n",
    "    # Start chatting with the boss as this is the user proxy agent.\n",
    "    boss.initiate_chat(\n",
    "        manager,\n",
    "        message=PROBLEM,\n",
    "    )"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start Chat\n",
    "\n",
    "### UserProxyAgent doesn't get the correct code\n",
    "[FLAML](https://github.com/microsoft/FLAML) was open sourced in 2020, so ChatGPT is familiar with it. However, Spark-related APIs were added in 2022, so they were not in ChatGPT's training data. As a result, we end up with invalid code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mBoss\u001b[0m (to chat_manager):\n",
      "\n",
      "How to use spark for parallel training in FLAML? Give me sample code.\n",
      "\n",
      "--------------------------------------------------------------------------------\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mSenior_Python_Engineer\u001b[0m (to chat_manager):\n",
      "\n",
      "To use Spark for parallel training in FLAML, you need to set up a Spark cluster and configure FLAML to use Spark as the backend. Here's a sample code to demonstrate how to use Spark for parallel training in FLAML:\n",
      "\n",
      "```python\n",
      "from flaml import AutoML\n",
      "from pyspark.sql import SparkSession\n",
      "\n",
      "# Create a Spark session\n",
      "spark = SparkSession.builder \\\n",
      "    .appName(\"FLAML with Spark\") \\\n",
      "    .getOrCreate()\n",
      "\n",
      "# Load your data into a Spark DataFrame\n",
      "data = spark.read.format(\"csv\").option(\"header\", \"true\").load(\"your_data.csv\")\n",
      "\n",
      "# Initialize FLAML with Spark backend\n",
      "automl = AutoML()\n",
      "automl.initialize(spark=spark)\n",
      "\n",
      "# Specify the search space and other settings\n",
      "settings = {\n",
      "    \"time_budget\": 60,  # total time in seconds\n",
      "    \"metric\": 'accuracy',\n",
      "    \"task\": 'classification',\n",
      "    \"log_file_name\": 'flaml.log',\n",
      "}\n",
      "\n",
      "# Train and tune the model using FLAML\n",
      "automl.fit(data=data, **settings)\n",
      "\n",
      "# Get the best model and its hyperparameters\n",
      "best_model = automl.best_model\n",
      "best_config = automl.best_config\n",
      "\n",
      "# Print the best model and its hyperparameters\n",
      "print(\"Best model:\", best_model)\n",
      "print(\"Best hyperparameters:\", best_config)\n",
      "\n",
      "# Terminate the Spark session\n",
      "spark.stop()\n",
      "```\n",
      "\n",
      "Make sure to replace `\"your_data.csv\"` with the path to your actual data file. Adjust the `settings` dictionary according to your requirements.\n",
      "\n",
      "This code initializes a Spark session, loads the data into a Spark DataFrame, and then uses FLAML's `AutoML` class to train and tune a model in parallel using Spark. Finally, it prints the best model and its hyperparameters.\n",
      "\n",
      "Remember to install FLAML and PySpark before running this code.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "To use Spark for parallel training in FLAML, you need to set up a Spark cluster and configure FLAML to use Spark as the backend. Here's a sample code to demonstrate how to use Spark for parallel training in FLAML:\n",
      "\n",
      "```python\n",
      "from flaml import AutoML\n",
      "from pyspark.sql import SparkSession\n",
      "\n",
      "# Create a Spark session\n",
      "spark = SparkSession.builder \\\n",
      "    .appName(\"FLAML with Spark\") \\\n",
      "    .getOrCreate()\n",
      "\n",
      "# Load your data into a Spark DataFrame\n",
      "data = spark.read.format(\"csv\").option(\"header\", \"true\").load(\"your_data.csv\")\n",
      "\n",
      "# Initialize FLAML with Spark backend\n",
      "automl = AutoML()\n",
      "automl.initialize(spark=spark)\n",
      "\n",
      "# Specify the search space and other settings\n",
      "settings = {\n",
      "    \"time_budget\": 60,  # total time in seconds\n",
      "    \"metric\": 'accuracy',\n",
      "    \"task\": 'classification',\n",
      "    \"log_file_name\": 'flaml.log',\n",
      "}\n",
      "\n",
      "# Train and tune the model using FLAML\n",
      "automl.fit(data=data, **settings)\n",
      "\n",
      "# Get the best model and its hyperparameters\n",
      "best_model = automl.best_model\n",
      "best_config = automl.best_config\n",
      "\n",
      "# Print the best model and its hyperparameters\n",
      "print(\"Best model:\", best_model)\n",
      "print(\"Best hyperparameters:\", best_config)\n",
      "\n",
      "# Terminate the Spark session\n",
      "spark.stop()\n",
      "```\n",
      "\n",
      "Make sure to replace `\"your_data.csv\"` with the path to your actual data file. Adjust the `settings` dictionary according to your requirements.\n",
      "\n",
      "This code initializes a Spark session, loads the data into a Spark DataFrame, and then uses FLAML's `AutoML` class to train and tune a model in parallel using Spark. Finally, it prints the best model and its hyperparameters.\n",
      "\n",
      "Remember to install FLAML and PySpark before running this code.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mCode_Reviewer\u001b[0m (to chat_manager):\n",
      "\n",
      "Looks good to me! The code demonstrates how to use Spark for parallel training in FLAML. It initializes a Spark session, loads the data into a Spark DataFrame, and then uses FLAML's `AutoML` class to train and tune a model in parallel using Spark. Finally, it prints the best model and its hyperparameters. Just make sure to replace `\"your_data.csv\"` with the actual path to the data file and adjust the `settings` dictionary as needed. \n",
      "\n",
      "If there are no further questions, I will terminate.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mProduct_Manager\u001b[0m (to chat_manager):\n",
      "\n",
      "TERMINATE\n",
      "\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "norag_chat()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RetrieveUserProxyAgent get the correct code\n",
    "Since RetrieveUserProxyAgent can perform retrieval-augmented generation based on the given documentation file, ChatGPT can generate the correct code for us!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "doc_ids:  [['doc_0', 'doc_1', 'doc_122']]\n",
      "\u001b[32mAdding doc_id doc_0 to context.\u001b[0m\n",
      "\u001b[32mAdding doc_id doc_1 to context.\u001b[0m\n",
      "\u001b[32mAdding doc_id doc_122 to context.\u001b[0m\n",
      "\u001b[33mBoss_Assistant\u001b[0m (to chat_manager):\n",
      "\n",
      "You're a retrieve augmented coding assistant. You answer user's questions based on your own knowledge and the\n",
      "context provided by the user.\n",
      "If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.\n",
      "For code generation, you must obey the following rules:\n",
      "Rule 1. You MUST NOT install any packages because all the packages needed are already installed.\n",
      "Rule 2. You must follow the formats below to write your code:\n",
      "```language\n",
      "# your code\n",
      "```\n",
      "\n",
      "User's question is: How to use spark for parallel training in FLAML? Give me sample code.\n",
      "\n",
      "Context is: # Integrate - Spark\n",
      "\n",
      "FLAML has integrated Spark for distributed training. There are two main aspects of integration with Spark:\n",
      "- Use Spark ML estimators for AutoML.\n",
      "- Use Spark to run training in parallel spark jobs.\n",
      "\n",
      "## Spark ML Estimators\n",
      "\n",
      "FLAML integrates estimators based on Spark ML models. These models are trained in parallel using Spark, so we called them Spark estimators. To use these models, you first need to organize your data in the required format.\n",
      "\n",
      "### Data\n",
      "\n",
      "For Spark estimators, AutoML only consumes Spark data. FLAML provides a convenient function `to_pandas_on_spark` in the `flaml.automl.spark.utils` module to convert your data into a pandas-on-spark (`pyspark.pandas`) dataframe/series, which Spark estimators require.\n",
      "\n",
      "This utility function takes data in the form of a `pandas.Dataframe` or `pyspark.sql.Dataframe` and converts it into a pandas-on-spark dataframe. It also takes `pandas.Series` or `pyspark.sql.Dataframe` and converts it into a [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) series. If you pass in a `pyspark.pandas.Dataframe`, it will not make any changes.\n",
      "\n",
      "This function also accepts optional arguments `index_col` and `default_index_type`.\n",
      "- `index_col` is the column name to use as the index, default is None.\n",
      "- `default_index_type` is the default index type, default is \"distributed-sequence\". More info about default index type could be found on Spark official [documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type)\n",
      "\n",
      "Here is an example code snippet for Spark Data:\n",
      "\n",
      "```python\n",
      "import pandas as pd\n",
      "from flaml.automl.spark.utils import to_pandas_on_spark\n",
      "# Creating a dictionary\n",
      "data = {\"Square_Feet\": [800, 1200, 1800, 1500, 850],\n",
      "      \"Age_Years\": [20, 15, 10, 7, 25],\n",
      "      \"Price\": [100000, 200000, 300000, 240000, 120000]}\n",
      "\n",
      "# Creating a pandas DataFrame\n",
      "dataframe = pd.DataFrame(data)\n",
      "label = \"Price\"\n",
      "\n",
      "# Convert to pandas-on-spark dataframe\n",
      "psdf = to_pandas_on_spark(dataframe)\n",
      "```\n",
      "\n",
      "To use Spark ML models you need to format your data appropriately. Specifically, use [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html) to merge all feature columns into a single vector column.\n",
      "\n",
      "Here is an example of how to use it:\n",
      "```python\n",
      "from pyspark.ml.feature import VectorAssembler\n",
      "columns = psdf.columns\n",
      "feature_cols = [col for col in columns if col != label]\n",
      "featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
      "psdf = featurizer.transform(psdf.to_spark(index_col=\"index\"))[\"index\", \"features\"]\n",
      "```\n",
      "\n",
      "Later in conducting the experiment, use your pandas-on-spark data like non-spark data and pass them using `X_train, y_train` or `dataframe, label`.\n",
      "\n",
      "### Estimators\n",
      "#### Model List\n",
      "- `lgbm_spark`: The class for fine-tuning Spark version LightGBM models, using [SynapseML](https://microsoft.github.io/SynapseML/docs/features/lightgbm/about/) API.\n",
      "\n",
      "#### Usage\n",
      "First, prepare your data in the required format as described in the previous section.\n",
      "\n",
      "By including the models you intend to try in the `estimators_list` argument to `flaml.automl`, FLAML will start trying configurations for these models. If your input is Spark data, FLAML will also use estimators with the `_spark` postfix by default, even if you haven't specified them.\n",
      "\n",
      "Here is an example code snippet using SparkML models in AutoML:\n",
      "\n",
      "```python\n",
      "import flaml\n",
      "# prepare your data in pandas-on-spark format as we previously mentioned\n",
      "\n",
      "automl = flaml.AutoML()\n",
      "settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"estimator_list\": [\"lgbm_spark\"],  # this setting is optional\n",
      "    \"task\": \"regression\",\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=psdf,\n",
      "    label=label,\n",
      "    **settings,\n",
      ")\n",
      "```\n",
      "\n",
      "\n",
      "[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb)\n",
      "\n",
      "## Parallel Spark Jobs\n",
      "You can activate Spark as the parallel backend during parallel tuning in both [AutoML](/docs/Use-Cases/Task-Oriented-AutoML#parallel-tuning) and [Hyperparameter Tuning](/docs/Use-Cases/Tune-User-Defined-Function#parallel-tuning), by setting the `use_spark` to `true`. FLAML will dispatch your job to the distributed Spark backend using [`joblib-spark`](https://github.com/joblib/joblib-spark).\n",
      "\n",
      "Please note that you should not set `use_spark` to `true` when applying AutoML and Tuning for Spark Data. This is because only SparkML models will be used for Spark Data in AutoML and Tuning. As SparkML models run in parallel, there is no need to distribute them with `use_spark` again.\n",
      "\n",
      "All the Spark-related arguments are stated below. These arguments are available in both Hyperparameter Tuning and AutoML:\n",
      "\n",
      "\n",
      "- `use_spark`: boolean, default=False | Whether to use spark to run the training in parallel spark jobs. This can be used to accelerate training on large models and large datasets, but will incur more overhead in time and thus slow down training in some cases. GPU training is not supported yet when use_spark is True. For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`.\n",
      "- `n_concurrent_trials`: int, default=1 | The number of concurrent trials. When n_concurrent_trials > 1, FLAML performes parallel tuning.\n",
      "- `force_cancel`: boolean, default=False | Whether to forcely cancel Spark jobs if the search time exceeded the time budget. Spark jobs include parallel tuning jobs and Spark-based model training jobs.\n",
      "\n",
      "An example code snippet for using parallel Spark jobs:\n",
      "```python\n",
      "import flaml\n",
      "automl_experiment = flaml.AutoML()\n",
      "automl_settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"task\": \"regression\",\n",
      "    \"n_concurrent_trials\": 2,\n",
      "    \"use_spark\": True,\n",
      "    \"force_cancel\": True, # Activating the force_cancel option can immediately halt Spark jobs once they exceed the allocated time_budget.\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=dataframe,\n",
      "    label=label,\n",
      "    **automl_settings,\n",
      ")\n",
      "```\n",
      "\n",
      "\n",
      "[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb)\n",
      "\n",
      "2684,4/26/2011,2,0,4,17,0,2,1,1,0.68,0.6364,0.61,0.3582,521\n",
      "2685,4/26/2011,2,0,4,18,0,2,1,1,0.68,0.6364,0.65,0.4478,528\n",
      "2686,4/26/2011,2,0,4,19,0,2,1,1,0.64,0.6061,0.73,0.4179,328\n",
      "2687,4/26/2011,2,0,4,20,0,2,1,1,0.64,0.6061,0.73,0.3582,234\n",
      "2688,4/26/2011,2,0,4,21,0,2,1,1,0.62,0.5909,0.78,0.2836,195\n",
      "2689,4/26/2011,2,0,4,22,0,2,1,2,0.6,0.5606,0.83,0.194,148\n",
      "2690,4/26/2011,2,0,4,23,0,2,1,2,0.6,0.5606,0.83,0.2239,78\n",
      "2691,4/27/2011,2,0,4,0,0,3,1,1,0.6,0.5606,0.83,0.2239,27\n",
      "2692,4/27/2011,2,0,4,1,0,3,1,1,0.6,0.5606,0.83,0.2537,17\n",
      "2693,4/27/2011,2,0,4,2,0,3,1,1,0.58,0.5455,0.88,0.2537,5\n",
      "2694,4/27/2011,2,0,4,3,0,3,1,2,0.58,0.5455,0.88,0.2836,7\n",
      "2695,4/27/2011,2,0,4,4,0,3,1,1,0.56,0.5303,0.94,0.2239,6\n",
      "2696,4/27/2011,2,0,4,5,0,3,1,2,0.56,0.5303,0.94,0.2537,17\n",
      "2697,4/27/2011,2,0,4,6,0,3,1,1,0.56,0.5303,0.94,0.2537,84\n",
      "2698,4/27/2011,2,0,4,7,0,3,1,2,0.58,0.5455,0.88,0.2836,246\n",
      "2699,4/27/2011,2,0,4,8,0,3,1,2,0.58,0.5455,0.88,0.3284,444\n",
      "2700,4/27/2011,2,0,4,9,0,3,1,2,0.6,0.5455,0.88,0.4179,181\n",
      "2701,4/27/2011,2,0,4,10,0,3,1,2,0.62,0.5758,0.83,0.2836,92\n",
      "2702,4/27/2011,2,0,4,11,0,3,1,2,0.64,0.5909,0.78,0.2836,156\n",
      "2703,4/27/2011,2,0,4,12,0,3,1,1,0.66,0.6061,0.78,0.3284,173\n",
      "2704,4/27/2011,2,0,4,13,0,3,1,1,0.64,0.5909,0.78,0.2985,150\n",
      "2705,4/27/2011,2,0,4,14,0,3,1,1,0.68,0.6364,0.74,0.2836,148\n",
      "\n",
      "\n",
      "\n",
      "--------------------------------------------------------------------------------\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProduct_Manager\u001b[0m (to chat_manager):\n",
      "\n",
      "To use Spark for parallel training in FLAML, you can follow these steps:\n",
      "\n",
      "1. Prepare your data in the required format using the `to_pandas_on_spark` function from the `flaml.automl.spark.utils` module. This function converts your data into a pandas-on-spark dataframe, which is required by Spark estimators. Here is an example code snippet:\n",
      "\n",
      "```python\n",
      "import pandas as pd\n",
      "from flaml.automl.spark.utils import to_pandas_on_spark\n",
      "\n",
      "# Creating a dictionary\n",
      "data = {\n",
      "    \"Square_Feet\": [800, 1200, 1800, 1500, 850],\n",
      "    \"Age_Years\": [20, 15, 10, 7, 25],\n",
      "    \"Price\": [100000, 200000, 300000, 240000, 120000]\n",
      "}\n",
      "\n",
      "# Creating a pandas DataFrame\n",
      "dataframe = pd.DataFrame(data)\n",
      "label = \"Price\"\n",
      "\n",
      "# Convert to pandas-on-spark dataframe\n",
      "psdf = to_pandas_on_spark(dataframe)\n",
      "```\n",
      "\n",
      "2. Format your data appropriately for Spark ML models. Use the `VectorAssembler` from `pyspark.ml.feature` to merge all feature columns into a single vector column. Here is an example:\n",
      "\n",
      "```python\n",
      "from pyspark.ml.feature import VectorAssembler\n",
      "\n",
      "columns = psdf.columns\n",
      "feature_cols = [col for col in columns if col != label]\n",
      "featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
      "psdf = featurizer.transform(psdf.to_spark(index_col=\"index\"))[\"index\", \"features\"]\n",
      "```\n",
      "\n",
      "3. Use the Spark ML models in FLAML's AutoML. Include the models you want to try in the `estimator_list` argument to `flaml.AutoML()`. FLAML will start trying configurations for these models. Here is an example code snippet:\n",
      "\n",
      "```python\n",
      "import flaml\n",
      "\n",
      "automl = flaml.AutoML()\n",
      "settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"estimator_list\": [\"lgbm_spark\"],\n",
      "    \"task\": \"regression\"\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=psdf,\n",
      "    label=label,\n",
      "    **settings\n",
      ")\n",
      "```\n",
      "\n",
      "4. To enable parallel Spark jobs during parallel tuning, set the `use_spark` parameter to `True`. FLAML will dispatch your job to the distributed Spark backend using `joblib-spark`. Here is an example code snippet:\n",
      "\n",
      "```python\n",
      "import flaml\n",
      "\n",
      "automl_experiment = flaml.AutoML()\n",
      "automl_settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"task\": \"regression\",\n",
      "    \"n_concurrent_trials\": 2,\n",
      "    \"use_spark\": True,\n",
      "    \"force_cancel\": True\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=dataframe,\n",
      "    label=label,\n",
      "    **automl_settings\n",
      ")\n",
      "```\n",
      "\n",
      "Please note that you should not set `use_spark` to `True` when applying AutoML and Tuning for Spark Data, as SparkML models will be used for Spark Data in AutoML and Tuning.\n",
      "\n",
      "Let me know if you need anything else.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "To use Spark for parallel training in FLAML, you can follow these steps:\n",
      "\n",
      "1. Prepare your data in the required format using the `to_pandas_on_spark` function from the `flaml.automl.spark.utils` module. This function converts your data into a pandas-on-spark dataframe, which is required by Spark estimators. Here is an example code snippet:\n",
      "\n",
      "```python\n",
      "import pandas as pd\n",
      "from flaml.automl.spark.utils import to_pandas_on_spark\n",
      "\n",
      "# Creating a dictionary\n",
      "data = {\n",
      "    \"Square_Feet\": [800, 1200, 1800, 1500, 850],\n",
      "    \"Age_Years\": [20, 15, 10, 7, 25],\n",
      "    \"Price\": [100000, 200000, 300000, 240000, 120000]\n",
      "}\n",
      "\n",
      "# Creating a pandas DataFrame\n",
      "dataframe = pd.DataFrame(data)\n",
      "label = \"Price\"\n",
      "\n",
      "# Convert to pandas-on-spark dataframe\n",
      "psdf = to_pandas_on_spark(dataframe)\n",
      "```\n",
      "\n",
      "2. Format your data appropriately for Spark ML models. Use the `VectorAssembler` from `pyspark.ml.feature` to merge all feature columns into a single vector column. Here is an example:\n",
      "\n",
      "```python\n",
      "from pyspark.ml.feature import VectorAssembler\n",
      "\n",
      "columns = psdf.columns\n",
      "feature_cols = [col for col in columns if col != label]\n",
      "featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
      "psdf = featurizer.transform(psdf.to_spark(index_col=\"index\"))[\"index\", \"features\"]\n",
      "```\n",
      "\n",
      "3. Use the Spark ML models in FLAML's AutoML. Include the models you want to try in the `estimator_list` argument to `flaml.AutoML()`. FLAML will start trying configurations for these models. Here is an example code snippet:\n",
      "\n",
      "```python\n",
      "import flaml\n",
      "\n",
      "automl = flaml.AutoML()\n",
      "settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"estimator_list\": [\"lgbm_spark\"],\n",
      "    \"task\": \"regression\"\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=psdf,\n",
      "    label=label,\n",
      "    **settings\n",
      ")\n",
      "```\n",
      "\n",
      "4. To enable parallel Spark jobs during parallel tuning, set the `use_spark` parameter to `True`. FLAML will dispatch your job to the distributed Spark backend using `joblib-spark`. Here is an example code snippet:\n",
      "\n",
      "```python\n",
      "import flaml\n",
      "\n",
      "automl_experiment = flaml.AutoML()\n",
      "automl_settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"task\": \"regression\",\n",
      "    \"n_concurrent_trials\": 2,\n",
      "    \"use_spark\": True,\n",
      "    \"force_cancel\": True\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=dataframe,\n",
      "    label=label,\n",
      "    **automl_settings\n",
      ")\n",
      "```\n",
      "\n",
      "Please note that you should not set `use_spark` to `True` when applying AutoML and Tuning for Spark Data, as SparkML models will be used for Spark Data in AutoML and Tuning.\n",
      "\n",
      "Let me know if you need anything else.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mSenior_Python_Engineer\u001b[0m (to chat_manager):\n",
      "\n",
      "Here is the sample code to use Spark for parallel training in FLAML:\n",
      "\n",
      "```python\n",
      "import pandas as pd\n",
      "from flaml.automl.spark.utils import to_pandas_on_spark\n",
      "from pyspark.ml.feature import VectorAssembler\n",
      "import flaml\n",
      "\n",
      "# Step 1: Prepare your data in the required format\n",
      "data = {\n",
      "    \"Square_Feet\": [800, 1200, 1800, 1500, 850],\n",
      "    \"Age_Years\": [20, 15, 10, 7, 25],\n",
      "    \"Price\": [100000, 200000, 300000, 240000, 120000]\n",
      "}\n",
      "\n",
      "dataframe = pd.DataFrame(data)\n",
      "label = \"Price\"\n",
      "\n",
      "psdf = to_pandas_on_spark(dataframe)\n",
      "\n",
      "# Step 2: Format your data for Spark ML models\n",
      "columns = psdf.columns\n",
      "feature_cols = [col for col in columns if col != label]\n",
      "featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
      "psdf = featurizer.transform(psdf.to_spark(index_col=\"index\"))[\"index\", \"features\"]\n",
      "\n",
      "# Step 3: Use Spark ML models in FLAML's AutoML\n",
      "automl = flaml.AutoML()\n",
      "settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"estimator_list\": [\"lgbm_spark\"],\n",
      "    \"task\": \"regression\"\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=psdf,\n",
      "    label=label,\n",
      "    **settings\n",
      ")\n",
      "\n",
      "# Step 4: Enable parallel Spark jobs during parallel tuning\n",
      "automl_experiment = flaml.AutoML()\n",
      "automl_settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"task\": \"regression\",\n",
      "    \"n_concurrent_trials\": 2,\n",
      "    \"use_spark\": True,\n",
      "    \"force_cancel\": True\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=dataframe,\n",
      "    label=label,\n",
      "    **automl_settings\n",
      ")\n",
      "```\n",
      "\n",
      "Let me know if you need anything else.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mCode_Reviewer\u001b[0m (to chat_manager):\n",
      "\n",
      "The code you provided is correct and follows the guidelines for using Spark for parallel training in FLAML. It includes the necessary steps to prepare the data, format it for Spark ML models, and use Spark ML models in FLAML's AutoML. It also demonstrates how to enable parallel Spark jobs during parallel tuning.\n",
      "\n",
      "Great job! You can now terminate the conversation.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mBoss_Assistant\u001b[0m (to chat_manager):\n",
      "\n",
      "\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mProduct_Manager\u001b[0m (to chat_manager):\n",
      "\n",
      "TERMINATE\n",
      "\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "rag_chat()\n",
    "# type exit to terminate the chat"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Call RetrieveUserProxyAgent while init chat with another user proxy agent\n",
    "Sometimes, there might be a need to use RetrieveUserProxyAgent in group chat without initializing the chat with it. In such scenarios, it becomes essential to create a function that wraps the RAG agents and allows them to be called from other agents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mBoss\u001b[0m (to chat_manager):\n",
      "\n",
      "How to use spark for parallel training in FLAML? Give me sample code.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "How to use spark for parallel training in FLAML? Give me sample code.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mProduct_Manager\u001b[0m (to chat_manager):\n",
      "\n",
      "\u001b[32m***** Suggested function Call: retrieve_content *****\u001b[0m\n",
      "Arguments: \n",
      "{\n",
      "  \"message\": \"How to use spark for parallel training in FLAML? Give me sample code.\"\n",
      "}\n",
      "\u001b[32m*****************************************************\u001b[0m\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[35m\n",
      ">>>>>>>> EXECUTING FUNCTION retrieve_content...\u001b[0m\n",
      "doc_ids:  [['doc_0', 'doc_1', 'doc_122']]\n",
      "\u001b[32mAdding doc_id doc_0 to context.\u001b[0m\n",
      "\u001b[32mAdding doc_id doc_1 to context.\u001b[0m\n",
      "\u001b[32mAdding doc_id doc_122 to context.\u001b[0m\n",
      "\u001b[33mBoss\u001b[0m (to chat_manager):\n",
      "\n",
      "\u001b[32m***** Response from calling function \"retrieve_content\" *****\u001b[0m\n",
      "You're a retrieve augmented coding assistant. You answer user's questions based on your own knowledge and the\n",
      "context provided by the user.\n",
      "If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.\n",
      "For code generation, you must obey the following rules:\n",
      "Rule 1. You MUST NOT install any packages because all the packages needed are already installed.\n",
      "Rule 2. You must follow the formats below to write your code:\n",
      "```language\n",
      "# your code\n",
      "```\n",
      "\n",
      "User's question is: How to use spark for parallel training in FLAML? Give me sample code.\n",
      "\n",
      "Context is: # Integrate - Spark\n",
      "\n",
      "FLAML has integrated Spark for distributed training. There are two main aspects of integration with Spark:\n",
      "- Use Spark ML estimators for AutoML.\n",
      "- Use Spark to run training in parallel spark jobs.\n",
      "\n",
      "## Spark ML Estimators\n",
      "\n",
      "FLAML integrates estimators based on Spark ML models. These models are trained in parallel using Spark, so we called them Spark estimators. To use these models, you first need to organize your data in the required format.\n",
      "\n",
      "### Data\n",
      "\n",
      "For Spark estimators, AutoML only consumes Spark data. FLAML provides a convenient function `to_pandas_on_spark` in the `flaml.automl.spark.utils` module to convert your data into a pandas-on-spark (`pyspark.pandas`) dataframe/series, which Spark estimators require.\n",
      "\n",
      "This utility function takes data in the form of a `pandas.Dataframe` or `pyspark.sql.Dataframe` and converts it into a pandas-on-spark dataframe. It also takes `pandas.Series` or `pyspark.sql.Dataframe` and converts it into a [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) series. If you pass in a `pyspark.pandas.Dataframe`, it will not make any changes.\n",
      "\n",
      "This function also accepts optional arguments `index_col` and `default_index_type`.\n",
      "- `index_col` is the column name to use as the index, default is None.\n",
      "- `default_index_type` is the default index type, default is \"distributed-sequence\". More info about default index type could be found on Spark official [documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type)\n",
      "\n",
      "Here is an example code snippet for Spark Data:\n",
      "\n",
      "```python\n",
      "import pandas as pd\n",
      "from flaml.automl.spark.utils import to_pandas_on_spark\n",
      "# Creating a dictionary\n",
      "data = {\"Square_Feet\": [800, 1200, 1800, 1500, 850],\n",
      "      \"Age_Years\": [20, 15, 10, 7, 25],\n",
      "      \"Price\": [100000, 200000, 300000, 240000, 120000]}\n",
      "\n",
      "# Creating a pandas DataFrame\n",
      "dataframe = pd.DataFrame(data)\n",
      "label = \"Price\"\n",
      "\n",
      "# Convert to pandas-on-spark dataframe\n",
      "psdf = to_pandas_on_spark(dataframe)\n",
      "```\n",
      "\n",
      "To use Spark ML models you need to format your data appropriately. Specifically, use [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html) to merge all feature columns into a single vector column.\n",
      "\n",
      "Here is an example of how to use it:\n",
      "```python\n",
      "from pyspark.ml.feature import VectorAssembler\n",
      "columns = psdf.columns\n",
      "feature_cols = [col for col in columns if col != label]\n",
      "featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
      "psdf = featurizer.transform(psdf.to_spark(index_col=\"index\"))[\"index\", \"features\"]\n",
      "```\n",
      "\n",
      "Later in conducting the experiment, use your pandas-on-spark data like non-spark data and pass them using `X_train, y_train` or `dataframe, label`.\n",
      "\n",
      "### Estimators\n",
      "#### Model List\n",
      "- `lgbm_spark`: The class for fine-tuning Spark version LightGBM models, using [SynapseML](https://microsoft.github.io/SynapseML/docs/features/lightgbm/about/) API.\n",
      "\n",
      "#### Usage\n",
      "First, prepare your data in the required format as described in the previous section.\n",
      "\n",
      "By including the models you intend to try in the `estimators_list` argument to `flaml.automl`, FLAML will start trying configurations for these models. If your input is Spark data, FLAML will also use estimators with the `_spark` postfix by default, even if you haven't specified them.\n",
      "\n",
      "Here is an example code snippet using SparkML models in AutoML:\n",
      "\n",
      "```python\n",
      "import flaml\n",
      "# prepare your data in pandas-on-spark format as we previously mentioned\n",
      "\n",
      "automl = flaml.AutoML()\n",
      "settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"estimator_list\": [\"lgbm_spark\"],  # this setting is optional\n",
      "    \"task\": \"regression\",\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=psdf,\n",
      "    label=label,\n",
      "    **settings,\n",
      ")\n",
      "```\n",
      "\n",
      "\n",
      "[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb)\n",
      "\n",
      "## Parallel Spark Jobs\n",
      "You can activate Spark as the parallel backend during parallel tuning in both [AutoML](/docs/Use-Cases/Task-Oriented-AutoML#parallel-tuning) and [Hyperparameter Tuning](/docs/Use-Cases/Tune-User-Defined-Function#parallel-tuning), by setting the `use_spark` to `true`. FLAML will dispatch your job to the distributed Spark backend using [`joblib-spark`](https://github.com/joblib/joblib-spark).\n",
      "\n",
      "Please note that you should not set `use_spark` to `true` when applying AutoML and Tuning for Spark Data. This is because only SparkML models will be used for Spark Data in AutoML and Tuning. As SparkML models run in parallel, there is no need to distribute them with `use_spark` again.\n",
      "\n",
      "All the Spark-related arguments are stated below. These arguments are available in both Hyperparameter Tuning and AutoML:\n",
      "\n",
      "\n",
      "- `use_spark`: boolean, default=False | Whether to use spark to run the training in parallel spark jobs. This can be used to accelerate training on large models and large datasets, but will incur more overhead in time and thus slow down training in some cases. GPU training is not supported yet when use_spark is True. For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`.\n",
      "- `n_concurrent_trials`: int, default=1 | The number of concurrent trials. When n_concurrent_trials > 1, FLAML performes parallel tuning.\n",
      "- `force_cancel`: boolean, default=False | Whether to forcely cancel Spark jobs if the search time exceeded the time budget. Spark jobs include parallel tuning jobs and Spark-based model training jobs.\n",
      "\n",
      "An example code snippet for using parallel Spark jobs:\n",
      "```python\n",
      "import flaml\n",
      "automl_experiment = flaml.AutoML()\n",
      "automl_settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"task\": \"regression\",\n",
      "    \"n_concurrent_trials\": 2,\n",
      "    \"use_spark\": True,\n",
      "    \"force_cancel\": True, # Activating the force_cancel option can immediately halt Spark jobs once they exceed the allocated time_budget.\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=dataframe,\n",
      "    label=label,\n",
      "    **automl_settings,\n",
      ")\n",
      "```\n",
      "\n",
      "\n",
      "[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb)\n",
      "\n",
      "2684,4/26/2011,2,0,4,17,0,2,1,1,0.68,0.6364,0.61,0.3582,521\n",
      "2685,4/26/2011,2,0,4,18,0,2,1,1,0.68,0.6364,0.65,0.4478,528\n",
      "2686,4/26/2011,2,0,4,19,0,2,1,1,0.64,0.6061,0.73,0.4179,328\n",
      "2687,4/26/2011,2,0,4,20,0,2,1,1,0.64,0.6061,0.73,0.3582,234\n",
      "2688,4/26/2011,2,0,4,21,0,2,1,1,0.62,0.5909,0.78,0.2836,195\n",
      "2689,4/26/2011,2,0,4,22,0,2,1,2,0.6,0.5606,0.83,0.194,148\n",
      "2690,4/26/2011,2,0,4,23,0,2,1,2,0.6,0.5606,0.83,0.2239,78\n",
      "2691,4/27/2011,2,0,4,0,0,3,1,1,0.6,0.5606,0.83,0.2239,27\n",
      "2692,4/27/2011,2,0,4,1,0,3,1,1,0.6,0.5606,0.83,0.2537,17\n",
      "2693,4/27/2011,2,0,4,2,0,3,1,1,0.58,0.5455,0.88,0.2537,5\n",
      "2694,4/27/2011,2,0,4,3,0,3,1,2,0.58,0.5455,0.88,0.2836,7\n",
      "2695,4/27/2011,2,0,4,4,0,3,1,1,0.56,0.5303,0.94,0.2239,6\n",
      "2696,4/27/2011,2,0,4,5,0,3,1,2,0.56,0.5303,0.94,0.2537,17\n",
      "2697,4/27/2011,2,0,4,6,0,3,1,1,0.56,0.5303,0.94,0.2537,84\n",
      "2698,4/27/2011,2,0,4,7,0,3,1,2,0.58,0.5455,0.88,0.2836,246\n",
      "2699,4/27/2011,2,0,4,8,0,3,1,2,0.58,0.5455,0.88,0.3284,444\n",
      "2700,4/27/2011,2,0,4,9,0,3,1,2,0.6,0.5455,0.88,0.4179,181\n",
      "2701,4/27/2011,2,0,4,10,0,3,1,2,0.62,0.5758,0.83,0.2836,92\n",
      "2702,4/27/2011,2,0,4,11,0,3,1,2,0.64,0.5909,0.78,0.2836,156\n",
      "2703,4/27/2011,2,0,4,12,0,3,1,1,0.66,0.6061,0.78,0.3284,173\n",
      "2704,4/27/2011,2,0,4,13,0,3,1,1,0.64,0.5909,0.78,0.2985,150\n",
      "2705,4/27/2011,2,0,4,14,0,3,1,1,0.68,0.6364,0.74,0.2836,148\n",
      "\n",
      "\n",
      "\u001b[32m*************************************************************\u001b[0m\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mProduct_Manager\u001b[0m (to chat_manager):\n",
      "\n",
      "To use Spark for parallel training in FLAML, you can follow these steps:\n",
      "\n",
      "1. Prepare your data in the required format using Spark data. You can use the `to_pandas_on_spark` function from the `flaml.automl.spark.utils` module to convert your data into a pandas-on-spark dataframe.\n",
      "\n",
      "```python\n",
      "import pandas as pd\n",
      "from flaml.automl.spark.utils import to_pandas_on_spark\n",
      "\n",
      "# Creating a dictionary\n",
      "data = {\n",
      "    \"Square_Feet\": [800, 1200, 1800, 1500, 850],\n",
      "    \"Age_Years\": [20, 15, 10, 7, 25],\n",
      "    \"Price\": [100000, 200000, 300000, 240000, 120000]\n",
      "}\n",
      "\n",
      "# Creating a pandas DataFrame\n",
      "dataframe = pd.DataFrame(data)\n",
      "label = \"Price\"\n",
      "\n",
      "# Convert to pandas-on-spark dataframe\n",
      "psdf = to_pandas_on_spark(dataframe)\n",
      "```\n",
      "\n",
      "2. Use the Spark ML estimators provided by FLAML. You can include the models you want to try in the `estimator_list` argument of the `flaml.AutoML` class. FLAML will start trying configurations for these models.\n",
      "\n",
      "```python\n",
      "import flaml\n",
      "\n",
      "automl = flaml.AutoML()\n",
      "settings = {\n",
      "    \"time_budget\": 30,\n",
      "    \"metric\": \"r2\",\n",
      "    \"estimator_list\": [\"lgbm_spark\"],  # Optional: specify the Spark estimator\n",
      "    \"task\": \"regression\",\n",
      "}\n",
      "\n",
      "automl.fit(\n",
      "    dataframe=psdf,\n",
      "    label=label,\n",
      "    **settings,\n",
      ")\n",
      "```\n",
      "\n",
      "3. Enable parallel Spark jobs by setting the `use_spark` parameter to `True` in the `fit` method. This will dispatch the job to the distributed Spark backend using `joblib-spark`.\n",
      "\n",
      "```python\n",
      "automl.fit(\n",
      "    dataframe=psdf,\n",
      "    label=label,\n",
      "    use_spark=True,\n",
      ")\n",
      "```\n",
      "\n",
      "Note: Make sure you have Spark installed and configured properly before running the code.\n",
      "\n",
      "Please let me know if you need any further assistance.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\u001b[33mSenior_Python_Engineer\u001b[0m (to chat_manager):\n",
      "\n",
      "TERMINATE\n",
      "\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "call_rag_chat()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "flaml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}