autogen/notebook/flaml_forecast.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Time Series Forecasting with FLAML Library"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Introduction\n",
    "\n",
    "FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models with low computational cost. It is fast and cheap. The simple and lightweight design makes it easy to use and extend, such as adding new learners. FLAML can\n",
    "\n",
    " - serve as an economical AutoML engine,\n",
    " - be used as a fast hyperparameter tuning tool, or\n",
    " - be embedded in self-tuning software that requires low latency & resource in repetitive tuning tasks.\n",
    " - In this notebook, we demonstrate how to use FLAML library to tune hyperparameters of XGBoost with a regression example.\n",
    "\n",
    "FLAML requires Python>=3.6. To run this notebook example, please install flaml with the notebook and forecast option:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install flaml[notebook,ts_forecast]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Forecast Problem\n",
    "\n",
    "### Load data and preprocess\n",
    "\n",
    "Import co2 data from statsmodel. The dataset is from “Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.,” which collected CO2 samples from March 1958 to December 2001. The task is to predict monthly CO2 samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "data = sm.datasets.co2.load_pandas()\n",
    "data = data.data\n",
    "# data is given in weeks, but the task is to predict monthly, so use monthly averages instead\n",
    "data = data['co2'].resample('MS').mean()\n",
    "data = data.fillna(data.bfill())  # makes sure there are no missing values\n",
    "data = data.to_frame().reset_index()\n",
    "# data = data.rename(columns={'index': 'ds', 'co2': 'y'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split the data into a train dataframe and X_test and y_test dataframes, where the number of samples for test is equal to\n",
    "# the number of periods the user wants to predict\n",
    "num_samples = data.shape[0]\n",
    "time_horizon = 12\n",
    "split_idx = num_samples - time_horizon\n",
    "train_df = data[:split_idx]  # train_df is a dataframe with two columns: timestamp and label\n",
    "X_test = data[split_idx:]['index'].to_frame()  # X_test is a dataframe with dates for prediction\n",
    "y_test = data[split_idx:]['co2']  # y_test is a series of the values corresponding to the dates for prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run FLAML\n",
    "\n",
    "In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "''' import AutoML class from flaml package '''\n",
    "from flaml import AutoML\n",
    "automl = AutoML()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "settings = {\n",
    "    \"time_budget\": 180,  # total running time in seconds\n",
    "    \"metric\": 'mape',  # primary metric for validation: 'mape' is generally used for forecast tasks\n",
    "    \"task\": 'ts_forecast',  # task type\n",
    "    \"log_file_name\": 'CO2_forecast.log',  # flaml log file\n",
    "    \"eval_method\": \"holdout\",  # validation method can be chosen from ['auto', 'holdout', 'cv']\n",
    "    \"seed\": 7654321,  # random seed\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''The main flaml automl API'''\n",
    "automl.fit(dataframe=train_df,  # training data\n",
    "           label='co2',  # label column\n",
    "           period=time_horizon,  # key word argument 'period' must be included for forecast task)\n",
    "           **settings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Best model and metric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "''' retrieve best config and best learner'''\n",
    "print('Best ML leaner:', automl.best_estimator)\n",
    "print('Best hyperparmeter config:', automl.best_config)\n",
    "print(f'Best mape on validation data: {automl.best_loss}')\n",
    "print(f'Training duration of best run: {automl.best_config_train_time}s')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(automl.model.estimator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "''' pickle and save the automl object '''\n",
    "import pickle\n",
    "with open('automl.pkl', 'wb') as f:\n",
    "    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "''' compute predictions of testing dataset '''\n",
    "flaml_y_pred = automl.predict(X_test)\n",
    "print(f\"Predicted labels\\n{flaml_y_pred}\")\n",
    "print(f\"True labels\\n{y_test}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "''' compute different metric values on testing dataset'''\n",
    "from flaml.ml import sklearn_metric_loss_score\n",
    "print('mape', '=', sklearn_metric_loss_score('mape', y_predict=flaml_y_pred, y_true=y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log history"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from flaml.data import get_output_from_log\n",
    "time_history, best_valid_loss_history, valid_loss_history, config_history, train_loss_history = \\\n",
    "    get_output_from_log(filename=settings['log_file_name'], time_budget=180)\n",
    "\n",
    "for config in config_history:\n",
    "    print(config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "plt.title('Learning Curve')\n",
    "plt.xlabel('Wall Clock Time (s)')\n",
    "plt.ylabel('Validation Accuracy')\n",
    "plt.scatter(time_history, 1 - np.array(valid_loss_history))\n",
    "plt.step(time_history, 1 - np.array(best_valid_loss_history), where='post')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.plot(X_test, y_test, label='Actual level')\n",
    "plt.plot(X_test, flaml_y_pred, label='FLAML forecast')\n",
    "plt.xlabel('Date')\n",
    "plt.ylabel('CO2 Levels')\n",
    "plt.legend()"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "8b6c8c3ba4bafbc4530f534c605c8412f25bf61ef13254e4f377ccd42b838aa4"
  },
  "kernelspec": {
   "display_name": "Python 3.8.10 64-bit ('python38': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}