autogen/notebook/flaml_forecast.ipynb
Chi Wang 72caa2172d
model_history, ITER_HP, settings in AutoML(), checkpoint bug fix (#283)
if save_best_model_per_estimator is False and retrain_final is True, unfit the model after evaluation in HPO.
retrain if using ray.
update ITER_HP in config after a trial is finished.
change prophet logging level.
example and notebook update.
allow settings to be passed to AutoML constructor. Are you planning to add multi-output-regression capability to FLAML #192 Is multi-tasking allowed? #277 can pass the auotml setting to the constructor instead of requiring a derived class.
remove model_history.
checkpoint bug fix.

* model_history meaning save_best_model_per_estimator

* ITER_HP

* example update

* prophet logging level

* comment update in forecast notebook

* print format improvement

* allow settings to be passed to AutoML constructor

* checkpoint bug fix

* time limit for autohf regression test

* skip slow test on macos

* cleanup before del
2021-11-18 09:39:45 -08:00

268 lines
8.0 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Time Series Forecasting with FLAML Library"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction\n",
"\n",
"FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models with low computational cost. It is fast and cheap. The simple and lightweight design makes it easy to use and extend, such as adding new learners. FLAML can\n",
"\n",
" - serve as an economical AutoML engine,\n",
" - be used as a fast hyperparameter tuning tool, or\n",
" - be embedded in self-tuning software that requires low latency & resource in repetitive tuning tasks.\n",
" - In this notebook, we demonstrate how to use FLAML library to tune hyperparameters of XGBoost with a regression example.\n",
"\n",
"FLAML requires Python>=3.6. To run this notebook example, please install flaml with the notebook and forecast option:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install flaml[notebook,ts_forecast]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Forecast Problem\n",
"\n",
"### Load data and preprocess\n",
"\n",
"Import co2 data from statsmodel. The dataset is from “Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.,” which collected CO2 samples from March 1958 to December 2001. The task is to predict monthly CO2 samples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import statsmodels.api as sm\n",
"data = sm.datasets.co2.load_pandas()\n",
"data = data.data\n",
"# data is given in weeks, but the task is to predict monthly, so use monthly averages instead\n",
"data = data['co2'].resample('MS').mean()\n",
"data = data.fillna(data.bfill()) # makes sure there are no missing values\n",
"data = data.to_frame().reset_index()\n",
"# data = data.rename(columns={'index': 'ds', 'co2': 'y'})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# split the data into a train dataframe and X_test and y_test dataframes, where the number of samples for test is equal to\n",
"# the number of periods the user wants to predict\n",
"num_samples = data.shape[0]\n",
"time_horizon = 12\n",
"split_idx = num_samples - time_horizon\n",
"train_df = data[:split_idx] # train_df is a dataframe with two columns: timestamp and label\n",
"X_test = data[split_idx:]['index'].to_frame() # X_test is a dataframe with dates for prediction\n",
"y_test = data[split_idx:]['co2'] # y_test is a series of the values corresponding to the dates for prediction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run FLAML\n",
"\n",
"In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' import AutoML class from flaml package '''\n",
"from flaml import AutoML\n",
"automl = AutoML()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"settings = {\n",
" \"time_budget\": 180, # total running time in seconds\n",
" \"metric\": 'mape', # primary metric for validation: 'mape' is generally used for forecast tasks\n",
" \"task\": 'ts_forecast', # task type\n",
" \"log_file_name\": 'CO2_forecast.log', # flaml log file\n",
" \"eval_method\": \"holdout\", # validation method can be chosen from ['auto', 'holdout', 'cv']\n",
" \"seed\": 7654321, # random seed\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'''The main flaml automl API'''\n",
"automl.fit(dataframe=train_df, # training data\n",
" label='co2', # label column\n",
" period=time_horizon, # key word argument 'period' must be included for forecast task)\n",
" **settings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Best model and metric"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' retrieve best config and best learner'''\n",
"print('Best ML leaner:', automl.best_estimator)\n",
"print('Best hyperparmeter config:', automl.best_config)\n",
"print(f'Best mape on validation data: {automl.best_loss}')\n",
"print(f'Training duration of best run: {automl.best_config_train_time}s')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(automl.model.estimator)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' pickle and save the automl object '''\n",
"import pickle\n",
"with open('automl.pkl', 'wb') as f:\n",
" pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' compute predictions of testing dataset '''\n",
"flaml_y_pred = automl.predict(X_test)\n",
"print(f\"Predicted labels\\n{flaml_y_pred}\")\n",
"print(f\"True labels\\n{y_test}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' compute different metric values on testing dataset'''\n",
"from flaml.ml import sklearn_metric_loss_score\n",
"print('mape', '=', sklearn_metric_loss_score('mape', y_predict=flaml_y_pred, y_true=y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Log history"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from flaml.data import get_output_from_log\n",
"time_history, best_valid_loss_history, valid_loss_history, config_history, train_loss_history = \\\n",
" get_output_from_log(filename=settings['log_file_name'], time_budget=180)\n",
"\n",
"for config in config_history:\n",
" print(config)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"plt.title('Learning Curve')\n",
"plt.xlabel('Wall Clock Time (s)')\n",
"plt.ylabel('Validation Accuracy')\n",
"plt.scatter(time_history, 1 - np.array(valid_loss_history))\n",
"plt.step(time_history, 1 - np.array(best_valid_loss_history), where='post')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"plt.plot(X_test, y_test, label='Actual level')\n",
"plt.plot(X_test, flaml_y_pred, label='FLAML forecast')\n",
"plt.xlabel('Date')\n",
"plt.ylabel('CO2 Levels')\n",
"plt.legend()"
]
}
],
"metadata": {
"interpreter": {
"hash": "8b6c8c3ba4bafbc4530f534c605c8412f25bf61ef13254e4f377ccd42b838aa4"
},
"kernelspec": {
"display_name": "Python 3.8.10 64-bit ('python38': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}