{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved. \n", "\n", "Licensed under the MIT License.\n", "\n", "# Tune LightGBM with FLAML Library\n", "\n", "\n", "## 1. Introduction\n", "\n", "FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models \n", "with low computational cost. It is fast and economical. The simple and lightweight design makes it easy \n", "to use and extend, such as adding new learners. FLAML can \n", "- serve as an economical AutoML engine,\n", "- be used as a fast hyperparameter tuning tool, or \n", "- be embedded in self-tuning software that requires low latency & resource in repetitive\n", " tuning tasks.\n", "\n", "In this notebook, we demonstrate how to use FLAML library to tune hyperparameters of LightGBM with a regression example.\n", "\n", "FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the `notebook` option:\n", "```bash\n", "pip install flaml[notebook]\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%pip install flaml[notebook]==1.0.10" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2. Regression Example\n", "### Load data and preprocess\n", "\n", "Download [houses dataset](https://www.openml.org/d/537) from OpenML. The task is to predict median price of the house in the region based on demographic composition and a state of housing market in the region." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/root/.local/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n", " from pandas import MultiIndex, Int64Index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "download dataset from openml\n", "Dataset name: houses\n", "X_train.shape: (15480, 8), y_train.shape: (15480,);\n", "X_test.shape: (5160, 8), y_test.shape: (5160,)\n" ] } ], "source": [ "from flaml.data import load_openml_dataset\n", "X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir='./')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Run FLAML\n", "In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "''' import AutoML class from flaml package '''\n", "from flaml import AutoML\n", "automl = AutoML()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "settings = {\n", " \"time_budget\": 240, # total running time in seconds\n", " \"metric\": 'r2', # primary metrics for regression can be chosen from: ['mae','mse','r2','rmse','mape']\n", " \"estimator_list\": ['lgbm'], # list of ML learners; we tune lightgbm in this example\n", " \"task\": 'regression', # task type \n", " \"log_file_name\": 'houses_experiment.log', # flaml log file\n", " \"seed\": 7654321, # random seed\n", "}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[flaml.automl: 07-01 15:22:15] {2427} INFO - task = regression\n", "[flaml.automl: 07-01 15:22:15] {2429} INFO - Data split method: uniform\n", "[flaml.automl: 07-01 15:22:15] {2432} INFO - Evaluation method: cv\n", "[flaml.automl: 07-01 15:22:15] {2501} INFO - Minimizing error metric: 1-r2\n", "[flaml.automl: 07-01 15:22:15] {2641} INFO - List of ML learners in AutoML Run: ['lgbm']\n", "[flaml.automl: 07-01 15:22:15] {2933} INFO - iteration 0, current learner lgbm\n", "[flaml.automl: 07-01 15:22:16] {3061} INFO - Estimated sufficient time budget=1981s. Estimated necessary time budget=2s.\n", "[flaml.automl: 07-01 15:22:16] {3108} INFO - at 0.3s,\testimator lgbm's best error=0.7383,\tbest estimator lgbm's best error=0.7383\n", "[flaml.automl: 07-01 15:22:16] {2933} INFO - iteration 1, current learner lgbm\n", "[flaml.automl: 07-01 15:22:16] {3108} INFO - at 0.5s,\testimator lgbm's best error=0.7383,\tbest estimator lgbm's best error=0.7383\n", "[flaml.automl: 07-01 15:22:16] {2933} INFO - iteration 2, current learner lgbm\n", "[flaml.automl: 07-01 15:22:16] {3108} INFO - at 0.7s,\testimator lgbm's best error=0.3250,\tbest estimator lgbm's best error=0.3250\n", "[flaml.automl: 07-01 15:22:16] {2933} INFO - iteration 3, current learner lgbm\n", "[flaml.automl: 07-01 15:22:16] {3108} INFO - at 1.1s,\testimator lgbm's best error=0.1868,\tbest estimator lgbm's best error=0.1868\n", "[flaml.automl: 07-01 15:22:16] {2933} INFO - iteration 4, current learner lgbm\n", "[flaml.automl: 07-01 15:22:17] {3108} INFO - at 1.3s,\testimator lgbm's best error=0.1868,\tbest estimator lgbm's best error=0.1868\n", "[flaml.automl: 07-01 15:22:17] {2933} INFO - iteration 5, current learner lgbm\n", "[flaml.automl: 07-01 15:22:19] {3108} INFO - at 3.6s,\testimator lgbm's best error=0.1868,\tbest estimator lgbm's best error=0.1868\n", "[flaml.automl: 07-01 15:22:19] {2933} INFO - iteration 6, current learner lgbm\n", "[flaml.automl: 07-01 15:22:19] {3108} INFO - at 3.8s,\testimator lgbm's best error=0.1868,\tbest estimator lgbm's best error=0.1868\n", "[flaml.automl: 07-01 15:22:19] {2933} INFO - iteration 7, current learner lgbm\n", "[flaml.automl: 07-01 15:22:19] {3108} INFO - at 4.2s,\testimator lgbm's best error=0.1868,\tbest estimator lgbm's best error=0.1868\n", "[flaml.automl: 07-01 15:22:19] {2933} INFO - iteration 8, current learner lgbm\n", "[flaml.automl: 07-01 15:22:20] {3108} INFO - at 4.7s,\testimator lgbm's best error=0.1868,\tbest estimator lgbm's best error=0.1868\n", "[flaml.automl: 07-01 15:22:20] {2933} INFO - iteration 9, current learner lgbm\n", "[flaml.automl: 07-01 15:22:20] {3108} INFO - at 4.9s,\testimator lgbm's best error=0.1868,\tbest estimator lgbm's best error=0.1868\n", "[flaml.automl: 07-01 15:22:20] {2933} INFO - iteration 10, current learner lgbm\n", "[flaml.automl: 07-01 15:22:22] {3108} INFO - at 6.6s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:22] {2933} INFO - iteration 11, current learner lgbm\n", "[flaml.automl: 07-01 15:22:22] {3108} INFO - at 7.2s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:22] {2933} INFO - iteration 12, current learner lgbm\n", "[flaml.automl: 07-01 15:22:28] {3108} INFO - at 12.9s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:28] {2933} INFO - iteration 13, current learner lgbm\n", "[flaml.automl: 07-01 15:22:29] {3108} INFO - at 13.6s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:29] {2933} INFO - iteration 14, current learner lgbm\n", "[flaml.automl: 07-01 15:22:34] {3108} INFO - at 18.4s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:34] {2933} INFO - iteration 15, current learner lgbm\n", "[flaml.automl: 07-01 15:22:39] {3108} INFO - at 23.9s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:39] {2933} INFO - iteration 16, current learner lgbm\n", "[flaml.automl: 07-01 15:22:40] {3108} INFO - at 24.5s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:40] {2933} INFO - iteration 17, current learner lgbm\n", "[flaml.automl: 07-01 15:22:53] {3108} INFO - at 37.9s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:53] {2933} INFO - iteration 18, current learner lgbm\n", "[flaml.automl: 07-01 15:22:53] {3108} INFO - at 38.2s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:53] {2933} INFO - iteration 19, current learner lgbm\n", "[flaml.automl: 07-01 15:22:54] {3108} INFO - at 39.2s,\testimator lgbm's best error=0.1744,\tbest estimator lgbm's best error=0.1744\n", "[flaml.automl: 07-01 15:22:54] {2933} INFO - iteration 20, current learner lgbm\n", "[flaml.automl: 07-01 15:22:56] {3108} INFO - at 41.0s,\testimator lgbm's best error=0.1738,\tbest estimator lgbm's best error=0.1738\n", "[flaml.automl: 07-01 15:22:56] {2933} INFO - iteration 21, current learner lgbm\n", "[flaml.automl: 07-01 15:22:58] {3108} INFO - at 42.5s,\testimator lgbm's best error=0.1738,\tbest estimator lgbm's best error=0.1738\n", "[flaml.automl: 07-01 15:22:58] {2933} INFO - iteration 22, current learner lgbm\n", "[flaml.automl: 07-01 15:22:59] {3108} INFO - at 44.2s,\testimator lgbm's best error=0.1738,\tbest estimator lgbm's best error=0.1738\n", "[flaml.automl: 07-01 15:22:59] {2933} INFO - iteration 23, current learner lgbm\n", "[flaml.automl: 07-01 15:23:03] {3108} INFO - at 47.8s,\testimator lgbm's best error=0.1738,\tbest estimator lgbm's best error=0.1738\n", "[flaml.automl: 07-01 15:23:03] {2933} INFO - iteration 24, current learner lgbm\n", "[flaml.automl: 07-01 15:23:04] {3108} INFO - at 48.6s,\testimator lgbm's best error=0.1738,\tbest estimator lgbm's best error=0.1738\n", "[flaml.automl: 07-01 15:23:04] {2933} INFO - iteration 25, current learner lgbm\n", "[flaml.automl: 07-01 15:23:05] {3108} INFO - at 49.5s,\testimator lgbm's best error=0.1738,\tbest estimator lgbm's best error=0.1738\n", "[flaml.automl: 07-01 15:23:05] {2933} INFO - iteration 26, current learner lgbm\n", "[flaml.automl: 07-01 15:23:07] {3108} INFO - at 51.4s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:07] {2933} INFO - iteration 27, current learner lgbm\n", "[flaml.automl: 07-01 15:23:09] {3108} INFO - at 53.8s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:09] {2933} INFO - iteration 28, current learner lgbm\n", "[flaml.automl: 07-01 15:23:11] {3108} INFO - at 55.4s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:11] {2933} INFO - iteration 29, current learner lgbm\n", "[flaml.automl: 07-01 15:23:12] {3108} INFO - at 56.6s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:12] {2933} INFO - iteration 30, current learner lgbm\n", "[flaml.automl: 07-01 15:23:15] {3108} INFO - at 59.8s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:15] {2933} INFO - iteration 31, current learner lgbm\n", "[flaml.automl: 07-01 15:23:20] {3108} INFO - at 64.5s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:20] {2933} INFO - iteration 32, current learner lgbm\n", "[flaml.automl: 07-01 15:23:20] {3108} INFO - at 65.1s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:20] {2933} INFO - iteration 33, current learner lgbm\n", "[flaml.automl: 07-01 15:23:31] {3108} INFO - at 76.0s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:31] {2933} INFO - iteration 34, current learner lgbm\n", "[flaml.automl: 07-01 15:23:32] {3108} INFO - at 76.5s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:32] {2933} INFO - iteration 35, current learner lgbm\n", "[flaml.automl: 07-01 15:23:35] {3108} INFO - at 79.3s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:35] {2933} INFO - iteration 36, current learner lgbm\n", "[flaml.automl: 07-01 15:23:35] {3108} INFO - at 80.2s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:35] {2933} INFO - iteration 37, current learner lgbm\n", "[flaml.automl: 07-01 15:23:37] {3108} INFO - at 81.5s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:37] {2933} INFO - iteration 38, current learner lgbm\n", "[flaml.automl: 07-01 15:23:39] {3108} INFO - at 83.8s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:39] {2933} INFO - iteration 39, current learner lgbm\n", "[flaml.automl: 07-01 15:23:40] {3108} INFO - at 84.8s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:40] {2933} INFO - iteration 40, current learner lgbm\n", "[flaml.automl: 07-01 15:23:43] {3108} INFO - at 88.1s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:43] {2933} INFO - iteration 41, current learner lgbm\n", "[flaml.automl: 07-01 15:23:45] {3108} INFO - at 89.4s,\testimator lgbm's best error=0.1611,\tbest estimator lgbm's best error=0.1611\n", "[flaml.automl: 07-01 15:23:45] {2933} INFO - iteration 42, current learner lgbm\n", "[flaml.automl: 07-01 15:23:47] {3108} INFO - at 91.7s,\testimator lgbm's best error=0.1608,\tbest estimator lgbm's best error=0.1608\n", "[flaml.automl: 07-01 15:23:47] {2933} INFO - iteration 43, current learner lgbm\n", "[flaml.automl: 07-01 15:23:48] {3108} INFO - at 92.4s,\testimator lgbm's best error=0.1608,\tbest estimator lgbm's best error=0.1608\n", "[flaml.automl: 07-01 15:23:48] {2933} INFO - iteration 44, current learner lgbm\n", "[flaml.automl: 07-01 15:23:54] {3108} INFO - at 98.5s,\testimator lgbm's best error=0.1608,\tbest estimator lgbm's best error=0.1608\n", "[flaml.automl: 07-01 15:23:54] {2933} INFO - iteration 45, current learner lgbm\n", "[flaml.automl: 07-01 15:23:55] {3108} INFO - at 100.2s,\testimator lgbm's best error=0.1608,\tbest estimator lgbm's best error=0.1608\n", "[flaml.automl: 07-01 15:23:55] {2933} INFO - iteration 46, current learner lgbm\n", "[flaml.automl: 07-01 15:23:58] {3108} INFO - at 102.6s,\testimator lgbm's best error=0.1608,\tbest estimator lgbm's best error=0.1608\n", "[flaml.automl: 07-01 15:23:58] {2933} INFO - iteration 47, current learner lgbm\n", "[flaml.automl: 07-01 15:23:59] {3108} INFO - at 103.4s,\testimator lgbm's best error=0.1608,\tbest estimator lgbm's best error=0.1608\n", "[flaml.automl: 07-01 15:23:59] {2933} INFO - iteration 48, current learner lgbm\n", "[flaml.automl: 07-01 15:24:03] {3108} INFO - at 108.0s,\testimator lgbm's best error=0.1608,\tbest estimator lgbm's best error=0.1608\n", "[flaml.automl: 07-01 15:24:03] {2933} INFO - iteration 49, current learner lgbm\n", "[flaml.automl: 07-01 15:24:04] {3108} INFO - at 108.8s,\testimator lgbm's best error=0.1608,\tbest estimator lgbm's best error=0.1608\n", "[flaml.automl: 07-01 15:24:04] {2933} INFO - iteration 50, current learner lgbm\n", "[flaml.automl: 07-01 15:24:12] {3108} INFO - at 116.3s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:24:12] {2933} INFO - iteration 51, current learner lgbm\n", "[flaml.automl: 07-01 15:25:01] {3108} INFO - at 166.2s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:25:01] {2933} INFO - iteration 52, current learner lgbm\n", "[flaml.automl: 07-01 15:25:02] {3108} INFO - at 167.2s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:25:02] {2933} INFO - iteration 53, current learner lgbm\n", "[flaml.automl: 07-01 15:25:04] {3108} INFO - at 168.7s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:25:04] {2933} INFO - iteration 54, current learner lgbm\n", "[flaml.automl: 07-01 15:25:38] {3108} INFO - at 203.0s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:25:38] {2933} INFO - iteration 55, current learner lgbm\n", "[flaml.automl: 07-01 15:25:47] {3108} INFO - at 211.9s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:25:47] {2933} INFO - iteration 56, current learner lgbm\n", "[flaml.automl: 07-01 15:25:51] {3108} INFO - at 216.2s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:25:51] {2933} INFO - iteration 57, current learner lgbm\n", "[flaml.automl: 07-01 15:25:53] {3108} INFO - at 217.8s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:25:53] {2933} INFO - iteration 58, current learner lgbm\n", "[flaml.automl: 07-01 15:26:19] {3108} INFO - at 243.9s,\testimator lgbm's best error=0.1558,\tbest estimator lgbm's best error=0.1558\n", "[flaml.automl: 07-01 15:26:21] {3372} INFO - retrain lgbm for 1.7s\n", "[flaml.automl: 07-01 15:26:21] {3379} INFO - retrained model: LGBMRegressor(colsample_bytree=0.6884091116362046,\n", " learning_rate=0.0825101833775657, max_bin=1023,\n", " min_child_samples=15, n_estimators=436, num_leaves=46,\n", " reg_alpha=0.0010949400705571237, reg_lambda=0.004934208563558304,\n", " verbose=-1)\n", "[flaml.automl: 07-01 15:26:21] {2672} INFO - fit succeeded\n", "[flaml.automl: 07-01 15:26:21] {2673} INFO - Time taken to find the best model: 116.267258644104\n" ] } ], "source": [ "'''The main flaml automl API'''\n", "automl.fit(X_train=X_train, y_train=y_train, **settings)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Best model and metric" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best hyperparmeter config: {'n_estimators': 436, 'num_leaves': 46, 'min_child_samples': 15, 'learning_rate': 0.0825101833775657, 'log_max_bin': 10, 'colsample_bytree': 0.6884091116362046, 'reg_alpha': 0.0010949400705571237, 'reg_lambda': 0.004934208563558304}\n", "Best r2 on validation data: 0.8442\n", "Training duration of best run: 1.668 s\n" ] } ], "source": [ "''' retrieve best config'''\n", "print('Best hyperparmeter config:', automl.best_config)\n", "print('Best r2 on validation data: {0:.4g}'.format(1-automl.best_loss))\n", "print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
LGBMRegressor(colsample_bytree=0.6884091116362046,\n", " learning_rate=0.0825101833775657, max_bin=1023,\n", " min_child_samples=15, n_estimators=436, num_leaves=46,\n", " reg_alpha=0.0010949400705571237, reg_lambda=0.004934208563558304,\n", " verbose=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LGBMRegressor(colsample_bytree=0.6884091116362046,\n", " learning_rate=0.0825101833775657, max_bin=1023,\n", " min_child_samples=15, n_estimators=436, num_leaves=46,\n", " reg_alpha=0.0010949400705571237, reg_lambda=0.004934208563558304,\n", " verbose=-1)
LGBMRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LGBMRegressor()