autogen/notebook/integrate_azureml.ipynb
Chi Wang efd85b4c86
Deploy a new doc website (#338)
A new documentation website. And:

* add actions for doc

* update docstr

* installation instructions for doc dev

* unify README and Getting Started

* rename notebook

* doc about best_model_for_estimator #340

* docstr for keep_search_state #340

* DNN

Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu>
Co-authored-by: Z.sk <shaokunzhang@psu.edu>
2021-12-16 17:11:33 -08:00

328 lines
8.4 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"source": [
"Copyright (c) 2020-2021 Microsoft Corporation. All rights reserved. \n",
"\n",
"Licensed under the MIT License.\n",
"\n",
"# Run FLAML in AzureML\n",
"\n",
"\n",
"## 1. Introduction\n",
"\n",
"FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models \n",
"with low computational cost. It is fast and cheap. The simple and lightweight design makes it easy \n",
"to use and extend, such as adding new learners. FLAML can \n",
"- serve as an economical AutoML engine,\n",
"- be used as a fast hyperparameter tuning tool, or \n",
"- be embedded in self-tuning software that requires low latency & resource in repetitive\n",
" tuning tasks.\n",
"\n",
"In this notebook, we use one real data example (binary classification) to showcase how to use FLAML library together with AzureML.\n",
"\n",
"FLAML requires `Python>=3.6`. To run this notebook example, please install flaml with the `notebook` and `azureml` option:\n",
"```bash\n",
"pip install flaml[notebook,azureml]\n",
"```"
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"!pip install flaml[notebook,azureml]"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Enable mlflow in AzureML workspace"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"import mlflow\n",
"from azureml.core import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## 2. Classification Example\n",
"### Load data and preprocess\n",
"\n",
"Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure."
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"from flaml.data import load_openml_dataset\n",
"X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir='./')"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
}
},
{
"cell_type": "markdown",
"source": [
"### Run FLAML\n",
"In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. For example, the default ML learners of FLAML are `['lgbm', 'xgboost', 'catboost', 'rf', 'extra_tree', 'lrl1']`. "
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"''' import AutoML class from flaml package '''\n",
"from flaml import AutoML\n",
"automl = AutoML()"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"settings = {\n",
" \"time_budget\": 60, # total running time in seconds\n",
" \"metric\": 'accuracy', # can be: 'r2', 'rmse', 'mae', 'mse', 'accuracy', 'roc_auc', 'roc_auc_ovr',\n",
" # 'roc_auc_ovo', 'log_loss', 'mape', 'f1', 'ap', 'ndcg', 'micro_f1', 'macro_f1'\n",
" \"estimator_list\": ['lgbm', 'rf', 'xgboost'], # list of ML learners\n",
" \"task\": 'classification', # task type \n",
" \"sample\": False, # whether to subsample training data\n",
" \"log_file_name\": 'airlines_experiment.log', # flaml log file\n",
"}"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"mlflow.set_experiment(\"flaml\")\n",
"with mlflow.start_run() as run:\n",
" '''The main flaml automl API'''\n",
" automl.fit(X_train=X_train, y_train=y_train, **settings)"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
}
},
{
"cell_type": "markdown",
"source": [
"### Best model and metric"
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"''' retrieve best config and best learner'''\n",
"print('Best ML leaner:', automl.best_estimator)\n",
"print('Best hyperparmeter config:', automl.best_config)\n",
"print('Best accuracy on validation data: {0:.4g}'.format(1 - automl.best_loss))\n",
"print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"automl.model"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"''' pickle and save the automl object '''\n",
"import pickle\n",
"with open('automl.pkl', 'wb') as f:\n",
" pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"''' compute predictions of testing dataset ''' \n",
"y_pred = automl.predict(X_test)\n",
"print('Predicted labels', y_pred)\n",
"print('True labels', y_test)\n",
"y_pred_proba = automl.predict_proba(X_test)[:,1]"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"''' compute different metric values on testing dataset'''\n",
"from flaml.ml import sklearn_metric_loss_score\n",
"print('accuracy', '=', 1 - sklearn_metric_loss_score('accuracy', y_pred, y_test))\n",
"print('roc_auc', '=', 1 - sklearn_metric_loss_score('roc_auc', y_pred_proba, y_test))\n",
"print('log_loss', '=', sklearn_metric_loss_score('log_loss', y_pred_proba, y_test))"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
}
},
{
"cell_type": "markdown",
"source": [
"### Log history"
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"from flaml.data import get_output_from_log\n",
"time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history = \\\n",
" get_output_from_log(filename = settings['log_file_name'], time_budget = 60)\n",
"\n",
"for config in config_history:\n",
" print(config)"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"plt.title('Learning Curve')\n",
"plt.xlabel('Wall Clock Time (s)')\n",
"plt.ylabel('Validation Accuracy')\n",
"plt.scatter(time_history, 1 - np.array(valid_loss_history))\n",
"plt.step(time_history, 1 - np.array(best_valid_loss_history), where='post')\n",
"plt.show()"
],
"outputs": [],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.0 64-bit ('blend': conda)"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
},
"interpreter": {
"hash": "0cfea3304185a9579d09e0953576b57c8581e46e6ebc6dfeb681bc5a511f7544"
}
},
"nbformat": 4,
"nbformat_minor": 2
}