{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) 2021. All rights reserved.\n", "\n", "Contributed by: @bnriiitb\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using AutoML in Sklearn Pipeline\n", "\n", "This tutorial will help you understand how FLAML's AutoML can be used as a transformer in the Sklearn pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1.Introduction\n", "\n", "### 1.1 FLAML - Fast and Lightweight AutoML\n", "\n", "FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models with low computational cost. It is fast and economical. The simple and lightweight design makes it easy to use and extend, such as adding new learners. \n", "\n", "FLAML can \n", "- serve as an economical AutoML engine,\n", "- be used as a fast hyperparameter tuning tool, or \n", "- be embedded in self-tuning software that requires low latency & resource in repetitive\n", " tuning tasks.\n", "\n", "In this notebook, we use one real data example (binary classification) to showcase how to use FLAML library.\n", "\n", "FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the `notebook` option:\n", "```bash\n", "pip install flaml[notebook]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Why are pipelines a silver bullet?\n", "\n", "In a typical machine learning workflow we have to apply all the transformations at least twice. \n", "1. During Training\n", "2. During Inference\n", "\n", "Scikit-learn pipelines provide an easy to use inteface to automate ML workflows by allowing several transformers to be chained together. \n", "\n", "The key benefits of using pipelines:\n", "* Make ML workflows highly readable, enabling fast development and easy review\n", "* Help to build sequential and parallel processes\n", "* Allow hyperparameter tuning across the estimators\n", "* Easier to share and collaborate with multiple users (bug fixes, enhancements etc)\n", "* Enforce the implementation and order of steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### As FLAML's AutoML module can be used a transformer in the Sklearn's pipeline we can get all the benefits of pipeline and thereby write extremley clean, and resuable code." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "%pip install flaml[notebook]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Classification Example\n", "### Load data and preprocess\n", "\n", "Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "download dataset from openml\n", "Dataset name: airlines\n", "X_train.shape: (404537, 7), y_train.shape: (404537,);\n", "X_test.shape: (134846, 7), y_test.shape: (134846,)\n" ] } ], "source": [ "from flaml.data import load_openml_dataset\n", "X_train, X_test, y_train, y_test = load_openml_dataset(\n", " dataset_id=1169, data_dir='./', random_state=1234, dataset_format='array')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 12., 2648., 4., 15., 4., 450., 67.], dtype=float32)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Create a Pipeline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('imputuer', SimpleImputer()),\n", " ('standardizer', StandardScaler()),\n", " ('automl',\n", " AutoML(append_log=False, auto_augment=True, custom_hp={},\n", " early_stop=False, ensemble=False, estimator_list='auto',\n", " eval_method='auto', fit_kwargs_by_estimator={},\n", " hpo_method='auto', keep_search_state=False,\n", " learner_selector='sample', log_file_name='',\n", " log_training_metric=False, log_type='better',\n", " max_iter=None, mem_thres=4294967296, metric='auto',\n", " metric_constraints=[], min_sample_size=10000,\n", " model_history=False, n_concurrent_trials=1, n_jobs=-1,\n", " n_splits=5, pred_time_limit=inf, retrain_full=True,\n", " sample=True, split_ratio=0.1, split_type='auto',\n", " starting_points='static', task='classification', ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('imputuer', SimpleImputer()),\n", " ('standardizer', StandardScaler()),\n", " ('automl',\n", " AutoML(append_log=False, auto_augment=True, custom_hp={},\n", " early_stop=False, ensemble=False, estimator_list='auto',\n", " eval_method='auto', fit_kwargs_by_estimator={},\n", " hpo_method='auto', keep_search_state=False,\n", " learner_selector='sample', log_file_name='',\n", " log_training_metric=False, log_type='better',\n", " max_iter=None, mem_thres=4294967296, metric='auto',\n", " metric_constraints=[], min_sample_size=10000,\n", " model_history=False, n_concurrent_trials=1, n_jobs=-1,\n", " n_splits=5, pred_time_limit=inf, retrain_full=True,\n", " sample=True, split_ratio=0.1, split_type='auto',\n", " starting_points='static', task='classification', ...))])
SimpleImputer()
StandardScaler()
AutoML(append_log=False, auto_augment=True, custom_hp={}, early_stop=False,\n", " ensemble=False, estimator_list='auto', eval_method='auto',\n", " fit_kwargs_by_estimator={}, hpo_method='auto', keep_search_state=False,\n", " learner_selector='sample', log_file_name='', log_training_metric=False,\n", " log_type='better', max_iter=None, mem_thres=4294967296, metric='auto',\n", " metric_constraints=[], min_sample_size=10000, model_history=False,\n", " n_concurrent_trials=1, n_jobs=-1, n_splits=5, pred_time_limit=inf,\n", " retrain_full=True, sample=True, split_ratio=0.1, split_type='auto',\n", " starting_points='static', task='classification', ...)
Pipeline(steps=[('imputuer', SimpleImputer()),\n", " ('standardizer', StandardScaler()),\n", " ('automl',\n", " AutoML(append_log=False, auto_augment=True, custom_hp={},\n", " early_stop=False, ensemble=False, estimator_list='auto',\n", " eval_method='auto', fit_kwargs_by_estimator={},\n", " hpo_method='auto', keep_search_state=False,\n", " learner_selector='sample', log_file_name='',\n", " log_training_metric=False, log_type='better',\n", " max_iter=None, mem_thres=4294967296, metric='auto',\n", " metric_constraints=[], min_sample_size=10000,\n", " model_history=False, n_concurrent_trials=1, n_jobs=-1,\n", " n_splits=5, pred_time_limit=inf, retrain_full=True,\n", " sample=True, split_ratio=0.1, split_type='auto',\n", " starting_points='static', task='classification', ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('imputuer', SimpleImputer()),\n", " ('standardizer', StandardScaler()),\n", " ('automl',\n", " AutoML(append_log=False, auto_augment=True, custom_hp={},\n", " early_stop=False, ensemble=False, estimator_list='auto',\n", " eval_method='auto', fit_kwargs_by_estimator={},\n", " hpo_method='auto', keep_search_state=False,\n", " learner_selector='sample', log_file_name='',\n", " log_training_metric=False, log_type='better',\n", " max_iter=None, mem_thres=4294967296, metric='auto',\n", " metric_constraints=[], min_sample_size=10000,\n", " model_history=False, n_concurrent_trials=1, n_jobs=-1,\n", " n_splits=5, pred_time_limit=inf, retrain_full=True,\n", " sample=True, split_ratio=0.1, split_type='auto',\n", " starting_points='static', task='classification', ...))])
SimpleImputer()
StandardScaler()
AutoML(append_log=False, auto_augment=True, custom_hp={}, early_stop=False,\n", " ensemble=False, estimator_list='auto', eval_method='auto',\n", " fit_kwargs_by_estimator={}, hpo_method='auto', keep_search_state=False,\n", " learner_selector='sample', log_file_name='', log_training_metric=False,\n", " log_type='better', max_iter=None, mem_thres=4294967296, metric='auto',\n", " metric_constraints=[], min_sample_size=10000, model_history=False,\n", " n_concurrent_trials=1, n_jobs=-1, n_splits=5, pred_time_limit=inf,\n", " retrain_full=True, sample=True, split_ratio=0.1, split_type='auto',\n", " starting_points='static', task='classification', ...)