{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# FLAML AutoML on Apache Spark \n",
"\n",
"| | | | | |\n",
"|-----|-----|--------|--------|--------|\n",
"|| | | \n",
"\n",
"\n",
"\n",
"### Goal\n",
"\n",
"\n",
"## 1. Introduction\n",
"\n",
"### FLAML\n",
"FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models \n",
"with low computational cost. It is fast and economical. The simple and lightweight design makes it easy \n",
"to use and extend, such as adding new learners. FLAML can \n",
"- serve as an economical AutoML engine,\n",
"- be used as a fast hyperparameter tuning tool, or \n",
"- be embedded in self-tuning software that requires low latency & resource in repetitive\n",
" tuning tasks.\n",
"\n",
"In this notebook, we demonstrate how to use FLAML library to do AutoML for SynapseML models and Apache Spark dataframes. We also compare the results between FLAML AutoML and the default SynapseML. \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"jupyter": {
"outputs_hidden": true,
"source_hidden": false
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:49:35.7617208Z",
"execution_start_time": "2023-04-19T00:49:35.7615143Z",
"livy_statement_state": "available",
"parent_msg_id": "aada545e-b4b9-4f61-b8f0-0921580f4c4c",
"queued_time": "2023-04-19T00:41:29.8670317Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": null,
"state": "finished",
"statement_id": -1
},
"text/plain": [
"StatementMeta(, 27, -1, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting flaml[synapse]@ git+https://github.com/microsoft/FLAML.git\n",
" Cloning https://github.com/microsoft/FLAML.git to /tmp/pip-install-9bp9bnbp/flaml_f9ddffb8b30b4c1aaffd650b9b9ac29a\n",
" Running command git clone --filter=blob:none --quiet https://github.com/microsoft/FLAML.git /tmp/pip-install-9bp9bnbp/flaml_f9ddffb8b30b4c1aaffd650b9b9ac29a\n",
" Resolved https://github.com/microsoft/FLAML.git to commit 99bb0a8425a58a537ae34347c867b4bc05310471\n",
" Preparing metadata (setup.py) ... \u001b[?25l-\b \b\\\b \bdone\n",
"\u001b[?25hCollecting xgboost==1.6.1\n",
" Downloading xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m192.9/192.9 MB\u001b[0m \u001b[31m22.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting pandas==1.5.1\n",
" Downloading pandas-1.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.2/12.2 MB\u001b[0m \u001b[31m96.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting numpy==1.23.4\n",
" Downloading numpy-1.23.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.1/17.1 MB\u001b[0m \u001b[31m98.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting scipy\n",
" Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m34.5/34.5 MB\u001b[0m \u001b[31m82.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting pytz>=2020.1\n",
" Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m502.3/502.3 KB\u001b[0m \u001b[31m125.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting python-dateutil>=2.8.1\n",
" Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m247.7/247.7 KB\u001b[0m \u001b[31m104.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting lightgbm>=2.3.1\n",
" Downloading lightgbm-3.3.5-py3-none-manylinux1_x86_64.whl (2.0 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m137.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting scikit-learn>=0.24\n",
" Downloading scikit_learn-1.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m9.8/9.8 MB\u001b[0m \u001b[31m148.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hCollecting joblibspark>=0.5.0\n",
" Downloading joblibspark-0.5.1-py3-none-any.whl (15 kB)\n",
"Collecting optuna==2.8.0\n",
" Downloading optuna-2.8.0-py3-none-any.whl (301 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m302.0/302.0 KB\u001b[0m \u001b[31m107.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pyspark>=3.2.0\n",
" Downloading pyspark-3.4.0.tar.gz (310.8 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m310.8/310.8 MB\u001b[0m \u001b[31m18.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l-\b \bdone\n",
"\u001b[?25hCollecting colorlog\n",
" Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)\n",
"Collecting cmaes>=0.8.2\n",
" Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)\n",
"Collecting cliff\n",
" Downloading cliff-4.2.0-py3-none-any.whl (81 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m81.0/81.0 KB\u001b[0m \u001b[31m44.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting packaging>=20.0\n",
" Downloading packaging-23.1-py3-none-any.whl (48 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m48.9/48.9 KB\u001b[0m \u001b[31m27.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting alembic\n",
" Downloading alembic-1.10.3-py3-none-any.whl (212 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.3/212.3 KB\u001b[0m \u001b[31m70.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting sqlalchemy>=1.1.0\n",
" Downloading SQLAlchemy-2.0.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.8/2.8 MB\u001b[0m \u001b[31m123.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting tqdm\n",
" Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.1/77.1 KB\u001b[0m \u001b[31m34.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting joblib>=0.14\n",
" Downloading joblib-1.2.0-py3-none-any.whl (297 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m298.0/298.0 KB\u001b[0m \u001b[31m114.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting wheel\n",
" Downloading wheel-0.40.0-py3-none-any.whl (64 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.5/64.5 KB\u001b[0m \u001b[31m27.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting py4j==0.10.9.7\n",
" Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m200.5/200.5 KB\u001b[0m \u001b[31m84.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting six>=1.5\n",
" Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)\n",
"Collecting threadpoolctl>=2.0.0\n",
" Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)\n",
"Collecting greenlet!=0.4.17\n",
" Downloading greenlet-2.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (618 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m618.5/618.5 KB\u001b[0m \u001b[31m131.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting typing-extensions>=4.2.0\n",
" Downloading typing_extensions-4.5.0-py3-none-any.whl (27 kB)\n",
"Collecting importlib-metadata\n",
" Downloading importlib_metadata-6.5.0-py3-none-any.whl (22 kB)\n",
"Collecting Mako\n",
" Downloading Mako-1.2.4-py3-none-any.whl (78 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.7/78.7 KB\u001b[0m \u001b[31m39.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting importlib-resources\n",
" Downloading importlib_resources-5.12.0-py3-none-any.whl (36 kB)\n",
"Collecting cmd2>=1.0.0\n",
" Downloading cmd2-2.4.3-py3-none-any.whl (147 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m147.2/147.2 KB\u001b[0m \u001b[31m68.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting autopage>=0.4.0\n",
" Downloading autopage-0.5.1-py3-none-any.whl (29 kB)\n",
"Collecting PrettyTable>=0.7.2\n",
" Downloading prettytable-3.7.0-py3-none-any.whl (27 kB)\n",
"Collecting stevedore>=2.0.1\n",
" Downloading stevedore-5.0.0-py3-none-any.whl (49 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.6/49.6 KB\u001b[0m \u001b[31m23.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting PyYAML>=3.12\n",
" Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m701.2/701.2 KB\u001b[0m \u001b[31m121.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting wcwidth>=0.1.7\n",
" Downloading wcwidth-0.2.6-py2.py3-none-any.whl (29 kB)\n",
"Collecting attrs>=16.3.0\n",
" Downloading attrs-23.1.0-py3-none-any.whl (61 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.2/61.2 KB\u001b[0m \u001b[31m33.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pyperclip>=1.6\n",
" Downloading pyperclip-1.8.2.tar.gz (20 kB)\n",
" Preparing metadata (setup.py) ... \u001b[?25l-\b \bdone\n",
"\u001b[?25hCollecting zipp>=0.5\n",
" Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)\n",
"Collecting pbr!=2.1.0,>=2.0.0\n",
" Downloading pbr-5.11.1-py2.py3-none-any.whl (112 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m112.7/112.7 KB\u001b[0m \u001b[31m51.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting MarkupSafe>=0.9.2\n",
" Downloading MarkupSafe-2.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)\n",
"Building wheels for collected packages: pyspark, flaml, pyperclip\n",
" Building wheel for pyspark (setup.py) ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \bdone\n",
"\u001b[?25h Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317145 sha256=27ed3d6841f2401a2d7018b6b56c164357334e10761228b12c0e5294db8985a4\n",
" Stored in directory: /home/trusted-service-user/.cache/pip/wheels/27/3e/a7/888155c6a7f230b13a394f4999b90fdfaed00596c68d3de307\n",
" Building wheel for flaml (setup.py) ... \u001b[?25l-\b \b\\\b \bdone\n",
"\u001b[?25h Created wheel for flaml: filename=FLAML-1.2.1-py3-none-any.whl size=248482 sha256=01f9d2f101b46c0104ad8919d4a65470ce54f23ef8b3671ac4bb12c2ba6db7dd\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-o_3986sn/wheels/5c/1a/48/c07dfe482b630f96d7258700d361a971759465895f9dd768ee\n",
" Building wheel for pyperclip (setup.py) ... \u001b[?25l-\b \bdone\n",
"\u001b[?25h Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11107 sha256=e1d85f669e71af3e8f45ffedf4e41257741b841bef852247b94ba8bfff3162ba\n",
" Stored in directory: /home/trusted-service-user/.cache/pip/wheels/7f/1a/65/84ff8c386bec21fca6d220ea1f5498a0367883a78dd5ba6122\n",
"Successfully built pyspark flaml pyperclip\n",
"Installing collected packages: wcwidth, pytz, pyperclip, py4j, zipp, wheel, typing-extensions, tqdm, threadpoolctl, six, PyYAML, pyspark, PrettyTable, pbr, packaging, numpy, MarkupSafe, joblib, greenlet, colorlog, autopage, attrs, stevedore, sqlalchemy, scipy, python-dateutil, Mako, joblibspark, importlib-resources, importlib-metadata, cmd2, cmaes, xgboost, scikit-learn, pandas, cliff, alembic, optuna, lightgbm, flaml\n",
" Attempting uninstall: wcwidth\n",
" Found existing installation: wcwidth 0.2.5\n",
" Not uninstalling wcwidth at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'wcwidth'. No files were found to uninstall.\n",
" Attempting uninstall: pytz\n",
" Found existing installation: pytz 2021.1\n",
" Not uninstalling pytz at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'pytz'. No files were found to uninstall.\n",
" Attempting uninstall: pyperclip\n",
" Found existing installation: pyperclip 1.8.2\n",
" Not uninstalling pyperclip at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'pyperclip'. No files were found to uninstall.\n",
" Attempting uninstall: py4j\n",
" Found existing installation: py4j 0.10.9.3\n",
" Not uninstalling py4j at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'py4j'. No files were found to uninstall.\n",
" Attempting uninstall: zipp\n",
" Found existing installation: zipp 3.5.0\n",
" Not uninstalling zipp at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'zipp'. No files were found to uninstall.\n",
" Attempting uninstall: wheel\n",
" Found existing installation: wheel 0.36.2\n",
" Not uninstalling wheel at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'wheel'. No files were found to uninstall.\n",
" Attempting uninstall: typing-extensions\n",
" Found existing installation: typing-extensions 3.10.0.0\n",
" Not uninstalling typing-extensions at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'typing-extensions'. No files were found to uninstall.\n",
" Attempting uninstall: tqdm\n",
" Found existing installation: tqdm 4.61.2\n",
" Not uninstalling tqdm at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'tqdm'. No files were found to uninstall.\n",
" Attempting uninstall: threadpoolctl\n",
" Found existing installation: threadpoolctl 2.1.0\n",
" Not uninstalling threadpoolctl at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'threadpoolctl'. No files were found to uninstall.\n",
" Attempting uninstall: six\n",
" Found existing installation: six 1.16.0\n",
" Not uninstalling six at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'six'. No files were found to uninstall.\n",
" Attempting uninstall: PyYAML\n",
" Found existing installation: PyYAML 5.4.1\n",
" Not uninstalling pyyaml at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'PyYAML'. No files were found to uninstall.\n",
" Attempting uninstall: pyspark\n",
" Found existing installation: pyspark 3.2.1\n",
" Not uninstalling pyspark at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'pyspark'. No files were found to uninstall.\n",
" Attempting uninstall: PrettyTable\n",
" Found existing installation: prettytable 2.4.0\n",
" Not uninstalling prettytable at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'prettytable'. No files were found to uninstall.\n",
" Attempting uninstall: packaging\n",
" Found existing installation: packaging 21.0\n",
" Not uninstalling packaging at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'packaging'. No files were found to uninstall.\n",
" Attempting uninstall: numpy\n",
" Found existing installation: numpy 1.19.4\n",
" Not uninstalling numpy at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'numpy'. No files were found to uninstall.\n",
" Attempting uninstall: MarkupSafe\n",
" Found existing installation: MarkupSafe 2.0.1\n",
" Not uninstalling markupsafe at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'MarkupSafe'. No files were found to uninstall.\n",
" Attempting uninstall: joblib\n",
" Found existing installation: joblib 1.0.1\n",
" Not uninstalling joblib at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'joblib'. No files were found to uninstall.\n",
" Attempting uninstall: greenlet\n",
" Found existing installation: greenlet 1.1.0\n",
" Not uninstalling greenlet at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'greenlet'. No files were found to uninstall.\n",
" Attempting uninstall: attrs\n",
" Found existing installation: attrs 21.2.0\n",
" Not uninstalling attrs at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'attrs'. No files were found to uninstall.\n",
" Attempting uninstall: sqlalchemy\n",
" Found existing installation: SQLAlchemy 1.4.20\n",
" Not uninstalling sqlalchemy at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'SQLAlchemy'. No files were found to uninstall.\n",
" Attempting uninstall: scipy\n",
" Found existing installation: scipy 1.5.3\n",
" Not uninstalling scipy at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'scipy'. No files were found to uninstall.\n",
" Attempting uninstall: python-dateutil\n",
" Found existing installation: python-dateutil 2.8.1\n",
" Not uninstalling python-dateutil at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'python-dateutil'. No files were found to uninstall.\n",
" Attempting uninstall: importlib-resources\n",
" Found existing installation: importlib-resources 5.10.0\n",
" Not uninstalling importlib-resources at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'importlib-resources'. No files were found to uninstall.\n",
" Attempting uninstall: importlib-metadata\n",
" Found existing installation: importlib-metadata 4.6.1\n",
" Not uninstalling importlib-metadata at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'importlib-metadata'. No files were found to uninstall.\n",
" Attempting uninstall: xgboost\n",
" Found existing installation: xgboost 1.4.0\n",
" Not uninstalling xgboost at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'xgboost'. No files were found to uninstall.\n",
" Attempting uninstall: scikit-learn\n",
" Found existing installation: scikit-learn 0.23.2\n",
" Not uninstalling scikit-learn at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'scikit-learn'. No files were found to uninstall.\n",
" Attempting uninstall: pandas\n",
" Found existing installation: pandas 1.2.3\n",
" Not uninstalling pandas at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'pandas'. No files were found to uninstall.\n",
" Attempting uninstall: lightgbm\n",
" Found existing installation: lightgbm 3.2.1\n",
" Not uninstalling lightgbm at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f\n",
" Can't uninstall 'lightgbm'. No files were found to uninstall.\n",
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"tensorflow 2.4.1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.\n",
"tensorflow 2.4.1 requires typing-extensions~=3.7.4, but you have typing-extensions 4.5.0 which is incompatible.\n",
"pmdarima 1.8.2 requires numpy~=1.19.0, but you have numpy 1.23.4 which is incompatible.\n",
"koalas 1.8.0 requires numpy<1.20.0,>=1.14, but you have numpy 1.23.4 which is incompatible.\n",
"gevent 21.1.2 requires greenlet<2.0,>=0.4.17; platform_python_implementation == \"CPython\", but you have greenlet 2.0.2 which is incompatible.\u001b[0m\u001b[31m\n",
"\u001b[0mSuccessfully installed Mako-1.2.4 MarkupSafe-2.1.2 PrettyTable-3.7.0 PyYAML-6.0 alembic-1.10.3 attrs-23.1.0 autopage-0.5.1 cliff-4.2.0 cmaes-0.9.1 cmd2-2.4.3 colorlog-6.7.0 flaml-1.2.1 greenlet-2.0.2 importlib-metadata-6.5.0 importlib-resources-5.12.0 joblib-1.2.0 joblibspark-0.5.1 lightgbm-3.3.5 numpy-1.23.4 optuna-2.8.0 packaging-23.1 pandas-1.5.1 pbr-5.11.1 py4j-0.10.9.7 pyperclip-1.8.2 pyspark-3.4.0 python-dateutil-2.8.2 pytz-2023.3 scikit-learn-1.2.2 scipy-1.10.1 six-1.16.0 sqlalchemy-2.0.9 stevedore-5.0.0 threadpoolctl-3.1.0 tqdm-4.65.0 typing-extensions-4.5.0 wcwidth-0.2.6 wheel-0.40.0 xgboost-1.6.1 zipp-3.15.0\n",
"\u001b[33mWARNING: You are using pip version 22.0.4; however, version 23.1 is available.\n",
"You should consider upgrading via the '/nfs4/pyenv-8895058f-cb80-488b-b82d-c341dcde311f/bin/python -m pip install --upgrade pip' command.\u001b[0m\u001b[33m\n",
"\u001b[0mNote: you may need to restart the kernel to use updated packages.\n"
]
},
{
"data": {},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning: PySpark kernel has been restarted to use updated packages.\n",
"\n"
]
}
],
"source": [
"%pip install flaml[synapse]==1.2.1 xgboost==1.6.1 pandas==1.5.1 numpy==1.23.4 --force-reinstall"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Uncomment `_init_spark()` if run in local spark env."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def _init_spark():\n",
" import pyspark\n",
"\n",
" spark = (\n",
" pyspark.sql.SparkSession.builder.appName(\"MyApp\")\n",
" .master(\"local[2]\")\n",
" .config(\n",
" \"spark.jars.packages\",\n",
" (\n",
" \"com.microsoft.azure:synapseml_2.12:0.10.2,\"\n",
" \"org.apache.hadoop:hadoop-azure:3.3.5,\"\n",
" \"com.microsoft.azure:azure-storage:8.6.6\"\n",
" ),\n",
" )\n",
" .config(\"spark.jars.repositories\", \"https://mmlspark.azureedge.net/maven\")\n",
" .config(\"spark.sql.debug.maxToStringFields\", \"100\")\n",
" .getOrCreate()\n",
" )\n",
" return spark\n",
"\n",
"# spark = _init_spark()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:49:38.7324858Z",
"execution_start_time": "2023-04-19T00:49:38.4750792Z",
"livy_statement_state": "available",
"parent_msg_id": "fa770a66-05ff-46d0-81b3-3f21c6be1ecd",
"queued_time": "2023-04-19T00:41:29.8741671Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 8
},
"text/plain": [
"StatementMeta(automl, 27, 8, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"spark.conf.set(\"spark.sql.execution.arrow.pyspark.enabled\", \"false\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Demo overview\n",
"In this example, we use FLAML & Apache Spark to build a classification model in order to predict bankruptcy.\n",
"1. **Tune**: Given an Apache Spark dataframe, we can use FLAML to tune a SynapseML Spark-based model.\n",
"2. **AutoML**: Given an Apache Spark dataframe, we can run AutoML to find the best classification model given our constraints.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Load data and preprocess"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:50:12.8686555Z",
"execution_start_time": "2023-04-19T00:49:39.0071841Z",
"livy_statement_state": "available",
"parent_msg_id": "f4fddcb8-daa9-4e51-82df-a026ad09848d",
"queued_time": "2023-04-19T00:41:29.8758509Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 9
},
"text/plain": [
"StatementMeta(automl, 27, 9, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"records read: 6819\n"
]
}
],
"source": [
"df = (\n",
" spark.read.format(\"csv\")\n",
" .option(\"header\", True)\n",
" .option(\"inferSchema\", True)\n",
" .load(\n",
" \"wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv\"\n",
" )\n",
")\n",
"# print dataset size\n",
"print(\"records read: \" + str(df.count()))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:50:17.1147492Z",
"execution_start_time": "2023-04-19T00:50:13.1478957Z",
"livy_statement_state": "available",
"parent_msg_id": "c3124278-a1fc-4678-ab90-8c1c61b252ed",
"queued_time": "2023-04-19T00:41:29.8770146Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 10
},
"text/plain": [
"StatementMeta(automl, 27, 10, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.synapse.widget-view+json": {
"widget_id": "27e3f6a9-6707-4f94-93cf-05ea98845414",
"widget_type": "Synapse.DataFrame"
},
"text/plain": [
"SynapseWidget(Synapse.DataFrame, 27e3f6a9-6707-4f94-93cf-05ea98845414)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the dataset into train and test"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:55:34.297498Z",
"execution_start_time": "2023-04-19T00:55:34.0061545Z",
"livy_statement_state": "available",
"parent_msg_id": "b7b9be0c-e8cb-4229-a2fb-95f5e0a9bd8f",
"queued_time": "2023-04-19T00:55:33.7779796Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 25
},
"text/plain": [
"StatementMeta(automl, 27, 25, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"train_raw, test_raw = df.randomSplit([0.8, 0.2], seed=41)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Add featurizer to convert features to vector"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:55:49.7837815Z",
"execution_start_time": "2023-04-19T00:55:49.5176322Z",
"livy_statement_state": "available",
"parent_msg_id": "faa6ab52-b98d-4e32-b569-ee27c282ff6e",
"queued_time": "2023-04-19T00:55:49.2823774Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 26
},
"text/plain": [
"StatementMeta(automl, 27, 26, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from pyspark.ml.feature import VectorAssembler\n",
"\n",
"feature_cols = df.columns[1:]\n",
"featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
"train_data = featurizer.transform(train_raw)[\"Bankrupt?\", \"features\"]\n",
"test_data = featurizer.transform(test_raw)[\"Bankrupt?\", \"features\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Default SynapseML LightGBM"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:56:14.2639565Z",
"execution_start_time": "2023-04-19T00:55:53.757847Z",
"livy_statement_state": "available",
"parent_msg_id": "29d11dfb-a2ef-4a1e-9dc6-d41d832e83ed",
"queued_time": "2023-04-19T00:55:53.5050188Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 27
},
"text/plain": [
"StatementMeta(automl, 27, 27, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from synapse.ml.lightgbm import LightGBMClassifier\n",
"\n",
"model = LightGBMClassifier(\n",
" objective=\"binary\", featuresCol=\"features\", labelCol=\"Bankrupt?\", isUnbalance=True\n",
")\n",
"\n",
"model = model.fit(train_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Model Prediction"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:56:19.165521Z",
"execution_start_time": "2023-04-19T00:56:14.5127236Z",
"livy_statement_state": "available",
"parent_msg_id": "27aa0ad6-99e5-489f-ab26-b26b1f10834e",
"queued_time": "2023-04-19T00:55:56.0549337Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 28
},
"text/plain": [
"StatementMeta(automl, 27, 28, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------------+--------------------+------------------+-------------------+------------------+------------------+\n",
"|evaluation_type| confusion_matrix| accuracy| precision| recall| AUC|\n",
"+---------------+--------------------+------------------+-------------------+------------------+------------------+\n",
"| Classification|1253.0 20.0 \\n2...|0.9627942293090357|0.42857142857142855|0.3409090909090909|0.6625990859101621|\n",
"+---------------+--------------------+------------------+-------------------+------------------+------------------+\n",
"\n"
]
}
],
"source": [
"def predict(model, test_data=test_data):\n",
" from synapse.ml.train import ComputeModelStatistics\n",
"\n",
" predictions = model.transform(test_data)\n",
" \n",
" metrics = ComputeModelStatistics(\n",
" evaluationMetric=\"classification\",\n",
" labelCol=\"Bankrupt?\",\n",
" scoredLabelsCol=\"prediction\",\n",
" ).transform(predictions)\n",
" return metrics\n",
"\n",
"default_metrics = predict(model)\n",
"default_metrics.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Run FLAML Tune"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:56:19.7604089Z",
"execution_start_time": "2023-04-19T00:56:19.4650633Z",
"livy_statement_state": "available",
"parent_msg_id": "22ff4c92-83c4-433e-8525-4ecb193c7d4e",
"queued_time": "2023-04-19T00:55:59.6397744Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 29
},
"text/plain": [
"StatementMeta(automl, 27, 29, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"train_data_sub, val_data_sub = train_data.randomSplit([0.8, 0.2], seed=41)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:50:56.2968207Z",
"execution_start_time": "2023-04-19T00:50:56.0058549Z",
"livy_statement_state": "available",
"parent_msg_id": "f0106eec-a889-4e51-86b2-ea899afb7612",
"queued_time": "2023-04-19T00:41:29.8989617Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 16
},
"text/plain": [
"StatementMeta(automl, 27, 16, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def train(lambdaL1, learningRate, numLeaves, numIterations, train_data=train_data_sub, val_data=val_data_sub):\n",
" \"\"\"\n",
" This train() function:\n",
" - takes hyperparameters as inputs (for tuning later)\n",
" - returns the AUC score on the validation dataset\n",
"\n",
" Wrapping code as a function makes it easier to reuse the code later for tuning.\n",
" \"\"\"\n",
"\n",
" lgc = LightGBMClassifier(\n",
" objective=\"binary\",\n",
" lambdaL1=lambdaL1,\n",
" learningRate=learningRate,\n",
" numLeaves=numLeaves,\n",
" labelCol=\"Bankrupt?\",\n",
" numIterations=numIterations,\n",
" isUnbalance=True,\n",
" featuresCol=\"features\",\n",
" )\n",
"\n",
" model = lgc.fit(train_data)\n",
"\n",
" # Define an evaluation metric and evaluate the model on the validation dataset.\n",
" eval_metric = predict(model, val_data)\n",
" eval_metric = eval_metric.toPandas()['AUC'][0]\n",
"\n",
" return model, eval_metric"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"jupyter": {
"outputs_hidden": true,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:56:20.3156028Z",
"execution_start_time": "2023-04-19T00:56:20.0366204Z",
"livy_statement_state": "available",
"parent_msg_id": "c5c60e40-1edf-4d4f-a106-77ac86ba288c",
"queued_time": "2023-04-19T00:56:07.4221398Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 30
},
"text/plain": [
"StatementMeta(automl, 27, 30, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import flaml\n",
"import time\n",
"\n",
"# define the search space\n",
"params = {\n",
" \"lambdaL1\": flaml.tune.uniform(0.001, 1),\n",
" \"learningRate\": flaml.tune.uniform(0.001, 1),\n",
" \"numLeaves\": flaml.tune.randint(30, 100),\n",
" \"numIterations\": flaml.tune.randint(100, 300),\n",
"}\n",
"\n",
"# define the tune function\n",
"def flaml_tune(config):\n",
" _, metric = train(**config)\n",
" return {\"auc\": metric}"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:57:20.6355868Z",
"execution_start_time": "2023-04-19T00:56:20.5770855Z",
"livy_statement_state": "available",
"parent_msg_id": "ea4962b9-33e8-459b-8b6f-acb4ae7a13d8",
"queued_time": "2023-04-19T00:56:10.1336409Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 31
},
"text/plain": [
"StatementMeta(automl, 27, 31, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[flaml.tune.tune: 04-19 00:56:20] {508} INFO - Using search algorithm BlendSearch.\n",
"No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune\n",
"You passed a `space` parameter to OptunaSearch that contained unresolved search space definitions. OptunaSearch should however be instantiated with fully configured search spaces only. To use Ray Tune's automatic search space conversion, pass the space definition as part of the `config` argument to `tune.run()` instead.\n",
"[flaml.tune.tune: 04-19 00:56:20] {777} INFO - trial 1 config: {'lambdaL1': 0.09833464080607023, 'learningRate': 0.64761881525086, 'numLeaves': 30, 'numIterations': 172}\n",
"[flaml.tune.tune: 04-19 00:56:46] {197} INFO - result: {'auc': 0.7350263891359782, 'training_iteration': 0, 'config': {'lambdaL1': 0.09833464080607023, 'learningRate': 0.64761881525086, 'numLeaves': 30, 'numIterations': 172}, 'config/lambdaL1': 0.09833464080607023, 'config/learningRate': 0.64761881525086, 'config/numLeaves': 30, 'config/numIterations': 172, 'experiment_tag': 'exp', 'time_total_s': 25.78124713897705}\n",
"[flaml.tune.tune: 04-19 00:56:46] {777} INFO - trial 2 config: {'lambdaL1': 0.7715493226234792, 'learningRate': 0.021731197410042098, 'numLeaves': 74, 'numIterations': 249}\n",
"[flaml.tune.tune: 04-19 00:57:19] {197} INFO - result: {'auc': 0.7648994840775662, 'training_iteration': 0, 'config': {'lambdaL1': 0.7715493226234792, 'learningRate': 0.021731197410042098, 'numLeaves': 74, 'numIterations': 249}, 'config/lambdaL1': 0.7715493226234792, 'config/learningRate': 0.021731197410042098, 'config/numLeaves': 74, 'config/numIterations': 249, 'experiment_tag': 'exp', 'time_total_s': 33.43822383880615}\n",
"[flaml.tune.tune: 04-19 00:57:19] {777} INFO - trial 3 config: {'lambdaL1': 0.49900850529028784, 'learningRate': 0.2255718488853168, 'numLeaves': 43, 'numIterations': 252}\n",
"\n"
]
}
],
"source": [
"analysis = flaml.tune.run(\n",
" flaml_tune,\n",
" params,\n",
" time_budget_s=60,\n",
" num_samples=100,\n",
" metric=\"auc\",\n",
" mode=\"max\",\n",
" verbose=5,\n",
" force_cancel=True,\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"Best config and metric on validation data"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:57:21.2098285Z",
"execution_start_time": "2023-04-19T00:57:20.9439827Z",
"livy_statement_state": "available",
"parent_msg_id": "e99f17e0-cd3e-4292-bc10-180386aaf810",
"queued_time": "2023-04-19T00:56:15.0604124Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 32
},
"text/plain": [
"StatementMeta(automl, 27, 32, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best config: {'lambdaL1': 0.7715493226234792, 'learningRate': 0.021731197410042098, 'numLeaves': 74, 'numIterations': 249}\n",
"Best metrics on validation data: {'auc': 0.7648994840775662, 'training_iteration': 0, 'config': {'lambdaL1': 0.7715493226234792, 'learningRate': 0.021731197410042098, 'numLeaves': 74, 'numIterations': 249}, 'config/lambdaL1': 0.7715493226234792, 'config/learningRate': 0.021731197410042098, 'config/numLeaves': 74, 'config/numIterations': 249, 'experiment_tag': 'exp', 'time_total_s': 33.43822383880615}\n"
]
}
],
"source": [
"tune_config = analysis.best_config\n",
"tune_metrics_val = analysis.best_result\n",
"print(\"Best config: \", tune_config)\n",
"print(\"Best metrics on validation data: \", tune_metrics_val)"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"Retrain model on whole train_data and check metrics on test_data"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:58:23.0787571Z",
"execution_start_time": "2023-04-19T00:57:21.4709435Z",
"livy_statement_state": "available",
"parent_msg_id": "35edd709-9c68-4646-8a8f-e757fae8a919",
"queued_time": "2023-04-19T00:56:18.2245009Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 33
},
"text/plain": [
"StatementMeta(automl, 27, 33, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------------+--------------------+------------------+------------------+-------------------+------------------+\n",
"|evaluation_type| confusion_matrix| accuracy| precision| recall| AUC|\n",
"+---------------+--------------------+------------------+------------------+-------------------+------------------+\n",
"| Classification|1247.0 26.0 \\n2...|0.9597570235383447|0.3953488372093023|0.38636363636363635|0.6829697207741198|\n",
"+---------------+--------------------+------------------+------------------+-------------------+------------------+\n",
"\n"
]
}
],
"source": [
"tune_model, tune_metrics = train(train_data=train_data, val_data=test_data, **tune_config)\n",
"tune_metrics = predict(tune_model)\n",
"tune_metrics.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run FLAML AutoML\n",
"In the FLAML AutoML run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:58:23.596951Z",
"execution_start_time": "2023-04-19T00:58:23.3265305Z",
"livy_statement_state": "available",
"parent_msg_id": "339c4992-4670-4593-a297-e08970e8ef34",
"queued_time": "2023-04-19T00:56:23.3561861Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 34
},
"text/plain": [
"StatementMeta(automl, 27, 34, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"''' import AutoML class from the FLAML package '''\n",
"from flaml import AutoML\n",
"from flaml.automl.spark.utils import to_pandas_on_spark\n",
"\n",
"automl = AutoML()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:58:24.1706079Z",
"execution_start_time": "2023-04-19T00:58:23.8891255Z",
"livy_statement_state": "available",
"parent_msg_id": "ab1eeb7b-d8fc-4917-9b0d-0e9e05778e6b",
"queued_time": "2023-04-19T00:56:26.0836197Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 35
},
"text/plain": [
"StatementMeta(automl, 27, 35, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import os\n",
"settings = {\n",
" \"time_budget\": 60, # total running time in seconds\n",
" \"metric\": 'roc_auc',\n",
" \"task\": 'classification', # task type\n",
" \"log_file_name\": 'flaml_experiment.log', # flaml log file\n",
" \"seed\": 42, # random seed\n",
" \"force_cancel\": True, # force stop training once time_budget is used up\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:58:24.6581809Z",
"execution_start_time": "2023-04-19T00:58:24.4054632Z",
"livy_statement_state": "available",
"parent_msg_id": "fad5e330-6ea9-4387-9da0-72090ee12857",
"queued_time": "2023-04-19T00:56:56.6277279Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 36
},
"text/plain": [
"StatementMeta(automl, 27, 36, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"pyspark.pandas.frame.DataFrame"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = to_pandas_on_spark(train_data)\n",
"\n",
"type(df)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:59:23.5292768Z",
"execution_start_time": "2023-04-19T00:58:24.9037573Z",
"livy_statement_state": "available",
"parent_msg_id": "e85fc33c-0a39-4ec5-a18f-625e4e5991da",
"queued_time": "2023-04-19T00:57:11.2416765Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 37
},
"text/plain": [
"StatementMeta(automl, 27, 37, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[flaml.automl.logger: 04-19 00:58:37] {1682} INFO - task = classification\n",
"[flaml.automl.logger: 04-19 00:58:37] {1689} INFO - Data split method: stratified\n",
"[flaml.automl.logger: 04-19 00:58:37] {1692} INFO - Evaluation method: cv\n",
"[flaml.automl.logger: 04-19 00:58:38] {1790} INFO - Minimizing error metric: 1-roc_auc\n",
"[flaml.automl.logger: 04-19 00:58:38] {1900} INFO - List of ML learners in AutoML Run: ['lgbm_spark']\n",
"[flaml.automl.logger: 04-19 00:58:38] {2210} INFO - iteration 0, current learner lgbm_spark\n",
"[flaml.automl.logger: 04-19 00:58:48] {2336} INFO - Estimated sufficient time budget=104269s. Estimated necessary time budget=104s.\n",
"[flaml.automl.logger: 04-19 00:58:48] {2383} INFO - at 23.9s,\testimator lgbm_spark's best error=0.1077,\tbest estimator lgbm_spark's best error=0.1077\n",
"[flaml.automl.logger: 04-19 00:58:48] {2210} INFO - iteration 1, current learner lgbm_spark\n",
"[flaml.automl.logger: 04-19 00:58:56] {2383} INFO - at 32.0s,\testimator lgbm_spark's best error=0.0962,\tbest estimator lgbm_spark's best error=0.0962\n",
"[flaml.automl.logger: 04-19 00:58:56] {2210} INFO - iteration 2, current learner lgbm_spark\n",
"[flaml.automl.logger: 04-19 00:59:05] {2383} INFO - at 40.2s,\testimator lgbm_spark's best error=0.0943,\tbest estimator lgbm_spark's best error=0.0943\n",
"[flaml.automl.logger: 04-19 00:59:05] {2210} INFO - iteration 3, current learner lgbm_spark\n",
"[flaml.automl.logger: 04-19 00:59:13] {2383} INFO - at 48.4s,\testimator lgbm_spark's best error=0.0760,\tbest estimator lgbm_spark's best error=0.0760\n",
"[flaml.automl.logger: 04-19 00:59:13] {2210} INFO - iteration 4, current learner lgbm_spark\n",
"[flaml.automl.logger: 04-19 00:59:21] {2383} INFO - at 56.5s,\testimator lgbm_spark's best error=0.0760,\tbest estimator lgbm_spark's best error=0.0760\n",
"[flaml.automl.logger: 04-19 00:59:22] {2619} INFO - retrain lgbm_spark for 0.9s\n",
"[flaml.automl.logger: 04-19 00:59:22] {2622} INFO - retrained model: LightGBMClassifier_b4bfafdbcfc1\n",
"[flaml.automl.logger: 04-19 00:59:22] {1930} INFO - fit succeeded\n",
"[flaml.automl.logger: 04-19 00:59:22] {1931} INFO - Time taken to find the best model: 48.424041748046875\n"
]
}
],
"source": [
"'''The main flaml automl API'''\n",
"automl.fit(dataframe=df, label='Bankrupt?', labelCol=\"Bankrupt?\", isUnbalance=True, **settings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Best model and metric"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:59:24.0559557Z",
"execution_start_time": "2023-04-19T00:59:23.7839019Z",
"livy_statement_state": "available",
"parent_msg_id": "211f9184-8589-414a-a39e-33478b83aa4b",
"queued_time": "2023-04-19T00:57:13.8241448Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 38
},
"text/plain": [
"StatementMeta(automl, 27, 38, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best hyperparmeter config: {'numIterations': 12, 'numLeaves': 6, 'minDataInLeaf': 17, 'learningRate': 0.1444074361218993, 'log_max_bin': 6, 'featureFraction': 0.9006280463830675, 'lambdaL1': 0.0021638671012090007, 'lambdaL2': 0.8181940184285643}\n",
"Best roc_auc on validation data: 0.924\n",
"Training duration of best run: 0.8982 s\n"
]
}
],
"source": [
"''' retrieve best config'''\n",
"print('Best hyperparmeter config:', automl.best_config)\n",
"print('Best roc_auc on validation data: {0:.4g}'.format(1-automl.best_loss))\n",
"print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T00:59:26.6061075Z",
"execution_start_time": "2023-04-19T00:59:24.3019256Z",
"livy_statement_state": "available",
"parent_msg_id": "eb0a6089-adb2-4061-bf64-4e5c4cc228eb",
"queued_time": "2023-04-19T00:57:15.1750669Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 39
},
"text/plain": [
"StatementMeta(automl, 27, 39, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------------+--------------------+------------------+-------------------+------------------+------------------+\n",
"|evaluation_type| confusion_matrix| accuracy| precision| recall| AUC|\n",
"+---------------+--------------------+------------------+-------------------+------------------+------------------+\n",
"| Classification|1106.0 167.0 \\n...|0.8686408504176157|0.18536585365853658|0.8636363636363636|0.8662250946225809|\n",
"+---------------+--------------------+------------------+-------------------+------------------+------------------+\n",
"\n"
]
}
],
"source": [
"automl_metrics = predict(automl.model.estimator)\n",
"automl_metrics.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Use Apache Spark to Parallelize AutoML trials and tuning"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T01:10:17.2334202Z",
"execution_start_time": "2023-04-19T01:10:16.938071Z",
"livy_statement_state": "available",
"parent_msg_id": "380652fc-0702-4dff-ba1b-2a74237b414e",
"queued_time": "2023-04-19T01:10:16.7003095Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 44
},
"text/plain": [
"StatementMeta(automl, 27, 44, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"settings = {\n",
" \"time_budget\": 60, # total running time in seconds\n",
" \"metric\": 'roc_auc', # primary metrics for regression can be chosen from: ['mae','mse','r2','rmse','mape']\n",
" \"task\": 'classification', # task type \n",
" \"seed\": 7654321, # random seed\n",
" \"use_spark\": True,\n",
" \"n_concurrent_trials\": 2,\n",
" \"force_cancel\": True,\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [
{
"data": {
"application/vnd.livy.statement-meta+json": {
"execution_finish_time": "2023-04-19T01:10:18.9486035Z",
"execution_start_time": "2023-04-19T01:10:17.4782718Z",
"livy_statement_state": "available",
"parent_msg_id": "9729f077-c1b9-402e-96b9-4fcd9bc960b4",
"queued_time": "2023-04-19T01:10:16.7818706Z",
"session_id": "27",
"session_start_time": null,
"spark_jobs": null,
"spark_pool": "automl",
"state": "finished",
"statement_id": 45
},
"text/plain": [
"StatementMeta(automl, 27, 45, Finished, Available)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"
\n", " | Bankrupt? | \n", "ROA(C) before interest and depreciation before interest | \n", "ROA(A) before interest and % after tax | \n", "ROA(B) before interest and depreciation after tax | \n", "Operating Gross Margin | \n", "Realized Sales Gross Margin | \n", "Operating Profit Rate | \n", "Pre-tax net Interest Rate | \n", "After-tax net Interest Rate | \n", "Non-industry income and expenditure/revenue | \n", "... | \n", "Net Income to Total Assets | \n", "Total assets to GNP price | \n", "No-credit Interval | \n", "Gross Profit to Sales | \n", "Net Income to Stockholder's Equity | \n", "Liability to Equity | \n", "Degree of Financial Leverage (DFL) | \n", "Interest Coverage Ratio (Interest expense to EBIT) | \n", "Net Income Flag | \n", "Equity to Liability | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "0.0828 | \n", "0.0693 | \n", "0.0884 | \n", "0.6468 | \n", "0.6468 | \n", "0.9971 | \n", "0.7958 | \n", "0.8078 | \n", "0.3047 | \n", "... | \n", "0.0000 | \n", "0.000000e+00 | \n", "0.6237 | \n", "0.6468 | \n", "0.7483 | \n", "0.2847 | \n", "0.0268 | \n", "0.5652 | \n", "1.0 | \n", "0.0199 | \n", "
1 | \n", "0 | \n", "0.1606 | \n", "0.1788 | \n", "0.1832 | \n", "0.5897 | \n", "0.5897 | \n", "0.9986 | \n", "0.7969 | \n", "0.8088 | \n", "0.3034 | \n", "... | \n", "0.5917 | \n", "4.370000e+09 | \n", "0.6236 | \n", "0.5897 | \n", "0.8023 | \n", "0.2947 | \n", "0.0268 | \n", "0.5651 | \n", "1.0 | \n", "0.0151 | \n", "
2 | \n", "0 | \n", "0.2040 | \n", "0.2638 | \n", "0.2598 | \n", "0.4483 | \n", "0.4483 | \n", "0.9959 | \n", "0.7937 | \n", "0.8063 | \n", "0.3034 | \n", "... | \n", "0.6816 | \n", "3.000000e-04 | \n", "0.6221 | \n", "0.4483 | \n", "0.8117 | \n", "0.3038 | \n", "0.0268 | \n", "0.5651 | \n", "1.0 | \n", "0.0136 | \n", "
3 | \n", "0 | \n", "0.2170 | \n", "0.1881 | \n", "0.2451 | \n", "0.5992 | \n", "0.5992 | \n", "0.9962 | \n", "0.7940 | \n", "0.8061 | \n", "0.3034 | \n", "... | \n", "0.6196 | \n", "1.100000e-03 | \n", "0.6236 | \n", "0.5992 | \n", "0.6346 | \n", "0.4359 | \n", "0.0268 | \n", "0.5650 | \n", "1.0 | \n", "0.0108 | \n", "
4 | \n", "0 | \n", "0.2314 | \n", "0.1628 | \n", "0.2068 | \n", "0.6001 | \n", "0.6001 | \n", "0.9988 | \n", "0.7960 | \n", "0.8078 | \n", "0.3015 | \n", "... | \n", "0.5269 | \n", "3.000000e-04 | \n", "0.6241 | \n", "0.6001 | \n", "0.7985 | \n", "0.2903 | \n", "0.0268 | \n", "0.5651 | \n", "1.0 | \n", "0.0164 | \n", "
5 rows × 96 columns
\n", "