autogen/flaml/default/README.md

184 lines
7.4 KiB
Markdown
Raw Normal View History

# FLAML-Zero: Zero-shot AutoML
## Zero-shot AutoML
There are several ways to use zero-shot AutoML, i.e., train a model with the data-dependent default configuration.
0. Use estimators in `flaml.default.estimator`.
```python
from flaml.default import LGBMRegressor
estimator = LGBMRegressor()
estimator.fit(X_train, y_train)
estimator.predict(X_test, y_test)
```
1. Use AutoML.fit(). set `starting_points="data"` and `max_iter=0`.
```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl = AutoML()
automl_settings = {
"time_budget": 2,
"task": "classification",
"log_file_name": "test/iris.log",
"starting_points": "data",
"max_iter": 0,
}
automl.fit(X_train, y_train, **automl_settings)
```
2. Use `flaml.default.preprocess_and_suggest_hyperparams`.
```python
from flaml.default import preprocess_and_suggest_hyperparams
X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
hyperparams, estimator_class, X_transformed, y_transformed, feature_transformer, label_transformer = preprocess_and_suggest_hyperparams(
"classification", X_train, y_train, "lgbm"
)
model = estimator_class(**hyperparams) # estimator_class is LGBMClassifier
model.fit(X_transformed, y_train) # LGBMClassifier can handle raw labels
X_test = feature_transformer.transform(X_test) # preprocess test data
y_pred = model.predict(X_test)
```
If you want to use your own meta-learned defaults, specify the path containing the meta-learned defaults. For example,
```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl = AutoML()
automl_settings = {
"time_budget": 2,
"task": "classification",
"log_file_name": "test/iris.log",
"starting_points": "data:test/default",
"estimator_list": ["lgbm", "xgb_limitdepth", "rf"]
"max_iter": 0,
}
automl.fit(X_train, y_train, **automl_settings)
```
Since this is a multiclass task, it will look for the following files under `test/default/`:
- `all/multiclass.json`.
- `{learner_name}/multiclass.json` for every learner_name in the estimator_list.
Read the next subsection to understand how to generate these files if you would like to meta-learn the defaults yourself.
To perform hyperparameter search starting with the data-dependent defaults, remove `max_iter=0`.
## Perform Meta Learning
FLAML provides a package `flaml.default` to learn defaults customized for your own tasks/learners/metrics.
### Prepare a collection of training tasks
Collect a diverse set of training tasks. For each task, extract its meta feature and save in a .csv file. For example, test/default/all/metafeatures.csv:
```
Dataset,NumberOfInstances,NumberOfFeatures,NumberOfClasses,PercentageOfNumericFeatures
2dplanes,36691,10,0,1.0
adult,43957,14,2,0.42857142857142855
Airlines,485444,7,2,0.42857142857142855
Albert,382716,78,2,0.3333333333333333
Amazon_employee_access,29492,9,2,0.0
bng_breastTumor,104976,9,0,0.1111111111111111
bng_pbc,900000,18,0,0.5555555555555556
car,1555,6,4,0.0
connect-4,60801,42,3,0.0
dilbert,9000,2000,5,1.0
Dionis,374569,60,355,1.0
poker,922509,10,0,1.0
```
The first column is the dataset name, and the latter four are meta features.
### Prepare the candidate configurations
You can extract the best configurations for each task in your collection of training tasks by running flaml on each of them with a long enough budget. Save the best configuration in a .json file under `{location_for_defaults}/{learner_name}/{task_name}.json`. For example,
```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl.fit(X_train, y_train, estimator_list=["lgbm"], **settings)
automl.save_best_config("test/default/lgbm/iris.json")
```
### Evaluate each candidate configuration on each task
Save the evaluation results in a .csv file. For example, save the evaluation results for lgbm under `test/default/lgbm/results.csv`:
```
task,fold,type,result,params
2dplanes,0,regression,0.946366,{'_modeljson': 'lgbm/2dplanes.json'}
2dplanes,0,regression,0.907774,{'_modeljson': 'lgbm/adult.json'}
2dplanes,0,regression,0.901643,{'_modeljson': 'lgbm/Airlines.json'}
2dplanes,0,regression,0.915098,{'_modeljson': 'lgbm/Albert.json'}
2dplanes,0,regression,0.302328,{'_modeljson': 'lgbm/Amazon_employee_access.json'}
2dplanes,0,regression,0.94523,{'_modeljson': 'lgbm/bng_breastTumor.json'}
2dplanes,0,regression,0.945698,{'_modeljson': 'lgbm/bng_pbc.json'}
2dplanes,0,regression,0.946194,{'_modeljson': 'lgbm/car.json'}
2dplanes,0,regression,0.945549,{'_modeljson': 'lgbm/connect-4.json'}
2dplanes,0,regression,0.946232,{'_modeljson': 'lgbm/default.json'}
2dplanes,0,regression,0.945594,{'_modeljson': 'lgbm/dilbert.json'}
2dplanes,0,regression,0.836996,{'_modeljson': 'lgbm/Dionis.json'}
2dplanes,0,regression,0.917152,{'_modeljson': 'lgbm/poker.json'}
adult,0,binary,0.927203,{'_modeljson': 'lgbm/2dplanes.json'}
adult,0,binary,0.932072,{'_modeljson': 'lgbm/adult.json'}
adult,0,binary,0.926563,{'_modeljson': 'lgbm/Airlines.json'}
adult,0,binary,0.928604,{'_modeljson': 'lgbm/Albert.json'}
adult,0,binary,0.911171,{'_modeljson': 'lgbm/Amazon_employee_access.json'}
adult,0,binary,0.930645,{'_modeljson': 'lgbm/bng_breastTumor.json'}
adult,0,binary,0.928603,{'_modeljson': 'lgbm/bng_pbc.json'}
adult,0,binary,0.915825,{'_modeljson': 'lgbm/car.json'}
adult,0,binary,0.919499,{'_modeljson': 'lgbm/connect-4.json'}
adult,0,binary,0.930109,{'_modeljson': 'lgbm/default.json'}
adult,0,binary,0.932453,{'_modeljson': 'lgbm/dilbert.json'}
adult,0,binary,0.921959,{'_modeljson': 'lgbm/Dionis.json'}
adult,0,binary,0.910763,{'_modeljson': 'lgbm/poker.json'}
...
```
The `type` column indicates the type of the task, such as regression, binary or multiclass.
The `result` column stores the evaluation result, assuming the large the better. The `params` column indicates which json config is used. For example 'lgbm/2dplanes.json' indicates that the best lgbm configuration extracted from 2dplanes is used.
### Learn data-dependent defaults
To recap, the inputs required for meta-learning are:
1. Metafeatures: e.g., `{location}/all/metafeatures.csv`.
1. Configurations: `{location}/{learner_name}/{task_name}.json`.
1. Evaluation results: `{location}/{learner_name}/results.csv`.
For example, if the input location is "test/default", learners are lgbm, xgb_limitdepth and rf, the following command learns data-dependent defaults for binary classification tasks.
```bash
python portfolio.py --output test/default --input test/default --metafeatures test/default/all/metafeatures.csv --task binary --estimator lgbm xgb_limitdepth rf
```
It will produce the following files as output:
- test/default/lgbm/binary.json: the learned defaults for lgbm.
- test/default/xgb_limitdepth/binary.json: the learned defaults for xgb_limitdepth.
- test/default/rf/binary.json: the learned defaults for rf.
- test/default/all/binary.json: the learned defaults for lgbm, xgb_limitdepth and rf together.
Change "binary" into "multiclass" or "regression" for the other tasks.
## Reference
For more technical details, please check our research paper.
* [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. arXiv preprint arXiv:2202.09927 (2022).
```bibtex
@article{Kayali2022default,
title={Mining Robust Default Configurations for Resource-constrained AutoML},
author={Moe Kayali and Chi Wang},
year={2022},
journal={arXiv preprint arXiv:2202.09927},
}
```