autogen/flaml/default/README.md

# FLAML-Zero: Zero-shot AutoML

## Zero-shot AutoML

There are several ways to use zero-shot AutoML, i.e., train a model with the data-dependent default configuration.

0. Use estimators in `flaml.default.estimator`.

```python
from flaml.default import LGBMRegressor

estimator = LGBMRegressor()
estimator.fit(X_train, y_train)
estimator.predict(X_test, y_test)
```


1. Use AutoML.fit(). set `starting_points="data"` and `max_iter=0`.

```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl = AutoML()
automl_settings = {
    "time_budget": 2,
    "task": "classification",
    "log_file_name": "test/iris.log",
    "starting_points": "data",
    "max_iter": 0,
}
automl.fit(X_train, y_train, **automl_settings)
```

2. Use `flaml.default.preprocess_and_suggest_hyperparams`.

```python
from flaml.default import preprocess_and_suggest_hyperparams

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
hyperparams, estimator_class, X_transformed, y_transformed, feature_transformer, label_transformer = preprocess_and_suggest_hyperparams(
    "classification", X_train, y_train, "lgbm"
)
model = estimator_class(**hyperparams)  # estimator_class is LGBMClassifier
model.fit(X_transformed, y_train)  # LGBMClassifier can handle raw labels
X_test = feature_transformer.transform(X_test)  # preprocess test data
y_pred = model.predict(X_test)
```

If you want to use your own meta-learned defaults, specify the path containing the meta-learned defaults. For example,

```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl = AutoML()
automl_settings = {
    "time_budget": 2,
    "task": "classification",
    "log_file_name": "test/iris.log",
    "starting_points": "data:test/default",
    "estimator_list": ["lgbm", "xgb_limitdepth", "rf"]
    "max_iter": 0,
}
automl.fit(X_train, y_train, **automl_settings)
```

Since this is a multiclass task, it will look for the following files under `test/default/`:

- `all/multiclass.json`.
- `{learner_name}/multiclass.json` for every learner_name in the estimator_list.

Read the next subsection to understand how to generate these files if you would like to meta-learn the defaults yourself.

To perform hyperparameter search starting with the data-dependent defaults, remove `max_iter=0`.

## Perform Meta Learning

FLAML provides a package `flaml.default` to learn defaults customized for your own tasks/learners/metrics.

### Prepare a collection of training tasks

Collect a diverse set of training tasks. For each task, extract its meta feature and save in a .csv file. For example, test/default/all/metafeatures.csv:

```
Dataset,NumberOfInstances,NumberOfFeatures,NumberOfClasses,PercentageOfNumericFeatures
2dplanes,36691,10,0,1.0
adult,43957,14,2,0.42857142857142855
Airlines,485444,7,2,0.42857142857142855
Albert,382716,78,2,0.3333333333333333
Amazon_employee_access,29492,9,2,0.0
bng_breastTumor,104976,9,0,0.1111111111111111
bng_pbc,900000,18,0,0.5555555555555556
car,1555,6,4,0.0
connect-4,60801,42,3,0.0
dilbert,9000,2000,5,1.0
Dionis,374569,60,355,1.0
poker,922509,10,0,1.0
```

The first column is the dataset name, and the latter four are meta features.

### Prepare the candidate configurations

You can extract the best configurations for each task in your collection of training tasks by running flaml on each of them with a long enough budget. Save the best configuration in a .json file under `{location_for_defaults}/{learner_name}/{task_name}.json`. For example,

```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl.fit(X_train, y_train, estimator_list=["lgbm"], **settings)
automl.save_best_config("test/default/lgbm/iris.json")
```

### Evaluate each candidate configuration on each task

Save the evaluation results in a .csv file. For example, save the evaluation results for lgbm under `test/default/lgbm/results.csv`:

```
task,fold,type,result,params
2dplanes,0,regression,0.946366,{'_modeljson': 'lgbm/2dplanes.json'}
2dplanes,0,regression,0.907774,{'_modeljson': 'lgbm/adult.json'}
2dplanes,0,regression,0.901643,{'_modeljson': 'lgbm/Airlines.json'}
2dplanes,0,regression,0.915098,{'_modeljson': 'lgbm/Albert.json'}
2dplanes,0,regression,0.302328,{'_modeljson': 'lgbm/Amazon_employee_access.json'}
2dplanes,0,regression,0.94523,{'_modeljson': 'lgbm/bng_breastTumor.json'}
2dplanes,0,regression,0.945698,{'_modeljson': 'lgbm/bng_pbc.json'}
2dplanes,0,regression,0.946194,{'_modeljson': 'lgbm/car.json'}
2dplanes,0,regression,0.945549,{'_modeljson': 'lgbm/connect-4.json'}
2dplanes,0,regression,0.946232,{'_modeljson': 'lgbm/default.json'}
2dplanes,0,regression,0.945594,{'_modeljson': 'lgbm/dilbert.json'}
2dplanes,0,regression,0.836996,{'_modeljson': 'lgbm/Dionis.json'}
2dplanes,0,regression,0.917152,{'_modeljson': 'lgbm/poker.json'}
adult,0,binary,0.927203,{'_modeljson': 'lgbm/2dplanes.json'}
adult,0,binary,0.932072,{'_modeljson': 'lgbm/adult.json'}
adult,0,binary,0.926563,{'_modeljson': 'lgbm/Airlines.json'}
adult,0,binary,0.928604,{'_modeljson': 'lgbm/Albert.json'}
adult,0,binary,0.911171,{'_modeljson': 'lgbm/Amazon_employee_access.json'}
adult,0,binary,0.930645,{'_modeljson': 'lgbm/bng_breastTumor.json'}
adult,0,binary,0.928603,{'_modeljson': 'lgbm/bng_pbc.json'}
adult,0,binary,0.915825,{'_modeljson': 'lgbm/car.json'}
adult,0,binary,0.919499,{'_modeljson': 'lgbm/connect-4.json'}
adult,0,binary,0.930109,{'_modeljson': 'lgbm/default.json'}
adult,0,binary,0.932453,{'_modeljson': 'lgbm/dilbert.json'}
adult,0,binary,0.921959,{'_modeljson': 'lgbm/Dionis.json'}
adult,0,binary,0.910763,{'_modeljson': 'lgbm/poker.json'}
...
```

The `type` column indicates the type of the task, such as regression, binary or multiclass.
The `result` column stores the evaluation result, assuming the large the better. The `params` column indicates which json config is used. For example 'lgbm/2dplanes.json' indicates that the best lgbm configuration extracted from 2dplanes is used.

### Learn data-dependent defaults

To recap, the inputs required for meta-learning are:

1. Metafeatures: e.g., `{location}/all/metafeatures.csv`.
1. Configurations: `{location}/{learner_name}/{task_name}.json`.
1. Evaluation results: `{location}/{learner_name}/results.csv`.

For example, if the input location is "test/default", learners are lgbm, xgb_limitdepth and rf, the following command learns data-dependent defaults for binary classification tasks.

```bash
python portfolio.py --output test/default --input test/default --metafeatures test/default/all/metafeatures.csv --task binary --estimator lgbm xgb_limitdepth rf
```

It will produce the following files as output:

- test/default/lgbm/binary.json: the learned defaults for lgbm.
- test/default/xgb_limitdepth/binary.json: the learned defaults for xgb_limitdepth.
- test/default/rf/binary.json: the learned defaults for rf.
- test/default/all/binary.json: the learned defaults for lgbm, xgb_limitdepth and rf together.

Change "binary" into "multiclass" or "regression" for the other tasks.

## Reference

For more technical details, please check our research paper.

* [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. arXiv preprint arXiv:2202.09927 (2022).

```bibtex
@article{Kayali2022default,
    title={Mining Robust Default Configurations for Resource-constrained AutoML},
    author={Moe Kayali and Chi Wang},
    year={2022},
    journal={arXiv preprint arXiv:2202.09927},
}
```
Zero-shot AutoML (#468) * Prepare for release Co-authored-by: Moe Kayali <t-moekayali@microsoft.com> * bug fix * improve doc and code quality Co-authored-by: Qingyun Wu 2022-03-01 15:39:09 -08:00			`# FLAML-Zero: Zero-shot AutoML`

			`## Zero-shot AutoML`

			`There are several ways to use zero-shot AutoML, i.e., train a model with the data-dependent default configuration.`

			0. Use estimators in `flaml.default.estimator`.

			```python
			`from flaml.default import LGBMRegressor`

			`estimator = LGBMRegressor()`
			`estimator.fit(X_train, y_train)`
			`estimator.predict(X_test, y_test)`
			```


			1. Use AutoML.fit(). set `starting_points="data"` and `max_iter=0`.

			```python
			`X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)`
			`automl = AutoML()`
			`automl_settings = {`
			`"time_budget": 2,`
			`"task": "classification",`
			`"log_file_name": "test/iris.log",`
			`"starting_points": "data",`
			`"max_iter": 0,`
			`}`
			`automl.fit(X_train, y_train, **automl_settings)`
			```

			2. Use `flaml.default.preprocess_and_suggest_hyperparams`.

			```python
			`from flaml.default import preprocess_and_suggest_hyperparams`

			`X, y = load_iris(return_X_y=True, as_frame=True)`
			`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)`
			`hyperparams, estimator_class, X_transformed, y_transformed, feature_transformer, label_transformer = preprocess_and_suggest_hyperparams(`
			`"classification", X_train, y_train, "lgbm"`
			`)`
			`model = estimator_class(**hyperparams) # estimator_class is LGBMClassifier`
			`model.fit(X_transformed, y_train) # LGBMClassifier can handle raw labels`
			`X_test = feature_transformer.transform(X_test) # preprocess test data`
			`y_pred = model.predict(X_test)`
			```

			`If you want to use your own meta-learned defaults, specify the path containing the meta-learned defaults. For example,`

			```python
			`X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)`
			`automl = AutoML()`
			`automl_settings = {`
			`"time_budget": 2,`
			`"task": "classification",`
			`"log_file_name": "test/iris.log",`
			`"starting_points": "data:test/default",`
			`"estimator_list": ["lgbm", "xgb_limitdepth", "rf"]`
			`"max_iter": 0,`
			`}`
			`automl.fit(X_train, y_train, **automl_settings)`
			```

			Since this is a multiclass task, it will look for the following files under `test/default/`:

			- `all/multiclass.json`.
			- `{learner_name}/multiclass.json` for every learner_name in the estimator_list.

			`Read the next subsection to understand how to generate these files if you would like to meta-learn the defaults yourself.`

			To perform hyperparameter search starting with the data-dependent defaults, remove `max_iter=0`.

			`## Perform Meta Learning`

			FLAML provides a package `flaml.default` to learn defaults customized for your own tasks/learners/metrics.

			`### Prepare a collection of training tasks`

			`Collect a diverse set of training tasks. For each task, extract its meta feature and save in a .csv file. For example, test/default/all/metafeatures.csv:`

			```
			`Dataset,NumberOfInstances,NumberOfFeatures,NumberOfClasses,PercentageOfNumericFeatures`
			`2dplanes,36691,10,0,1.0`
			`adult,43957,14,2,0.42857142857142855`
			`Airlines,485444,7,2,0.42857142857142855`
			`Albert,382716,78,2,0.3333333333333333`
			`Amazon_employee_access,29492,9,2,0.0`
			`bng_breastTumor,104976,9,0,0.1111111111111111`
			`bng_pbc,900000,18,0,0.5555555555555556`
			`car,1555,6,4,0.0`
			`connect-4,60801,42,3,0.0`
			`dilbert,9000,2000,5,1.0`
			`Dionis,374569,60,355,1.0`
			`poker,922509,10,0,1.0`
			```

			`The first column is the dataset name, and the latter four are meta features.`

			`### Prepare the candidate configurations`

			You can extract the best configurations for each task in your collection of training tasks by running flaml on each of them with a long enough budget. Save the best configuration in a .json file under `{location_for_defaults}/{learner_name}/{task_name}.json`. For example,

			```python
			`X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)`
			`automl.fit(X_train, y_train, estimator_list=["lgbm"], **settings)`
			`automl.save_best_config("test/default/lgbm/iris.json")`
			```

			`### Evaluate each candidate configuration on each task`

			Save the evaluation results in a .csv file. For example, save the evaluation results for lgbm under `test/default/lgbm/results.csv`:

			```
			`task,fold,type,result,params`
			`2dplanes,0,regression,0.946366,{'_modeljson': 'lgbm/2dplanes.json'}`
			`2dplanes,0,regression,0.907774,{'_modeljson': 'lgbm/adult.json'}`
			`2dplanes,0,regression,0.901643,{'_modeljson': 'lgbm/Airlines.json'}`
			`2dplanes,0,regression,0.915098,{'_modeljson': 'lgbm/Albert.json'}`
			`2dplanes,0,regression,0.302328,{'_modeljson': 'lgbm/Amazon_employee_access.json'}`
			`2dplanes,0,regression,0.94523,{'_modeljson': 'lgbm/bng_breastTumor.json'}`
			`2dplanes,0,regression,0.945698,{'_modeljson': 'lgbm/bng_pbc.json'}`
			`2dplanes,0,regression,0.946194,{'_modeljson': 'lgbm/car.json'}`
			`2dplanes,0,regression,0.945549,{'_modeljson': 'lgbm/connect-4.json'}`
			`2dplanes,0,regression,0.946232,{'_modeljson': 'lgbm/default.json'}`
			`2dplanes,0,regression,0.945594,{'_modeljson': 'lgbm/dilbert.json'}`
			`2dplanes,0,regression,0.836996,{'_modeljson': 'lgbm/Dionis.json'}`
			`2dplanes,0,regression,0.917152,{'_modeljson': 'lgbm/poker.json'}`
			`adult,0,binary,0.927203,{'_modeljson': 'lgbm/2dplanes.json'}`
			`adult,0,binary,0.932072,{'_modeljson': 'lgbm/adult.json'}`
			`adult,0,binary,0.926563,{'_modeljson': 'lgbm/Airlines.json'}`
			`adult,0,binary,0.928604,{'_modeljson': 'lgbm/Albert.json'}`
			`adult,0,binary,0.911171,{'_modeljson': 'lgbm/Amazon_employee_access.json'}`
			`adult,0,binary,0.930645,{'_modeljson': 'lgbm/bng_breastTumor.json'}`
			`adult,0,binary,0.928603,{'_modeljson': 'lgbm/bng_pbc.json'}`
			`adult,0,binary,0.915825,{'_modeljson': 'lgbm/car.json'}`
			`adult,0,binary,0.919499,{'_modeljson': 'lgbm/connect-4.json'}`
			`adult,0,binary,0.930109,{'_modeljson': 'lgbm/default.json'}`
			`adult,0,binary,0.932453,{'_modeljson': 'lgbm/dilbert.json'}`
			`adult,0,binary,0.921959,{'_modeljson': 'lgbm/Dionis.json'}`
			`adult,0,binary,0.910763,{'_modeljson': 'lgbm/poker.json'}`
			`...`
			```

			The `type` column indicates the type of the task, such as regression, binary or multiclass.
			The `result` column stores the evaluation result, assuming the large the better. The `params` column indicates which json config is used. For example 'lgbm/2dplanes.json' indicates that the best lgbm configuration extracted from 2dplanes is used.

			`### Learn data-dependent defaults`

			`To recap, the inputs required for meta-learning are:`

			1. Metafeatures: e.g., `{location}/all/metafeatures.csv`.
			1. Configurations: `{location}/{learner_name}/{task_name}.json`.
			1. Evaluation results: `{location}/{learner_name}/results.csv`.

			`For example, if the input location is "test/default", learners are lgbm, xgb_limitdepth and rf, the following command learns data-dependent defaults for binary classification tasks.`

			```bash
			`python portfolio.py --output test/default --input test/default --metafeatures test/default/all/metafeatures.csv --task binary --estimator lgbm xgb_limitdepth rf`
			```

			`It will produce the following files as output:`

			`- test/default/lgbm/binary.json: the learned defaults for lgbm.`
			`- test/default/xgb_limitdepth/binary.json: the learned defaults for xgb_limitdepth.`
			`- test/default/rf/binary.json: the learned defaults for rf.`
			`- test/default/all/binary.json: the learned defaults for lgbm, xgb_limitdepth and rf together.`

			`Change "binary" into "multiclass" or "regression" for the other tasks.`

			`## Reference`

			`For more technical details, please check our research paper.`

			`* [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. arXiv preprint arXiv:2202.09927 (2022).`

			```bibtex
			`@article{Kayali2022default,`
			`title={Mining Robust Default Configurations for Resource-constrained AutoML},`
			`author={Moe Kayali and Chi Wang},`
			`year={2022},`
			`journal={arXiv preprint arXiv:2202.09927},`
			`}`
			```