ScaDaMaLe Course site and book

Random Forests and MixUp

First off, we will implement MixUp for a Random Forest applied to the Fashion-MNIST data set. Fashion-MNIST consists of black and white 28x28 images of clothing items [[6]]. We will use the scikit-learn package to implement the Random Forest algorithm, and then perform a distributed hyperparameter search with Ray tune. Thus, scalability enters this part of the project through the hyperparameter search.

First, we will just train the Random Forest using the basic training data and observe the performance. Next, we will do the same but utilizing MixUp. Typically, MixUp is used for iterative algorithms, where a new batch of MixUp data is created at each iteration. However, since a Random Forest is not trained iteratively, we use MixUp to augment our data set by adding a number of MixUp data points to our original data set.

First, we will load the data set.

# Loading Fashion-mnist import tensorflow as tf (X, y),(testX,testY) = tf.keras.datasets.fashion_mnist.load_data() X = X.reshape(60000, 28*28) from sklearn.preprocessing import LabelBinarizer enc = LabelBinarizer() y = enc.fit_transform(y)

Next, we define a function that can be used to generated new MixUp data.

# Function to create MixUp data def create_mixup(X, y, beta_param): n = np.shape(X)[0] shuffled_indices = np.arange(n).tolist() np.random.shuffle(shuffled_indices) X_s = X[shuffled_indices] y_s = y[shuffled_indices] mixup_l = np.random.beta(beta_param,beta_param) X_mixed = X*(1-mixup_l) + mixup_l*X_s y_mixed = y*(1-mixup_l) + (mixup_l)*(y_s) return X_mixed, y_mixed

Next, we split the data into training and validation sets.

# Fixes the issue "AttributeError: 'ConsoleBuffer has no attribute 'fileno'" import sys sys.stdout.fileno = lambda: False import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor # Prepare the data num_classes = 10 np.random.seed(1) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, test_size=0.5) X_train_base = X_train.copy() y_train_base = y_train.copy()

We now define the training function that will be used by Ray Tune. For each set of hyperparameters, we initialize a Random Forest and train on the data, either with or without added folds of MixUp data. We then evaluate on some metrics of interest.

# Fixes the issue "AttributeError: 'ConsoleBuffer has no attribute 'fileno'" import sys sys.stdout.fileno = lambda: False from sklearn import metrics import numpy as np from sklearn.ensemble import RandomForestRegressor def training_function(config, checkpoint_dir=None): # Hyperparameters n_estimators, max_depth, mixup_folds = config["n_estimators"], config["max_depth"], config["mixup_folds"] X_train_data = X_train_base.copy() y_train_data = y_train_base.copy() for i in range(mixup_folds): X_mixed, y_mixed = create_mixup(X_train_base, y_train_base, 0.2) X_train_data = np.concatenate([X_train_data, X_mixed]) y_train_data = np.concatenate([y_train_data, y_mixed]) # Instantiate model with n_estimators decision trees rf = RandomForestRegressor(n_estimators = n_estimators, max_depth = max_depth, random_state = 1) # Train the model on training data rf.fit(X_train_data, y_train_data) """ Logg the results """ #x_mix, y_mix = mixup_data( x_val, y_val) #mix_loss, mix_acc = model.evaluate( x_mix, y_mix ) y_pred_probs = rf.predict(X_test) y_pred = np.zeros_like(y_pred_probs) y_pred[np.arange(len(y_pred_probs)), y_pred_probs.argmax(1)] = 1 val_acc = np.mean(np.argmax(y_test,1) == np.argmax(y_pred,1)) y_pred_probs = rf.predict(X_train_base) y_pred = np.zeros_like(y_pred_probs) y_pred[np.arange(len(y_pred_probs)), y_pred_probs.argmax(1)] = 1 train_acc = np.mean(np.argmax(y_train_base,1) == np.argmax(y_pred,1)) mean_loss = 1 tune.report(mean_loss=mean_loss, train_accuracy = train_acc, val_accuracy = val_acc)

Finally, we run the actual hyperparameter search in a distributed fashion. Regarding the amount of MixUp data, we try using no MixUp, or we add 2 folds, effectively tripling the size of the data set.

from ray import tune from ray.tune import CLIReporter # Limit the number of rows. reporter = CLIReporter(max_progress_rows=10) reporter.add_metric_column("val_accuracy") reporter.add_metric_column("train_accuracy") analysis = tune.run( training_function, config={ 'n_estimators': tune.grid_search([10, 20]), 'max_depth': tune.grid_search([5, 10]), 'mixup_folds': tune.grid_search([0, 2]) }, local_dir='ray_results', progress_reporter=reporter ) print("Best config: ", analysis.get_best_config( metric="val_accuracy", mode="max")) #Get a dataframe for analyzing trial results. df = analysis.results_df

Let's look at the data from the different trials to see if we can conclude anything about the efficacy of MixUp.

df[['config.n_estimators', 'config.max_depth', 'config.mixup_folds', 'train_accuracy', 'val_accuracy']]
config.n_estimators config.max_depth config.mixup_folds train_accuracy val_accuracy
trial_id
7e269_00000 10 5 0 0.735533 0.728100
7e269_00001 10 10 0 0.887033 0.835733
7e269_00002 10 5 2 0.729433 0.720333
7e269_00003 10 10 2 0.888467 0.827867
7e269_00004 20 5 0 0.734367 0.729667
7e269_00005 20 10 0 0.888367 0.837833
7e269_00006 20 5 2 0.724567 0.715700
7e269_00007 20 10 2 0.865267 0.818967

Conclusions

Based on the results, MixUp does not seem to help in this context. The validation accuracy achieved with MixUp is actually slightly lower than without it. The reasons for this may be that the data is too simple, that Random Forests cannot fully utilize the power of MixUp augmentation due to not being iterative, or that the piecewise constant nature Decision Trees means that MixUp cannot help too much.