Reproducibility

One of the desirable capabilities of a package that makes several “random” choices is to be able to reproduce the results.

The usual strategy is to fix the random seed that starts generating the pseudo-random numbers. Unfortunately, the DEAP package, which is the main dependency for all the evolutionary algorithms, doesn’t have an explicit parameter to fix this seed.

However, there is a workaround that seems to work to reproduce these results; this is:

  • Set the random seed of numpy and random package, which are the underlying random numbers generators

  • Use the random_state parameter In each of the scikit-learn and sklearn-genetic-opt objects that support it

In the following example, the random_state is set for the train_test_split, cross-validation generator, each of the hyperparameters in the param_grid, the RandomForestClassifier, and at the file level.

Example:

import numpy as np
import random
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Continuous, Categorical, Integer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score


# Random Seed at file level
random_seed = 54

np.random.seed(random_seed)
random.seed(random_seed)


data = load_digits()
n_samples = len(data.images)
X = data.images.reshape((n_samples, -1))
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=random_seed)

clf = RandomForestClassifier(random_state=random_seed)

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform',
                                                     random_state=random_seed),
              'bootstrap': Categorical([True, False], random_state=random_seed),
              'max_depth': Integer(2, 30, random_state=random_seed),
              'max_leaf_nodes': Integer(2, 35, random_state=random_seed),
              'n_estimators': Integer(100, 300, random_state=random_seed)}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_seed)

evolved_estimator = GASearchCV(estimator=clf,
                               cv=cv,
                               scoring='accuracy',
                               population_size=8,
                               generations=5,
                               param_grid=param_grid,
                               n_jobs=-1,
                               verbose=True,
                               keep_top_k=4)

# Train and optimize the estimator
evolved_estimator.fit(X_train, y_train)
# Best parameters found
print(evolved_estimator.best_params_)
# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)