Reproducibility
===============


One of the desirable capabilities of a package that makes several "random" choices is to be able to reproduce the results.

The usual strategy is to fix the random seed that starts generating the pseudo-random numbers.
Unfortunately, the DEAP package, which is the main dependency for all the evolutionary algorithms,
doesn't have an explicit parameter to fix this seed.

However, there is a workaround that seems to work to reproduce these results; this is:

* Set the random seed of `numpy` and `random` package, which are the underlying random numbers generators
* Use the random_state parameter In each of the scikit-learn and sklearn-genetic-opt objects that support it

In the following example, the random_state is set for the `train_test_split`, `cross-validation` generator,
each of the hyperparameters in the `param_grid`, the `RandomForestClassifier`, and at the file level.

Example:
--------
.. code:: python3

   import numpy as np
   import random
   from sklearn_genetic import GASearchCV
   from sklearn_genetic.space import Continuous, Categorical, Integer
   from sklearn.ensemble import RandomForestClassifier
   from sklearn.model_selection import train_test_split, StratifiedKFold
   from sklearn.datasets import load_digits
   from sklearn.metrics import accuracy_score


   # Random Seed at file level
   random_seed = 54

   np.random.seed(random_seed)
   random.seed(random_seed)


   data = load_digits()
   n_samples = len(data.images)
   X = data.images.reshape((n_samples, -1))
   y = data['target']
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=random_seed)

   clf = RandomForestClassifier(random_state=random_seed)

   param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform',
                                                        random_state=random_seed),
                 'bootstrap': Categorical([True, False], random_state=random_seed),
                 'max_depth': Integer(2, 30, random_state=random_seed),
                 'max_leaf_nodes': Integer(2, 35, random_state=random_seed),
                 'n_estimators': Integer(100, 300, random_state=random_seed)}

   cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_seed)

   evolved_estimator = GASearchCV(estimator=clf,
                                  cv=cv,
                                  scoring='accuracy',
                                  population_size=8,
                                  generations=5,
                                  param_grid=param_grid,
                                  n_jobs=-1,
                                  verbose=True,
                                  keep_top_k=4)

   # Train and optimize the estimator
   evolved_estimator.fit(X_train, y_train)
   # Best parameters found
   print(evolved_estimator.best_params_)
   # Use the model fitted with the best parameters
   y_predict_ga = evolved_estimator.predict(X_test)
   print(accuracy_score(y_test, y_predict_ga))

   # Saved metadata for further analysis
   print("Stats achieved in each generation: ", evolved_estimator.history)
   print("Best k solutions: ", evolved_estimator.hof)