How to Use sklearn-genetic-opt

Introduction

sklearn-genetic-opt uses evolutionary algorithms to tune scikit-learn estimators and select informative features. It works with classification and regression estimators, including estimators inside a scikit-learn Pipeline.

The package follows the familiar scikit-learn search API, but the search space is defined differently from GridSearchCV. Instead of listing every candidate value, you define the allowed range or choices for each hyperparameter. The optimizer samples candidates from that space, evaluates them with cross-validation, and uses evolutionary operators to produce new candidates over several generations.

Internally, sklearn-genetic-opt uses the DEAP package. A population is a set of candidate solutions. Each candidate is evaluated, selected, crossed over, or mutated to create the next generation. The process continues until the configured number of generations is reached or a callback stops the search.

This tutorial covers the two most common workflows:

Hyperparameter Tuning

For the first example, we will tune an MLPClassifier on the digits dataset. The digits dataset is a multi-class classification problem.

import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.neural_network import MLPClassifier

from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig
from sklearn_genetic.space import Categorical, Continuous, Integer

Load the data, split it into training and test sets, and visualize a few examples:

data = load_digits()
n_samples = len(data.images)
X = data.images.reshape((n_samples, -1))
y = data["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, data.images, data.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title("Training: %i" % label)

The samples should look like this:

../_images/basic_usage_digits_0.png

Next, define the hyperparameter search space. The keys in param_grid must match valid estimator parameters. The values are search-space dimensions:

  • Integer samples integer values from a range.

  • Continuous samples floating-point values from a range.

  • Categorical samples from a fixed list of choices.

param_grid = {
    "tol": Continuous(1e-2, 1e10, distribution="log-uniform"),
    "alpha": Continuous(1e-5, 2e-5),
    "activation": Categorical(["logistic", "tanh"]),
    "batch_size": Integer(300, 350),
}

For example, batch_size can take any integer value from 300 to 350, while activation must be either "logistic" or "tanh". The distribution argument controls how random values are sampled from a dimension. A log-uniform distribution is useful when a parameter spans several orders of magnitude.

Now create the estimator and the cross-validation strategy:

clf = MLPClassifier(hidden_layer_sizes=(50, 30))

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

evolved_estimator = GASearchCV(
    estimator=clf,
    cv=cv,
    scoring="accuracy",
    param_grid=param_grid,
    evolution_config=EvolutionConfig(population_size=10, generations=20),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(n_jobs=-1, verbose=True),
)

Most arguments have the same meaning as in scikit-learn search estimators: cv controls the validation strategy, scoring controls the metric, and RuntimeConfig.n_jobs controls parallel execution. During the genetic search, unique candidates in the same generation are evaluated in parallel when possible; each candidate runs its cross-validation sequentially to avoid nested parallelism. Set RuntimeConfig(parallel_backend="cv") to keep candidate evaluation serial and pass n_jobs to each candidate’s cross-validation instead. The genetic-search-specific values EvolutionConfig.population_size and EvolutionConfig.generations determine how many candidate solutions are explored. By default, PopulationConfig(initializer="smart") builds a more diverse initial population using estimator defaults, warm starts when provided, Latin hypercube samples for numeric hyperparameters, and stratified categorical values. Set PopulationConfig(initializer="random") to use the previous random initialization behavior. After fitting, fit_stats_ reports evaluation counters such as cache hits, duplicate candidates, cross-validation calls, and skipped invalid feature masks.

Run the optimization:

evolved_estimator.fit(X_train, y_train)

During training, you should see a generation-by-generation log:

../_images/basic_usage_train_log_1.jpeg

Each row summarizes one generation:

  • gen: generation number.

  • nevals: number of evaluated individuals in the generation.

  • fitness: average cross-validation score for the generation.

  • fitness_std: standard deviation of the cross-validation scores.

  • fitness_best: best score found so far during the full search.

  • fitness_max: best individual score in the generation.

  • fitness_min: worst individual score in the generation.

A compact summary of diversity and optimizer state appears at the right of each row:

  • div: genotype_diversity — the average fraction of distinct values per gene position across the population. A value near 1.0 means the population is diverse; a value near 0.0 means it has converged to nearly identical configurations.

  • unique: unique_individual_ratio — the fraction of the population that are distinct individuals. Values below diversity_threshold (default 0.25) trigger diversity control.

  • stag: stagnation_generations — how many consecutive generations have passed without fitness_best improving. Useful for deciding when to add an early-stopping callback.

  • events: a compact summary of optimizer interventions in the generation — div (diversity control triggered), imm=N (N random immigrants injected), dup=N (N duplicates replaced), share (fitness sharing applied).

After fitting, inspect the full history as a DataFrame:

import pandas as pd

history = pd.DataFrame(evolved_estimator.history)
print(history[[
    "gen", "fitness_best", "genotype_diversity",
    "unique_individual_ratio", "stagnation_generations",
]])

And check evaluation cost via fit_stats_:

print(evolved_estimator.fit_stats_)
# evaluated_candidates: total individuals presented to the evaluator
# unique_candidates:    distinct configurations actually cross-validated
# cache_hits:           evaluations reused from the fitness cache
# random_immigrants:    individuals injected when diversity control triggered
# skipped_invalid_candidates: configs that raised exceptions during fit

After fitting, GASearchCV behaves like a fitted scikit-learn estimator. It uses the best hyperparameters found during the search:

print(evolved_estimator.best_params_)

y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

In this run, the test accuracy was approximately 0.96.

y_predict_ga = evolved_estimator.predict(X_test)
accuracy_score(y_test, y_predict_ga)
../_images/basic_usage_accuracy_2.jpeg
evolved_estimator.best_params_
../_images/basic_usage_params_0.jpeg

You can also inspect the optimization process. The plot_fitness_evolution() helper shows how the best score found so far changed over generations:

from sklearn_genetic.plots import plot_fitness_evolution

plot_fitness_evolution(evolved_estimator)
plt.show()
../_images/basic_usage_fitness_plot_3.png

The evolved_estimator.logbook attribute stores the results generated during the search. You can use plot_search_space() to see which hyperparameter values were sampled:

from sklearn_genetic.plots import plot_search_space

plot_search_space(evolved_estimator, features=["tol", "batch_size", "alpha"])
plt.show()
../_images/basic_usage_plot_space_4.png

In this plot, each axis represents a sampled hyperparameter value. For example, the tol range is intentionally broad in this tutorial, and the plot can help you decide whether to narrow that range in a second search.

Feature Selection

For the second example, we will use the Iris dataset and add random noise features. The goal is to recover a useful subset of features while ignoring the noise.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

from sklearn_genetic import (
    EvolutionConfig,
    GAFeatureSelectionCV,
    PopulationConfig,
    RuntimeConfig,
)
from sklearn_genetic.plots import plot_fitness_evolution

data = load_iris()
X, y = data["data"], data["target"]

noise = np.random.uniform(0, 10, size=(X.shape[0], 10))
X = np.hstack((X, noise))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=0
)

The resulting dataset contains the original Iris features plus 10 noisy features.

GAFeatureSelectionCV is similar to GASearchCV, but it does not optimize hyperparameters. Instead, it evaluates subsets of columns and tries to maximize the cross-validation score while selecting a compact feature set. The estimator should already be configured with the hyperparameters you want to use.

clf = SVC(gamma="auto")

evolved_estimator = GAFeatureSelectionCV(
    estimator=clf,
    cv=3,
    scoring="accuracy",
    evolution_config=EvolutionConfig(
        population_size=30,
        generations=20,
        keep_top_k=2,
        elitism=True,
    ),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(n_jobs=-1, verbose=True),
)

Run the feature-selection search:

evolved_estimator.fit(X_train, y_train)

During training, the same log format is displayed:

../_images/basic_usage_train_log_5.PNG

After fitting, GAFeatureSelectionCV also behaves like a scikit-learn estimator. Prediction methods such as predict and predict_proba use only the selected columns.

features = evolved_estimator.support_

y_predict_ga = evolved_estimator.predict(X_test)
accuracy = accuracy_score(y_test, y_predict_ga)
../_images/basic_usage_accuracy_6.PNG

In this run, the test accuracy was approximately 0.98.

The support_ attribute is a boolean mask. Each position corresponds to a column in the input data: True means the feature was selected, and False means it was discarded. In this example, the optimizer selected the informative Iris features and ignored the random noise features.

You can plot the fitness evolution for the feature-selection search too:

plot_fitness_evolution(evolved_estimator)
plt.show()
../_images/basic_usage_fitness_plot_7.PNG

This concludes the basic sklearn-genetic-opt workflow. The next tutorials cover callbacks, custom callbacks, schedulers, reproducibility, MLflow integration, outlier detection, and cross-validation behavior in more detail.