Understanding the Evaluation Process

This tutorial explains how GASearchCV evaluates candidate hyperparameters and how cross-validation fits into the evolutionary search process.

Two parameters control most of the evaluation behavior:

cv: The cross-validation strategy. This can be an integer or any compatible scikit-learn cross-validator, such as KFold, StratifiedKFold, or RepeatedKFold. See the scikit-learn cross-validation documentation for more details.
scoring: The metric used to evaluate each candidate. For classification, common choices include "accuracy", "precision", and "recall". For regression, common choices include "r2", "max_error", and "neg_root_mean_squared_error". The full list is available in the scikit-learn model evaluation documentation.

Evolutionary Algorithm Background

A genetic algorithm is a metaheuristic optimization method inspired by natural selection. In sklearn-genetic-opt, the algorithm searches over possible hyperparameter configurations and uses their cross-validation scores as the fitness signal.

The main concepts are:

Individual: one candidate solution, such as one set of hyperparameters.
Population: a group of individuals evaluated in the same generation.
Generation: one iteration of the evolutionary process.
Fitness value: the score used to compare individuals, usually a cross-validation score.
Genetic operators: operations such as selection, crossover, mutation, and elitism that create the next generation.

At a high level, the process is:

Build an initial population from the search space. This is generation 0.
Evaluate each individual with cross-validation.
Use genetic operators to create a new generation.
Repeat the evaluation and generation steps until the search reaches its generation limit or a callback stops it.

Creating the First Generation

By default, the first generation is built with PopulationConfig(initializer="smart"). For GASearchCV, this combines valid warm-start candidates, valid estimator defaults, Latin hypercube samples for numeric hyperparameters, stratified categorical values, and duplicate avoidance. For GAFeatureSelectionCV, it creates duplicate-aware feature masks with varied selected-feature counts. Set PopulationConfig(initializer="random") to use fully random initialization.

Each individual can be represented as a chromosome-like structure. In the example below, the first generation contains three individuals. Each chromosome encodes one candidate set of hyperparameters:

The red arrow represents the encoding step, where hyperparameter values are mapped into a chromosome representation. Each block is a gene, and groups of genes represent hyperparameters. The purple arrow represents scoring: each candidate is decoded, evaluated with cross-validation, and assigned a fitness value.

Creating New Generations

After the initial population is evaluated, the algorithm creates a new generation. The exact process depends on the selected algorithms strategy, but the most common operations are crossover, mutation, selection, and elitism.

Crossover

Crossover combines information from two parent chromosomes to create new children. Parent selection usually favors individuals with better fitness, so stronger candidates have a higher chance of contributing to the next generation.

For example, if individuals 1 and 3 are selected as parents, the algorithm can split their chromosomes and exchange sections:

After decoding the child chromosomes, the resulting candidates might look like this:

Child 1: {"learning_rate": 0.015, "layers": 4, "optimizer": "Adam"}
Child 2: {"learning_rate": 0.4, "layers": 6, "optimizer": "SGD"}

Mutation

Crossover alone can make the search converge too quickly around similar solutions. Mutation introduces diversity by randomly changing part of a chromosome. It can alter a single gene or an entire hyperparameter value.

For example, a single gene in a child chromosome can change:

Or the mutation can change a complete hyperparameter, such as the optimizer:

../_images/understandcv_mutantparameter.png

Elitism

Elitism keeps the best individuals from one generation and copies them into the next generation. This helps preserve strong candidates while the rest of the population continues exploring.

After crossover, mutation, selection, and elitism, a new generation may look like this:

The search repeats this cycle until one of the stopping conditions is met:

The maximum number of generations is reached.
The search exceeds a time budget.
An early-stopping callback detects that the score has reached a threshold or stopped improving.

How GASearchCV Evaluates Candidates

In sklearn-genetic-opt, GASearchCV evaluates candidate hyperparameters as follows:

Sample population_size candidate configurations from param_grid.
Fit and score one estimator for each candidate using the configured cv and scoring values.
Log generation-level metrics when verbose=True.
Create the next generation using the selected evolutionary algorithm.
Repeat until generations is reached or callbacks stop the search.
Select the best hyperparameters based on the best individual cross-validation score.

If use_cache=True (the default), candidates that have already been evaluated reuse their stored fitness values. Duplicate candidates inside the same generation are also evaluated only once and then recorded for each occurrence. When n_jobs enables parallel execution, unique candidates in a generation are evaluated in parallel, while each candidate’s own cross-validation runs sequentially to avoid nested parallelism. Set RuntimeConfig(parallel_backend="cv") to keep candidate evaluation serial and pass n_jobs to each candidate’s cross-validation instead. After fitting, fit_stats_ exposes counters for actual cross-validation calls, cache hits, duplicate candidates, skipped invalid candidates, and population-level parallel batches.

The history attribute also includes optimizer telemetry for each generation: population_size, unique_individuals, unique_individual_ratio, genotype_diversity, fitness_improvement, fitness_improved, stagnation_generations, best_generation, mutation_probability, diversity_control_triggered, random_immigrants, duplicate_replacements, local_refinements, fitness_sharing_applied, mean_niche_count, and max_niche_count. These fields help diagnose whether the search is still exploring diverse solutions or has started to converge/stagnate around the same candidates.

When the search space is noisy or rugged, OptimizationConfig(diversity_control=True) can help avoid premature convergence by increasing mutation, replacing duplicate candidates, and adding random immigrants after low-diversity or stagnant generations. When the search has found promising regions, OptimizationConfig(local_search=True) can run a short neighborhood refinement around the hall-of-fame candidates without increasing the number of GA generations. OptimizationConfig(fitness_sharing=True) can reduce selection pressure on crowded niches, so similar high-scoring candidates do not immediately dominate the population.

The generation log contains summary metrics:

fitness: The average score across the individuals in the current generation.
fitness_std: The standard deviation of the individual scores in the current generation.
fitness_best: The best score found so far. This is the most useful metric for convergence plots and early-stopping callbacks because it is cumulative.
fitness_max: The best individual score in the current generation.
fitness_min: The worst individual score in the current generation.

Except for fitness_best, these values summarize the current population, not just the final selected model. For example, if EvolutionConfig(population_size=10), the fitness value is the average score of the 10 candidates evaluated in that generation.

The complete flow can be represented like this:

Each candidate is evaluated with cross-validation. For example, a 5-fold strategy splits the data into five train/validation rotations:

Image taken from scikit-learn.

Example

This example tunes a DecisionTreeRegressor inside a scikit-learn Pipeline on the diabetes regression dataset. The search uses 5-fold cross-validation and optimizes the "r2" metric.

At the end, we print the best hyperparameters and the R-squared score on the test set.

from sklearn.datasets import load_diabetes
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor

from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig
from sklearn_genetic.space import Categorical, Continuous, Integer

data = load_diabetes()
X, y = data["data"], data["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

cv = KFold(n_splits=5, shuffle=True, random_state=42)

pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("clf", DecisionTreeRegressor(random_state=42)),
    ]
)

param_grid = {
    "clf__ccp_alpha": Continuous(0, 1),
    "clf__criterion": Categorical(["squared_error", "absolute_error"]),
    "clf__max_depth": Integer(2, 20),
    "clf__min_samples_split": Integer(2, 30),
}

evolved_estimator = GASearchCV(
    estimator=pipe,
    cv=cv,
    scoring="r2",
    param_grid=param_grid,
    evolution_config=EvolutionConfig(
        population_size=15,
        generations=20,
        tournament_size=3,
        elitism=True,
        keep_top_k=4,
        crossover_probability=0.9,
        mutation_probability=0.05,
        criteria="max",
        algorithm="eaMuCommaLambda",
    ),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(n_jobs=-1),
)

evolved_estimator.fit(X_train, y_train)

y_predict_ga = evolved_estimator.predict(X_test)
r_squared = r2_score(y_test, y_predict_ga)

print(evolved_estimator.best_params_)
print("R-squared:", "{:.2f}".format(r_squared))