Advanced Optimizer Control

Introduction

This tutorial shows how to use the optimizer-control features designed for harder search spaces: spaces with many local optima, noisy validation scores, or early population collapse.

The tools covered here are:

PopulationConfig(initializer="smart") to start with better coverage.
Adaptive mutation and crossover schedules.
OptimizationConfig(diversity_control=True) to react when the population becomes too similar.
OptimizationConfig(fitness_sharing=True) to reduce selection pressure on crowded niches.
OptimizationConfig(local_search=True) to refine the best candidates after the genetic search.

These features are optional. The default behavior remains conservative, and the advanced controls are best introduced one at a time when telemetry shows that the optimizer needs them.

When to Use These Controls

Use these controls when one of the following patterns appears in estimator.history:

unique_individual_ratio drops quickly toward zero.
genotype_diversity is low while the score is still not improving.
stagnation_generations grows for several generations.
Many high-scoring candidates are small variations of the same solution.
The final solution is good, but nearby configurations might be better.

The controls target different parts of the optimization process:

Control	Main purpose
`population_initializer`	Better initial exploration
Adaptive schedules	Change exploration pressure over generations
`diversity_control`	Recover diversity after collapse or stagnation
`fitness_sharing`	Keep multiple promising niches alive
`local_search`	Improve exploitation near hall-of-fame solutions

Hyperparameter Search Example

The following example tunes a random forest on a classification problem. It uses a broad search space where multiple different regions can perform well.

import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, train_test_split

from sklearn_genetic import (
    EvolutionConfig,
    GASearchCV,
    OptimizationConfig,
    PopulationConfig,
    RuntimeConfig,
)
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter
from sklearn_genetic.space import Categorical, Integer

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)

param_grid = {
    "n_estimators": Integer(50, 250),
    "max_depth": Integer(2, 20),
    "min_samples_split": Integer(2, 20),
    "min_samples_leaf": Integer(1, 10),
    "max_features": Categorical(["sqrt", "log2", None]),
    "criterion": Categorical(["gini", "entropy", "log_loss"]),
}

crossover_schedule = InverseAdapter(
    initial_value=0.8,
    end_value=0.6,
    adaptive_rate=0.05,
)
mutation_schedule = ExponentialAdapter(
    initial_value=0.1,
    end_value=0.25,
    adaptive_rate=0.08,
)

search = GASearchCV(
    estimator=RandomForestClassifier(random_state=42, n_jobs=1),
    param_grid=param_grid,
    cv=cv,
    scoring="roc_auc",
    evolution_config=EvolutionConfig(
        population_size=24,
        generations=18,
        crossover_probability=crossover_schedule,
        mutation_probability=mutation_schedule,
        tournament_size=3,
        elitism=True,
        keep_top_k=4,
    ),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", verbose=True),
    optimization_config=OptimizationConfig(
        diversity_control=True,
        diversity_threshold=0.18,
        diversity_stagnation_generations=4,
        diversity_mutation_boost=1.8,
        random_immigrants_fraction=0.15,
        fitness_sharing=True,
        sharing_radius=0.25,
        sharing_alpha=1.0,
        local_search=True,
        local_search_top_k=2,
        local_search_steps=4,
        local_search_radius=0.08,
    ),
)

search.fit(X_train, y_train)

y_pred = search.predict(X_test)
y_proba = search.predict_proba(X_test)[:, 1]

print(search.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

Reading Optimizer Telemetry

After fitting, convert history to a dataframe to inspect the search:

history = pd.DataFrame(search.history)

columns = [
    "gen",
    "fitness_best",
    "fitness_max",
    "unique_individual_ratio",
    "genotype_diversity",
    "stagnation_generations",
    "mutation_probability",
    "diversity_control_triggered",
    "random_immigrants",
    "duplicate_replacements",
    "fitness_sharing_applied",
    "mean_niche_count",
    "max_niche_count",
    "local_refinements",
]

print(history[columns])
print(search.fit_stats_)

The most useful fields are:

unique_individual_ratio: Fraction of candidates in the generation that are unique. Low values mean the population contains many duplicates.
genotype_diversity: Average per-gene diversity. Low values mean the candidates are becoming structurally similar, even if they are not exact duplicates.
stagnation_generations: Number of generations since the best score improved.
diversity_control_triggered: Whether diversity control reacted in the generation.
random_immigrants: Number of random candidates injected into the offspring.
duplicate_replacements: Number of duplicate offspring replaced before evaluation.
fitness_sharing_applied: Whether niche-aware selection pressure was active.
mean_niche_count and max_niche_count: How crowded candidates were during selection. Higher values mean more candidates occupied similar regions.
local_refinements: Number of local neighbor candidates evaluated after the main genetic search. This is usually non-zero only in the final logbook row.

Feature Selection Example

The same controls can be used with GAFeatureSelectionCV. In feature selection, local_search_radius controls the fraction of feature bits flipped when creating local neighbors.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, train_test_split

from sklearn_genetic import (
    EvolutionConfig,
    GAFeatureSelectionCV,
    OptimizationConfig,
    PopulationConfig,
    RuntimeConfig,
)
from sklearn_genetic.schedules import ExponentialAdapter

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)

selector = GAFeatureSelectionCV(
    estimator=RandomForestClassifier(random_state=42, n_jobs=1),
    cv=cv,
    scoring="roc_auc",
    max_features=18,
    evolution_config=EvolutionConfig(
        population_size=30,
        generations=16,
        crossover_probability=0.8,
        mutation_probability=ExponentialAdapter(0.1, 0.25, 0.08),
        keep_top_k=4,
    ),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", verbose=True),
    optimization_config=OptimizationConfig(
        diversity_control=True,
        diversity_threshold=0.2,
        diversity_stagnation_generations=4,
        random_immigrants_fraction=0.15,
        fitness_sharing=True,
        sharing_radius=0.2,
        local_search=True,
        local_search_top_k=2,
        local_search_steps=5,
        local_search_radius=0.1,
    ),
)

selector.fit(X_train, y_train)

print("Selected features:", selector.best_features_)
print("Test score:", selector.score(X_test, y_test))

Recommended Workflow

Start from the simplest useful setup and add controls based on telemetry:

Use PopulationConfig(initializer="smart") with high crossover (0.8) and low mutation (0.1). Add adaptive schedules if stagnation is detected.
Inspect unique_individual_ratio, genotype_diversity, and stagnation_generations.
If diversity collapses early, enable OptimizationConfig(diversity_control=True).
If one candidate family dominates too quickly, enable OptimizationConfig(fitness_sharing=True).
If the final region looks promising but not fully refined, enable OptimizationConfig(local_search=True).

Tuning Guidelines

diversity_threshold: Values between 0.1 and 0.3 are a practical starting point. Increase it if you want diversity control to react earlier.
diversity_mutation_boost: Values between 1.5 and 2.5 usually provide a noticeable mutation increase without turning the generation into a fully random search.
random_immigrants_fraction: Start between 0.05 and 0.2. Larger values can help rugged spaces, but may slow convergence.
sharing_radius: Values between 0.15 and 0.35 are often useful. Smaller values only penalize very similar candidates; larger values preserve broader niches.
local_search_radius: Small values such as 0.05 to 0.15 keep refinement local. Larger values behave more like another mutation phase.
local_search_steps: Keep this small at first. Each step creates extra candidates that must be evaluated with cross-validation.

Practical Notes

These controls improve search behavior, but they do not remove the need for a good validation design. Use reproducible cross-validation splits, keep a final holdout set for model assessment, and compare multiple random seeds when the metric is noisy.

Because fitness_sharing only changes temporary selection pressure, it does not alter raw cross-validation scores, best_score_, or cv_results_. Because local_search evaluates extra neighbor candidates after the genetic search, it can improve final quality but may increase total runtime.