Advanced Hyperparameter Search With Random Forest

This notebook is a guided tour of advanced optimization controls available in sklearn-genetic-opt. We will tune a RandomForestClassifier on the breast cancer dataset, inspect optimizer telemetry, compare against a lightweight randomized-search baseline, and then reuse the same ideas for feature selection.

Problem Setup

The breast cancer dataset is a binary classification task. It is small enough for a documentation example, but it still has enough numeric features to make model selection and feature selection meaningful.

We use a fixed train/test split and a shuffled StratifiedKFold so the notebook is reproducible.

[1]:
import warnings
from pprint import pprint

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, classification_report, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, train_test_split
from scipy.stats import randint

from sklearn_genetic import (
    EvolutionConfig,
    GAFeatureSelectionCV,
    GASearchCV,
    OptimizationConfig,
    PopulationConfig,
    RuntimeConfig,
)
from sklearn_genetic.callbacks import ConsecutiveStopping, DeltaThreshold, TimerStopping
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter
from sklearn_genetic.space import Categorical, Continuous, Integer

warnings.filterwarnings("ignore", category=UserWarning)

RANDOM_STATE = 42
[2]:
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    stratify=y,
    random_state=RANDOM_STATE,
)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

print(f"Training shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
print(f"Positive class rate: {y.mean():.3f}")
Training shape: (398, 30)
Test shape: (171, 30)
Positive class rate: 0.627

Baseline Model

Before tuning anything, train a plain random forest. This gives us a practical reference point: a genetic search should either improve the score, find a simpler configuration, or give us useful telemetry about the search process.

[3]:
def evaluate_classifier(estimator, X_eval, y_eval):
    predictions = estimator.predict(X_eval)
    probabilities = estimator.predict_proba(X_eval)[:, 1]
    return {
        "accuracy": accuracy_score(y_eval, predictions),
        "balanced_accuracy": balanced_accuracy_score(y_eval, predictions),
        "roc_auc": roc_auc_score(y_eval, probabilities),
    }


baseline = RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1)
baseline.fit(X_train, y_train)

baseline_metrics = evaluate_classifier(baseline, X_test, y_test)
baseline_metrics
[3]:
{'accuracy': 0.935672514619883,
 'balanced_accuracy': 0.9297605140186915,
 'roc_auc': 0.991311331775701}

Define a Genetic Search Space

sklearn-genetic-opt uses explicit search-space objects instead of sklearn parameter distributions. This keeps integer, continuous, and categorical choices clear.

In this example we tune both model capacity and split behavior. The search space is intentionally moderate so the notebook runs quickly.

[4]:
param_grid = {
    "n_estimators": Integer(40, 140),
    "max_depth": Integer(2, 12),
    "min_samples_split": Integer(2, 12),
    "min_samples_leaf": Integer(1, 8),
    "max_features": Categorical(["sqrt", "log2", None]),
    "ccp_alpha": Continuous(0.0, 0.03),
}

Configure GASearchCV

This configuration demonstrates several optimizer controls:

  • PopulationConfig(initializer="smart") seeds a more useful initial population using estimator defaults, stratified categorical choices, and Latin hypercube sampling for numeric dimensions.

  • warm_start_configs injects a known reasonable configuration into the first population.

  • RuntimeConfig(parallel_backend="auto") lets the estimator decide whether to parallelize candidate evaluation or cross-validation.

  • OptimizationConfig(local_search=True) performs a short refinement around the best candidates at the end.

  • OptimizationConfig(diversity_control=True) increases mutation pressure and can inject random candidates when the population collapses too early.

  • OptimizationConfig(fitness_sharing=True) reduces crowding pressure so similar candidates do not dominate selection too soon.

  • adaptive schedules let crossover and mutation probabilities evolve over generations.

[5]:
callbacks = [
    ConsecutiveStopping(generations=10, metric="fitness_best"),
    TimerStopping(total_seconds=240),
]

ga_search = GASearchCV(
    estimator=RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1),
    param_grid=param_grid,
    scoring="roc_auc",
    cv=cv,
    evolution_config=EvolutionConfig(
        population_size=20,
        generations=15,
        crossover_probability=ExponentialAdapter(initial_value=0.8, end_value=0.4, adaptive_rate=0.15),
        mutation_probability=InverseAdapter(initial_value=0.25, end_value=0.05, adaptive_rate=0.2),
        tournament_size=3,
        elitism=True,
        keep_top_k=3,
    ),
    population_config=PopulationConfig(
        initializer="smart",
        warm_start_configs=[
            {
                "n_estimators": 100,
                "max_depth": 6,
                "min_samples_split": 4,
                "min_samples_leaf": 2,
                "max_features": "sqrt",
                "ccp_alpha": 0.0,
            }
        ],
    ),
    runtime_config=RuntimeConfig(
        n_jobs=-1,
        parallel_backend="auto",
        use_cache=True,
        verbose=True,
        return_train_score=False,
    ),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=1,
        local_search_radius=0.2,
        diversity_control=True,
        diversity_threshold=0.35,
        diversity_stagnation_generations=3,
        diversity_mutation_boost=1.8,
        random_immigrants_fraction=0.15,
        fitness_sharing=True,
        sharing_radius=0.35,
        sharing_alpha=1.0,
    ),
)

ga_search.fit(X_train, y_train, callbacks=callbacks)
 gen evals           avg          best     div  unique  stag     mut   sel             events
---- ----- ------------- ------------- ------- ------- ----- ------- ----- ------------------
   0    20       0.98625       0.99076   0.579   1.000     0       -     - -
   1    40       0.98519       0.99076   0.386   0.650     1   0.200     3 dup=1,share
   2    40       0.98587       0.99076   0.342   0.900     2   0.217     3 dup=9,share
   3    40       0.98676       0.99076   0.316   0.700     3   0.304     3 div,imm=6,dup=2,sh
   4    40       0.98610       0.99076   0.307   0.750     4   0.315     3 div,imm=6,dup=2,sh
   5    40       0.98464       0.99076   0.386   0.750     5   0.290     3 div,imm=6,dup=12,s
   6    40       0.98632       0.99171   0.412   0.900     0   0.270     3 div,imm=6,dup=7,sh
   7    40       0.98588       0.99171   0.421   0.800     1   0.141     3 dup=15,share
   8    40       0.98520       0.99171   0.421   0.850     2   0.133     3 dup=17,share
   9    40       0.98597       0.99171   0.404   0.750     3   0.127     3 dup=20,share
  10    40       0.98640       0.99171   0.439   0.900     4   0.219     3 div,imm=6,dup=16,s
  11    40       0.98589       0.99171   0.351   0.700     5   0.210     3 div,imm=6,dup=13,s
INFO: TimerStopping callback met its criteria
INFO: Stopping the algorithm
[5]:
GASearchCV(crossover_probability=<sklearn_genetic.schedules.schedulers.ExponentialAdapter object at 0x000001C5C12A12B0>,
           cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
           diversity_control=True, diversity_mutation_boost=1.8,
           diversity_stagnation_generations=3, diversity_threshold=0.35,
           estimator=RandomForestClassifier(ccp_alpha=0.0083469934111643...
                                                                   'max_features': 'sqrt',
                                                                   'min_samples_leaf': 2,
                                                                   'min_samples_split': 4,
                                                                   'n_estimators': 100}]),
           population_size=20, random_immigrants_fraction=0.15,
           return_train_score=True,
           runtime_config=RuntimeConfig(n_jobs=-1,
                                        pre_dispatch='2*n_jobs',
                                        error_score=nan,
                                        return_train_score=False,
                                        use_cache=True,
                                        parallel_backend='auto',
                                        verbose=True),
           scoring='roc_auc', sharing_radius=0.35)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Inspect Results and Telemetry

The usual sklearn-style attributes are available: best_params_, best_score_, and best_estimator_. The library also records optimization mechanics in fit_stats_ and per-generation telemetry in history.

These fields are especially useful when tuning performance. If cache_hits is high, the search is revisiting candidates. If diversity collapses early, try stronger mutation, more random immigrants, a larger population, or fitness sharing.

[6]:
print("Best CV ROC AUC:", round(ga_search.best_score_, 4))
print("Best parameters:")
pprint(ga_search.best_params_)

ga_metrics = evaluate_classifier(ga_search, X_test, y_test)
pd.DataFrame([baseline_metrics, ga_metrics], index=["baseline", "ga_search"])
Best CV ROC AUC: 0.9917
Best parameters:
{'ccp_alpha': 0.008346993411164376,
 'max_depth': 6,
 'max_features': 'log2',
 'min_samples_leaf': 5,
 'min_samples_split': 7,
 'n_estimators': 56}
[6]:
accuracy balanced_accuracy roc_auc
baseline 0.935673 0.929761 0.991311
ga_search 0.929825 0.925088 0.986565
[7]:
ga_search.fit_stats_
[7]:
{'evaluated_candidates': 462,
 'unique_candidates': 460,
 'cross_validate_calls': 460,
 'cache_hits': 2,
 'duplicate_candidates': 0,
 'skipped_invalid_candidates': 0,
 'population_parallel_batches': 13,
 'population_serial_batches': 0,
 'random_immigrants': 36,
 'local_refinement_candidates': 2}
[8]:
history = pd.DataFrame(ga_search.history)
telemetry_columns = [
    "gen",
    "fitness",
    "fitness_max",
    "fitness_std",
    "unique_individual_ratio",
    "genotype_diversity",
    "stagnation_generations",
    "best_generation",
]
history[[column for column in telemetry_columns if column in history.columns]].tail()
[8]:
gen fitness fitness_max fitness_std unique_individual_ratio genotype_diversity stagnation_generations best_generation
7 7 0.985883 0.990676 0.002142 0.80 0.421053 1 6
8 8 0.985202 0.990143 0.001566 0.85 0.421053 2 6
9 9 0.985969 0.990143 0.002267 0.75 0.403509 3 6
10 10 0.986397 0.990106 0.002121 0.90 0.438596 4 6
11 11 0.986717 0.991707 0.002035 0.75 0.385965 6 6

A compact plot can make the search dynamics easier to read. The first chart shows best-so-far fitness, current-generation best, and population average; the second chart shows diversity signals. If the diversity curves drop to zero early while fitness stops improving, the search is probably over-exploiting one region.

[9]:
ax = history.plot(x="gen", y=["fitness_best", "fitness_max", "fitness"], marker="o", figsize=(8, 4))
ax.set_title("Fitness over generations")
ax.set_xlabel("Generation")
ax.set_ylabel("ROC AUC")
[9]:
Text(0, 0.5, 'ROC AUC')
../_images/notebooks_Advanced_breast_cancer_random_forest_15_1.png
[10]:
diversity_columns = [
    column
    for column in ["unique_individual_ratio", "genotype_diversity"]
    if column in history.columns
]

ax = history.plot(x="gen", y=diversity_columns, marker="o", figsize=(8, 4))
ax.set_title("Population diversity over generations")
ax.set_xlabel("Generation")
ax.set_ylabel("Diversity")
[10]:
Text(0, 0.5, 'Diversity')
../_images/notebooks_Advanced_breast_cancer_random_forest_16_1.png

Compare With RandomizedSearchCV

Genetic search is most useful when the search space is large, mixed-type, or expensive enough that exhaustive grids become unattractive. A lightweight RandomizedSearchCV baseline is still useful because it tells us whether the GA is paying for itself.

The parameter distributions below cover roughly the same region as the genetic search space, but they use sklearn/scipy objects instead of sklearn-genetic-opt dimensions.

[11]:
randomized_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1),
    param_distributions={
        "n_estimators": randint(40, 141),
        "max_depth": randint(2, 13),
        "min_samples_split": randint(2, 13),
        "min_samples_leaf": randint(1, 9),
        "max_features": ["sqrt", "log2", None],
        "ccp_alpha": np.linspace(0.0, 0.03, 20),
    },
    n_iter=12,
    scoring="roc_auc",
    cv=cv,
    n_jobs=-1,
    random_state=RANDOM_STATE,
    refit=True,
)

randomized_search.fit(X_train, y_train)
randomized_metrics = evaluate_classifier(randomized_search, X_test, y_test)

pd.DataFrame(
    [baseline_metrics, randomized_metrics, ga_metrics],
    index=["baseline", "randomized_search", "ga_search"],
)
[11]:
accuracy balanced_accuracy roc_auc
baseline 0.935673 0.929761 0.991311
randomized_search 0.929825 0.925088 0.986419
ga_search 0.929825 0.925088 0.986565

Feature Selection With GAFeatureSelectionCV

The same optimizer ideas can be used for feature selection. Here the individual is a binary mask instead of a hyperparameter vector.

PopulationConfig(initializer="smart") creates diverse masks with different numbers of selected features. max_features limits the largest valid mask. Invalid masks are skipped efficiently instead of spending cross-validation time on candidates whose fitness is already known to be invalid.

[12]:
feature_selector = GAFeatureSelectionCV(
    estimator=RandomForestClassifier(
        random_state=RANDOM_STATE,
        n_jobs=1,
        **ga_search.best_params_,
    ),
    scoring="roc_auc",
    cv=cv,
    max_features=10,
    evolution_config=EvolutionConfig(population_size=14, generations=10),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(
        n_jobs=-1,
        parallel_backend="auto",
        use_cache=True,
        verbose=True,
    ),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=1,
        local_search_radius=0.15,
        diversity_control=True,
        diversity_threshold=0.30,
        random_immigrants_fraction=0.10,
        fitness_sharing=True,
        sharing_radius=0.40,
    ),
)

feature_selector.fit(X_train, y_train, callbacks=[TimerStopping(total_seconds=120)])
 gen evals           avg          best     div  unique  stag     mut   sel             events
---- ----- ------------- ------------- ------- ------- ----- ------- ----- ------------------
   0    14       0.93158       0.98816   0.074   1.000     0       -     - -
   1    28       0.98468       0.98816   0.074   0.500     1   0.800     3 div,imm=3,dup=2,sh
   2    28       0.98443       0.98849   0.077   0.786     0   0.800     3 div,imm=3,dup=1,sh
   3    28       0.98230       0.98849   0.074   0.643     1   0.800     3 div,imm=3,share
   4    28       0.98342       0.98849   0.074   0.857     2   0.800     3 div,imm=3,dup=1,sh
   5    28       0.98267       0.99079   0.074   0.786     0   0.800     3 div,imm=3,dup=1,sh
   6    28       0.98361       0.99079   0.077   0.714     1   0.800     3 div,imm=3,share
   7    28       0.98253       0.99079   0.077   0.786     2   0.800     3 div,imm=3,share
   8    28       0.98454       0.99381   0.077   0.857     0   0.800     3 div,imm=3,share
   9    28       0.97952       0.99381   0.077   0.857     1   0.800     3 div,imm=3,dup=2,sh
  10    28       0.98315       0.99381   0.077   0.571     2   0.800     3 div,imm=3,dup=2,sh
[12]:
GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
                     diversity_control=True, diversity_threshold=0.3,
                     estimator=RandomForestClassifier(ccp_alpha=0.008346993411164376,
                                                      max_depth=6,
                                                      max_features='log2',
                                                      min_samples_leaf=5,
                                                      min_samples_split=7,
                                                      n_estimators=56, n_jobs=1,
                                                      random_state=42),
                     evolution_config=EvolutionConfig(population_s...
                                                            final_selection=False,
                                                            final_selection_top_k=3,
                                                            final_selection_cv=None),
                     population_config=PopulationConfig(initializer='smart',
                                                        warm_start_configs=[]),
                     population_size=14,
                     runtime_config=RuntimeConfig(n_jobs=-1,
                                                  pre_dispatch='2*n_jobs',
                                                  error_score=nan,
                                                  return_train_score=False,
                                                  use_cache=True,
                                                  parallel_backend='auto',
                                                  verbose=True),
                     scoring='roc_auc', sharing_radius=0.4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[13]:
selected_features = X_train.columns[feature_selector.support_]
print(f"Selected {len(selected_features)} features:")
print(selected_features.tolist())

selector_metrics = evaluate_classifier(feature_selector, X_test, y_test)
pd.DataFrame(
    [baseline_metrics, randomized_metrics, ga_metrics, selector_metrics],
    index=["baseline", "randomized_search", "ga_search", "feature_selector"],
)
Selected 10 features:
['mean texture', 'mean symmetry', 'area error', 'smoothness error', 'concave points error', 'symmetry error', 'worst radius', 'worst texture', 'worst concavity', 'worst concave points']
[13]:
accuracy balanced_accuracy roc_auc
baseline 0.935673 0.929761 0.991311
randomized_search 0.929825 0.925088 0.986419
ga_search 0.929825 0.925088 0.986565
feature_selector 0.935673 0.923481 0.989486
[14]:
print(classification_report(y_test, feature_selector.predict(X_test), target_names=data.target_names))
              precision    recall  f1-score   support

   malignant       0.95      0.88      0.91        64
      benign       0.93      0.97      0.95       107

    accuracy                           0.94       171
   macro avg       0.94      0.92      0.93       171
weighted avg       0.94      0.94      0.94       171

Practical Takeaways

  • Start with PopulationConfig(initializer="smart"); it usually gives better early coverage than random initialization.

  • Use fit_stats_ to understand the cost of the run: evaluated candidates, unique candidates, cache hits, skipped invalid masks, and cross-validation calls.

  • Use history to decide whether the optimizer is exploring enough. Low diversity plus stalled fitness suggests stronger mutation, fitness sharing, random immigrants, or a larger population.

  • Use OptimizationConfig(local_search=True) when the GA already finds good regions and you want a final exploitation pass.

  • Keep a sklearn baseline such as RandomizedSearchCV nearby. It is the simplest way to check whether a more advanced optimizer is improving quality enough to justify extra search time.