Multi-Metric Hyperparameter Search on Iris

This notebook keeps the original objective of the Iris multi-metric tutorial: show how GASearchCV can optimize with multiple scorers and refit the final estimator using one selected metric.

Problem Setup

Iris is a compact multi-class classification dataset. The small size keeps this notebook fast, while still making it useful for demonstrating multi-metric evaluation.

We use a Pipeline with scaling plus multinomial logistic regression. The genetic search tunes regularization and solver-related settings.

[ ]:

import warnings
from pprint import pprint

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, balanced_accuracy_score, f1_score, make_scorer
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn_genetic import (
    EvolutionConfig,
    GASearchCV,
    OptimizationConfig,
    PopulationConfig,
    RuntimeConfig,
)
from sklearn_genetic.callbacks import ConsecutiveStopping, DeltaThreshold, TimerStopping
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter
from sklearn_genetic.space import Categorical, Continuous, Integer

warnings.filterwarnings("ignore", category=UserWarning)

RANDOM_STATE = 42

[2]:

iris = load_iris(as_frame=True)
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    stratify=y,
    random_state=RANDOM_STATE,
)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

print(f"Training shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
print(f"Classes: {iris.target_names.tolist()}")

Training shape: (105, 4)
Test shape: (45, 4)
Classes: ['setosa', 'versicolor', 'virginica']

Define Multiple Metrics

A multi-metric search receives a dictionary of scorers. The refit parameter decides which metric is used to choose best_params_ and refit best_estimator_.

Here we track three metrics:

accuracy: overall correctness.
balanced_accuracy: average recall across classes.
f1_macro: macro-averaged F1, useful when classes should contribute equally.

We set refit="balanced_accuracy" so the final model is selected by class-balanced behavior.

[3]:

scoring = {
    "accuracy": "accuracy",
    "balanced_accuracy": make_scorer(balanced_accuracy_score),
    "f1_macro": make_scorer(f1_score, average="macro"),
}

Configure GASearchCV

This example uses optimizer controls while keeping the search small enough for a notebook.

PopulationConfig(initializer="smart") improves the first generation. warm_start_configs includes a sensible logistic-regression configuration. Diversity control, fitness sharing, and local search help balance exploration and exploitation.

[ ]:

model = Pipeline(
    [
        ("scaler", StandardScaler()),
        (
            "logistic",
            LogisticRegression(
                solver="saga",
                max_iter=1200,
                random_state=RANDOM_STATE,
            ),
        ),
    ]
)

param_grid = {
    "logistic__C": Continuous(1e-3, 30.0, distribution="log-uniform"),
    "logistic__l1_ratio": Continuous(0.0, 1.0),
    "logistic__class_weight": Categorical([None, "balanced"]),
    "logistic__max_iter": Integer(1000, 1500),
}

search = GASearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring=scoring,
    refit="balanced_accuracy",
    cv=cv,
    evolution_config=EvolutionConfig(
        population_size=12,
        generations=10,
        crossover_probability=ExponentialAdapter(initial_value=0.8, end_value=0.4, adaptive_rate=0.15),
        mutation_probability=InverseAdapter(initial_value=0.25, end_value=0.08, adaptive_rate=0.25),
        tournament_size=3,
        elitism=True,
        keep_top_k=3,
    ),
    population_config=PopulationConfig(
        initializer="smart",
        warm_start_configs=[
            {
                "logistic__C": 1.0,
                "logistic__l1_ratio": 0.0,
                "logistic__class_weight": None,
                "logistic__max_iter": 1200,
            }
        ],
    ),
    runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True, verbose=True),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=1,
        local_search_radius=0.20,
        diversity_control=True,
        diversity_threshold=0.30,
        diversity_stagnation_generations=3,
        diversity_mutation_boost=1.8,
        random_immigrants_fraction=0.10,
        fitness_sharing=True,
        sharing_radius=0.40,
    ),
)

callbacks = [
    DeltaThreshold(threshold=0.001, generations=5, metric="fitness_best"),
    ConsecutiveStopping(generations=7, metric="fitness_best"),
    TimerStopping(total_seconds=90),
]

search.fit(X_train, y_train, callbacks=callbacks)

Inspect Best Parameters and Test Metrics

Because refit="balanced_accuracy", the best parameters and final estimator are selected by the cross-validation rank of balanced_accuracy.

[5]:

print("Refit metric:", search.refit_metric)
print("Best balanced-accuracy CV score:", round(search.best_score_, 4))
print("Best parameters:")
pprint(search.best_params_)

Refit metric: balanced_accuracy
Best balanced-accuracy CV score: 0.9798
Best parameters:
{'logistic__C': 1.0,
 'logistic__class_weight': None,
 'logistic__l1_ratio': 0.0,
 'logistic__max_iter': 1200}

[6]:

predictions = search.predict(X_test)
test_metrics = {
    "accuracy": accuracy_score(y_test, predictions),
    "balanced_accuracy": balanced_accuracy_score(y_test, predictions),
    "f1_macro": f1_score(y_test, predictions, average="macro"),
}
test_metrics

[6]:

{'accuracy': 0.9111111111111111,
 'balanced_accuracy': 0.9111111111111111,
 'f1_macro': 0.9107142857142857}

Explore Multi-Metric cv_results_

For multi-metric searches, cv_results_ includes one set of columns for each metric. The most useful columns usually start with mean_test_, std_test_, or rank_test_.

[7]:

results = pd.DataFrame(search.cv_results_)
metric_columns = [
    "mean_test_accuracy",
    "rank_test_accuracy",
    "mean_test_balanced_accuracy",
    "rank_test_balanced_accuracy",
    "mean_test_f1_macro",
    "rank_test_f1_macro",
]
parameter_columns = [column for column in results.columns if column.startswith("param_")]

results[metric_columns + parameter_columns].sort_values("rank_test_balanced_accuracy").head()

[7]:

	mean_test_accuracy	rank_test_accuracy	mean_test_balanced_accuracy	rank_test_balanced_accuracy	mean_test_f1_macro	rank_test_f1_macro	param_logistic__C	param_logistic__l1_ratio	param_logistic__class_weight	param_logistic__max_iter
0	0.980952	1	0.979798	1	0.980529	1	1.000000	0.000000	None	1200
1	0.980952	1	0.979798	1	0.980529	1	2.111163	0.031362	None	1185
3	0.980952	1	0.979798	1	0.980529	1	4.680447	0.521491	None	1102
4	0.980952	1	0.979798	1	0.980529	1	0.687406	0.391485	balanced	1258
12	0.980952	1	0.979798	1	0.980529	1	1.000000	0.000000	None	1119

Read Optimizer Telemetry

The multi-metric result still has a single scalar fitness during optimization: the selected refit metric. Telemetry helps explain how the optimizer moved through the space while optimizing that metric.

[8]:

search.fit_stats_

[8]:

{'evaluated_candidates': 110,
 'unique_candidates': 101,
 'cross_validate_calls': 101,
 'cache_hits': 9,
 'duplicate_candidates': 0,
 'skipped_invalid_candidates': 0,
 'population_parallel_batches': 6,
 'population_serial_batches': 0,
 'random_immigrants': 6,
 'local_refinement_candidates': 2}

[9]:

history = pd.DataFrame(search.history)
telemetry_columns = [
    "gen",
    "fitness",
    "fitness_max",
    "fitness_std",
    "unique_individual_ratio",
    "genotype_diversity",
    "stagnation_generations",
    "best_generation",
]
history[[column for column in telemetry_columns if column in history.columns]].tail()

[9]:

	gen	fitness	fitness_max	fitness_std	unique_individual_ratio	genotype_diversity	stagnation_generations
0	0	0.751964	0.979798	0.297005	1.000000	0.772727	0
1	1	0.954335	0.979798	0.032981	0.750000	0.477273	1
2	2	0.938692	0.979798	0.035403	0.666667	0.295455	2
3	3	0.947040	0.979798	0.037654	0.750000	0.454545	3
4	4	0.971310	0.979798	0.007982	0.916667	0.477273	5

[10]:

ax = history.plot(x="gen", y=["fitness_best", "fitness_max", "fitness"], marker="o", figsize=(8, 4))
ax.set_title("Balanced-accuracy fitness over generations")
ax.set_xlabel("Generation")
ax.set_ylabel("Balanced accuracy")

[10]:

Text(0, 0.5, 'Balanced accuracy')

../_images/notebooks_Iris_multimetric_16_1.png

Change the Refit Metric

The same cv_results_ can point to different candidate rankings. The example below shows the best row for each metric without rerunning the search. In a real workflow, choose refit before fitting based on the metric that best matches the product or scientific goal.

[11]:

best_rows = []
for metric_name in ["accuracy", "balanced_accuracy", "f1_macro"]:
    row = results.sort_values(f"rank_test_{metric_name}").iloc[0]
    best_rows.append(
        {
            "metric": metric_name,
            "mean_test_score": row[f"mean_test_{metric_name}"],
            "rank": row[f"rank_test_{metric_name}"],
            "C": row["param_logistic__C"],
            "l1_ratio": row["param_logistic__l1_ratio"],
            "class_weight": row["param_logistic__class_weight"],
        }
    )

pd.DataFrame(best_rows)

[11]:

	metric	mean_test_score	rank	C	class_weight
0	accuracy	0.980952	1	1.0	None
1	balanced_accuracy	0.979798	1	1.0	None
2	f1_macro	0.980529	1	1.0	None

Practical Notes

With multi-metric scoring, set refit to the metric that should define the final model.
best_score_, best_params_, and best_estimator_ follow the refit metric, not every metric at once.
Use cv_results_ to inspect tradeoffs between metrics after fitting.
Use fit_stats_ and history to understand optimizer cost, duplicate candidates, diversity, stagnation, and convergence behavior.