Multi-Metric Hyperparameter Search

GASearchCV can track several metrics simultaneously while optimizing for one. This is useful when you care about multiple properties of the model — for example, both accuracy and class-balanced recall — but need to select a single best configuration at the end.

How Multi-Metric Search Works

Pass a dictionary to scoring where each key is a metric name and each value is a scorer string or a callable built with make_scorer. Set refit to the name of the metric that should determine best_params_ and refit best_estimator_.

During the genetic search, candidates are ranked and selected by the refit metric only. Every metric is still evaluated at each generation and stored in cv_results_, so you can inspect tradeoffs after fitting without rerunning the search.

from sklearn.metrics import balanced_accuracy_score, f1_score, make_scorer

scoring = {
    "accuracy": "accuracy",
    "balanced_accuracy": make_scorer(balanced_accuracy_score),
    "f1_macro": make_scorer(f1_score, average="macro"),
}

Full Example

This example tunes a logistic-regression pipeline on the Iris dataset, tracking three metrics and selecting the final model by balanced accuracy.

Setup

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    f1_score,
    make_scorer,
)
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig
from sklearn_genetic.space import Categorical, Continuous, Integer

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

Define the scorers:

scoring = {
    "accuracy": "accuracy",
    "balanced_accuracy": make_scorer(balanced_accuracy_score),
    "f1_macro": make_scorer(f1_score, average="macro"),
}

Define the search:

model = Pipeline([
    ("scaler", StandardScaler()),
    ("logistic", LogisticRegression(solver="saga", max_iter=1200, random_state=42)),
])

param_grid = {
    "logistic__C": Continuous(1e-3, 30.0, distribution="log-uniform"),
    "logistic__l1_ratio": Continuous(0.0, 1.0),
    "logistic__class_weight": Categorical([None, "balanced"]),
    "logistic__max_iter": Integer(1000, 1500),
}

search = GASearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring=scoring,
    refit="balanced_accuracy",   # select and refit on this metric
    cv=cv,
    evolution_config=EvolutionConfig(population_size=15, generations=12),
    population_config=PopulationConfig(
        initializer="smart",
        warm_start_configs=[{
            "logistic__C": 1.0,
            "logistic__l1_ratio": 0.0,
            "logistic__class_weight": None,
            "logistic__max_iter": 1200,
        }],
    ),
    runtime_config=RuntimeConfig(n_jobs=-1, use_cache=True),
)

search.fit(X_train, y_train)

After fitting, best_score_ and best_params_ reflect the refit metric:

print("Refit metric:", search.refit)
print("Best balanced-accuracy CV score:", round(search.best_score_, 4))
print("Best parameters:", search.best_params_)

Evaluate all metrics on the holdout set:

predictions = search.predict(X_test)

print("Accuracy:          ", round(accuracy_score(y_test, predictions), 4))
print("Balanced accuracy: ", round(balanced_accuracy_score(y_test, predictions), 4))
print("F1 macro:          ", round(f1_score(y_test, predictions, average="macro"), 4))

Inspect `cv_results_`

For multi-metric searches cv_results_ gains one set of columns per metric: mean_test_<metric>, std_test_<metric>, and rank_test_<metric>.

import pandas as pd

results = pd.DataFrame(search.cv_results_)

metric_cols = [
    "mean_test_accuracy",
    "rank_test_accuracy",
    "mean_test_balanced_accuracy",
    "rank_test_balanced_accuracy",
    "mean_test_f1_macro",
    "rank_test_f1_macro",
]
param_cols = [col for col in results.columns if col.startswith("param_")]

print(
    results[metric_cols + param_cols]
    .sort_values("rank_test_balanced_accuracy")
    .head()
)

Find the best configuration for each metric without rerunning the search:

for metric in ["accuracy", "balanced_accuracy", "f1_macro"]:
    best = results.sort_values(f"rank_test_{metric}").iloc[0]
    print(f"\nBest by {metric}:")
    print(f"  score = {best[f'mean_test_{metric}']:.4f}")
    print(f"  params = {best['params']}")

This is useful when you want to compare what configuration each metric would select after a single search run.

Change the Refit Metric

refit must be set before calling fit — it controls which metric the GA optimizes during the search, not just which score is returned. If you want a different refit metric, run the search again with a different refit value.

For a quick comparison without refitting, you can manually inspect cv_results_ and build a new estimator from the best parameters for any metric:

from sklearn.base import clone

# Best parameters by f1_macro, even though search refitted on balanced_accuracy
best_row = results.sort_values("rank_test_f1_macro").iloc[0]
alt_params = best_row["params"]

alt_model = clone(model).set_params(**alt_params)
alt_model.fit(X_train, y_train)
print("F1-macro selected model accuracy:", accuracy_score(y_test, alt_model.predict(X_test)))

Practical Notes

The refit metric is the only one used to rank candidates during the genetic search. The other metrics are recorded but do not influence selection, crossover, or mutation.
best_estimator_, best_params_, and best_score_ always refer to the refit metric.
Use cv_results_ to inspect tradeoffs after fitting. There is no need to rerun the search to see how different metrics would rank the same candidates.
For regression, use negative scorers: "neg_root_mean_squared_error", "neg_mean_absolute_error". The GA maximizes by default, so these scorers produce correct rankings when used as the refit target.