.. _multi-metric: Multi-Metric Hyperparameter Search ==================================== ``GASearchCV`` can track several metrics simultaneously while optimizing for one. This is useful when you care about multiple properties of the model — for example, both accuracy and class-balanced recall — but need to select a single best configuration at the end. How Multi-Metric Search Works ------------------------------ Pass a dictionary to ``scoring`` where each key is a metric name and each value is a scorer string or a callable built with ``make_scorer``. Set ``refit`` to the name of the metric that should determine ``best_params_`` and refit ``best_estimator_``. During the genetic search, candidates are ranked and selected by the ``refit`` metric only. Every metric is still evaluated at each generation and stored in ``cv_results_``, so you can inspect tradeoffs after fitting without rerunning the search. .. code:: python3 from sklearn.metrics import balanced_accuracy_score, f1_score, make_scorer scoring = { "accuracy": "accuracy", "balanced_accuracy": make_scorer(balanced_accuracy_score), "f1_macro": make_scorer(f1_score, average="macro"), } Full Example ------------ This example tunes a logistic-regression pipeline on the Iris dataset, tracking three metrics and selecting the final model by balanced accuracy. Setup ^^^^^ .. code:: python3 from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.metrics import ( accuracy_score, balanced_accuracy_score, f1_score, make_scorer, ) from sklearn.model_selection import StratifiedKFold, train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig from sklearn_genetic.space import Categorical, Continuous, Integer X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, stratify=y, random_state=42 ) cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) Define the scorers: .. code:: python3 scoring = { "accuracy": "accuracy", "balanced_accuracy": make_scorer(balanced_accuracy_score), "f1_macro": make_scorer(f1_score, average="macro"), } Define the search: .. code:: python3 model = Pipeline([ ("scaler", StandardScaler()), ("logistic", LogisticRegression(solver="saga", max_iter=1200, random_state=42)), ]) param_grid = { "logistic__C": Continuous(1e-3, 30.0, distribution="log-uniform"), "logistic__l1_ratio": Continuous(0.0, 1.0), "logistic__class_weight": Categorical([None, "balanced"]), "logistic__max_iter": Integer(1000, 1500), } search = GASearchCV( estimator=model, param_grid=param_grid, scoring=scoring, refit="balanced_accuracy", # select and refit on this metric cv=cv, evolution_config=EvolutionConfig(population_size=15, generations=12), population_config=PopulationConfig( initializer="smart", warm_start_configs=[{ "logistic__C": 1.0, "logistic__l1_ratio": 0.0, "logistic__class_weight": None, "logistic__max_iter": 1200, }], ), runtime_config=RuntimeConfig(n_jobs=-1, use_cache=True), ) search.fit(X_train, y_train) After fitting, ``best_score_`` and ``best_params_`` reflect the ``refit`` metric: .. code:: python3 print("Refit metric:", search.refit) print("Best balanced-accuracy CV score:", round(search.best_score_, 4)) print("Best parameters:", search.best_params_) Evaluate all metrics on the holdout set: .. code:: python3 predictions = search.predict(X_test) print("Accuracy: ", round(accuracy_score(y_test, predictions), 4)) print("Balanced accuracy: ", round(balanced_accuracy_score(y_test, predictions), 4)) print("F1 macro: ", round(f1_score(y_test, predictions, average="macro"), 4)) Inspect ``cv_results_`` ------------------------ For multi-metric searches ``cv_results_`` gains one set of columns per metric: ``mean_test_``, ``std_test_``, and ``rank_test_``. .. code:: python3 import pandas as pd results = pd.DataFrame(search.cv_results_) metric_cols = [ "mean_test_accuracy", "rank_test_accuracy", "mean_test_balanced_accuracy", "rank_test_balanced_accuracy", "mean_test_f1_macro", "rank_test_f1_macro", ] param_cols = [col for col in results.columns if col.startswith("param_")] print( results[metric_cols + param_cols] .sort_values("rank_test_balanced_accuracy") .head() ) Find the best configuration for each metric without rerunning the search: .. code:: python3 for metric in ["accuracy", "balanced_accuracy", "f1_macro"]: best = results.sort_values(f"rank_test_{metric}").iloc[0] print(f"\nBest by {metric}:") print(f" score = {best[f'mean_test_{metric}']:.4f}") print(f" params = {best['params']}") This is useful when you want to compare what configuration each metric would select after a single search run. Change the Refit Metric ------------------------ ``refit`` must be set before calling ``fit`` — it controls which metric the GA optimizes during the search, not just which score is returned. If you want a different refit metric, run the search again with a different ``refit`` value. For a quick comparison without refitting, you can manually inspect ``cv_results_`` and build a new estimator from the best parameters for any metric: .. code:: python3 from sklearn.base import clone # Best parameters by f1_macro, even though search refitted on balanced_accuracy best_row = results.sort_values("rank_test_f1_macro").iloc[0] alt_params = best_row["params"] alt_model = clone(model).set_params(**alt_params) alt_model.fit(X_train, y_train) print("F1-macro selected model accuracy:", accuracy_score(y_test, alt_model.predict(X_test))) Practical Notes --------------- * The ``refit`` metric is the only one used to rank candidates during the genetic search. The other metrics are recorded but do not influence selection, crossover, or mutation. * ``best_estimator_``, ``best_params_``, and ``best_score_`` always refer to the ``refit`` metric. * Use ``cv_results_`` to inspect tradeoffs after fitting. There is no need to rerun the search to see how different metrics would rank the same candidates. * For regression, use negative scorers: ``"neg_root_mean_squared_error"``, ``"neg_mean_absolute_error"``. The GA maximizes by default, so these scorers produce correct rankings when used as the ``refit`` target.