Pipeline Tuning with GASearchCV

scikit-learn Pipeline objects let you chain preprocessing steps and an estimator into a single object. GASearchCV tunes pipelines the same way it tunes plain estimators — the only difference is the parameter naming convention.

Parameter Naming Inside a Pipeline

Pipeline parameters follow the pattern stepname__paramname (two underscores). The step name is the string you assigned when creating the pipeline. For example, a pipeline built with:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("regressor", GradientBoostingRegressor()),
])

exposes parameters like regressor__n_estimators, regressor__learning_rate, and scaler__with_mean. These are the same names used in param_grid for any sklearn search method.

Full Example: Gradient Boosting Regression Pipeline

This example tunes a GradientBoostingRegressor inside a preprocessing pipeline on the diabetes regression dataset. The search space has six parameters with known interactions (learning_rate × n_estimators, max_depth × min_samples_leaf).

Setup

from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig
from sklearn_genetic.space import Categorical, Continuous, Integer

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

cv = KFold(n_splits=4, shuffle=True, random_state=42)

Build and evaluate a default baseline to have a comparison point:

def make_pipeline(**kwargs):
    return Pipeline([
        ("scaler", StandardScaler()),
        ("regressor", GradientBoostingRegressor(random_state=42, **kwargs)),
    ])

baseline = make_pipeline()
baseline.fit(X_train, y_train)

baseline_r2 = r2_score(y_test, baseline.predict(X_test))
baseline_rmse = mean_squared_error(y_test, baseline.predict(X_test)) ** 0.5
print(f"Baseline R²: {baseline_r2:.4f}  RMSE: {baseline_rmse:.2f}")

Define the Search Space

Note the regressor__ prefix on every key:

param_grid = {
    "regressor__n_estimators": Integer(50, 200),
    "regressor__learning_rate": Continuous(0.01, 0.2, distribution="log-uniform"),
    "regressor__max_depth": Integer(1, 5),
    "regressor__min_samples_leaf": Integer(1, 12),
    "regressor__subsample": Continuous(0.6, 1.0),
    "regressor__loss": Categorical(["squared_error", "absolute_error", "huber"]),
}

The log-uniform distribution for learning_rate samples small values more often than large ones, which matches the prior that small learning rates are generally more interesting.

Configure and Run

search = GASearchCV(
    estimator=make_pipeline(),
    param_grid=param_grid,
    cv=cv,
    scoring="neg_root_mean_squared_error",
    evolution_config=EvolutionConfig(
        population_size=20,
        generations=15,
        elitism=True,
        keep_top_k=3,
    ),
    population_config=PopulationConfig(
        initializer="smart",
        warm_start_configs=[
            {
                "regressor__n_estimators": 100,
                "regressor__learning_rate": 0.1,
                "regressor__max_depth": 3,
                "regressor__min_samples_leaf": 4,
                "regressor__subsample": 0.8,
                "regressor__loss": "squared_error",
            }
        ],
    ),
    runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True),
)

search.fit(X_train, y_train)

print("Best CV negative RMSE:", round(search.best_score_, 4))
print("Best parameters:", search.best_params_)

A warm_start_configs entry seeds the initial population with a known-good configuration. The optimizer then explores variations around it alongside the LHS-sampled candidates.

Evaluate on the Holdout Set

After fitting, the search object behaves like a fitted pipeline. Call predict directly:

ga_r2 = r2_score(y_test, search.predict(X_test))
ga_rmse = mean_squared_error(y_test, search.predict(X_test)) ** 0.5

print(f"Baseline → R²: {baseline_r2:.4f}  RMSE: {baseline_rmse:.2f}")
print(f"GA tuned → R²: {ga_r2:.4f}  RMSE: {ga_rmse:.2f}")

Inspect Evaluation Cost

print(search.fit_stats_)
# evaluated_candidates: total individuals presented to the evaluator
# unique_candidates:    distinct configurations actually cross-validated
# cache_hits:           re-used scores from the fitness cache
# random_immigrants:    individuals injected by diversity control

Common Pitfalls

Wrong step name in ``param_grid``

The step name must exactly match what you passed to Pipeline([...]). If your pipeline uses ("clf", LogisticRegression()), the parameter is clf__C, not logistic__C or C.

Tuning preprocessor parameters

You can also tune scaler__with_std, pca__n_components, or any preprocessor parameter using the same stepname__param pattern. When the preprocessor parameters change, the transformation changes, so the GA effectively searches the combined (preprocessing + model) space.

Negative scorers for regression

sklearn convention is to maximize scores, so regression losses must be negated: "neg_root_mean_squared_error", "neg_mean_absolute_error". GASearchCV uses criteria="max" by default, which is correct for negative scorers.

Nested parallelism

By default RuntimeConfig(parallel_backend="auto") parallelizes across unique candidates in a generation. If your pipeline itself uses n_jobs internally (e.g., RandomForestClassifier(n_jobs=-1)), you may get oversubscription. Either set the estimator’s n_jobs=1, or switch to RuntimeConfig(parallel_backend="cv") to parallelize within each candidate’s cross-validation instead.

Next Steps