Pipeline Tuning with GASearchCV
scikit-learn Pipeline objects let you chain preprocessing steps and an
estimator into a single object. GASearchCV tunes pipelines the same way it
tunes plain estimators — the only difference is the parameter naming
convention.
Parameter Naming Inside a Pipeline
Pipeline parameters follow the pattern stepname__paramname (two
underscores). The step name is the string you assigned when creating the
pipeline. For example, a pipeline built with:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
pipe = Pipeline([
("scaler", StandardScaler()),
("regressor", GradientBoostingRegressor()),
])
exposes parameters like regressor__n_estimators,
regressor__learning_rate, and scaler__with_mean. These are the same
names used in param_grid for any sklearn search method.
Full Example: Gradient Boosting Regression Pipeline
This example tunes a GradientBoostingRegressor inside a preprocessing
pipeline on the diabetes regression dataset. The search space has six
parameters with known interactions (learning_rate × n_estimators,
max_depth × min_samples_leaf).
Setup
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig
from sklearn_genetic.space import Categorical, Continuous, Integer
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
cv = KFold(n_splits=4, shuffle=True, random_state=42)
Build and evaluate a default baseline to have a comparison point:
def make_pipeline(**kwargs):
return Pipeline([
("scaler", StandardScaler()),
("regressor", GradientBoostingRegressor(random_state=42, **kwargs)),
])
baseline = make_pipeline()
baseline.fit(X_train, y_train)
baseline_r2 = r2_score(y_test, baseline.predict(X_test))
baseline_rmse = mean_squared_error(y_test, baseline.predict(X_test)) ** 0.5
print(f"Baseline R²: {baseline_r2:.4f} RMSE: {baseline_rmse:.2f}")
Define the Search Space
Note the regressor__ prefix on every key:
param_grid = {
"regressor__n_estimators": Integer(50, 200),
"regressor__learning_rate": Continuous(0.01, 0.2, distribution="log-uniform"),
"regressor__max_depth": Integer(1, 5),
"regressor__min_samples_leaf": Integer(1, 12),
"regressor__subsample": Continuous(0.6, 1.0),
"regressor__loss": Categorical(["squared_error", "absolute_error", "huber"]),
}
The log-uniform distribution for learning_rate samples small values
more often than large ones, which matches the prior that small learning rates
are generally more interesting.
Configure and Run
search = GASearchCV(
estimator=make_pipeline(),
param_grid=param_grid,
cv=cv,
scoring="neg_root_mean_squared_error",
evolution_config=EvolutionConfig(
population_size=20,
generations=15,
elitism=True,
keep_top_k=3,
),
population_config=PopulationConfig(
initializer="smart",
warm_start_configs=[
{
"regressor__n_estimators": 100,
"regressor__learning_rate": 0.1,
"regressor__max_depth": 3,
"regressor__min_samples_leaf": 4,
"regressor__subsample": 0.8,
"regressor__loss": "squared_error",
}
],
),
runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True),
)
search.fit(X_train, y_train)
print("Best CV negative RMSE:", round(search.best_score_, 4))
print("Best parameters:", search.best_params_)
A warm_start_configs entry seeds the initial population with a known-good
configuration. The optimizer then explores variations around it alongside the
LHS-sampled candidates.
Evaluate on the Holdout Set
After fitting, the search object behaves like a fitted pipeline. Call
predict directly:
ga_r2 = r2_score(y_test, search.predict(X_test))
ga_rmse = mean_squared_error(y_test, search.predict(X_test)) ** 0.5
print(f"Baseline → R²: {baseline_r2:.4f} RMSE: {baseline_rmse:.2f}")
print(f"GA tuned → R²: {ga_r2:.4f} RMSE: {ga_rmse:.2f}")
Inspect Evaluation Cost
print(search.fit_stats_)
# evaluated_candidates: total individuals presented to the evaluator
# unique_candidates: distinct configurations actually cross-validated
# cache_hits: re-used scores from the fitness cache
# random_immigrants: individuals injected by diversity control
Visualize the Search
import matplotlib.pyplot as plt
from sklearn_genetic.plots import plot_fitness_evolution, plot_search_space
plot_fitness_evolution(search)
plt.show()
# Inspect which learning_rate / n_estimators pairs were explored
plot_search_space(
search,
features=["regressor__learning_rate", "regressor__n_estimators"],
)
plt.show()
Common Pitfalls
- Wrong step name in ``param_grid``
The step name must exactly match what you passed to
Pipeline([...]). If your pipeline uses("clf", LogisticRegression()), the parameter isclf__C, notlogistic__CorC.- Tuning preprocessor parameters
You can also tune
scaler__with_std,pca__n_components, or any preprocessor parameter using the samestepname__parampattern. When the preprocessor parameters change, the transformation changes, so the GA effectively searches the combined (preprocessing + model) space.- Negative scorers for regression
sklearn convention is to maximize scores, so regression losses must be negated:
"neg_root_mean_squared_error","neg_mean_absolute_error".GASearchCVusescriteria="max"by default, which is correct for negative scorers.- Nested parallelism
By default
RuntimeConfig(parallel_backend="auto")parallelizes across unique candidates in a generation. If your pipeline itself usesn_jobsinternally (e.g.,RandomForestClassifier(n_jobs=-1)), you may get oversubscription. Either set the estimator’sn_jobs=1, or switch toRuntimeConfig(parallel_backend="cv")to parallelize within each candidate’s cross-validation instead.
Next Steps
Using Callbacks — stop the search early when the score plateaus.
Using Adapters — schedule crossover and mutation probabilities over generations.
Advanced Optimizer Control — diversity control, local refinement, and fitness sharing for harder pipeline spaces.