.. _pipeline-tuning:

Pipeline Tuning with GASearchCV
================================

scikit-learn ``Pipeline`` objects let you chain preprocessing steps and an
estimator into a single object. ``GASearchCV`` tunes pipelines the same way it
tunes plain estimators — the only difference is the parameter naming
convention.

Parameter Naming Inside a Pipeline
-----------------------------------

Pipeline parameters follow the pattern ``stepname__paramname`` (two
underscores). The step name is the string you assigned when creating the
pipeline. For example, a pipeline built with:

.. code:: python3

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.ensemble import GradientBoostingRegressor

    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("regressor", GradientBoostingRegressor()),
    ])

exposes parameters like ``regressor__n_estimators``,
``regressor__learning_rate``, and ``scaler__with_mean``. These are the same
names used in ``param_grid`` for any sklearn search method.

Full Example: Gradient Boosting Regression Pipeline
-----------------------------------------------------

This example tunes a ``GradientBoostingRegressor`` inside a preprocessing
pipeline on the diabetes regression dataset. The search space has six
parameters with known interactions (``learning_rate`` × ``n_estimators``,
``max_depth`` × ``min_samples_leaf``).

Setup
^^^^^

.. code:: python3

    from sklearn.datasets import load_diabetes
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.metrics import r2_score, mean_squared_error
    from sklearn.model_selection import KFold, train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

    from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig
    from sklearn_genetic.space import Categorical, Continuous, Integer

    X, y = load_diabetes(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=42
    )

    cv = KFold(n_splits=4, shuffle=True, random_state=42)

Build and evaluate a default baseline to have a comparison point:

.. code:: python3

    def make_pipeline(**kwargs):
        return Pipeline([
            ("scaler", StandardScaler()),
            ("regressor", GradientBoostingRegressor(random_state=42, **kwargs)),
        ])

    baseline = make_pipeline()
    baseline.fit(X_train, y_train)

    baseline_r2 = r2_score(y_test, baseline.predict(X_test))
    baseline_rmse = mean_squared_error(y_test, baseline.predict(X_test)) ** 0.5
    print(f"Baseline R²: {baseline_r2:.4f}  RMSE: {baseline_rmse:.2f}")

Define the Search Space
^^^^^^^^^^^^^^^^^^^^^^^^

Note the ``regressor__`` prefix on every key:

.. code:: python3

    param_grid = {
        "regressor__n_estimators": Integer(50, 200),
        "regressor__learning_rate": Continuous(0.01, 0.2, distribution="log-uniform"),
        "regressor__max_depth": Integer(1, 5),
        "regressor__min_samples_leaf": Integer(1, 12),
        "regressor__subsample": Continuous(0.6, 1.0),
        "regressor__loss": Categorical(["squared_error", "absolute_error", "huber"]),
    }

The ``log-uniform`` distribution for ``learning_rate`` samples small values
more often than large ones, which matches the prior that small learning rates
are generally more interesting.

Configure and Run
^^^^^^^^^^^^^^^^^^

.. code:: python3

    search = GASearchCV(
        estimator=make_pipeline(),
        param_grid=param_grid,
        cv=cv,
        scoring="neg_root_mean_squared_error",
        evolution_config=EvolutionConfig(
            population_size=20,
            generations=15,
            elitism=True,
            keep_top_k=3,
        ),
        population_config=PopulationConfig(
            initializer="smart",
            warm_start_configs=[
                {
                    "regressor__n_estimators": 100,
                    "regressor__learning_rate": 0.1,
                    "regressor__max_depth": 3,
                    "regressor__min_samples_leaf": 4,
                    "regressor__subsample": 0.8,
                    "regressor__loss": "squared_error",
                }
            ],
        ),
        runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True),
    )

    search.fit(X_train, y_train)

    print("Best CV negative RMSE:", round(search.best_score_, 4))
    print("Best parameters:", search.best_params_)

A ``warm_start_configs`` entry seeds the initial population with a known-good
configuration. The optimizer then explores variations around it alongside the
LHS-sampled candidates.

Evaluate on the Holdout Set
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After fitting, the search object behaves like a fitted pipeline. Call
``predict`` directly:

.. code:: python3

    ga_r2 = r2_score(y_test, search.predict(X_test))
    ga_rmse = mean_squared_error(y_test, search.predict(X_test)) ** 0.5

    print(f"Baseline → R²: {baseline_r2:.4f}  RMSE: {baseline_rmse:.2f}")
    print(f"GA tuned → R²: {ga_r2:.4f}  RMSE: {ga_rmse:.2f}")

Inspect Evaluation Cost
^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python3

    print(search.fit_stats_)
    # evaluated_candidates: total individuals presented to the evaluator
    # unique_candidates:    distinct configurations actually cross-validated
    # cache_hits:           re-used scores from the fitness cache
    # random_immigrants:    individuals injected by diversity control

Visualize the Search
^^^^^^^^^^^^^^^^^^^^^

.. code:: python3

    import matplotlib.pyplot as plt
    from sklearn_genetic.plots import plot_fitness_evolution, plot_search_space

    plot_fitness_evolution(search)
    plt.show()

    # Inspect which learning_rate / n_estimators pairs were explored
    plot_search_space(
        search,
        features=["regressor__learning_rate", "regressor__n_estimators"],
    )
    plt.show()

Common Pitfalls
---------------

**Wrong step name in ``param_grid``**
    The step name must exactly match what you passed to ``Pipeline([...])``. If
    your pipeline uses ``("clf", LogisticRegression())``, the parameter is
    ``clf__C``, not ``logistic__C`` or ``C``.

**Tuning preprocessor parameters**
    You can also tune ``scaler__with_std``, ``pca__n_components``, or any
    preprocessor parameter using the same ``stepname__param`` pattern. When
    the preprocessor parameters change, the transformation changes, so the GA
    effectively searches the combined (preprocessing + model) space.

**Negative scorers for regression**
    sklearn convention is to maximize scores, so regression losses must be
    negated: ``"neg_root_mean_squared_error"``, ``"neg_mean_absolute_error"``.
    ``GASearchCV`` uses ``criteria="max"`` by default, which is correct for
    negative scorers.

**Nested parallelism**
    By default ``RuntimeConfig(parallel_backend="auto")`` parallelizes across
    unique candidates in a generation. If your pipeline itself uses ``n_jobs``
    internally (e.g., ``RandomForestClassifier(n_jobs=-1)``), you may get
    oversubscription. Either set the estimator's ``n_jobs=1``, or switch to
    ``RuntimeConfig(parallel_backend="cv")`` to parallelize within each
    candidate's cross-validation instead.

Next Steps
----------

* :doc:`callbacks` — stop the search early when the score plateaus.
* :doc:`adapters` — schedule crossover and mutation probabilities over
  generations.
* :doc:`advanced_optimizer_control` — diversity control, local refinement, and
  fitness sharing for harder pipeline spaces.