.. _when-to-use:

When to Use sklearn-genetic-opt
================================

This page helps you decide whether a genetic search is the right tool for your
tuning problem and shows what a realistic setup looks like.

Choosing a Search Method
-------------------------

scikit-learn ships three search strategies. Each targets a different situation:

.. list-table::
   :header-rows: 1
   :widths: 20 30 30 20

   * - Method
     - Best for
     - Weakness
     - Typical space size
   * - ``GridSearchCV``
     - Small, fully discrete grids
     - Candidate count multiplies with each dimension
     - ≤ 3 parameters
   * - ``RandomizedSearchCV``
     - Continuous spaces, large grids
     - Treats every parameter independently — misses interactions
     - 3–6 parameters
   * - ``GASearchCV``
     - Mixed or large spaces with parameter interactions
     - Adds overhead on trivially small spaces
     - 5+ parameters

The key limitation of random search is **independence**: it samples each
parameter as if the others do not exist. When two parameters interact — for
example, ``learning_rate`` and ``n_estimators`` in a gradient boosting model —
random search is as likely to pair a low learning rate with few estimators
(underfit) as to pair a low learning rate with many estimators (good). A genetic
algorithm recombines *complete configurations* that performed well, so it
naturally gravitates toward combinations that work together.

Signs That GA Will Help
-----------------------

* **Five or more hyperparameters.** The search space grows exponentially. GA
  explores it with a population of complete solutions rather than independently
  sampling each axis.
* **Known or suspected parameter interactions.** ``learning_rate`` × number of
  estimators, regularization strength × solver, kernel bandwidth × ``C`` in
  SVMs.
* **Mixed parameter types in the same space.** Integers, floats, and categoricals
  in one search. Grid search becomes unwieldy; random search handles them but
  ignores interactions.
* **Expensive evaluations.** GA caches every evaluated candidate and reuses its
  score when the same configuration appears again. Cache hits become valuable
  when each cross-validation takes seconds or minutes.
* **You want to narrow the space iteratively.** ``plot_search_space`` shows
  where the algorithm sampled. You can tighten ranges between runs.

Signs That GA Will Not Help
---------------------------

* **One or two continuous parameters.** Random search covers this well and is
  faster to configure.
* **Very fast evaluations with a large budget.** If you can afford ten thousand
  random candidates in seconds, more candidates beat a smarter search.
* **A fully discrete grid with few values.** ``GridSearchCV`` is exhaustive and
  easier to reason about.

Example: Gradient Boosting With Seven Parameters
-------------------------------------------------

The following example shows a case where GA's advantage is real.
``HistGradientBoostingClassifier`` has a well-known interaction between
``learning_rate`` and ``max_iter``: a low learning rate needs more iterations,
and a high learning rate converges faster. Random search samples these
independently. The genetic algorithm recombines configurations that worked and
tends to find consistent (learning rate, iteration count) pairs.

.. code:: python3

    from sklearn.datasets import load_breast_cancer
    from sklearn.ensemble import HistGradientBoostingClassifier
    from sklearn.model_selection import StratifiedKFold, train_test_split
    from sklearn.metrics import roc_auc_score

    from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig
    from sklearn_genetic.space import Categorical, Continuous, Integer

    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, stratify=y, random_state=42
    )

    param_grid = {
        "learning_rate": Continuous(0.01, 0.3, distribution="log-uniform"),
        "max_iter": Integer(50, 300),
        "max_depth": Integer(2, 8),
        "min_samples_leaf": Integer(5, 50),
        "l2_regularization": Continuous(1e-6, 1.0, distribution="log-uniform"),
        "max_features": Continuous(0.3, 1.0),
        "max_leaf_nodes": Integer(15, 127),
    }

    search = GASearchCV(
        estimator=HistGradientBoostingClassifier(random_state=42, early_stopping=False),
        param_grid=param_grid,
        cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42),
        scoring="roc_auc",
        evolution_config=EvolutionConfig(population_size=20, generations=15),
        population_config=PopulationConfig(initializer="smart"),
        runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True),
    )

    search.fit(X_train, y_train)

    print("Best CV score:", round(search.best_score_, 4))
    print("Best parameters:", search.best_params_)

    y_prob = search.predict_proba(X_test)[:, 1]
    print("Holdout ROC-AUC:", round(roc_auc_score(y_test, y_prob), 4))

After fitting, you can inspect the cache efficiency and diversity telemetry:

.. code:: python3

    print(search.fit_stats_)
    # Shows evaluated_candidates, unique_candidates, cache_hits, etc.

    import pandas as pd
    history = pd.DataFrame(search.history)
    print(history[["gen", "fitness_best", "genotype_diversity", "stagnation_generations"]].tail())

Minimum Recommended Configuration
----------------------------------

The essentials for any production run:

.. code:: python3

    from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig

    search = GASearchCV(
        estimator=your_estimator,
        param_grid=your_param_grid,
        cv=your_cv_strategy,
        scoring="your_metric",
        evolution_config=EvolutionConfig(
            population_size=20,   # start here; increase for larger spaces
            generations=15,       # 10-20 is a reasonable default
        ),
        population_config=PopulationConfig(initializer="smart"),
        runtime_config=RuntimeConfig(n_jobs=-1, use_cache=True),
    )

    search.fit(X_train, y_train)

    print(search.best_params_)
    print(search.best_score_)

``population_size`` and ``generations`` control the total evaluation budget:
``population_size + generations × 2 × population_size`` candidates are
generated. With population 20 and 15 generations, that is 620 candidate
configurations — a reasonable budget for a 7-parameter space.

Next Steps
----------

* :doc:`basic_usage` — full workflow for hyperparameter tuning and feature
  selection with plots and fitness inspection.
* :doc:`understand_cv` — how the genetic algorithm evaluates candidates and
  what the generation log means.
* :doc:`pipeline_tuning` — how to tune a scikit-learn ``Pipeline`` and use the
  ``step__param`` naming convention.
* :doc:`advanced_optimizer_control` — diversity control, local search, and
  fitness sharing for harder search spaces.