.. _when-to-use: When to Use sklearn-genetic-opt ================================ This page helps you decide whether a genetic search is the right tool for your tuning problem and shows what a realistic setup looks like. Choosing a Search Method ------------------------- scikit-learn ships three search strategies. Each targets a different situation: .. list-table:: :header-rows: 1 :widths: 20 30 30 20 * - Method - Best for - Weakness - Typical space size * - ``GridSearchCV`` - Small, fully discrete grids - Candidate count multiplies with each dimension - ≤ 3 parameters * - ``RandomizedSearchCV`` - Continuous spaces, large grids - Treats every parameter independently — misses interactions - 3–6 parameters * - ``GASearchCV`` - Mixed or large spaces with parameter interactions - Adds overhead on trivially small spaces - 5+ parameters The key limitation of random search is **independence**: it samples each parameter as if the others do not exist. When two parameters interact — for example, ``learning_rate`` and ``n_estimators`` in a gradient boosting model — random search is as likely to pair a low learning rate with few estimators (underfit) as to pair a low learning rate with many estimators (good). A genetic algorithm recombines *complete configurations* that performed well, so it naturally gravitates toward combinations that work together. Signs That GA Will Help ----------------------- * **Five or more hyperparameters.** The search space grows exponentially. GA explores it with a population of complete solutions rather than independently sampling each axis. * **Known or suspected parameter interactions.** ``learning_rate`` × number of estimators, regularization strength × solver, kernel bandwidth × ``C`` in SVMs. * **Mixed parameter types in the same space.** Integers, floats, and categoricals in one search. Grid search becomes unwieldy; random search handles them but ignores interactions. * **Expensive evaluations.** GA caches every evaluated candidate and reuses its score when the same configuration appears again. Cache hits become valuable when each cross-validation takes seconds or minutes. * **You want to narrow the space iteratively.** ``plot_search_space`` shows where the algorithm sampled. You can tighten ranges between runs. Signs That GA Will Not Help --------------------------- * **One or two continuous parameters.** Random search covers this well and is faster to configure. * **Very fast evaluations with a large budget.** If you can afford ten thousand random candidates in seconds, more candidates beat a smarter search. * **A fully discrete grid with few values.** ``GridSearchCV`` is exhaustive and easier to reason about. Example: Gradient Boosting With Seven Parameters ------------------------------------------------- The following example shows a case where GA's advantage is real. ``HistGradientBoostingClassifier`` has a well-known interaction between ``learning_rate`` and ``max_iter``: a low learning rate needs more iterations, and a high learning rate converges faster. Random search samples these independently. The genetic algorithm recombines configurations that worked and tends to find consistent (learning rate, iteration count) pairs. .. code:: python3 from sklearn.datasets import load_breast_cancer from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.model_selection import StratifiedKFold, train_test_split from sklearn.metrics import roc_auc_score from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig from sklearn_genetic.space import Categorical, Continuous, Integer X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, stratify=y, random_state=42 ) param_grid = { "learning_rate": Continuous(0.01, 0.3, distribution="log-uniform"), "max_iter": Integer(50, 300), "max_depth": Integer(2, 8), "min_samples_leaf": Integer(5, 50), "l2_regularization": Continuous(1e-6, 1.0, distribution="log-uniform"), "max_features": Continuous(0.3, 1.0), "max_leaf_nodes": Integer(15, 127), } search = GASearchCV( estimator=HistGradientBoostingClassifier(random_state=42, early_stopping=False), param_grid=param_grid, cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42), scoring="roc_auc", evolution_config=EvolutionConfig(population_size=20, generations=15), population_config=PopulationConfig(initializer="smart"), runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True), ) search.fit(X_train, y_train) print("Best CV score:", round(search.best_score_, 4)) print("Best parameters:", search.best_params_) y_prob = search.predict_proba(X_test)[:, 1] print("Holdout ROC-AUC:", round(roc_auc_score(y_test, y_prob), 4)) After fitting, you can inspect the cache efficiency and diversity telemetry: .. code:: python3 print(search.fit_stats_) # Shows evaluated_candidates, unique_candidates, cache_hits, etc. import pandas as pd history = pd.DataFrame(search.history) print(history[["gen", "fitness_best", "genotype_diversity", "stagnation_generations"]].tail()) Minimum Recommended Configuration ---------------------------------- The essentials for any production run: .. code:: python3 from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig search = GASearchCV( estimator=your_estimator, param_grid=your_param_grid, cv=your_cv_strategy, scoring="your_metric", evolution_config=EvolutionConfig( population_size=20, # start here; increase for larger spaces generations=15, # 10-20 is a reasonable default ), population_config=PopulationConfig(initializer="smart"), runtime_config=RuntimeConfig(n_jobs=-1, use_cache=True), ) search.fit(X_train, y_train) print(search.best_params_) print(search.best_score_) ``population_size`` and ``generations`` control the total evaluation budget: ``population_size + generations × 2 × population_size`` candidates are generated. With population 20 and 15 generations, that is 620 candidate configurations — a reasonable budget for a 7-parameter space. Next Steps ---------- * :doc:`basic_usage` — full workflow for hyperparameter tuning and feature selection with plots and fitness inspection. * :doc:`understand_cv` — how the genetic algorithm evaluates candidates and what the generation log means. * :doc:`pipeline_tuning` — how to tune a scikit-learn ``Pipeline`` and use the ``step__param`` naming convention. * :doc:`advanced_optimizer_control` — diversity control, local search, and fitness sharing for harder search spaces.