Advanced Hyperparameter Search With Random Forest

This notebook is a guided tour of advanced optimization controls available in sklearn-genetic-opt. We will tune a RandomForestClassifier on the breast cancer dataset, inspect optimizer telemetry, compare against a lightweight randomized-search baseline, and then reuse the same ideas for feature selection.

Problem Setup

The breast cancer dataset is a binary classification task. It is small enough for a documentation example, but it still has enough numeric features to make model selection and feature selection meaningful.

We use a fixed train/test split and a shuffled StratifiedKFold so the notebook is reproducible.

[1]:

import warnings
from pprint import pprint

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, classification_report, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, train_test_split
from scipy.stats import randint

from sklearn_genetic import (
    EvolutionConfig,
    GAFeatureSelectionCV,
    GASearchCV,
    OptimizationConfig,
    PopulationConfig,
    RuntimeConfig,
)
from sklearn_genetic.callbacks import ConsecutiveStopping, DeltaThreshold, TimerStopping
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter
from sklearn_genetic.space import Categorical, Continuous, Integer

warnings.filterwarnings("ignore", category=UserWarning)

RANDOM_STATE = 42

[2]:

data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    stratify=y,
    random_state=RANDOM_STATE,
)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

print(f"Training shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
print(f"Positive class rate: {y.mean():.3f}")

Training shape: (398, 30)
Test shape: (171, 30)
Positive class rate: 0.627

Baseline Model

Before tuning anything, train a plain random forest. This gives us a practical reference point: a genetic search should either improve the score, find a simpler configuration, or give us useful telemetry about the search process.

[3]:

def evaluate_classifier(estimator, X_eval, y_eval):
    predictions = estimator.predict(X_eval)
    probabilities = estimator.predict_proba(X_eval)[:, 1]
    return {
        "accuracy": accuracy_score(y_eval, predictions),
        "balanced_accuracy": balanced_accuracy_score(y_eval, predictions),
        "roc_auc": roc_auc_score(y_eval, probabilities),
    }


baseline = RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1)
baseline.fit(X_train, y_train)

baseline_metrics = evaluate_classifier(baseline, X_test, y_test)
baseline_metrics

[3]:

{'accuracy': 0.935672514619883,
 'balanced_accuracy': 0.9297605140186915,
 'roc_auc': 0.991311331775701}

Define a Genetic Search Space

sklearn-genetic-opt uses explicit search-space objects instead of sklearn parameter distributions. This keeps integer, continuous, and categorical choices clear.

In this example we tune both model capacity and split behavior. The search space is intentionally moderate so the notebook runs quickly.

[4]:

param_grid = {
    "n_estimators": Integer(40, 140),
    "max_depth": Integer(2, 12),
    "min_samples_split": Integer(2, 12),
    "min_samples_leaf": Integer(1, 8),
    "max_features": Categorical(["sqrt", "log2", None]),
    "ccp_alpha": Continuous(0.0, 0.03),
}

Configure GASearchCV

This configuration demonstrates several optimizer controls:

PopulationConfig(initializer="smart") seeds a more useful initial population using estimator defaults, stratified categorical choices, and Latin hypercube sampling for numeric dimensions.
warm_start_configs injects a known reasonable configuration into the first population.
RuntimeConfig(parallel_backend="auto") lets the estimator decide whether to parallelize candidate evaluation or cross-validation.
OptimizationConfig(local_search=True) performs a short refinement around the best candidates at the end.
OptimizationConfig(diversity_control=True) increases mutation pressure and can inject random candidates when the population collapses too early.
OptimizationConfig(fitness_sharing=True) reduces crowding pressure so similar candidates do not dominate selection too soon.
adaptive schedules let crossover and mutation probabilities evolve over generations.

[5]:

callbacks = [
    ConsecutiveStopping(generations=10, metric="fitness_best"),
    TimerStopping(total_seconds=240),
]

ga_search = GASearchCV(
    estimator=RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1),
    param_grid=param_grid,
    scoring="roc_auc",
    cv=cv,
    evolution_config=EvolutionConfig(
        population_size=20,
        generations=15,
        crossover_probability=ExponentialAdapter(initial_value=0.8, end_value=0.4, adaptive_rate=0.15),
        mutation_probability=InverseAdapter(initial_value=0.25, end_value=0.05, adaptive_rate=0.2),
        tournament_size=3,
        elitism=True,
        keep_top_k=3,
    ),
    population_config=PopulationConfig(
        initializer="smart",
        warm_start_configs=[
            {
                "n_estimators": 100,
                "max_depth": 6,
                "min_samples_split": 4,
                "min_samples_leaf": 2,
                "max_features": "sqrt",
                "ccp_alpha": 0.0,
            }
        ],
    ),
    runtime_config=RuntimeConfig(
        n_jobs=-1,
        parallel_backend="auto",
        use_cache=True,
        verbose=True,
        return_train_score=False,
    ),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=1,
        local_search_radius=0.2,
        diversity_control=True,
        diversity_threshold=0.35,
        diversity_stagnation_generations=3,
        diversity_mutation_boost=1.8,
        random_immigrants_fraction=0.15,
        fitness_sharing=True,
        sharing_radius=0.35,
        sharing_alpha=1.0,
    ),
)

ga_search.fit(X_train, y_train, callbacks=callbacks)

 gen evals           avg          best     div  unique  stag     mut   sel             events
---- ----- ------------- ------------- ------- ------- ----- ------- ----- ------------------
   0    20       0.98625       0.99076   0.579   1.000     0       -     - -
   1    40       0.98519       0.99076   0.386   0.650     1   0.200     3 dup=1,share
   2    40       0.98587       0.99076   0.342   0.900     2   0.217     3 dup=9,share
   3    40       0.98676       0.99076   0.316   0.700     3   0.304     3 div,imm=6,dup=2,sh
   4    40       0.98610       0.99076   0.307   0.750     4   0.315     3 div,imm=6,dup=2,sh
   5    40       0.98464       0.99076   0.386   0.750     5   0.290     3 div,imm=6,dup=12,s
   6    40       0.98632       0.99171   0.412   0.900     0   0.270     3 div,imm=6,dup=7,sh
   7    40       0.98588       0.99171   0.421   0.800     1   0.141     3 dup=15,share
   8    40       0.98520       0.99171   0.421   0.850     2   0.133     3 dup=17,share
   9    40       0.98597       0.99171   0.404   0.750     3   0.127     3 dup=20,share
  10    40       0.98640       0.99171   0.439   0.900     4   0.219     3 div,imm=6,dup=16,s
  11    40       0.98589       0.99171   0.351   0.700     5   0.210     3 div,imm=6,dup=13,s
INFO: TimerStopping callback met its criteria
INFO: Stopping the algorithm

[5]:

GASearchCV(crossover_probability=<sklearn_genetic.schedules.schedulers.ExponentialAdapter object at 0x000001C5C12A12B0>,
           cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
           diversity_control=True, diversity_mutation_boost=1.8,
           diversity_stagnation_generations=3, diversity_threshold=0.35,
           estimator=RandomForestClassifier(ccp_alpha=0.0083469934111643...
                                                                   'max_features': 'sqrt',
                                                                   'min_samples_leaf': 2,
                                                                   'min_samples_split': 4,
                                                                   'n_estimators': 100}]),
           population_size=20, random_immigrants_fraction=0.15,
           return_train_score=True,
           runtime_config=RuntimeConfig(n_jobs=-1,
                                        pre_dispatch='2*n_jobs',
                                        error_score=nan,
                                        return_train_score=False,
                                        use_cache=True,
                                        parallel_backend='auto',
                                        verbose=True),
           scoring='roc_auc', sharing_radius=0.35)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GASearchCV

iFitted

Parameters

	estimator	RandomForestC...ndom_state=42)
	cv	StratifiedKFo... shuffle=True)
	param_grid	{'ccp_alpha': <sklearn_gene...001C5C12A0D70>, 'max_depth': <sklearn_gene...001C5C1231BD0>, 'max_features': <sklearn_gene...001C5C12A0EC0>, 'min_samples_leaf': <sklearn_gene...001C5C12395B0>, ...}
	scoring	'roc_auc'
	population_size	20
	generations	15
	crossover_probability	<sklearn_gene...001C5C12A12B0>
	mutation_probability	<sklearn_gene...001C5C12A1400>
	keep_top_k	3
	n_jobs	-1
	return_train_score	True
	evolution_config	EvolutionConf...MuPlusLambda')
	population_config	PopulationCon...alpha': 0.0}])
	runtime_config	RuntimeConfig... verbose=True)
	optimization_config	OptimizationC...ction_cv=None)
	local_search	True
	local_search_top_k	2
	local_search_radius	0.2
	diversity_control	True
	diversity_threshold	0.35
	diversity_stagnation_generations	3
	diversity_mutation_boost	1.8
	random_immigrants_fraction	0.15
	fitness_sharing	True
	sharing_radius	0.35
	tournament_size	3
	elitism	True
	verbose	True
	criteria	'max'
	algorithm	'eaMuPlusLambda'
	refit	True
	pre_dispatch	'2*n_jobs'
	error_score	nan
	log_config	None
	use_cache	True
	warm_start_configs	None
	parallel_backend	'auto'
	population_initializer	'smart'
	local_search_steps	1
	adaptive_selection	False
	selection_pressure_min	2
	selection_pressure_max	None
	offspring_diversity_retries	0
	sharing_alpha	1.0
	final_selection	False
	final_selection_top_k	3
	final_selection_cv	None

Fitted attributes

Name	Type	Value
X_	DataFrame	mean rad... x 30 columns]
best_estimator_	RandomForestClassifier	RandomForestC...ndom_state=42)
best_index_	int	57
best_params_	dict	{'cc...ha': 0.008346993411164376, 'ma...th': 6, 'ma...es': 'log2', 'mi...af': 5, ...}
best_score_	float	0.9917
classes_	ndarray[int64](2,)	[0,1]
cv_results_	dict	{'me...me': [np.float64(1.2785807450612385), np.float64(0.8232693672180176), np.float64(0.9571025371551514), np.float64(2.049020846684774), ...], 'me...me': [np.float64(0.4316854476928711), np.float64(0....4202550252277), np.float64(0....5157559712726), np.float64(0.5433539549509684), ...], 'me...re': [np.float64(0.9877237619086648), np.float64(0.9859223873140764), np.float64(0.9898175793553222), np.float64(0.9855488426007236), ...], 'me...re': [np.float64(0.999695527146649), np.float64(0.9978708142672813), np.float64(0.9972397883546328), np.float64(0.9972309381777881), ...], ...}
estimator_	RandomForestClassifier	RandomForestC...ndom_state=42)
final_selection_results_	dict	{'ca...es': [], 'changed': False, 'cv': None, 'enabled': False, ...}
fit_stats_	dict	{'ca...ts': 2, 'cr...ls': 460, 'du...es': 0, 'ev...es': 462, ...}
multimetric_	bool	False
n_features_in_	int	30
n_splits_	int	3
refit_time_	float	0.1146
scorer_	_Scorer	make_scorer(r...edict_proba'))
y_	Series[int64](398,)	469 1 561 ..., dtype: int64

best_estimator_: RandomForestClassifier

RandomForestClassifier(ccp_alpha=0.008346993411164376, max_depth=6,
                       max_features='log2', min_samples_leaf=5,
                       min_samples_split=7, n_estimators=56, n_jobs=1,
                       random_state=42)

RandomForestClassifier

?Documentation for RandomForestClassifier

Parameters

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	56
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	6
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	7
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	5
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'log2'
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.	1
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary <random_state>` for details.	42
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.008346993411164376
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary <warm_start>` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples irrespective of `sample_weight`. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` unweighted samples or `max_samples * sample_weight.sum()` weighted samples. .. versionadded:: 0.22 .. versionchanged:: 1.9 Float `max_samples` is relative to `sample_weight.sum()` instead of `X.shape[0]` for weighted samples.	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`). The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide <monotonic_cst_gbdt>`. .. versionadded:: 1.4	None

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) or a list of such arrays The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).	ndarray[int64](2,)	[0,1]
estimator_ estimator_: :class:`~sklearn.tree.DecisionTreeClassifier` The child estimator template used to create the collection of fitted sub-estimators. .. versionadded:: 1.2 `base_estimator_` was renamed to `estimator_`.	DecisionTreeClassifier	DecisionTreeClassifier()
estimators_ estimators_: list of DecisionTreeClassifier The collection of fitted sub-estimators.	list	[DecisionTreeC...te=1608637542), DecisionTreeC...te=1273642419), DecisionTreeC...te=1935803228), DecisionTreeC...ate=787846414), ...]
estimators_samples_ estimators_samples_: list of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected. .. versionadded:: 1.4	list	[array([ 41, 3..., dtype=int32), array([174, 1..., dtype=int32), array([342, 2..., dtype=int32), array([288, ..., dtype=int32), ...]
feature_importances_ feature_importances_: ndarray of shape (n_features,) The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See :func:`sklearn.inspection.permutation_importance` as an alternative.	ndarray[float64](30,)	[0.06,0. ,0.09,...,0.12,0.01,0.01]
feature_names_in_ feature_names_in_: ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0	ndarray[object](30,)	['mean radius','mean texture','mean perimeter',...,'worst concave points', 'worst symmetry','worst fractal dimension']
n_classes_ n_classes_: int or list The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).	int	2
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	30
n_outputs_ n_outputs_: int The number of outputs when ``fit`` is performed.	int	1

Inspect Results and Telemetry

The usual sklearn-style attributes are available: best_params_, best_score_, and best_estimator_. The library also records optimization mechanics in fit_stats_ and per-generation telemetry in history.

These fields are especially useful when tuning performance. If cache_hits is high, the search is revisiting candidates. If diversity collapses early, try stronger mutation, more random immigrants, a larger population, or fitness sharing.

[6]:

print("Best CV ROC AUC:", round(ga_search.best_score_, 4))
print("Best parameters:")
pprint(ga_search.best_params_)

ga_metrics = evaluate_classifier(ga_search, X_test, y_test)
pd.DataFrame([baseline_metrics, ga_metrics], index=["baseline", "ga_search"])

Best CV ROC AUC: 0.9917
Best parameters:
{'ccp_alpha': 0.008346993411164376,
 'max_depth': 6,
 'max_features': 'log2',
 'min_samples_leaf': 5,
 'min_samples_split': 7,
 'n_estimators': 56}

[6]:

	accuracy	balanced_accuracy	roc_auc
baseline	0.935673	0.929761	0.991311
ga_search	0.929825	0.925088	0.986565

[7]:

ga_search.fit_stats_

[7]:

{'evaluated_candidates': 462,
 'unique_candidates': 460,
 'cross_validate_calls': 460,
 'cache_hits': 2,
 'duplicate_candidates': 0,
 'skipped_invalid_candidates': 0,
 'population_parallel_batches': 13,
 'population_serial_batches': 0,
 'random_immigrants': 36,
 'local_refinement_candidates': 2}

[8]:

history = pd.DataFrame(ga_search.history)
telemetry_columns = [
    "gen",
    "fitness",
    "fitness_max",
    "fitness_std",
    "unique_individual_ratio",
    "genotype_diversity",
    "stagnation_generations",
    "best_generation",
]
history[[column for column in telemetry_columns if column in history.columns]].tail()

[8]:

	gen	fitness	fitness_max	fitness_std	unique_individual_ratio	genotype_diversity	stagnation_generations	best_generation
7	7	0.985883	0.990676	0.002142	0.80	0.421053	1	6
8	8	0.985202	0.990143	0.001566	0.85	0.421053	2	6
9	9	0.985969	0.990143	0.002267	0.75	0.403509	3	6
10	10	0.986397	0.990106	0.002121	0.90	0.438596	4	6
11	11	0.986717	0.991707	0.002035	0.75	0.385965	6	6

A compact plot can make the search dynamics easier to read. The first chart shows best-so-far fitness, current-generation best, and population average; the second chart shows diversity signals. If the diversity curves drop to zero early while fitness stops improving, the search is probably over-exploiting one region.

[9]:

ax = history.plot(x="gen", y=["fitness_best", "fitness_max", "fitness"], marker="o", figsize=(8, 4))
ax.set_title("Fitness over generations")
ax.set_xlabel("Generation")
ax.set_ylabel("ROC AUC")

[9]:

Text(0, 0.5, 'ROC AUC')

../_images/notebooks_Advanced_breast_cancer_random_forest_15_1.png

[10]:

diversity_columns = [
    column
    for column in ["unique_individual_ratio", "genotype_diversity"]
    if column in history.columns
]

ax = history.plot(x="gen", y=diversity_columns, marker="o", figsize=(8, 4))
ax.set_title("Population diversity over generations")
ax.set_xlabel("Generation")
ax.set_ylabel("Diversity")

[10]:

Text(0, 0.5, 'Diversity')

../_images/notebooks_Advanced_breast_cancer_random_forest_16_1.png

Compare With RandomizedSearchCV

Genetic search is most useful when the search space is large, mixed-type, or expensive enough that exhaustive grids become unattractive. A lightweight RandomizedSearchCV baseline is still useful because it tells us whether the GA is paying for itself.

The parameter distributions below cover roughly the same region as the genetic search space, but they use sklearn/scipy objects instead of sklearn-genetic-opt dimensions.

[11]:

randomized_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1),
    param_distributions={
        "n_estimators": randint(40, 141),
        "max_depth": randint(2, 13),
        "min_samples_split": randint(2, 13),
        "min_samples_leaf": randint(1, 9),
        "max_features": ["sqrt", "log2", None],
        "ccp_alpha": np.linspace(0.0, 0.03, 20),
    },
    n_iter=12,
    scoring="roc_auc",
    cv=cv,
    n_jobs=-1,
    random_state=RANDOM_STATE,
    refit=True,
)

randomized_search.fit(X_train, y_train)
randomized_metrics = evaluate_classifier(randomized_search, X_test, y_test)

pd.DataFrame(
    [baseline_metrics, randomized_metrics, ga_metrics],
    index=["baseline", "randomized_search", "ga_search"],
)

[11]:

	accuracy	balanced_accuracy	roc_auc
baseline	0.935673	0.929761	0.991311
randomized_search	0.929825	0.925088	0.986419
ga_search	0.929825	0.925088	0.986565

Feature Selection With GAFeatureSelectionCV

The same optimizer ideas can be used for feature selection. Here the individual is a binary mask instead of a hyperparameter vector.

PopulationConfig(initializer="smart") creates diverse masks with different numbers of selected features. max_features limits the largest valid mask. Invalid masks are skipped efficiently instead of spending cross-validation time on candidates whose fitness is already known to be invalid.

[12]:

feature_selector = GAFeatureSelectionCV(
    estimator=RandomForestClassifier(
        random_state=RANDOM_STATE,
        n_jobs=1,
        **ga_search.best_params_,
    ),
    scoring="roc_auc",
    cv=cv,
    max_features=10,
    evolution_config=EvolutionConfig(population_size=14, generations=10),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(
        n_jobs=-1,
        parallel_backend="auto",
        use_cache=True,
        verbose=True,
    ),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=1,
        local_search_radius=0.15,
        diversity_control=True,
        diversity_threshold=0.30,
        random_immigrants_fraction=0.10,
        fitness_sharing=True,
        sharing_radius=0.40,
    ),
)

feature_selector.fit(X_train, y_train, callbacks=[TimerStopping(total_seconds=120)])

 gen evals           avg          best     div  unique  stag     mut   sel             events
---- ----- ------------- ------------- ------- ------- ----- ------- ----- ------------------
   0    14       0.93158       0.98816   0.074   1.000     0       -     - -
   1    28       0.98468       0.98816   0.074   0.500     1   0.800     3 div,imm=3,dup=2,sh
   2    28       0.98443       0.98849   0.077   0.786     0   0.800     3 div,imm=3,dup=1,sh
   3    28       0.98230       0.98849   0.074   0.643     1   0.800     3 div,imm=3,share
   4    28       0.98342       0.98849   0.074   0.857     2   0.800     3 div,imm=3,dup=1,sh
   5    28       0.98267       0.99079   0.074   0.786     0   0.800     3 div,imm=3,dup=1,sh
   6    28       0.98361       0.99079   0.077   0.714     1   0.800     3 div,imm=3,share
   7    28       0.98253       0.99079   0.077   0.786     2   0.800     3 div,imm=3,share
   8    28       0.98454       0.99381   0.077   0.857     0   0.800     3 div,imm=3,share
   9    28       0.97952       0.99381   0.077   0.857     1   0.800     3 div,imm=3,dup=2,sh
  10    28       0.98315       0.99381   0.077   0.571     2   0.800     3 div,imm=3,dup=2,sh

[12]:

GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
                     diversity_control=True, diversity_threshold=0.3,
                     estimator=RandomForestClassifier(ccp_alpha=0.008346993411164376,
                                                      max_depth=6,
                                                      max_features='log2',
                                                      min_samples_leaf=5,
                                                      min_samples_split=7,
                                                      n_estimators=56, n_jobs=1,
                                                      random_state=42),
                     evolution_config=EvolutionConfig(population_s...
                                                            final_selection=False,
                                                            final_selection_top_k=3,
                                                            final_selection_cv=None),
                     population_config=PopulationConfig(initializer='smart',
                                                        warm_start_configs=[]),
                     population_size=14,
                     runtime_config=RuntimeConfig(n_jobs=-1,
                                                  pre_dispatch='2*n_jobs',
                                                  error_score=nan,
                                                  return_train_score=False,
                                                  use_cache=True,
                                                  parallel_backend='auto',
                                                  verbose=True),
                     scoring='roc_auc', sharing_radius=0.4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GAFeatureSelectionCV

iFitted

Parameters

	estimator	RandomForestC...ndom_state=42)
	cv	StratifiedKFo... shuffle=True)
	scoring	'roc_auc'
	population_size	14
	generations	10
	max_features	10
	n_jobs	-1
	evolution_config	EvolutionConf...MuPlusLambda')
	population_config	PopulationCon...rt_configs=[])
	runtime_config	RuntimeConfig... verbose=True)
	optimization_config	OptimizationC...ction_cv=None)
	local_search	True
	local_search_top_k	2
	local_search_radius	0.15
	diversity_control	True
	diversity_threshold	0.3
	fitness_sharing	True
	sharing_radius	0.4
	crossover_probability	0.2
	mutation_probability	0.8
	tournament_size	3
	elitism	True
	verbose	True
	keep_top_k	1
	criteria	'max'
	algorithm	'eaMuPlusLambda'
	refit	True
	pre_dispatch	'2*n_jobs'
	error_score	nan
	return_train_score	False
	log_config	None
	use_cache	True
	parallel_backend	'auto'
	population_initializer	'smart'
	local_search_steps	1
	diversity_stagnation_generations	5
	diversity_mutation_boost	2.0
	random_immigrants_fraction	0.1
	adaptive_selection	False
	selection_pressure_min	2
	selection_pressure_max	None
	offspring_diversity_retries	0
	sharing_alpha	1.0

Fitted attributes

Name	Type	Value
X_	ndarray[float64](398, 30)	[[ 11.62, 18.18, 76.38,..., 0.14, 0.27, 0.09], [ 11.2 , 29.37, 70.67,..., 0. , 0.16, 0.06], [ 10.57, 18.32, 66.82,..., 0.02, 0.27, 0.07], ..., [ 13.65, 13.16, 87.88,..., 0.08, 0.24, 0.09], [ 17.05, 19.08,113.4 ,..., 0.25, 0.31, 0.09], [ 9.9 , 18.06, 64.6 ,..., 0.1 , 0.26, 0.12]]
best_estimator_	RandomForestClassifier	RandomForestC...ndom_state=42)
best_features_	ndarray[bool](30,)	[False, True,False,..., True,False,False]
cv_results_	dict	{'fe...es': [array([False,...False, False]), array([False,...False, False]), array([False,...False, False]), array([False,...False, False]), ...], 'me...me': [np.float64(1.1971229712168376), np.float64(1.2588351567586262), np.float64(1.1254057884216309), np.float64(1.1517895062764485), ...], 'me...me': [np.float64(0.2826677958170573), np.float64(0....4295845031738), np.float64(0....2118593851727), np.float64(0....8149388631183), ...], 'me...re': [np.float64(0.5983236737035607), np.float64(0.9223554080266645), np.float64(0.9676136421292556), np.float64(0.7894379071192447), ...], ...}
estimator_	RandomForestClassifier	RandomForestC...ndom_state=42)
fit_stats_	dict	{'ca...ts': 0, 'cr...ls': 295, 'du...es': 0, 'ev...es': 295, ...}
multimetric_	bool	False
n_features_in_	int	30
n_splits_	int	3
refit_time_	float	0.1084
scorer_	_Scorer	make_scorer(r...edict_proba'))
support_	ndarray[bool](30,)	[False, True,False,..., True,False,False]
y_	ndarray[int64](398,)	[1,1,1,...,1,0,1]

estimator: RandomForestClassifier

RandomForestClassifier(ccp_alpha=0.008346993411164376, max_depth=6,
                       max_features='log2', min_samples_leaf=5,
                       min_samples_split=7, n_estimators=56, n_jobs=1,
                       random_state=42)

RandomForestClassifier

?Documentation for RandomForestClassifier

Parameters

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	56
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	6
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	7
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	5
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'log2'
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.	1
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary <random_state>` for details.	42
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.008346993411164376
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary <warm_start>` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples irrespective of `sample_weight`. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` unweighted samples or `max_samples * sample_weight.sum()` weighted samples. .. versionadded:: 0.22 .. versionchanged:: 1.9 Float `max_samples` is relative to `sample_weight.sum()` instead of `X.shape[0]` for weighted samples.	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`). The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide <monotonic_cst_gbdt>`. .. versionadded:: 1.4	None

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) or a list of such arrays The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).	ndarray[int64](2,)	[0,1]
estimator_ estimator_: :class:`~sklearn.tree.DecisionTreeClassifier` The child estimator template used to create the collection of fitted sub-estimators. .. versionadded:: 1.2 `base_estimator_` was renamed to `estimator_`.	DecisionTreeClassifier	DecisionTreeClassifier()
estimators_ estimators_: list of DecisionTreeClassifier The collection of fitted sub-estimators.	list	[DecisionTreeC...te=1608637542), DecisionTreeC...te=1273642419), DecisionTreeC...te=1935803228), DecisionTreeC...ate=787846414), ...]
estimators_samples_ estimators_samples_: list of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected. .. versionadded:: 1.4	list	[array([ 41, 3..., dtype=int32), array([174, 1..., dtype=int32), array([342, 2..., dtype=int32), array([288, ..., dtype=int32), ...]
feature_importances_ feature_importances_: ndarray of shape (n_features,) The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See :func:`sklearn.inspection.permutation_importance` as an alternative.	ndarray[float64](10,)	[0.02,0.01,0.14,...,0.04,0.14,0.29]
n_classes_ n_classes_: int or list The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).	int	2
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	10
n_outputs_ n_outputs_: int The number of outputs when ``fit`` is performed.	int	1

10 features

x1

x8

x13

x14

x17

x18

x20

x21

x26

x27

[13]:

selected_features = X_train.columns[feature_selector.support_]
print(f"Selected {len(selected_features)} features:")
print(selected_features.tolist())

selector_metrics = evaluate_classifier(feature_selector, X_test, y_test)
pd.DataFrame(
    [baseline_metrics, randomized_metrics, ga_metrics, selector_metrics],
    index=["baseline", "randomized_search", "ga_search", "feature_selector"],
)

Selected 10 features:
['mean texture', 'mean symmetry', 'area error', 'smoothness error', 'concave points error', 'symmetry error', 'worst radius', 'worst texture', 'worst concavity', 'worst concave points']

[13]:

	accuracy	balanced_accuracy	roc_auc
baseline	0.935673	0.929761	0.991311
randomized_search	0.929825	0.925088	0.986419
ga_search	0.929825	0.925088	0.986565
feature_selector	0.935673	0.923481	0.989486

[14]:

print(classification_report(y_test, feature_selector.predict(X_test), target_names=data.target_names))

              precision    recall  f1-score   support

   malignant       0.95      0.88      0.91        64
      benign       0.93      0.97      0.95       107

    accuracy                           0.94       171
   macro avg       0.94      0.92      0.93       171
weighted avg       0.94      0.94      0.94       171

Practical Takeaways

Start with PopulationConfig(initializer="smart"); it usually gives better early coverage than random initialization.
Use fit_stats_ to understand the cost of the run: evaluated candidates, unique candidates, cache hits, skipped invalid masks, and cross-validation calls.
Use history to decide whether the optimizer is exploring enough. Low diversity plus stalled fitness suggests stronger mutation, fitness sharing, random immigrants, or a larger population.
Use OptimizationConfig(local_search=True) when the GA already finds good regions and you want a final exploitation pass.
Keep a sklearn baseline such as RandomizedSearchCV nearby. It is the simplest way to check whether a more advanced optimizer is improving quality enough to justify extra search time.

Advanced Hyperparameter Search With Random Forest

Menu

Problem Setup

Baseline Model

Define a Genetic Search Space

Configure GASearchCV

Inspect Results and Telemetry

Compare With RandomizedSearchCV

Feature Selection With GAFeatureSelectionCV

Practical Takeaways