Advanced Hyperparameter Search With Random Forest
This notebook is a guided tour of advanced optimization controls available in sklearn-genetic-opt. We will tune a RandomForestClassifier on the breast cancer dataset, inspect optimizer telemetry, compare against a lightweight randomized-search baseline, and then reuse the same ideas for feature selection.
Menu
Problem Setup
The breast cancer dataset is a binary classification task. It is small enough for a documentation example, but it still has enough numeric features to make model selection and feature selection meaningful.
We use a fixed train/test split and a shuffled StratifiedKFold so the notebook is reproducible.
[1]:
import warnings
from pprint import pprint
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, classification_report, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, train_test_split
from scipy.stats import randint
from sklearn_genetic import (
EvolutionConfig,
GAFeatureSelectionCV,
GASearchCV,
OptimizationConfig,
PopulationConfig,
RuntimeConfig,
)
from sklearn_genetic.callbacks import ConsecutiveStopping, DeltaThreshold, TimerStopping
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter
from sklearn_genetic.space import Categorical, Continuous, Integer
warnings.filterwarnings("ignore", category=UserWarning)
RANDOM_STATE = 42
[2]:
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.30,
stratify=y,
random_state=RANDOM_STATE,
)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
print(f"Training shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
print(f"Positive class rate: {y.mean():.3f}")
Training shape: (398, 30)
Test shape: (171, 30)
Positive class rate: 0.627
Baseline Model
Before tuning anything, train a plain random forest. This gives us a practical reference point: a genetic search should either improve the score, find a simpler configuration, or give us useful telemetry about the search process.
[3]:
def evaluate_classifier(estimator, X_eval, y_eval):
predictions = estimator.predict(X_eval)
probabilities = estimator.predict_proba(X_eval)[:, 1]
return {
"accuracy": accuracy_score(y_eval, predictions),
"balanced_accuracy": balanced_accuracy_score(y_eval, predictions),
"roc_auc": roc_auc_score(y_eval, probabilities),
}
baseline = RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1)
baseline.fit(X_train, y_train)
baseline_metrics = evaluate_classifier(baseline, X_test, y_test)
baseline_metrics
[3]:
{'accuracy': 0.935672514619883,
'balanced_accuracy': 0.9297605140186915,
'roc_auc': 0.991311331775701}
Define a Genetic Search Space
sklearn-genetic-opt uses explicit search-space objects instead of sklearn parameter distributions. This keeps integer, continuous, and categorical choices clear.
In this example we tune both model capacity and split behavior. The search space is intentionally moderate so the notebook runs quickly.
[4]:
param_grid = {
"n_estimators": Integer(40, 140),
"max_depth": Integer(2, 12),
"min_samples_split": Integer(2, 12),
"min_samples_leaf": Integer(1, 8),
"max_features": Categorical(["sqrt", "log2", None]),
"ccp_alpha": Continuous(0.0, 0.03),
}
Configure GASearchCV
This configuration demonstrates several optimizer controls:
PopulationConfig(initializer="smart")seeds a more useful initial population using estimator defaults, stratified categorical choices, and Latin hypercube sampling for numeric dimensions.warm_start_configsinjects a known reasonable configuration into the first population.RuntimeConfig(parallel_backend="auto")lets the estimator decide whether to parallelize candidate evaluation or cross-validation.OptimizationConfig(local_search=True)performs a short refinement around the best candidates at the end.OptimizationConfig(diversity_control=True)increases mutation pressure and can inject random candidates when the population collapses too early.OptimizationConfig(fitness_sharing=True)reduces crowding pressure so similar candidates do not dominate selection too soon.adaptive schedules let crossover and mutation probabilities evolve over generations.
[5]:
callbacks = [
ConsecutiveStopping(generations=10, metric="fitness_best"),
TimerStopping(total_seconds=240),
]
ga_search = GASearchCV(
estimator=RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1),
param_grid=param_grid,
scoring="roc_auc",
cv=cv,
evolution_config=EvolutionConfig(
population_size=20,
generations=15,
crossover_probability=ExponentialAdapter(initial_value=0.8, end_value=0.4, adaptive_rate=0.15),
mutation_probability=InverseAdapter(initial_value=0.25, end_value=0.05, adaptive_rate=0.2),
tournament_size=3,
elitism=True,
keep_top_k=3,
),
population_config=PopulationConfig(
initializer="smart",
warm_start_configs=[
{
"n_estimators": 100,
"max_depth": 6,
"min_samples_split": 4,
"min_samples_leaf": 2,
"max_features": "sqrt",
"ccp_alpha": 0.0,
}
],
),
runtime_config=RuntimeConfig(
n_jobs=-1,
parallel_backend="auto",
use_cache=True,
verbose=True,
return_train_score=False,
),
optimization_config=OptimizationConfig(
local_search=True,
local_search_top_k=2,
local_search_steps=1,
local_search_radius=0.2,
diversity_control=True,
diversity_threshold=0.35,
diversity_stagnation_generations=3,
diversity_mutation_boost=1.8,
random_immigrants_fraction=0.15,
fitness_sharing=True,
sharing_radius=0.35,
sharing_alpha=1.0,
),
)
ga_search.fit(X_train, y_train, callbacks=callbacks)
gen evals avg best div unique stag mut sel events
---- ----- ------------- ------------- ------- ------- ----- ------- ----- ------------------
0 20 0.98625 0.99076 0.579 1.000 0 - - -
1 40 0.98519 0.99076 0.386 0.650 1 0.200 3 dup=1,share
2 40 0.98587 0.99076 0.342 0.900 2 0.217 3 dup=9,share
3 40 0.98676 0.99076 0.316 0.700 3 0.304 3 div,imm=6,dup=2,sh
4 40 0.98610 0.99076 0.307 0.750 4 0.315 3 div,imm=6,dup=2,sh
5 40 0.98464 0.99076 0.386 0.750 5 0.290 3 div,imm=6,dup=12,s
6 40 0.98632 0.99171 0.412 0.900 0 0.270 3 div,imm=6,dup=7,sh
7 40 0.98588 0.99171 0.421 0.800 1 0.141 3 dup=15,share
8 40 0.98520 0.99171 0.421 0.850 2 0.133 3 dup=17,share
9 40 0.98597 0.99171 0.404 0.750 3 0.127 3 dup=20,share
10 40 0.98640 0.99171 0.439 0.900 4 0.219 3 div,imm=6,dup=16,s
11 40 0.98589 0.99171 0.351 0.700 5 0.210 3 div,imm=6,dup=13,s
INFO: TimerStopping callback met its criteria
INFO: Stopping the algorithm
[5]:
GASearchCV(crossover_probability=<sklearn_genetic.schedules.schedulers.ExponentialAdapter object at 0x000001C5C12A12B0>,
cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
diversity_control=True, diversity_mutation_boost=1.8,
diversity_stagnation_generations=3, diversity_threshold=0.35,
estimator=RandomForestClassifier(ccp_alpha=0.0083469934111643...
'max_features': 'sqrt',
'min_samples_leaf': 2,
'min_samples_split': 4,
'n_estimators': 100}]),
population_size=20, random_immigrants_fraction=0.15,
return_train_score=True,
runtime_config=RuntimeConfig(n_jobs=-1,
pre_dispatch='2*n_jobs',
error_score=nan,
return_train_score=False,
use_cache=True,
parallel_backend='auto',
verbose=True),
scoring='roc_auc', sharing_radius=0.35)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| estimator | RandomForestC...ndom_state=42) | |
| cv | StratifiedKFo... shuffle=True) | |
| param_grid | {'ccp_alpha': <sklearn_gene...001C5C12A0D70>, 'max_depth': <sklearn_gene...001C5C1231BD0>, 'max_features': <sklearn_gene...001C5C12A0EC0>, 'min_samples_leaf': <sklearn_gene...001C5C12395B0>, ...} | |
| scoring | 'roc_auc' | |
| population_size | 20 | |
| generations | 15 | |
| crossover_probability | <sklearn_gene...001C5C12A12B0> | |
| mutation_probability | <sklearn_gene...001C5C12A1400> | |
| keep_top_k | 3 | |
| n_jobs | -1 | |
| return_train_score | True | |
| evolution_config | EvolutionConf...MuPlusLambda') | |
| population_config | PopulationCon...alpha': 0.0}]) | |
| runtime_config | RuntimeConfig... verbose=True) | |
| optimization_config | OptimizationC...ction_cv=None) | |
| local_search | True | |
| local_search_top_k | 2 | |
| local_search_radius | 0.2 | |
| diversity_control | True | |
| diversity_threshold | 0.35 | |
| diversity_stagnation_generations | 3 | |
| diversity_mutation_boost | 1.8 | |
| random_immigrants_fraction | 0.15 | |
| fitness_sharing | True | |
| sharing_radius | 0.35 | |
| tournament_size | 3 | |
| elitism | True | |
| verbose | True | |
| criteria | 'max' | |
| algorithm | 'eaMuPlusLambda' | |
| refit | True | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| log_config | None | |
| use_cache | True | |
| warm_start_configs | None | |
| parallel_backend | 'auto' | |
| population_initializer | 'smart' | |
| local_search_steps | 1 | |
| adaptive_selection | False | |
| selection_pressure_min | 2 | |
| selection_pressure_max | None | |
| offspring_diversity_retries | 0 | |
| sharing_alpha | 1.0 | |
| final_selection | False | |
| final_selection_top_k | 3 | |
| final_selection_cv | None |
Fitted attributes
| Name | Type | Value |
|---|---|---|
| X_ | DataFrame | mean rad... x 30 columns] |
| best_estimator_ | RandomForestClassifier | RandomForestC...ndom_state=42) |
| best_index_ | int | 57 |
| best_params_ | dict | {'cc...ha': 0.008346993411164376, 'ma...th': 6, 'ma...es': 'log2', 'mi...af': 5, ...} |
| best_score_ | float | 0.9917 |
| classes_ | ndarray[int64](2,) | [0,1] |
| cv_results_ | dict | {'me...me': [np.float64(1.2785807450612385), np.float64(0.8232693672180176), np.float64(0.9571025371551514), np.float64(2.049020846684774), ...], 'me...me': [np.float64(0.4316854476928711), np.float64(0....4202550252277), np.float64(0....5157559712726), np.float64(0.5433539549509684), ...], 'me...re': [np.float64(0.9877237619086648), np.float64(0.9859223873140764), np.float64(0.9898175793553222), np.float64(0.9855488426007236), ...], 'me...re': [np.float64(0.999695527146649), np.float64(0.9978708142672813), np.float64(0.9972397883546328), np.float64(0.9972309381777881), ...], ...} |
| estimator_ | RandomForestClassifier | RandomForestC...ndom_state=42) |
| final_selection_results_ | dict | {'ca...es': [], 'changed': False, 'cv': None, 'enabled': False, ...} |
| fit_stats_ | dict | {'ca...ts': 2, 'cr...ls': 460, 'du...es': 0, 'ev...es': 462, ...} |
| multimetric_ | bool | False |
| n_features_in_ | int | 30 |
| n_splits_ | int | 3 |
| refit_time_ | float | 0.1146 |
| scorer_ | _Scorer | make_scorer(r...edict_proba')) |
| y_ | Series[int64](398,) | 469 1 561 ..., dtype: int64 |
RandomForestClassifier(ccp_alpha=0.008346993411164376, max_depth=6,
max_features='log2', min_samples_leaf=5,
min_samples_split=7, n_estimators=56, n_jobs=1,
random_state=42)Parameters
Fitted attributes
Inspect Results and Telemetry
The usual sklearn-style attributes are available: best_params_, best_score_, and best_estimator_. The library also records optimization mechanics in fit_stats_ and per-generation telemetry in history.
These fields are especially useful when tuning performance. If cache_hits is high, the search is revisiting candidates. If diversity collapses early, try stronger mutation, more random immigrants, a larger population, or fitness sharing.
[6]:
print("Best CV ROC AUC:", round(ga_search.best_score_, 4))
print("Best parameters:")
pprint(ga_search.best_params_)
ga_metrics = evaluate_classifier(ga_search, X_test, y_test)
pd.DataFrame([baseline_metrics, ga_metrics], index=["baseline", "ga_search"])
Best CV ROC AUC: 0.9917
Best parameters:
{'ccp_alpha': 0.008346993411164376,
'max_depth': 6,
'max_features': 'log2',
'min_samples_leaf': 5,
'min_samples_split': 7,
'n_estimators': 56}
[6]:
| accuracy | balanced_accuracy | roc_auc | |
|---|---|---|---|
| baseline | 0.935673 | 0.929761 | 0.991311 |
| ga_search | 0.929825 | 0.925088 | 0.986565 |
[7]:
ga_search.fit_stats_
[7]:
{'evaluated_candidates': 462,
'unique_candidates': 460,
'cross_validate_calls': 460,
'cache_hits': 2,
'duplicate_candidates': 0,
'skipped_invalid_candidates': 0,
'population_parallel_batches': 13,
'population_serial_batches': 0,
'random_immigrants': 36,
'local_refinement_candidates': 2}
[8]:
history = pd.DataFrame(ga_search.history)
telemetry_columns = [
"gen",
"fitness",
"fitness_max",
"fitness_std",
"unique_individual_ratio",
"genotype_diversity",
"stagnation_generations",
"best_generation",
]
history[[column for column in telemetry_columns if column in history.columns]].tail()
[8]:
| gen | fitness | fitness_max | fitness_std | unique_individual_ratio | genotype_diversity | stagnation_generations | best_generation | |
|---|---|---|---|---|---|---|---|---|
| 7 | 7 | 0.985883 | 0.990676 | 0.002142 | 0.80 | 0.421053 | 1 | 6 |
| 8 | 8 | 0.985202 | 0.990143 | 0.001566 | 0.85 | 0.421053 | 2 | 6 |
| 9 | 9 | 0.985969 | 0.990143 | 0.002267 | 0.75 | 0.403509 | 3 | 6 |
| 10 | 10 | 0.986397 | 0.990106 | 0.002121 | 0.90 | 0.438596 | 4 | 6 |
| 11 | 11 | 0.986717 | 0.991707 | 0.002035 | 0.75 | 0.385965 | 6 | 6 |
A compact plot can make the search dynamics easier to read. The first chart shows best-so-far fitness, current-generation best, and population average; the second chart shows diversity signals. If the diversity curves drop to zero early while fitness stops improving, the search is probably over-exploiting one region.
[9]:
ax = history.plot(x="gen", y=["fitness_best", "fitness_max", "fitness"], marker="o", figsize=(8, 4))
ax.set_title("Fitness over generations")
ax.set_xlabel("Generation")
ax.set_ylabel("ROC AUC")
[9]:
Text(0, 0.5, 'ROC AUC')
[10]:
diversity_columns = [
column
for column in ["unique_individual_ratio", "genotype_diversity"]
if column in history.columns
]
ax = history.plot(x="gen", y=diversity_columns, marker="o", figsize=(8, 4))
ax.set_title("Population diversity over generations")
ax.set_xlabel("Generation")
ax.set_ylabel("Diversity")
[10]:
Text(0, 0.5, 'Diversity')
Compare With RandomizedSearchCV
Genetic search is most useful when the search space is large, mixed-type, or expensive enough that exhaustive grids become unattractive. A lightweight RandomizedSearchCV baseline is still useful because it tells us whether the GA is paying for itself.
The parameter distributions below cover roughly the same region as the genetic search space, but they use sklearn/scipy objects instead of sklearn-genetic-opt dimensions.
[11]:
randomized_search = RandomizedSearchCV(
estimator=RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1),
param_distributions={
"n_estimators": randint(40, 141),
"max_depth": randint(2, 13),
"min_samples_split": randint(2, 13),
"min_samples_leaf": randint(1, 9),
"max_features": ["sqrt", "log2", None],
"ccp_alpha": np.linspace(0.0, 0.03, 20),
},
n_iter=12,
scoring="roc_auc",
cv=cv,
n_jobs=-1,
random_state=RANDOM_STATE,
refit=True,
)
randomized_search.fit(X_train, y_train)
randomized_metrics = evaluate_classifier(randomized_search, X_test, y_test)
pd.DataFrame(
[baseline_metrics, randomized_metrics, ga_metrics],
index=["baseline", "randomized_search", "ga_search"],
)
[11]:
| accuracy | balanced_accuracy | roc_auc | |
|---|---|---|---|
| baseline | 0.935673 | 0.929761 | 0.991311 |
| randomized_search | 0.929825 | 0.925088 | 0.986419 |
| ga_search | 0.929825 | 0.925088 | 0.986565 |
Feature Selection With GAFeatureSelectionCV
The same optimizer ideas can be used for feature selection. Here the individual is a binary mask instead of a hyperparameter vector.
PopulationConfig(initializer="smart") creates diverse masks with different numbers of selected features. max_features limits the largest valid mask. Invalid masks are skipped efficiently instead of spending cross-validation time on candidates whose fitness is already known to be invalid.
[12]:
feature_selector = GAFeatureSelectionCV(
estimator=RandomForestClassifier(
random_state=RANDOM_STATE,
n_jobs=1,
**ga_search.best_params_,
),
scoring="roc_auc",
cv=cv,
max_features=10,
evolution_config=EvolutionConfig(population_size=14, generations=10),
population_config=PopulationConfig(initializer="smart"),
runtime_config=RuntimeConfig(
n_jobs=-1,
parallel_backend="auto",
use_cache=True,
verbose=True,
),
optimization_config=OptimizationConfig(
local_search=True,
local_search_top_k=2,
local_search_steps=1,
local_search_radius=0.15,
diversity_control=True,
diversity_threshold=0.30,
random_immigrants_fraction=0.10,
fitness_sharing=True,
sharing_radius=0.40,
),
)
feature_selector.fit(X_train, y_train, callbacks=[TimerStopping(total_seconds=120)])
gen evals avg best div unique stag mut sel events
---- ----- ------------- ------------- ------- ------- ----- ------- ----- ------------------
0 14 0.93158 0.98816 0.074 1.000 0 - - -
1 28 0.98468 0.98816 0.074 0.500 1 0.800 3 div,imm=3,dup=2,sh
2 28 0.98443 0.98849 0.077 0.786 0 0.800 3 div,imm=3,dup=1,sh
3 28 0.98230 0.98849 0.074 0.643 1 0.800 3 div,imm=3,share
4 28 0.98342 0.98849 0.074 0.857 2 0.800 3 div,imm=3,dup=1,sh
5 28 0.98267 0.99079 0.074 0.786 0 0.800 3 div,imm=3,dup=1,sh
6 28 0.98361 0.99079 0.077 0.714 1 0.800 3 div,imm=3,share
7 28 0.98253 0.99079 0.077 0.786 2 0.800 3 div,imm=3,share
8 28 0.98454 0.99381 0.077 0.857 0 0.800 3 div,imm=3,share
9 28 0.97952 0.99381 0.077 0.857 1 0.800 3 div,imm=3,dup=2,sh
10 28 0.98315 0.99381 0.077 0.571 2 0.800 3 div,imm=3,dup=2,sh
[12]:
GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
diversity_control=True, diversity_threshold=0.3,
estimator=RandomForestClassifier(ccp_alpha=0.008346993411164376,
max_depth=6,
max_features='log2',
min_samples_leaf=5,
min_samples_split=7,
n_estimators=56, n_jobs=1,
random_state=42),
evolution_config=EvolutionConfig(population_s...
final_selection=False,
final_selection_top_k=3,
final_selection_cv=None),
population_config=PopulationConfig(initializer='smart',
warm_start_configs=[]),
population_size=14,
runtime_config=RuntimeConfig(n_jobs=-1,
pre_dispatch='2*n_jobs',
error_score=nan,
return_train_score=False,
use_cache=True,
parallel_backend='auto',
verbose=True),
scoring='roc_auc', sharing_radius=0.4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| estimator | RandomForestC...ndom_state=42) | |
| cv | StratifiedKFo... shuffle=True) | |
| scoring | 'roc_auc' | |
| population_size | 14 | |
| generations | 10 | |
| max_features | 10 | |
| n_jobs | -1 | |
| evolution_config | EvolutionConf...MuPlusLambda') | |
| population_config | PopulationCon...rt_configs=[]) | |
| runtime_config | RuntimeConfig... verbose=True) | |
| optimization_config | OptimizationC...ction_cv=None) | |
| local_search | True | |
| local_search_top_k | 2 | |
| local_search_radius | 0.15 | |
| diversity_control | True | |
| diversity_threshold | 0.3 | |
| fitness_sharing | True | |
| sharing_radius | 0.4 | |
| crossover_probability | 0.2 | |
| mutation_probability | 0.8 | |
| tournament_size | 3 | |
| elitism | True | |
| verbose | True | |
| keep_top_k | 1 | |
| criteria | 'max' | |
| algorithm | 'eaMuPlusLambda' | |
| refit | True | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False | |
| log_config | None | |
| use_cache | True | |
| parallel_backend | 'auto' | |
| population_initializer | 'smart' | |
| local_search_steps | 1 | |
| diversity_stagnation_generations | 5 | |
| diversity_mutation_boost | 2.0 | |
| random_immigrants_fraction | 0.1 | |
| adaptive_selection | False | |
| selection_pressure_min | 2 | |
| selection_pressure_max | None | |
| offspring_diversity_retries | 0 | |
| sharing_alpha | 1.0 |
Fitted attributes
| Name | Type | Value |
|---|---|---|
| X_ | ndarray[float64](398, 30) | [[ 11.62, 18.18, 76.38,..., 0.14, 0.27, 0.09], [ 11.2 , 29.37, 70.67,..., 0. , 0.16, 0.06], [ 10.57, 18.32, 66.82,..., 0.02, 0.27, 0.07], ..., [ 13.65, 13.16, 87.88,..., 0.08, 0.24, 0.09], [ 17.05, 19.08,113.4 ,..., 0.25, 0.31, 0.09], [ 9.9 , 18.06, 64.6 ,..., 0.1 , 0.26, 0.12]] |
| best_estimator_ | RandomForestClassifier | RandomForestC...ndom_state=42) |
| best_features_ | ndarray[bool](30,) | [False, True,False,..., True,False,False] |
| cv_results_ | dict | {'fe...es': [array([False,...False, False]), array([False,...False, False]), array([False,...False, False]), array([False,...False, False]), ...], 'me...me': [np.float64(1.1971229712168376), np.float64(1.2588351567586262), np.float64(1.1254057884216309), np.float64(1.1517895062764485), ...], 'me...me': [np.float64(0.2826677958170573), np.float64(0....4295845031738), np.float64(0....2118593851727), np.float64(0....8149388631183), ...], 'me...re': [np.float64(0.5983236737035607), np.float64(0.9223554080266645), np.float64(0.9676136421292556), np.float64(0.7894379071192447), ...], ...} |
| estimator_ | RandomForestClassifier | RandomForestC...ndom_state=42) |
| fit_stats_ | dict | {'ca...ts': 0, 'cr...ls': 295, 'du...es': 0, 'ev...es': 295, ...} |
| multimetric_ | bool | False |
| n_features_in_ | int | 30 |
| n_splits_ | int | 3 |
| refit_time_ | float | 0.1084 |
| scorer_ | _Scorer | make_scorer(r...edict_proba')) |
| support_ | ndarray[bool](30,) | [False, True,False,..., True,False,False] |
| y_ | ndarray[int64](398,) | [1,1,1,...,1,0,1] |
RandomForestClassifier(ccp_alpha=0.008346993411164376, max_depth=6,
max_features='log2', min_samples_leaf=5,
min_samples_split=7, n_estimators=56, n_jobs=1,
random_state=42)Parameters
Fitted attributes
10 features
| x1 |
| x8 |
| x13 |
| x14 |
| x17 |
| x18 |
| x20 |
| x21 |
| x26 |
| x27 |
[13]:
selected_features = X_train.columns[feature_selector.support_]
print(f"Selected {len(selected_features)} features:")
print(selected_features.tolist())
selector_metrics = evaluate_classifier(feature_selector, X_test, y_test)
pd.DataFrame(
[baseline_metrics, randomized_metrics, ga_metrics, selector_metrics],
index=["baseline", "randomized_search", "ga_search", "feature_selector"],
)
Selected 10 features:
['mean texture', 'mean symmetry', 'area error', 'smoothness error', 'concave points error', 'symmetry error', 'worst radius', 'worst texture', 'worst concavity', 'worst concave points']
[13]:
| accuracy | balanced_accuracy | roc_auc | |
|---|---|---|---|
| baseline | 0.935673 | 0.929761 | 0.991311 |
| randomized_search | 0.929825 | 0.925088 | 0.986419 |
| ga_search | 0.929825 | 0.925088 | 0.986565 |
| feature_selector | 0.935673 | 0.923481 | 0.989486 |
[14]:
print(classification_report(y_test, feature_selector.predict(X_test), target_names=data.target_names))
precision recall f1-score support
malignant 0.95 0.88 0.91 64
benign 0.93 0.97 0.95 107
accuracy 0.94 171
macro avg 0.94 0.92 0.93 171
weighted avg 0.94 0.94 0.94 171
Practical Takeaways
Start with
PopulationConfig(initializer="smart"); it usually gives better early coverage than random initialization.Use
fit_stats_to understand the cost of the run: evaluated candidates, unique candidates, cache hits, skipped invalid masks, and cross-validation calls.Use
historyto decide whether the optimizer is exploring enough. Low diversity plus stalled fitness suggests stronger mutation, fitness sharing, random immigrants, or a larger population.Use
OptimizationConfig(local_search=True)when the GA already finds good regions and you want a final exploitation pass.Keep a sklearn baseline such as
RandomizedSearchCVnearby. It is the simplest way to check whether a more advanced optimizer is improving quality enough to justify extra search time.