Feature Selection With Noisy Iris Data

This notebook keeps the original goal of the Iris feature-selection tutorial: use GAFeatureSelectionCV to find a compact subset of useful features. The example now adds synthetic noise features so the selection problem is more realistic.

Problem Setup

The original Iris dataset has only four informative features. To make feature selection visible, we add random noise columns. A useful selector should keep a small subset of original measurements and avoid most noise columns.

We use a Pipeline with StandardScaler and SVC because SVMs are sensitive to feature scale.

[1]:

import warnings

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, balanced_accuracy_score, classification_report
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from sklearn_genetic import (
    EvolutionConfig,
    GAFeatureSelectionCV,
    OptimizationConfig,
    PopulationConfig,
    RuntimeConfig,
)
from sklearn_genetic.callbacks import ConsecutiveStopping, DeltaThreshold, TimerStopping
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter

warnings.filterwarnings("ignore", category=UserWarning)

RANDOM_STATE = 42
rng = np.random.default_rng(RANDOM_STATE)

[2]:

iris = load_iris(as_frame=True)
X_original = iris.data
y = iris.target

noise = pd.DataFrame(
    rng.normal(size=(X_original.shape[0], 12)),
    columns=[f"noise_{index:02d}" for index in range(12)],
)
X = pd.concat([X_original, noise], axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    stratify=y,
    random_state=RANDOM_STATE,
)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

print(f"Original features: {X_original.shape[1]}")
print(f"Noise features: {noise.shape[1]}")
print(f"Total features: {X.shape[1]}")

Original features: 4
Noise features: 12
Total features: 16

Baseline With All Features

The baseline trains on all original and noise columns. This gives us a reference for whether feature selection preserves model quality while reducing the feature set.

[3]:

def make_svc_pipeline():
    return Pipeline(
        [
            ("scaler", StandardScaler()),
            (
                "svc",
                SVC(
                    kernel="rbf",
                    C=2.0,
                    gamma="scale",
                    random_state=RANDOM_STATE,
                ),
            ),
        ]
    )


def evaluate(estimator, X_eval, y_eval):
    predictions = estimator.predict(X_eval)
    return {
        "accuracy": accuracy_score(y_eval, predictions),
        "balanced_accuracy": balanced_accuracy_score(y_eval, predictions),
    }


baseline = make_svc_pipeline()
baseline.fit(X_train, y_train)
baseline_metrics = evaluate(baseline, X_test, y_test)
baseline_metrics

[3]:

{'accuracy': 0.8222222222222222, 'balanced_accuracy': 0.8222222222222223}

Configure GAFeatureSelectionCV

GAFeatureSelectionCV searches over binary masks. A value of 1 means the feature is selected; a value of 0 means it is excluded.

This configuration uses several optimizer controls:

PopulationConfig(initializer="smart") starts from diverse masks instead of purely random masks.
max_features=6 asks the optimizer to find a compact subset.
OptimizationConfig(diversity_control=True) boosts exploration when the population collapses or stalls.
OptimizationConfig(fitness_sharing=True) reduces pressure for many similar masks to dominate too early.
OptimizationConfig(local_search=True) performs a small final neighborhood search around strong masks.
adaptive schedules gradually change crossover and mutation behavior during the run.

[4]:

selector = GAFeatureSelectionCV(
    estimator=make_svc_pipeline(),
    cv=cv,
    scoring="balanced_accuracy",
    max_features=6,
    evolution_config=EvolutionConfig(
        population_size=20,
        generations=15,
        crossover_probability=ExponentialAdapter(initial_value=0.8, end_value=0.4, adaptive_rate=0.15),
        mutation_probability=InverseAdapter(initial_value=0.30, end_value=0.08, adaptive_rate=0.25),
        tournament_size=3,
        elitism=True,
        keep_top_k=3,
    ),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True, verbose=True),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=1,
        local_search_radius=0.15,
        diversity_control=True,
        diversity_threshold=0.30,
        diversity_stagnation_generations=3,
        diversity_mutation_boost=1.8,
        random_immigrants_fraction=0.10,
        fitness_sharing=True,
        sharing_radius=0.40,
    ),
)

callbacks = [
    DeltaThreshold(threshold=0.001, generations=5, metric="fitness_best"),
    ConsecutiveStopping(generations=7, metric="fitness_best"),
    TimerStopping(total_seconds=90),
]

selector.fit(X_train, y_train, callbacks=callbacks)

 gen evals           avg          best     div  unique  stag     mut   sel             events
---- ----- ------------- ------------- ------- ------- ----- ------- ----- ------------------
   0    20       0.58725       0.96128   0.053   1.000     0       -     - -
   1    40       0.75526       0.96128   0.053   0.750     1   0.200     3 div,imm=4,dup=15,s
   2    40       0.84625       0.96128   0.053   0.750     2   0.256     3 div,imm=4,dup=10,s
   3    40       0.85387       0.96128   0.053   0.650     3   0.304     3 div,imm=4,dup=10,s
   4    40       0.86208       0.96128   0.053   0.650     4   0.345     3 div,imm=4,dup=6,sh
INFO: DeltaThreshold callback met its criteria
INFO: Stopping the algorithm

[4]:

GAFeatureSelectionCV(crossover_probability=<sklearn_genetic.schedules.schedulers.ExponentialAdapter object at 0x00000226B432D7F0>,
                     cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
                     diversity_control=True, diversity_mutation_boost=1.8,
                     diversity_stagnation_generations=3,
                     diversity_threshold=0.3,
                     estimator=Pipeline(steps=[('scaler', StandardScaler()...
                                                            final_selection_top_k=3,
                                                            final_selection_cv=None),
                     population_config=PopulationConfig(initializer='smart',
                                                        warm_start_configs=[]),
                     population_size=20,
                     runtime_config=RuntimeConfig(n_jobs=-1,
                                                  pre_dispatch='2*n_jobs',
                                                  error_score=nan,
                                                  return_train_score=False,
                                                  use_cache=True,
                                                  parallel_backend='auto',
                                                  verbose=True),
                     scoring='balanced_accuracy', sharing_radius=0.4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GAFeatureSelectionCV

iFitted

Parameters

	estimator	Pipeline(step...m_state=42))])
	cv	StratifiedKFo... shuffle=True)
	scoring	'balanced_accuracy'
	population_size	20
	generations	15
	crossover_probability	<sklearn_gene...00226B432D7F0>
	mutation_probability	<sklearn_gene...00226B432D6A0>
	max_features	6
	keep_top_k	3
	n_jobs	-1
	evolution_config	EvolutionConf...MuPlusLambda')
	population_config	PopulationCon...rt_configs=[])
	runtime_config	RuntimeConfig... verbose=True)
	optimization_config	OptimizationC...ction_cv=None)
	local_search	True
	local_search_top_k	2
	local_search_radius	0.15
	diversity_control	True
	diversity_threshold	0.3
	diversity_stagnation_generations	3
	diversity_mutation_boost	1.8
	fitness_sharing	True
	sharing_radius	0.4
	tournament_size	3
	elitism	True
	verbose	True
	criteria	'max'
	algorithm	'eaMuPlusLambda'
	refit	True
	pre_dispatch	'2*n_jobs'
	error_score	nan
	return_train_score	False
	log_config	None
	use_cache	True
	parallel_backend	'auto'
	population_initializer	'smart'
	local_search_steps	1
	random_immigrants_fraction	0.1
	adaptive_selection	False
	selection_pressure_min	2
	selection_pressure_max	None
	offspring_diversity_retries	0
	sharing_alpha	1.0

Fitted attributes

Name	Type	Value
X_	ndarray[float64](105, 16)	[[ 5.1 , 2.5 , 3. ,...,-1.38,-0.32, 0.37], [ 6.2 , 2.2 , 4.5 ,..., 0.03, 0.03,-0.12], [ 5.1 , 3.8 , 1.5 ,...,-0.16,-1.06,-0.53], ..., [ 5.5 , 4.2 , 1.4 ,...,-0.27,-0.12, 0.83], [ 5.6 , 2.7 , 4.2 ,..., 0.04,-0.09, 0. ], [ 4.6 , 3.1 , 1.5 ,..., 0.22, 0.87, 0.22]]
best_estimator_	Pipeline	Pipeline(step...m_state=42))])
best_features_	ndarray[bool](16,)	[False,False,False,...,False,False,False]
cv_results_	dict	{'fe...es': [array([False,...False, True]), array([False,...False, False]), array([False,...False, False]), array([False,...False, False]), ...], 'me...me': [np.float64(0....1542561848958), np.float64(0....3043696085612), np.float64(0....5858917236328), np.float64(0.0790853500366211), ...], 'me...me': [np.float64(0....0404942830406), np.float64(0....5751190185547), np.float64(0....9554761250814), np.float64(0....9825617472332), ...], 'me...re': [np.float64(0.3451178451178451), np.float64(0.3122895622895623), np.float64(0.9612794612794614), np.float64(0....2929292929293), ...], ...}
estimator_	Pipeline	Pipeline(step...m_state=42))])
fit_stats_	dict	{'ca...ts': 0, 'cr...ls': 182, 'du...es': 0, 'ev...es': 182, ...}
multimetric_	bool	False
n_features_in_	int	16
n_splits_	int	3
refit_time_	float	0.00226
scorer_	_Scorer	make_scorer(b...hod='predict')
support_	ndarray[bool](16,)	[False,False,False,...,False,False,False]
y_	ndarray[int64](105,)	[1,1,0,...,0,1,0]

estimator: Pipeline

StandardScaler

?Documentation for StandardScaler

Parameters

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

Fitted attributes

Name	Type	Value
mean_ mean_: ndarray of shape (n_features,) or None The mean value for each feature in the training set. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](1,)	[1.21]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	1
n_samples_seen_ n_samples_seen_: int or ndarray of shape (n_features,) The number of samples processed by the estimator for each feature. If there are no missing samples, the ``n_samples_seen`` will be an integer, otherwise it will be an array of dtype int. If `sample_weights` are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across ``partial_fit`` calls.	float64	105
scale_ scale_: ndarray of shape (n_features,) or None Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using `np.sqrt(var_)`. If a variance is zero, we can't achieve unit variance, and the data is left as-is, giving a scaling factor of 1. `scale_` is equal to `None` when `with_std=False`. .. versionadded:: 0.17 scale_	ndarray[float64](1,)	[0.77]
var_ var_: ndarray of shape (n_features,) or None The variance for each feature in the training set. Used to compute `scale_`. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](1,)	[0.6]

1 feature

x0

SVC

?Documentation for SVC

Parameters

	C C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.	2.0
	random_state random_state: int, RandomState instance or None, default=None Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when `probability` is False. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	42
	kernel kernel: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable, default='rbf' Specifies the kernel type to be used in the algorithm. If none is given, 'rbf' will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape ``(n_samples, n_samples)``. For an intuitive visualization of different kernel types see :ref:`sphx_glr_auto_examples_svm_plot_svm_kernels.py`.	'rbf'
	degree degree: int, default=3 Degree of the polynomial kernel function ('poly'). Must be non-negative. Ignored by all other kernels.	3
	gamma gamma: {'scale', 'auto'} or float, default='scale' Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. - if ``gamma='scale'`` (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma, - if 'auto', uses 1 / n_features - if float, must be non-negative. .. versionchanged:: 0.22 The default value of ``gamma`` changed from 'auto' to 'scale'.	'scale'
	coef0 coef0: float, default=0.0 Independent term in kernel function. It is only significant in 'poly' and 'sigmoid'.	0.0
	shrinking shrinking: bool, default=True Whether to use the shrinking heuristic. See the :ref:`User Guide <shrinking_svm>`.	True
	probability probability: bool, default=False Whether to enable probability estimates. This must be enabled prior to calling `fit`, will slow down that method as it internally uses 5-fold cross-validation, and `predict_proba` may be inconsistent with `predict`. Read more in the :ref:`User Guide <scores_probabilities>`. ..deprecated:: 1.9 The `probability` parameter is deprecated and will be removed in 1.11. Use `CalibratedClassifierCV(SVC(), ensemble=False)` instead of `SVC(probability=True)`.	'deprecated'
	tol tol: float, default=1e-3 Tolerance for stopping criterion.	0.001
	cache_size cache_size: float, default=200 Specify the size of the kernel cache (in MB).	200
	class_weight class_weight: dict or 'balanced', default=None Set the parameter C of class i to class_weight[i]C for SVC. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes np.bincount(y))``.	None
	verbose verbose: bool, default=False Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.	False
	max_iter max_iter: int, default=-1 Hard limit on iterations within solver, or -1 for no limit.	-1
	decision_function_shape decision_function_shape: {'ovo', 'ovr'}, default='ovr' Whether to return a one-vs-rest ('ovr') decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one ('ovo') decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, note that internally, one-vs-one ('ovo') is always used as a multi-class strategy to train models; an ovr matrix is only constructed from the ovo matrix. The parameter is ignored for binary classification. .. versionchanged:: 0.19 decision_function_shape is 'ovr' by default. .. versionadded:: 0.17 decision_function_shape='ovr' is recommended. .. versionchanged:: 0.17 Deprecated decision_function_shape='ovo' and None.	'ovr'
	break_ties break_ties: bool, default=False If true, ``decision_function_shape='ovr'``, and number of classes > 2, :term:`predict` will break ties according to the confidence values of :term:`decision_function`; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict. See :ref:`sphx_glr_auto_examples_svm_plot_svm_tie_breaking.py` for an example of its usage with ``decision_function_shape='ovr'``. .. versionadded:: 0.22	False

Fitted attributes

Name	Type	Value
class_weight_ class_weight_: ndarray of shape (n_classes,) Multipliers of parameter C for each class. Computed based on the ``class_weight`` parameter.	ndarray[float64](3,)	[1.,1.,1.]
classes_ classes_: ndarray of shape (n_classes,) The classes labels.	ndarray[int64](3,)	[0,1,2]
dual_coef_ dual_coef_: ndarray or sparse array/matrix of shape (n_classes -1, n_SV) Dual coefficients of the support vector in the decision function (see :ref:`sgd_mathematical_formulation`), multiplied by their targets. For multiclass, coefficient for all 1-vs-1 classifiers. The layout of the coefficients in the multiclass case is somewhat non-trivial. See the :ref:`multi-class section of the User Guide <svm_multi_class>` for details. If `X` is sparse, these will also be sparse.	ndarray[float64](2, 20)	[[ 0.25, 2. , 0. ,...,-0. ,-0. ,-0. ], [ 0. , 0.97, 0.43,...,-2. ,-2. ,-2. ]]
fit_status_ fit_status_: int 0 if correctly fitted, 1 otherwise (will raise warning)	int	0
intercept_ intercept_: ndarray of shape (n_classes * (n_classes - 1) / 2,) Constants in decision function.	ndarray[float64](3,)	[-0.15,-0.24, 0. ]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	1
n_iter_ n_iter_: ndarray of shape (n_classes * (n_classes - 1) // 2,) Number of iterations run by the optimization routine to fit the model. The shape of this attribute depends on the number of models optimized which in turn depends on the number of classes. .. versionadded:: 1.1	ndarray[int32](3,)	[ 6,11, 8]
n_support_ n_support_: ndarray of shape (n_classes,), dtype=int32 Number of support vectors for each class.	ndarray[int32](3,)	[3,9,8]
probA_ probA_: ndarray of shape (n_classes * (n_classes - 1) / 2) If `probability=True`, it corresponds to the parameters learned in Platt scaling to produce probability estimates from decision values. If `probability=False`, it's an empty array. Platt scaling uses the logistic function	ndarray[float64](0,)	[]
probB_ probB_: ndarray of shape (n_classes * (n_classes - 1) / 2) If `probability=True`, it corresponds to the parameters learned in Platt scaling. Platt scaling uses the logistic function ``1 / (1 + exp(decision_value * probA_ + probB_))`` where ``probA_`` and ``probB_`` are learned from the dataset [2]_. For more information on the multiclass case and training procedure see section 8 of [1]_. .. deprecated:: 1.9 The attributes `probA_` and `probB_` are deprecated in version 1.9 and will be removed in 1.11.	ndarray[float64](0,)	[]
shape_fit_ shape_fit_: tuple of int of shape (n_dimensions_of_X,) Array dimensions of training vector ``X``.	tuple	(105, 1)
support_ support_: ndarray of shape (n_SV) Indices of support vectors.	ndarray[int32](20,)	[ 79, 84,100,..., 74, 81, 82]
support_vectors_ support_vectors_: ndarray or sparse array/matrix of shape (n_SV, n_features) Support vectors. An empty array if kernel is precomputed. If `X` is sparse, these will also be sparse.	ndarray[float64](20, 1)	[[-1.04], [-1.04], [-1.43], ..., [ 0.77], [ 0.77], [ 0.51]]

1 feature

x3

Inspect Selected Features

The fitted selector exposes support_, just like many sklearn feature selectors. Because our input is a pandas DataFrame, we can recover the selected column names directly.

[5]:

selected_features = X_train.columns[selector.support_]
selected_summary = pd.DataFrame(
    {
        "feature": X_train.columns,
        "selected": selector.support_,
        "kind": ["original" if column in X_original.columns else "noise" for column in X_train.columns],
    }
)

print(f"Selected {len(selected_features)} of {X_train.shape[1]} features")
selected_summary[selected_summary["selected"]]

Selected 1 of 16 features

[5]:

	feature	selected	kind
3	petal width (cm)	True	original

Read Fit Statistics and Telemetry

fit_stats_ summarizes search cost. history stores per-generation optimizer telemetry. These are useful when feature selection is slow or when the search converges too early.

[6]:

selector.fit_stats_

[6]:

{'evaluated_candidates': 182,
 'unique_candidates': 182,
 'cross_validate_calls': 182,
 'cache_hits': 0,
 'duplicate_candidates': 0,
 'skipped_invalid_candidates': 0,
 'population_parallel_batches': 6,
 'population_serial_batches': 0,
 'random_immigrants': 16,
 'local_refinement_candidates': 2}

[7]:

history = pd.DataFrame(selector.history)
telemetry_columns = [
    "gen",
    "fitness",
    "fitness_max",
    "fitness_std",
    "unique_individual_ratio",
    "genotype_diversity",
    "stagnation_generations",
    "random_immigrants",
    "local_refinement_candidates",
]
history[[column for column in telemetry_columns if column in history.columns]].tail()

[7]:

	gen	fitness	fitness_max	fitness_std	unique_individual_ratio	genotype_diversity	stagnation_generations	random_immigrants
0	0	0.587247	0.961279	0.273261	1.00	0.052632	0	0
1	1	0.755261	0.944444	0.199663	0.75	0.052632	1	4
2	2	0.846254	0.934343	0.108298	0.75	0.052632	2	4
3	3	0.853872	0.933502	0.080829	0.65	0.052632	3	4
4	4	0.889310	0.922559	0.025854	0.70	0.052632	5	4

[8]:

ax = history.plot(x="gen", y=["fitness_best", "fitness_max", "fitness"], marker="o", figsize=(8, 4))
ax.set_title("Feature-selection fitness over generations")
ax.set_xlabel("Generation")
ax.set_ylabel("Balanced accuracy")

[8]:

Text(0, 0.5, 'Balanced accuracy')

../_images/notebooks_Iris_feature_selection_13_1.png

Compare Baseline and Selected-Feature Model

The selected-feature estimator supports the usual sklearn prediction API, so it can be evaluated just like the baseline pipeline.

[9]:

selector_metrics = evaluate(selector, X_test, y_test)
pd.DataFrame([baseline_metrics, selector_metrics], index=["all_features", "selected_features"])

[9]:

	accuracy	balanced_accuracy
all_features	0.822222	0.822222
selected_features	0.933333	0.933333

[10]:

print(classification_report(y_test, selector.predict(X_test), target_names=iris.target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.88      0.93      0.90        15
   virginica       0.93      0.87      0.90        15

    accuracy                           0.93        45
   macro avg       0.93      0.93      0.93        45
weighted avg       0.93      0.93      0.93        45

Practical Notes

max_features is a useful way to make feature selection prefer compact solutions.
If many candidates are skipped as invalid, increase max_features or reduce mutation strength.
If diversity drops quickly, use diversity_control, random_immigrants_fraction, and fitness_sharing before simply increasing generations.
Always compare with an all-feature baseline. A smaller selected subset is only useful if quality remains acceptable.

Feature Selection With Noisy Iris Data

Menu

Problem Setup

Baseline With All Features

Configure GAFeatureSelectionCV

Inspect Selected Features

Read Fit Statistics and Telemetry

Compare Baseline and Selected-Feature Model

Practical Notes