.. _basic-usage: How to Use sklearn-genetic-opt ============================== Introduction ------------ sklearn-genetic-opt uses evolutionary algorithms to tune scikit-learn estimators and select informative features. It works with classification and regression estimators, including estimators inside a scikit-learn ``Pipeline``. The package follows the familiar scikit-learn search API, but the search space is defined differently from :class:`~sklearn.model_selection.GridSearchCV`. Instead of listing every candidate value, you define the allowed range or choices for each hyperparameter. The optimizer samples candidates from that space, evaluates them with cross-validation, and uses evolutionary operators to produce new candidates over several generations. Internally, sklearn-genetic-opt uses the `DEAP package `__. A population is a set of candidate solutions. Each candidate is evaluated, selected, crossed over, or mutated to create the next generation. The process continues until the configured number of generations is reached or a callback stops the search. This tutorial covers the two most common workflows: - Hyperparameter tuning with :class:`~sklearn_genetic.GASearchCV`. - Feature selection with :class:`~sklearn_genetic.GAFeatureSelectionCV`. Hyperparameter Tuning --------------------- For the first example, we will tune an :class:`~sklearn.neural_network.MLPClassifier` on the `digits dataset `__. The digits dataset is a multi-class classification problem. .. code:: python3 import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.metrics import accuracy_score from sklearn.model_selection import StratifiedKFold, train_test_split from sklearn.neural_network import MLPClassifier from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig from sklearn_genetic.space import Categorical, Continuous, Integer Load the data, split it into training and test sets, and visualize a few examples: .. code:: python3 data = load_digits() n_samples = len(data.images) X = data.images.reshape((n_samples, -1)) y = data["target"] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42 ) _, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3)) for ax, image, label in zip(axes, data.images, data.target): ax.set_axis_off() ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest") ax.set_title("Training: %i" % label) The samples should look like this: .. image:: ../images/basic_usage_digits_0.png Next, define the hyperparameter search space. The keys in ``param_grid`` must match valid estimator parameters. The values are search-space dimensions: - :class:`~sklearn_genetic.space.Integer` samples integer values from a range. - :class:`~sklearn_genetic.space.Continuous` samples floating-point values from a range. - :class:`~sklearn_genetic.space.Categorical` samples from a fixed list of choices. .. code:: python3 param_grid = { "tol": Continuous(1e-2, 1e10, distribution="log-uniform"), "alpha": Continuous(1e-5, 2e-5), "activation": Categorical(["logistic", "tanh"]), "batch_size": Integer(300, 350), } For example, ``batch_size`` can take any integer value from 300 to 350, while ``activation`` must be either ``"logistic"`` or ``"tanh"``. The ``distribution`` argument controls how random values are sampled from a dimension. A log-uniform distribution is useful when a parameter spans several orders of magnitude. Now create the estimator and the cross-validation strategy: .. code:: python3 clf = MLPClassifier(hidden_layer_sizes=(50, 30)) cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) evolved_estimator = GASearchCV( estimator=clf, cv=cv, scoring="accuracy", param_grid=param_grid, evolution_config=EvolutionConfig(population_size=10, generations=20), population_config=PopulationConfig(initializer="smart"), runtime_config=RuntimeConfig(n_jobs=-1, verbose=True), ) Most arguments have the same meaning as in scikit-learn search estimators: ``cv`` controls the validation strategy, ``scoring`` controls the metric, and ``RuntimeConfig.n_jobs`` controls parallel execution. During the genetic search, unique candidates in the same generation are evaluated in parallel when possible; each candidate runs its cross-validation sequentially to avoid nested parallelism. Set ``RuntimeConfig(parallel_backend="cv")`` to keep candidate evaluation serial and pass ``n_jobs`` to each candidate's cross-validation instead. The genetic-search-specific values ``EvolutionConfig.population_size`` and ``EvolutionConfig.generations`` determine how many candidate solutions are explored. By default, ``PopulationConfig(initializer="smart")`` builds a more diverse initial population using estimator defaults, warm starts when provided, Latin hypercube samples for numeric hyperparameters, and stratified categorical values. Set ``PopulationConfig(initializer="random")`` to use the previous random initialization behavior. After fitting, ``fit_stats_`` reports evaluation counters such as cache hits, duplicate candidates, cross-validation calls, and skipped invalid feature masks. Run the optimization: .. code:: python3 evolved_estimator.fit(X_train, y_train) During training, you should see a generation-by-generation log: .. image:: ../images/basic_usage_train_log_1.jpeg Each row summarizes one generation: * **gen:** generation number. * **nevals:** number of evaluated individuals in the generation. * **fitness:** average cross-validation score for the generation. * **fitness_std:** standard deviation of the cross-validation scores. * **fitness_best:** best score found so far during the full search. * **fitness_max:** best individual score in the generation. * **fitness_min:** worst individual score in the generation. A compact summary of diversity and optimizer state appears at the right of each row: * **div:** ``genotype_diversity`` — the average fraction of distinct values per gene position across the population. A value near 1.0 means the population is diverse; a value near 0.0 means it has converged to nearly identical configurations. * **unique:** ``unique_individual_ratio`` — the fraction of the population that are distinct individuals. Values below ``diversity_threshold`` (default 0.25) trigger diversity control. * **stag:** ``stagnation_generations`` — how many consecutive generations have passed without ``fitness_best`` improving. Useful for deciding when to add an early-stopping callback. * **events:** a compact summary of optimizer interventions in the generation — ``div`` (diversity control triggered), ``imm=N`` (N random immigrants injected), ``dup=N`` (N duplicates replaced), ``share`` (fitness sharing applied). After fitting, inspect the full history as a DataFrame: .. code:: python3 import pandas as pd history = pd.DataFrame(evolved_estimator.history) print(history[[ "gen", "fitness_best", "genotype_diversity", "unique_individual_ratio", "stagnation_generations", ]]) And check evaluation cost via ``fit_stats_``: .. code:: python3 print(evolved_estimator.fit_stats_) # evaluated_candidates: total individuals presented to the evaluator # unique_candidates: distinct configurations actually cross-validated # cache_hits: evaluations reused from the fitness cache # random_immigrants: individuals injected when diversity control triggered # skipped_invalid_candidates: configs that raised exceptions during fit After fitting, ``GASearchCV`` behaves like a fitted scikit-learn estimator. It uses the best hyperparameters found during the search: .. code:: python3 print(evolved_estimator.best_params_) y_predict_ga = evolved_estimator.predict(X_test) print(accuracy_score(y_test, y_predict_ga)) In this run, the test accuracy was approximately 0.96. .. code:: python3 y_predict_ga = evolved_estimator.predict(X_test) accuracy_score(y_test, y_predict_ga) .. image:: ../images/basic_usage_accuracy_2.jpeg .. code:: python3 evolved_estimator.best_params_ .. image:: ../images/basic_usage_params_0.jpeg You can also inspect the optimization process. The :func:`~sklearn_genetic.plots.plot_fitness_evolution` helper shows how the best score found so far changed over generations: .. code:: python3 from sklearn_genetic.plots import plot_fitness_evolution plot_fitness_evolution(evolved_estimator) plt.show() .. image:: ../images/basic_usage_fitness_plot_3.png The ``evolved_estimator.logbook`` attribute stores the results generated during the search. You can use :func:`~sklearn_genetic.plots.plot_search_space` to see which hyperparameter values were sampled: .. code:: python3 from sklearn_genetic.plots import plot_search_space plot_search_space(evolved_estimator, features=["tol", "batch_size", "alpha"]) plt.show() .. image:: ../images/basic_usage_plot_space_4.png In this plot, each axis represents a sampled hyperparameter value. For example, the ``tol`` range is intentionally broad in this tutorial, and the plot can help you decide whether to narrow that range in a second search. Feature Selection ----------------- For the second example, we will use the Iris dataset and add random noise features. The goal is to recover a useful subset of features while ignoring the noise. .. code:: python3 import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn_genetic import ( EvolutionConfig, GAFeatureSelectionCV, PopulationConfig, RuntimeConfig, ) from sklearn_genetic.plots import plot_fitness_evolution data = load_iris() X, y = data["data"], data["target"] noise = np.random.uniform(0, 10, size=(X.shape[0], 10)) X = np.hstack((X, noise)) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=0 ) The resulting dataset contains the original Iris features plus 10 noisy features. ``GAFeatureSelectionCV`` is similar to ``GASearchCV``, but it does not optimize hyperparameters. Instead, it evaluates subsets of columns and tries to maximize the cross-validation score while selecting a compact feature set. The estimator should already be configured with the hyperparameters you want to use. .. code:: python3 clf = SVC(gamma="auto") evolved_estimator = GAFeatureSelectionCV( estimator=clf, cv=3, scoring="accuracy", evolution_config=EvolutionConfig( population_size=30, generations=20, keep_top_k=2, elitism=True, ), population_config=PopulationConfig(initializer="smart"), runtime_config=RuntimeConfig(n_jobs=-1, verbose=True), ) Run the feature-selection search: .. code:: python3 evolved_estimator.fit(X_train, y_train) During training, the same log format is displayed: .. image:: ../images/basic_usage_train_log_5.PNG After fitting, ``GAFeatureSelectionCV`` also behaves like a scikit-learn estimator. Prediction methods such as ``predict`` and ``predict_proba`` use only the selected columns. .. code:: python3 features = evolved_estimator.support_ y_predict_ga = evolved_estimator.predict(X_test) accuracy = accuracy_score(y_test, y_predict_ga) .. image:: ../images/basic_usage_accuracy_6.PNG In this run, the test accuracy was approximately 0.98. The ``support_`` attribute is a boolean mask. Each position corresponds to a column in the input data: ``True`` means the feature was selected, and ``False`` means it was discarded. In this example, the optimizer selected the informative Iris features and ignored the random noise features. You can plot the fitness evolution for the feature-selection search too: .. code:: python3 plot_fitness_evolution(evolved_estimator) plt.show() .. image:: ../images/basic_usage_fitness_plot_7.PNG This concludes the basic sklearn-genetic-opt workflow. The next tutorials cover callbacks, custom callbacks, schedulers, reproducibility, MLflow integration, outlier detection, and cross-validation behavior in more detail.