Outlier Detection Support

Overview

sklearn-genetic now includes native support for tuning outlier detection models such as IsolationForest, OneClassSVM, and LocalOutlierFactor using GASearchCV and GAFeatureSelectionCV. These models are recognized automatically, and a default scoring function is applied when scoring=None is passed.

This feature simplifies hyperparameter optimization for unsupervised anomaly detection problems, where y labels are not available.

Default Scoring Logic

When scoring=None and an estimator is recognized as an outlier detector, a default scorer is used. This scorer attempts the following, in order:

  1. If the estimator has score_samples, the mean of the scores is used.

  2. If score_samples is unavailable but decision_function exists, its mean value is used.

  3. As a fallback, the estimator is used with fit_predict, and the mean of (predictions == 1) is returned.

This scoring system is designed to maximize flexibility and compatibility with a wide range of outlier models.

def default_outlier_scorer(estimator, X, y=None):
    if hasattr(estimator, 'score_samples'):
        return np.mean(estimator.score_samples(X))
    elif hasattr(estimator, 'decision_function'):
        return np.mean(estimator.decision_function(X))
    else:
        predictions = estimator.fit_predict(X)
        return np.mean(predictions == 1)

Examples

Using GASearchCV with IsolationForest:

from sklearn.ensemble import IsolationForest
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Integer, Continuous
from sklearn.datasets import make_blobs
import numpy as np

# Create synthetic data with outliers
X_normal, _ = make_blobs(n_samples=200, centers=1, n_features=4, random_state=42)
X_outliers = np.random.uniform(low=-6, high=6, size=(20, 4))
X = np.vstack([X_normal, X_outliers])

estimator = IsolationForest(random_state=42)

param_grid = {
    'contamination': Continuous(0.05, 0.3),
    'n_estimators': Integer(50, 150)
}

search = GASearchCV(estimator=estimator,
                    param_grid=param_grid,
                    scoring=None,  # triggers default_outlier_scorer
                    cv=3,
                    generations=4,
                    population_size=6,
                    n_jobs=-1)

search.fit(X)

Using GAFeatureSelectionCV with outlier detection:

from sklearn_genetic import GAFeatureSelectionCV
from sklearn.ensemble import IsolationForest

selector = GAFeatureSelectionCV(
    estimator=IsolationForest(random_state=42),
    scoring=None,  # default_outlier_scorer used
    cv=3,
    generations=4,
    population_size=6,
    n_jobs=-1
)

selector.fit(X)

Custom Scoring

You may override the default logic by passing your own custom scoring function:

def custom_score(estimator, X, y=None):
    return np.std(estimator.score_samples(X))

search = GASearchCV(
    estimator=IsolationForest(),
    param_grid=param_grid,
    scoring=custom_score,
    cv=3,
    generations=4,
    population_size=6,
    n_jobs=1
)

search.fit(X)

Limitations

  • Only estimators with fit_predict, decision_function, or score_samples are supported by default.

  • Models not recognized as outlier detectors must be scored explicitly or will raise a ValueError.