1.8. Feature Selection

1.8.1. Introduction

This example shows how to apply pSeven Core to feature selection based on iterative forward/backward search.

Consider the test function \(f:[0,1]^4 \to \mathbb{R}\),

\[f(x_1, x_2, x_3, x_4) = 2 + 0.25(x_2 - 5 x_1^2)^2 + (1 - 5 x_1)^2 + 2 (2 - 5 x_2)^2 + 7 \sin(2.5 x_1) \sin(17.5 x_1 x_2) + 10 x_3^2.\]

The fourth feature \(x_4\) is intended to have no influence on the test function value.

The task is to detect which input factors provide the best approximation quality for output based on the sample of data points.

1.8.2. Feature Selection

Start by importing the Generic Tool for Sensitivity and Dependency Analysis (GTSDA) module:

from da.p7core import gtsda

Describe the problem:

import numpy as np

def mystery_function(x):
  term1 = x[:, 1] - 5.0 * x[:, 0] * x[:, 0]
  term2 = 1.0 - 5.0 * x[:, 0]
  term3 = 2.0 - 5.0 * x[:, 1]
  term4 = np.sin(2.5 * x[:, 0]) * np.sin(17.5 * x[:, 0] * x[:, 1])
  result = np.array(2.0 + 0.25 * term1 * term1 + term2 * term2 + 2 * term3 * term3 + 7 * term4 + 10 * x[:, 2]**2).reshape((x.shape[0], 1))
  return result

Generate a sample:

sample_size = 100
input_dim = 4
x = np.random.uniform(low=0.0, high=1.0, size=sample_size*input_dim).reshape((sample_size, input_dim))
y = mystery_function(x)

Create a gtsda.Analyzer instance:

analyzer = gtsda.Analyzer()

Set options and logger (see Options Interface, Loggers):

from da.p7core import loggers

analyzer.options.set("GTSDA/Seed", 100)
analyzer.set_logger(loggers.StreamLogger())

Get ranking of features based on their sensitivities:

ranking = analyzer.score2rank(analyzer.rank(x=x, y=y).scores)

Perform feature selection based on the Internal Validation error estimate:

options = {'GTSDA/Selector/ValidationType': 'internal'}
result_internal = analyzer.select(x=x, y=y, ranking=ranking, options=options)

Perform feature selection based on error on the train sample:

options = {'GTSDA/Selector/ValidationType': 'TrainSample'}
result_internal = analyzer.select(x=x, y=y, ranking=ranking, options=options)

Compare these two feature selection results:

print("\nOptimal features: %s" % str([0, 1, 2]))
print("Selected features with IV: %s" % result_internal.feature_list.tolist())
print("Selected features with train sample validation: %s" % result_train.feature_list.tolist())

1.8.3. Full Example Code

import numpy as np

from da.p7core import gtsda
from da.p7core import loggers

def mystery_function(x):
  term1 = x[:, 1] - 5.0 * x[:, 0] * x[:, 0]
  term2 = 1.0 - 5.0 * x[:, 0]
  term3 = 2.0 - 5.0 * x[:, 1]
  term4 = np.sin(2.5 * x[:, 0]) * np.sin(17.5 * x[:, 0] * x[:, 1])
  result = np.array(2.0 + 0.25 * term1 * term1 + term2 * term2 + 2 * term3 * term3 + 7 * term4 + 10 * x[:, 2]**2).reshape((x.shape[0], 1))
  return result

sample_size = 100
input_dim = 4
x = np.random.uniform(low=0.0, high=1.0, size=sample_size*input_dim).reshape((sample_size, input_dim))
y = mystery_function(x)

analyzer = gtsda.Analyzer()

analyzer.options.set("GTSDA/Seed", 100)
analyzer.set_logger(loggers.StreamLogger())

ranking = analyzer.score2rank(analyzer.rank(x=x, y=y).scores)

options = {'GTSDA/Selector/ValidationType': 'Internal'}
result_internal = analyzer.select(x=x, y=y, ranking=ranking, options=options)

options = {'GTSDA/Selector/ValidationType': 'TrainSample'}
result_train = analyzer.select(x=x, y=y, ranking=ranking, options=options)

print("\nOptimal features: %s" % str([0, 1, 2]))
print("Selected features with IV: %s" % result_internal.feature_list.tolist())
print("Selected features with train sample validation: %s" % result_train.feature_list.tolist())