1.8. Feature Selection¶
1.8.1. Introduction¶
This example shows how to apply pSeven Core to feature selection based on iterative forward/backward search.
Consider the test function \(f:[0,1]^4 \to \mathbb{R}\),
The fourth feature \(x_4\) is intended to have no influence on the test function value.
The task is to detect which input factors provide the best approximation quality for output based on the sample of data points.
1.8.2. Feature Selection¶
Start by importing the Generic Tool for Sensitivity and Dependency Analysis (GTSDA) module:
from da.p7core import gtsda
Describe the problem:
import numpy as np
def mystery_function(x):
term1 = x[:, 1] - 5.0 * x[:, 0] * x[:, 0]
term2 = 1.0 - 5.0 * x[:, 0]
term3 = 2.0 - 5.0 * x[:, 1]
term4 = np.sin(2.5 * x[:, 0]) * np.sin(17.5 * x[:, 0] * x[:, 1])
result = np.array(2.0 + 0.25 * term1 * term1 + term2 * term2 + 2 * term3 * term3 + 7 * term4 + 10 * x[:, 2]**2).reshape((x.shape[0], 1))
return result
Generate a sample:
sample_size = 100
input_dim = 4
x = np.random.uniform(low=0.0, high=1.0, size=sample_size*input_dim).reshape((sample_size, input_dim))
y = mystery_function(x)
Create a gtsda.Analyzer
instance:
analyzer = gtsda.Analyzer()
Set options and logger (see Options Interface, Loggers):
from da.p7core import loggers
analyzer.options.set("GTSDA/Seed", 100)
analyzer.set_logger(loggers.StreamLogger())
Get ranking of features based on their sensitivities:
ranking = analyzer.score2rank(analyzer.rank(x=x, y=y).scores)
Perform feature selection based on the Internal Validation error estimate:
options = {'GTSDA/Selector/ValidationType': 'internal'}
result_internal = analyzer.select(x=x, y=y, ranking=ranking, options=options)
Perform feature selection based on error on the train sample:
options = {'GTSDA/Selector/ValidationType': 'TrainSample'}
result_internal = analyzer.select(x=x, y=y, ranking=ranking, options=options)
Compare these two feature selection results:
print("\nOptimal features: %s" % str([0, 1, 2]))
print("Selected features with IV: %s" % result_internal.feature_list.tolist())
print("Selected features with train sample validation: %s" % result_train.feature_list.tolist())
1.8.3. Full Example Code¶
import numpy as np
from da.p7core import gtsda
from da.p7core import loggers
def mystery_function(x):
term1 = x[:, 1] - 5.0 * x[:, 0] * x[:, 0]
term2 = 1.0 - 5.0 * x[:, 0]
term3 = 2.0 - 5.0 * x[:, 1]
term4 = np.sin(2.5 * x[:, 0]) * np.sin(17.5 * x[:, 0] * x[:, 1])
result = np.array(2.0 + 0.25 * term1 * term1 + term2 * term2 + 2 * term3 * term3 + 7 * term4 + 10 * x[:, 2]**2).reshape((x.shape[0], 1))
return result
sample_size = 100
input_dim = 4
x = np.random.uniform(low=0.0, high=1.0, size=sample_size*input_dim).reshape((sample_size, input_dim))
y = mystery_function(x)
analyzer = gtsda.Analyzer()
analyzer.options.set("GTSDA/Seed", 100)
analyzer.set_logger(loggers.StreamLogger())
ranking = analyzer.score2rank(analyzer.rank(x=x, y=y).scores)
options = {'GTSDA/Selector/ValidationType': 'Internal'}
result_internal = analyzer.select(x=x, y=y, ranking=ranking, options=options)
options = {'GTSDA/Selector/ValidationType': 'TrainSample'}
result_train = analyzer.select(x=x, y=y, ranking=ranking, options=options)
print("\nOptimal features: %s" % str([0, 1, 2]))
print("Selected features with IV: %s" % result_internal.feature_list.tolist())
print("Selected features with train sample validation: %s" % result_train.feature_list.tolist())