5.4. Feature Selection¶

Sections

Introduction
Quality measures
- Error-based feature selection
- Dependency-based feature selection
Methods
- Full Search
- Add
- Del
- Add-Del

Feature selection functionality of GT SDA searches for good features in user-provided data. It is intended to select the smallest possible subset of features which accurately describes the dependency. Technically it is a technique to iteratively process different feature subsets in some order to determine which features are important for the model. The analysis is performed on base of data sample (also known as training data (or just sample)).

5.4.1. Introduction ¶

Let \(y=f(x)\), \(x \in R^p\), \(y \in R^q\) be some considered dependency. \(f(x)\) may be some physical experiment or a solver code. Let \((X, Y) = \bigl\{(x_i, y_i = f(x_i)), i = 1, \dots, n\bigr\}\) be a training sample. We assume that the function \(f(x)\) in fact depends only on the subset of features or can be well-approximated by the function of this subset of features. Our goal is to detect this subset of features. This drives the idea of the selection procedure which is described below.

There are two possible approaches to feature selection in this scenario depending on the quality measure used:

Error-based feature selection;
Dependency-based feature selection

The quality measure can be selected via GTSDA/Selector/QualityMeasure option. By default Error-based feature selection approach is selected.

Example of usage:

analyzer = gtsda.Analyzer()
selectResult =  analyzer.select(x=x, y=y, ranking=None, options={}, approx_options={}, x_test=None, y_test=None)

Subset search strategy is selected via GTSDA/Selector/Technique option. By default Add-Del approach is selected.

Input parameters are:

training sample
feature ordering (optional)
GTSDA options (optional)
GTApprox options (optional)
test sample (optional)

Output parameters are:

selected subset of features
validation error obtained on the selected subset
approximation model constructed for selected subset of features

Ordering gives the method a priori knowledge of which features are more important. If ordering is correct searching time may be significantly reduced. Ordering can be performed by using Sensitivity Analysis methods of GTSDA or by expert judgment. By default natural feature ordering is used. The GTSDA options are described in detail in section Option Reference. Any combination of GTApprox options is supported.

5.4.2. Quality measures ¶

5.4.2.1. Error-based feature selection ¶

Error-based algorithm works as follows:

Select subset of features: \(I \subseteq \{1, \dots, p\}\), which results in considering train set consisting of pairs \((x_I, y)\) with \(x_I = (x^{i}, i \in I)\).
Construct surrogate model \(\hat{f}_{I}\) which mimics the behavior of dependency \(f\) (using GTApprox).
Estimate the quality of the model \(\hat{f}_{I}\) by computing some error metric \(Q_I\).

These steps are performed for series of feature subsets \(I\) and the best subset is chosen as the one which gives lowest possible value of \(Q_I\) and smallest possible size of feature subset \(I\). Subset search strategy is selected via GTSDA/Selector/Technique option. By default Add-Del approach is selected.

The iterative search over subsets can be configured via several options. In general all the methods except Full Search try to improve currently best subset by considering the subsets which differ from the current subset in one feature. However all such subsets can possibly not improve the error. However among the subsets which differ from the current subset in at least two features there can be successful candidates for improvements. To avoid stucking in local minima optimization procedure is allowed to make several iterations of search without significant improvements in error of approximation:

The option GTSDA/Selector/Criteria/MaxLookAheadSteps determines how many iterations without significant improvement can the algorithm make.
The option GTSDA/Selector/Criteria/MinImprovement sets maximum relative error change which is considered significant during iterations of feature selector.
Finally, some error value can be considered as sufficient which means that all the errors less or equal to this values are treated the same. This values can be set by the option GTSDA/Selector/Criteria/TargetError.

If some subset of features is a-priori known to be in the model, then this subset can be specified by the option GTSDA/Selector/RequiredFeatures.

There are several possible ways to compute the error of approximation for feature selection task. We concentrate here on the error metrics for the regression problem. Our metrics are based on the error of the approximation. There are three possible ways to compute error of approximation in the regression problems:

Error on the train sample
Error on the test sample (test sample is required)
Cross validation error

Every choice of error has its pros and cons. The error on th train sample is known to underestimate the error of approximation, as for some approximation algorithms errors on the train sample can be subsequently lower then ones for the points outside of the test sample. The error on the test sample is a preferable way if this test sample is available. It is assumed that test sample is different from the train sample and contains enough points to represent function behaviour. The cross validation approach to error computation is preferable when there is no possibility to use separate test sample. This approach separates train set in two parts, constructs model on one part and computes approximation error on the other part. Separation could be perform several times. In this case resultive error is the mean of the errors for each separation.

5.4.2.2. Dependency-based feature selection ¶

Dependency based feature selection is based on usage of Correlation Analysis functionality of GTSDA. The idea is to select the subset \(I\) of features with two properties:

All features from the subset \(I\) have significant partial correlation (given all other features from the subset \(I\)) with outputs.
Features outside the subset \(I\) don’t have significant partial correlation (given all the features from the subset \(I\)) with outputs.

These two properties are achieved via iteration procedures on every step of which some partial correlation (Partial Pearson Correlation or Partial Distance Correlation) is computed and new feature is selected as one which has highest significant correlation (if we add features) or one of features is deleted as one which has lowestest non-significant correlation (if we delete features).

There are two possible dependency types:

Linear, which is based on Partial Pearson Correlation.
General, which is based on Partial Distance Correlation.

Dependency type can be selected via option GTSDA/Selector/Dependency/Type.

The significance level for tests on significance can be tuned via option GTSDA/Selector/Dependency/SignificanceLevel.

5.4.3. Methods ¶

5.4.3.1. Full Search ¶

This method looks through the all possible feature subsets. For every subset of features the error of approximation is computed and the subset with the lowest error of approximation is chosen as output.

Pros:

Always finds the best subset among all possible.

Cons:

The execution type grows exponentially with the number of features.

Please note, that this type of search can be used only for error-based feature selection.

5.4.3.2. Add ¶

This method starts from subset of single feature (or initial subset of features set by the option GTSDA/Selector/RequiredFeatures) and iteratively adds features while the quality of the approximation increases.

Pros:

Works good if target function depends only on few features of the subset.

Cons:

If the actual function depends on relatively big number of features and for low number of features the approximation quality is bad, then good subset of features can be never reached.
If some feature was not chosen at some iteration then it will be never chosen.

5.4.3.3. Del ¶

This method starts from subset of all features and iteratively deletes features while the quality of the approximation increases.

Pros:

Best suited for the problem with moderate total number of features.

Cons:

Can have big working time in case when the total number of features is big.
If some feature was not chosen at some iteration then it will be never chosen.

5.4.3.4. Add-Del ¶

This method iterates Add and Del steps in order to find best possible combination of features.

Pros:

Normally allows to find much better set of features.

Cons:

Can be relatively time consuming compared to single Add or Del runs.