GT SDA allows studying the dependencies present in the data. Ultimately, it answers the following questions:

- What input parameters have no influence on the output and thus can be dropped in the further study?
- Which features should be drop first to reduce the number of parameters considered in the problem?
- Which features should be taken to construct the best surrogate model with GT Approx?

Checker method allows to quickly check whether dependency between given inputs and outputs exists and measures the strength of statistical dependency.

- Popular correlation coefficients are implemented: Pearson, Spearman and partial Pearson correlation.
- Method checks obtained correlations for statistical significance and tells the user if the obtained correlation is meaningful or not.

Ranker method ranks the available parameters with respect to their influence on the given response function, assigning to each variable a score reflecting its importance.

- The result allows the user to
**determine the most important parameters**and disregard those unimportant. - Several different
**state-of-the-art ranking methods**are implemented: - Sobol indices
- Morris screening

The method also tells if score estimates should be trusted.

- Ranker may work with the user-provided dataset, or it can
**itself generate**a suitable dataset specifically tailored for score estimation, in case if the user provides means to compute the response function.

Selector searches the subset of given input features that provide the best approximation quality with GT Approx.

- The result allows the user to
**understand what**.**set of**input features would provide the best surrogate model - Different search strategies for subset selection may be used.
- Expert knowledge of feature relative importance (or results of GT SDA Ranker) can be incorporated into the selector to increase search quality.

- In the Surrogate Model (SM) construction, it may be beneficial to remove the least important features because fewer features mean more dense sample and a denser sample may provide a more accurate approximation. Also, many SM construction techniques may work better in smaller dimensions in terms of time/memory requirements.
- In the Design of Experiment: knowing what features influence dependency the most, one can plan the sample generation in a way that most important features have the highest variability. Also, if data is obtained as some physical measurements, knowing feature scores may tell what input variables should be measured with the highest accuracy.
- In the Optimization, when the number of allowed function calls (budget) is limited, knowing what features are less important allows for not changing them in the optimization process. Reducing the number of variables by not considering features that have little effect on the dependency, one can do more optimization iterations with the same budget, possibly acquiring better solution.

Below is an example of GT SDA application in an optimization problem solution. Consider a problem of designing geometry of rotating disk shown on the picture.

The goal is to create a disk with a minimal mass that would satisfy given mechanical and stability constraints. The disc’s geometry is parameterized and may be represented as a vector of 9 numbers (3 thicknesses and 6 radiuses):

Optimization goal is to:

- Minimize disk mass
- Satisfy the following constraints:
- maximal tension
- radial displacement on outer disk radius

Before configuring the problem in the optimization tool (GT Opt) and starting to solve it, an analysis is performed using GT SDA.

- First, a small sample of 50 random geometries is generated. The sample required to do the estimation depends on the problem, but GT SDA checks if the provided estimate is reliable. Thus, the tool sends
**warning**to the user in case it considers the results as not reliable, and throws an**exception**if the results are considered as meaningless. - Sobol indices are computed with GT SDA Ranker on the obtained sample
- Sobol indices are computed for each feature. They indicate what portion of output variance is described by considered input (vary from 0 to 1).
- Total
**Sobol**indices are computed with**GT SDA Ranker**

One may see that for all of the 3 outputs, most of the output is described** by just 3 features out of original 6. **

**Optimization Results Comparison** (original problem vs the reduced one, where only features **r1, t1, t3 are considered)**

Problem | Resulting optimum, Kg | Calls to function |
---|---|---|

Original (6 inputs) |
13.91 |
4624 |

Reduced (3 inputs) |
14.30 |
663 + 50 (to run SDA) |

Resulting is **~ 6.5 times reduction of function calls during the optimization at the cost of just 2% quality of found optimum ***(if the quality of optimum found is not satisfactory, one may continue the optimization of a full model starting from the optimum of a reduced one).*