12.4. GTSDA

In this section we will apply sensitivity analysis methods from GTSDA to some artificial model functions and some real world data sets to demonstrate method properties.

12.4.1. Using Screening Indices

In this example we will consider the function:

\[f(x_1,x_2,x_3,x_4) = x_1 + 2x_2 + x_3^2 + x_4^3, \, x_i \in [ -1, \, 1], \; i=1, \ldots, 4.\]

See result for different scores computed with Screening Indices on 100 points budget.

The code to do it:

# doing analysis...
analyzer = gtsda.Analyzer()
rank_result = analyzer.rank(x=x, y=y, options={'GTSDA/Ranker/Technique': 'screening'}) # or just rank_result = analyzer.rank(x=x, y=y) as 'screening' is the default index type

# and reading results...
mu_star = rank_result.info['Ranker']['Detailed info']['mu_star']
mu = rank_result.info['Ranker']['Detailed info']['mu']
sigma = rank_result.info['Ranker']['Detailed info']['sigma']
Screening indices
Score type \(x_1\) \(x_2\) \(x_3\) \(x_4\)
\(\mu^*\) 1.97 3.96 1.08 1.17
\(\mu\) 1.97 3.96 0.17 1.17
\(\sigma\) 0.00 0.00 1.25 0.48

Now let’s do some analysis based on obtained values:

  • For \(x_1\) and \(x_2\) value of \(\sigma\) is \(0\), meaning that function depends on both of them linearly (\(x_2\) being twice more important).
  • For \(x_3\) value of \(\sigma\) is significantly non zero, meaning that function depends on \(x_3\) non linearly, also \(\mu^* >> \mu\) meaning that dependency is very non-monotonic (even symmetric in our case).
  • For \(x_4\) value of \(\sigma\) again is significantly non zero, meaning that function depends on \(x_4\) non linearly, but \(\mu^* = \mu\) meaning that dependency is monotonic.

Also value of \(\mu^*\) indicates how much function value would change on average if each input variable would change its absolute value by \(1\) (if GTSDA/Ranker/NormalizeInputs = False) or by size of 100% variable range (if GTSDA/Ranker/NormalizeInputs = True).

You can also get the full code and run the example as a script: example_gtsda_ranker_screening.py.

12.4.2. Using Sobol Indices

In this example we will consider Sobol indices performance for the function:

\[f(x_1,x_2,x_3,x_4) = x_1^2 + 2x_1 x_2 + x_3^2, \, x_i \in [ -1, \, 1], \; i=1, \ldots, 4.\]

that on the one hand is still simple enough to form some expectations of what true scores should be, but on the other hand it already has some feature interactions.

So in this example one may expect to see \(x_1\) being the most important, \(x_2\) be on the second place and \(x_3\) be the least important feature, while \(x_4\) should not affect output at all.

We will also use this example to demonstrate the difference between main and total Sobol Indices. Main scores take into account only isolated variable contribution to the variance of output, meaning that main scores would ignore influence of the \(x_1 \cdot x_2\) term. Total indices on the other side should account all feature interactions. In the manual dependency analysis comparison of these two indices allows for some investigation of the dependency nature. We have estimated these scores using 2000 sample points.

The code to do it:

# doing analysis...
analyzer = gtsda.Analyzer()
rank_result = analyzer.rank(x=x, y=y, options={'GTSDA/Ranker/Technique': 'sobol'})

# and reading results...
total_indices = rank_result.info['Ranker']['Detailed info']['Total indices']
main_indices = rank_result.info['Ranker']['Detailed info']['Main indices']
interact_indices = rank_result.info['Ranker']['Detailed info']['Interaction indices']

Different scores for Sobol Indices are presented in the table:

Sobol indices
Score type \(x_1\) \(x_2\) \(x_3\) \(x_4\)
Total index 0.51 0.36 0.14 0.00
Main index 0.14 0.00 0.14 0.00
Interaction index 0.36 0.36 0.00 0.00

Now let’s do some analysis based on obtained values:

  • First of all one may see that total index for \(x_4\) is almost \(0\) so it appears that the feature does not affect output, so \(f(x) \approx f(x_1, x_2, x_3)\).
  • Also note that interaction index for \(x_3\) is small and main index is very close to total index, this tells us that \(x_3\) does not interact with other features and probably enters dependency in sole additive term, so \(f(x) \approx f_{1, 2}(x_1, x_2) + f_3(x_3)\).
  • On the contrary for \(x_2\) main index is small, while interaction index is big, which means that \(x_2\) only enters dependency together with other feature.
  • As for the \(x_1\) it has main index similar to \(x_3\) and interaction index similar to \(x_2\) (which tells that the interaction only between these two variables exist).

So as a final conclusion one can make an educated guess that considered dependency has the following form

\[f(x_1, x_2, x_3, x_4) = f_1(x_1) + f_2(x_1, x_2) + f_3(x_3),\]

where \(f_1\) and \(f_3\) have about the same variance and \(f_2\) describes most of the variance of the function.

You can also get the full code and run the example as a script: example_gtsda_ranker_sobol.py.

12.4.3. Using Taguchi Analysis

The data for the following example is not real, and details pertaining to microprocessor fabrication may not be completely accurate [1] .

A microprocessor company is having difficulty with its current yields. Silicon processors are made on a large die, cut into pieces, and each one is tested to match specifications. The company has requested that you run experiments to increase processor yield. The factors that affect processor yields are temperature, pressure, doping amount, and deposition rate.

The operating conditions for each parameter and level are listed below:

  • Temperature: \(100, 150, 200\) Celsius;
  • Pressure: \(2, 5, 8\) psi;
  • Doping Amount: \(2\%, 4\%, 6\%\);
  • Deposition Rate: \(0.1, 0.2, 0.3\) mg/s.

So one need to generate Orthogonal Array with \(d=4\) design variables and factors cardinality equals to \(s_1=s_2=s_3=s_4=3\). The possible solution is a classical array \(L9\) with \(9\) points:

Taguchi Analysis. Design matrix `L9`
experiment number Temperature Pressure Doping Amount Deposition Rate
1 100 2 4 0.1
2 100 5 6 0.2
3 100 8 8 0.3
4 150 2 6 0.3
5 150 5 8 0.1
6 150 8 4 0.2
7 200 2 8 0.2
8 200 5 4 0.3
9 200 8 6 0.1

This setup allows the testing of all four variables without having to run Full Factorial design with \(s_1 \cdot s_2 \cdot s_3 \cdot s_4 = 3^4 = 81\) separate trials.

Conducting three trials for each experiment (Trial 1, 2, 3 form table), the data below was collected. Generally, the number of trials for the blackbox based case could be unique for each unique design point (but greater or equal then \(2\) for the \("Signal-to-noise"\) value of GTSDA/Ranker/Taguchi/Method). So the sample should have one response for one point (see code below).

import numpy as np

# define sample
x = np.array([[100, 2, 4, 0.1],
              [100, 5, 6, 0.2],
              [100, 8, 8, 0.3],
              [150, 2, 6, 0.3],
              [150, 5, 8, 0.1],
              [150, 8, 4, 0.2],
              [200, 2, 8, 0.2],
              [200, 5, 4, 0.3],
              [200, 8, 6, 0.1]])
y = np.array([[87.3, 82.3, 70.7],
              [74.8, 70.7, 63.2],
              [56.5, 54.0, 45.7],
              [79.8, 78.2, 62.3],
              [77.3, 76.5, 54.0],
              [89.0, 87.3, 83.2],
              [64.8, 62.3, 55.7],
              [99.0, 93.2, 87.3],
              [75.7, 74.0, 63.2]])
x = np.tile(x, (3, 1))
y = y.reshape(27, 1, order='F')
Taguchi Analysis. Experimental responses and signal-to-noise ratio
experiment number Temperature Pressure Doping Amount Deposition Rate Trial 1 Trial 2 Trial 3 SN
1 100 2 4 0.1 87.3 82.3 70.7 19.5
2 100 5 6 0.2 74.8 70.7 63.2 21.4
3 100 8 8 0.3 56.5 54.9 45.7 19.3
4 150 2 6 0.3 79.8 78.2 62.3 17.6
5 150 5 8 0.1 77.3 76.5 54.9 14.3
6 150 8 4 0.2 89 87.3 83.2 29.2
7 200 2 8 0.2 64.8 62.3 55.7 22.2
8 200 5 4 0.3 99 93.2 87.3 24.0
9 200 8 6 0.1 75.7 74 63.2 20.4

We will consider signal-to-noise ratio (see GTSDA/Ranker/Taguchi/Method) analysis, that is compute the SN for each experiment for the target value case, create a response chart, and determine the parameters that have the highest and lowest effect on the processor yield. SN is calculated (SN column from table).

from da.p7core import gtsda

# set options
options = {"GTSDA/Ranker/Technique": "Taguchi",
           "GTSDA/Ranker/Taguchi/Method": "signal_to_noise"}
# rank
ranker = gtsda.Analyzer()
result = ranker.rank(x=x, y=y, options=options)

# print result
print result.scores

Shown below is the response table. This table was created by calculating an average SN value for each factor. A sample calculation is shown for pressure variable.

Taguchi Analysis. SN aggregation
point index Temperature Pressure Doping Amount Deposition Rate SN
1 100 2 4 0.1 19.5
2 100 \(5\) 6 0.2 \(21.4\)
3 100 8 8 0.3 19.3
4 150 2 6 0.3 17.6
5 150 \(5\) 8 0.1 \(14.3\)
6 150 8 4 0.2 29.2
7 200 2 8 0.2 22.2
8 200 \(5\) 4 0.3 \(24.0\)
9 200 8 6 0.1 20.4
\[SN_{Pr=2}=\frac{19.5 + 17.6 + 22.2}{3} = 19.8;\]
\[SN_{Pr=5}=\frac{21.4+14.3+24.0}{3} = 19.9;\]
\[SN_{Pr=8}=\frac{19.3+29.2+20.4}{3} = 23.0.\]

The effect of this design variables is then calculated by determining the range:

\[\Delta = Max - Min = 23.0 - 19.8 = 3.2.\]
Taguchi Analysis. Design variables effects
Level index Temperature Pressure Doping Amount Deposition Rate
1 20.1 19.8 24.2 18.1
2 20.4 19.9 19.8 24.3
3 22.2 23.0 18.6 20.3
\(\Delta\) 2.2 3.2 5.6 6.3
Rank \(4\) \(3\) \(2\) \(1\)

It can be seen that deposition rate has the largest effect on the processor yield and that temperature has the smallest effect on the processor yield.

You can also get the full code and run the example as a script: example_gtsda_ranker_taguchi.py.

12.4.4. Irrelevant Features in Artificial Model Problem

In this section we will apply feature selection from GT SDA to some artificial model functions and some real world data sets to demonstrate method properties.

Consider the test function \(f:[0,1]^3 \to \mathbb{R}\),

\[f(x_1, x_2, x_3) = 2 + 0.25(x_2 - 5 x_1^2)^2 + (1 - 5 x_1)^2 + 2 (2 - 5 x_2)^2 + 7 \sin(2.5 x_1) \sin(17.5 x_1 x_2) + x_3^2.\]

We add the fourth feature \(x_4\) which has no influence on the function value.

We generated training set of \(100\) points and want to compare performance of feature selection in two error computation modes:

  • Internal Validation
  • Training sample

For this sample size GTApprox by default constructs Gaussian process regression model which normally almost interpolates training data. The training error is very small in this case even for sets of features which don’t describe the dependency. As a result feature selection with validation on training sample selects the subset of features, which is not relevant for the problem in question.

The following feature subsets were selected:

  • By IV based algorithm: [1, 2, 3]
  • By training sample based algorithm: [1,2]

A Table Features subsets and corresponding errors summarizes the results which are given by both methods for two chosen subsets.

Features subsets and corresponding errors
Feature subset IV error Training sample error
[1, 2] 0.6658 0.0007
[1, 2, 3] 0.3538 0.0001

It can be seen that training sample error is absolutely irrelevant here, while IV error estimate provides the result which allows to select the best subset.

You can also get the full code and run the example as a script: example_gtsda_selector_simple.py.

More GTSDA examples can also be found in Code Samples.

12.4.5. References

[1]
  1. Fraley, M. Oom, B. Terrien, J. Zalewski. Design of experiments via taguchi methods: orthogonal arrays. controls.engin.umich.edu, 2007