February 14, 2018

# Intel® Distribution for Python Speeds Up Estimation of the Sobol Indices

## Introduction

The study was performed to demonstrate the benefits of using Intel® Distribution for Python to increase the overall performance of mathematical algorithms implemented in pSeven Core, a Python library for Design Space Exploration. In particular, GT SDA (General Tool for Sensitivity and Dependency Analysis) was used to assess performance improvements.

Sensitivity analysis is the study of how the variation (uncertainty) in the output of a statistical model can be attributed to different variations in the inputs of the model. Variance-based sensitivity measures usually associated with the Sobol indices are well-suited for the global analysis of the entire inputs design space of nonlinear models. Performance estimations were obtained for FAST method of computing the Sobol indices.

## What are the Sobol indices?

The idea of the Sobol indices is to measure what portion of model’s outputs variance is described by the variance of its inputs. The score is the estimate with regard to a part of the output variation that is described by the variation of considered input.

The method can provide following types of indices:

• Main indices take into account the only sole influence of input while others being fixed. The index tells what portion of output variance would be described by considered input provided that all other inputs are fixed at their mean values.
• Total indices take into account interactions of the i-th input with other inputs. The index tells what portion of output variance would be lost if we fixed considered input to its mean value, while still vary other inputs.
• Interaction indices are equal to the difference between total and main indices, which represents the strength of inputs interactions.

For each input total indices are estimated as:

$$T_i = 1 - \frac{1}{V(Y)}V_{\sim x_i}[E_{x_i}(Y|~x_i)]$$,

where $$E_{x_i}[\cdot]$$ is the mean with respect to $$x_i$$, whereas $$V_{x_i} (⋅|x_i)$$ is the conditional variance with respect to all inputs except $$x_i$$.

The higher-order main indices require O(2d) evaluations to explore the interaction details between all the inputs and output variances, which might become too expensive for models with large inputs dimensionality. In such cases, the total indices approach is more suitable since it requires O(d) model evaluations.

## FAST Method

There are several methods to compute the Sobol indices: FASTCSTA and EASI. Each of the methods has its pros and cons. In particular, FAST and CSTA methods are suited for computation of all types of indices: total, main and interactions, while EASI is designed to compute specifically main indices. All the estimations are based on FAST method’s performance for total indices evaluation since it is widely used in pSeven Core for sensitivity analysis of models under study.

The method uses space-filling one-dimensional curves of the form:

$$x_i(s)=\frac{1}{2}+\frac{1}{\pi}arcsin(sin(v_is+\phi _i))$$,

to generate sample points. Here each feature has some frequency $$v_i$$, $$s$$ is the coordinate on one-dimensional curve and $$φ_i$$ is a random constant phase shift. Using Fourier decomposition, we may say that:

$$f(X) = \sum_{j=-\infty}^{\infty }(A_j \cos (js) + B_j\sin (js))$$,

$$A_j = \frac{1}{2\pi }\int_{-\pi}^{\pi}f(s)cos(js)ds$$,

$$B_j = \frac{1}{2\pi }\int_{-\pi}^{\pi}f(s)sin(js)ds$$.

These integrals can be estimated using points generated on the curve. In this case, e.g., conditional variance can be estimated as:

$$V_{x_i}[E_{\sim x_i}(Y|x_i)] = 2 \sum_{j=1}^{K}(A_{jv_i}^2 + B_{jv_i}^2), \;jv_i \; \text{is an integer}$$,

where $$K$$ is some predefined number. Another appealing property of this approach is its ability to estimate total indices accurately. To do this estimation unique frequency $$v_i$$ is given to $$x_i$$ and the same frequency $$v_i$$ is given to all other features, then the same procedure as above is performed. Notably, both main and total indices can be estimated during one method run.

## Computational Experiment and Results

A sample of 100 points was generated using a test model with 2 inputs and 4 outputs, the method has been executed 10 times for each configuration to increase the accuracy of performance estimations. The figure below shows the results for both 2.x and 3.x python versions available for pSeven Core.

Summary of time estimations for FAST method

For Python 2.7.13 + Numpy 1.11.3 and Python 3.6.3 + Numpy 1.13.3, Intel® Distribution for Python provided estimations 4.93 and 3.72 times faster, accordingly.

You can request a fully functional 30-days demo license of pSeven or pSeven Core here. Intel® Distribution for Python is available for download here.

Read an official Intel press-release >

By Dmitry Vetrov, Chief Developer & Taleh Agasiev, Software Developer, DATADVANCE

## Interested in the solution?

Click to request a free 30-day demo.