5.2. Correlation Analysis

Correlation analysis functionality of GTSDA helps to understand if there is statistical dependency in the data. The tool computes correlation scores between inputs and outputs. Also it uses statistical tests to decide if correlations are significant. The analysis is performed on base of data sample (also known as training data (or just sample)).

5.2.1. Introduction

This functionality of GTSDA allows to check if dependency between inputs \(x\) and outputs \(y\) exists. More specifically it computes correlation score and corresponding p-value (see Testing for significance) for each pair of input feature and output. If user set \(y\) to \(None\), the result will return cross-correlation of \(x\).

  • correlation score is the number in range between \(-1\) and \(1\), where \(0\) value means that output does not depend on considered input feature, and \(\pm 1\) value means that output is a deterministic function of considered input. Also ‘\(+\)’ sign of correlation says that output value tends to increase if input value increases, while ‘\(-\)’ sign tells the opposite. Note, correlations that detect general form of dependency doesn’t not return sign.
  • p-value is the probability that obtained correlation score appeared on given sample while output does not depend on considered input in reality, i.e. the true value is 0 and observed value appeared by luck. So the smaller p-value is the more likely that correlation do exist. Usually p-value \(0.05\) (i.e. \(5\%\) probability) is considered to be small enough to be sure that correlation is significant.

Example of usage:

analyzer = gtsda.Analyzer()
checkResult = analyzer.check(x=x, y=y, options={})

Input parameters are:

  • training sample
  • (optional) control variables for partial correlation (if not defined all inputs except considered one are taken as control variables for the input)
  • GTSDA options (optional)

The output parameters are:

  • correlation scores
  • corresponding p-values
  • decisions about statistical significance of scores

The GTSDA options are described in detail in section Option Reference.

5.2.2. Testing for significance

When computing correlations to make a decision whether dependency between considered variables exists at all, it is always important to check whether obtained correlation is significant.

Imagine that no real dependency exists (which is called null hypothesis), but due to random sampling error one very unlikely to obtain on finite sample correlation value that is 0 exactly.

So when looking at correlation coefficients user has to distinguish between the cases when null hypothesis is false and when it is true and observed correlation value is different from 0 just due to random sampling error.

As a common way of action here GTSDA computes p-value for each correlation.

  • p-value estimates a probability that one may get observed correlation value on given data if null hypothesis is true.

Common practice is to consider dependency significant if computed p-value is smaller that 0.05 (i.e. there is only \(5\%\) chance for observed correlation value under the null hypothesis). However, for each application one can use different threshold to make decisions.

5.2.3. Methods

At the moment tool can compute following types of correlation scores:

Partial correlation scores measure the degree of association between two random variables, with the effect of a set of controlling variables removed.

To select which correlation score to compute one should choose an option GTSDA/Checker/Technique

By default Pearson Correlation is used.

5.2.3.1. Pearson Correlation

Pearson correlation is a popular method of estimating a linear dependency.

Pearson correlation between output \(y\) and feature \(x\) is equal to 1 if and only if \(y = a x\) with \(a > 0\) and i equal to -1 if and only if \(y = a x\) with \(a < 0\) correspondingly.

Pearson correlation coefficient between output \(y\) and feature \(x\) is:

\[\rho = \frac{\sum_{j=1}^{n} \Big ( x_j - \bar{x} \Big ) \Big ( y_j - \bar{y} \Big )}{\sqrt{\sum_{j=1}^{n}\Big ( x_j - \bar{x} \Big )^2} \sqrt{\sum_{j=1}^{n}\Big ( y_j - \bar{y} \Big )^2}},\]

where \(\bar{x}\) and \(\bar{y}\) are mean values of \(x\) and \(y\) computed based on sample.

Pros:

  • Very simple and reliable measure.

Cons:

  • Degrades if output depends on many inputs.
  • May not catch some non linear dependencies.

5.2.3.2. Robust Pearson Correlation

This technique is added to GTSDA since version 6.2. Procedure starts with removing outliers from sample (see definition of outliers in Section Outlier Detection). To do this an algorithm MinCovDet (see [2], [3]) is used. After that, Pearson correlation is calculated on the remaining set of points.

Pros:

  • It provides robust to outliers version of Pearson correlation.
  • Linear complexity.

Cons:

  • Detect only linear dependence.
  • Degrades if input is high-dimensional.

5.2.3.3. Partial Pearson Correlation

This method also computes Pearson Correlation between each input and output. Method also assumes relation between input and output to be linear. The difference is that the method subtracts from the considered input and output possible influence of other inputs.

Let \(x\) and \(y\) be, as above, random variables taking real values, and let \(\mathbf{z}\) be the corresponding matrix of explanatory variables. Methods starts with solving the linear regression problems:

\[ \begin{align}\begin{aligned}\mathbf{w}_X^* = \arg\min_{\mathbf{w}} \left\{ \sum_{i=1}^n (x_i - \langle\mathbf{w}, \mathbf{z}_i \rangle)^2 \right\},\\\mathbf{w}_Y^* = \arg\min_{\mathbf{w}} \left\{ \sum_{i=1}^n (y_i - \langle\mathbf{w}, \mathbf{z}_i \rangle)^2 \right\},\end{aligned}\end{align} \]

where \(n\) is the number of samples and \(\langle\mathbf{v},\mathbf{w} \rangle\) is the scalar product between the vectors \(v\) and \(w\).

The residuals are then

\[ \begin{align}\begin{aligned}r_{X,i} = x_i - \langle\mathbf{w}_X^*,\mathbf{z}_i \rangle,\\r_{Y,i} = y_i - \langle\mathbf{w}_Y^*,\mathbf{z}_i \rangle\end{aligned}\end{align} \]

and the sample partial correlation is then given by the usual formula for sample correlation, but between these new derived values.

\[\hat{\rho}_{xy\cdot\mathbf{z}}=\frac{N\sum_{i=1}^N r_{X,i}r_{Y,i}-\sum_{i=1}^N r_{X,i}\sum_{i=1}^N r_{Y,i}} {\sqrt{N\sum_{i=1}^N r_{X,i}^2-\left(\sum_{i=1}^N r_{X,i}\right)^2}~\sqrt{N\sum_{i=1}^N r_{Y,i}^2-\left(\sum_{i=1}^N r_{Y,i}\right)^2}}.\]

Explanatory variables \(z\) are optional. In case if \(z\) is not provided, for each input variable the method computes partial correlation with all other inputs playing the role of \(z\).

Pros:

  • In case of multivariate linear dependency tool provides accurate estimates.

Cons:

  • Can have high working time in case when the total number of features is high.
  • If dependency is not linear (and especially if there are strong cross-feature interactions) method estimates may degrade.

5.2.3.4. Spearman Correlation

Spearman correlation is a generalization of Pearson Correlation which is \(\pm 1\) if and only if \(y\) is a monotonic function of considered feature \(x\). The raw values \(x_i\), \(y_i\) are converted to ranks \(X_i\), \(Y_i\) and Spearman correlation coefficient is computed in the form:

\[\rho = 1 - \frac{6 \sum_{i = 1}^n d_i^2}{n (n^2 -1)},\]

where \(d_i = Y_i - X_i\) is the difference between ranks.

Pros:

  • Can catch not only linear but also any monotonic dependency.

Cons:

  • Degrades if output depends on many inputs.
  • May not catch some non-monotonic dependencies.

5.2.3.5. Kendall Correlation

Kendall correlation is a method to calculate rank correlations between samples. It can detect monotonic dependency.

Any pair of observations \((x_i, y_i)\) and \((x_j, y_j)\):

  • said to be concordant if the ranks for both elements agree: that is, if both \(x_i > x_j\) and \(y_i > y_j\) or if both \(x_i < x_j\) and \(y_i < y_j\),
  • said to be discordant, if \(x_i > x_j\) and \(y_i < y_j\) or if \(x_i < x_j\) and \(y_i > y_j\),
  • said to be ties (i.e. neither concordant nor discordant) if \(x_i = x_j\) or \(y_i = y_j\).

Kendall correlation score computes the difference between fractions of concordant and discordant pairs. In monotonic case when all pairs are either concordant or discordant the coefficient would be equal to \(+1\) or \(-1\) correspondingly. A value of zero indicates equal number of concordant and discordant pairs.

Generally Kendall is preferred over Spearman Correlation as it is better in handling discrete variables, has easier interpretation. In addition, significance estimates for Kendall correlation are more reliable than that of Spearman Correlation.

In pSeven Core we use variant of this score known as Tau-b (which does correction of possible ties).

More formally Kendall Tau-b coefficient is defined as:

\[\tau_B = \frac{n_c-n_d}{\sqrt{(n_0-n_{tx})(n_0-n_{ty})}},\]

where

  • \(n_0\) is the total number of pairs that can be constructed for a sample size of \(n\)
  • \(n_c\) is the number of concordant pairs,
  • \(n_d\) is the number of discordant pairs,
  • \(n_{tx}\) is the number of pairs tied on the \(x\) variable,
  • \(n_{ty}\) is the number of pairs tied on the \(y\) variable.

Pros:

  • Can catch not only linear but also any monotonic dependency.
  • Is effective for discrete variables.

Cons:

  • Degrades if output depends on many inputs.
  • May not catch some non monotonic dependencies.

5.2.3.6. Mutual Information

The idea of Mutual Information is to measure how far the joint distribution \(p\left(x,y\right)\) of the feature and the output is from the case of two independent random values where \(p\left(x,y\right) = p\left(x\right) p\left(y\right)\). The greater the difference the more relevant feature is.

Formally Mutual Information can be computed as:

\[I\left(x, y\right)=\int p\left(x,y\right) log\frac{p\left(x,y\right)}{p\left(x\right) p\left(y\right)}dxdy.\]

By default, in pSeven Core we normalize its value to \(\left[0, 1\right]\) interval so final feature score for \(i\)-th variable is estimated as:

\[w_i = \frac{I(x,y)}{\min \Big ( H(x), H(y) \Big )},\]

where \(H(x)\) and \(H(y)\) are corresponding entropies:

\[H \left( x \right)=\int p\left(x\right) log \Big( p\left(x\right) \Big) dx.\]

One can disable normalization with GTSDA/Checker/MutualInformation/Normalize option.

Currently we use histogram based estimate to reconstruct variable distributions, i.e. we define bins and count the fraction of data points within each bin as local density. This approach may be crude on small samples, but is very cheap in terms of memory and computation time. So it can be applied to very large data sets.

The histogram approach requires specification of the bins size. Since version 6.2 in pSeven Core we use one of the two following approaches, which can be selected via GTSDA/Checker/MutualInformation/BinsMethod option:

  1. Full search approach does grid search for bin configuration and selects best in terms of leave-one-out cross validation, see [1] for criteria explanation.

  2. Scott’s method based on asymptotic approximation (see [5]):

    \[h_i = \frac{3.5 \sigma_i}{n^{1/3}},\]

    where

    • \(n\) - sample size,
    • \(h_i\) - bin size for \(i\)-th variable,
    • \(\sigma_i\) - standart deviation for \(i\)-th variable.

Pros:

  • Can detect non-linear dependency.
  • Works fast.

Cons:

  • Degrades if output depends on many inputs.
  • Can be unreliable for small (< 200 points) samples.

5.2.3.7. Distance Correlation

Distance correlation measured for sample of \((x, y)\) pairs is a Pearson Correlation between elements in centred matrix of distances by \(x\) and \(y\).

It equals zero if and only if \(x\) and \(y\) are statistically independent and equals 1 if and only if \(y\) is a linear function of \(x\).

Denote non-centred matrix of distances by

\[ \begin{align}\begin{aligned}a_{ij} = |x_{i} - x_{j}|,\\b_{ij} = |y_{i} - y_{j}|.\end{aligned}\end{align} \]

Denote centered matrixes by

\[ \begin{align}\begin{aligned}A_{ij} = a_{ij} - a_{i.} - a_{.j} + a_{..},\\B_{ij} = b_{ij} - b_{i.} - b_{.j} + b_{..}.\end{aligned}\end{align} \]

where \(a_{i.}\) is the average over columns, \(a_{.j}\) over rows and \(a_{..}\) over the whole matrix

Resulting DC-score is

\[\rho = \frac{\langle A, B \rangle}{\sqrt{\langle A, A \rangle \langle B, B \rangle}} > 0.\]

where

\[\langle A, B \rangle=\frac{1}{n(n-3)}\sum_{i,j=1}^{n} A_{i,j} B_{i,j}.\]

One can also denote Unbiased Distance correlation by:

\[UA_{ij} = a_{ij} - \frac{n}{n-1}a_{i.} - \frac{n}{n-1}a_{.j} + \frac{n^2}{(n-1)(n-2)}a_{..},\]
\[UB_{ij} = b_{ij} - b_{i.} - b_{.j} + b_{..}.\]

where \(a_{i.}\) is the average over columns, \(a_{.j}\) over rows and \(a_{..}\) over the whole matrix

Resulting UnbiasedDC-score is

\[U\rho = \frac{\langle UA, UB \rangle}{\sqrt{\langle UA, UA \rangle \langle UB, UB \rangle}},\]

This score can be < 0.

Some details of the algorithm could be found in [4].

Pros:

  • Can detect non-linear dependency.

Cons:

  • Degrades if output depends on many inputs.
  • Works in \(O(n^2)\) operations.

5.2.3.8. Partial Distance Correlation

Partial distance correlation is the metric of conditional dependence between two vectors in sample given control variables or other inputs (which influence would have to be removed from sample).

As shown above, distance correlation is the Pearson Correlation between elements — in so-called U-centred matrix of distances. In unbiased case we can denote the inner product of two matrices by

\[\langle A, B \rangle=\frac{1}{n(n-3)}\sum_{i,j=1}^{n} a_{i,j}b_{i,j}.\]

We can also denote a Hilbert space of all possible U-centred matrixes. Let A, B and C be the U-centred matrices from Hilbert space \(H\) of samples X, Y and Z. We would like to calculate a — conditional dependency between X and Y, knowing \(y=f(x,z)\), \(x \in R, y \in R, z \in R^{p-1}\). We can illuminate the influence of Z on Y with help of orthogonal projections of elements — A and B onto C:

\[ \begin{align}\begin{aligned}A^{*} = A - \frac{\langle A,C \rangle}{\langle C,C \rangle}C,\\B^{*} = B - \frac{\langle B, C \rangle)}{\langle C,C \rangle}C.\end{aligned}\end{align} \]

These matrices are also U-centred, and in Hilbert space H. The resulting Partial Distance Correlation score is

\[R^{*} = \frac{\langle A^{*},B^{*} \rangle}{\sqrt{\langle A^{*},A^{*} \rangle\langle B^{*},B^{*} \rangle}}.\]

In cause of this score is unbiased, we can get a negative score.

Pros:

  • Can detect non-linear dependency.
  • Can work with multidimensional arrays.

Cons:

  • Works in \(O(n^2)\) operations.

5.2.4. Sample size requirements

The maximum size of the training sample, which can be processed by GTSDA, is primarily determined by the user’s hardware. Necessary hardware resources depend significantly on the specific technique — see descriptions of individual techniques. Accuracy of estimation tends to improve as the sample size increases.

Contrary to the maximum size, there is a certain minimum for the size of the sample, which depends on the technique used. This condition refers to the effective values, i.e. the ones obtained after preprocessing. An error with the corresponding error code will be returned if this condition is violated.

At the moment all techniques used for correlation analysis in GTSDA require sample of size at least 3 unique rows to compute correlations.

5.2.5. References

[1]
  1. Shimazaki, S. Shinomoto. A recipe for optimizing a time-histogram. NIPS, 2006.
[2]
    1. Rousseeuw. Least median of squares regression. Am Stat Ass, 79:871, 1984.
[3]A Fast Algorithm for the Minimum Covariance Determinant Estimator, 1999, American Statistical Association and the American Society for Quality, TECHNOMETRICS
[4]Székely, G. J. Rizzo, M. L. and Bakirov, N. K. (2007). “Measuring and testing independence by correlation of distances”, Annals of Statistics, 35/6, 2769-2794
[5]Scott, David W. Multivariate Density Estimation and Visualization. No. 2004, 16. Papers/Humboldt-Universität Berlin, Center for Applied Statistics and Economics (CASE), 2004.