9. Statistical Utilities

pSeven Core provides additional statistical utilities, including the calculation of elementary statistics, outlier detection and distribution tests.

9.1. Elementary Statistics

Implemented in calculate_statistics().

This analysis method calculates various statistics for the data sample (matrix) \(X\).

The statistics include:

  • Component mean, minimum, maximum, median values, ranges and standard deviations.
  • Upper and lower quantiles for the given confidence level.
  • Components correlation coefficients (based on empirical or robust covariance or rank correlation).

Quantiles are computed with respect to the given confidence level (default is 0.95).

There are three correlation coefficient calculation methods available:

  • Empirical is an ordinary correlation calculation, default.
  • Robust - special robust correlation computation, calculated if the covariance parameter to calculate_statistics() is 'robust'.
  • Rank correlation is measured via the Kendall rank correlation coefficient (only for ranked components). Ranked components are specified by the rank_components parameter to calculate_statistics(). rank_components is a zero-based list of indices of rank components. The list length has to be equal to the total number of components in \(X\).

Below we give an exact mathematical formulations of statistics computed. Let \(X\) be a \(n \times m\) real matrix, where observations (objects) are located in rows and components (features) are located in columns. By \(X_i\) we will denote \(i\)-th row of the matrix X and by \(x_j\) we will denote \(j\)-th column of \(X\), \((i, j)\)-th element of matrix \(X\) is denoted by \(X_{ij}\).

9.1.1. Min

Returns the smallest elements along the first dimension of \(X\).

9.1.2. Max

Returns the largest elements along the first dimension of \(X\).

9.1.3. Mean

Computes the arithmetic mean values of the elements along the first dimension of \(X\).

9.1.4. Median

Computes median values of the elements along the first dimension of \(X\). Given a vector \(x\) of length \(n\), the median of \(x\) is the middle value of a sorted copy (in ascending order) \(x_{sorted}\) of \(x\). When \(n\) is odd it is \(x_{sorted}[(n-1)/2]\). When \(n\) is even, it is the average of the two middle values of \(x_{sorted}\).

9.1.5. Range

Returns the difference between the maximum and the minimum of the elements along the first dimension of \(X\).

9.1.6. Upper and lower quantiles

Computes empirical quantiles for each column of \(X\). \(p\)-th lower quantiles for a real vector \(x\) are defined by \(Q_l(p) = x_{sorted}[j]\), where \(x_{sorted}[j]\) is the statistic and \(j = max(1, \lfloor n * (1 - p) \rfloor)\), where \(x_{sorted}\) is a copy of \(x\) sorted in ascending order. \(p\)-th upper quantiles are thus computed as \(Q_u(p) = x_{sorted}[n - j + 1]\).

9.1.7. Std

Computes standard deviations of the elements along the first dimension of \(X\). If covariance type used is 'empirical' then ordinary unbiased estimate of standard deviation for \(j\)-th column of \(X\) is computed:

\[S_j = \biggl(\frac{1}{n - 1} \sum_{i = 1}^{n} (X_{ij} - \bar{x_j})^2\biggr)^{1/2}.\]

If covariance type is 'robust' and \(m > 1\) then algorithm based on notion of Minimum Covariance Determinant Estimator (see [1] for details) is used to detect subsample with the most robust estimate of covariance and unbiased estimate of standard deviation is computed on base of this subsample.

If covariance type is 'robust' and \(m = 1\) then estimation of standard deviation is made on base of median absolute deviation (MAD). For vector \(x\) MAD is computed as

\[M(x) = median(\tilde{x}), \tilde{x} = x - (median(x), median(x), \ldots, median(x))\]

and

\[S(x) \simeq 1.4826 * M(x).\]

9.1.8. Correlation

Computes empirical correlation coefficients between different columns of \(X\). In case of 'empirical' correlation for columns \(x_i\) and \(x_j\) correlation coefficient is

\[R(i, j) = \frac{C(i, j)}{\sqrt{C(i, i)} \sqrt{C(j, j)}},\]

where \(C(i, j)\) is a covariance between \(x_i\) and \(x_j\):

\[C(i, j) = \frac{1}{n - 1} \sum_{k = 1}^{n} (X_{ki} - \bar{x}_i) (X_{kj} - \bar{x}_j).\]

By \(\bar{x}_i\) we denote arythmetic mean of \(x_i\). The values of correlation coefficients are between -1 and 1, inclusive. Correlation coefficient indicates the level to which two variables are lineary dependent.

In case of 'robust' correlation the same quantity is computed for vectors corresponding to the subsample determined by Minimum Covariance Determinant Estimator (see [1] for details).

If both \(i\)-th and \(j\)-th columns of the matrix \(X\) are marked as rank components by setting variable rank_components, Kendall rank correlation coefficient is used for computing correlations between these columns:

\[R(i, j) = \frac{n(n - 1) - 4 \nu(x_i, x_j) - 2 u^{i} - 2 u^{j}}{\sqrt{(n(n - 1) - 2 u^{i}) (n(n - 1) - 2 u^{j})}} ~ i, j = 1, \dots, n,\]

where \(\nu(x_i, x_j)\) is the minimal number of exchanges of adjacent elements of \(x_{i}\), required to make it ordered as \(x_{j}\), and

\[u^{i} = \sum_{t = 1}^{m^{l}} n_t^{l} (n_t^{l} - 1), l = k, j,\]

\(m^{i}\) is a number of ranks of \(i\)-th variable and \(n_t^{l}\) is a number of elements of vector \(x_i\) which have \(t\)-th rank. Note, that if all elements of \(i\)-th vector are distinct and all elements of \(j\)-th vector are distinct, we can write \(R(i, j)\) in the form:

\[R(i, j) = 1 - \frac{4 \nu(x_i, x_j)}{n (n - 1)}.\]

If only one of two variables, for example, \(i\)-th is marked as a rank component and another isn’t marked as a rank component, then empirical correlation coefficient is computed between the corresponding columns of \(X\). If the matrix \(X\) contains both ranked and real components, then the correlation matrix will contain both emprirical and Kendall correlations. In such case it can lose the property to be non-negative definite.

9.2. Outlier Detection

Implemented in detect_outliers().

A problem is to find abnormal observations in a given set of observations \(X\). Abnormal observation is also refered to as outlying observation or outlier.

An outlying observation is the one that appears to deviate significanlty from other members of the sample. Outlier detection is performed on basis of the covariance matrix estimation. Obtained estimate is then used to compute scores of outliers. The score represents how likely the corresponding object is an outlier. The object is labeled as an outlier if its score is higher than quantile for the given confidence level.

9.2.1. Covairance Estimate

There are two types of covariances estimates implemented in our tool.

  1. Empirical covariance.

    \[S = \frac{1}{n - 1} \sum_{i = 1}^n (X_i - \bar{X})(X_i - \bar{X})^T,\]

    where \(\bar{X} = \frac{1}{n} \sum_{i = 1}^n X_i\).

  2. Robust covariance estimate based on the notion of Minimum Covariance Determinant Estimator [1]. This method uses only part of the given sample to estimate empirical covariance matrix. It tries to choose the subsample in such way that the determinant of the covariance matrix is as small as possible. This means that the observations of this subsample lie close to each other and don’t deviate significantly from each other. That is the covariance is estimated using non-outlier observations.

9.2.2. Calculation of Scores

Scores can be computed by two different methods.

  1. Score is calculated based on distances to the mean vector of the sample using the metric induced by the covariance matrix. It is assumed that non-outlier points can be described by Gaussian distribution. This means that \(X_i S^{-1} X_i^T, i=1, \ldots, n\) have \(\chi^2\) distribution with \(m\) degrees of freedom. For distant points \(\chi^2\) cumulative distribution function (CDF) is close to 1 and they can be considered as outliers. For non-outlier points \(\chi^2\) CDF must be close to 0.

  2. Score is based on the notion of local outlier probability [2]. The idea of this approach is that non-outlier points lies in dense regions, whereas outliers lies in sparse regions. This algorithm uses the probabilistic distance of observation \(x \in X\) to a set \(U \subseteq X\)

    \[\begin{split}\begin{array}{l} {\rm pdist}(\lambda, x, U) = \lambda \sigma(x, U), \\ \sigma^2(x, U) = \frac{1}{|U|} \sum_{y \in U} (y - x)S^{-1}(y - x)^T, \end{array}\end{split}\]

    where \(S\) is a covariance matrix estimated by one of the methods described previously, \(|U|\) is a size of set \(U\), \(\lambda\) is some normalization factor. Intuitively \({\rm pdist}\) estimates inverse of local density around observation \(x\). So, we can calculate the ratio of the estimation for the density around \(x\) which is based on \(U\) and the expected value of the estimations for the densities around all objects in the set \(U\). Score is a normal CDF of this ratio. For outlier points the value of score is close to 1, for non-outlier points it is close to 0.

9.3. Distribution Tests

Implemented in check_distribution().

If data comes from specific type of probability distribution, this knowledge can be crucial for its successful analysis. Test for specific types of distribution can be performed. It checks if the points in sample are distributed uniformly or normally in the design space. The uniformity test is based on the notion of discrepancy (a special measure of uniformity in multidimensional spaces) and minimum spanning tree statistics (a special measure of uniformity in low dimensional spaces). Note that uniformity tests are for random uniformity, so full factorial design might be not uniform at this sense. The normality tests are based on the asymptotic properties of skewness and kurtosis coefficients for normal distribution. For small sample sizes skewness test is corrected in order to compensate error due to finite size of the sample.

All the tests are probabilistic and are performed for the given confidence level. The default value for the confidence level is 0.99. Also in case of big sample size number of points to be processed by the test can be limited by setting of maximum allowed budget. If number of points in sample exceeds maximal budget then random subsample of sample of size equal to maximal budget is taken to be processed by the distribution tests.

9.4. References

[1](1, 2, 3) Peter J. Rousseeuw and Katrien Van Driessen, A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 1998, vol. 41, p. 212–223.
[2]Kriegel, H.-P.; Kroger, P.; Schubert, E.; Zimek, A. (2009), LoOP: Local Outlier Probabilities, Proc. 18th ACM Conference on Information and Knowledge Management (CIKM).
[3]Mardia, K. V. (1970), Measures of multivariate skewness and kurtosis with applications. Biometrika, 57(3):519-530.
[4]Mardia, K. V. (1974), Applications of some measures of multivariate skewness and kurtosis for testing normality and robustness studies. Sankhya A, 36:115-128.
[5]Stevens, J. (1992), Applied Multivariate Statistics for Social Sciences. 2nd. ed. New-Jersey:Lawrance Erlbaum Associates Publishers. pp. 247-248.
[6]Friedman, J. H.; Rafsky, L. C. (1979), Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 1979, Vol.7, No. 4, p. 697–717.
[7]Liang, J.-J.; Fang, K.-T.; Hickernell, F. J.; Li, R. (2000), Testing multivariate uniformity and its applications. Mathimatics of Computation, Vol. 70, Num. 233, p. 337–355.