12.8. Statistics

12.8.1. Robust Covariance

This example explains benefits of using robust covariance estimation in calculate_statistics() and detect_outliers().

Robust covariance is computed as empirical covariance for subsample determined by Minimum Covariance Determinant Estimator. This algorithm allows for efficient outlier detection and leads to more stable and robust covariance estimates.

In the example, we consider a sample consisting of two parts (subsamples):

  • The first one is a sample of 100 points with mean \([0, 0]\) and standard deviation \([1, 4]\).
  • The second one is a sample of 20 points with mean \([4, 4]\) and standard deviation \([4, 1]\).

Both subsamples are generated by a normal distribution in \(R^{2}\). The first subsample is considered to contain inliers, while the second contains outliers. This example sample is generated right in the script using the following code:

  print('Sample generation...')
  n_features = 2
  n_inliers = 100
  n_outliers = 20

  rand_gen = np.random.RandomState(0)
  inliers_mean = np.array([0, 0])
  inliers_std = [1, 2]
  inliers = rand_gen.randn(n_inliers, n_features)
  inliers = inliers * inliers_std + inliers_mean

  outliers_mean = np.array([4, 4])
  outliers_std = [2, 1]
  outliers = rand_gen.randn(n_outliers, n_features)
  outliers = outliers * outliers_std + outliers_mean

  sample = np.vstack((inliers, outliers))

It is worth noting that the subsamples have nontrivial intersection area. Due to this, the outlier detection and correct covariance estimation for inliers becomes a complicated problem.

We can calculate elementary statistics for the full sample both with ordinary empirical covariance estimation (which is default) and with robust covariance. Covariance estimation mode is controlled by the covariance argument. The following code demonstrates the computation of elementary statistics and compares results for standard deviation and correlation in different modes.

  print("Computing statistiscs...")
  stat_object = stat.Analyzer()
  statistics = stat_object.calculate_statistics(sample)
  statistics_robust = stat_object.calculate_statistics(sample, covariance='robust')

  print('Standard deviation of inliers (no correlation) is %s' % inliers_std)
  print('Computed standard deviation in case of empirical covariance is %s' % statistics.std)
  print('Computed standard deviation in case of robust covariance is %s' % statistics_robust.std)
  print('Computed matrix of correlation coefficients in case of empirical covariance is %s' % statistics.correlation)
  print('Computed matrix of correlation coefficients in case of robust covariance is %s' % statistics_robust.correlation)

An example output is (actual results vary due to random sample generation):

Standard deviation of inliers (no correlation) is [1, 2]
Computed standard deviation in case of empirical covariance is [1.9411665976192736, 2.3120140588291305]
Computed standard deviation in case of robust covariance is [1.0221658859902678, 2.0696661352205399]
Computed matrix of correlation coefficients in case of empirical covariance is [[ 1., 0.42593518], [ 0.42593518, 1.]]
Computed matrix of correlation coefficients in case of robust covariance is [[ 1., 0.00954776], [ 0.00954776,  1.]]

Clearly, ordinary empirical estimates are far from expected (due to presence of outliers), while the robust method still managed to provide accurate variance and correlation estimates.

To address the problem with data outliers, we could first perform outlier detection using detect_outliers(). The following code tries to detect data points for which the probability of being an outlier is at least 90% (confidence is set to 0.9), using both robust and empirical methods:

  print('Detecting outliers...')
  robust_outlier_detection_result = stat_object.detect_outliers(sample, covariance='robust', confidence=0.9)
  robust_outliers_mask = robust_outlier_detection_result.outliers
  robust_inliers_mask = [not outlier for outlier in robust_outliers_mask]
  robust_detected_inliers = sample[np.where(robust_inliers_mask)]
  robust_detected_outliers = sample[np.where(robust_outliers_mask)]

  outlier_detection_result = stat_object.detect_outliers(sample, confidence=0.9)
  outliers_mask = outlier_detection_result.outliers
  inliers_mask = [not outlier for outlier in outliers_mask]
  detected_inliers = sample[np.where(inliers_mask)]
  detected_outliers = sample[np.where(outliers_mask)]

The inliers and outliers may then be segregated as shown above. When plotting the results, it is easy to see that the algorithm based on robust covariance is much more accurate in the case we consider:

../_images/inliers_outliers.png

To run the example, open a command prompt and execute:

python -m da.p7core.examples.stat.example_robust_covariance

You can also get the full code and run the example as a script: example_robust_covariance.py.