June 6, 2017

Approximation in pSeven vs. Open Algorithms

Not long ago a group of DATADVANCE software developers and scientific advisors published an article about  different aspects of using approximation in pSeven (GTApprox in pSeven Core). The full version of this article can be found here.

In this note we present a comparison of accuracy from that article between approximation algorithms used in GTApprox and some of the most popular Python libraries, like Scikit-learn, XGBoost, and GPy.

We emphasize that there are a few caveats to this comparison:

  • First, these libraries are aimed at a technically advanced audience of data analysts who are expected to select appropriate algorithms and tune their parameters themselves. In particular, Scikit-learn does not provide a single entry point wrapping multiple techniques like GTApprox does. We, therefore, compare GTApprox, as a single algorithm, to multiple algorithms of Scikit-learn. We also select a couple of different modes in both XGBoost and GPy.
  • Second, the scope of Scikit-learn and XGBoost is somewhat different from that of GTApprox: the former are not focused on regression problems and their engineering applications and, in particular, their most powerful nonlinear regression methods seem to be ensembles of trees (Random Forests and Gradient Boosted Trees) that produce piece-wise constant approximations presumably not fully suitable for modeling continuous response functions. Keeping these points in mind, our comparison should be otherwise reasonably fair and informative.

All the chosen techniques are used with default settings:

  • Scikit-learn methods for regression, both linear and nonlinear: Ridge Regression with cross-validation (SL_RidgeCV), Support Vector Regression (SL_SVR), Gaussian Processes (SL_GP), Kernel Ridge (SL_KR), and Random Forest Regression (SL_RFR).
  • XGBoost: with the gbtree booster (default, XGB) and with the gblinear booster (XGB_lin).
  • GPy: the GPRegression model (GPy) and, since some of our test sets are relatively large, the SparseGPRegression model (GPy_sparse).
  • GTApprox corresponding to the two meta-algorithms: the basic tree-based algorithm (gtapprox) and the SmartSelection technique (gta_smart).

Accuracy pro files of different approximation algorithms

The resulting accuracy profi les are shown on the picture above. We find that the default GTApprox is much more accurate than default implementations of techniques from other libraries. As expected, gta_smart yields even better results than gtapprox.

By Dmitry Yartosky, Scientific Advisor, DATADVANCE