October 13, 2016

“Build. Validate. Explore.” - Part 1: SmartSelection in Predictive Modeling Toolkit

What is “SmartSelection”?

SmartSelection is an intelligent model training technology in pSeven that automatically selects approximation technique and its options in order to obtain the most accurate surrogate model.

SmartSelection allows user to focus on solving particular task without delving into details of approximation techniques and methods by automating trial-and-error process of surrogate model construction. It features automatic model training technology which hides the complexity of underlying machine learning algorithms behind user friendly interface.

There’s a large set of approximation techniques under the hood. From simple linear regression or quadratic polynomial and splines to gaussian processes,  original High Dimensional Approximation (HDA) method based on neural networks and gradient boosted regression trees. Every technique has its own tunable parameters. Each has its own strengths and weaknesses and no single technique is best for all possible and data sets, i.e. there’s No Free Lunch.

And the purpose of SmartSelection is to automate selection of the model with best predictive performance by exploring different techniques and optimizing their parameters to find minimum of approximation error on cross-validation or holdout test set.

Use case

Let`s consider simple “Static mixer optimization” example from pSeven package. When engineer wants to study certain process, the DOE is used to sample data. Data is used as training sample to create surrogate model.

In this example DOE samples 200 points:

  • 4 inputs: ‘Flow temperature’, ‘Pressure drop’, ‘1st flow velocity’, ‘2nd flow velocity’
  • 2 outputs: ‘Nozzle angle’, ‘Nozzle diameter’

Input data properties

Training sample is required at minimum. But additional data, if provided, may improve quality of approximation:

High level hints

Three groups of high-level hints can be used to express domain knowledge, model requirements and time/quality constraints.

1. Domain knowledge about data underlying studied process

Any additional prior knowledge narrows search space of possible configurations and thus reduces training time and may influence predictive performance of the final model.

2. Requirements for the model properties

3. Time constraints and quality management. Ballance time/quality tradeoff

  • or this example define acceptable quality with metric R2 = 0.99 on cross-validation
  • limit time for the selection process: set nightly experiment

User Interface shows declared hints in a form of tags:

SmartSelection algorithm starts selection with given knowledge, requirements and time/quality tradeoff.

The quality of approximation can be measured in 3 different ways:

  1. Using Internal Validation
  2. Via splitting given training sample into train/test subsets
  3. Using additional holdout test sample

Optimal model is constructed for each outcome variable in case of vector (multidimensional) output.

Manual mode vs. SmartSelection

For advanced users who want to get closer to core approximation techniques with all the knobs and switches Manual mode is available. But compared with it SmartSelection technology always gives similar or in most of the cases better approximation results.

Further appliances

After surrogate model is built it can be validated on a new data and compared with other models using Model Validator, additionally smoothed, evaluated and exported in different formats (C, Octave, FMI etc.)..


 

By Dennis Shilko, Senior Software Engineer, DATADVANCE