October 13, 2016

“Build. Validate. Explore.” - Part 1: SmartSelection in Predictive Modeling Toolkit

What is SmartSelection?

SmartSelection is an intelligent model training technology in pSeven that automatically selects approximation technique and its options in order to obtain the most accurate approximation model.

SmartSelection allows the user to focus on solving particular task without delving into details of approximation techniques and methods by automating the trial-and-error process of approximation model construction. Itfeatures automatic model training technology which hides the complexity of underlying machine learning algorithms behind a user-friendly interface.

There’s a large set of approximation techniques under the hood. From simple linear regression or quadratic polynomial and sp lines to Gaussian processes, original High Dimensional Approximation (HDA) method based on n eural networks and gradient boosted regression trees. Every technique has its own tunable parameters. Each has its own strengths and weaknesses and no single technique is best for all possible and data sets, i.e. there’s No Free Lunch.

And the purpose of SmartSelection is to automate selection of the model with the best predictive performance by exploring different techniques and optimizing their parameters to find a minimum of approximation error on cross-validation or holdout test set.

Use Case

Let`s consider simple “Static mixer optimization” example from pSeven package. When an engineer wants to study certain process, the DOE is used to sample data. Data is used as training sample to create an approximation model.

In this example DOE samples 200 points:

4 inputs: ‘Flow temperature’, ‘Pressure drop’, ‘1st flow velocity’, ‘2nd flow velocity’
2 outputs: ‘Nozzle angle’, ‘Nozzle diameter’

Input Data Properties

Training sample is required at minimum. But additional data, if provided, may improve quality of approximation:

Hold-out test sample for validation (cross-validation is used by default)
Weights for input points of training sample
Output noise variance of training sample
Marks for categorical variables
Data filters for training and test samples e.g. to remove outliers

High-Level Hints

Three groups of high-level hints can be used to express domain knowledge, model requirements and time/quality constraints.

1. Domain knowledge about data underlying studied process

Any additional prior knowledge narrows search space of possible configurations and thus reduces training time and may influence the predictive performance of the final model.

process is nonlinear or discontinuous
input data is noisy
there is a dependency or correlation between outputs

2. Requirements for the model properties

require the model to be a smooth (differentiable) approximation function
require the ability to evaluate the uncertainty of the predictions
predict NaN values in regions that are close to points for which training sample contained NaN output (invalid design point is marked as NaN), require the model to fit the training sample exactly

3. Time constraints and quality management. Ballance time/quality tradeoff

or this example define acceptable quality with metric R ² = 0.99 on cross-validation
limit the time for the selection process: set nightly experiment

User Interface shows declared hints in a form of tags:

SmartSelection algorithm starts selection with given knowledge, requirements and time/quality tradeoff.

The quality of approximation can be measured in 3 different ways:

Using Internal Validation
Via splitting given training sample into train/test subsets
Using additional holdout test sample

Optimal model is constructed for each outcome variable in case of vector (multidimensional) output.

Manual Mode vs. SmartSelection

For advanced users who want to get closer to core approximation techniques with all the knobs and switches Manual mode is available. But compared with it SmartSelection technology always gives similar or in most of the cases better approximation results.

Further Appliances

After the approximation model is built it can be validated on a new data and compared with other models using Model Validator, additionally smoothed, evaluated and exported in different formats (C, Octave, FMI etc.).

In “Build. Validate. Explore.” - Part 2 we will describe Model Validator - an interactive analysis tool that allows to estimate model’s quality (i.e. predictive performance) and compare different models. It allows to test models against reference data and find the most accurate model using error plots and statistics.

In “Build. Validate. Explore.” - Part 3 we’ll see how to “look inside the model” and explore its behaviour with an interactive visual tool called Model Explorer. Stay tuned!