4.1. Introduction¶

Sections

Overview
Problem Statement
Sample Size Requirements

4.1.1. Overview ¶

GTApprox is a software package for construction of approximation models based on user-provided training data set; it provides assessment of quality of the approximations and further analysis of constructed models, including smoothing, Accuracy Evaluation etc. GTApprox includes a set of approximation techniques taking into account different properties of the training data and user’s requirements that can be set by related options of GTApprox.

We suppose that user-provided training data set \(S\) consists of two samples: input and response. Input and response samples consist of \(d_{in}\)-dimensional vectors \(X\) and \(d_{out}\)–dimensional vectors \(Y\) accordingly. \(d_{in}\) and \(d_{out}\) are called input dimension and output dimension.

Response of interest \(Y\) is supposed to be sampled from an unknown dependency \(f\): \(Y = f(X)\). So, a single element of the training set \(S\) is a pair \((X_k,Y_k)\) with \(Y_k=f(X_k)\) and \(S = (X_k, Y_k)_{k=1}^{|S|}\), where \(|S|\) is the sample size.

The training set \(S\) is usually numerically represented by a \(|S|\times(d_{in}+d_{out})\) — matrix \(\mathbf{XY}_{train}\). We can think of this matrix as having \(|S|\) rows, corresponding to different training vectors, and \(d_{in}+d_{out}\) columns, corresponding to different scalar components of these vectors (\(d_{in}\) components of the input and \(d_{out}\) of the output). This matrix naturally divides into two submatrices — \(\mathbf{X}_{train}\), corresponding to the DoE, and \(\mathbf{Y}_{train}\), corresponding to the output components of the training set.

Based on the training set \(S\), GTApprox constructs approximation model \(\hat{f}:R^{d_{in}}\to R^{d_{out}}\) (also known as surrogate model, meta-model or response surface) of an unknown dependency \(f\). Approximation model \(\hat{f}\) is the main result of GTApprox. It allows to predict the responses of interest not only on training set (for points \(X \in (X_k)_{k=1}^{|S|}\)), but also for points \(X\) not belonging to it. This important property of approximation is called predictive power. In addition to responses prediction, approximation model allows to predict gradient of model \(f\) and other characteristics. Formal problem statement is given in Problem Statement section.

Besides the training set, additional data may be provided by the user:

weights of points in the training set: GTApprox tries to fit the points \((X_k, Y_k)\) with greater weights better (see Sample Weighting);
noise level of responses \(Y\) that allows GTApprox to provide better approximation quality for noisy problems (see sections Noisy Problems and Data with Errorbars for details);

By taking into account this additional information, GTApprox provides more accurate approximations.

GTApprox provides different types of approximation models (listed below). Different approximation models may have different accuracy, predictive power, construction time, capabilities for further analysis and other properties. The type of approximation model may be selected either automatically based on the properties of the training data and the user-defined options or manually by the user (see General Usage).

Available approximation techniques are listed below:

Auto — best technique will be determined automatically (default, see General Usage)
GBRT — Gradient Boosted Regression Trees
GP — Gaussian Processes
HDA — High Dimensional Approximation
HDAGP — High Dimensional Approximation combined with Gaussian Processes
MoA — Mixture of Approximators
PLA — Piecewise Linear Approximation
RSM — Response Surface Model
SGP — Sparse Gaussian Process
SPLT — 1D Splines with tension
TA — Tensor Products of Approximations
iTA — Incomplete Tensor Products of Approximations
TBL — Table Function
TGP — Tensored Gaussian Processes

These techniques have additional internal options for more subtle configuration of approximation model properties. See detailed descriptions of these techniques and their options in related sections.

In addition to the technique type and internal options, before the construction of approximation model the user may also set the options related to special features of GTApprox:

Acceleration of approximation construction if it takes much time. See Training Time and Accuracy Tradeoffs;
Cross-validation that provides an estimate of the overall accuracy of the approximation model; this method of accuracy control allows to prevent overfitting (the situation when the approximation quality is high on the training set, but considerably degrades outside of it). See Model Validation;
Approximation can be constructed jointly for all scalar components of the response or separately for each scalar response component. Approximation can also keep linear dependencies between response components. See Output Dependency Modes;
Exact Fit that makes the constructed approximation go through the points of the training set. Note that an interpolating approximation may be not the most accurate. See Exact Fit;

When constructing models, GTApprox takes advantage of shared memory multiprocessing and training on a remote host or an HPC cluster. See Multi-core Scalability and Using Clusters.

In addition to estimation of responses and their gradients, the constructed approximation model can provide (depending on the preliminary specified options) some additional features:

Validation of the approximation model on the test set provided by the user. See Validation on test set;
Accuracy Evaluation that estimates the accuracy of the approximation at given point \(X\). See Evaluation of accuracy in given point;
Smoothing of the model. User can smooth the model after training in order to decrease its variability (see section Model Smoothing). Smoothing affects the gradient of the function as well as the function itself.
Export saving and loading of the model. See Model Export;

The constructed approximation model also contains additional information about its errors on the training set, cross-validation results etc. See Model Details for details.

In order to simplify modeling process, GTApprox has functionality for automatic selection of the most accurate technique for the given problem. For details see section Smart Training.

4.1.2. Problem Statement ¶

The main goal of GTApprox is to construct approximations, also known as surrogate models, meta-models or response surfaces (see, e.g., [Forrester2008]) fitting the user-provided samples, also known as training data or training sets. A training set \(S\) is a collection of vectors, sometimes also called prototypes, representing an unknown numerical response function \(Y=f(X)\). Here, \(X\) is a \(d_{in}\)-dimensional vector, and \(Y\) is \(d_{out}\)–dimensional. In general, dimensions \(d_{in}\), \(d_{out}\) can be greater than 1.

We denote the total number of elements in the training set \(S\) — the size of the training set — by \(|S|\).

The training set \(S\) is usually numerically represented by a \(|S|\times(d_{in}+d_{out})\) — matrix \(\mathbf{XY}_{train}\). We can think of this matrix as having \(|S|\) rows, corresponding to different observations, and \(d_{in}+d_{out}\) columns, corresponding to different components of these vectors (\(d_{in}\) components of the input and \(d_{out}\) of the output). This matrix naturally divides into two submatrices — \(\mathbf{X}_{train}\), corresponding to the DoE, and \(\mathbf{Y}_{train}\), corresponding to the output components of the training set. We will also denote single observation as \(( X_k,\,Y_k)\).

Given a training set \(S\), GTApprox constructs an approximation \(\hat{f}:R^{d_{in}}\to R^{d_{out}},\) to the response function \(f\). The tool attempts to find an optimal approximation, providing a good fit to training data yet reasonably simple so as to maintain the predictive power of the approximation.

The approximation \(\hat{f}\) produced by GTApprox is defined for all \(X\in R^{d_{in}}\), but its accuracy normally deteriorates away from the training set. In practice, the approximation is usually considered on a bounded subset \(D \in R^{d_{in}}\) known as design space.

GTApprox implements number of ways to measure accuracy of constructed models, please refer to Model Validation section for details.

If noise level in output values from the training set is available then these values can be used as another input of GTApprox. In this case algorithm works with extended training set \((X_k,Y_k, \varepsilon_k)\), where \(\varepsilon_k \in R^{d_{out}}\) are estimated noise variances in output components at point \(X_k\). The knowledge of noise variances allows algorithm to provide better approximation quality for noisy problems {it (see Section Data with Errorbars for details)}.

As an alternative to specifying noise variances one may provide relative “weight” (see section Sample Weighting for details) for each point. Weights do not have any strict physical meaning, general rule is that points with greater weights are considered to be more important and GTApprox spends more effort to fitting them.

If the approximation goes through the training points, i.e., \(\hat{f}(X_k) = Y_k\), it is referred to as interpolation. In general, approximations created by GTApprox are not interpolating, but the tool has an option enforcing this property; see Section Exact Fit. Note that an interpolating approximation would not necessarily be the most accurate model over all design space.

4.1.3. Sample Size Requirements ¶

Training sample size requirements in GTApprox depend on the approximation technique and some training features. For example, the required minimum size may increase if you enable internal validation (see Internal Validation). Training in the dependent outputs mode (see Output Dependency Modes) also typically increases the required sample size, since more data is required to estimate output dependencies.

This section provides a summary on the sample size requirements for various techniques with regard to the above training features. Note that training with small samples, which have size close to the required minimum, is not recommended as it decreases model quality and may even produce a degenerate model. For details on recommended sample sizes, see the technique descriptions in section Techniques.

All requirements in this section consider the effective sample size and effective dimension of the sample inputs and outputs. These are the size and dimensions obtained after internal preprocessing of the training sample (removing constants, duplicates, and other — see Sample Cleanup for details).

General notations below are:

\(N\) is the raw sample size, \(\tilde{N}\) is the size after sample cleanup (the effective sample size).
\(\tilde{N}_{min}\) is the minimum requirement for the effective sample size. Most of the techniques have a strict \(\tilde{N}_{min}\) requirement, which also depends on the training settings (internal validation, dependent outputs training mode).
\(\tilde{N}_{max}\) is the maximum limit for the effective sample size. Most techniques have no hard \(\tilde{N}_{max}\) limit, although using large samples typically increases the training time.
\(\tilde{d}_{in}\) and \(\tilde{d}_{out}\) are the effective input and output dimensions. There are no hard limits for \(\tilde{d}_{in}\) and \(\tilde{d}_{out}\), but some of the techniques are not recommended for high-dimensional data.
\(N_{ss}\) is the number of data subsets used in internal validation (see Cross-validation procedure details).
\(\lceil K \rceil\) denotes the value of \(K\) rounded up to the next closest integer.

GP, HDA, HDAGP, SGP, SPLT

The GP, HDA, HDAGP, SGP, and SPLT techniques have the following common requirements for \(\tilde{N}_{min}\):

With internal validation disabled:
- \(\tilde{N}_{min} = 2 \tilde{d}_{in} + 3\) in the independent outputs mode (default) and the partial linear dependency mode.
- In the dependent outputs mode, \(\tilde{N}_{min} = 2 (\tilde{d}_{in} + \tilde{d}_{out}) + 1\).
With internal validation enabled, at least one more point is required, and \(\tilde{N}\) cannot be less than \(N_{ss}\):
- \(\tilde{N}_{min} = \max (N_{ss}, 2 \tilde{d}_{in} + 4)\) in the independent outputs mode (default) and the partial linear dependency mode.
- \(\tilde{N}_{min} = \max (N_{ss}, 2 (\tilde{d}_{in} + \tilde{d}_{out}) + 2)\) in the dependent outputs mode.

The HDA, SGP, and SPLT techniques support large samples and have no hard \(\tilde{N}_{max}\) limit. The HDA and SGP techniques can be recommended for \(\tilde{N} \gtrsim 1000\).

For the GP and HDAGP techniques, there is a hard limit of \(\tilde{N}_{max} = 4000\) regardless of the training features. For example, if you select the GP or HDAGP technique manually using the GTApprox/Technique option, and the effective sample size \(\tilde{N} > 4000\), training raises an InvalidOptionsError exception. Note that GTApprox does not apply the maximum size limit to the raw sample size \(N\) — so the GP and HDAGP techniques work normally, if \(N > 4000\) but \(\tilde{N} \leq 4000\) (for example, due to removing duplicates).

GBRT, RSM

For the GBRT and RSM techniques, \(\tilde{N}_{min}\) requirements are affected only by internal validation:

With internal validation disabled, \(\tilde{N}_{min} = 1\).
With internal validation enabled, \(\tilde{N}_{min} = N_{ss}\).

The GBRT and RSM techniques support large samples and have no hard \(\tilde{N}_{max}\) limit.

TA, iTA, TGP

The TA and TGP techniques require certain input sample structure, and \(\tilde{N}_{min}\) depends on the number of input factors \(f\) (see Tensor Products of Approximations for details). Note that constant factors are excluded from \(f\) — that is, if there is a factor with all of its components being constant in the training sample, this factor does not count towards \(f\). Also, \(f\) must be greater than 0.

For TA and TGP, \(\tilde{N}_{min}\) requirements are further affected by internal validation:

With internal validation disabled, \(\tilde{N}_{min} = 2^f\).
With internal validation enabled, \(\tilde{N}_{min} = 2^{f-1}(N_{ss} + 1)\).

The iTA technique also has sample structure requirements, although they are less strict (see Tensor Products of Approximations). These requirements introduce \(\tilde{N}_{min}\) limits that cannot be expressed generally.

For iTA, there is a hard requirement for \(\tilde{N}\), but it is not a sufficient criterion:

\(({\lg 2^{\tilde{d}_{in}}}/{\lg \tilde{N}}) - 1 < 0.5\) is necessary. As noted above, \(\tilde{N}_{min}\) also depends on the sample structure, so it may be higher than the minimum \(\tilde{N}\) that meets the hard requirement.

The TA, iTA, and TGP techniques are intended for large training samples and have no hard \(\tilde{N}_{max}\) limit. However, the iTA technique limits \(\tilde{d}_{in} \leq 19\); input dimension 20 and higher is not yet supported.

MoA

The MoA technique is intended for large training samples. It requires at least \(\tilde{N}_{min} = 2 (\tilde{d}_{in} + \tilde{d}_{out}) + 2\) training points regardless of training settings, although using MoA with such small samples is not recommended. On the other hand, MoA is one of the recommended techniques for large samples containing \(\sim 10^6\) points, and it has no hard \(\tilde{N}_{max}\) limit.

\(\tilde{N}_{min}\) for MoA also depends on clustering parameters and techniques used to train local models and may increase if you configure this technique manually — see Mixture of Approximators for details.

PLA

For the PLA technique, \(\tilde{N}_{min}\) requirements are affected only by internal validation:

With internal validation disabled, \(\tilde{N}_{min} = \tilde{d}_{in} + 1\).
With internal validation enabled, \(\tilde{N}_{min} = \max (N_{ss}, \tilde{d}_{in} + 2)\).

Large sample support in PLA depends heavily on the effective input dimension \(\tilde{d}_{in}\), because the computational complexity of this technique increases rapidly with input dimension. If \(\tilde{d}_{in} = 1\), there is no \(\tilde{N}_{max}\) limit. For \(\tilde{d}_{in} > 1\), effective size \(\tilde{N}\) must satisfy the following requirement:

Regardless of the training settings, \(\frac{1}{2}(\tilde{d}_{in}+1) \lg \tilde{N} < 9\).

The above requirements determine the following actual \(\tilde{N}_{min}\) and \(\tilde{N}_{max}\) for the PLA technique depending on the effective input dimension \(\tilde{d}_{in}\):

\(\tilde{d}_{in}\)	\(\tilde{N}_{min}\) (no IV)	\(\tilde{N}_{max}\)
1	2	any
2	3	1000000
3	4	31622
4	5	3981
5	6	1000
6	7	372
7	8	177
8	9	100
9	10	63
10	11	43
11	12	31
12	13	24
13	14	19
14	15	15
>14	not supported	not supported

Effectively, the PLA technique does not support \(\tilde{d}_{in}\) higher than 14, and cannot perform internal validation if \(\tilde{d}_{in}\) is close to this maximum.

TBL

The TBL technique does not have any sample size requirements as it simply stores the training sample after preprocessing and, when evaluating the model, performs a table lookup in the stored sample.

4.1. Introduction¶

4.1.1. Overview¶

4.1.2. Problem Statement¶

4.1.3. Sample Size Requirements¶

4.1.1. Overview ¶

4.1.2. Problem Statement ¶

4.1.3. Sample Size Requirements ¶