ApproxBuilder

Tag: Modeling

ApproxBuilder trains an approximation model using the data received to x_sample and f_sample input ports as the training sample. The model can be output to the model port and saved to disk. The block also outputs a human-readable model summary to info after successfully training a model.

Specific additional data can be sent to the following optional input ports:

  • output_noise_variance — noise variance for sample outputs. See Data with errorbars for details.
  • weights — point weights (relative importance measure) in the training sample. See Sample Weighting for details.

ApproxBuilder can also train a model incrementally (this feature is supported only by the GBRT technique). If you select an initial model in block’s configuration, the block uses input data to improve this model, and then outputs the updated model or saves it to disk (as a new file). For more details, see section Incremental Training in Gradient Boosted Regression Trees.

See also

GTApprox guide
The guide to the Generic Tool for Approximation (GTApprox) — the pSeven Core approximation component used by ApproxBuilder to train approximation models.

Sections

Options


GTApprox/Accelerator

Five-position switch to control the trade-off between speed and accuracy.

Value:integer in range \([1, 5]\)
Default:1

This option controls training time by changing some internal technique parameters. Possible values are from 1 (low speed, highest quality) to 5 (high speed, lower quality).

For the GBRT and HDA techniques, GTApprox/Accelerator also changes values of some public technique-specific options (the dependent options). User changes to dependent options always override settings made by GTApprox/Accelerator: if you set both GTApprox/Accelerator and some dependent option, GTApprox will use your value of this dependent option, not the value automatically set by GTApprox/Accelerator.

Changed in version 5.1: GTApprox/GBRTMaxDepth and GTApprox/GBRTNumberOfTrees added to dependent options.

Dependent GBRT options are GTApprox/GBRTMaxDepth and GTApprox/GBRTNumberOfTrees. GTApprox/Accelerator sets them as follows:

GTApprox/Accelerator 1 2 3 4 5
GTApprox/GBRTMaxDepth 10 10 10 6 6
GTApprox/GBRTNumberOfTrees 500 400 300 200 100

Settings made by GTApprox/Accelerator for HDA depend on input sample size. There are two cases:

  • Ordinary sample size (the sample contains less than 10 000 points).
  • Big sample size (the sample contains 10 000 points or more).

In the case of ordinary sized sample, dependent options are GTApprox/HDAFDGauss, GTApprox/HDAMultiMax, GTApprox/HDAMultiMin and GTApprox/HDAPhaseCount. GTApprox/Accelerator sets them as follows:

GTApprox/Accelerator 1 2 3 4 5
GTApprox/HDAFDGauss 1 1 0 0 0
GTApprox/HDAMultiMax 10 6 4 4 2
GTApprox/HDAMultiMin 5 4 2 2 1
GTApprox/HDAPhaseCount 10 7 5 1 1

In the case of big sized sample, dependent options are GTApprox/HDAFDGauss, GTApprox/HDAHessianReduction, GTApprox/HDAMultiMax, GTApprox/HDAMultiMin, GTApprox/HDAPhaseCount, GTApprox/HDAPMax, and GTApprox/HDAPMin. GTApprox/Accelerator sets them as follows:

GTApprox/Accelerator 1 2 3 4 5
GTApprox/HDAFDGauss 0 0 0 0 0
GTApprox/HDAHessianReduction 0.3 0.3 0 0 0
GTApprox/HDAMultiMax 3 2 2 2 1
GTApprox/HDAMultiMin 1 1 1 1 1
GTApprox/HDAPhaseCount 5 5 3 1 1
GTApprox/HDAPMax 150 150 150 150 150
GTApprox/HDAPMin 150 150 150 150 150

GTApprox/AccuracyEvaluation

Require accuracy evaluation.

Value:Boolean
Default:off

If this option is on (True), then, in addition to the approximation, constructed model will contain a function providing an estimate of the approximation error as a function on the design space.

Read Accuracy Evaluation chapter for details.

GTApprox/CategoricalVariables

Specifies discrete (categorical) input variables.

Value:a list of zero-based indexes of input variables
Default:[] (no discrete variables)

New in version 6.3.

Treat listed variables as discrete (categorical). These variables can take only predefined values (levels). For every discrete variable, each unique value from the training sample becomes a level. Note that a discrete variable never takes a value not found in the training sample, and a model with discrete variables cannot be evaluated for values of discrete variables that was not found in the training sample.

Note

Discrete variables are supported only by the RSM, HDA, GP, SGP, HDAGP, TA, iTA, TGP, and PLA techniques.

See section Categorical Variables for more details.

Note that if you specify tensor factors for the TA and TGP techniques manually, you can select categorical variables with GTApprox/TensorFactors instead of specifying GTApprox/CategoricalVariables. Using both these options at the same time is not recommended since they can conflict; see section Categorical Variables for TA, iTA and TGP techniques for more details.

GTApprox/Componentwise

Perform componentwise approximation of the output.

Value:Boolean or "Auto"
Default:"Auto"

Deprecated since version 6.3: kept for compatibility, use GTApprox/DependentOutputs instead.

Prior to 6.3, this option was used to enable componentwise approximation which was disabled by default.

Since 6.3, componentwise approximation is enabled by default, and can be disabled with GTApprox/DependentOutputs. Now if GTApprox/Componentwise is default ("Auto"), GTApprox/DependentOutputs takes priority. If GTApprox/Componentwise is not default while GTApprox/DependentOutputs is "Auto", then GTApprox/Componentwise takes priority. In case of conflict (both options explicitly set on or off) GTApprox raises an error (but this conflict is ignored if the output is 1-dimensional).

GTApprox/DependentOutputs

Assume that training outputs are dependent and do not use componentwise approximation.

Value:Boolean or "Auto"
Default:"Auto"

New in version 6.3.

In case of multidimensional output there are two possible approaches:

  1. a separate approximator is used for each output component (componentwise approximation), or
  2. approximation is performed for all output components simultaneously using the same approximator.

GTApprox/DependentOutputs switches between these two modes. Componentwise approximation is enabled by default — that is, "Auto" defaults to "on" unless GTApprox/Componentwise explicitly disables it. Note that GTApprox/Componentwise is a deprecated option that is kept for version compatibility only and should not be used since 6.3.

For more details on componentwise approximation, see section Componentwise Approximation.

GTApprox/Deterministic

Controls the behavior of randomized initialization algorithms in certain techniques.

Value:Boolean
Default:on

New in version 5.0.

Several model training techniques in GTApprox feature randomized initialization of their internal parameters. These techniques include:

  • GBRT, which can select random subsamples of the full training set when creating regression trees (see section Stochastic Boosting).
  • HDA and HDAGP, which use randomized initialization of approximator parameters.
  • MoA, if the approximation technique for its local models is set to HDA, HDAGP or SGP using GTApprox/MoATechnique, or the same selection is done automatically.
  • SGP, which uses randomized selection of base points when approximating the full covariance matrix of the points from the training sample (Nystrom method).
  • TA, if for some of its factors the HDA technique is specified manually or is selected automatically (see GTApprox/TensorFactors).

The determinacy of randomized techniques can be controlled in the following way:

  • If GTApprox/Deterministic is on (deterministic training mode, default), a fixed seed is used in all randomized initialization algorithms. The seed is set by GTApprox/Seed. This makes the technique behavior reproducible — for example, two models trained in deterministic mode with the same data, same GTApprox/Seed and other settings will be exactly the same, since a training algorithm is initialized with the same parameters.
  • Alternatively, if GTApprox/Deterministic is off (non-deterministic training mode), a new seed is generated internally every time you train a model. As a result, models trained with randomized techniques may slightly differ even if all settings and training samples are the same. In this case, GTApprox/Seed is ignored. The generated seed that was actually used for initialization can be found in model info, so later the training run can still be reproduced exactly by switching to the deterministic mode and setting GTApprox/Seed to this value.

In case of randomized techniques, repeated non-deterministic training runs may be used to try obtaining a more accurate approximation, because results will be slightly different. On the contrary, deterministic techniques always produce exactly the same model given the same training data and settings, and are not affected by GTApprox/Deterministic and GTApprox/Seed. Deterministic techniques include:

GTApprox/EnableTensorFeature

Enable automatic selection of the TA and iTA techniques.

Value:Boolean
Default:on

New in version 1.9.2: allows the automatic selection of the iTA technique. Previously affected only the TA technique selection.

If on (True), makes TA and iTA techniques available for auto selection. If off (False), neither TA nor iTA will ever be selected automatically based on decision tree. Has no effect if any approximation technique is selected manually using the GTApprox/Technique option.

Note

This option does not enable the automatic selection of the TGP technique.

GTApprox/ExactFitRequired

Require the model to fit sample data exactly.

Value:Boolean
Default:off

If this option is on, the model fits the points of the training sample exactly — that is, the model response at the point which is found in the input part of the training sample is equal to the value found in the response part of the training sample for this point.

If GTApprox/ExactFitRequired is off then no fitting condition is imposed, and the approximation can be either fitting or non-fitting depending on the training data. Typical example: if GTApprox finds that the sample is noisy, it does not create an exact-fitting model to avoid overtraining.

Changed in version 4.2: the iTA technique no longer ignores this option.

Read Exact Fit section for details.

GTApprox/GBRTColsampleRatio

Column subsample ratio.

Works only for Gradient Boosted Regression Trees technique.

Value:floating point number in range \((0, 1]\)
Default:1.0

New in version 5.1.

The GBRT technique uses random subsamples of the full training set when training weak estimators (regression trees). GTApprox/GBRTColsampleRatio specifies the fraction of columns (input features) to be included in a subsample: for example, setting it to 0.5 will randomly select half of the input features to form a subsample.

For more details, see section Stochastic Boosting.

GTApprox/GBRTMaxDepth

Maximum regression tree depth.

Works only for Gradient Boosted Regression Trees technique.

Value:non-negative integer
Default:0 (auto)

New in version 5.1.

Sets the maximum depth allowed for each regression tree (GBRT weak estimator). Greater depth results in a more complex final model.

Default (0) means that the tree depth will be set by GTApprox/Accelerator as follows:

GTApprox/Accelerator 1 2 3 4 5
GTApprox/GBRTMaxDepth 10 10 10 6 6

For example, if both options are default (GTApprox/GBRTMaxDepth is 0 and GTApprox/Accelerator is 1), actual depth setting is 10.

For more details, see section Model Complexity.

GTApprox/GBRTMinChildWeight

Minimum total weight of points in a regression tree leaf.

Works only for Gradient Boosted Regression Trees technique.

Value:non-negative floating point number
Default:1

New in version 5.1.

The GBRT technique stops growing a branch of a regression tree if the total weight of points assigned to a leaf becomes less than GTApprox/GBRTMinChildWeight. If the sample is not weighted, this is the same as limiting the number of points in a leaf. Zero minimum weight means that no such limit is imposed.

For more details, see section Leaf Weighting.

GTApprox/GBRTMinLossReduction

Minimum significant reduction of loss function.

Works only for Gradient Boosted Regression Trees technique.

Value:non-negative floating point number
Default:0

New in version 5.1.

The GBRT technique stops growing a branch of a regression tree if the reduction of loss function (model’s mean square error over the training set) becomes less than GTApprox/GBRTMinLossReduction.

For more details, see section Model Complexity.

GTApprox/GBRTNumberOfTrees

The number of regression trees in the model.

Works only for Gradient Boosted Regression Trees technique.

Value:non-negative integer
Default:0 (auto)

New in version 5.1.

Sets the number of weak estimators (regression trees) in a GBRT model, the same as the number of gradient boosting stages. Greater number results in a more complex final model.

Changed in version 5.2: 0 is allowed and means auto setting.

Default (0) means that the number of trees will be set by GTApprox/Accelerator as follows:

GTApprox/Accelerator 1 2 3 4 5
GTApprox/GBRTNumberOfTrees 500 400 300 200 100

For example, if both options are default (GTApprox/GBRTNumberOfTrees is 0 and GTApprox/Accelerator is 1), the actual number of trees is 500.

For more details, see section Model Complexity.

Note that in incremental training the auto (0) number of trees is not affected by GTApprox/Accelerator but depends on the number of trees in the initial model and training sample sizes — see Incremental Training for details.

GTApprox/GBRTShrinkage

Shrinkage step, or learning rate

Works only for Gradient Boosted Regression Trees technique.

Value:floating point number in range \((0, 1]\)
Default:0.3

New in version 5.1.

GBRT scales each weak estimator by a factor of GTApprox/GBRTShrinkage, resulting in a klnd of regularization with smaller step values.

For more details, see section Shrinkage.

GTApprox/GBRTSubsampleRatio

Row subsample ratio

Works only for Gradient Boosted Regression Trees technique.

Value:floating point number in range \((0, 1]\)
Default:1.0

New in version 5.1.

The GBRT technique uses random subsamples of the full training set when training weak estimators (regression trees). GTApprox/GBRTSubsampleRatio specifies the fraction of rows (sample points) to be included in a subsample: for example, setting it to 0.5 will randomly select half of the points to form a subsample.

For more details, see section Stochastic Boosting.

GTApprox/GPInteractionCardinality

Allowed orders of additive covariance function.

Works for Gaussian Processes, Sparse Gaussian Process and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:list of unique unsigned integers in range \([1, dim(X)]\) each
Default:[] (equivalent to [1, n], \(n = dim(X)\))

New in version 1.10.3.

This option takes effect only when using the additive covariance function (GTApprox/GPType is set to "Additive"), otherwise it is ignored. In particular, the TGP technique always ignores this option since its covariance function is always "Wlp".

The additive covariance function is a sum of products of one-dimensional covariance functions, where each additive component (a summand) depends on a subset of initial input variables. GTApprox/GPInteractionCardinality defines the degree of interaction between input variables by specifying allowed subset sizes, which are in fact the allowed values of covariance function order.

All values in the list should be unique, and neither of them can be greater than the number of input components, excluding constant inputs (the effective dimension of the input part of the training sample).

Consider an n-dimensional \(X\) sample with \(m\) variable and \(n-m\) constant components (sample matrix columns). Valid GTApprox/GPInteractionCardinality settings then would be:

  • [1, n]: simplified syntax, implicitly converts to [1, m].
  • [1, 2, ... m-1, m, m+1, ... k], where \(m < k \le n\): treated as a consecutive list of interactions up to cardinality k, implicitly converts to [1, 2, ... m-1, m]. Note that in this case all values from 1 to m have to be included in the list, otherwise it is considered invalid.
  • [i1, i2, ... ik], where \(i_j \le m\): valid list of interaction cardinalities, no conversion needed.

GTApprox/GPLearningMode

Give priority to either model accuracy or robustness

Works for Gaussian Processes, Sparse Gaussian Process, High Dimensional Approximation combined with Gaussian Processes and Tensored Gaussian Processes techniques.

Value:"Accurate" or "Robust"
Default:"Accurate"

New in version 1.9.6.

By default, the Gaussian Processes technique creates accurate models which may have a drawback of being not enough robust. To create a more robust model, GTApprox/GPLearningMode may be set to "Robust". The cost is a certain decrease in the accuracy of the model.

GTApprox/GPLinearTrend

Deprecated since version 3.2: kept for compatibility only, use GTApprox/GPTrendType instead.

Since version 3.2 this option is deprecated by a more advanced GTApprox/GPTrendType option which allows to select linear, quadratic, or none trend type.

GTApprox/GPMeanValue

Specifies mean of model output mean values

Works for Gaussian Processes, Sparse Gaussian Process, High Dimensional Approximation combined with Gaussian Processes and Tensored Gaussian Processes techniques.

Value:list of floating point numbers
Default:[] (automatic estimate)

Model output mean values are essential for constructing GP approximation. These values may be defined by user or estimated using the given sample (the bigger and more representative is the sample, the better is the estimate of model output mean values). Model output mean values misspecification leads to decrease in aproximation accuracy: the larger the error in output mean values, the worse is the final approximation model. If left default (empty list), model output mean values are estimated using the given sample.

Option value is a list of floating point numbers. This list should either be empty or contain a number of elements equal to output dataset dimensionality.

GTApprox/GPPower

The value of p in the p-norm which is used to measure the distance between input vectors

Works for Gaussian Processes, Sparse Gaussian Process, High Dimensional Approximation combined with Gaussian Processes and Tensored Gaussian Processes techniques.

Value:floating point number in range \([1, 2]\)
Default:2.0

The main component of the Gaussian Processes based regression is the covariance function measuring the similarity between two input points. The covariance between two input uses p-norm of the difference between coordinates of these input points. The case p = 2 corresponds to the usual gaussian covariance function (better suited for modelling of smooth functions) and the case p = 1 corresponds to laplacian covariance function (better suited for modelling of non-smooth functions).

For the GP technique, this option takes effect only if GTApprox/GPType is "Wlp". However, the TGP technique is always affected by GTApprox/GPPower, since its covariance type is always "Wlp", regardless of the GTApprox/GPType setting.

GTApprox/GPTrendType

Specifies the trend type.

Works for Gaussian Processes, Sparse Gaussian Process, High Dimensional Approximation combined with Gaussian Processes and Tensored Gaussian Processes techniques.

Value:"None", "Linear", "Quadratic", or "Auto"
Default:"Auto"

New in version 3.2.

This option allows to take into account specific (linear or quadratic) behavior of the modeled dependency by selecting which type of trend to use.

  • "None" — no trend.
  • "Linear" — linear trend.
  • "Quadratic" — polynomial trend with constant, linear and pure quadratic terms (no interaction terms, no feature selection).
  • "Auto" — automatic selection, defaults to no trend unless GTApprox/GPLinearTrend is on (provides compatibility with the deprecated GTApprox/GPLinearTrend option).

GTApprox/GPType

Specify covariance function for the GP technique

Works for Gaussian Processes, Sparse Gaussian Process and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:"Additive", "Mahalanobis", or "Wlp"
Default:"Wlp"

New in version 1.10.3: additive covariance function.

Allows to specify the covariance function used in Gaussian processes. Available modes:

  • "Additive": summarized coordinate-wise products of 1-dimensional Gaussian covariance functions. With this setting, GTApprox/GPInteractionCardinality may be used to set the degree of interaction between input variables.
  • "Mahalanobis": squared exponential covariance function with Mahalanobis distance.
  • "Wlp": widely-used exponential Gaussian covariance function with weighted \(L_p\) distance.

If "Additive" is set, but the \(X\) sample is 1-dimensional, then the additive covariance function is implicitly replaced with the ordinary covariance function ("Wlp"), and the GTApprox/GPInteractionCardinality option value is ignored.

Note

The TGP technique ignores this option and always uses "Wlp".

GTApprox/Heteroscedastic

Treat input sample as a sample containing heteroscedastic noise.

Value:Boolean or "Auto"
Default:"Auto"

New in version 1.9.0.

If this option is on (True), the builder assumes that heteroscedactic noise variance is present in the input sample. Default value ("Auto") currently means that option is off.

This option has certain limitations:

  • It is valid for GP and HDAGP techniques only. For other techniques the value is ignored (treated as always off).
  • Heteroscedasticity is incompatible with covariance functions other than "Wlp": if GTApprox/Heteroscedastic is True and GTApprox/GPType is not "Wlp", exception will be thrown.
  • If noise variance is given, the GTApprox/Heteroscedastic option is ignored and non-variational GP (or HDAGP) technique is used.

See corresponding Heteroscedastic data section for details.

GTApprox/HDAFDGauss

Include Gaussian functions into functional dictionary used in construction of approximations

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:"No" or "Ordinary"
Default:"Ordinary"

In order to construct an approximation, the linear expansion in functions from special functional dictionary is used. This option controls whether Gaussian functions should be included into functional dictionary used in construction of approximation.

In general, using Gaussian functions as building blocks for construction of approximation can lead to significant increase in accuracy, especially in the case when the approximable function is bell-shaped. However, it may also significantly increase training time.

GTApprox/HDAFDLinear

Include linear functions into functional dictionary used in construction of approximations

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:"No" or "Ordinary"
Default:"Ordinary"

In order to construct an approximation, the linear expansion in functions from special functional dictionary is used. This option controls whether linear functions should be included into functional dictionary used in construction of approximation or not.

In general, using linear functions as building blocks for construction of approximation can lead to increase in accuracy, especially in the case when the approximable function has significant linear component. However, it may also increase training time.

GTApprox/HDAFDSigmoid

Include sigmoid functions into functional dictionary used in construction of approximations

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:"No" or "Ordinary"
Default:"Ordinary"

In order to construct an approximation, the linear expansion in functions from special functional dictionary is used. This option controls whether sigmoid-like functions should be included into functional dictionary used in construction of approximation or not.

In general, using sigmoid-like functions as building blocks for construction of approximation can lead to increase in accuracy, especially in the case when the approximable function has square-like or discontinuity regions. However, it may also lead to significant increase in training time.

GTApprox/HDAHessianReduction

Maximum proportion of data used in evaluating Hessian matrix

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:floating point number in range \([0, 1]\)
Default:0.0

New in version 1.6.1.

This option shrinks maximum amount of data points for Hessian estimation (used in high-precision algorithm). If the value is 0, the whole set of points is used in Hessian estimation, otherwise, if the value is in range \((0;1]\), only a part (smaller than HDAHessianReduction of the whole set) is used. Reduction is used only in case of samples bigger than 1250 input points (if number of points is smaller than 1250, this option is ignored and Hessian is estimated by the whole train sample).

Note

In some cases, the high-precision algorithm can be disabled automatically, regardless of the GTApprox/HDAHessianReduction value. This happens if:

  1. \((dim(X) + 1) \cdot p \ge 4000\), where dim(X) is the dimension of the input vector X and p is a total number of basis functions, or
  2. \(dim(X) \ge 25\), where dim(X) is the dimension of the input vector X, or
  3. there are no sufficient computational resources to use the high precision algorithm.

GTApprox/HDAMultiMax

Maximum number of basic approximators constructed during one approximation phase.

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:integer in range \([\)GTApprox/HDAMultiMin\(, 1000]\)
Default:10

This option specifies the maximum number of basic approximators constructed during one approximation phase. Option value is a positive integer which must be greater than or equal to the value of GTApprox/HDAMultiMin option. This option sets upper limit to the number of basic approximators, but does not require this limit to be reached (approximation algorithm stops constructing basic approximators as soon as construction of subsequent basic approximator does not increase accuracy). In general, the bigger the value of GTApprox/HDAMultiMax is, the more accurate is the constructed approximator. However, increasing the value may lead to significant training time increase and/or overtraining in some cases.

GTApprox/HDAMultiMin

Minimum number of basic approximators constructed during one approximation phase.

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:integer in range \([1,\) GTApprox/HDAMultiMax\(]\)
Default:5

This option specifies the minimum number of basic approximators constructed during one approximation phase. Option value is a positive integer which must be less than or equal to the value of GTApprox/HDAMultiMax option. In general, the bigger the value of GTApprox/HDAMultiMin is, the more accurate is the constructed approximator. However, increasing the value may lead to significant training time increase and/or overtraining in some cases.

GTApprox/HDAPhaseCount

Maximum number of approximation phases.

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:integer in range \([1, 50]\)
Default:10

This option specifies the maximum possible number of approximation phases. It sets upper limit to that number only, and does not require the limit to be reached (approximation algorithm stops performing new phases as soon as the subsequent approximation phase does not increase accuracy). In general, the more approximation phases, the more accurate approximator is built. However, increasing maximum number of approximation phases may lead to significant training time increase and/or overtraining in some cases.

GTApprox/HDAPMax

Maximum allowed approximator complexity.

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:integer in range \([\)GTApprox/HDAPMin\(, 5000]\)
Default:150

This option specifies the maximum allowed complexity of the approximator. Its value must be greater than or equal to the value of the GTApprox/HDAPMin option. The approximation algorithm selects the approximator with optimal complexity pOpt from the range \([\)GTApprox/HDAPMin, GTApprox/HDAPMax\(]\). Optimality here means that, depending on the complexity of approximable function behavior and the size of the available training sample, constructed approximator with complexity pOpt fits this function in the best possible way compared to other approximators with complexity in range \([\)GTApprox/HDAPMin, GTApprox/HDAPMax\(]\). Thus the GTApprox/HDAPMax value should be big enough in order to select the approximator with complexity being the most appropriate for the considered problem. Note, however, that increasing the GTApprox/HDAPMax value may lead to significant increase in training time and/or overtraining in some cases.

GTApprox/HDAPMin

Minimum allowed approximator complexity.

Works for High Dimensional Approximation and High Dimensional Approximation combined with Gaussian Processes techniques.

Value:integer in range \([0,\) GTApprox/HDAPMax\(]\)
Default:0

This option specifies the minimum allowed complexity of the approximator. Its value must be less than or equal to the value of the GTApprox/HDAPMax option. The approximation algorithm selects the approximator with optimal complexity pOpt from the range \([\)GTApprox/HDAPMin, GTApprox/HDAPMax\(]\). Optimality here means that, depending on the complexity of approximable function behavior and the size of the available training sample, constructed approximator with complexity pOpt fits this function in the best possible way compared to other approximators with complexity in range \([\)GTApprox/HDAPMin, GTApprox/HDAPMax\(]\). Thus the GTApprox/HDAPMin value should not be too big in order to select the approximator with complexity being the most appropriate for the considered problem. Note that increasing the GTApprox/HDAPMin value may lead to significant increase in training time and/or overtraining in some cases.

GTApprox/InputNanMode

Specifies how to handle non-numeric values in the input part of the training sample.

Value:"raise", "ignore"
Default:"raise"

New in version 6.8.

GTApprox cannot obtain any information from non-numeric (NaN or infinity) values of variables. This option controls its behavior when such values are encountered. Default ("raise") means to raise an exception and cancel training; "ignore" means to exclude data points with non-numeric values from the sample and continue training.

GTApprox/InputsTolerance

Specifies up to which tolerance each input variable would be rounded.

Value:list of length \(dim(X)\) of floating point numbers
Default:[]

New in version 6.3.

If default, option does nothing. Otherwise each input variable in training sample is rounded up to specified tolerance. Note that this may lead to merge of some points.

See section Sample Cleanup for details.

GTApprox/InternalValidation

Enable or disable internal validation.

Value:Boolean
Default:off

If this option is on (True) then, in addition to the approximation, the constructed model contains a table of cross validation errors of different types, which may serve as a measure of accuracy of approximation.

See Model Validation chapter for details.

GTApprox/IVDeterministic

Controls the behavior of the pseudorandom algorithm selecting data subsets in cross validation.

Works only if GTApprox/InternalValidation is on.

Value:Boolean
Default:on

New in version 5.0.

Cross validation involves partitioning the training sample into a number of subsets (defined by GTApprox/IVSubsetCount) and randomized combination of these subsets for each training (validation) session. Since the algorithm that combines subsets is pseudorandom, its behavior can be controlled in the following way:

  • If GTApprox/IVDeterministic is on (deterministic cross validation mode, default), a fixed seed is used in the combination algorithm. The seed is set by GTApprox/IVSeed. This makes cross-validation reproducible — a different combination is selected for each session, but if you repeat a cross validation run, for each session it will select the same combination as the first run.
  • Alternatively, if GTApprox/IVDeterministic is off (non-deterministic cross validation mode), a new seed is generated internally for every run, so cross validation results may slightly differ. In this case, GTApprox/IVSeed is ignored. The generated seed that was actually used in cross validation can be found in model info, so results can still be reproduced exactly by switching to the deterministic mode and setting GTApprox/IVSeed to this value.

Final model is never affected by GTApprox/IVDeterministic because it is always trained using the full sample.

GTApprox/IVSavePredictions

Save model values calculated during internal validation.

Works only if GTApprox/InternalValidation is on.

Value:Boolean or "Auto"
Default:"Auto"

New in version 2.0rc2.

If this option is on (True), internal validation information, in addition to error values, also contains raw validation data: model values calculated during internal validation, as well as validation inputs and outputs.

GTApprox/IVSeed

Fixed seed used in the deterministic cross validation mode.

Works only if GTApprox/InternalValidation is on.

Value:positive integer
Default:15313

New in version 5.0.

Fixed seed for the pseudorandom algorithm that selects the combination of data subsets for each cross validation session. GTApprox/IVSeed has an effect only if GTApprox/IVDeterministic is on — see its description for more details.

GTApprox/IVSubsetCount

The number of cross validation subsets.

Works only if GTApprox/InternalValidation is on.

Value:0 (auto) or an integer in range \([2, |S|]\), where \(|S|\) is the size of the training set; also can not be less than GTApprox/IVTrainingCount
Default:0 (auto)

The number of subsets (of approximately equal size) into which the training set is divided for the cross validation.

If left default, the number of cross validation subsets is selected automatically and will be equal to \(min(10, |S|)\), where \(|S|\) is the size of the training set.

GTApprox/IVTrainingCount

The number of training sessions in cross validation.

Works only if GTApprox/InternalValidation is on.

Value:integer in range \([1,\) GTApprox/IVSubsetCount\(]\), or 0 (auto)
Default:0 (auto)

The number of training sessions performed during the cross validation. Each training session includes the following steps:

  1. Select one of the cross validation subsets.
  2. Construct a complement of this subset which is the training sample excluding the selected subset.
  3. Build a model using this complement as a training sample (so the selected subset is excluded from builder input).
  4. Validate this model on the previously selected subset.

Repeat until the number of such sessions exceeds GTApprox/IVTrainingCount.

If left default, the number of cross validation sessions is selected automatically and will be equal to:

\[\begin{split}N_{\rm tr} & = \bigg\lceil\min\Big(|S|, \frac{100}{|S|}\Big)\bigg\rceil,\end{split}\]

where \(|S|\) is the training sample size.

GTApprox/LinearityRequired

Require the model to be linear.

Value:Boolean
Default:off

If this option is on (True), then the approximation is constructed as a linear function which fits the training data optimally. If option is off (False), then no condition related to linearity is imposed on the approximation: it can be either linear or non-linear depending on which fits training data best.

Note

The TGP technique does not support linear models: if GTApprox/Technique is "TGP", GTApprox/LinearityRequired should be off.

GTApprox/LogLevel

Set minimum log level.

Value:"Debug", "Info", "Warn", "Error", "Fatal"
Default:"Info"

If this option is set, only messages with log level greater than or equal to the threshold are dumped into log.

GTApprox/MaxExpectedMemory

Maximum expected amount of memory (in GB) allowed for model training.

Value:positive integer or 0 (no limit)
Default:0 (no limit)

New in version 6.4.

This option currently works for the GBRT technique only.

GTApprox/MaxExpectedMemory is intended to avoid the case when a long training process fails due to memory overflow, spending much time and giving no results. If GTApprox/MaxExpectedMemory is not default, GTApprox tries to estimate the expected memory usage at each stage of the training algorithm, and if the estimate exceeds the option value, the training is suspended: the process stops and returns a “partially trained” model which then can be trained incrementally (see Incremental Training).

To check whether the training stopped due to memory limit violation or for other reasons, you can test the value of model.info['ModelInfo']['Builder']['Details']['/GTApprox/MemoryOverflowDetected']. This key is present in model info only if GTApprox/MaxExpectedMemory was non-default when training the model; its value is True if GTApprox/MaxExpectedMemory stopped the training and False otherwise.

Note that when componentwise training for models with multidimensional output is enabled (default, see GTApprox/DependentOutputs), each of the component models gets its own expected memory limit — which is GTApprox/MaxExpectedMemory divided by the output dimension. The flag described above will be set to True if any of the componentwise models is out of the limit.

With GTApprox/MaxExpectedMemory set it is also possible that the training sample is so big that it never can be processed with the allowed amount of memory; in this case, the training does not start.

If GTApprox/MaxExpectedMemory is default (0, no limit) or training technique is not GBRT, then GTApprox does not try to prevent memory overflow.

GTApprox/MaxParallel

Set the maximum number of parallel threads to use when building a model.

Value:positive integer or 0 (auto)
Default:0 (auto)

New in version 5.0rc1.

GTApprox can run in parallel to speed up model training. This option sets the maximum number of threads the builder is allowed to create.

Changed in version 6.0: auto (0) sets the number of threads to 1 for small training samples.

The default setting (0) normally uses the value given by the OMP_NUM_THREADS environment variable, which by default is equal to the number of virtual processors, including hyperthreading CPUs. However in the case of a small training sample parallelization becomes inefficient, and is disabled by allowing only one thread.

Note that non-default values always override OMP_NUM_THREADS.

GTApprox/MoACovarianceType

Type of covariance matrix to use in Gaussian Mixture Model.

Works only for Mixture of Approximators technique.

Value:"Full", "Tied", "Diag", "Spherical", "BIC".
Default:"BIC"

New in version 1.10.0.

Type of covariance matrix used to construct Gaussian Mixture Model.

  • "Full" - all covariance matrices are positive semidefinite and symmetric,
  • "Tied" - all covariance matrices are positive semidefinite, symmetric and equal,
  • "Diag" - all covariance matrices are diagonal,
  • "Spherical" - diagonal matrix with equal elements on its diagonal.
  • "BIC" - the type of covariance matrix is chosen according to Bayesian Information Criterion.

This option allows user to control accuracy and training time. For example, if it is known that design space consists of regions of regularity having similar structure it may be reasonable to use "Tied" matrix for Gaussian Mixture Models. "Full" has the slowest training time and "Diag" and "Spherical" have the fastest training time. In "BIC" mode Gaussian Mixture Models are constructed for all types of covariance matrices and the best one in sense of Bayesian Information Criterion (BIC) is chosen.

GTApprox/MoANumberOfClusters

Sets the number of design space clusters.

Works only for Mixture of Approximators technique.

Value:list of positive integers, or an empty list (auto)
Default:[] (auto)

New in version 1.10.0.

New in version 1.11.0: empty list is also a valid value which selects the number of clusters automatically.

If set, the effective number of clusters is selected from the list according to Bayesian Information Criterion (BIC). To fix the number of clusters, you may specify a list containing a single positive integer.

Auto setting ([]) generates a list of possible numbers based on the sample size \(N\) and the effective input dimension \(\tilde{p}\). The minimum number of points to form a cluster is \(K_{min} = 2\tilde{p} + 3\). The maximum number of clusters is then defined as \(C_{max} = min(max(1, \lfloor N / K_{min} \rfloor), 5)\) — so 5 is the maximum ever possible number of clusters selected automatically. A list of integers in range \([1, C_{max}]\) is generated, and the effective number of clusters is again selected from this list according to BIC.

Note that unless \(N > 2 K_{min}\), the above method selects a single cluster (generated list is [1]), which in fact means that MoA is not applied: a single global approximation is constructed, using the technique selected by GTApprox/MoATechnique.

GTApprox/MoAPointsAssignment

Select the technique for assigning points to clusters.

Works only for Mixture of Approximators technique, see Design Space Decomposition.

Value:"Probability" or "Mahalanobis".
Default:"Probability"

New in version 1.10.0.

  • "Probability" corresponds to points assignment based on posterior probability.
  • "Mahalanobis" corresponds to points assignment based on Mahalanobis distance.

For the Mahalanobis distance based technique, the confidence value \(\alpha\) may be changed using the GTApprox/MoAPointsAssignmentConfidence option.

GTApprox/MoAPointsAssignmentConfidence

This option sets confidence value for points assignment technique based on Mahalanobis distance.

Works only for Mixture of Approximators technique, see Design Space Decomposition.

Value:floating point number in range \((0, 1)\).
Default:0.97

New in version 1.10.0.

This option allows to control size of clusters. The greater this value is the greater will be the cluster size.

GTApprox/MoATechnique

This option specifies approximation technique for local models.

Works only for Mixture of Approximators.

Value:"SPLT", "HDA", "GP", "HDAGP", "SGP", "TA", "iTA", "RSM", or "Auto".
Default:"Auto"

New in version 1.10.0.

The option allows to control local approximation technique. It sets the same technique for all local models.

GTApprox/MoATypeOfWeights

This option sets the type of weighting used for “gluing” local approximations.

Works only for Mixture of Approximators, see Calculating Model Output.

Value:"Probability" or "Sigmoid".
Default:"Probability"

New in version 1.10.0.

  • "Probability" corresponds to weights based on posterior probability.
  • "Sigmoid" corresponds to weights based on sigmoid function.

Sigmoid weighting can be fine-tuned with GTApprox/MoAWeightsConfidence.

GTApprox/MoAWeightsConfidence

This option sets confidence for sigmoid based weights.

Works only for Mixture of Approximators, see Calculating Model Output.

Value:floating point number in range \((0, 1)\); must be greater than GTApprox/MoAPointsAssignmentConfidence
Default:0.99

New in version 1.10.0.

This options controls smoothness of weights. The greater this value is the smoother will be weights providing more smooth approximation.

GTApprox/OutputNanMode

Specifies how to handle non-numeric values in the output part of the training sample.

Value:"raise", "ignore", or "predict"
Default:"raise".

New in version 6.8.

By convention, NaN output values signify undefined function behavior. This option controls whether the model should try to predict undefined behavior. If set to "predict", NaN values in training sample outputs are accepted, and the model will return NaN values in regions that are close to those points, for which the training sample contained NaN output values. Default ("raise") means that NaN output values are not accepted and GTApprox raises an exception and cancels training if they are found; "ignore" means that such points are excluded from the sample, and training continues.

GTApprox/RSMCategoricalVariables

Specifies categorical variables.

Value:a list of zero-based indexes of input variables
Default:[] (no categorical variables)

Deprecated since version 6.3: kept for compatibility only, use GTApprox/CategoricalVariables instead.

Prior to 6.3 this option listed categorical variables for the RSM technique specifically. With the improved support for categorical variables added in 6.3, there is now the general GTApprox/CategoricalVariables that deprecates GTApprox/RSMCategoricalVariables and overrides it unless these options come into a conflict.

GTApprox/RSMElasticNet/L1_ratio

Specifies ratio between L1 and L2 regularization.

Works only for Response Surface Model technique.

Value:list of floats in range \([0, 1]\).
Default:[]

Each element of the list sets the trade-off between L1 and L2 regularization: 1 means L1 regularization only, while 0 means L2 regularization only. The best value among given is chosen via cross-validation procedure. If none is given (default) RSM with pure L1 regularization is constructed.

GTApprox/RSMFeatureSelection

Specifies the regularization and term selection procedures.

Works only for Response Surface Model technique.

Value:"LS", "RidgeLS", "MultipleRidgeLS", "ElasticNet", or "StepwiseFit"
Default:"RidgeLS"

The technique to use for regularization and term selection:

  • "LS" — ordinary least squares (no regularization, no term selection).
  • "RidgeLS" — least squares with Tikhonov regularization (no term selection).
  • "MultipleRidgeLS" — multiple ridge regression that also filters non-important terms.
  • "ElasticNet" — linear combination of L1 and L2 regularizations.
  • "StepwiseFit" — ordinary least squares regression with stepwise inclusion/exclusion for term selection.

GTApprox/RSMMapping

Specifies mapping type for data pre-processing.

Works only for Response Surface Model technique.

Value:"None", "MapStd" or "MapMinMax"
Default:"MapStd"

The technique to use for data pre-processing:

  • "None" - no data pre-processing.
  • "MapStd" - linear mapping of standard deviation for each variable to \([-1, 1]\) range.
  • "MapMinMax" - linear mapping of values for each variable to \([-1, 1]\) range.

GTApprox/RSMStepwiseFit/inmodel

Selects the starting model for stepwise-fit regression.

Works only for Response Surface Model technique, see GTApprox/RSMFeatureSelection.

Value:"IncludeAll", "ExcludeAll"
Default:"IncludeAll"

This option specifies the terms initially included in the model when stepwise-fit regression is used.

  • "IncludeAll" starts with a full model (all terms included).
  • "ExcludeAll" assumes none of the terms are included at the starting step.

Depending on the terms included in the initial model and the order in which terms are moved in and out, the method may build different models from the same set of potential terms.

GTApprox/RSMStepwiseFit/penter

Specifies p-value of inclusion for stepwise-fit regression.

Works only for Response Surface Model technique.

Value:floating point number in range \((0,\) GTApprox/RSMStepwiseFit/premove\(]\)
Default:0.05

Option value is the maximum p-value of F-test for a term to be added into the model. Generally, the higher the value, the more terms are included into the final model.

GTApprox/RSMStepwiseFit/premove

Specifies p-value of exclusion for stepwise-fit regression.

Works only for Response Surface Model technique.

Value:floating point number in range \([\)GTApprox/RSMStepwiseFit/penter\(, 1)\)
Default:0.10

Option value is the minimum p-value of F-test for a term to be removed from the model. Generally, the higher the value, the more terms are included into the final model.

GTApprox/RSMType

Specifies the type of response surface model.

Works only for Response Surface Model technique.

Value:"Linear", "Interaction", "Quadratic", or "PureQuadratic"
Default:"Linear"

Changed in version 6.8: default is "Linear" (was "PureQuadratic")

This option restricts the type of terms that may be included into the regression model.

  • "Linear" — only constant and linear terms may be included.
  • "Interaction" — constant, linear, and interaction terms may be included.
  • "Quadratic" — constant, linear, interaction, and quadratic terms may be included.
  • "PureQuadratic" — only constant, linear, and quadratic terms may be included (interaction terms are excluded).

GTApprox/Seed

Fixed seed used in the deterministic training mode.

Value:positive integer
Default:15313

New in version 5.0.

In the deterministic training mode, GTApprox/Seed sets the seed for randomized initialization algorithms in certain techniques. See GTApprox/Deterministic for more details.

GTApprox/SGPNumberOfBasePoints

The number of base points used to approximate the full covariance matrix of the points from the training sample.

Works only for Sparse Gaussian Process technique.

Value:integer in range \([1, 2^{31}-2]\)
Default:1000

Base points (subset of regressors) are selected randomly among points from the training sample and used for the reduced rank approximation of the full covariance matrix of the points from the training sample. Reduced rank approximation is done using Nystrom method for selected subset of regressors. Note that if the value of this option is greater than the dataset size, then GP technique is used instead of SGP.

GTApprox/SPLTContinuity

Required approximation smoothness.

Works only for 1D Splines with tension technique.

Value:"C1" or "C2"
Default:"C2"

If this option value is "C2" (default), then the approximation curve is required to have continuous second derivative. If it is "C1", only the first derivative is required to be continuous.

GTApprox/StoreTrainingSample

Save a copy of training data with the model.

Value:Boolean or "Auto"
Default:"Auto"

New in version 6.6.

If on, the trained model will store a copy of the training sample. If off, this attribute will be an empty list. The "Auto" setting currently defaults to “off”.

Note that in case of GBRT incremental training (see Incremental Training) setting GTApprox/StoreTrainingSample saves only the last (most recent) training sample on each training iteration.

GTApprox/TADiscreteVariables

Specifies discrete input variables.

Value:a list of zero-based indexes of input variables
Default:[] (no discrete variables)

Deprecated since version 6.3: kept for compatibility only, use GTApprox/CategoricalVariables instead.

Prior to 6.3 this option specified discrete input variables for the TA technique specifically. With the improved support for categorical variables added in 6.3, there is now the general GTApprox/CategoricalVariables that deprecates GTApprox/TADiscreteVariables and overrides it unless these options come into a conflict.

GTApprox/TALinearBSPLExtrapolation

Use linear extrapolation for BSPL factors.

Works for Tensor Products of Approximations and Incomplete Tensor Products of Approximations techniques.

Value:Boolean or "Auto"
Default:"Auto"

New in version 1.9.4.

This option allows to switch extrapolation type for BSPL factors to linear. By default, BSPL factors extrapolate to constant. If GTApprox/TALinearBSPLExtrapolation is True, extrapolation will be linear in the range specified by the GTApprox/TALinearBSPLExtrapolationRange option, and fall back to constant outside this range.

  • True: use linear extrapolation in the range specified by GTApprox/TALinearBSPLExtrapolationRange.
  • False: do not use linear extrapolation (always use constant extrapolation).
  • "Auto": defaults to False.

This option affects only the Tensor Products of Approximations (including Incomplete Tensor Products of Approximations) models that contain BSPL factors. It does not affect non-BSPL factors at all, and if a Tensor Products of Approximations model is built using only non-BSPL factors, this option is ignored.

GTApprox/TALinearBSPLExtrapolationRange

Sets linear BSPL extrapolation range.

Works for Tensor Products of Approximations and Incomplete Tensor Products of Approximations techniques.

Value:floating point number in range \((0, \infty)\)
Default:1.0

New in version 1.9.4.

Sets the range in which the BSPL factors extrapolation will be linear (see GTApprox/TALinearBSPLExtrapolation) relatively to the variable range of this factor in the training sample. This setting “expands” the sample range: let \(x_{min}\) and \(x_{max}\) be the minimum and maximum value of a variable found in the sample (BSPL factors are always 1-dimensional), then the extrapolation range is \((x_{max} - x_{min}) \cdot (1 + 2r)\), where \(r\) is the GTApprox/TALinearBSPLExtrapolationRange option value (the range is expanded by \((x_{max} - x_{min}) \cdot r\) on each bound).

This option affects only the Tensor Products of Approximations (including Incomplete Tensor Products of Approximations) models that contain BSPL factors, and only if GTApprox/TALinearBSPLExtrapolation is set to True. It does not affect non-BSPL factors at all, and if a Tensor Approximation model is built using only non-BSPL factors, this option is ignored.

GTApprox/TAModelReductionRatio

Sets the ratio the model complexity should be reduced by.

Works for Tensor Products of Approximations and Incomplete Tensor Products of Approximations techniques.

Value:floating point number in range \([1, \infty)\) or 0 (auto)
Default:0 (auto)

New in version 6.2.

Sets the ratio of the complexity (number of basis functions) of the default TA or iTA model to the desired complexity (for detailed description see Model Complexity Reduction). For example, if this option is set to 2 the number of basis function of the model will be 2 times less than the number of basis functions in the default model. This option affects only TA models with BSPL factors. All other factors ignore this option.

Using this option slightly increases model size but reduces memory consumption during model evaluation and the size of model exported to C and Octave. The accuracy of the model in most cases decreases.

Note, that the model complexity has a lower bound. This means that the reduction ratio has an upper bound. So, the actual reduction ratio can be smaller than the value of GTApprox/TAModelReductionRatio.

Default setting (0) means that no reduction is performed.

Setting GTApprox/TAModelReductionRatio greater than 1 does not guarantee exact fit, so this options is not compatible with GTApprox/ExactFitRequired set to True.

Note, that this option is not compatible with GTApprox/TAReducedBSPLModel, because both options reduce model complexity but use different algorithm to do this.

GTApprox/TAReducedBSPLModel

Deprecated since version 6.2: kept for compatibility only, use GTApprox/TAModelReductionRatio instead.

Since version 6.2 this option is deprecated by a more advanced GTApprox/TAModelReductionRatio option which allows to set the desired complexity of the final model.

GTApprox/Technique

Specify the approximation algorithm to use.

Value:"RSM", "SPLT", "HDA", "GP", "SGP", "HDAGP", "TA", "iTA", "TGP", "MoA", "GBRT", "PLA", "TBL" or "Auto"
Default:"Auto"

New in version 1.9.2: added the incomplete Tensor Approximation technique.

New in version 1.10.0: added the Mixture of Approximators technique.

New in version 3.0rc1: added the Tensor Gaussian Processes technique.

New in version 5.1: added the Gradient Boosted Regression Trees technique.

New in version 6.3: added the Piecewise Linear Approximation technique.

New in version 6.8: added the Table Function technique.

Changed in version 6.8: removed the deprecated Linear Regression (LR) technique. This technique is no longer supported; instead, use RSM with GTApprox/RSMType set to "Linear".

This option allows user to explicitly specify an algorithm to be used in approximation. Its default value is "Auto", meaning that the tool will automatically determine and use the best algorithm (except TGP and GBRT which are never selected automatically, and TA and iTA which are by default excluded from automatic selection — see GTApprox/EnableTensorFeature). Manual settings are:

Sample size requirements taking effect when the approximation technique is selected manually are described in section Sample Size Requirements.

Note

Smart training of GBRT technique can be time consuming even in case of small training samples. Details on smart training can be found in section Smart Training.

GTApprox/TensorFactors

Describes tensor factors to use in the Tensor Approximation technique.

Value:factorization vector in JSON format (see description)
Default:[] (automatic factorization)

This option allows user to specify his own factorization of the input when the TA technique is used. Can also be used with TGP, but does not allow to change factor techniques in this case, except specifying discrete variables. iTA and other techniques ignore this option completely.

Note

The incomplete tensor approximation (iTA) technique ignores factorization specified by GTApprox/TensorFactors because it always uses 1-dimensional BSPL factors. The tensor Gaussian processes (TGP) technique applies factorization, but in this case the option value cannot include technique labels (see below). The only valid label for TGP is "DV", but it is better to use the GTApprox/CategoricalVariables option instead.

Option value is a list of user-defined tensor factors, each factor being a subset of input dataset components selected by user. A factor is defined by a list of component indices and optionally includes a label, specifying the approximation technique to use, as the last element of the list. Indices are zero-based, lists are comma-separated and enclosed in square brackets.

For example, [[0, 2], [1, "BSPL"]] specifies factorization of a 3-dimensional input dataset into two factors. The first factor includes the first and third components, and the approximation technique for this factor will be selected automatically (no technique specified by user). The second factor includes the second component, and splines ("BSPL" label) will be used in the approximation of this factor.

Technique label must be the last element of the list defining a factor. Valid labels are:

  • "Auto" - automatic selection (same as no label).
  • "BSPL" - use 1-dimensional cubic smoothing splines.
  • "GP" - use Gaussian processes.
  • "SGP" - use Sparse Gaussian Process (added in 6.2).
  • "HDA" - use high dimensional approximation.
  • "LR" - linear approximation (linear regression).
  • "LR0" - constant approximation (zero order linear regression).
  • "DV" - discrete variable. The only valid label for the tensor Gaussian processes (TGP) technique. To specify discrete variable GTApprox/CategoricalVariables option can also be used. Interaction between these two options is described in section Categorical Variables for TA, iTA and TGP techniques.

Note

The splines technique ("BSPL") is available only for 1-dimensional factors.

Note

For factors using sparse Gaussian processes ("SGP") the number of base points is specified by GTApprox/SGPNumberOfBasePoints. Note that this number is the same for all SGP factors. If a factor’s cardinality is less than the number of base points then a warning is generated and the Gaussian processes ("GP") technique is used for this factor instead.

Warning

The "DV" label may conflict with the GTApprox/CategoricalVariables option — see its description for details. For this reason, when using the TGP technique, GTApprox/CategoricalVariables should be used instead of specifying discrete variables using the "DV" label.

Factorization has to be full (has to include all components). If there is a component not included in any of the factors, it leads to an exception.

GTApprox/TrainingAccuracySubsetSize

Limits the number of points selected from the training set to calculate model accuracy on the training set.

Value:integer in range \([1, 2^{32}-1]\), or 0 (no limit).
Default:100 000

New in version 1.9.0.

After a model has been built by GTApprox, it is evaluated on the input values from the training set to test model accuracy (calculate model errors, or the deviation of model output values from the original output values). The result is an integral characteristic named “Training Set Accuracy”, which is found in model info. For very large samples this test is time consuming and may significantly increase the build time. If the number of points in the training set exceeds the GTApprox/TrainingAccuracySubsetSize option value, some of the points will be dropped to make the test take less time, and training set accuracy statistic will be based only on the model errors calculated using the limited points subset (the size of which is equal to the GTApprox/TrainingAccuracySubsetSize option value). The number of points actually used in the test will also be found in model info.

If the sample size is less than GTApprox/TrainingAccuracySubsetSize value, this option in fact has no effect. In this case the number of points used in model accuracy test is equal to the number of points used to build the model (which may still be different from the number of points in the training set — for example, if the training set contains duplicate values).

When this option does take an effect, it always produces a warning to the model build log stating that only a limited subset of points selected from the training set will be used to calculate model accuracy.

To cancel the limit, set this option to 0. With this setting, the model will always be evaluated on the same set of points which were used to build the model.