4.2. General Usage¶

GTApprox supports two modes of model training:

Smart training: in this mode the tool tries to build the best model according to quality metrics defined by user.
Manual training: in this mode training parameters are defined manually by user.

This section describes a sequence of steps required to build model in both modes.

Sections

Smart Training
Manual Training

4.2.1. Smart Training ¶

Smart training is a procedure that automatically chooses an approximation technique and tunes values of its options in order to obtain the most accurate model for a given problem. It is designed to enable non-experts to build accurate surrogate models easily.

The accuracy of the model is defined by the quality metrics defined by the user and the data set used to calculate the quality metrics. Such data set can be either a training set or an independent test set.

A general sequence of steps required to build a model is the following.

Prepare a training sample.

The training sample consists of an array of inputs and an array of outputs. Pairs of rows \((X_k, Y_k)\) of these 2 arrays with the same index are the elements of the training sample. For details see section Overview.

GTApprox performs some additional sample preprocessing, see section Sample Cleanup.

Prepare additional input data.

This step is optional. It allows you to improve the quality of the model.

You can specify values of noise in outputs of the training sample if they are known. See section Data with Errorbars for details.

You can specify weights of points in the training sample. Weight is a kind of importance of the point, the greater the value of the weight, the better the model fits the corresponding point. See section Sample Weighting for details.

Prepare a test sample.

This step is optional. It allows you to reduce training time.

The test sample is used to calculate quality metrics.

If there is no test sample, by default GTApprox automatically selects a certain subset of the training sample, which will be used for model validation. This subset is generated by a special algorithm, which selects validation points in such a way that models with higher quality metrics calculated for the validation subset show higher accuracy overall (for the entire training sample), thus minimizing the possible reduction of model quality due to excluding a subset of data from training.

If the sample size is not enough to prepare the separate training and test samples (typically less than 100 points), cross-validation procedure is used instead (see Cross-validation procedure details). Note that cross-validation is time consuming, so using a separate test sample or splitting the input sample into the training and test subsets considerably reduces the training time.

When not using a separate test sample, you can also specify the sample splitting ratio manually, using the @GTApprox/TrainingSubsampleRatio hint (see Training Features). Automatic and manual splitting use different algorithms: the former determines the split ratio by analyzing the sample data, while the latter supports arbitrary ratios. Due to this, in the case of manual splitting the effects of excluding a data subset from training may be more significant than in the case of automatic splitting.

The structure of the test sample is the same as the structure of the training sample.

Configure smart training.

This step is optional. It allows you to improve model quality, reduce training time, specify requirements to the model, and set the approximation techniques to use in smart training.

You can provide some prior knowledge about the data or underlying dependence. Using such information allows to find a more accurate model and also reduce training time. See section Additional Data Properties for details.

You can specify the properties the model should possess. See section Model Features for details.

You can control the model building process by setting the time-quality trade-off, or specifying the enabled approximation techniques. See section Training Features for details.

Fine-tuning.

This step is optional. This is an advanced functionality which allows to specify options manually. Details on manual tuning are given in section Manual Training.

Note that manually specified options can interfere with smart training hints

If option is set manually it will not be tuned. The specified value will be used to build the model.

Some model features can be set via hints and via options. If hints and options specify opposite features (for example, accuracy evaluation is required by hints but turned off in options), an exception will be raised.

Run model training.

Use build_smart() to train the model.

Note that smart training is more time consuming than manual training as it performs tuning of technique’s options. There are several ways to reduce training time, see section Training Features.

4.2.1.1. Additional Data Properties ¶

In practice we sometimes know something about our data or underlying dependency. Such information can be useful during model building and allows to obtain model which preserves the data properties. The training time can also be reduced. Such knowledge is incorporated into smart training via @GTApprox/DataFeatures hint.

You can specify one or several values from the following list:

"Discontinuous" — specify this value if the dependency is discontinuous.
"Linear" — specify this value if the dependency is linear.
"Quadratic" — specify this value if the dependency is quadratic.
"DependentOutputs" — specify this value if there is a dependence between different outputs. For details see section Output Dependency Modes.
"TensorStructure" — specify this value if the training sample has tensor structure (complete of incomplete). For details on samples with tensor structure see sections Tensor Products of Approximations and Incomplete Tensor Products of Approximations.

Note, that some values of this hint are incompatible (e.g. "Linear" is not compatible with "Quadratic"). In such case exception will be raised.

4.2.1.2. Model Features ¶

By default smart training tries to find the best model despite its features. To force smart training to find the model with some desired properties @GTApprox/ModelFeatures hint must be used.

You can specify one or several values from the following list:

"AccuracyEvaluation". By default GTApprox doesn’t build the model which supports accuracy evaluation. Specify this value, if you want to obtain model with accuracy evaluation. More details on accuracy evaluation can be found in section Evaluation of accuracy in given point.
"Smooth". The default constructed model is not necessarily smooth or even continuous. Specify this value to force GTApprox to build only smooth models. If “smooth model” is selected as one of model requirements in smart training, input dimension is greater than 1 and there are less than 10,000 points in the training set, special analysis is enabled helping to prevent oscillations of the model between the training points. “Smooth model” requirement is incompatible with the “exact fit” requirement.
"ExactFit". By default GTApprox is not going to build the model which perfectly fits the training sample, though supports this functionality. Specify this value of the hint if you want to obtain the model with such behavior. Models with exact fit are described in section Exact Fit.
"Gradient". Since 6.30, GTApprox does not require the model to support gradients (the grad() method) by default. Gradient support depends on the technique selected when training the model: for example, if the GBRT technique was selected, the final model will not support gradients. If you need a model with gradient support, add this requirement.

Note that some values of this hint are not compatible with some values of the @GTApprox/DataFeatures hint. For example, the accuracy evaluation requirement is not compatible with the "Linear" value of the @GTApprox/DataFeatures hint: linear models do not support accuracy evaluation.

4.2.1.3. Training Features ¶

In addition to data features and model features there are several training features which allow you to control specific parameters of the training process — for example, the time-quality trade-off, the type of quality metrics used to estimate accuracy of the model, the set of enabled approximation techniques, and other.

These training features are specified via the @GTApprox/Accelerator, @GTApprox/QualityMetrics, @GTApprox/AcceptableQualityLevel, @GTApprox/TryOutputTransformations, and @GTApprox/EnabledTechniques hints. In addition, the @GTApprox/TimeLimit hint allows you to limit the total time spent in training.

@GTApprox/Accelerator is a five position switch that allows to control the training time by changing certain internal parameters of the technique selection procedure. Generally, reducing the training time decreases model quality. Possible values are from 1 (low speed, highest quality) to 5 (high speed, lower quality).
@GTApprox/QualityMetrics sets the error metric to use for continuous outputs when estimating model prediction quality. Smart training will tune the model to the specified metric (by default, RRMS). For details on quality metrics see section Componentwise errors.

The above metrics are defined only for continuous outputs, so the @GTApprox/QualityMetrics hint is ignored for categorical outputs. To estimate prediction quality for a categorical output, smart training always uses the cross-entropy loss metric (smaller values are better).
@GTApprox/AcceptableQualityLevel sets the acceptable level of the error metric specified by the @GTApprox/QualityMetrics hint. This hint can be used to stop training early or to reduce training time by accepting a less accurate model. Smart training will stop once the model error becomes less or equal @GTApprox/AcceptableQualityLevel. Default acceptable levels are 0.001 for RRMS, 0.999 for \(R^2\), and 0.0 for other quality metrics.
@GTApprox/TryOutputTransformations is an advanced hint, which enables GTApprox to train and compare two models: one with log transformation applied to the training sample output data, and the other without the transformation.

By default, the above comparison is done for each output, which can noticeably increase the training time — up to doubling the time compared to training with disabled @GTApprox/TryOutputTransformations. To avoid testing every output, you can additionally set the GTApprox/OutputTransformation option so as to prohibit transformation for certain outputs — see the @GTApprox/TryOutputTransformations hint description for details.
@GTApprox/EnabledTechniques sets the approximation techniques enabled for smart training. With this hint, you can explicitly specify the techniques to use in smart training. This hint can be used only when the GTApprox/Technique option is default ("Auto"). Note that if you enable a technique that is incompatible with the training data, this technique is not used. In the case when none of the enabled techniques can be applied, build_smart() raises an exception.
@GTApprox/TimeLimit sets a soft limit for the total training time in seconds. The default value is 0 meaning no limit. Values greater than 0 set a time limit. The limit is not exact and can be violated to a certain extent. Allowed violation/limit ratio decreases as the limit increases: for example, if the limit is 1, actual training may take up to about 2.5 seconds (150% violation), and if the limit is 3600, training may take up to about 4040 seconds (12% violation). When the time limit is exceeded, GTApprox automatically interrupts training and returns the current best (most accurate) model.
Note

When model quality is estimated using cross-validation, all intermediate models are trained using a smaller subset of the training sample (see Cross-validation procedure details). The final model, which uses the full training sample, is trained only after determining the optimum training settings. If a time limit is set, it is possible to exceed the limit while the final model is not ready yet. In this case GTApprox returns the best (most accurate) model selected from the intermediate models trained during cross-validation. To note that the model is not final, it sends a warning to the training log. For example:
```
[w] Model was created on a subset of the training set using the following options:
[w]   GTApprox/Accelerator = 1
[w]   GTApprox/EnableTensorFeature = False
```
@GTApprox/TrainingSubsampleRatio specifies the fraction of the given training sample to use for model training. The remaining part of the given sample becomes a test sample, and model quality is estimated using this test sample to avoid the time consuming cross-validation procedure. This speeds up model training but at the expense of slight reduction of model quality if the initial sample is small, due to the fact that the final model is trained on a subset of the initial sample. If a separate test sample is given, this hint is ignored.

This hint’s default value 0 is a special value which allows GTApprox to automatically select a validation method as follows:
- Preferred method is to split the sample into the training and test subsets using a special algorithm, which selects validation points in such a way that models with higher quality metrics calculated for the validation subset show higher accuracy overall (for the entire training sample), thus minimizing the possible reduction of model quality due to excluding a subset of data from training. This algorithm is different from the one used when the ratio is specified manually, and determines the split ratio automatically, depending on data properties.
- Cross-validation is used only when the split ratio is not specified manually and it is not possible to automatically split the input sample.
Common cases when smart training uses cross-validation by default are:
- @GTApprox/TrainingSubsampleRatio is set to 1 (do not split the sample), and there is no test sample given.
- The sample size is too small for splitting, typically less than 100 points.
- The exact fit requirement is set — either using the @GTApprox/ModelFeatures hint or by setting the GTApprox/ExactFitRequired option in the options argument to build_smart().
- The GTApprox/IVTrainingCount option in the options argument is set to a value greater than 1.

4.2.2. Manual Training ¶

This mode is intended for advanced users who want to manually specify options of GTApprox and control the quality of the model.

These are the general sequence of steps to build model using manual training:

Prepare training sample.
- The training sample consists of array of inputs and array of outputs. Pairs of rows of these 2 arrays with the same index \((X_k, Y_k)\) are the elements of the training sample. For details see section Overview.
- GTApprox performs some additional sample preprocessing, see section Sample Cleanup.
Prepare additional input data.
- This step is optional. It allows to improve the quality of the model.
- You can specify values of noise in outputs of training sample if they are known. See section Data with Errorbars for details.
- You can specify weights of points in training sample. Weight is a some kind of importance of the point, the greater the value of the weight the better the model will fit the corresponding point. See section Sample Weighting for details.
Configure manual training.
- The configuration is performed by setting options values (see section Option Reference). The options allows to tune the following parameters.
  - Approximation technique. It can be set via GTApprox/Technique. Default value is "Auto". In this case the technique is chosen automatically using decision tree. See section Automatic Technique Selection for details.
  - Model features. GTApprox allows to build models which support accuracy evaluation (GTApprox/AccuracyEvaluation option), exact fit (GTApprox/ExactFitRequired option), linearity (GTApprox/LinearityRequired option). For details on accuracy evaluation see section Exact Fit. For details on exact fit see section Evaluation of accuracy in given point.
  - Taking into account noise in the data set. If noise in output data is non-homogeneous and depends on the point location then turning on option GTApprox/Heteroscedastic allows to improve the quality of the model. See section Heteroscedastic data.
  - Componentwise approximation. Multidimensional output can be approximated jointly for all scalar components of the output, if they are dependent, or separately, if they are independent. GTApprox can also search for linear dependencies in output. This behavior is controlled by the GTApprox/DependentOutputs option. See section Output Dependency Modes for details.
  - Categorical variables. GTApprox can build models with categorical variables in inputs. Such variables must be specified using GTApprox/CategoricalVariables option. Refer to section Categorical Variables for details.
  - Internal Validation. It is a special procedure to estimate model quality if no test set is given. The details of the procedure is described in section Internal Validation. To configure internal validation the following options should be used: GTApprox/InternalValidation, GTApprox/IVSubsetCount, GTApprox/IVTrainingCount and GTApprox/IVSavePredictions.
  - Time-quality trade-off. It is controlled by GTApprox/Accelerator option. See section Training Time and Accuracy Tradeoffs for details.
  - Size of subset of training sample used to estimate accuracy. After model construction GTApprox evaluates model quality on a subset of training sample. The size of the subset is defined by GTApprox/TrainingAccuracySubsetSize option.
  - Deterministic training. Some of the approximation technique have some randomization and the results of approximation can slightly differ between different runs of model construction on the same data and the same technique. In order to obtain the same results you should turn on deterministic behavior of GTApprox using GTApprox/Deterministic option. Details on deterministic and randomized behavior can be found in section Deterministic and Randomized Training.
  - Maximum expected amount of memory allowed for model training. This value can be configured via GTApprox/MaxExpectedMemory option.
  - Maximum number of parallel threads to use during model construction. This value can be configured via GTApprox/MaxParallel option.
Run model construction.
- Use da.p7core.gtapprox.Builder.build() function to build the model.
Assess model quality.
- There are several ways to asses model quality: using test set, using internal validation and using accuracy evaluation procedure. Details can be found in section Quality Assessment.

4.2. General Usage¶

4.2.1. Smart Training¶

4.2.1.1. Additional Data Properties¶

4.2.1.2. Model Features¶

4.2.1.3. Training Features¶

4.2.2. Manual Training¶

4.2.1. Smart Training ¶

4.2.1.1. Additional Data Properties ¶

4.2.1.2. Model Features ¶

4.2.1.3. Training Features ¶

4.2.2. Manual Training ¶