4.6. Details

4.6.1. Sample Cleanup

Before applying the approximation technique to the training set, this training set is preprocessed in order to remove possible degeneracies in the data. Let \(\mathbf{XY}\) be the \(|S|\times (d_{in}+d_{out})\) matrix of the training data, where the rows are \((d_{in}+d_{out})\)-dimensional training points, and the columns are individual scalar components of the input or output. As explained in Section Problem Statement, the matrix \(\mathbf{XY}\) consists of the sub-matrices \(\mathbf{X}\) and \(\mathbf{Y}\). We perform the following operation with the matrix \(\mathbf{XY}\):

  • Check for non-numeric values in \(\mathbf{XY}\). By default, NaN or infinity values are not accepted, and training is cancelled if \(\mathbf{XY}\) contains such values. This behavior can be changed with the following options:
    • GTApprox/InputNanMode can be set to ignore non-numeric values in \(\mathbf{X}\). Ignoring means that rows with non-numeric values are removed from the \(\mathbf{XY}\) matrix, and it is processed further.
      • The Gradient Boosted Regression Trees (GBRT) technique has a unique feature that it can handle NaN (but not infinity) values in \(\mathbf{X}\). If you set the GTApprox/InputNanMode option, GBRT keeps points where some (but not all) variables are NaN and actually uses them in training.
    • GTApprox/OutputNanMode can be set to handle non-numeric values in \(\mathbf{Y}\). There are two possible variants of behavior: ignore (remove corresponding rows from \(\mathbf{XY}\)), or accept such values and train a model that predicts NaN output in regions near those points that contained non-numeric output values.
  • Remove duplicate rows from \(\mathbf{XY}\). A duplicated row means that the same training vector is included more than once in the training matrix. Repetitions bring no additional information and are therefore ignored; a repeating row is counted only once.
  • Round input values (only if input tolerance is set — that is, GTApprox/InputsTolerance is not default). Each column in \(\mathbf{X}\) is rounded up to specified tolerance (see Input Rounding for details). Note that this operation can again result in appearance of repeating rows; in this case, they are merged as described in section Input Rounding.
  • Remove all constant columns in the sub-matrix \(\mathbf{X}\). A constant column in \(\mathbf{X}\) means that all the training vectors have the same value of one of the input components. In particular, this means that the training DoE is degenerate and lies in a proper subspace of the design space. When constructing the approximation, such input components are ignored.

If the above operations are nontrivial, e.g., if the matrix \(\mathbf{X}\) does contain constant columns or the matrix \(\mathbf{XY}\) does contain repeating rows, then the removals are accompanied by warnings. As a result of these operations, we obtain a reduced matrix \(\mathbf{XY}_r\) consisting of the submatrices \(\mathbf{X}_r\) and \(\mathbf{Y}_r\). Accordingly, we define effective input dimension and the effective sample size as the corresponding dimensions of the sub-matrix \(\mathbf{X}_r\).

Note

After removing repeating rows in \(\mathbf{XY}_r\) the reduced matrix may still contain rows which have the same \(X\) components but different \(Y\) components, e.g. output has some random noise. Such problems may require a special tuning of GTApprox, see Section Noisy Problems; in particular not all approximation techniques are appropriate for them. If the training data does contain rows with equal \(X\) but different \(Y\) components, the tool produces a warning.

The model constructed by GTApprox is constructed using the reduced matrix \(\mathbf{XY}_r\) rather than the original matrix \(\mathbf{XY}\). Furthermore, it is the effective input dimension and sample size, rather than the original ones, that are used when required at certain steps, in particular when determining the default approximation technique (see Section General Usage) and choosing the default Internal Validation parameters. For example, if the original input dimension has been reduced to 1, then the default 1D technique will be applied to this data by default. The resulting approximation is then considered as a function of the full \(d_{in}\)-dimensional input, but it depends only on those components of the full input which have been included in \(\mathbf{X}_r\).

4.6.1.1. Input Rounding

Rounding procedure consists of the following steps:

  1. For each column of \(\mathbf{X}\), a uniform grid is constructed (that always contains the 0 value) with the step equal to the corresponding tolerance specified by GTApprox/InputsTolerance.
  2. For each element of \(\mathbf{X}\), closest values on grid corresponding to its column are found.
  3. Each element of \(\mathbf{X}\) is changed to the found value.
  4. If duplicate rows in \(\mathbf{X}\) appear after rounding, they are merged into one row.

When rows are merged following rules apply:

  • Resulting output values are computed as average, or a weighted average if point weights are specified (see Sample Weighting).
  • Points weights are averaged.
  • Resulting output noise variance (see Data with Errorbars) is computed as \(\frac{sum(variances)}{N_{merged}^2}\), where \(N_{merged}\) is the number of points to merge.

4.6.2. Automatic Technique Selection

The automatic selection of the approximation technique is based on decision tree shown in Figure Decision tree.

../../_images/decision_tree_vers1110.png

Figure: The GTApprox internal decision tree for the choice of default approximation methods.

The factors of choice are:

  • Effective sample size \(S\) (the number of data points after the sample cleanup).
  • Input dimensionality \(d_{in}\) (the number of input components, or variables).
  • Output dimensionality \(d_{out}\) (the number of output components, or responses).
  • Exact fit requirement (“Exact Fit” nodes). See option GTApprox/ExactFitRequired.
  • Accuracy evaluation requirement (“AE” nodes). See option GTApprox/AccuracyEvaluation.
  • The availability of output variance data (responses with “errorbars”). See Data with Errorbars for details.
  • Tensor approximation switch (“Tensor Approximation enabled” node). See option GTApprox/EnableTensorFeature. Note that the Tensor Approximation technique requires a specific sample structure and has several more options not shown on Figure Decision tree, which may affect the automatic selection. See Tensor Products of Approximations for more details, in particular The overall workflow for the full TA decision tree.

The result is the constructed approximation, possibly with an accuracy prediction, or an error. The selection is performed in agreement with properties of individual approximation technique. In particular:

The threshold values \(2d_{in} + 2\), \(125\), \(500\), \(1000\) and \(10000\) for \(|S|\) have been set based on previous experience and extensive testing.

In Figure the “sample size vs. dimension” diagram for the default choice is shown.

../../_images/samplesize_dim_chart_vers16.png

Figure: The sample size vs. dimension diagram of default techniques in GTApprox.

4.6.3. Deterministic and Randomized Training

Several model training techniques feature randomized initialization of their internal parameters. Since the algorithms that select these parameters are pseudorandom, their behavior can be controlled in the following way by GTApprox/Deterministic:

  • Deterministic training mode, default. A fixed seed is used in all pseudorandom initialization algorithms. The seed is set by GTApprox/Seed. This makes the behavior reproducible — for example, two models trained in deterministic mode with the same data, same GTApprox/Seed and other settings will be exactly the same, since a training algorithm is initialized with the same parameters.
  • Non-deterministic training mode. A new seed is generated internally every time you train a model. As a result, models trained with randomized techniques may slightly differ even if all settings and training samples are the same. In this case, GTApprox/Seed is ignored. The generated seed that was actually used for initialization can be found in model info, so later the training run can still be reproduced exactly by switching to the deterministic mode and setting GTApprox/Seed to this value.

In case of randomized techniques, repeated non-deterministic training runs may be used to try obtaining a more accurate approximation. These techniques include:

Other techniques are always deterministic, meaning that models trained with the same data and settings are always the same, regardless of GTApprox/Deterministic and GTApprox/Seed.

4.6.4. Training Time and Accuracy Tradeoffs

As GTApprox normally constructs approximations by complex nonlinear techniques, this process may take a while. In general, the quality of the model trained by GTApprox is positively correlated with the time spent to train it. GTApprox contains a number of parameters affecting this time. The default values of the parameters are selected so as to reasonably balance the training time with the accuracy of the model, but in some cases it may be desirable to adjust them so as to decrease the training time at the cost of decreasing the accuracy, or increase the accuracy at the cost of increasing the training time. The following general recommendations apply:

  • The fastest approximation algorithms of GTApprox are, by far, 1D Splines with tension and Response Surface Model (see Techniques). They have, however, a limited applicability: SPLT is exclusively for 1D, and RSM is crude in many cases.
  • Internal Validation involves a multiple training of the approximation which slows down the training time by a factor of \(N_{tr}+1\), where \(N_{tr}\) is the number of IV training/validation sessions. To speed up, turn the Internal Validation off or decrease \(N_{tr}\). See Section Model Validation for details.
  • The only nonlinear approximation algorithm effectively available in GTApprox for very large training sets (larger than 10000) in dimensions higher than 1 is HDA. This algorithm can be quite slow on very large training sets, but it has several options which can be adjusted to decrease the training time. See Section High Dimensional Approximation for details.

In addition, GTApprox has a special option \(\tt{Accelerator}\) which allows the user to tune the training time by simply choosing the level 1 to 5; the detailed specific options of the approximation techniques are then set to pre-determined values. See Section Accelerator.

We emphasize that the above remarks refer to the training time of the model, which should not be confused with the evaluation time.
After the model has been constructed, evaluating it is usually a very fast operation, and this time is negligible in most cases (though very complex models, in particular tensor product of approximations, see Section Tensor Products of Approximations).

4.6.4.1. Accelerator

The switch \(\tt{Accelerator}\) allows the user to reduce training time at the cost of some loss of accuracy. The switch takes values 1 to 5. Increasing the value reduces the training time and, in general, lowers the accuracy. The default value is 1.

Increasing the value of \(\tt{Accelerator}\) simplifies the training process and, accordingly, always reduces the training time. However, the accuracy, though degrades in general, may change in a non-monotone fashion. In some cases, the accuracy may remain constant or even improve.

\(\tt{Accelerator}\) acts differently depending on the approximation technique.

  • High Dimensional Approximation: Here, the following rules apply:

    • \(\tt{Accelerator}\) changes the default settings of HDA.
    • User-defined settings of HDA have priority over those imposed by \(\tt{Accelerator}\).
    • The technique’s parameters are selected differently for very large training sets (more than 10000 points) and moderate sets (less than 10000 points).

In brief, the following settings are modified by \(\tt{Accelerator}\) (see Section High Dimensional Approximation):

To illustrate the effect of \(\tt{Accelerator}\), we show how it changes the training time and accuracy of the default approximation on a few test functions. We consider the following functions:

  • \(\tt{Rosenbrock}\):
\[Y = \sum_{i=1}^{d-1} \left( 1 - x_i \right)^2 + 100\left( x_{i+1} - x_i^2 \right)^2, \quad X=(x_1,\ldots,x_d) \in [-2.048, 2.048]^d\]
  • \(\tt{Michalewicz}\):
\[Y = \sum_{i=1}^d \sin(x_i)\sin\left(\frac{x_i^2}{\pi}\right), \quad X \in [0, \pi]^d\]
  • \(\tt{Ellipsoidal}\):
\[Y = \sum_{i=1}^d i x_i^2, \quad X \in [-6, 6]^d\]
  • \(\tt{Whitley}\):
\[Y = \sum_{i=1}^d\sum_{j=1}^d \Bigl( c + \frac{c}{b}\left(a\left(x_i^2-x_j\right)^2 + \left(c-x_j\right)^2\right)^2 - \cos\left(a \left(x_i^2 - x_j\right)^2 + \left(c-x_j\right)^2\right) \Bigr),\]
\[a = 100, b=400, c=1, \quad X \in [-2, 2]^d\]

We consider these functions for different input dimensions \(d\) and consider training sets of different sizes. The approximation techniques are chosen by default by the rules described in Section General Usage.

Table compares training times and errors of approximations obtained on the same computer with different settings of \(\tt{Accelerator}\). Times and errors are geometrically averaged over all the test problems. For the default value \(\tt{Accelerator}=1\), the actual averaged training times and errors are given. For the other values, the ratios \(\frac{reference\; value}{current\; value}\) are given. We observe the clear general trend of increase of the error and decrease of the training time, though precise quantitative characteristics of the trend may depend significantly on test functions and approximation techniques.

Table: Effect of \(\tt{Accelerator}\) on training time and accuracy. For the default value \(\tt{Accelerator}=1\), the actual averaged training times and errors are given. For the other values, the ratios \(\frac{reference\; value}{current\; value}\) are given, where \({reference\; value}\) corresponds to \(\tt{Accelerator}=1\).
  Technique GP GP HDAGP HDAGP HDA HDA HDA SGP
  Sample size 40 80 160 320 640 1280 2560 2560
\(\tt{Accelerator}=1\) reference time, s 0.45 0.91 22 130 740 690 1700 850
  reference error 9.7e-5 2.3e-5 5.6e-5 5.5e-5 9.5e-6 2.2e-6 3.5e-8 9.8e-7
\(\tt{Accelerator}=2\) time ratio 1.1 1.4 1.6 1.5 1.4 1.3 1.1 1.7
  error ratio 0.89 1.0 0.96 0.77 0.98 0.99 0.97 0.96
\(\tt{Accelerator}=3\) time ratio 1.2 1.8 1.8 1.8 1.5 2.1 4.1 2.6
  error ratio 0.81 0.56 0.74 0.79 0.23 0.53 0.43 0.45
\(\tt{Accelerator}=4\) time ratio 1.2 2.2 3.8 3.0 2.4 4.1 4.7 7.3
  error ratio 0.015 0.040 0.26 0.61 0.14 0.23 0.42 0.17
\(\tt{Accelerator}=5\) time ratio 1.3 2.4 4.0 3.0 2.5 19 9.8 21
  error ratio 0.010 0.0068 0.26 0.61 0.14 0.025 0.39 0.021

4.6.5. Multi-core Scalability

GTApprox takes advantage of shared memory multiprocessing when multiple processor cores are available. It uses OpenMP multithreading implementation which allows user to control parallelization by setting the maximum number of threads in the parallel region. In a very general sense, certain increase in performance may be expected when using more threads, though actual GTApprox performance gain depends on problem properties, choice of approximation techniques, host specifications and execution environment. This section presents the conclusions drawn from GTApprox scalability tests and consequent recommendations for the end user.

A series of GTApprox techniques tests conducted internally under various supported platforms running on different hosts with different number of cores available, using a variety of samples of different dimensionality and size, showed that a significant performance increase may be achieved for High Dimensional Approximation, Gaussian Processes, Sparse Gaussian Process and High Dimensional Approximation combined with Gaussian Processes techniques in case of sample size \(\gtrsim 500\) by increasing the number of cores available to GTApprox up to 4. Further increase up to 8 cores gives little effect, with no noticeable gain after that due to parallel overhead (sometimes there is even a slight performance decrease). The nature of this dependency is the same regardless of input dimensionality, despite absolute gain values, of course, may be different.

Figure Multi-core scalability illustrates typical behaviour of HDA technique. This particular test was run on a 3.40 GHz Intel i7-2600 CPU (4 physical cores) under Ubuntu Linux x64 with a training sample of 1000 points in 10-dimensional space. Note that due to high CPU load setting the number of OpenMP threads more than 4 on a quad-core CPU will only degrade performance (not shown), but until it does not exceed the number of physical cores, the performance scales well as expected.

The easiest way to define the number of parallel threads is by setting OMP_NUM_THREADS environment variable. Its default value is equal to the number of logical cores, which gives good results in cases described above on processors having 2-4 physical cores, but may be unwanted in other cases, especially for hyperthreading CPUs. See option GTApprox/MaxParallel.

../../_images/scalability_hda1.png

Figure: Multi-core scalability of GTApprox HDA. Host: Intel i7-2600 @ 3.40 GHz, Ubuntu x64. Sample size: 1000 points, dimensionality: 10.

4.6.6. Using Clusters

GTApprox has initial support for running approximation model training on a remote host or a HPC cluster (currently only LSF clusters are supported). See set_remote_build() for details.

4.6.7. Model Smoothing

GTApprox has the option of additional smoothing the approximation after it has been constructed. The user can transform the trained model into another, smoother model. Smoothing affects the gradient of the function as well as the function itself. Therefore, smoothing may be useful in tasks for which smooth derivatives are important, e.g., in surrogate-based optimization.

The model can be smoothed via smooth() method. The amount of smoothing is specified by the smoothing factor — an arbitrary float value in range \([0.0, 1.0]\) (f_smoothness argument of method smooth()). The smoothing factor has no physical meaning by itself: it simply requests less smoothing (smaller values) or more smoothing (larger values); 0 means no smoothing (the new model will be identical to the original one), while 1 is extreme smoothing (the new model will be almost linear).

For a model with multidimensional output, scalar smoothing factor sets the same smoothing factor for all components of the output; to set different smoothing for individual output components, use the array form (array length must be equal to the model output dimension size_f). In the array form, each element (an individual smoothing factor) is also a float in range [0.0, 1.0].

Smoothing may be applied to an already smoothed model, but in such case the result will be the same as applying it to the original model with the same smoothing factor(s). For example:

model_a = model_orig.smooth(0.5)
model_b = model_orig.smooth(0.1)
model_c = model_a.smooth(0.1)

Both model_b and model_c are different from the model_orig prototype, yet model_b is identical to model_c. The resulting smoothness factor for model_c is not 0.6 or 0.5 \(\cdot\) 0.1 as it may seem, but simply 0.1 as specified in the model_a.smooth() call.

Also, applying zero smoothness to an already smoothed model returns a model which is up to the computational error identical to the original model:

model_smoothed = model_orig.smooth(0.1)
model_restored = model_smoothed.smooth(0)

Here, model_restored is identical to model_orig, up to the computational error of the smoothing restoration procedure.

smooth() method returns a new smoothed model; the original one (the model that called the method) is not changed. To check if the model is a smoothed model, see is_smoothed. Model’s smoothness factors can be found in info.

4.6.7.1. Anisotropic Smoothing

Anisotropic smoothing extends the simple smoothing functionality (see section Model Smoothing) by allowing additionally specify smoothing strength along different input components. It is implemented in smooth_anisotropic() method. It has two arguments: f_smoothness which is the smoothing factor of output components (as in simple smoothing described in section Model Smoothing) and x_weights argument which defines smoothing strength along different input components.

Note

GP-based techniques (Gaussian Processes, High Dimensional Approximation combined with Gaussian Processes, Sparse Gaussian Process) do not support anisotropic smoothing. For these techniques smooth_anisotropic() ignores x_weights argument and performs usual smoothing.

Argument x_weights can be either 1D array-like or 2D array-like. In case of 1D x_weights the same smoothing along input components is applied to all output components. If x_weights is 2D, then each row specifies smoothing of corresponding output component. As an example, for a model with 2D input and output, the 1D form:

x_weights = [1, 0]

specifies that both output components are smoothed by the first input component only, while the 2D form:

x_weights = [[0.5, 0.5], [1, 0]]

specifies that the first output component is smoothed equally by both input components, and the second output component is smoothed only by the first input component. In either form, each element of x_weights is a float in range [0.0, 1.0].

The values in x_weights are relative to each other — for example, these settings are essentially the same:

x_weights = [0.5, 0.5]
x_weights = [0.1, 0.1]

So are these:

x_weights = [1, 0]
x_weights = [0.1, 0]

It means that they control only relative smoothing by different output components, contrary to f_smoothness which specifies the amount of output smoothing (f_smoothness works just like in the smooth() method and has the same valid values).

Input weights and smoothing factors interact in the following way (again, for a model with 2D input and output as an example):

x_weights = [[0.5, 0.5], [1, 0]]
f_smoothness = [0.4, 0.6]
  • The first output component is less smooth compared to the second output component, and it is equally smooth by both input components.
  • The second output component is more smooth compared to the first output component, but it is smoothed only by the first input component and not smoothed at all by the second input component.

This method returns a new smoothed model; the original one (the model that called the method) is not changed. To check if the model is a smoothed model, see is_smoothed. Model’s smoothness factors can be found in info.

4.6.7.2. Error-Based Smoothing

It is common for the model errors to increase when the model is smoothed. To control the accuracy of the smoothed model you can use error-based smoothing. In error-based smoothing you specify a threshold for the model errors calculated using a reference sample which also should be provided. A final model is a model with maximum possible smoothness, approximation errors of the model do not exceed given thresholds.

This type of smoothing is implemented in smooth_errbased(). As input argument you should provide reference sample (x_sample, f_sample), type of error (error_type) and threshold (error_thresholds). There is an optional argument x_weights. If it is provided the smoothing will be anisotropic (see section Anisotropic Smoothing for details).

Type of errors can be a list of strings — different types of errors for different outputs, or string — one error type for all outputs. Available error types are:

  • RMS — root-mean-squared error;
  • Mean — mean absolute error;
  • Max — maximum absolute error;
  • RMS_PTP, Mean_PTP, Max_PTP — the same as previous but normalized by value range of output values of given reference sample
  • RRMS — relative root-mean-squared error: RMS normalized by standard deviation of given reference sample;
  • Median — median of absolute errors;
  • Q0.95 — 95-percent quantile of absolute errors;
  • Q0.99 — 99-percent quantile of absolute errors.

See definitions of error types in section Componentwise errors.

Also, error aggregation for different output components is supported. There are three types of aggregation available (see definitions of aggregated errors in section Aggregated errors):

  • Mean — arithmetic average of errors for different outputs;
  • Max — maximum of errors for different outputs;
  • RMS — root-mean-squared of errors for different outputs.

For example, to calculate the mean of RMS errors, set error_type to 'Mean RMS'.

The error_thresholds parameter sets the threshold for errors. If error aggregation is used, error_thresholds may only be a single float. If errors are not aggregated, error_thresholds can be float (the same threshold for all output components) and 1D array-like (individual threshold for each output component).

This method returns a new smoothed model; the original one (the model that called the method) is not changed. To check if the model is a smoothed model, see is_smoothed. Model’s smoothness factors can be found in info.

4.6.7.3. Notes on Smoothing Methods

Details of smoothing method depend on the technique that was used to construct the approximation:

  • SPLT. Implementation of smoothing in GTApprox is based on article [Bernstein2011] and consists in convolving the approximation with a smoothing kernel \(K_h\) , where \(h\) is the kernel width. The kernel width controls the power of smoothness: the larger the width, the more intensive smoothing we get.
  • Gaussian Processes, Sparse Gaussian Process, High Dimensional Approximation combined with Gaussian Processes. Smoothing is based on adding a linear trend to the model which leads to new covariance function. Degree of smoothness depends on the coefficient of non-stationary covariance term (based on linear trend): the larger this coefficient, the closer model to the linear one.
  • High Dimensional Approximation. Smoothing is based on penalization of second order derivatives of the model. HDA model represents as linear decomposition of nonlinear parametric functions so Decomposition coefficients are re-estimated (via penalized least squares) during smoothing. Smoothness is controlled by regularization parameter (the coefficient at penalty term of least square problem).
  • Response Surface Model. Response surface model are not affected by smoothing.

In Figures below we show examples of smoothing in 2D and 1D.

../../_images/2D_smoothing.png

Figure: 2D smoothing: a) original function; b) non-smoothed approximation; c) smoothed approximation.

../../_images/1D_smoothing.png

Figure: 1D smoothing with different values of the parameter smoothness.

Note

This image was obtained using an older pSeven Core version. Actual results in the current version may differ.

4.6.8. Model Metainformation

Approximation models can store additional information which is not required by model functions. This information describes the model in a human-readable form:

  • Comment — a simple text comment, added with the comment parameter to build() and build_smart(). The comment is stored to model comment. The comment parameter is a single string.
  • Annotations — an extended comment, added with the annotations parameter to build() and build_smart(). Annotations are stored to model annotations. The annotations parameter is a dictionary. All its keys and values must be strings. It can be used to store a number of notes or even some documentation for the model.
  • Descriptions of model inputs and outputs, including their names and other details. These descriptions are added with the x_meta and y_meta parameters to build() and build_smart(). They are stored to model details (see Input and Output Descriptions). The x_meta and y_meta parameters are explained below.

The x_meta and y_meta parameters can be used to:

  • Add or edit metainformation on model inputs and outputs (names, descriptions, measurement units).
  • Manually limit the model’s input domain, applying simple box constraints to its inputs (see GTApprox/InputDomainType, section Input Constraints).
  • Add thresholds to model outputs (see Output Constraints).

Both x_meta and y_meta are lists of length equal to the number of model inputs and outputs respectively, or the number of columns in the input and response parts of the training sample. A list element can be a (Unicode) string or a dictionary. A string specifies only the name for the respective input or output; note there are certain restrictions for names (see below). A dictionary describes a single input or output and can have the following keys (all keys are optional):

  • "name": contains the name (str or unicode) for this input or output. If this key is omitted, default names will be saved to the model: "x[i]" for inputs, "f[i]" for outputs, where i is the index of the respective column in the training samples.
  • "description": contains a brief description, any text (str or unicode).
  • "quantity": physical quantity, for example "Angle" or "Energy" (str or unicode).
  • "unit": measurement units used for this input or output, for example "deg" or "J" (str or unicode).
  • "min" and "max" (added in 6.16): in x_meta, specify the lower and upper bounds (float) for an input. If some of these keys is omitted, the input is unbound in the respective direction. In y_meta, specify the low and high threshold for an output (added in 6.17). If a key is omitted, the respective threshold is not changed (note that there are no thresholds by default). Also, -Inf and Inf can be used to remove the low and high thresholds with modify().

Names of inputs and outputs must satisfy the following rules:

  • Name must not be empty.
  • All names must be unique. The same name for an input and an output is also prohibited.
  • The only whitespace character allowed in names is the ASCII space, so \t, \n, \r, and various Unicode whitespace characters are prohibited.
  • Name cannot contain leading or trailing spaces, and cannot contain two or more consecutive spaces.
  • Name cannot contain leading or trailing dots, and cannot contain two or more consecutive dots, since dots are commonly used as name separators.
  • Parts of the name separated by dots must not begin or end with a space, so the name cannot contain '. ' or ' .'.
  • Name cannot contain control characters and Unicode separators. Prohibited Unicode character categories are: Cc, Cf, Cn, Co, Cs, Zl, Zp, Zs.
  • Name cannot contain characters from this set: :"/\|?*.

You can also add or edit metainformation after training the model (see modify()). Note that modify() returns a new modified model, which is identical to the original except your edits.

Since version 6.17, if you use an initial model in training, its metainformation is copied to the trained model by default. New information specified in x_meta and y_meta then updates existing one, so you can use them to edit metainformation when updating a model.

Comments and descriptions of model inputs and outputs are kept when you export the model code to C and other common formats (see export_to()) or to a Functional Mock-up Unit (see export_fmi_cs(), export_fmi_me()).

  • Comment:
    • If using export_to(), comment is added to the comment in the exported code.
    • If using export_fmi_cs(), comment is copied to the description attribute of the <fmiModelDescription> element, unless you specify another description using the meta parameter.
  • Input and output descriptions:
    • If using export_to(), the descriptions are added to the comment in the exported code.
    • If using export_fmi_cs(), the descriptions are used to set properties (name, quantity, unit) of the FMU variables. Note that the method also allows to set properties of variables with its inputs_meta and outputs_meta parameters. Descriptions stored with the model are used together with inputs_meta and outputs_meta, but if some property is specified by both, information from the parameters takes priority.

4.6.9. Model Details

Trained GTApprox models contain a detailed description which is stored in the details dictionary. Its structure depends on the training technique and mode:

  • details["Input Variables"] and details["Output Variables"] (list) — descriptions of model’s inputs and outputs (added in 6.14). Values under these keys are lists of length size_x and size_f respectively. Each element of the list is a dictionary describing a single input or output. See Input and Output Descriptions for full details.

  • details["Technique"] (str) — the technique that was actually used to train the model. If the technique was specified manually (using the GTApprox/Technique option), this key will show the short name of the technique (same as the GTApprox/Technique value). If you did not specify the technique (automatic selection was used), there are two possible cases:

    • The model was trained using a single technique. In this case the value will be the short name of this automatically selected technique.
    • The model includes submodels (component models) trained using different techniques. In this case the value will be "Composite". Further details on composite models can be found in model decomposition information (see below).
  • details["Training Options"] (dict) — all options that were set to non-default values when training (added in 6.6). This will include both options that were set manually and options that were optimized by smart training, if it was used. The value under this key is a dictionary, its keys are option names, and values are the option values. If all options were default, the value is an empty dictionary.

  • details["Training Hints"] (dict) — smart training hints (added in 6.8). This key is present only if the model was obtained from smart training. The value under this key is a dictionary, its key are hint names, and values are the specified hint values. If no hints were specified, the dictionary is empty.

  • details["Issues"] (dict) — all training warnings extracted from the model’s build_log (added in 6.16). A key in the issues dictionary is a string identifying the source of a warning, while a value under this key is a list containing all warnings (as strings) collected from this source. Different warning sources are usually separate submodels related to a specific output and/or combination of levels of categorical variables, for example: "[output #3: categorical input x[2]=[3.]]". If there are no submodels, all warnings are stored under the "general" key. The "general" key is also used for non-specific warnings.

  • details["Training Dataset"] (dict) — detailed information on the training dataset (added in 6.0). See Training Dataset Information for a full description.

  • details["Training Time"] (dict) — model training start, finish, and total time (added in 6.14.3). See Training Time.

  • details["Regression Model"] (dict) — detailed structure information for RSM models. This key is present only if the model was trained using the RSM technique. See Regression Model Information for a full description.

  • details["Input Constraints"] (dict) — model input constraints which describe the model’s input domain (added in 6.16). This key is present only if the model was trained with an option which limits the model’s input domain — for example, non-default GTApprox/InputDomainType, or GTApprox/OutputNanMode set to "predict". See Input Constraints for a full description.

  • details["Output Constraints"] (dict) — model output constraints which describe dependencies between model outputs (added in 6.16) and output thresholds (added in 6.17). This key is present only in the following cases:

    See Output Constraints for a full description.

  • details["Model Decomposition"] (list) — detailed structure information for composite models (added in 6.8). This key is present only if the model contains submodels. This is the case when there are categorical variables in the model, or the model has multidimensional output and was trained in component-wise mode. See Model Decomposition for a full description.

4.6.9.1. Input and Output Descriptions

New in version 6.14.

Changed in version 6.22: added information about categorical outputs.

Changed in version 6.25: added new input and output variability types: set (input) and piecewise constant (output).

Descriptions under details["Input Variables"] and details["Output Variables"] contain metainformation about model’s inputs and outputs. Some information is added automatically by GTApprox, and some is available only if it was specified manually — for example, using the x_meta and y_meta arguments to build(), or using modify().

Structure of input and output descriptions is generally similar: the value under the details["Input Variables"] key is a list of length size_x (model’s input dimension), the value under the details["Output Variables"] key is a list of length size_f (model’s output dimension). Each element of the list is a dictionary describing a single input or output. List order follows the order of columns in the training data samples.

A dictionary describing the i-th input, found as details["Input Variables"][i], has the following keys:

  • "name" (str) — contains the variable’s name. This key always exists. If a name for the variable was never specified, a default name (x[i]) is stored here.
  • "description" (str) — contains a brief description for the variable. This key exists only if the description was specified by user.
  • "quantity" (str) — physical quantity of the variable. This key exists only if variable’s quantity was specified by user.
  • "unit" (str) — measurement units used for the variable. This key exists only if measurement units were specified by user.
  • "variability" (str) — type of the variable, automatically determined by GTApprox. This key always exists. Its value can be:
    • "continuous" — a generic continuous variable.
    • "enumeration" — a categorical variable (see Categorical Variables). Levels of a categorical variable are found under the "enumerators" key.
    • "neglected" — indicates a variable which was not used when training the model — for example, a variable which value in the training sample is constant (a constant column). Generally such variables are ignored by the model and have only formal meaning. Note that constant columns can appear due to input rounding — see Sample Cleanup for details.
    • "set" — a variable that is not categorical or constant by meaning yet takes only values from a finite set (such as a stepped variable). Valid values are listed under the "enumerators" key.
  • "enumerators" (list) — contains a list of valid input values (float). This key exists only if input variability is "enumeration" or "set".

A dictionary describing the j-th output, found as details["Output Variables"][j], contains the following keys:

  • "name" (str) — contains the name of the output. This key always exists. If a name for the output was never specified, a default name (f[j]) is stored here.
  • "description" (str) — contains a brief description for the output. This key exists only if the description was specified by user.
  • "quantity" (str) — physical quantity of the output. This key exists only if quantity was specified by user.
  • "unit" (str) — measurement units used for the output. This key exists only if measurement units were specified by user.
  • "variability" (str) — type of the output, automatically determined by GTApprox. This key always exists. Its value can be:
    • "continuous" — a generic continuous function.
    • "constant" — a constant output. The value of a constant output value is found under the "value" key.
    • "enumeration" — a categorical output. Levels of a categorical output are found under the "enumerators" key.
    • "piecewise constant" — an output that is not categorical or constant by meaning yet takes only values from a finite set (such as a stepped function). Possible output values are listed under the "enumerators" key.
  • "value" (float) — only for constant outputs, contains the output value.
  • "enumerators" (list) — contains a list of possible output values (float or str). This key exists only if output variability is "enumeration" or "piecewise constant".

4.6.9.2. Training Dataset Information

The details["Training Dataset"] dictionary contains various information about the data used to train the model.

  • details["Training Dataset"]["Total Points Number"] (int) — the total number of points in the dataset.
  • details["Training Dataset"]["Ambiguous Points Number"] (int) — the number of points with ambiguous output values — that is, points with the same input values but different outputs.
  • details["Training Dataset"]["Duplicate Points Number"] (int) — the number of exact duplicates.
  • details["Training Dataset"]["Accuracy"] (dict) — contains model error values calculated on the training dataset. See Accuracy for details.
  • details["Training Dataset"]["Sample Statistics"] (dict) — descriptive statistics of the training sample. See Sample Statistics for details.
  • details["Training Dataset"]["Test Sample Statistics"] (dict) — if the training dataset included a separate test sample, this dictionary contains its descriptive statistics (added in 6.8). Dictionary structure is the same as in details["Training Dataset"]["Sample Statistics"].

4.6.9.2.1. Accuracy

Changed in version 6.22: added an error metric for categorical outputs.

The details["Training Dataset"]["Accuracy"] dictionary contains componentwise and aggregated model errors and statistics.

  • details["Training Dataset"]["Accuracy"]["Componentwise"] (dict) — errors calculated individually for each output component. All values in this dictionary are 1D NumPy arrays (float), array length is equal to the number of model’s output components. Keys are:
    • "Count" — the number of numeric output values per component.
    • "Inf Count" — the number of infinite output values in the training sample, per each output component.
    • "NaN Count" — the number of NaN output values in the training dataset, per each output component.
    • "Unpredicted NaN Count" — if handling of NaN output values was enabled when training the model (see GTApprox/OutputNanMode), this is the number of points (per output component) for which this model’s component predicts a numeric value while training data contains a NaN value.
    • "False NaN predictions" — likewise, the number of points (per output component) where this component predicts NaN while training data contains a numeric value.
    • Error metrics for continuous outputs: "Max", "Mean", "Median", "Q_0.95", "Q_0.99", "RMS", "RRMS", and "R^2". Contain corresponding errors for each of the continuous outputs. For categorical outputs, values of these metrics are always NaN. For metric definitions, see section Error metrics.
    • Error metric for categorical outputs: "LogLoss". The sample mean cross-entropy loss value for each of the categorical outputs (smaller is better). For continuous outputs, this metric is always NaN.
  • details["Training Dataset"]["Accuracy"]["Aggregate"] — the same statistics and errors aggregated in three different ways: taking maximum, mean, and root-mean-squared values of componentwise errors, excluding NaN values (see Aggregated errors). Keys are the same as above, values are also NumPy arrays but contain a single element only.

4.6.9.2.2. Sample Statistics

The details["Training Dataset"]["Sample Statistics"] and (if present) details["Training Dataset"]["Test Sample Statistics"] dictionaries each contain two dictionaries with statistics for the input and output parts of the respective data sample:

  • details["Training Dataset"]["Sample Statistics"] (dict) — training sample statistics.
    • details["Training Dataset"]["Sample Statistics"]["Input"] (dict) — input sample statistics.
    • details["Training Dataset"]["Sample Statistics"]["Output"] (dict) — output sample statistics.
  • details["Training Dataset"]["Test Sample Statistics"] (dict) — test sample statistics (only if the test sample was included in training data).
    • details["Training Dataset"]["Test Sample Statistics"]["Input"] (dict) — input sample statistics.
    • details["Training Dataset"]["Test Sample Statistics"]["Output"] (dict) — output sample statistics.

Every dictionary (train, test, input, output) includes the same set of statistics. All values in these dictionaries are NumPy arrays with lengths equal to the sample dimension (so all statistics are component-wise). Keys (statistics) are:

  • "Count" — the number of numeric values for each component.
  • "Inf Count" — the number of infinite values per component.
  • "NaN Count" — the number of NaN values per component.
  • "Min" — minimum value of this component found in the sample.
  • "Max" — maximum value of this component found in the sample.
  • "Range" — the difference between the minimum and maximum values.
  • "Mean" — mean component value.
  • "Median" — component median.
  • "Q1" — first quartile of the component. 25% of sample values are lower than this value.
  • "Q3" — third quartile of the component. 75% of sample values are lower than this value.
  • "IQR" — interquartile range, the difference between Q1 and Q3.
  • "Q_0.01" — first percentile (only 1% of values in the sample are lower than this value).
  • "Q_0.05" — fifth percentile (5% values are lower than this).
  • "Q_0.95" — 95-th percentile (95% values are lower than this).
  • "Q_0.99" — 99-th percentile (99% of values in the sample are lower).
  • "Std" — standard sample deviation for each component.
  • "Var" — sample variance for each component.

4.6.9.3. Training Time

New in version 6.14.3.

The details["Training Time"] dictionary contains training time statistics in human-readable format:

  • details["Training Time"]["Start"] (str) — training start time.
  • details["Training Time"]["Finish"] (str) — finish time.
  • details["Training Time"]["Total"] (str) — the difference between the start and finish times.

Note that the total is wall time, which may be different from the real time spent in training. For example, if you run training on a laptop and it enters the suspend mode (sleeps) during training, the suspend period is included in the total time, while training was actually paused during suspend.

Information includes date and time up to milliseconds, for example:

{u'Finish': u'2018-11-01 17:55:44.119000',
 u'Start': u'2018-11-01 17:55:43.778000',
 u'Total': u'0:00:00.341000'}

4.6.9.4. Regression Model Information

The details["Regression Model"] dictionary is present in details only if the model was trained with the RSM technique. Contents of this dictionary are:

  • details["Regression Model"]["categorical"] (dict) — informs which model variables are categorical, and what values are allowed for them. Keys are 0-based indices of variables, values are tuples of categorical values.
  • details["Regression Model"]["model"] (numpy.array) — a matrix describing model terms. Each column corresponds to an input variable, and each row describes a single term (see the example below).
  • details["Regression Model"]["terms"] (tuple) — lists model terms in a human-readable format.
  • details["Regression Model"]["weights"] (numpy.array) — a matrix of term weight coefficients for each model output. Each column corresponds to a term, and each row specifies term weights for a single output.

For example, consider the following excerpt from a details printout:

{
 'Regression Model':
 {
   'categorical': {1: (3.0, 5.0, 11.0)},
   'model': array([[  0.,  nan],
                   [  1.,  nan],
                   [  2.,  nan],
                   [  0.,   5.],
                   [  0.,  11.]]),
   'terms': ('1', 'x[0]', 'x[0] * x[0]', 'I(x[1], 5.0)', 'I(x[1], 11.0)'),
   'weights': array([[  4.60000000e+00,  -2.73000000e+00,   3.47308902e-13,  -1.31534390e-13,   3.19255476e-15],
                     [  3.00000000e+00,  -1.00000000e+00,  -4.63691057e-12,   2.00000000e+00,   8.00000000e+00]])
 },
 'Technique': 'RSM',
 'Training Dataset':
 {
   ...
 }
}

It describes a model with 2 input variables (the model matrix contains 2 columns) and 2 outputs (2 rows in the weights matrix). The second variable (indexed 1) is categorical with allowed values 3, 5, and 11.

For the purpose of this example, let us use the following as a general definition of an RSM model:

\[ \begin{align}\begin{aligned}y_k(\mathbf{x}) = \sum_{i=1}^N w_{ki} q_i(\mathbf{x}), ~ k = \{1, 2, \dots, D\},\\\mathbf{x} = \{x_j\}, ~ j = \{1, 2, \dots, d\},\end{aligned}\end{align} \]

where:

  • \(q_i\) are terms, listed in the terms tuple,
  • \(w_{ki}\) are term weight coefficients from the weights matrix (i-th term weight for the k-th output),
  • \(N\) is the number of model terms,
  • \(d\) is the input dimension (the number of model variables \(x_j\)), and
  • \(D\) is the output dimension.

When reading details, you can easily reconstruct the model in analytic form using the terms tuple and the weights matrix. In the terms tuple, '1' denotes the constant term; 'x[0]', 'x[1]' and so on are variables; a string like 'I(x[1], 5.0)' denotes an indicator function \(I\) of a categorical variable defined as

\[ \begin{align}\begin{aligned}I(x, t) = 1, x = t,\\I(x, t) = 0, x \neq t\end{aligned}\end{align} \]

Note

If names of variables were specified when training the model (see Model Metainformation, Input and Output Descriptions), these names are used in the terms tuple instead of default ones ('x[0]', 'x[1]' and so on).

In the weights matrix, each row corresponds to a single model output and contains term weights for this output (so there are \(D\) rows and \(N\) columns).

Thus the model shown in this example is (omitting marginal terms and rounding weights for easier reading):

\[\begin{split}\begin{array}{l} y_1(\mathbf{x}) = 4.6 - 2.73 x_1, \\ y_2(\mathbf{x}) = 3 - x_1 + 2 I(x_2, 5) + 8 I(x_2, 11) \end{array}\end{split}\]

Note

For reference, in the true function modelled in this example, the second output is \(y_2^*(\mathbf{x}) = -x_1 + x_2\). Note how the linear term \(x_2\) (categorical variable) has been replaced with a sum of indicator functions and has produced a constant term which is actually one of the valid values (3) for \(x_2\). Using the indicator notation, \(y_2^*\) could also be written as \(y_2^*(\mathbf{x}) = -x_1 + 3 I(x_2, 3) + 5 I(x_2, 5) + 11 I(x_2, 11)\).

The model matrix is intended to be processed programmatically.

As noted above, each column in this matrix corresponds to an input variable, and each row corresponds to another term. So, the i-th row describes how different variables contribute to the term \(q_i\), and the j-th column describes how the variable \(x_j\) contributes to different terms. Matrix elements \(p_{ij}\) have different meanings for continuous and categorical variables:

  • For a continuous variable \(x_j\), \(p_{ij}\) is the power of this variable in the i-th term. By convention, any value raised to 0 power equals 1 (\(x_j^0 = 1\) regardless the value of \(x_j\)), actually meaning that the variable does not contribute to a term. With this assumption, if all model variables are continuous, its terms \(q_i\) can be written simply as \(q_i(\mathbf{x}) = \prod_{j=1}^d x_j^{p_{ij}}\), where \(d\) is the input dimension.
  • For a categorical variable \(x_j\), a NaN value of \(p_{ij}\) means that the variable does not contribute to the i-th term. Numeric \(p_{ij}\) should be understood as the test value \(t\) for the indicator function \(I\) (see above), meaning that the term \(q_i\) includes the indicator \(I(x_i, p_{ij})\).

So, the first row of the model matrix describes a constant term: [0., nan] means that neither of the two variables contribute to this term. Its weights are found in the first column of the weights matrix: 4.6 for the first output and 3.0 for the second.

The second row of the model matrix [1., nan] describes a linear term \(x_1^1\), categorical variable \(x_2\) does not contribute to this term. Term weights are found in the second column of the weights matrix: -2.73 for the first output and -1.0 for the second.

The third row of the model matrix describes a quadratic term \(x_1^2\), the \(x_2\) variable does not contribute to this term. Since its weights are so small compared to other terms’ weights, this third term can probably be neglected (this decision depends on its physical meaning).

Next two terms include the categorical variable \(x_2\) but do not depend on the continuous variable \(x_1\): row [0., 5.] in the model matrix gives the term \(x_1^0 I(x_2, 5)\), which is denoted 'I(x[1], 5.0)' in the terms tuple; row [0., 11.] in the model matrix gives the term \(x_1^0 I(x_2, 11)\).

4.6.9.5. Input Constraints

New in version 6.16.

The details["Input Constraints"] dictionary is present in details only if the model has a limited input domain. By default, GTApprox models have no input constraints; these constraints appear only when the model is trained with some option which limits the input domain — for example, when GTApprox/InputDomainType is used, or when GTApprox/OutputNanMode is set to "predict" and the training sample contains NaN values in outputs. A model with input constraints returns NaN values for input points which are out of the domain. Note that if the model has independent outputs, the submodels which evaluate these outputs can have different input domains (see Output Dependency Modes, Model Decomposition) — so it is possible that for a given input point some model outputs return NaN while other return numeric values. In this case, the model’s input domain is considered to be a union of input domains if all submodels.

Each input constraint is expressed in the following form:

\[ \begin{align}\begin{aligned}l_k \le \sum_{i=1}^{N} w_{ki} q_i(\mathbf{x}) \le u_k, ~ k = \{1, 2, \dots, N_c\},\\\mathbf{x} = \{x_j\}, ~ j = \{1, 2, \dots, d\}\end{aligned}\end{align} \]

where:

  • \(N_c\) is the total number of input constraints,
  • \(l_k\) and \(u_k\) are the k-th constraint’s lower and upper bounds respectively,
  • \(q_i\) are constraint terms,
  • \(N\) is the total number of terms,
  • \(w_{ki}\) are term weight coefficients (i-th term weight in the k-th constraint), and
  • \(d\) is the input dimension (the number of model variables \(x_j\)).

Contents of the details["Input Constraints"] dictionary are:

  • details["Input Constraints"]["terms"] (tuple) — lists constraint terms \(q_i\) in a human-readable format.
  • details["Input Constraints"]["categorical"] (dict) — informs which input variables \(x_j\) are categorical, and what values are allowed for them. Keys are 0-based indices of variables, values are tuples of categorical values.
  • details["Input Constraints"]["weights"] (numpy.array) — a matrix of term weight coefficients \(w_{ki}\). Each column corresponds to a term (\(N\) columns), and each row specifies term weights for a single constraint (\(N_c\) rows).
  • details["Input Constraints"]["lower_bound"] (numpy.array) — a vector of lower bounds \(l_k\), contains \(N_c\) elements. Note that -inf is a valid lower bound value, meaning an absent bound.
  • details["Input Constraints"]["upper_bound"] (numpy.array) — a vector of upper bounds \(u_k\), contains \(N_c\) elements. Note that +inf is a valid upper bound value, meaning an absent bound.
  • details["Input Constraints"]["model"] (numpy.array) — a matrix describing model terms \(q_i\) (see the example below). Each column corresponds to an input variable (\(d\) columns), and each row describes a single term (\(N\) rows).
  • details["Input Constraints"]["rpn_formula"] (list) — defines a logical formula, which describes how the constraints are applied. The formula is written in reverse Polish notation.

For example, consider the following excerpt from a details printout:

{
  'Input Constraints':
  {
    'terms': ('x[0]*x[0]', 'x[0]', 'x[1]*x[1]', 'x[1]', '1'),
    'categorical': {},
    'lower_bound': array([ 0.03942652,  0.03942652]),
    'upper_bound': array([ inf,  inf]),
    'weights': array([[ 7.88530466, -3.15412186, 7.88530466, -3.15412186, 0.63082437],
                      [ 7.88530466, -12.61648746, 7.88530466, -12.61648746, 10.09318996]]),
    'rpn_formula': [0, 1, 'and'],
    'model': array([[ 2.,  0.],
                    [ 1.,  0.],
                    [ 0.,  2.],
                    [ 0.,  1.],
                    [ 0.,  0.]])
  },
  'Input Variables': [ ... ]
  ...
}

It describes 2 input constraints with 5 terms (the weignts matrix contains 2 rows and 5 columns) for a model with 2 input variables (the model matrix contains 2 columns).

When reading details, you can easily reconstruct constraints in analytic form, using the terms tuple, weights matrix, and arrays of bounds. In the terms tuple, '1' denotes the constant term; 'x[0]', 'x[1]' and so on are input variables.

Note

If names of variables were specified when training the model (see Model Metainformation, Input and Output Descriptions), these names are used in the terms tuple instead of default ones ('x[0]', 'x[1]' and so on).

In the weights matrix, each row corresponds to a single constraint and contains term weights for this constraint (so there are \(N_c\) rows and \(N\) columns). Constraint bounds follow the order of constraints in the weights matrix.

Thus the example constraints are:

\[ \begin{align}\begin{aligned}0.03942652 \le 7.88530466 x_1^2 - 3.15412186 x_1 + 7.88530466 x_2^2 - 3.15412186 x_2 + 0.63082437\\0.03942652 \le 7.88530466 x_1^2 - 12.61648746 x_1 + 7.88530466 x_2^2 - 12.61648746 x_2 + 10.09318996\end{aligned}\end{align} \]

Or in common form:

\[ \begin{align}\begin{aligned}(x_1-0.2)^2 + (x_2-0.2)^2 \ge 0.005\\(x_1-0.8)^2 + (x_2-0.8)^2 \ge 0.005\end{aligned}\end{align} \]

The model function in this example has undefined behavior at \([0.2, 0.2]\) and \([0.8, 0.8]\), so the constraint excludes these points and their vicinity from the model’s input domain.

The final step is to understand how multiple constraints are applied: the list under the "rpn_formula" key defines a logical formula, which has to be evaluated to determine whether a given point is inside the model’s input domain. In the simple example above, both constraints must be satisfied simultaneously. This is denoted by the formula [0, 1, 'and']: 0 and 1 are constraint indices, 'and' is the logical operator applied to these constraints (the formula uses the reverse Polish notation). For the general explanation, see section Constraints Formula.

Note

Categorical variables in input constraints are replaced with indicator functions \(I(x, t)\) defined the same as in the regression model information for RSM models. For an example of their interpretation, see section Regression Model Information.

The model matrix is intended to be processed programmatically.

As noted above, each column in this matrix corresponds to an input variable, and each row corresponds to another term. So, the i-th row describes how different variables contribute to the term \(q_i\), and the j-th column describes how the variable \(x_j\) contributes to different terms. Matrix elements \(p_{ij}\) have different meanings for continuous and categorical variables:

  • For a continuous variable \(x_j\), \(p_{ij}\) is the power of this variable in the i-th term. By convention, any value raised to 0 power equals 1 (\(x_j^0 = 1\) regardless the value of \(x_j\)), actually meaning that the variable does not contribute to a term. With this assumption, if all model variables are continuous, constraint terms \(q_i\) can be written simply as \(q_i(\mathbf{x}) = \prod_{j=1}^d x_j^{p_{ij}}\), where \(d\) is the model’s input dimension.
  • For a categorical variable \(x_j\), a NaN value of \(p_{ij}\) means that the variable does not contribute to the i-th term. Numeric \(p_{ij}\) should be understood as the test value \(t\) for the indicator function \(I\) (see above), meaning that the term \(q_i\) includes the indicator \(I(x_i, p_{ij})\).

So, the first row of the model matrix describes a quadratic term: [2., 0.] translates to \(x_1^2 x_2^0\) (compare it with the first element of the terms tuple). The weight of this term is found in the first column of the weights matrix and equals 7.88530466 for both constraints. Likewise, the second and following rows of the model matrix give terms \(x_1^1 x_2^0\), \(x_1^0 x_2^2\), \(x_1^0 x_2^1\), and \(x_1^0 x_2^0\).

4.6.9.6. Output Constraints

New in version 6.16.

The details["Output Constraints"] dictionary is present in details only in the following cases:

When output thresholds are applied, the model limits its output range: if it calculates some output value that exceeds the (low, high) threshold, it returns the (low, high) threshold value instead of the calculated one. Thus the model guarantees that the output value is always within bounds set by y_meta. Model gradients in such cases are set to 0, and accuracy estimates to NaN.

When the search for linear output dependencies is enabled, GTApprox analyzes the training sample to find such dependencies, and trains a model which keeps these dependencies between outputs.

Thresholds and output dependencies are expressed as linear constraints of the form:

\[l_k \le \sum_{i=1}^{D} w_{ki} f_i(\mathbf{x}) \le u_k ~ \forall \mathbf{x} \in \mathbb{D}, ~ k = \{1, 2, \dots, N_c\},\]

where:

  • \(N_c\) is the total number of output constraints,
  • \(l_k\) and \(u_k\) are the k-th constraint’s lower and upper bounds respectively,
  • \(w_{ki}\) are linear combination coefficients (for the i-th output in the k-th constraint),
  • \(f_i\) are model outputs,
  • \(D\) is the model’s output dimension (the number of outputs \(f_i\)), and
  • \(\mathbb{D}\) is the model’s input domain, which is unbound by default (so \(\mathbb{D} \equiv \mathbb{R}^d\), \(d\) is the model’s input dimension), but may also be constrained (see Input Constraints).

Contents of the details["Output Constraints"] dictionary are:

  • details["Output Constraints"]["terms"] (tuple) — lists constraint terms in a human-readable format.
  • details["Output Constraints"]["weights"] (numpy.array) — a matrix of linear combination coefficients \(w_{ki}\), which contains \(N_c\) rows, each row corresponds to a constraint.
  • details["Output Constraints"]["lower_bound"] (numpy.array) — a vector of lower bounds \(l_k\), contains \(N_c\) elements. Note that -inf is a valid lower bound value, meaning an absent bound.
  • details["Output Constraints"]["upper_bound"] (numpy.array) — a vector of upper bounds \(u_k\), contains \(N_c\) elements. Note that +inf is a valid upper bound value, meaning an absent bound.
  • details["Output Constraints"]["model"] (numpy.array) — a matrix specifying the linear combination terms in constraints.
  • details["Output Constraints"]["rpn_formula"] (list) — defines a logical formula, which describes how the constraints are applied. The formula is written in reverse Polish notation.

For example, consider the following excerpt from a details printout:

{
  'Output Constraints':
  {
    'terms': (u'f4', u'f1', u'f5', u'f2', u'f3'),
    'lower_bound': array([ 2.80000000e+00, -1.11022302e-15]),
    'upper_bound': array([ 2.80000000e+00, -1.11022302e-15]),
    'weights': array([[ 1. , -0.5,  0. ,  0. ,  0. ],
                      [ 0. ,  0. ,  1. , -1.2, -0.7]])
    'rpn_formula': [0, 1, 'and'],
    'model': array([[ 0.,  0.,  0.,  1.,  0.],
                    [ 1.,  0.,  0.,  0.,  0.],
                    [ 0.,  0.,  0.,  0.,  1.],
                    [ 0.,  1.,  0.,  0.,  0.],
                    [ 0.,  0.,  1.,  0.,  0.]]),
  },
  'Output Variables': [ ... ]
  ...
}

It describes 2 output constraints with 5 terms (the weights matrix contains 2 rows and 5 columns). When reading details, you can easily reconstruct the linear combinations for constraints, using the terms tuple and the weights matrix:

  • Apply coefficients from the first row of weights matrix to the terms tuple to get the first combination: \(f_4 - 0.5 f_1\). Note that terms come in the order of their appearance in the terms tuple, not in the “natural” order.
  • Apply coefficients from the second row: \(f_5 - 1.2 f_2 - 0.7 f_3\).
  • Apply bounds. Note that the lower and upper bound are equal for each constraint, so they are equality constraints.

Note

When training the model used in this example, the output names were specified manually, so they match the \(f_i\) notations used in this section. If you do not specify output names, the terms tuple contains default zero-indexed names ('f[0]', 'f[1]' and so on). See Model Metainformation and Input and Output Descriptions for details.

Thus the constraints are:

\[\begin{split}\begin{array}{l} - 0.5 f_1 + f_4 = 2.8 \\ - 1.2 f_2 - 0.7 f_3 + f_5 = - 1.1 \cdot 10^{-15} \end{array}\end{split}\]

The final step is to understand how the model applies constraints to its outputs: the list under the "rpn_formula" key defines a logical formula, which has to be evaluated to determine whether a given point is inside the model’s output domain. In the simple example above, both constraints are always satisfied simultaneously. This is denoted by the formula [0, 1, 'and']: 0 and 1 are constraint indices, 'and' is the logical operator applied to these constraints (the formula uses the reverse Polish notation). For the general explanation, see section Constraints Formula.

Note

In this example, GTApprox has correctly identified output dependencies in the training data, which was generated as follows:

\[\begin{split}\begin{array}{l} f_1 = f_1(\mathbf{x}) \\ f_2 = f_2(\mathbf{x}) \\ f_3 = f_3(\mathbf{x}) \\ f_4 = 0.5 f_1 + 2.8 \\ f_5 = 1.2 f_2 + 0.7 f_3 \end{array}\end{split}\]

where \(f_1\), \(f_2\), \(f_3\) are some independent functions.

The model matrix is intended to be processed programmatically. Each column in this matrix corresponds to an output, and each row corresponds to a constraint term. Matrix element \(p_{mi}\) is the power of the output \(f_i\) in the m-th term. Since output constraints are linear, \(p_{mi}\) is either 0 or 1, and there can be only one non-zero element in each row (all terms are linear). Thus the model matrix for output constraints simply identifies which output is which term: for example, the first row [ 0.,  0.,  0.,  1.,  0.] means that the first term is \(f_4\), and so on.

4.6.9.7. Constraints Formula

If a model has multiple input (output) constraints, the final step in defining the model’s input (output) domain is to understand how different constraints are combined. The list under the "rpn_formula" key found in the input (output) constraints dictionary defines a logical formula, which has to be evaluated to determine whether a point belongs to the domain.

The evaluation procedure is the same for input and output constraints. Let \(\mathbf{z}\) denote a point in the input or output space (a set of model’s input or output values). As an example, consider the following:

  • A model with \(N_c = 7\) constraints \(C_k\), \(k = \{0, 1, \dots, N_c-1\}\).
  • A list defining the logical formula for these 7 constraints: [3, 0, 'or', 4, 1, 'or', 5, 2, 'or', 6, 1, 'or', 'and', 'and', 'and'].

Numeric elements of the list are indices of predicates \(P_k\), where \(P_k\) is a Boolean value which is true if \(\mathbf{z}\) satisfies the constraint \(C_k\) and false otherwise. String elements are logical operators which can be "and", "or", and "not". The formula is written in reverse Polish notation and is processed from left to right following the common algorithm, that is:

  • Begin with the following:
    • A list of Boolean predicates pre calculated for \(\mathbf{z}\) by testing it against each constraint \(C_k\). pre[k] is True if \(\mathbf{z}\) satisfies \(C_k\) and False otherwise.
    • The formula list.
    • An empty stack of operands.
  • For each element in formula:
    • If the element is an index \(k\) (int), push pre[k] to the stack.
    • If the element is an operator (string):
      • If it is a binary operator ("and", "or"):
        • Pop 2 operands from the stack.
        • Evaluate the operation with these two operands.
        • Push the result to the stack.
      • If it is a unary operator ("not"):
        • Pop 1 operand from the stack.
        • Evaluate the operation.
        • Push the result to the stack.
  • Finish: pop the final result \(\mathbf{F}\) from the stack. The stack always contains a single element after the formula list is processed.

The following is an example implementation of this algorithm:

def evaluate_formula(formula, predicates):
  stack = []
  for op in formula:
    if op == "and":
      stack.append(stack.pop() and stack.pop())
    elif op == "or":
      stack.append(stack.pop() or stack.pop())
    elif op == "not":
      stack.append(not stack.pop())
    else:
      stack.append(predicates[op])
  return stack.pop()

So the example list defines the following condition: (\(P_3\) or \(P_0\)) and (\(P_4\) or \(P_1\)) and (\(P_5\) or \(P_2\)) and (\(P_6\) or \(P_1\)), where \(P_k\) means “\(\mathbf{z}\) satisfies \(C_k\)“.

The final assertions then are:

  • If \(\mathbf{z}\) is an input point: \(\mathbf{z}\) is in the model’s input domain if and only if \(\mathbf{F}\) is True. That is, the model requires \(\mathbf{F}\) to be True for its input, and if this requirement is not met, model output evaluates to NaN. For models with independent outputs, this assertion has to be considered separately for each output, as the submodels which evaluate different outputs can have different input domains.
  • If \(\mathbf{z}\) is an output point: \(\mathbf{F}\) is True for \(\mathbf{z}\). That is, the model guarantees that \(\mathbf{F}\) is True for every output returned by the model.

4.6.9.8. Model Decomposition

Model decomposition information is present in details only for composite models — those that contains several submodels. Submodels are created for each output component (in the component-wise training mode), and also for each unique combination of values of categorical variables. Each submodel is trained independently, so their training techniques, options and other details may be different.

Elements of the details["Model Decomposition"] list are dictionaries, each submodel adds its own dictionary to the list. A dictionary for the k-th submodel, found as details["Model Decomposition"][k], has the following keys:

  • "Categorical Variables" — a list of 0-based indices of categorical variables. Can be empty if there are none. Note that RSM models represent categorical variables using indicator functions (see Regression Model Information), so the original categorical variables will not be listed here if entire model is trained using the RSM technique (the list will be empty).
  • "Categorical Signature" — the combination of values of categorical variables that identifies this submodel. Can be empty, also always empty for RSM models.
  • "Dependent Outputs" — a list 0-based indices of output components predicted by this submodel. Note that the model can be trained in the dependent outputs mode but still include submodels due to the presence of categorical variables — so the value under this key is a list despite by default it contains a single number (component-wise training is the default mode).
  • "Technique" — a string naming the technique that was actually used to train this submodel.
  • "Training Options" — non-default options selected when training this submodel. Dictionary, keys are option names, values are option values.
  • "Training Dataset" — descriptive statistics of the data sample that was used to train the submodel. A dictionary with the same structure as details["Training Dataset"], except that it does not include any accuracy data — the accuracy on the training dataset is tested for the final model only).
  • "Regression Model" — the same as details["Regression Model"] (see Regression Model Information), but for a specific submodel. Present only if the submodel was trained using the RSM technique.
  • "Input Constraints" — input constraints for a specific submodel. The key is present only if the submodel has a limited input domain. Data structure is the same as in details["Input Constraints"] (see Input Constraints).
  • "Output Constraints" — output constraints for a specific submodel. The key is present only if the search for linear dependencies in outputs was performed for this submodel. Data structure is the same as in details["Output Constraints"] (see Output Constraints).

4.6.9.9. Structure Reference

Full structure of details, excluding the inner details of accuracy and sample statistics dictionaries (described in sections Accuracy and Training Dataset Information ). Note that many of these keys are optional, and certain keys can contain empty values such as empty dictionaries or lists.

  • details["Input Variables"]
    • details["Input Variables"][i] (the i-th input).
      • details["Input Variables"][i]["name"]
      • details["Input Variables"][i]["description"]
      • details["Input Variables"][i]["quantity"]
      • details["Input Variables"][i]["unit"]
      • details["Input Variables"][i]["variability"]
      • details["Input Variables"][i]["enumerators"]
  • details["Output Variables"]
    • details["Output Variables"][j] (the j-th output).
      • details["Output Variables"][j]["name"]
      • details["Output Variables"][j]["description"]
      • details["Output Variables"][j]["quantity"]
      • details["Output Variables"][j]["unit"]
      • details["Output Variables"][j]["variability"]
      • details["Output Variables"][j]["value"]
      • details["Output Variables"][j]["enumerators"]
  • details["Technique"]
  • details["Training Options"]
  • details["Training Hints"]
  • details["Issues"]
  • details["Training Time"]
    • details["Training Time"]["Start"]
    • details["Training Time"]["Finish"]
    • details["Training Time"]["Total"]
  • details["Regression Model"]
    • details["Regression Model"]["categorical"]
    • details["Regression Model"]["model"]
    • details["Regression Model"]["terms"]
    • details["Regression Model"]["weights"]
  • details["Input Constraints"]
    • details["Input Constraints"]["model"]
    • details["Input Constraints"]["categorical"]
    • details["Input Constraints"]["terms"]
    • details["Input Constraints"]["weights"]
    • details["Input Constraints"]["lower_bound"]
    • details["Input Constraints"]["upper_bound"]
    • details["Input Constraints"]["rpn_formula"]
  • details["Output Constraints"]
    • details["Output Constraints"]["model"]
    • details["Output Constraints"]["terms"]
    • details["Output Constraints"]["weights"]
    • details["Output Constraints"]["lower_bound"]
    • details["Output Constraints"]["upper_bound"]
    • details["Output Constraints"]["rpn_formula"]
  • details["Training Dataset"]
    • details["Training Dataset"]["Ambiguous Points Number"]
    • details["Training Dataset"]["Duplicate Points Number"]
    • details["Training Dataset"]["Total Points Number"]
    • details["Training Dataset"]["Accuracy"]
      • details["Training Dataset"]["Accuracy"]["Aggregate"]
      • details["Training Dataset"]["Accuracy"]["Componentwise"]
    • details["Training Dataset"]["Sample Statistics"]
      • details["Training Dataset"]["Sample Statistics"]["Input"]
      • details["Training Dataset"]["Sample Statistics"]["Output"]
    • details["Training Dataset"]["Test Sample Statistics"]
      • details["Training Dataset"]["Test Sample Statistics"]["Input"]
      • details["Training Dataset"]["Test Sample Statistics"]["Output"]
  • details["Model Decomposition"]
    • details["Model Decomposition"][k] (the k-th submodel).
      • details["Model Decomposition"][k]["Categorical Variables"]
      • details["Model Decomposition"][k]["Categorical Signature"]
      • details["Model Decomposition"][k]["Dependent Outputs"]
      • details["Model Decomposition"][k]["Technique"]
      • details["Model Decomposition"][k]["Training Options"]
      • details["Model Decomposition"][k]["Training Dataset"]
        • details["Model Decomposition"][k]["Training Dataset"]["Ambiguous Points Number"]
        • details["Model Decomposition"][k]["Training Dataset"]["Duplicate Points Number"]
        • details["Model Decomposition"][k]["Training Dataset"]["Total Points Number"]
        • details["Model Decomposition"][k]["Training Dataset"]["Sample Statistics"]
          • details["Model Decomposition"][k]["Training Dataset"]["Sample Statistics"]["Input"]
          • details["Model Decomposition"][k]["Training Dataset"]["Sample Statistics"]["Output"]
        • details["Model Decomposition"][k]["Training Dataset"]["Test Sample Statistics"]
          • details["Model Decomposition"][k]["Training Dataset"]["Test Sample Statistics"]["Input"]
          • details["Model Decomposition"][k]["Training Dataset"]["Test Sample Statistics"]["Output"]
      • details["Model Decomposition"][k]["Regression Model"]
        • details["Model Decomposition"][k]["Regression Model"]["categorical"]`
        • details["Model Decomposition"][k]["Regression Model"]["model"]`
        • details["Model Decomposition"][k]["Regression Model"]["terms"]`
        • details["Model Decomposition"][k]["Regression Model"]["weights"]`
      • details["Model Decomposition"][k]["Input Constraints"]
        • details["Model Decomposition"][k]["Input Constraints"]["model"]
        • details["Model Decomposition"][k]["Input Constraints"]["categorical"]
        • details["Model Decomposition"][k]["Input Constraints"]["terms"]
        • details["Model Decomposition"][k]["Input Constraints"]["weights"]
        • details["Model Decomposition"][k]["Input Constraints"]["lower_bound"]
        • details["Model Decomposition"][k]["Input Constraints"]["upper_bound"]
        • details["Model Decomposition"][k]["Input Constraints"]["rpn_formula"]
      • details["Model Decomposition"][k]["Output Constraints"]
        • details["Model Decomposition"][k]["Output Constraints"]["model"]
        • details["Model Decomposition"][k]["Output Constraints"]["terms"]
        • details["Model Decomposition"][k]["Output Constraints"]["weights"]
        • details["Model Decomposition"][k]["Output Constraints"]["lower_bound"]
        • details["Model Decomposition"][k]["Output Constraints"]["upper_bound"]
        • details["Model Decomposition"][k]["Output Constraints"]["rpn_formula"]

4.6.10. Model Gradients and AE Gradients

GTApprox can evaluate model gradients for a point or a sample. See grad() and grad_ae()

Also, GTApprox allows to enumerate available gradient output modes:

  • \(\tt{F\_MAJOR}\): indexed in function-major order (\(grad_{ij}=\frac{d f_i}{d x_j}\)).
  • \(\tt{X\_MAJOR}\): indexed in function-major order (\(grad_{ij}=\frac{d f_j}{d x_i}\)).

See GradMatrixOrder

4.6.11. Model Export

GTApprox can save and load models using a binary file format, serialize a model or export it to various formats. Available options are:

4.6.12. Approximation Model Structure

This section describes the general internal structure of a model saved or serialized to a file.

A model file is divided into several sections containing different parts of model information. When reading an existing file, you can load the whole model, or select only specific sections in order to save memory or to load faster. Likewise, when saving a model, you can select which sections to save. This is controlled using the sections argument in load(), fromstring(), save(), and tostring(). To see which sections are available in a model, use available_sections().

The sections are listed below. Note that this list is not intended to describe the exact structure and section order, but only the available contents.

  • Main model section: contains the model binary, required to evaluate model values and gradients. This section is always saved, but you can skip it on load. For some models, the size of this section can be additionally reduced by removing the accuracy evaluation or smoothing information with modify().
  • Information section: full model information accessible via info.
  • Comment section: contains your text comment to the model, see comment.
  • Annotations section: contains the information accessible via annotations (extended comments or notes). Also, some models trained in older versions of pSeven Core and pSeven (prior to 6.14) use the annotations section to store the names of model inputs and outputs, which are included in the descriptions found in details under the "Input Variables" and "Output Variables" keys.
  • Training sample section: contains a copy of data used to train the model, see training_sample. Note that this section may be empty (if GTApprox/StoreTrainingSample was off when training the model). Also, in case of GBRT incremental training (see Incremental Training) only the last (most recent) training sample is saved here (if any).
  • Internal validation information section: contains model internal validation data, iv_info.
  • Log section: contains the text of model training log, build_log.

Note that depending on which sections are loaded, certain Model methods or attributes may be not available:

  • Main section is required by:
  • Annotations section is required by models trained in older versions of pSeven Core and pSeven (prior to 6.14) to get the names of inputs and outputs which are found in details. For such models, if the annotations section is not loaded, the names of inputs and outputs are replaced with defaults.
  • For RSM models only:
    • Reading detailed RSM model information found in details under the "Regression Model" key also requires loading the main model section. The rest of details for RSM models does not require any section and is included into a minimum model load (see below), except the case of models trained in versions prior to 6.14, which also require the annotations section to get the names of inputs and outputs.
  • Obviously, reading annotations, build_log, comment, info, or iv_info requires loading respective sections.

Some minimum model information is available after load even if you do not load any of the sections. This information, just like the main model section, is also always saved with the model. The following attributes can be read after a minimum load:

  • Input and output dimension: size_x and size_f.
  • Model description in details — excluding the "Regression Model" key for RSM models, and the names of inputs and outputs for models trained in older versions of pSeven Core and pSeven (see above).
  • License information: license.
  • Feature flags: has_ae, has_smoothing, and is_smoothed.