# DFBuilder¶

Tag: Modeling

DFBuilder builds an approximation model using two samples containing high and low fidelity data (the high fidelity and low fidelity training samples). The high fidelity training sample is received to the x_hf_sample and f_hf_sample input ports, while the low fidelity sample is received to the x_lf_sample and f_lf_sample input ports. The model may be output to the model port and/or saved to disk.

GTDF guide
The guide to the Generic Tool for Data Fusion (GTDF) — the pSeven Core advanced approximation component used by DFBuilder to train approximation models combining two training samples of different fidelity.

Sections

## Options¶

GTDF/Accelerator

Five-position switch to control the trade-off between speed and accuracy.

Value: integer in range $$[1, 5]$$ 1

This option controls training time by changing values of other options. Afterwards, if any of these dependent options is modified by user, user changes override the setting previously made by changing the value of GTDF/Accelerator.

Possible values are from 1 (low speed, highest quality) to 5 (high speed, lower quality).

GTDF/AccuracyEvaluation

Require accuracy evaluation.

Value: Boolean off

If on, then in addition to the approximation constructed model contains a function providing an estimate of the approximation error as a function on the design space.

GTDF/Componentwise

Perform componentwise approximation of the output.

Value: Boolean or "Auto" "Auto"

Deprecated since version 6.3: kept for compatibility, use GTDF/DependentOutputs instead.

Prior to 6.3, this option was used to enable componentwise approximation which was disabled by default.

Since 6.3, componentwise approximation is enabled by default, and can be disabled with GTDF/DependentOutputs. Now if GTDF/Componentwise is default ("Auto"), GTDF/DependentOutputs takes priority. If GTDF/Componentwise is not default while GTDF/DependentOutputs is "Auto", then GTDF/Componentwise takes priority. In case of conflict (both options explicitly set on or off) GTDF raises an error (but this conflict is ignored if the output is 1-dimensional).

GTDF/DependentOutputs

Assume that training outputs are dependent and do not use componentwise approximation.

Value: Boolean or "Auto" "Auto"

New in version 6.3.

In case of multidimensional output there are two possible approaches:

1. a separate approximator is used for each output component (componentwise approximation), or
2. approximation is performed for all output components simultaneously using the same approximator.

GTDF/DependentOutputs switches between these two modes. Componentwise approximation is enabled by default — that is, "Auto" defaults to "on" unless GTDF/Componentwise explicitly disables it. Note that GTDF/Componentwise is a deprecated option that is kept for version compatibility only and should not be used since 6.3.

For more details on componentwise approximation, see section Componentwise Approximation.

GTDF/Deterministic

Controls the behavior of randomized initialization algorithms in certain techniques.

Value: Boolean on

New in version 5.2.

Several model training techniques in GTDF feature randomized initialization of their internal parameters. These techniques include:

• DA, which may automatically (after analyzing the training sample) select a randomized technique for the approximator used internally by GTDF.
• HFA, if GTDF/HFA/SurrogateModelType is set to use one of the randomized approximation techniques (HDA, HDAGP, or SGP, and TA in certain cases). Note that HFA can also select one of these techniques automatically if GTDF/HFA/SurrogateModelType is default.
• DA_BB and VFGP_BB — blackbox-based techniques which perform randomized sampling of a low-fidelity blackbox.

The determinacy of randomized techniques can be controlled in the following way:

• If GTDF/Deterministic is on (deterministic training mode, default), a fixed seed is used in all randomized algorithms. The seed is set by GTDF/Seed. This makes the technique behavior reproducible — for example, two models trained in deterministic mode with the same data, same GTDF/Seed and other settings will be exactly the same, since a training algorithm is initialized with the same parameters.
• Alternatively, if GTDF/Deterministic is off (non-deterministic training mode), a new seed is generated internally every time you train a model. As a result, models trained with randomized techniques may slightly differ even if all settings and training samples are the same. In this case, GTDF/Seed is ignored. The generated seed that was actually used for initialization can be found in model info, so later the training run can still be reproduced exactly by switching to the deterministic mode and setting GTDF/Seed to this value.

Note that GTDF/Deterministic and GTDF/Seed settings are passed to the approximator and (in case of blackbox-based techniques) sample generator used internally by GTDF; in fact, they indirectly set GTApprox/Deterministic, GTApprox/Seed, GTDoE/Deterministic, and GTDoE/Seed.

In case of randomized techniques, repeated non-deterministic training runs may be used to try obtaining a more accurate approximation, because results will be slightly different. On the contrary, deterministic techniques always produce exactly the same model given the same training data and settings, and are not affected by GTDF/Deterministic and GTDF/Seed. Deterministic techniques include:

• MFGP, SVFGP, VFGP — always deterministic.
• DA, which can be deterministic for certain training samples. In general, this technique is non-deterministic because its behavior depends on the automatic selection of the internal approximation technique (which can result in using a randomized technique).
• HFA, if GTDF/HFA/SurrogateModelType is set to use the LR, SPLT, GP, iTA, or RSM technique.

GTDF/ExactFitRequired

Require the model to fit sample data exactly.

Value: Boolean off

If on, the approximation fits the points of the training sample. If GTDF/ExactFitRequired is off then no fitting condition is imposed, and the approximation can be either fitting or non-fitting depending on the training data (typically, noisy data means there will be no exact fit).

GTDF/HFA/SurrogateModelType

Specify the algorithm for the approximator used in the HFA technique.

Value: "LR", "SPLT", "HDA", "GP", "HDAGP", "SGP", "TA", "iTA", "RSM", or "Auto" "Auto"

New in version 1.10.2.

This option allows to explicitly specify the approximation algorithm used whenever the HFA technique is selected (manually or automatically). It is essentially the same as GTApprox/Technique with an exception that it does not allow to select the Mixture of Approximators (MoA) technique. Default ("Auto"), like in GTApprox, means that the algorithm is selected automatically according to the GTApprox automatic technique selection logic (see the GTApprox user manual for details).

GTDF/InputNanMode

Specifies how to handle non-numeric values in the input part of the training sample.

Value: "raise", "ignore" "raise"

New in version 6.8.

GTDF cannot obtain any information from non-numeric (NaN or infinity) values of variables. This option controls its behavior when such values are encountered. Default ("raise") means to raise an exception; "ignore" means to exclude data points with non-numeric values from the sample and continue training.

GTDF/InternalValidation

Enable or disable internal validation.

Value: Boolean off

If on, then in addition to the approximation constructed model contains a table of cross-validation errors of different types, which may serve as an indication of the expected accuracy of the approximation.

GTDF/IVDeterministic

Controls the behavior of the pseudorandom algorithm selecting data subsets in cross validation.

Value: Boolean on

New in version 5.0.

Cross validation involves partitioning the training sample into a number of subsets (defined by GTDF/IVSubsetCount) and randomized combination of these subsets for each training (validation) session. Since the algorithm that combines subsets is pseudorandom, its behavior can be controlled in the following way:

• If GTDF/IVDeterministic is on (deterministic cross validation mode, default), a fixed seed is used in the combination algorithm. The seed is set by GTDF/IVSeed. This makes cross-validation reproducible — a different combination is selected for each session, but if you repeat a cross validation run, for each session it will select the same combination as the first run.
• Alternatively, if GTDF/IVDeterministic is off (non-deterministic cross validation mode), a new seed is generated internally for every run, so cross validation results may slightly differ. In this case, GTDF/IVSeed is ignored. The generated seed that was actually used in cross validation can be found in model info, so results can still be reproduced exactly by switching to the deterministic mode and setting GTDF/IVSeed to this value.

Final model is never affected by GTDF/IVDeterministic because it is always trained using the full sample.

GTDF/IVSavePredictions

Save model values calculated during internal validation.

Value: Boolean or "Auto" "Auto"

New in version 3.0beta1.

If on, internal validation information, in addition to error values, also contains raw validation data: model values calculated during internal validation, as well as validation inputs and outputs.

GTDF/IVSeed

Fixed seed used in the deterministic cross validation mode.

Value: positive integer 15313

New in version 5.0.

Fixed seed for the pseudorandom algorithm that selects the combination of data subsets for each cross validation session. GTDF/IVSeed has an effect only if GTDF/IVDeterministic is on — see its description for more details.

GTDF/IVSubsetCount

The number of cross validation subsets.

Value: 0 (auto) or an integer in range $$[2, |S|]$$, where $$|S|$$ is the size of the high fidelity sample; also can not be less than GTDF/IVTrainingCount 0 (auto)

The number of subsets (of approximately equal size) into which the high fidelity sample is divided for the cross validation.

If left default, the number of cross validation subsets is selected automatically and will be equal to $$min(10, |S|)$$, where $$|S|$$ is the size of the high fidelity sample.

GTDF/IVTrainingCount

The number of training sessions in cross validation.

Value: 0 (auto) or an integer in range $$[1,$$ GTDF/IVSubsetCount$$]$$ 0 (auto)

The number of training sessions performed during the cross validation. Each training session includes the following steps:

1. Select one of the cross validation subsets.
2. Construct a complement of this subset which is the high fidelity training sample excluding the selected subset.
3. Build a model using a full low fidelity sample and the above complement as a high fidelity training sample (so the selected subset is excluded from builder input).
4. Validate this model on the previously selected subset.

Repeat until the number of such sessions exceeds GTDF/IVTrainingCount.

If left default, the number of cross validation sessions is selected automatically and will be equal to:

$\begin{split}N_{\rm tr} & = \bigg\lceil\min\Big(|S|, \frac{100}{|S|}\Big)\bigg\rceil,\end{split}$

where $$|S|$$ is the size of the high fidelity sample.

GTDF/LogLevel

Set minimum log level.

Value: "Debug", "Info", "Warn", "Error", "Fatal" "Info"

If this option is set, only messages with log level greater than or equal to the threshold are dumped into log.

GTDF/MaxParallel

Set the maximum number of parallel threads to use when building a model.

Value: positive integer or 0 (auto) 0 (auto)

New in version 5.0rc1.

GTDF can run in parallel to speed up model training. This option sets the maximum number of threads the builder is allowed to create. Default setting (0) uses the value given by the OMP_NUM_THREADS environment variable, which by default is equal to the number of virtual processors, including hyperthreading CPUs. Other values override OMP_NUM_THREADS.

GTDF/Seed

Fixed seed used in the deterministic training mode.

Value: positive integer 15313

New in version 5.2.

In the deterministic training mode, GTDF/Seed sets the seed for randomized initialization algorithms in certain techniques. See GTDF/Deterministic for more details.

GTDF/StoreTrainingSample

Save a copy of training data with the model.

Value: Boolean or "Auto" "Auto"

New in version 6.6.

If on, the trained model will store copies of training samples, sorted in order of increasing fidelity. If off, this attribute will be an empty list. The "Auto" setting currently defaults to “off”.

GTDF/Technique

Specify the approximation algorithm to use.

Value: "DA", "HFA", "VFGP", "SVFGP" (in the sample-based mode); "DA_BB", "VFGP_BB" (in the blackbox-based mode); "Auto" "Auto"

This option allows to specify the algorithm to be used in approximation.

• Sample-based techniques (available only when the sample-based mode is selected in block configuration):
• "DA" — Difference Approximation
• "HFA" — High Fidelity Approximation
• "SVFGP" — Sparse Variable Fidelity Gaussian Process
• "VFGP" — Variable Fidelity Gaussian Process
• Blackbox-based techniques (available only when the blackbox-based mode is selected in block configuration):
• "DA_BB" — blackbox-based Difference Approximation
• "VFGP_BB" — blackbox-based Variable Fidelity Gaussian Process

Default value ("Auto") means that the best algorithm will be determined automatically.

Sample size and blackbox budget requirements taking effect when the technique is selected manually are also described in section Sample Size and Budget Requirements.

GTDF/UnbiasLowFidelityModel

Try compensating the low-fidelity sample bias.

Value: Boolean or "Auto" "Auto"

New in version 1.10.4.

If on, then after building an initial low-fidelity model (the approximation model trained using the low-fidelity sample only), GTDF will try to find and compensate its bias, using the high-fidelity sample.

For example, consider a high-fidelity sample generated by function $$f_{hf}(x)$$ and a low-fidelity sample generated by $$f_{lf}(x) \approx f_{hf}(x+e)$$. If GTDF/UnbiasLowFidelityModel is on, GTDF will use the algorithm that compensates the bias $$e$$, resulting in a more accurate final model.

This option affects all techniques except HFA. If GTDF/Technique is set to "HFA", the GTDF/UnbiasLowFidelityModel option value is ignored.

The "Auto" setting currently defaults to off (no bias compensation).