# DFBuilder¶

**Tag:** Modeling

*DFBuilder* builds an approximation model using two samples containing high and low fidelity data (the high fidelity and low fidelity training samples). The high fidelity training sample is received to the x_hf_sample and f_hf_sample input ports, while the low fidelity sample is received to the x_lf_sample and f_lf_sample input ports. The model may be output to the model port and/or saved to disk.

See also

*GTDF*guide- The guide to the Generic Tool for Data Fusion (GTDF) — the pSeven Core advanced approximation component used by
*DFBuilder*to train approximation models combining two training samples of different fidelity.

Sections

## Options¶

- Basic options:
*GTDF/Accelerator*- five-position switch to control the trade-off between speed and accuracy.*GTDF/AccuracyEvaluation*- require accuracy evaluation.*GTDF/ExactFitRequired*- require the model to fit sample data exactly.*GTDF/InternalValidation*- enable or disable internal validation.*GTDF/LogLevel*- minimum log level.

- Advanced options:
*GTDF/Componentwise*- perform componentwise approximation of the output (*deprecated since 6.3*).*GTDF/DependentOutputs*— assume that training outputs are dependent and do not use componentwise approximation (*added in 6.3*).*GTDF/Deterministic*— controls the behavior of randomized initialization algorithms in certain techniques (*added in 5.2*).*GTDF/IVDeterministic*— controls the behavior of the pseudorandom algorithm selecting data subsets in cross validation (*added in 5.0*).*GTDF/IVSavePredictions*- save model values calculated during internal validation (*added in 3.0 beta 1*).*GTDF/IVSeed*- fixed seed used in the deterministic cross validation mode (*added in 5.0*).*GTDF/IVSubsetCount*- the number of cross validation subsets.*GTDF/IVTrainingCount*- the number of training sessions in cross validation.*GTDF/MaxParallel*- maximum number of parallel threads (*added in 5.0 RC 1*).*GTDF/Seed*— fixed seed used in the deterministic training mode (*added in 5.2*).*GTDF/Technique*- specify the approximation algorithm to use.*GTDF/UnbiasLowFidelityModel*- try compensating the low-fidelity sample bias (*added in 1.10.4*).

- High Fidelity Approximation (HFA) options:
*GTDF/HFA/SurrogateModelType*- specify the algorithm for the approximator used in the HFA technique*(added in 1.10.2)*.

**GTDF/Accelerator**

Five-position switch to control the trade-off between speed and accuracy.

Value: integer in range \([1, 5]\) Default: 1 This option controls training time by changing values of other options. Afterwards, if any of these dependent options is modified by user, user changes override the setting previously made by changing the value of

GTDF/Accelerator.Possible values are from 1 (low speed, highest quality) to 5 (high speed, lower quality).

**GTDF/AccuracyEvaluation**

Require accuracy evaluation.

Value: Boolean Default: off If on, then in addition to the approximation constructed model contains a function providing an estimate of the approximation error as a function on the design space.

**GTDF/Componentwise**

Perform componentwise approximation of the output.

Value: Boolean or "Auto"Default: "Auto"Deprecated since version 6.3: kept for compatibility, use

GTDF/DependentOutputsinstead.Prior to 6.3, this option was used to enable componentwise approximation which was disabled by default.

Since 6.3, componentwise approximation is enabled by default, and can be disabled with

GTDF/DependentOutputs. Now ifGTDF/Componentwiseis default ("Auto"),GTDF/DependentOutputstakes priority. IfGTDF/Componentwiseis not default whileGTDF/DependentOutputsis"Auto", thenGTDF/Componentwisetakes priority. In case of conflict (both options explicitly set on or off) GTDF raises an error (but this conflict is ignored if the output is 1-dimensional).

**GTDF/DependentOutputs**

Assume that training outputs are dependent and do not use componentwise approximation.

Value: Boolean or "Auto"Default: "Auto"New in version 6.3.

In case of multidimensional output there are two possible approaches:

- a separate approximator is used for each output component (componentwise approximation), or
- approximation is performed for all output components simultaneously using the same approximator.

GTDF/DependentOutputsswitches between these two modes. Componentwise approximation is enabled by default — that is,"Auto"defaults to"on"unlessGTDF/Componentwiseexplicitly disables it. Note thatGTDF/Componentwiseis a deprecated option that is kept for version compatibility only and should not be used since 6.3.For more details on componentwise approximation, see section

Componentwise Approximation.

**GTDF/Deterministic**

Controls the behavior of randomized initialization algorithms in certain techniques.

Value: Boolean Default: on New in version 5.2.

Several model training techniques in GTDF feature randomized initialization of their internal parameters. These techniques include:

- DA, which may automatically (after analyzing the training sample) select a randomized technique for the approximator used internally by GTDF.
- HFA, if
GTDF/HFA/SurrogateModelTypeis set to use one of the randomized approximation techniques (HDA, HDAGP, or SGP, and TA in certain cases). Note that HFA can also select one of these techniques automatically ifGTDF/HFA/SurrogateModelTypeis default.- DA_BB and VFGP_BB — blackbox-based techniques which perform randomized sampling of a low-fidelity blackbox.
The determinacy of randomized techniques can be controlled in the following way:

- If
GTDF/Deterministicis on (deterministic training mode, default), a fixed seed is used in all randomized algorithms. The seed is set byGTDF/Seed. This makes the technique behavior reproducible — for example, two models trained in deterministic mode with the same data, sameGTDF/Seedand other settings will be exactly the same, since a training algorithm is initialized with the same parameters.- Alternatively, if
GTDF/Deterministicis off (non-deterministic training mode), a new seed is generated internally every time you train a model. As a result, models trained with randomized techniques may slightly differ even if all settings and training samples are the same. In this case,GTDF/Seedis ignored. The generated seed that was actually used for initialization can be found in model info, so later the training run can still be reproduced exactly by switching to the deterministic mode and settingGTDF/Seedto this value.Note that

GTDF/DeterministicandGTDF/Seedsettings are passed to the approximator and (in case of blackbox-based techniques) sample generator used internally by GTDF; in fact, they indirectly setGTApprox/Deterministic,GTApprox/Seed,GTDoE/Deterministic, andGTDoE/Seed.In case of randomized techniques, repeated non-deterministic training runs may be used to try obtaining a more accurate approximation, because results will be slightly different. On the contrary, deterministic techniques always produce exactly the same model given the same training data and settings, and are not affected by

GTDF/DeterministicandGTDF/Seed. Deterministic techniques include:

- MFGP, SVFGP, VFGP — always deterministic.
- DA, which can be deterministic for certain training samples. In general, this technique is non-deterministic because its behavior depends on the automatic selection of the internal approximation technique (which can result in using a randomized technique).
- HFA, if
GTDF/HFA/SurrogateModelTypeis set to use the LR, SPLT, GP, iTA, or RSM technique.

**GTDF/ExactFitRequired**

Require the model to fit sample data exactly.

Value: Boolean Default: off If on, the approximation fits the points of the training sample. If

GTDF/ExactFitRequiredis off then no fitting condition is imposed, and the approximation can be either fitting or non-fitting depending on the training data (typically, noisy data means there will be no exact fit).

**GTDF/HFA/SurrogateModelType**

Specify the algorithm for the approximator used in the HFA technique.

Value: "LR","SPLT","HDA","GP","HDAGP","SGP","TA","iTA","RSM", or"Auto"Default: "Auto"New in version 1.10.2.

This option allows to explicitly specify the approximation algorithm used whenever the HFA technique is selected (manually or automatically). It is essentially the same as

GTApprox/Techniquewith an exception that it does not allow to select the Mixture of Approximators (MoA) technique. Default ("Auto"), like in GTApprox, means that the algorithm is selected automatically according to the GTApprox automatic technique selection logic (see the GTApprox user manual for details).

**GTDF/InputNanMode**

Specifies how to handle non-numeric values in the input part of the training sample.

Value: "raise","ignore"Default: "raise"New in version 6.8.

GTDF cannot obtain any information from non-numeric (NaN or infinity) values of variables. This option controls its behavior when such values are encountered. Default (

"raise") means to raise an exception;"ignore"means to exclude data points with non-numeric values from the sample and continue training.

**GTDF/InternalValidation**

Enable or disable internal validation.

Value: Boolean Default: off If on, then in addition to the approximation constructed model contains a table of cross-validation errors of different types, which may serve as an indication of the expected accuracy of the approximation.

**GTDF/IVDeterministic**

Controls the behavior of the pseudorandom algorithm selecting data subsets in cross validation.

Value: Boolean Default: on New in version 5.0.

Cross validation involves partitioning the training sample into a number of subsets (defined by

GTDF/IVSubsetCount) and randomized combination of these subsets for each training (validation) session. Since the algorithm that combines subsets is pseudorandom, its behavior can be controlled in the following way:

- If
GTDF/IVDeterministicis on (deterministic cross validation mode, default), a fixed seed is used in the combination algorithm. The seed is set byGTDF/IVSeed. This makes cross-validation reproducible — a different combination is selected for each session, but if you repeat a cross validation run, for each session it will select the same combination as the first run.- Alternatively, if
GTDF/IVDeterministicis off (non-deterministic cross validation mode), a new seed is generated internally for every run, so cross validation results may slightly differ. In this case,GTDF/IVSeedis ignored. The generated seed that was actually used in cross validation can be found in model info, so results can still be reproduced exactly by switching to the deterministic mode and settingGTDF/IVSeedto this value.Final model is never affected by

GTDF/IVDeterministicbecause it is always trained using the full sample.

**GTDF/IVSavePredictions**

Save model values calculated during internal validation.

Value: Boolean or "Auto"Default: "Auto"New in version 3.0beta1.

If on, internal validation information, in addition to error values, also contains raw validation data: model values calculated during internal validation, as well as validation inputs and outputs.

**GTDF/IVSeed**

Fixed seed used in the deterministic cross validation mode.

Value: positive integer Default: 15313 New in version 5.0.

Fixed seed for the pseudorandom algorithm that selects the combination of data subsets for each cross validation session.

GTDF/IVSeedhas an effect only ifGTDF/IVDeterministicis on — see its description for more details.

**GTDF/IVSubsetCount**

The number of cross validation subsets.

Value: 0 (auto) or an integer in range \([2, |S|]\), where \(|S|\) is the size of the high fidelity sample; also can not be less than GTDF/IVTrainingCountDefault: 0 (auto) The number of subsets (of approximately equal size) into which the high fidelity sample is divided for the cross validation.

If left default, the number of cross validation subsets is selected automatically and will be equal to \(min(10, |S|)\), where \(|S|\) is the size of the high fidelity sample.

**GTDF/IVTrainingCount**

The number of training sessions in cross validation.

Value: 0 (auto) or an integer in range \([1,\) GTDF/IVSubsetCount\(]\)Default: 0 (auto) The number of training sessions performed during the cross validation. Each training session includes the following steps:

- Select one of the cross validation subsets.
- Construct a complement of this subset which is the high fidelity training sample excluding the selected subset.
- Build a model using a full low fidelity sample and the above complement as a high fidelity training sample (so the selected subset is excluded from builder input).
- Validate this model on the previously selected subset.
Repeat until the number of such sessions exceeds

GTDF/IVTrainingCount.If left default, the number of cross validation sessions is selected automatically and will be equal to:

\[\begin{split}N_{\rm tr} & = \bigg\lceil\min\Big(|S|, \frac{100}{|S|}\Big)\bigg\rceil,\end{split}\]where \(|S|\) is the size of the high fidelity sample.

**GTDF/LogLevel**

Set minimum log level.

Value: "Debug","Info","Warn","Error","Fatal"Default: "Info"If this option is set, only messages with log level greater than or equal to the threshold are dumped into log.

**GTDF/MaxParallel**

Set the maximum number of parallel threads to use when building a model.

Value: positive integer or 0 (auto) Default: 0 (auto) New in version 5.0rc1.

GTDF can run in parallel to speed up model training. This option sets the maximum number of threads the builder is allowed to create. Default setting (0) uses the value given by the

OMP_NUM_THREADSenvironment variable, which by default is equal to the number of virtual processors, including hyperthreading CPUs. Other values overrideOMP_NUM_THREADS.

**GTDF/Seed**

Fixed seed used in the deterministic training mode.

Value: positive integer Default: 15313 New in version 5.2.

In the deterministic training mode,

GTDF/Seedsets the seed for randomized initialization algorithms in certain techniques. SeeGTDF/Deterministicfor more details.

**GTDF/StoreTrainingSample**

Save a copy of training data with the model.

Value: Boolean or "Auto"Default: "Auto"New in version 6.6.

If on, the trained model will store copies of training samples, sorted in order of increasing fidelity. If off, this attribute will be an empty list. The

"Auto"setting currently defaults to “off”.

**GTDF/Technique**

Specify the approximation algorithm to use.

Value: "DA","HFA","VFGP","SVFGP"(in the sample-based mode);"DA_BB","VFGP_BB"(in the blackbox-based mode);"Auto"Default: "Auto"This option allows to specify the algorithm to be used in approximation.

- Sample-based techniques (available only when the sample-based mode is selected in block configuration):

"DA"— Difference Approximation"HFA"— High Fidelity Approximation"SVFGP"— Sparse Variable Fidelity Gaussian Process"VFGP"— Variable Fidelity Gaussian Process- Blackbox-based techniques (available only when the blackbox-based mode is selected in block configuration):

"DA_BB"— blackbox-based Difference Approximation"VFGP_BB"— blackbox-based Variable Fidelity Gaussian ProcessDefault value (

"Auto") means that the best algorithm will be determined automatically.Sample size and blackbox budget requirements taking effect when the technique is selected manually are also described in section

Sample Size and Budget Requirements.

**GTDF/UnbiasLowFidelityModel**

Try compensating the low-fidelity sample bias.

Value: Boolean or "Auto"Default: "Auto"New in version 1.10.4.

If on, then after building an initial low-fidelity model (the approximation model trained using the low-fidelity sample only), GTDF will try to find and compensate its bias, using the high-fidelity sample.

For example, consider a high-fidelity sample generated by function \(f_{hf}(x)\) and a low-fidelity sample generated by \(f_{lf}(x) \approx f_{hf}(x+e)\). If

GTDF/UnbiasLowFidelityModelis on, GTDF will use the algorithm that compensates the bias \(e\), resulting in a more accurate final model.This option affects all techniques except HFA. If

GTDF/Techniqueis set to"HFA", theGTDF/UnbiasLowFidelityModeloption value is ignored.The

"Auto"setting currently defaults to off (no bias compensation).