6.1. Introduction¶

Generic Tool for Data Fusion (GTDF) is a software package for constructing of approximations fitting user-provided training data, including both high and low fidelity data, and for quality assessment of the constructed approximations.

Sections

Problem Statement
Sample Size and Budget Requirements
Data Fusion Model Structure

6.1.1. Problem Statement ¶

The problem to be solved is to build an approximation \(\hat{f}(X)\) of \(f^h(X)\) using a training set. The training set consists of two parts.

The first part \(D_h = (F_h, X _h) = \{(f^h(X_i^h), X_i^h)\}_{i = 1}^{N_h}\) includes a collection of points pairs and high fidelity function values at this points. The matrix \(F_h\) corresponds to the output data of high fidelity model, and the matrix \(X_h\) corresponds to the input data of high fidelity model.

The second part \(D_l = (F_l, X_l) = \{(f^l(X_i^l), X_i^l)\}_{i = 1}^{N_l}\) includes a collection of points pairs and low fidelity function values at this points. The matrix \(F_l\) corresponds to the output data of low fidelity model, and the matrix \(X_l\) corresponds to the input data of low fidelity model.

Input vectors \(X_i^h\), \(X_i^l\) come from the same space \(\Re ^{d_{in}}\), output vectors \(f^h(X^h_i)\), \(f^l(X^l_i)\) lie in the space \(\Re ^{d_{out}}\). It is supposed that \(N_h \ll N_l\).

6.1.2. Sample Size and Budget Requirements ¶

If the approximation technique is selected manually (see option GTDF/Technique), there are certain requirements for the high- and low-fidelity sample size (or, in case of blackbox-based techniques, the low-fidelity blackbox budget) depending on the technique and some other option settings, primarily the GTDF/InternalValidation. As shown below, internal validation typically increases the required high-fidelity sample size.

Required high- and low-fidelity sample sizes (\(N^h_{min}\) and \(N^l_{min}\), respectively) are as follows:

Internal validation off (GTDF/InternalValidation is False, default):
- DA technique: \(N^h_{min} = N^l_{min} = 1\).
- HFA technique: \(N^h_{min} = 1\), \(N^l_{min} = 0\).
- VFGP and SVFGP techniques: \(N^h_{min} = N^l_{min} = 2\tilde{p}+3\).
- MFGP technique: \(N_{min} = 2\tilde{p}+3\) for all input samples (this technique works with multiple samples of different fidelity).

Internal validation on (GTDF/InternalValidation is set to True by user):
- DA technique: \(N^h_{min} = s\), \(N^l_{min} = 1\).
- HFA technique: \(N^h_{min} = s\), \(N^l_{min} = 0\).
- VFGP and SVFGP techniques: \(N^h_{min} = \lceil\frac{2\tilde{p}+3}{s-1}\rceil\cdot s\), \(N^l_{min} = 2\tilde{p}+3\).
- MFGP technique: \(N^*_{min} = \lceil\frac{2\tilde{p}+3}{s-1}\rceil\cdot s\), \(N^i_{min} = 2\tilde{p}+3\), where \(N^*_{min}\) is the size of the most accurate sample (the last one in the samples list in build_MF()), and \(N^i_{min}\) is the required size for all other samples.

In the above:

\(\tilde{p}\) is the effective input dimension.
\(s\) is the value of GTDF/IVSubsetCount (the number of data subsets in cross-validation).
\(\lceil x \rceil\) means the value of \(x\) rounded up (to the next integer).

For the blackbox-based DA_BB and VFGP_BB techniques, \(N^l_{min}\) becomes the requirement for the blackbox budget.

In all cases, sample size requirements are for the effective sample sizes, and \(\tilde{p}\) is the effective input dimension obtained after internal preprocessing of the training samples (removing redundant data, see Section Preprocessing in GTDF User manual).

6.1.3. Data Fusion Model Structure ¶

This section describes the general internal structure of a model saved or serialized to a file.

A model file is divided into several sections containing different parts of model information. When reading an existing file, you can load the whole model, or select only specific sections in order to save memory or to load faster. Likewise, when saving a model, you can select which sections to save. This is controlled using the sections argument in load(), fromstring(), save(), and tostring(). To see which sections are available in a model, use available_sections().

The sections are listed below. Note that this list is not intended to describe the exact structure and section order, but only the available contents.

Main model section: contains the model binary, required to evaluate model values and gradients. This section is always saved, but you can skip it on load.
Information section: full model information accessible via info.
Comment section: contains your text comment to the model, see comment.
Annotations section: contains the information accessible via annotations (extended comments or notes).
Training sample section: contains a copy of data used to train the model, see training_sample. Note that this section may be empty (if GTDF/StoreTrainingSample was off when training the model).
Internal validation information section: contains model internal validation data, iv_info.
Log section: contains the text of model training log, build_log.

Note that depending on which sections are loaded, certain Model methods or attributes may be not available:

Main section is required by:
- all evaluation methods: calc(), calc_ae(), calc_ae_bb(), calc_bb(), grad(), grad_bb(),
- model save methods: save(), tostring(), and
- validate() and validate_bb().
Obviously, reading annotations, build_log, comment, info, or iv_info requires loading respective sections.

Some minimum model information is available after load even if you do not load any of the sections. This information, just like the main model section, is also always saved with the model. The following attributes can be read after a minimum load:

Input and output dimension: size_x and size_f.
License information: license.
Model description in details.
Feature flags: has_ae, has_ae_bb, and has_bb.

6.1. Introduction¶

6.1.1. Problem Statement¶

6.1.2. Sample Size and Budget Requirements¶

6.1.3. Data Fusion Model Structure¶

6.1.1. Problem Statement ¶

6.1.2. Sample Size and Budget Requirements ¶

6.1.3. Data Fusion Model Structure ¶