8.1. Introduction

Generic Tool for Dimension Reduction (GTDR) automatically generates data-based dimension reduction model (DR-model) on the basis of the specified set of data (prototypes) consisting of multidimensional vectors. The DR-model performs the multidimensional vector dimension reduction procedure adapted to specified data set.

8.1.1. Problem Statement

The GTDR module implements the solution of a mathematical task that can be formulated as follows: based on the specified training data set (the structure of the training data set is described below), create a DR-model \(\Sigma=\{n,m,C,D\}\) defined by the following parameters:

  • \(n\) - dimension of the initial vector \(X\);
  • \(m\) - dimension of the compressed vector \(\lambda\);
  • \(C\) - the compression procedure that reduces the vector \(X\) of dimension \(n\) to the \(m\)-dimensional compressed vector \(\lambda\);
  • \(D\) - the reconstruction procedure that recovers the \(m\)-dimensional compressed vector \(\lambda\) to the full dimensional vector \(X\).

The DR-model must also comply with the specified requirements (depending on the dimension reduction problem type), which are determined either by the dimension \(m\) of the compressed vector or by the accuracy of the procedure.

In the dimension reduction tasks 1 and 2 considered below, specified training data set consists of \(N\) \(n\)-dimensional vectors \(\{X_1, X_2, ..., X_N\}\) (prototypes). Also, the accuracy of DR-model \(\Sigma=\{n,m,C,D\}\), applied to the initial vector \(X\), is determined by the error of reconstruction, which is understood as the distance \(d(X,X^*)=|X-X^*|\) between the original vector \(X\) and reconstructed vector \(X^*=D(C(X))\), obtained by first applying the compression procedure and then the reconstruction procedure to the original vector \(X\). For a given data set \(\{X_1, X_2, ..., X_N\}\) the accuracy of DR-model is governed by the root-mean-square error of reconstruction \(\varepsilon\):

\[\varepsilon(\Sigma) = \sqrt{\frac{1}{N}\sum_{i=1}^{N}{|X_i-X_i^*|^2}}.\]

Now let us formulate the dimension reduction tasks 1 and 2.

8.1.1.1. Dimension Reduction Task 1

Based on a given data set \(\{X_1, X_2,...,X_N\}\), construct a DR-model \(\Sigma=\{n,m,C,D\}\) such that the root-mean-square error of reconstruction \(\varepsilon(\Sigma)\) does not exceed the specified value \(\varepsilon\) and the compressed vector has the smallest possible dimension \(m\).

8.1.1.2. Dimension Reduction Task 2

Based on a given data set \(\{X_1, X_2,...,X_N\}\) and specified dimension \(m\) of the compressed vector, construct a DR-model \(\Sigma=\{n,m,C,D\}\) having the lowest possible root-mean-square error of reconstruction \(\varepsilon(\Sigma)\) for the specified reduced dimension value \(m\).

8.1.1.3. Dimension Reduction Task 3

Now let us consider dimension reduction task 3 (feature extraction problem) statement. First of all we describe the training data set for dimension reduction task 3 and the way to estimate the accuracy of corresponding DR-model.

Let \(Y = f(X), X \in \mathbb{R}^n, Y \in \mathbb{R}^q\) be some dependency. Specified training data set consists of \(N\) \(n\)-dimensional input vectors and corresponding \(q\)-dimensional output vectors, namely, \(\{(X_1,Y_1), (X_2,Y_2), ..., (X_N,Y_N)\}\), where \(Y_i = f(X_i), i = 1,...,N\).

The accuracy of DR-model \(\Sigma=\{n,m,C,D\}\), applied to the initial vector \(X\), is determined by the discrepancy between function value \(f(X)\) for the initial vector \(X\) and function value \(f(X^*)\) for the reconstructed vector \(X^*=D(C(X))\), obtained by applying first the compression procedure and then the reconstruction procedure to the original vector \(X\). For a given data set \(\{(X_1,Y_1), (X_2,Y_2), ..., (X_N,Y_N)\}\) the accuracy of DR-model is governed by the root-mean-square error \(\varepsilon\):

\[\varepsilon(\Sigma) = \sqrt{\frac{1}{N}\sum_{i=1}^{N}{|Y_i-f(X_i^*)|^2}}.\]

Finally, the dimension reduction task 3 can be formulated as follows. Based on a given data set \(\{(X_1,Y_1), (X_2,Y_2), ..., (X_N,Y_N)\}\), construct a DR-model \(\Sigma=\{n,m,C,D\}\) having the lowest possible root-mean-square error \(\varepsilon(\Sigma)\) for the specified reduced dimension value \(m\).

In the current version of GTDR it is assumed that

\[\begin{split}\begin{array}{l} C: X \in \mathbb{R}^n \rightarrow \lambda = \lambda(X) = B\cdot X \in \mathbb{R}^m \\ D: \lambda \in \mathbb{R}^m \rightarrow X^* = X^*(\lambda) = B^T\cdot\lambda \in \mathbb{R}^n \end{array}\end{split}\]

for some unknown matrix \(B\in\mathbb{R}^{m\times n}\) and that the elements of given data set were generated according to the following model:

\[Y_i=f(X_i) = g(\lambda(X_i))+\varepsilon_i \approx g(B\cdot X_i), i = 1,...,N,\]

for some unknown function \(g(\lambda), \lambda \in \mathbb{R}^m\). Feature extraction algorithm of GTDR that performs solution of dimension reduction task 3 estimates the matrix \(B \in \mathbb{R}^{m\times n}\) using the sample \(\{(X_1,Y_1), (X_2,Y_2), ..., (X_N,Y_N)\}\).

8.1.1.4. Dimension Reduction Task 4

Task 4 (blackbox-based feature extraction problem) is based on same assumptions as dimension reduction task 3, however algorithm input is significantly different.

GTDR is trained not on a predefined data set, but on a user-provided blackbox that may be evaluated at any point in \([a_1, b_1] \times \ldots \times [a_n, b_n]\) area.

Dimension reduction task 4 may be formulated as follows. Based on bounds \({a_i, b_i}, i \in [1, n]\), budget \(K\) and user-defined function \(f(\vec{x})\), construct a DR-model \(\Sigma=\{n,m,C,D\}\) having the lowest possible root-mean-square error \(\varepsilon(\Sigma)\) for the specified reduced dimension value \(m\).