4. Comparison with OLS and OMP#

FastCan has a close relationship with Orthogonal Least Squares (OLS) [1] and Orthogonal Matching Pursuit (OMP) [2]. The detailed difference between OLS and OMP can be found in [3]. Here, let’s briefly compare the three methods.

Assume we have a feature matrix \(X_s \in \mathbb{R}^{N\times t}\), which contains \(t\) selected features, and a target vector \(y \in \mathbb{R}^{N\times 1}\). Then the residual \(r \in \mathbb{R}^{N\times 1}\) of the least-squares can be found by

\[r = y - X_s \beta \;\; \text{where} \;\; \beta = (X_s^\top X_s)^{-1}X_s^\top y\]

When evaluating a new candidate feature \(x_i \in \mathbb{R}^{N\times 1}\)

  • for OMP, the feature which maximizes \(r^\top x_i\) will be selected,

  • for OLS, the feature which maximizes \(r^\top w_i\) will be selected, where \(w_i \in \mathbb{R}^{N\times 1}\) is the projection of \(x_i\) on the orthogonal subspace so that it is orthogonal to \(X_s\), i.e., \(X_s^\top w_i = \mathbf{0} \in \mathbb{R}^{t\times 1}\),

  • for FastCan (h-correlation algorithm), it is almost same as OLS, but the difference is that in FastCan, \(X_s\), \(y\), and \(x_i\) are centered (i.e., zero mean in each column) before the selection.

The small difference makes the feature ranking criterion of FastCan is equivalent to the sum of squared canonical correlation coefficients, which gives it the following advantages over OLS and OMP:

  • Affine invariance: if features are polluted by affine transformation, i.e., scaled and/or added some constants, the selection result given by FastCan will be unchanged. See Affine invariance.

  • Multioutput: as FastCan use canonical correlation for feature ranking, it is naturally support feature selection on dataset with multioutput.

References