6. Dictionary learning based unsupervised data pruning#
Different from feature selection, which reduces the size of dataset in column-wise,
data pruning reduces the size of dataset in row-wise.
To use FastCan for unsupervised data pruning, the target \(Y\) matrix is
obtained first with dictionary learning.
Dictionary learning will learn a dictionary which is composed of atoms.
The atoms should be very representative, so that each sample of dataset can be represented (with errors)
by sparse linear combinations of the atoms.
We use these atoms as the target \(Y\) and select samples based on their correlation with \(Y\).
One challenge to use FastCan for data pruning is that the number to select is much larger than feature selection.
Normally, this number is greater than the number of features, which will make the pruned data matrix singular.
In other words, FastCan will easily think the pruned data is redundant and no additional sample
should be selected, as any additional samples can be represented by linear combinations of the selected samples.
Therefore, the number to select has to be set to small.
To solve this problem, we use minibatch() to loose the redundancy check of FastCan.
The original FastCan checks the redundancy within \(X_s \in \mathbb{R}^{n\times t}\),
which contains \(t\) selected samples and n features,
and the redundancy within \(Y \in \mathbb{R}^{n\times m}\), which contains \(m\) atoms \(y_i\).
minibatch() ranks samples with multiple correlation coefficients between \(X_b \in \mathbb{R}^{n\times b}\) and \(y_i\),
where \(b\) is batch size and \(b <= t\), instead of canonical correlation coefficients between \(X_s\) and \(Y\),
which is used in FastCan.
Therefore, minibatch() looses the redundancy check in two ways.
it uses \(y_i\) instead of \(Y\), so no redundancy check is performed within \(Y\)
it uses \(X_b\) instead of \(X_s\), so
minibatch()only checks the redundancy within a batch \(X_b\), but does not check the redundancy between batches.
References
“Dictionary learning-based data pruning for system identification” Wang, T., Zhang, S., Song, M., & Sun, L. Applied Sciences, 15(17), 9368 (2025).
Examples
See Data pruning for an example of dictionary learning based data pruning.