Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Machine Learning and Compression

Randomly Pivoted V-optimal Design: Fast Data Selection under Low Intrinsic Dimension

Yijun Dong · Xiang Pan · Viet Hoang Phan · Qi Lei


Abstract: Despite the ubiquitous high-dimensionalities brought about by the increasing sizes of models and data, low intrinsic dimensions are commonly found in many high-dimensional learning problems (\eg finetuning). To explore sample efficient learning leveraging such low intrinsic dimensions, we introduce randomly pivoted V-optimal design (RPVopt), a fast data selection algorithm that combines dimension reduction via sketching and optimal experimental design. Given a large dataset with $N$ samples in a high dimension $d$, RPVopt first reduces the dimensionality from $d$ to $m \ll d$ by embedding the data to a random low-dimensional subspace via sketching. Then a coreset of size $n > m$ is selected based on the low-dimensional sketched data through an efficient two-stage random pivoting algorithm. With a fast embedding matrix for sketching, RPVopt achieves an asymptotic complexity of $O(Nd+Nnm)$, linear in the full data size, data dimension, and coreset size. With extensive experiments in both regression and classification settings, we demonstrate the empirical effectiveness of RPVopt in data selection for finetuning vision tasks.

Chat is not available.