Abstract:
We consider the linear model $\vy=\vX\vbeta_{\star}+\vepsilon$ with $\vX\in \mathbb{R}^{n\times p}$ in the overparameterized regime $p>n$. We estimate $\vbeta_{\star}$ via generalized (weighted) ridge regression: $\hat{\vbeta}_{\lambda}=\left(\vX^{\t}\vX+\lambda\vSigma_w\right)^{\dagger}\vX^{\t}\vy$, where $\vSigma_w$ is the weighting matrix. Under a random design setting with general data covariance $\vSigma_x$ and anisotropic prior on the true coefficients $\bbE\vbeta_{\star}\vbeta_{\star}^{\t}=\vSigma_\beta$, we provide an exact characterization of the prediction risk $\mathbb{E}(y-\vx^{\t}\hat{\vbeta}_{\lambda})^2$ in the proportional asymptotic limit $p/n\rightarrow \gamma \in (1,\infty)$. Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting $\lambda_{\opt}$ for the ridge parameter $\lambda$, which suggests an implicit $\ell_2$ regularization effect of overparameterization, and theoretically justifies the surprising empirical observation that $\lambda_{\opt}$ can be \textit{negative} in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when $\vX$ and $\vbeta_{\star}$ are non-isotropic. Finally, we determine the optimal $\vSigma_w$ for both the ridgeless ($\lambda\to 0$) and optimally regularized ($\lambda = \lambda_{\opt}$) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.
Chat is not available.