Posters in this session:
A PAC-Bayesian Perspective on the Interpolating Information Criterion
Graph Neural Networks Benefit from Structural Information Provably: A Feature Learning Perspective
Linear attention is (maybe) all you need (to understand transformer optimization)
Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study
Feature Learning in Infinite-Depth Neural Networks
Variational Classification
Implicit biases in multitask and continual learningfrom a backward error analysis perspective
Spectrum Extraction and Clipping for Implicitly Linear Layers
The Noise Geometry of Stochastic Gradient Descent: A Quantitative and Analytical Characterization
Curvature-Dimension Tradeoff for Generalization in Hyperbolic Space
Complexity Matters: Dynamics of Feature Learning in the Presence of Spurious Correlations
Unveiling the Hessian's Connection to the Decision Boundary
Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks
Large Learning Rates Improve Generalization: But How Large Are We Talking About?
Understanding the Role of Noisy Statistics in the Regularization Effect of Batch Normalization
Generalization Guarantees of Deep ResNets in the Mean-Field Regime
Theoretical Explanation for Generalization from Adversarial Perturbations
In-Context Convergence of Transformers
How Two-Layer Neural Networks Learn, One (Giant) Step at a Time
Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
Unraveling the Complexities of Simplicity Bias: Mitigating and Amplifying Factors
Transformers as Support Vector Machines
Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems
A Theoretical Study of Dataset Distillation
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
Introducing an Improved Information-Theoretic Measure of Predictive Uncertainty
In-Context Learning on Unstructured Data: Softmax Attention as a Mixture of Experts
Attention-Only Transformers and Implementing MLPs with Attention Heads
Privacy at Interpolation: Precise Analysis for Random and NTK Features
Denoising Low-Rank Data Under Distribution Shift: Double Descent and Data Augmentation
A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks
Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data
How does Gradient Descent Learn Features --- A Local Analysis for Regularized Two-Layer Neural Networks
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
Provably Efficient CVaR RL in Low-rank MDPs
Analysis of Task Transferability in Large Pre-trained Classifiers
On Scale-Invariant Sharpness Measures
Gibbs-Based Information Criteria and the Over-Parameterized Regime
Grokking modular arithmetic can be explained by margin maximization