| Session 1 (Moderator: Courtney Paquette)
|
9:00am-9:05am |
Organizers |
Welcome Remarks |
9:00am-9:30am | Yair Carmon (Tel-Aviv University)
| DoG is SGD’s best friend: toward tuning-free stochastic optimization | [abstract] |
While stochastic optimization methods drive continual improvements in machine learning, choosing the optimization parameters—and particularly the learning rate (LR)—remains a difficulty. In this talk, I will describe our work on removing LR tuning from stochastic gradient descent (SGD), culminating in a tuning-free dynamic SGD step size formula, which we call Distance over Gradients (DoG). We show that DoG removes the need to tune learning rate both theoretically (obtaining strong parameter-free convergence guarantees) and empirically (performing nearly as well as expensively-tuned SGD on neural network training tasks).
|
9:30am-10:00am | Contributed talks
|
- Bruno Loureiro: Escaping mediocrity: how two-layer networks learn hard generalized linear models
- Daniil Vankov: Last Iterate Convergence of Popov Method for Non-monotone Stochastic Variational Inequalities
| [papers] |
10:00am-11:00am | Poster Session 1
|
| [posters] |
| Session 2 (Moderator: Aaron Sidford)
|
11:00am-11:30am | Contributed talks
|
- Guy Kornowski: An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization
- Michael Lu: Practical Principled Policy Optimization for Finite MDPs
| [papers] |
11:30am-12:00pm | Dmitry Drusvyatskiy (University of Washington)
| Aiming towards the minimizers: fast convergence of SGD for overparameterized problems | [abstract] |
Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a small minibatch of sampled gradients in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
|
12:00pm-02:00pm | Lunch
|
| |
| Session 3 (Moderator: Sebastian Stich)
|
02:00pm-02:30pm | Virginia Smith (Carnegie Mellon University)
| Evaluating Large-Scale Learning Systems | [abstract] |
To deploy machine learning models in practice it is critical to have a way to reliably evaluate their effectiveness. Unfortunately, the scale and complexity of modern machine learning systems makes it difficult to provide faithful evaluations and gauge performance across potential deployment scenarios. In this talk I discuss our work addressing challenges in large-scale ML evaluation. First, I explore the problem of hyperparameter optimization in federated networks of devices, where issues of device subsampling, heterogeneity, and privacy can introduce noise in the evaluation process and make it challenging to effectively perform optimization. Second, I present ReLM, a system for validating and querying large language models (LLMs). Although LLMs have been touted for their ability to generate natural-sounding text, there is a growing need to evaluate the behavior of LLMs in light of issues such as data memorization, bias, and inappropriate language. ReLM poses LLM validation queries as regular expressions to enable faster and more effective LLM evaluation.
|
02:30pm-03:00pm | Contributed talks
|
- Naren Sarayu Manoj: Dueling Optimization with a Monotone Adversary
- Georgy Noarov: High-Dimensional Prediction for Sequential Decision Making
| [papers] |
03:00pm-04:00pm | Poster Session 2
|
| [posters] |
| Session 4 (Moderator: Courtney Paquette)
|
04:00pm-04:30pm | Ashwin Pananjady (Georgia Tech)
| Sharply predicting the behavior of complex iterative algorithms with random data | [abstract] |
Iterative algorithms are the workhorses of modern signal processing and statistical learning, and are widely used to fit complex models to random data. While the choice of an algorithm and its hyperparameters determines both the speed and fidelity of the learning pipeline, it is common for this choice to be made heuristically, either by expensive trial-and-error or by comparing upper bounds on convergence rates of various candidate algorithms. Motivated by these issues, we develop a principled framework that produces sharp, iterate-by-iterate characterizations of solution quality for complex iterative algorithms on several nonconvex model-fitting problems with random data. Such sharp predictions can provide precise separations between families of algorithms while also revealing nonstandard convergence phenomena. We will showcase the general framework on several canonical models in statistical machine learning.
|
04:30pm-05:00pm | Jason Lee (Princeton University)
| Provable Feature Learning in Gradient Descent | [abstract] |
We focus on the task of learning a single index model $\sigma(w* x)$ with respect to the isotropic Gaussian distribution in d dimensions, including the special case when $\sigma$ is a kth order hermite which corresponds to the Gaussian analog of parity learning. Prior work has shown that the sample complexity of learning w* is governed by the *information exponent* k* of the link function \sigma, which is defined as the index of the first nonzero Hermite coefficient of $\sigma$. Prior upper bounds have shown that n > d^{k*-1} samples suffice for learning w* and that this is tight for online SGD (Ben Arous et al., 2020). However, the CSQ lower bound for gradient based methods only shows that n > d^{k*/2}$ samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns w* with n > d^{k*/2}$ samples. Next, we turn to the problem of learning multi index models f(x) = g(Ux), where U encodes a latent representation of low dimension. Significant prior work has established that neural networks trained by gradient descent behave like kernel methods, despite significantly worse empirical performance of kernel methods. However, in this work we demonstrate that for this large class of functions that there is a large gap between kernel methods and gradient descent on a two-layer neural network, by showing that gradient descent learns representations relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f*(x)=g(Ux) where U is d by r. When the degree of f* is p, it is known that n≍dp samples are necessary to learn f* in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f*. This results in an improved sample complexity of n≍d^2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d.
|
05:00pm-05:01pm | Cristóbal Guzmán
| Closing Remarks |