OPT 2023: Optimization for Machine Learning

OPT2023

We welcome you to participate in the 15th International OPT Workshop on Optimization for Machine Learning, to be held as a part of the NeurIPS 2023 conference. This year we particularly encourage (but not limit) submissions inspired by the spirit of "Optimization in the Wild".

We are looking forward to an exciting OPT!

Schedule

Locaction: New Orleans Convention Center.

The schedule is also available on the NeurIPS virtual platform.

Time	Speaker	Title
	Session 1 (Moderator: Courtney Paquette)
9:00am-9:05am	Organizers	Welcome Remarks
9:00am-9:30am	Yair Carmon (Tel-Aviv University)	DoG is SGD’s best friend: toward tuning-free stochastic optimization	[abstract]
While stochastic optimization methods drive continual improvements in machine learning, choosing the optimization parameters—and particularly the learning rate (LR)—remains a difficulty. In this talk, I will describe our work on removing LR tuning from stochastic gradient descent (SGD), culminating in a tuning-free dynamic SGD step size formula, which we call Distance over Gradients (DoG). We show that DoG removes the need to tune learning rate both theoretically (obtaining strong parameter-free convergence guarantees) and empirically (performing nearly as well as expensively-tuned SGD on neural network training tasks).
9:30am-10:00am	Contributed talks	Bruno Loureiro: Escaping mediocrity: how two-layer networks learn hard generalized linear models Daniil Vankov: Last Iterate Convergence of Popov Method for Non-monotone Stochastic Variational Inequalities	[papers]
10:00am-11:00am	Poster Session 1		[posters]
	Session 2 (Moderator: Aaron Sidford)
11:00am-11:30am	Contributed talks	Guy Kornowski: An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization Michael Lu: Practical Principled Policy Optimization for Finite MDPs	[papers]
11:30am-12:00pm	Dmitry Drusvyatskiy (University of Washington)	Aiming towards the minimizers: fast convergence of SGD for overparameterized problems	[abstract]
Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a small minibatch of sampled gradients in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
12:00pm-02:00pm	Lunch
	Session 3 (Moderator: Sebastian Stich)
02:00pm-02:30pm	Virginia Smith (Carnegie Mellon University)	Evaluating Large-Scale Learning Systems	[abstract]
To deploy machine learning models in practice it is critical to have a way to reliably evaluate their effectiveness. Unfortunately, the scale and complexity of modern machine learning systems makes it difficult to provide faithful evaluations and gauge performance across potential deployment scenarios. In this talk I discuss our work addressing challenges in large-scale ML evaluation. First, I explore the problem of hyperparameter optimization in federated networks of devices, where issues of device subsampling, heterogeneity, and privacy can introduce noise in the evaluation process and make it challenging to effectively perform optimization. Second, I present ReLM, a system for validating and querying large language models (LLMs). Although LLMs have been touted for their ability to generate natural-sounding text, there is a growing need to evaluate the behavior of LLMs in light of issues such as data memorization, bias, and inappropriate language. ReLM poses LLM validation queries as regular expressions to enable faster and more effective LLM evaluation.
02:30pm-03:00pm	Contributed talks	Naren Sarayu Manoj: Dueling Optimization with a Monotone Adversary Georgy Noarov: High-Dimensional Prediction for Sequential Decision Making	[papers]
03:00pm-04:00pm	Poster Session 2		[posters]
	Session 4 (Moderator: Courtney Paquette)
04:00pm-04:30pm	Ashwin Pananjady (Georgia Tech)	Sharply predicting the behavior of complex iterative algorithms with random data	[abstract]
Iterative algorithms are the workhorses of modern signal processing and statistical learning, and are widely used to fit complex models to random data. While the choice of an algorithm and its hyperparameters determines both the speed and fidelity of the learning pipeline, it is common for this choice to be made heuristically, either by expensive trial-and-error or by comparing upper bounds on convergence rates of various candidate algorithms. Motivated by these issues, we develop a principled framework that produces sharp, iterate-by-iterate characterizations of solution quality for complex iterative algorithms on several nonconvex model-fitting problems with random data. Such sharp predictions can provide precise separations between families of algorithms while also revealing nonstandard convergence phenomena. We will showcase the general framework on several canonical models in statistical machine learning.
04:30pm-05:00pm	Jason Lee (Princeton University)	Provable Feature Learning in Gradient Descent	[abstract]
We focus on the task of learning a single index model $\sigma(w* x)$ with respect to the isotropic Gaussian distribution in d dimensions, including the special case when $\sigma$ is a kth order hermite which corresponds to the Gaussian analog of parity learning. Prior work has shown that the sample complexity of learning w* is governed by the information exponent k* of the link function \sigma, which is defined as the index of the first nonzero Hermite coefficient of $\sigma$. Prior upper bounds have shown that n > d^{k-1} samples suffice for learning w and that this is tight for online SGD (Ben Arous et al., 2020). However, the CSQ lower bound for gradient based methods only shows that n > d^{k/2}$ samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns w with n > d^{k/2}$ samples. Next, we turn to the problem of learning multi index models f(x) = g(Ux), where U encodes a latent representation of low dimension. Significant prior work has established that neural networks trained by gradient descent behave like kernel methods, despite significantly worse empirical performance of kernel methods. However, in this work we demonstrate that for this large class of functions that there is a large gap between kernel methods and gradient descent on a two-layer neural network, by showing that gradient descent learns representations relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f(x)=g(Ux) where U is d by r. When the degree of f* is p, it is known that n≍dp samples are necessary to learn f* in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f*. This results in an improved sample complexity of n≍d^2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d.
05:00pm-05:01pm	Cristóbal Guzmán	Closing Remarks

OPT2023

Site menu:

OPT2023

Schedule

Session 1 (Moderator: Courtney Paquette)

Session 2 (Moderator: Aaron Sidford)

Session 3 (Moderator: Sebastian Stich)

Session 4 (Moderator: Courtney Paquette)

OPTML Links

Sponsors

Past Speakers