| Session 1 (Moderator: Courtney Paquette)
|
9:00am-9:05am |
Organizers |
Welcome Remarks |
9:00am-9:30am | Ben Grimmer (Johns Hopkins)
| Optimizing Optimization Methods with Computer Assistance | [abstract] |
This talk will present highly optimized (sometimes minimax optimal) convergence theory for Gradient Descent, Frank-Wolfe, Alternating Projections, and other methods and key intuitions about optimal methods discovered along the way. Our optimized convergence theory is made possible by the use of a novel computer-assisted technology developed over the last decade called "Performance Estimation Problems" (PEPs). At their core, PEPs are infinite-dimensional problems, computing a worst-case problem instance from a given class for a given algorithm. The key ingredient to making PEPs useful for algorithm design is reducing them to an equivalent, tractable, finite-dimensional mathematical program. To facilitate our optimized convergence theory, we will develop new PEP reductions capable of computing worst-case constraint sets satisfying a range of possible structural conditions (smooth / strongly convex / bounded diameter / contains a Slater point).
|
9:30am-10:00am | Contributed talks
|
- Devansh Gupta: On the Inherent Privacy of Two Point Zeroth Order Projected Gradient Descent
- Matan Schliserman: The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization
| [papers] |
10:00am-11:00am | Poster Session 1
| Paper ID: 1, 11, 13, 14, 21, 23, 25, 27, 28, 29, 30, 31, 33, 37, 38, 40, 47, 48, 49, 53, 54, 58, 62, 64, 66, 67, 72, 75, 76, 77, 78, 84, 85, 89, 91, 92, 93, 95, 97, 99, 100, 102, 103, 105, 106, 108, 119, 122, 123, 126, 130
| [posters] |
| Session 2 (Moderator: Sebastian Stich)
|
11:00am-11:30am | Contributed talks
|
- Depen Morwani: SOAP: Improving and Stabilizing Shampoo using Adam
- Benjamin Thérien: μLO: Compute-Efficient Meta-Generalization of Learned Optimizers
| [papers] |
11:30am-12:00pm | Misha Belkin (University of California San Diego)
| Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning | [abstract] |
I will discuss the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.
Joint work with Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan.
|
12:00pm-02:00pm | Lunch
|
|
| Session 3 (Moderator: Courtney Paquette)
|
02:00pm-02:30pm | Jason Altschuler (University of Pennsylvania)
| Acceleration by Stepsize Hedging | [abstract] |
Can we accelerate the convergence of gradient descent without changing the algorithm — just by optimizing stepsizes? Surprisingly, we show that the answer is yes. Our proposed Silver Stepsize Schedule optimizes strongly convex functions in k^(logp 2) = k^(0.7864) iterations, where p=1+sqrt{2} is the silver ratio and k is the condition number. This is intermediate between the textbook unaccelerated rate k and the accelerated rate sqrt{k} due to Nesterov in 1983. The non-strongly convex setting is conceptually identical and leads to an analogously accelerated rate \eps^{-\logp 2} = \eps^{-0.7864}. We conjecture and provide partial evidence that these rates are optimal among all possible stepsize schedules.
The Silver Stepsize Schedule is an explicit non-monotonic fractal. Why should such stepsizes help? The core intuition is “hedging” between individually suboptimal strategies — short steps and long steps — since bad cases for the former are good cases for the latter, and vice versa. Properly combining these stepsizes yields faster convergence due to the misalignment of worst-case functions. This talk is based on a line of work with Pablo Parrilo that originates from my 2018 Master’s Thesis — which established for the first time that judiciously chosen stepsizes can enable accelerated convex optimization. Prior to this thesis, the only such result was for the special case of quadratics, due to Young in 1953.
|
02:30pm-03:00pm | Contributed talks
|
- Artavazd Maranjyan: MindFlayer: Efficient Asynchronous Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times
- Gauthier Gidel: Provable non-accelerations of the heavy-ball method
| [papers] |
03:00pm-04:00pm | Poster Session 2
| Paper ID: 12, 15, 16, 20, 22, 26, 35, 39, 46, 51, 52, 55, 56, 59, 60, 61, 63, 68, 69, 70, 71, 73, 74, 79, 80, 82, 83, 86, 88, 90,, 94, 98, 104, 107, 109, 111, 112, 113, 114, 115, 117, 118, 124, 125, 129
| [posters] |
| Session 4 (Moderator: Cristóbal Guzmán)
|
04:00pm-04:30pm | Aryan Mokhtari (University of Texas, Austin)
| Online Learning Guided Quasi-Newton Methods: Improved Global Non-asymptotic Guarantees | [abstract] |
Quasi-Newton (QN) methods are popular iterative algorithms known for their superior practical performance compared to Gradient Descent (GD)-type methods. However, the existing theoretical results for this class of algorithms do not sufficiently justify their advantage over GD-type methods. In this talk, we discuss our recent efforts to address this issue. Specifically, in the strongly convex setting, we propose the first “globally” convergent QN method that achieves an explicit “non-asymptotic superlinear” rate. We show that the rate presented for our method is provably faster than GD after at most O(d) iteration, where d is the problem dimension. Additionally, in the convex setting, we present an accelerated variant of our proposed method that provably outperforms the accelerated gradient method and converges at a rate of O(\min{1/k^2, \sqrt{d \log k }/ k^{2.5}}), where k is the number of iterations. To attain these results, we diverge from conventional approaches and construct our QN methods based on the Hybrid Proximal Extragradient (HPE) framework and its accelerated variants. Furthermore, a pivotal algorithmic concept underpinning our methodologies is an online learning framework for updating the Hessian approximation matrices. Specifically, we relate our method's convergence rate to the regret of a specific online convex optimization problem in the matrix space and choose the sequence of Hessian approximation matrices to minimize its overall regret.
|
04:30pm-05:00pm | Organizers
| Discussion about the future of OPT-ML | |
TBD
|
05:00pm-05:00pm | Cristóbal Guzmán
| Closing Remarks |
|