6:00am-7:00am | Socializing in gather.town.
| Login and say hello! |
| Session 1 (Moderator: Sebastian Stich, co-moderator Courtney Paquette)
|
6:50am-7:00am | Organizers
| Opening Remarks |
7:00am-7:30am | Tong Zhang (HKUST)
| The Convexity of Learning Infinite-width Deep Neural Networks | [abstract] |
Deep learning has received considerable empirical successes in recent years. Although deep neural networks (DNNs) are highly nonconvex with respect to the model parameters, it has been observed that the training of overparametrized DNNs leads to consistent solutions that are highly reproducible with different random initializations.
I will explain this phenomenon by modeling DNNs using feature representations, and show that the optimization landscape is convex with respect to the features. Moreover, we show that optimization with respect to the nonconvex DNN parameters leads to a global optimal solution under an idealized regularity condition, which can explain various empirical findings.
|
7:30am-8:00am | Volkan Cevher (EPFL)
| Adaptation and universality in first-order methods | [abstract] |
In this talk, we review some of the recent advances in first-order methods for convex and non-convex optimization as well as their universality properties. We say an algorithm is universal if it does not require to know whether the optimization objective is smooth or not.
We first recall the AdaGrad method and show that AdaGrad is a universal algorithm without any modifications: It implicitly exploits the smoothness of the problem to achieve the standard O(1/k) rate in the presence of smooth objectives, where k is the iteration count.
To this end, we introduce an accelerated, universal variant of AdaGrad, dubbed as AcceleGrad, that in addition obtains the optimal convergence rate of O(1/k^2) in the smooth setting with deterministic oracles. We then introduce UniXGrad, which is the first algorithm that simultaneously achieves optimal rates for smooth or non-smooth problems with either deterministic or stochastic first-order oracles in the constrained convex setting.
We conclude the presentation with results in non-convex optimization revolving around ADAM-type algorithms, including new convergence rates.
|
8:00am-8:30am | Spotlight Presentations live Q&A
| 10-minute talks available (on-demand) on the website Presenters will join the Zoom session for live Q&A
- Laurent Condat, Distributed Proximal Splitting Algorithms with Rates and Acceleration
- Zhize Li, PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization
- Ohad Shamir, Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?
- Tiffany Vlaar, Constraint-Based Regularization of Neural Networks
- Mohammadi Zaki, Employing No Regret Learners for Pure Exploration in Linear Bandits
| [papers] |
8:30am-10:00am | Break and Poster Session
| Join for the poster session on gather.town.
Authors will present their work.
| [posters] |
| Session 2 (Moderator: Martin Takáč, co-moderator Courtney Paquette)
|
9:50am-10:00am | Organizers
| Opening Remarks |
10:00am-10:30am | Andreas Krause (ETH Zurich)
| Adaptive Sampling for Stochastic Risk-Averse Learning | [abstract] |
In high-stakes machine learning applications, it is crucial to not only perform well on average, but also when restricted to difficult examples. To address this, we consider the problem of training models in a risk averse manner. We propose an adaptive sampling algorithm for stochastically optimizing the Conditional Value-at-Risk (CVaR) of a loss distribution. We use a distributionally robust formulation of the CVaR to phrase the problem as a zero-sum game between two players, and solve it efficiently using regret minimization. Our approach relies on sampling from structured Determinantal Point Processes (DPPs), which allows scaling it to large data sets. Finally, we empirically demonstrate its effectiveness on large-scale convex and non-convex learning tasks.
|
10:30am-11:00am | Donald Goldfarb (Columbia)
| Practical Kronecker-factored BFGS and L-BFGS methods for training deep neural networks | [abstract] |
In training deep neural network (DNN) models, computing and storing a full BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is impractical. In our methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices, analogous to the approach in KFAC for approximating the Fisher matrix in a stochastic natural gradient method. Because of the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the BFGS and L-BFGS approximations bounded, both above and below. In tests on autoencoder feed forward and convolutional neural network models, our methods outperformed KFAC and were competitive with state-of-the-art first-order stochastic methods.
|
11:00am-11:30am | Spotlight Presentations live Q&A
| 10-minute talks available (on-demand) on the website Presenters will join the Zoom session for live Q&A
- Samuel Horvath, Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization
- Guan-Horng Liu, DDPNOpt: Differential Dynamic Programming Neural Optimizer
- Nicolas Loizou, Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence
- Sharan Vaswani, Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search)
- Sharan Vaswani, How to make your optimizer generalize better
| [papers] |
11:30am-12:00pm | Break
| Coffee break in gather.town.
| |
12:00pm-12:30pm | Suvrit Sra (MIT)
| SGD without replacement: optimal rate analysis and more | [abstract] |
Stochastic gradient descent (SGD) is the workhorse of machine learning. There are two fundamental versions of SGD: (i) those that pick stochastic gradients with replacement, and (ii) those that pick without replacement. Ironically, version (ii) is what is used in practice (across most ML toolkits), while version (i) is what almost all published work analyzes. This mismatch is well-known. It arises because analyzing SGD without replacement involves biased gradients and must cope with lack of independence between the stochastic gradients used. In this talk, I will present recent progress on analyzing without replacement SGD, the bulk of which will focus on minimax optimal convergence rates. The rates are obtained without assuming componentwise convexity. I will mention further refinements of the results assuming this additional convexity, which remove drawbacks common to previous works (such as large number of epochs required).
|
12:30am-14:00pm | Break and Poster Session
| Join for the poster session on gather.town.
Authors will present their work.
| [posters] |
| Session 3 (Moderator: Mark Schmidt, co-moderator Martin Takáč)
|
13:50pm-14:00pm | Organizers
| Opening Remarks |
14:00pm-14:30pm | Ashia Wilson (MSR)
| Stochastic Geodesic Optimization | [abstract] |
Geodesic convexity offers a promising systematic way to handle non-convexity for many problems of interest in statistics and computer science. The focus of this talk will be to describe efforts to extend the basic tools of convex optimization on Euclidean space to the general setting of Riemannian manifolds. We begin by motivating our focus on geodesic optimization with several examples, reviewing the basics of geodesic spaces and several techniques from optimization along the way. Particular attention will be given to optimization techniques which achieve oracle lower bounds for minimizing stochastic functions, namely accelerated methods. We end with a discussion of how one might adapt these techniques to the Riemannian setting.
|
14:30pm-15:00pm | Rachel Ward (UT Austin)
| Concentration for matrix products, and convergence of Oja’s algorithm for streaming PCA | [abstract] |
We present new nonasymptotic growth and concentration bounds for a product of independent random matrices, similar in spirit to concentration for sums of independent random matrices developed in the previous decade. Our matrix product concentration bounds provide a new, direct convergence proof of Oja's algorithm for streaming Principal Component Analysis, and should be useful more broadly for analyzing the convergence of stochastic gradient descent for certain classes of nonconvex optimization problems, including neural networks. This talk covers joint work with Amelia Henriksen, De Huang, Jon Niles-Weed, and Joel Tropp.
|
15:00pm-15:30pm | Spotlight Presentations live Q&A
| 10-minute talks available (on-demand) on the website Presenters will join the Zoom session for live Q&A
- Zhan Gao, Incremental Greedy BFGS: An Incremental Quasi-Newton Method with Explicit Superlinear Rate
- Wenjie Li, Variance Reduction on Adaptive Stochastic Mirror Descent
- Preetum Nakkiran, Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems
- Denny Wu, When Does Preconditioning Help or Hurt Generalization?
- Chengrun Yang, TenIPS: Inverse Propensity Sampling for Tensor Completion
| [papers] |
15:30pm-16:30pm | Break
| Coffee break in gather.town.
| |
16:30pm-17:00pm | Michael Friedlander (UBC)
| Fast convergence of stochastic subgradient method under interpolation | [abstract] |
This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized empirical-risk optimization models that exactly fit the training data. We prove that for models with composite structures often found in neural networks, the interpolation condition implies that the model is effectively smooth at all minimizers, and therefore that SSGD converges at rates normally achievable only for smooth convex problems. We also prove that the fast rates we derive are optimal for any subgradient method applied to convex problems where interpolation holds.
This is joint work with Huang Fang and Zhenan Fan.
|
17:00pm-18:00pm | Break and Poster Session
| Join for the poster session on gather.town.
Authors will present their work.
| [posters] |
| Session 4 (Moderator: Quanquan Gu, co-moderator Mark Schmidt)
|
17:50pm-18:00pm | Organizers
| Opening Remarks |
18:00pm-18:30pm | Deanna Needell (UCLA), co-speaker Hanbaek Lyu (UCLA)
| Online nonnegative matrix factorization for Markovian and other real data | [abstract] |
Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this talk, we present results showing that a non-convex generalization of the well-known OMF algorithm for i.i.d. data converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. As the main application, by combining online non-negative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning that extracts `network dictionary patches' from a given network in an online manner that encodes main features of the network. We demonstrate this technique on real-world data and discuss recent extensions and variations.
|
18:30pm-19:00pm | Spotlight Presentations live Q&A
| 10-minute talks available (on-demand) on the website Presenters will join the Zoom session for live Q&A
- Tolga Ergen, Convex Programs for Global Optimization of Convolutional Neural Networks in Polynomial-Time
- Charles Guille-Escuret, A Study of Condition Numbers for First-Order Optimization
- Lewis Liu, Affine-Invariant Analysis of Frank-Wolfe on Strongly Convex Sets
- Sanae Lotfi, Stochastic Damped L-BFGS with controlled norm of the Hessian approximation
- Dongruo Zhou, On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization
| [papers] |
19:00pm-19:05pm | Organizers
| Closing Remarks |