8:15am-8:50am | Light Breakfast and Coffee |
8:50am-9:00am | Organizers
| Opening Remarks |
9:00am-9:30am | John Duchi
| Optimality in optimization | [abstract] |
I will discuss what it means for a method to be optimal in optimization. When a method matches a lower bound, does that mean the method is good? How can we develop lower bounds and optimality results that are meaningful? Can theoretical results actually direct progress in what we do? Some of this talk will be speculative, some will cover my and others results, and some will be polemic.
|
9:30am-10:00am | Mert Gürbüzbalaban
| Recent Advances in Stochastic Gradient Methods: From Convex to Non-Convex Optimization and Deep Learning | [abstract] |
For many large-scale optimization and machine learning problems, first-order methods and their accelerated variants based on momentum have been a leading approach for computing low-to-medium accuracy solutions because of their cheap iterations and mild dependence on the problem dimension and data size. Even though momentum-based accelerated gradient (AG) methods proposed by Nesterov for convex optimization converges provably faster than gradient descent (GD) in the absence of noise, the comparison is no longer clear when the gradients are stochastic; containing random gradient errors.
In the first part of the talk, we focus on stochastic gradient (SGD) and accelerated stochastic gradient (ASG) methods for convex optimization when the gradient has random errors. We study the trade-offs between convergence rate and robustness to gradient errors in designing a first-order algorithm and provide a systematic way of trading off these two in an optimal fashion. Our results show that stochastic momentum methods can achieve acceleration while being more robust to random gradient errors. Our framework also leads to "optimal" algorithms that can perform better than other state-of-the-art methods in the presence of random gradient noise. We also discuss extensions of our results and algorithms to distributed convex optimization problems.
In the second part of the talk, we focus on SGD for non-convex optimization and deep learning. The gradient noise (GN) in the SGD algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed a-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a Lévy motion. Such SDEs can incur ‘jumps’, which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the a-stable assumption, we conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets and obtain guarantees for convergence to local minima under heavy-tailed noise. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.
Finally, we focus on the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm for non-convex learning, which is a popular variant of the stochastic gradient with momentum where a controlled and properly scaled Gaussian noise is added to the unbiased gradient estimates to steer them towards a global minimum. We obtain first-time non-asymptotic global convergence guarantees for SGHMC for both empirical risk and population risk optimization problems. Our results demonstrate that momentum-based Monte Carlo algorithms can lead to better iteration complexity as well as generalization performance compared to known guarantees for the (reversible) standard stochastic gradient Langevin Monte Carlo methods showing that momentum improves global iteration complexity on a number of problems, justifying the use of momentum in the context of non-convex optimization and non-convex learning further.
|
10:00am-10:20am | Spotlight Talks for posters
| See accepted papers | [posters] |
|
10:20am-11:15am | POSTER SESSION (including coffee break 10:30am - 11:00am) |
11:15am-11:45am | Mengdi Wang
| Sample-Efficient Reinforcement Learning in Feature Space | [abstract] |
In this talk we discuss the statistical efficiency of policy optimization in reinforcement learning. For example, how many observations are needed and sufficient for learning a good policy? How to learn to act in live environment when good feature representation is given? We will approach the policy optimization problem by leveraging the minimax duality and contractive Bellman operator. We introduce feature-based variants of Q-learning algorithms and showed that they achieve minimax-optimal sample complexity and near-optimal regret.
|
11:45am-12:00pm | Itay M Safran, Ohad Shamir
| How Good is SGD with Random Shuffling?
|
12:00pm-12:15pm | Oliver Hinder, Aaron Sidford, Nimit S Sohoni
| Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond
|
12:15pm-12:30pm | Dmitrii Avdiukhin, Grigory Yaroslavtsev, Chi Jin
| Escaping Saddle Points with Inequality Constraints via Noisy Sticky Projected Gradient Descent
|
12:15pm-2:00pm | Lunch Break (on your own) |
2:00am-2:30pm | Swati Gupta
| Impact of Bias on Hiring Decisions | [abstract] |
Optimization and machine learning algorithms often work with real-world data that has been generated through complex socio-economic and behavioral processes. This data however is noisy, at times unbalanced and naturally encodes a variety of unquantifiable systemic biases. In this work, we focus on automated hiring decisions. There is ample evidence to show that there exists implicit bias in evaluations of candidates dependent on their demographic groups or the socio-economic opportunities available to them in the past. Motivated by this, we present the first mathematical analysis of the impact of bias in evaluations on the performance of online secretary algorithms. We assume that the candidates belong to disjoint groups, for instance defined by race, gender, nationality, age, etc. Each candidate has a "true" nonnegative utility Z in R and an "observed" utility Z/\beta where \beta is unknown and group-dependent. We show that in the biased setting, group-agnostic algorithms for online secretary problem are suboptimal, often causing starvation of jobs for groups with $\beta>1$. We propose multiple group-aware algorithms that are able to achieve demographic parity (i.e., equal proportion of people hired from each group) by design while obtaining a competitive ratio for maximizing true utility of hired candidates in a variety of settings.
This talk is based on ongoing work with Jad Salem.
|
2:30am-2:45pm | Alejandro Carderera, Jelena Diakonikolas, Sebastian Pokutta
| Breaking the Curse of Dimensionality (Locally) to Accelerate Conditional Gradients
|
2:45pm-3:00pm | Geoffrey Negiar, Armin Askari, Martin Jaggi, Fabian Pedregosa
| Linearly Convergent Frank-Wolfe without Prior Knowledge
|
3:00pm-3:15pm | Marwa El Halabi, Stefanie Jegelka
| Minimizing approximately submodular functions
|
3:15pm-4:15pm | POSTER SESSION (including coffee break 3:30pm - 4:00pm) |
4:15am-4:45pm | Yin-Tat Lee
| Solving Linear Programs in the Current Matrix Multiplication Time | [abstract] |
We show how to solve linear programs with accuracy epsilon in time n^{omega+o(1)} log(1/epsilon) where omega~2.3728639 is the current matrix multiplication constant. This hits a natural barrier of solving linear programs since linear systems are a special case of linear programs and solving linear systems requires time n^{omega} currently.
Joint work with Michael B. Cohen and Zhao Song.
|
4:45am-5:00pm | Wei Hu, Jeffrey Pennington
| Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks
|
5:00am-5:15pm | Jingfeng Wu, Vladimir Braverman, Lin Yang
| Obtaining Regularization for Free via Iterate Averaging
|
5:15pm-5:16pm | Organizers
| Closing Remarks |
5:16pm-6:30pm | POSTER SESSION |