OPT 2019: Optimization for Machine Learning

OPT2019

We welcome you to participate in the 11th OPT Workshop on Optimization for Machine Learning. This year's OPT workshop will be run as an independent event, co-located with NeurIPS in Vancouver. It will happen on December 14th, overlapping with the NeurIPS workshops, to facilitate easy switching between the two venues for attendees.

We are looking forward to an exciting OPT 2019!

Location: Exchange Hotel Vancouver

Location

Exchange Hotel Vancouver (5th floor, Townley Studio) (a short 5 minutes walk from the conference center)

Schedule

Time	Speaker	Title
8:15am-8:50am	Light Breakfast and Coffee
8:50am-9:00am	Organizers	Opening Remarks
9:00am-9:30am	John Duchi	Optimality in optimization	[abstract]
I will discuss what it means for a method to be optimal in optimization. When a method matches a lower bound, does that mean the method is good? How can we develop lower bounds and optimality results that are meaningful? Can theoretical results actually direct progress in what we do? Some of this talk will be speculative, some will cover my and others results, and some will be polemic.
9:30am-10:00am	Mert Gürbüzbalaban	Recent Advances in Stochastic Gradient Methods: From Convex to Non-Convex Optimization and Deep Learning	[abstract]
For many large-scale optimization and machine learning problems, first-order methods and their accelerated variants based on momentum have been a leading approach for computing low-to-medium accuracy solutions because of their cheap iterations and mild dependence on the problem dimension and data size. Even though momentum-based accelerated gradient (AG) methods proposed by Nesterov for convex optimization converges provably faster than gradient descent (GD) in the absence of noise, the comparison is no longer clear when the gradients are stochastic; containing random gradient errors. In the first part of the talk, we focus on stochastic gradient (SGD) and accelerated stochastic gradient (ASG) methods for convex optimization when the gradient has random errors. We study the trade-offs between convergence rate and robustness to gradient errors in designing a first-order algorithm and provide a systematic way of trading off these two in an optimal fashion. Our results show that stochastic momentum methods can achieve acceleration while being more robust to random gradient errors. Our framework also leads to "optimal" algorithms that can perform better than other state-of-the-art methods in the presence of random gradient noise. We also discuss extensions of our results and algorithms to distributed convex optimization problems. In the second part of the talk, we focus on SGD for non-convex optimization and deep learning. The gradient noise (GN) in the SGD algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed a-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a Lévy motion. Such SDEs can incur ‘jumps’, which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the a-stable assumption, we conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets and obtain guarantees for convergence to local minima under heavy-tailed noise. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima. Finally, we focus on the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm for non-convex learning, which is a popular variant of the stochastic gradient with momentum where a controlled and properly scaled Gaussian noise is added to the unbiased gradient estimates to steer them towards a global minimum. We obtain first-time non-asymptotic global convergence guarantees for SGHMC for both empirical risk and population risk optimization problems. Our results demonstrate that momentum-based Monte Carlo algorithms can lead to better iteration complexity as well as generalization performance compared to known guarantees for the (reversible) standard stochastic gradient Langevin Monte Carlo methods showing that momentum improves global iteration complexity on a number of problems, justifying the use of momentum in the context of non-convex optimization and non-convex learning further.
10:00am-10:20am	Spotlight Talks for posters	See accepted papers	[posters]
See accepted papers.
10:20am-11:15am	POSTER SESSION (including coffee break 10:30am - 11:00am)
11:15am-11:45am	Mengdi Wang	Sample-Efficient Reinforcement Learning in Feature Space	[abstract]
In this talk we discuss the statistical efficiency of policy optimization in reinforcement learning. For example, how many observations are needed and sufficient for learning a good policy? How to learn to act in live environment when good feature representation is given? We will approach the policy optimization problem by leveraging the minimax duality and contractive Bellman operator. We introduce feature-based variants of Q-learning algorithms and showed that they achieve minimax-optimal sample complexity and near-optimal regret.
11:45am-12:00pm	Itay M Safran, Ohad Shamir	How Good is SGD with Random Shuffling?
12:00pm-12:15pm	Oliver Hinder, Aaron Sidford, Nimit S Sohoni	Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond
12:15pm-12:30pm	Dmitrii Avdiukhin, Grigory Yaroslavtsev, Chi Jin	Escaping Saddle Points with Inequality Constraints via Noisy Sticky Projected Gradient Descent
12:15pm-2:00pm	Lunch Break (on your own)
2:00am-2:30pm	Swati Gupta	Impact of Bias on Hiring Decisions	[abstract]
Optimization and machine learning algorithms often work with real-world data that has been generated through complex socio-economic and behavioral processes. This data however is noisy, at times unbalanced and naturally encodes a variety of unquantifiable systemic biases. In this work, we focus on automated hiring decisions. There is ample evidence to show that there exists implicit bias in evaluations of candidates dependent on their demographic groups or the socio-economic opportunities available to them in the past. Motivated by this, we present the first mathematical analysis of the impact of bias in evaluations on the performance of online secretary algorithms. We assume that the candidates belong to disjoint groups, for instance defined by race, gender, nationality, age, etc. Each candidate has a "true" nonnegative utility Z in R and an "observed" utility Z/\beta where \beta is unknown and group-dependent. We show that in the biased setting, group-agnostic algorithms for online secretary problem are suboptimal, often causing starvation of jobs for groups with $\beta>1$. We propose multiple group-aware algorithms that are able to achieve demographic parity (i.e., equal proportion of people hired from each group) by design while obtaining a competitive ratio for maximizing true utility of hired candidates in a variety of settings. This talk is based on ongoing work with Jad Salem.
2:30am-2:45pm	Alejandro Carderera, Jelena Diakonikolas, Sebastian Pokutta	Breaking the Curse of Dimensionality (Locally) to Accelerate Conditional Gradients
2:45pm-3:00pm	Geoffrey Negiar, Armin Askari, Martin Jaggi, Fabian Pedregosa	Linearly Convergent Frank-Wolfe without Prior Knowledge
3:00pm-3:15pm	Marwa El Halabi, Stefanie Jegelka	Minimizing approximately submodular functions
3:15pm-4:15pm	POSTER SESSION (including coffee break 3:30pm - 4:00pm)
4:15am-4:45pm	Yin-Tat Lee	Solving Linear Programs in the Current Matrix Multiplication Time	[abstract]
We show how to solve linear programs with accuracy epsilon in time n^{omega+o(1)} log(1/epsilon) where omega~2.3728639 is the current matrix multiplication constant. This hits a natural barrier of solving linear programs since linear systems are a special case of linear programs and solving linear systems requires time n^{omega} currently. Joint work with Michael B. Cohen and Zhao Song.
4:45am-5:00pm	Wei Hu, Jeffrey Pennington	Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks
5:00am-5:15pm	Jingfeng Wu, Vladimir Braverman, Lin Yang	Obtaining Regularization for Free via Iterate Averaging
5:15pm-5:16pm	Organizers	Closing Remarks
5:16pm-6:30pm	POSTER SESSION

OPT2019

Site menu:

OPT2019

Location

Schedule

OPTML Links

Sponsors