OPT 2021: Optimization for Machine Learning

OPT2021

We welcome you to participate in the 13th International (Virtual) OPT Workshop on Optimization for Machine Learning, to be held as a part of the NeurIPS 2021 conference. This year we particularly encourage (but not limit) submissions in the area of Beyond Worst-case Complexity.

We are looking forward to an exciting OPT 2021!

All events online

check the NeurIPS website for instructions on how to access the live stream and the pre-recorded talks
all times in NY-time
you can also find the schedule in this [google doc]

Schedule

Time	Speaker	Title
6:15am-6:55am	Welcome social event in gather.town (lounge).	Login and say hello!
	Session 1 (Moderator: Sebastian Stich, co-moderator: Martin Takáč)
6:55am-7:00am	Organizers	Opening Remarks to Session 1
7:00am-7:30am	Shai Shalev-Shwartz (Hebrew University of Jerusalem) (video stream)	Deep Learning: Success, Failure, and the Border between them	[abstract]
Deep learning revolutionized computer vision, speech recognition, natural language understanding, and more. However, from the theoretical perspective we know that training neural networks is computationally hard and one can construct distributions on which deep learning fails. The talk will propose several parameters of distributions that can move them from being easy-to-train to being hard-to-train.
7:30am-8:00am	Martin Jaggi (EPFL) (video stream)	Learning with Strange Gradients	[abstract]
Gradient methods form the foundation of current machine learning. A vast literature covers the use of stochastic gradients being simple unbiased estimators of the full gradient of our objective. In this talk, we discuss four applications motivated from practical machine learning, where this key assumption is violated, and show new ways to cope with gradients which are only loosely related to the original objective. We demonstrate that algorithms with rigorous convergence guarantees can still be obtained in such settings, for 1) federated learning on heterogeneous data, 2) personalized collaborative learning, 3) masked training of neural networks with partial gradients, 4) learning with malicious participants, in the sense of Byzantine robust training.
8:00am-8:30am	Contributed talks (live stream)	Oral (10min): Futong Liu (EPFL): Understanding Memorization from the Perspective of Optimization via Efficient Influence Estimation Spotlights (5min): Abdurakhmon Sadiev (MIPT): Decentralized Personalized Federated Learning: Lower Bounds and Optimal Algorithm for All Personalization Modes Frederik Benzing (ETH Zurich): Fast, Exact Subsampled Natural Gradients and First-Order KFAC Simon Roburin (valeo.ai / imagine ENPC): Spherical Perspective on Learning with Normalization Layers	[papers]
8:30am-9:55am	Poster Session 1 and Break	Join for the poster session on gather.town (room 1). Authors will present their work.	[posters]
	Session 2 (Moderator: Katya Scheinberg, co-moderator: Courtney Paquette)
9:55am-10:00am	Organizers	Opening Remarks
10:00am-10:30am	Coralia Cartis (University of Oxford) (video stream)	The global optimization of functions with low effective dimension - better than worst-case?	[abstract]
We investigate the unconstrained and constrained global optimization of functions with low effective dimension, which are constant along an (unknown) linear subspace and only vary over the effective (complement) subspace. Known also as multi-ridge or active subspace objectives, these functions appear frequently in applications, such as in those involving some complex engineering simulations, overly parametrised models, and more. We aim to implicitly explore the intrinsic low dimensionality of the constrained/unconstrained landscape using (feasible) random embeddings, in order to understand and improve the scalability of algorithms for the global optimization of these special-structure problems. We show that for these functions, the convergence and complexity of a random embedding algorithmic framework can potentially be exponentially better than if no special structure was present, and furthermore, under certain assumptions/problems, they are independent of the ambient dimension. Our numerical experiments on functions with low effective dimension illustrate the improved scalability of the framework when coupled with state-of-the-art global — and even local — optimization solvers for the subproblems. Our analysis uses tools from random matrix theory, stochastic optimization and conic integral geometry. Authors: Coralia Cartis, Estelle Massart and Adilet Otemissov (Mathematical Institute, University of Oxford and Alan Turing Institute, London) Papers: https://arxiv.org/pdf/2003.09673.pdf, https://arxiv.org/pdf/2009.10446.pdf, https://arxiv.org/pdf/2107.12102.pdf
10:30am-11:00am	Cristóbal Guzmán (University of Twente) (video stream)	Non-Euclidean Differentially Private Stochastic Convex Optimization	[abstract]
Ensuring privacy of users' data in machine learning models has become a crucial requirement in multiple domains. In this respect, differential privacy (DP) is the gold standard, due to its general and rigorous privacy guarantees, as well as its high composability. For the particular case of stochastic convex optimization (SCO), recent efforts have established optimal rates for the excess risk under differential privacy in Euclidean setups. These bounds suffer a polynomial degradation of accuracy with respect to the dimension, which limits their applicability in high-dimensional settings. In this talk, I will present nearly-dimension independent rates on the excess risk for DP-SCO in the $\ell_1$ setup, as well as the investigation of more general $\ell_p$ setups, where $1\leq p\leq \infty$. Based on joint work with Raef Bassily and Anupama Nandi.
11:00am-11:30am	Contributed talks (live stream)	Oral (10min): Junchi Li (UC Berkeley): On the convergence of stochastic extragradient for bilinear games using restarted iteration averaging Spotlights (5min): Jeffery Kline (American Family Insurance): Farkas' Theorem of the Alternative for Prior Knowledge in Deep Networks Lyle Kim (Rice University): Fast, Acceleration and Stability of the Stochastic Proximal Point Algorithm Pascal Esser (Technical University of Munich): Towards Modeling and Resolving Singular Parameter Spaces using Stratifolds	[papers]
11:30am-12:55pm	Break	Join the gather.town lounge, or the poster rooms, (room 1) (room 2) for discussions with other participants.	[posters]
	Session 3 (Moderator: Oliver Hinder, co-moderator: Courtney Paquette)
12:55pm-1:00pm	Organizers	Opening Remarks
1:00pm-1:30pm	Damek Davis (Cornell University) (video stream)	Avoiding saddle points in nonsmooth optimization	[abstract]
We introduce a geometrically transparent strict saddle property for nonsmooth functions. When present, this property guarantees that simple randomly initialized proximal algorithms on weakly convex problems converge only to local minimizers. We argue that the strict saddle property may be a realistic assumption in applications since it provably holds for generic semi-algebraic optimization problems. Finally, we close the talk with an extension of the result to "perturbed" subgradient methods.
1:30pm-2:00pm	Jelena Diakonikolas (University of Wisconsin-Madison) (video stream)	Faster Empirical Risk Minimization	[abstract]
Empirical Risk Minimization (ERM) problems are central to machine learning, and their efficient optimization has been studied from different perspectives, often taking advantage of the finite sum structure present in typical problem formulations. In particular, tight oracle complexity bounds have been established under fairly general assumptions about the loss functions. In this talk, I will present a rather surprising and general result that takes advantage of the separability of nonsmooth convex loss functions with efficiently computable proximal operators -- such as, e.g., the hinge loss and the sum of absolute errors -- to obtain an algorithm that exhibits significantly lower complexity than what is predicted by the lower bounds for general nonsmooth convex losses. I will then talk about how this result can be further improved for problems that can be stated in a form close to a linear program and discuss how these results relate to robust empirical risk minimization. The talk is based on joint results with Chaobing Song, Eric Lin, and Stephen Wright.
2:00pm-2:30pm	Contributed talks (live stream)	Oral (10min): Wenhao Zhan (Princeton University): Policy Mirror Descent for Regularized RL: A Generalized Framework with Linear Convergence Spotlights (5min): Akhilesh Soni (University of Wisconsin-Madison): Integer Programming Approaches To Subspace Clustering With Missing Data Boyue Li (Carnegie Mellon University): DESTRESS: Computation-Optimal and Communication-Efficient Decentralized Nonconvex Finite-Sum Optimization Grigory Malinovsky (KAUST): Better Linear Rates for SGD with Data Shuffling	[papers]
2:30pm-15:55pm	Poster Session 2 and Break	Join for the poster session on gather.town (room 2). Authors will present their work.	[posters]
	Session 4 (Moderator: Quanquan Gu, co-moderator: Oliver Hinder)
3:55pm-4:00pm	Organizers	Opening Remarks
4:00pm-4:30pm	Yinyu Ye (Stanford University) (video stream)	Online Learning via Linear Programming	[abstract]
A natural optimization model that formulates many online resource allocations and decision-making problems is online linear programming (OLP) where the constraint matrix, along with the objective coefficients and decision variables, are revealed and decided column by column sequentially. We review the near optimal algorithms and theories for solving this surprisingly general class of online problems under the assumption of random order of arrival and/or iid distributions of the input data. Then we present few recent applications of the model/algorithm, including a fast online algorithm as a pre-solver for solving large-scale offline (binary) LPs, an interior-point online algorithm to address “fairness” for resource allocation, and a provable logarithmic regret bound for the Bandits with Knapsacks (BwK) problem.
4:30pm-5:00pm	Michael Mahoney (UC Berkeley) (video stream)	Putting Randomized Matrix Algorithms in LAPACK, and Connections with Second-order Stochastic Optimization	[abstract]
LAPACK (Linear Algebra PACKage) is a widely-used high-quality software library for numerical linear algebra that provides routines for solving systems of linear equations, linear least squares, eigenvalue problems, singular value decomposition, and matrix factorizations such as LU, QR, Cholesky and Schur decomposition. Randomized Numerical Linear Algebra (RandNLA) is an interdisciplinary research area that exploits randomization as a computational resource to develop improved algorithms for large-scale linear algebra problems. In addition to providing some of the best linear algebra algorithms (in worst-case theory, numerical implementations, and non-trivial implicit statistical properties and machine learning implementations), RandNLA techniques are the basis for the best stochastic second-order optimization algorithms (such as SubSampled Newton's methods, Iterative Hessian Sketch, and Newton-LESS). The time has come to put RandNLA methods into the next generation of LAPACK, and we have begun to do that. We will present our high level plan to introduce RandNLA algorithms into LAPACK. The RandLAPACK library will implement state of the art randomized algorithms for problems such as low-rank approximation and least-squares, but its full scope will be larger than linear algebra per se. It will include certain higher-level primitives in optimization and machine learning that require judicious use of RandNLA. We will describe building blocks, the modular design that will help RandLAPACK evolve with the field of RandNLA (as well as the needs of machine learning, scientific computing, and other users) over time, as well as connections and implications for machine learning optimization. Joint work with Riley Murray, Jim Demmel, and others.
5:00pm-5:30pm	Contributed talks (live stream)	Oral (10min): Agnieszka Słowik (University of Cambridge): On the Relation between Distributionally Robust Optimization and Data Curation Spotlights (5min): Jacques Chen (University of British Columbia): Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers Neha Wadia (UC Berkeley): Optimization with Adaptive Step Size Selection from a Dynamical Systems Perspective Difan Zou (UC Los Angeles): Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization	[papers]
	Closing Remarks
5:30pm-5:35pm	Organizers	Closing Remarks

OPT2021

Site menu:

OPT2021

All events online

Schedule

Session 1 (Moderator: Sebastian Stich, co-moderator: Martin Takáč)

Session 2 (Moderator: Katya Scheinberg, co-moderator: Courtney Paquette)

Session 3 (Moderator: Oliver Hinder, co-moderator: Courtney Paquette)

Session 4 (Moderator: Quanquan Gu, co-moderator: Oliver Hinder)

Closing Remarks

OPTML Links

Sponsors

Past Speakers