WF-02: Beyond First-order Optimization Methods for Machine Learning – Part I
Stream: Advances in mathematical optimization for machine learning
Chair(s): Fred Roosta, Albert Berahas
Sequential Quadratic Optimization for Nonlinear Equality Constrained Stochastic Optimization
Albert Berahas, Frank E. Curtis
Stochastic gradient and related methods for solving unconstrained stochastic optimization problems have been studied extensively in recent years. However, settings with general nonlinear constraints have received less attention, and many of the proposed methods resort to using penalty or Lagrangian methods, which are often not the most effective strategies. In this work, we propose and analyze stochastic optimization methods based on the sequential quadratic optimization methodology. We discuss advantages and disadvantages of our approaches. Collaborators: F. E. Curtis, D. Robinson & B. Zhou.
Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence
We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD with SPS in different settings, including strongly convex, convex and non-convex functions.
Systematic Second-order Methods for Training, Designing, and Deploying Neural Networks
Finding the right Neural Network model and training it for a new task requires considerable expertise and extensive computational resources. Moreover, the process often includes ad-hoc rules that do not generalize to different application domains. These issues have limited the applicability and usefulness of DNN models, especially for new learning tasks. This problem is becoming more acute, as datasets and models grow larger, which increases training time, making random/brute force search approaches quickly untenable. In large part, this situation is due to the first-order stochastic gradient
Distributed Learning of Deep Neural Networks using Independent Subnet Training
We propose a new approach to distributed neural network learning, called independent subnet training (IST). In IST, a neural network is decomposed into a set of subnetworks of the same depth as the original network, each of which is trained locally, before the various subnets are exchanged and the process is repeated. IST training has many advantages over standard data parallel approaches. We show experimentally that IST results in training time that are much lower than data parallel approaches to distributed learning.