On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Hao Yu; Rong Jin; Sen Yang

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Hao Yu, Rong Jin, Sen Yang

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:7184-7193, 2019.

Abstract

Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enables us to scale out the computing capability by adding more computing nodes into our system. The reduced communication complexity is desirable since communication overhead is often the performance bottleneck in distributed systems. Recently, momentum methods are more and more widely adopted by practitioners to train machine learning models since they can often converge faster and generalize better. However, it remains unclear whether any distributed momentum SGD possesses the same linear speedup property as distributed SGD and has reduced communication complexity. This paper fills the gap by considering a distributed communication efficient momentum SGD method and proving its linear speedup property.

Cite this Paper

BibTeX


@InProceedings{pmlr-v97-yu19d,
  title = 	 {On the Linear Speedup Analysis of Communication Efficient Momentum {SGD} for Distributed Non-Convex Optimization},
  author =       {Yu, Hao and Jin, Rong and Yang, Sen},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {7184--7193},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/yu19d/yu19d.pdf},
  url = 	 {https://proceedings.mlr.press/v97/yu19d.html},
  abstract = 	 {Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enables us to scale out the computing capability by adding more computing nodes into our system. The reduced communication complexity is desirable since communication overhead is often the performance bottleneck in distributed systems. Recently, momentum methods are more and more widely adopted by practitioners to train machine learning models since they can often converge faster and generalize better. However, it remains unclear whether any distributed momentum SGD possesses the same linear speedup property as distributed SGD and has reduced communication complexity. This paper fills the gap by considering a distributed communication efficient momentum SGD method and proving its linear speedup property.}
}

Endnote

%0 Conference Paper
%T On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization
%A Hao Yu
%A Rong Jin
%A Sen Yang
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-yu19d
%I PMLR
%P 7184--7193
%U https://proceedings.mlr.press/v97/yu19d.html
%V 97
%X Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enables us to scale out the computing capability by adding more computing nodes into our system. The reduced communication complexity is desirable since communication overhead is often the performance bottleneck in distributed systems. Recently, momentum methods are more and more widely adopted by practitioners to train machine learning models since they can often converge faster and generalize better. However, it remains unclear whether any distributed momentum SGD possesses the same linear speedup property as distributed SGD and has reduced communication complexity. This paper fills the gap by considering a distributed communication efficient momentum SGD method and proving its linear speedup property.

APA


Yu, H., Jin, R. & Yang, S.. (2019). On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:7184-7193 Available from https://proceedings.mlr.press/v97/yu19d.html.

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Abstract

Cite this Paper

Related Material