An empirical study of cyclical learning rate on neural machine translation

Weixuan Wang; Choon Meng Lee; Jianfeng Liu; Talha Colakoglu; Wei Peng

doi:10.1017/S135132492200002X

An empirical study of cyclical learning rate on neural machine translation

Published online by Cambridge University Press: 09 February 2022

Talha Colakoglu and

Weixuan Wang: Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Choon Meng Lee: Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Jianfeng Liu: Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Talha Colakoglu: Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Wei Peng*: Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
*: *Corresponding author. E-mail: peng.wei1@huawei.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy to computer vision tasks, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks.

Keywords

Neural machine translation Cyclical learning rate Optimizer Adam Batch size

Type: Article
Information: Natural Language Engineering , Volume 29 , Issue 2 , March 2023 , pp. 316 - 336

DOI: https://doi.org/10.1017/S135132492200002X [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aarts, E.H.L. and Korst, J.H.M. (2003). Simulated annealing and boltzmann machines. In Michael A. Arbib (ed), Handbook of Brain Theory and Neural Networks (2nd ed). Cambridge, Massachusetts: MIT Press, pp. 1039–1044.Google Scholar

Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Bengio Y. and LeCun Y. (eds), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings.Google Scholar

Bengio, Y., Louradour, J., Collobert, R. and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009, ACM International Conference Proceeding Series, vol. 382, pp. 41–48.CrossRef Google Scholar

Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In 19th International Conference on Computational Statistics, COMPSTAT 2010, Paris, France, August 22–27, 2010 - Keynote, Invited and Contributed Papers, pp. 177–186.CrossRef Google Scholar

Cettolo, M., Girardi, C. and Federico, M. (2012). Wit

$^3$ : Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261–268.Google Scholar

Chung, J., Gülçehre, Ç., Cho, K. and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv, abs/1412.3555.Google Scholar

Dauphin, Y., de Vries, H. and Bengio, Y. (2015). Equilibrated adaptive learning rates for non-convex optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, pp. 1504–1512.Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186.Google Scholar

Dinh, L., Pascanu, R., Bengio, S. and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017, Volume 70, pp. 1019–1028.Google Scholar

Duchi, J.C., Hazan, E. and Singer, Y. (2010). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121–2159.Google Scholar

He, K., Zhang, X., Ren, S. and Sun, J. (2015). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778.Google Scholar

Hoang, C.D.V., Haffari, G. and Cohn, T. (2017). Decoding as continuous optimization in neural machine translation. arXiv preprint arXiv:1701.02854.Google Scholar

Hochreiter, S. and Schmidhuber, J. (1997a). Flat minima. Neural Computation 9, 1–42.CrossRef Google Scholar PubMed

Hochreiter, S. and Schmidhuber, J. (1997b). Long short-term memory. Neural Computation 9, 1735–1780.CrossRef Google Scholar PubMed

Huang, G., Liu, Z. and Weinberger, K.Q. (2016). Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269.Google Scholar

Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M. and Tang, P.T.P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. ArXiv, abs/1609.04836.Google Scholar

Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic. Association for Computational Linguistics, pp. 177–180.CrossRef Google Scholar

Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM 60, 84–90.CrossRef Google Scholar

Li, H., Xu, Z., Taylor, G. and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, MontrÉal, Canada, pp. 6391–6401.Google Scholar

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J. and Han, J. (2019). On the variance of the adaptive learning rate and beyond. ArXiv, abs/1908.03265.Google Scholar

Loshchilov, I. and Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.Google Scholar

Luo, L., Xiong, Y., Liu, Y. and Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. ArXiv, abs/1902.09843.Google Scholar

McCandlish, S., Kaplan, J., Amodei, D. and Team, O.D. (2018). An empirical model of large-batch training. ArXiv, abs/1812.06162.Google Scholar

Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D. and Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Demonstrations, pp. 48–53.Google Scholar

Popel, M. and Bojar, O. (2018). Training tips for the transformer model. Prague Bulletin of Mathematical Linguistics 110, 43–70.CrossRef Google Scholar

Reddi, S.J., Kale, S. and Kumar, S. (2018). On the convergence of adam and beyond. ArXiv, abs/1904.09237.Google Scholar

Ruder, S. (2016). An overview of gradient descent optimization algorithms. ArXiv, abs/1609.04747.Google Scholar

Sennrich, R., Haddow, B. and Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 1715–1725.CrossRef Google Scholar

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.Google Scholar

Smith, L.N. (2017). Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472.CrossRef Google Scholar

Smith, L.N. and Topin, N. (2017). Super-convergence: Very fast training of residual networks using large learning rates. ArXiv, abs/1708.07120.Google Scholar

Smith, S.L., Kindermans, P., Ying, C. and Le, Q.V. (2018). Don’t decay the learning rate, increase the batch size. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings. OpenReview.net.Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.U. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 5998–6008.Google Scholar

Wiseman, S. and Rush, A.M. (2016). Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pp. 1296–1306.CrossRef Google Scholar

Zhang, M.R., Lucas, J., Hinton, G.E. and Ba, J. (2019). Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp. 9593–9604.Google Scholar

Article contents

An empirical study of cyclical learning rate on neural machine translation

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests