Hostname: page-component-848d4c4894-x5gtn Total loading time: 0 Render date: 2024-05-31T23:30:44.464Z Has data issue: false hasContentIssue false

An empirical study of cyclical learning rate on neural machine translation

Published online by Cambridge University Press:  09 February 2022

Weixuan Wang
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Choon Meng Lee
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Jianfeng Liu
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Talha Colakoglu
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Wei Peng*
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
*
*Corresponding author. E-mail: peng.wei1@huawei.com

Abstract

In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy to computer vision tasks, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aarts, E.H.L. and Korst, J.H.M. (2003). Simulated annealing and boltzmann machines. In Michael A. Arbib (ed), Handbook of Brain Theory and Neural Networks (2nd ed). Cambridge, Massachusetts: MIT Press, pp. 1039–1044.Google Scholar
Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Bengio Y. and LeCun Y. (eds), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings.Google Scholar
Bengio, Y., Louradour, J., Collobert, R. and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009, ACM International Conference Proceeding Series, vol. 382, pp. 4148.CrossRefGoogle Scholar
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In 19th International Conference on Computational Statistics, COMPSTAT 2010, Paris, France, August 22–27, 2010 - Keynote, Invited and Contributed Papers, pp. 177186.CrossRefGoogle Scholar
Cettolo, M., Girardi, C. and Federico, M. (2012). Wit $^3$ : Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261268.Google Scholar
Chung, J., Gülçehre, Ç., Cho, K. and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv, abs/1412.3555.Google Scholar
Dauphin, Y., de Vries, H. and Bengio, Y. (2015). Equilibrated adaptive learning rates for non-convex optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, pp. 15041512.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 41714186.Google Scholar
Dinh, L., Pascanu, R., Bengio, S. and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017, Volume 70, pp. 10191028.Google Scholar
Duchi, J.C., Hazan, E. and Singer, Y. (2010). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 21212159.Google Scholar
He, K., Zhang, X., Ren, S. and Sun, J. (2015). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770778.Google Scholar
Hoang, C.D.V., Haffari, G. and Cohn, T. (2017). Decoding as continuous optimization in neural machine translation. arXiv preprint arXiv:1701.02854.Google Scholar
Hochreiter, S. and Schmidhuber, J. (1997a). Flat minima. Neural Computation 9, 142.CrossRefGoogle ScholarPubMed
Hochreiter, S. and Schmidhuber, J. (1997b). Long short-term memory. Neural Computation 9, 17351780.CrossRefGoogle ScholarPubMed
Huang, G., Liu, Z. and Weinberger, K.Q. (2016). Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22612269.Google Scholar
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M. and Tang, P.T.P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. ArXiv, abs/1609.04836.Google Scholar
Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic. Association for Computational Linguistics, pp. 177180.CrossRefGoogle Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM 60, 8490.CrossRefGoogle Scholar
Li, H., Xu, Z., Taylor, G. and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, MontrÉal, Canada, pp. 63916401.Google Scholar
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J. and Han, J. (2019). On the variance of the adaptive learning rate and beyond. ArXiv, abs/1908.03265.Google Scholar
Loshchilov, I. and Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.Google Scholar
Luo, L., Xiong, Y., Liu, Y. and Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. ArXiv, abs/1902.09843.Google Scholar
McCandlish, S., Kaplan, J., Amodei, D. and Team, O.D. (2018). An empirical model of large-batch training. ArXiv, abs/1812.06162.Google Scholar
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D. and Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Demonstrations, pp. 4853.Google Scholar
Popel, M. and Bojar, O. (2018). Training tips for the transformer model. Prague Bulletin of Mathematical Linguistics 110, 4370.CrossRefGoogle Scholar
Reddi, S.J., Kale, S. and Kumar, S. (2018). On the convergence of adam and beyond. ArXiv, abs/1904.09237.Google Scholar
Ruder, S. (2016). An overview of gradient descent optimization algorithms. ArXiv, abs/1609.04747.Google Scholar
Sennrich, R., Haddow, B. and Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 17151725.CrossRefGoogle Scholar
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.Google Scholar
Smith, L.N. (2017). Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464472.CrossRefGoogle Scholar
Smith, L.N. and Topin, N. (2017). Super-convergence: Very fast training of residual networks using large learning rates. ArXiv, abs/1708.07120.Google Scholar
Smith, S.L., Kindermans, P., Ying, C. and Le, Q.V. (2018). Don’t decay the learning rate, increase the batch size. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings. OpenReview.net.Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.U. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 59986008.Google Scholar
Wiseman, S. and Rush, A.M. (2016). Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pp. 12961306.CrossRefGoogle Scholar
Zhang, M.R., Lucas, J., Hinton, G.E. and Ba, J. (2019). Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp. 95939604.Google Scholar