Jointly learning sentence embeddings and syntax with unsupervised Tree-LSTMs

Jean Maillard; Stephen Clark; Dani Yogatama

doi:10.1017/S1351324919000184

Jointly learning sentence embeddings and syntax with unsupervised Tree-LSTMs

Published online by Cambridge University Press: 31 July 2019

Jean Maillard ,

Stephen Clark and

Dani Yogatama

Show author details

Jean Maillard*: Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
Stephen Clark: Affiliation:
DeepMind, London, UK
Dani Yogatama: Affiliation:
DeepMind, London, UK
*: *Corresponding author. Email: jean@maillard.it

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We present two studies on neural network architectures that learn to represent sentences by composing their words according to automatically induced binary trees, without ever being shown a correct parse tree. We use Tree-Long Short-Term Memories (LSTMs) as our composition function, applied along a tree structure found by a differentiable natural language chart parser. The models simultaneously optimise both the composition function and the parser, thus eliminating the need for externally provided parse trees, which are normally required for Tree-LSTMs. They can therefore be seen as tree-based recurrent neural networks that are unsupervised with respect to the parse trees. Due to being fully differentiable, the models are easily trained with an off-the-shelf gradient descent method and backpropagation.

In the first part of this paper, we introduce a model based on the CKY chart parser, and evaluate its downstream performance on a natural language inference task and a reverse dictionary task. Further, we show how its performance can be improved with an attention mechanism which fully exploits the parse chart, by attending over all possible subspans of the sentence. We find that our approach is competitive against similar models of comparable size and outperforms Tree-LSTMs that use trees produced by a parser.

Finally, we present an alternative architecture based on a shift-reduce parser. We perform an analysis of the trees induced by both our models, to investigate whether they are consistent with each other and across re-runs, and whether they resemble the trees produced by a standard parser.

Keywords

sentence representations representation learning latent tree learning grammar induction

Information

Type: Article
Information: Natural Language Engineering , Volume 25 , Issue 4 , July 2019 , pp. 433 - 449

DOI: https://doi.org/10.1017/S1351324919000184 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR, pp. 1–15.Google Scholar

Bengio, Y., Léonard, N. and Courville, A.C. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1305.2982.Google Scholar

Bowman, S.R., Angeli, G., Potts, C. and Manning, C.D. (2015). A large annotated corpus for learning natural language inference. In EMNLP. ACL. pp. 632–642.Google Scholar

Bowman, S.R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C.D. and Potts, C. (2016). A fast unified model for parsing and sentence understanding. In ACL, pp. 1466–1477.CrossRef Google Scholar

Choi, J., Yoo, K.M. and Lee, S. (2018). Learning to compose task-specific tree structures. In AAAI, pp. 5094–5101.Google Scholar

Chomsky, N. (1957). Syntactic Structures. The Hague, Netherlands: Mouton and Co.Google Scholar

Cocke, J. (1969). Programming Languages and Their Compilers: Preliminary Notes. New York, NY: Courant Institute of Mathematical Sciences, New York University.Google Scholar

Coecke, B., Sadrzadeh, M. and Clark, S. (2011). Mathematical foundations for a compositional distributed model of meaning. Linguistic Analysis 36(1–4), 345–384.Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805Google Scholar

Dyer, C., Kuncoro, A., Ballesteros, M and Smith, N.A. (2016). Recurrent neural network grammars. In NAACL-HLT. ACL.CrossRef Google Scholar

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5–6), 602–610.CrossRef Google Scholar PubMed

Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. and Smith, N.A. (2018). Annotation artifacts in natural language inference data. In NAACL-HLT (Short Papers). ACL.Google Scholar

Hill, F., Cho, K., Korhonen, A. and Bengio, Y. (2016). Learning to understand phrases by embedding the dictionary. Transactions of the Association for Computational Linguistics 4, pp. 17–3.CrossRef Google Scholar

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Comput. 9(8), 1735–1780.CrossRef Google Scholar PubMed

Htut, P.M., Cho, K. and Bowman, S. (2018). Grammar induction with neural language models: An unusual replication. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. ACL.Google Scholar

Jang, E., Gu, S. and Poole, B. (2017). Categorical reparameterization with gumbel-softmax.Google Scholar

Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N. and Wu, Y. (2016). Exploring the limits of language modeling. arXiv:1602.02410.Google Scholar

Jozefowicz, R., Zaremba, W. and Sutskever, I. (2015). An empirical exploration of recurrent network architectures. J Machine Learning Research.Google Scholar

Kalchbrenner, N., Grefenstette, E. and Blunsom, P. (2014). A convolutional neural network for modelling sentences. ACL. pp. 655–665CrossRef Google Scholar

Kasami, T. (1965). An Efficient Recognition and Syntax Analysis Algorithm for Context-Free Languages. Technical Report AFCRL-65-758, Air Force Cambridge Research Laboratory.Google Scholar

Kim, Y., Denton, C., Hoang, L. and Rush, A.M. (2017). Structured attention networks. In ICLR.Google Scholar

Kingma, D.P. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.Google Scholar

Kiperwasser, E. and Goldberg, Y. (2016). Easy-first dependency parsing with hierarchical tree lstms. TACL 4, pp. 445–461.Google Scholar

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D. and Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114(13), 3521–3526.CrossRef Google Scholar PubMed

Le, P. and Zuidema, W. (2015). The forest convolutional network: Compositional distributional semantics with a neural chart and without binarization. In EMNLP. ACL.Google Scholar

Ma, M., Huang, L., Xiang, B. and Zhou, B. (2015). Dependency-based convolutional neural networks for sentence embedding. In ACL-IJCNLP. ACL.Google Scholar

Maillard, J. and Clark, S. (2018). Latent tree learning with differentiable parsers: Shift-reduce parsing and chart parsing. In Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP. ACL.Google Scholar

Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In ACL System Demonstrations.CrossRef Google Scholar

Mikolov, T., Yih, W.-T. and Zweig, G. (2013). Linguistic regularities in continuous space word representations. In NAACL-HLT. ACL.Google Scholar

Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., Duh, K., Faruqui, M., Gan, C., Garrette, D., Ji, Y., Kong, L., Kuncoro, A., Kumar, G., Malaviya, C., Michel, P., Oda, Y., Richardson, M., Saphra, N., Swayamdipta, S. and Yin, P. (2017). Dynet: The dynamic neural network toolkit. arXiv:1701.03980.Google Scholar

Paperno, D., Pham, N.T. and Baroni, M. (2014). A practical and linguistically-motivated approach to compositional distributional semantics. In ACL. ACL.Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In EMNLP. ACL.Google Scholar

Rush, A.M., Chopra, S. and Weston, J. (2015). A neural attention model for abstractive sentence summarization. In EMNLP. ACL.Google Scholar

Sha, L., Chang, B., Sui, Z. and Li, S. (2016). Reading and thinking: Re-read LSTM unit for textual entailment recognition. In COLING.Google Scholar

Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S. and Zhang, C. (2018). DiSAN: Directional self-attention network for RNN/CNN-free language understanding. In AAAI.Google Scholar

Socher, R., Bauer, J., Manning, C.D. and Ng, A.Y. (2013). Parsing with compositional vector grammars. In ACL.Google Scholar

Socher, R., Huval, B., Manning, C.D. and Ng, A.Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL. ACL.Google Scholar

Steedman, M. (2000). The Syntactic Process. Cambridge, MA: MIT Press.Google Scholar

Subramanian, S., Trischler, A., Bengio, Y. and Pal, C.J. (2018). Learning general purpose distributed sentence representations via large scale multi-task learning. In ICLR.Google Scholar

Sundermeyer, M., Schlüter, R. and Ney, H. (2012). LSTM neural networks for language modeling. In INTERSPEECH.Google Scholar

Sutskever, I., Vinyals, O. and Le, Q.V. (2014). Sequence to sequence learning with neural networks. In NIPS. MIT Press.Google Scholar

Tai, K.S., Socher, R. and Manning, C.D. (2015). Improved semantic representations from tree-structured long short-term memory networks. In ACL-IJCNLP. ACL.Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. In NIPS.Google Scholar

Williams, A., Drozdov, A. and Bowman, S.R. (2018). Do latent tree learning models identify meaningful structure in sentences? Transactions of the Association for Computational Linguistics 6, 253–267.CrossRef Google Scholar

Williams, A., Nangia, N. and Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT (Long Papers). ACL.Google Scholar

Yogatama, D., Blunsom, P., Dyer, C., Grefenstette, E. and Ling, W. (2016). Learning to compose words into sentences with reinforcement learning.Google Scholar

Younger, D.H. (1967). Recognition and parsing of context-free languages in time n ³. Information and Control 10, 189–208.CrossRef Google Scholar

Zhu, X., Sobhani, P. and Guo, H. (2015). Long short-term memory over recursive structures. In ICML.Google Scholar

Article contents

Jointly learning sentence embeddings and syntax with unsupervised Tree-LSTMs

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests