To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we provide an implementation of the multilayer neural network described in Chapter 5, along with several of the best practices discussed in Chapter 6. Still keeping things fairly simple, our network will consist of two fully connected layers: a hidden layer and an output layer. Between these layers, we will include dropout and a nonlinearity. Further, we make use of two PyTorch classes: a Dataset and a DataLoader. The advantage of using these classes is that they make several things easy, including data shuffling and batching. Last, since the classifier’s architecture has become more complex, for optimization we transition from stochastic gradient descent to the Adam optimizer to take advantage of its additional features such as momentum and L2 regularization.
This chapter motivates the need for a book that covers both theoretical and practical aspects of deep learning for natural language processing. We summarize the content of the book, as well as aspects that are not within scope, and current limitations of deep learning in general.
In Chapters 10 and 12, we focused on two common usages of recurrent neural networks and transformer networks: acceptors and transducers. In this chapter, we discuss a third architecture for both recurrent neural networks and transformer networks: encoder-decoder methods. We introduce three encoder-decoder architectures, which enable important NLP applications such as machine translation. In particular, we discuss the sequence-to-sequence method of Sutskever et al. (2014), which couples an encoder long short-term memory with a decoder long short-term memory. We follow this method with the approach of Bahdanau et al. (2015), which extends the previous decoder with an attention component, which produces a different encoding of the source text for each decoded word. Last, we introduce the complete encoder-decoder transformer network, which relies on three attention mechanisms: one within the encoder (which we discussed in Chapter 12), a similar one that operates over decoded words, and, importantly, an attention component that connects the input words with the decoded ones.
As mentioned in Chapter 8, the distributional similarity algorithms discussed there conflate all senses of a word into a single numerical representation (or embedding). For example, the word bank receives a single representation, regardless of its financial (e.g., as in the bank gives out loans) or geological (e.g., bank of the river) sense. This chapter introduces a solution for this limitation in the form of a new neural architecture called transformer networks, which learns contextualized embeddings of words, which, as the name indicates, change depending on the context in which the words appear. That is, the word bank receives a different numerical representation for each of its instances in the two texts above because the contexts in which they occur are different. We also discuss several architectural choices that enabled the tremendous success of transformer networks: self attention, multiple heads, stacking of multiple layers, and subword tokenization, as well as how transformers can be pretrained on large amounts of data through through masked language modeling and next-sentence prediction.
Up to this point, we have only discussed neural approaches for text classification (e.g., review and news classification) that handle the text as a bag of words. That is, we aggregate the words either by representing them as explicit features in a feature vector or by averaging their numerical representations (i.e., embeddings). Although this strategy completely ignores the order in which words occur in a sentence, it has been repeatedly shown to be a good solution for many practical natural language processing applications that are driven by text classification. Nevertheless, for many natural language processing tasks such as part-of-speech tagging, we need to capture the word-order information more explicitly. Sequence models capture exactly this scenario, where classification decisions must be made using not only the current information but also the context in which it appears. In particular, we discuss several types of recurrent neural networks, including stacked (or deep) recurrent neural networks, bidirectional recurrent neural networks, and long short-term memory networks. Last, we introduced conditional random fields, which extend recurrent neural networks with an extra layer that explicitly models transition probabilities between two cells.