We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the previous chapter, we introduced word embeddings, which are real-valued vectors that encode semantic representation of words. We discussed how to learn them and how they capture semantic information that makes them useful for downstream tasks. In this chapter, we show how to use word embeddings that have been pretrained using a variant of the algorithm discussed in the previous chapter. We show how to load them, explore some of their characteristics, and show their application for a text classification task.
In this chapter, we describe several common applications (including the ones we touched on before) and multiple possible neural approaches for each. We focus on simple neural approaches that work well and should be familiar to anybody beginning research in natural language processing or interested in deploying robust strategies in industry. In particular, we describe the implementation of the following applications: text classification, part-of-speech tagging, named entity recognition, syntactic dependency parsing, relation extraction, question answering, and machine translation.
The previous chapter introduced feed-forward neural networks and demonstrated that, theoretically, implementing the training procedure for an arbitrary feed-forward neural network is relatively simple. Unfortunately, neural networks trained this way will suffer from several problems such as stability of the training process – that is, slow convergence due to parameters jumping around a good minimum – and overfitting. In this chapter, we will describe several practical solutions that mitigate these problems. In particular, we discuss minibatching, multiple optimization algorithms, other activation and cost functions, regularization, dropout, temporal averaging, and parameter initialization and normalization.
As mentioned in the previous chapter, the perceptron does not perform smooth updates during training, which may slow down learning, or cause it to miss good solutions entirely in real-world situations. In this chapter, we will discuss logistic regression, a machine learning algorithm that elegantly addresses this problem. We also extend the vanilla logistic regression, which was designed for binary classification, to handle multiclass classification. Through logistic regression, we introduce the concept of cost function (i.e., the function we aim to minimize during training), and gradient descent, the algorithm that implements this minimization procedure.
In the previous chapters, we have discussed the theory behind the perceptron and logistic regression, including mathematical explanations of how and why they are able to learn from examples. In this chapter, we will transition from math to code. Specifically, we will discuss how to implement these models in the Python programming language. All the code that we will introduce throughout this book is available in this GitHub repository as well: https://github.com/clulab/gentlenlp. To get a better understanding of how these algorithms work under the hood, we will start by implementing them from scratch. However, as the book progresses, we will introduce some of the popular tools and libraries that make Python the language of choice for machine learning – for example, PyTorch, and Hugging Face’s transformers. The code for all the examples in the book is provided in the form of Jupyter notebooks. Fragments of these notebooks will be presented in the implementation chapters so that the reader has the whole picture just by reading the book.
The previous chapter was our first exposure to recurrent neural networks, which included intuitions for why they are useful for natural language processing, various architectures, and training algorithms. In this chapter, we will put them to use in order to implement a common sequence modeling task. In particular, we implement a Spanish part-of-speech tagger using a bidirectional long short-term memory and a set of pretrained, static word embeddings. Through this process, we have also introduced several new PyTorch features such as the pad_sequence, pack_padded_sequence, and pad_packed_sequence functions, which allow us to work more e?iciently with variable length sequences for recurrent neural networks.
All the algorithms we covered so far rely on handcrafted features that must be designed and implemented by the machine learning developer. This is problematic for two reasons. First, designing such features can be a complicated endeavor. Second, most words in any language tend to be very infrequent. In our context, this means that most words are very sparse, and our text classification algorithm trained on word-occurrence features may generalize poorly. For example, if the training data for a review classification dataset contains the word great but not the word fantastic, a learning algorithm trained on these data will not be able to properly handle reviews containing the latter word, even though there is a clear semantic similarity between the two. In this chapter, we will begin to addresses this limitation. In particular, we will discuss methods that learn numerical representations of words that capture some semantic knowledge. Under these representations, similar words such as great and fantastic will have similar forms, which will improve the generalization capability of our machine learning algorithms.
One of the key advantages of transformer networks is the ability to take a model that was pretrained over vast quantities of text and fine-tune it for the task at hand. Intuitively, this strategy allows transformer networks to achieve higher performance on smaller datasets by relying on statistics acquired at scale in an unsupervised way (e.g., through the masked language model training objective). To this end, in this chapter, we will use the Hugging Face library, which has a rich repository of datasets and pretrained models, as well as helper methods and classes that make it easy to target downstream tasks. Using pretrained transformer encoders, we will implement the two tasks that served as use cases in the previous chapters: text classification and part-of-speech tagging.
In this chapter, we implement a machine translation application as an example of an encoder-decoder task. In particular, we build on pretrained encoder-decoder transformer models, which exist in the Hugging Face library for a wide variety of language pairs. We first show how to use one of these models out-of-the-box to perform translation for one of the language pairs it has been exposed to during pretraining: English to Romanian. Afterward, we fine-tune the model to a new language combination that is has not seen before: Romanian to English. In both use cases, we use the T5 encoder-decoder model, which has been pretrained for several tasks, including machine translation.
So far we have explored classifiers with decision boundaries that are linear, or, in the case of the multiclass logistic regression, a combination of linear segments. In this chapter, we will expand what we have learned so far to classifiers that are capable of learning nonlinear decision boundaries. The classifiers that we will discuss here are called feed-forward neural networks, and are a generalization of both logistic regression and the perceptron. Despite the more complicated structures presented, we show that the key building blocks remain the same: the network is trained by minimizing a cost function. This minimization is implemented with backpropagation, which adapts the gradient descent algorithm introduced in the previous chapter to multilayer neural networks.
This chapter covers the perceptron, the simplest neural network architecture. In general, neural networks are machine learning architectures loosely inspired by the structure of biological brains. The perceptron is the simplest example of such architectures: it contains a single artificial neuron. The perceptron will form the building block for the more complicated architectures discussed later in the book. However, rather than starting directly with the discussion of this algorithm, we will start with something simpler: a children’s book and some fundamental observations about machine learning. From these, we will formalize our first machine learning algorithm, the perceptron.