Emerging trends: General fine-tuning (gft)

Abstract This paper describes gft (general fine-tuning), a little language for deep nets, introduced at an ACL-2022 tutorial. gft makes deep nets accessible to a broad audience including non-programmers. It is standard practice in many fields to use statistics packages such as R. One should not need to know how to program in order to fit a regression or classification model and to use the model to make predictions for novel inputs. With gft, fine-tuning and inference are similar to fit and predict in regression and classification. gft demystifies deep nets; no one would suggest that regression-like methods are “intelligent.”


Introduction
This paper introduces gft (general fine-tuning), 1 a little language 2 for deep nets, introduced at an ACL-2022 tutorial. 3 There are two parts to the tutorial: 1. Glass is half-full: make deep nets accessible to a mass audience, including nonprogrammers, and 2. Glass is half-empty: based on the successes of the first part on so many benchmarks, one might come to the mistaken impression that deep nets are more successful than they are. There are always opportunities for improvement. We are advocating an interdisciplinary approach that combines the successes in the first part, with decades of work in AI representation and centuries of work in linguistics and philosophy.
This paper will use gft to discuss the first part. It is amazing how much can be done with so little. gft demystifies deep nets. No one would suggest that regression-like methods are "intelligent." There are two main functions in gft: fit and predict. Fit takes a pretrained model, f pre , as input, and fine-tunes that on data to produce a post-trained model, f post , as output. Predict takes x, a novel input, and predicts,ŷ = f (x). Hopefully, the prediction,ŷ, will be close to the gold label, y.
We discussed deep nets in two previous articles in this journal: (Church et al., 2021a, b). gft makes it possible to do much of that in short (1-line) programs. 1-line programs are easier to read, write, understand, and port from one environment to another than examples on hubs (typically hundreds of lines of Python, PyTorch, 4 TensorFlow, 5 Jax 6 , and/or PaddlePaddle). 7 gft is designed to make much of this functionality accessible to nonprogrammers. Just as one does not need to know Python and Machine Learning to use an off-the-shelf regression package, so too, deep nets should not require much (if any) programming skills.
Following the advice in "Crossing the Chasm" (Moore and McKenna, 1999), the long-term success of deep nets will depend on finding ways to cross the chasm from the current set of loyal users (so-called early adopters) to a much larger set of users. Early adopters may be willing to invest in machine learning and programming, but most users have other priorities.
The gft interpreter is based on examples from hubs. 8,9 Hubs encourage users to modify hundreds of lines of Python code as necessary if they want to change models, data sets, and/or tasks. gft generalizes the examples so users can do much of that in a single line of gft code (with comparable performance).
gft supports most of the arguments in the examples on the hubs, so it is possible to tune hyperparameters such as batch size, learning rate, and stopping rules. Tuning matters for (state of the art) SOTA-chasing, though default settings are recommended for most users who prefer results that are easy to replicate and reasonably competitive.
There is already too much SOTA-chasing in the literature (Church and Kordoni, 2022). Users should avoid wasting time on hyper-parameter tuning unless they are about to ship a model to a large number of users for an application where small improvements in performance are worth the effort.

gft Cheatsheet
gft supports the following functions: 10 1. fit (also known as fine-tuning): f pre + data → f post 2. predict (also known as inference): f (x) =ŷ, where x is an input from stdin or from a data set 3. eval: f + data → score (produce a single score for a data set split, as opposed to a prediction, y, for each input row in the split, x) 4. summary: Search hubs for popular data sets, models, and tasks and provide snippets.
Popularity is estimated from metrics on downloads. 5. cat_data: Output data set on stdout There are four major arguments: 1. -data: a data set on a hub, or a local file 2. -model: a model on a hub, or a local file 3. -task: for example, classify, regress 11 4. -eqn (e.g., classify:y ∼ x 1 + x 2 ), where a task appears before the colon, and variables refer to columns in the data set.

The standard recipe
Following (Howard and Ruder, 2018;Devlin et al., 2019), it has become standard practice to use the 3-step recipe in Table 1. We prefer the terms, fit and predict, to fine-tuning and inference. The proposed terminology has a long tradition in statistics and predates relatively recent work on deep nets. 12 Fit and predict were discussed in two previous Emerging Trends articles in this journal (Church et al. 2021a, b). This paper will unify much of that discussion into a single github (see footnote 1) with hundreds of examples of short (1-line) programs. 13 gft makes it easy to use models and data sets on hubs: HuggingFace 14 and PaddleHub/PaddleNLP). 15 The hubs are large (∼40k models and ∼4k data sets) and growing quickly (∼3x/year). The challenge is to make these amazing resources more accessible to as many users as possible. The target audience has diverse interests and skills. It should not be necessary for them to know much (if any) programming to join in on the fun.
The 40k models include both pretrained and post-trained models, f pre and f post . gft provides tools to make it easy to find popular models, as well as popular data sets. We recommend users make as much use as possible of these resources and resist the temptation to pretrain their own models from scratch, for reasons that will be discussed in Appendix A.1.

An example of fit and predict in R
As mentioned above, gft is inspired by glm (general linear models) (Guisan et al., 2002) in R. 16 Listing 1 illustrates the use of fit and predict in R. The R environment provides a number of standard data sets such as cars, a data table with two columns, speed and dist, shown as black points in Figure 1. The model, g, fits dist as a quadratic function of speed. Predictions from this model are shown in red in Figure 1. 12 In addition to history, there are two more reasons to prefer the terms, fit and predict. First, the proposed terminology, as mentioned above, demystifies deep nets. No one would suggest that regression-like methods are "intelligent." Second, the proposed terminology is intended to discourage work on foundation models, f pre . As will be discussed in Appendix A.1, the term, foundation models, was introduced to encourage work on f pre (Bommasani et al., 2021), but we believe it is a mistake for academics to compete with industry on tasks that require large investments, and more logistics and systems work, than creative contributions to computational linguistics research. 13   The summary function in R is applied to both the data table cars as well as the model g. The R summary function can be applied to almost any object and provides some useful description of its argument.

An example of fit (aka fine-tuning)
Listing 2 shows an example of gft_fit. Listing 2 is similar to Listing 1 in a number of ways. Fit takes a pretrained model, f pre , and uses a data set to output a post-trained model, f post . In Listing 2, f pre is a BERT model, and the data table is the emotion data set on HuggingFace. The model in Listing 1, g, is analogous to f post = $outdir in Listing 2. The variables in both equations, line 7 of Listing 1 and line 3 of Listing 2, refer to columns in the relevant data table.
Many gft programs take four arguments: 1. -data specifies the use of the emotion data set on HuggingFace. 17 17 https://huggingface.co/datasets/emotion Listings 3 and 4 show two examples of gft_predict. Predict takes a novel input, x, and applies x to a model, f , to produce a prediction,ŷ = f (x). The default model (for the classification task) performs sentiment analysis; other models output other labels. In particular, the f in Listing 4 outputs emotion classes: anger, fear, joy, love, sadness, surprise. To see the set of classes for a model, we recommend the use of gft_summary, as illustrated in Listing 5. gft_summary outputs the set of classes, among other things.
Some more classifications of x =I love you are shown in Tables 2 and 3 using a number of different models from HuggingFace. Most of these models agree that x is positive, though many of them classify x as fake news and some classify x as spam. One can use other models to classify x in many ways such as offensive or not and hate speech or not.
Many of these classifiers were trained on corpora that may not be appropriate for this task. In particular, we really should not apply a Spanish classifier on English inputs, but mistakes like that are likely to happen given how easy it is to make such mistakes.
Most of the models on the hubs were created by the community. The hubs do not vet models for quality. The best models on the hubs are very good, though maybe not state of the art (SOTA). We rarely see results that are as good as PWC 19 and leaderboards. 20 Some models produce poor results, or no results (using standard mechanisms in gft). The most popular models (in terms of downloads) often produce competitive results, though the most popular models rarely produce the best results.

Embarrassment of riches
As mentioned at the beginning of this section, there are a huge number of models and data sets on the hubs. There are currently 40k models and 4k data sets, and these numbers are increasing rapidly (∼3x/year). How do we find the good stuff? And how do we use it?
The hubs provide a number of useful tools to answer these questions. There are GUI interfaces (as illustrated by footnotes 17 and 18 ), as well as APIs. gft_summary uses the APIs to provide much of this functionality, as illustrated in Listing 6, which finds the five most popular data sets (or models) that contain the substring: "emotion." Popularity is estimated from downloads.
Listing 6. Example of gft_summary as a search engine.
Listing 7 finds the most popular data sets and models by searching for data sets and models that contain the null string: Listing 7. Example of gft_summary with the null string as a query.
There are a few common naming conventions. Models containing the string "base" are likely to be base models, f pre (also known as pretrained models or foundation models). Models containing the string "distil" are like to be distilled (compressed models). Models containing the names of popular tasks such as "squad" and GLUE subtasks are likely to be post-trained models, f post .
gft_summary can also be used to summarize data sets, models, tasks, etc. As mentioned in Section 3.1, these summaries are modeled after the summary function in R, which takes many different types of objects and produces useful descriptions.

Portability across hubs and frameworks 3.5.1. Portability → stability over time
The code in the listings above take a dependency on HuggingFace, a small start-up company that has done very well recently. There are also dependencies on a number of Python packages that are constantly changing. We have seen many hardware and software platforms come and go. Many companies do well for a while, but success rarely lasts for long (decades). Deep nets will be more likely to survive the test of time if they are written in high-level languages such as gft that can be ported from one environment to another, as necessary.
Consider the example of operating systems. Unix survived the test of time better than alternatives such as VMS 21 because Unix was designed to port easily across suppliers. There was a time when Unix was mostly running on DEC machines, 22 and then there was a time when Unix was mostly running on Sun computers. 23 These days, Unix has moved on to other platforms. If programs are written in a relatively stable higher level environment like Unix (and gft), then old programs are more likely to continue to work for decades, despite instabilities at lower levels in the hardware and software stacks.
Too many deep nets are taking dependencies on Python packages that are updated very frequently (almost daily), often in incompatible ways. Many of these resources are supported by companies that could go out of business, or could decide to sunset support at any time. Given recent events, there is a risk that support could also be cutoff by sanctions and other instabilities in international relations. Because of these realities, gft is designed to make it easy to port from one hub to another.

H is for HuggingFace and P is for PaddleNLP/PaddleHub
Listing 9 is similar to Listing 2, though dependencies on one company (H → HuggingFace) are replaced by dependencies on another company (P → Baidu's PaddleNLP/PaddleHub). gft supports mixing and matching models and data sets from different suppliers. "H:" uses resources Listing 9. An example of gft_fit using P for PaddleNLP/PaddleHub. from Huggingface, and "P:" uses resources from PaddleNLP/PaddleHub. gft also supports "C:" for custom resources on the local file system.
Note that most of the models on HuggingFace are based on PyTorch, whereas models on PaddleNLP and PaddleHub use a different framework called PaddlePaddle. gft hides much of this complexity.
Listing 9 uses the chnsenticorp data set, 24 which is different from the emotion dataset in Listing 2. The chnsenticorp data set specifies a sentiment analysis task in Chinese, whereas the emotion data set specifies an emotion classification task in English.
Listing 9 uses the ernie-tiny model (Su et al., 2021), a compressed version of an ERNIE model. ERNIE models are similar to BERT models, though ERNIE models may be more appropriate for Chinese applications. Distillation (Hinton et al., 2015) is a popular method to compress models. Compressed models tend to trade-off a little bit of performance (accuracy) in order to save a substantial amount of space and time when making predictions at inference time (Ganesh et al., 2021). Distillation can be important for commercial applications.

Data sets
As mentioned in Section 3.4, there are currently more than 4000 data sets on the hubs. We have already mentioned the emotion data set. Many data sets provide splits for training, validation, and test, though different data sets may name these splits differently. Each split provides a data table with columns and rows. The emotion data set, for example, contains two columns, named text and label. As can be seen in HuggingFace's data set viewer, 25 each row specifies a text field (e.g., "i didnt feel humiliated") and a label field (e.g., "sadness"). We will refer to the label field as a gold label. The task is to predict the gold labels.
SQuAD 26,27 (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) is a popular data set for question answering. This data set has 5 columns: id, title, context, question, answers. The answers are substrings of the context, which makes this task considerably easier than the general case of Q&A (question answering), where the answer could be almost anything, and need not be mentioned in any of the other columns.
In Section 2.1 of (Church and Kordoni, 2022), there is a discussion of constructed queries like SQuAD. The TREC QA track 28 started with "constructed" questions in 1999, but quickly moved to "real" questions from query logs for subsequent TREC QA tracks (2000-2007 because constructed questions are too easy for systems and unrealistic (Voorhees, 2001).

Examples of -data and -eqn
Short (1-line) gft programs can fit (fine-tune) many benchmarks, as illustrated in Table 4. Table 4 shows -data and -eqn arguments for a number of popular benchmarks.
-data arguments start with a supplier, for example, H, P, C. After the colon, there can be one or two substrings, delimited by comma. For example, for the cola subtask of GLUE, the -data argument is H:glue,cola. -eqn arguments consist of a task, plus a formula expressed in terms of columns in the dataset. See Table 5 for some examples of some tasks. For a more comprehensive list of tasks, see footnote 11 . QA, question-answering, classify_spans Table 4 Classify the beginning and end of spans (substrings); assumes the answer is a substring of the right-hand side (rhs) following conventions in the SQuAD task, as discussed in Section 4.

More examples and more tasks
As mentioned in footnote 13 , there are hundreds of examples of gft in the github: fit, 32 predict, 33 summary, 34 and eval. 35 A few examples have already been discussed in Sections 3.2 and 3.3. Many more will be discussed in the next few subsections: 1. Predict (Section 5.1): token-classification, fill-mask, MT, ASR, etc. 2. Input from datasets (as opposed to stdin) (Section 5.2). 3. gft_predict → gft_eval (Section 5.3).

Predict
A few examples of predict were shown in Listing 3. The gft documentation has many more examples of predict. 36

Token classification
Some examples of token classification with PaddleNLP are shown in Listing 11. Many of these tasks have been in the literature for a long time. Fill-mask is similar to the cloze task (Taylor, 1953), as illustrated in Listing 12.
Text generation is one of the more popular use cases for GPT-3, though Listing 13 uses a different model. Listing 11. Example of token classification with PaddleNLP.
Listing 12. Example of fill-mask (also known as cloze task).
Listing 13. Example of text generation.

MT, ASR and more
There are translation models for many language pairs, as illustrated in Listing 14. 37 Listing 14. Example of machine translation (MT).

Listing 15. Example of automatic speech recognition (ASR).
Listing 16. Example of image classification.

Input from data sets (as opposed to stdin)
Listing 17 shows an example of input from a data set.
Listing 17. Example of input from data set (as opposed to stdin).
Listing 18. gft_eval outputs a single score for a data set, as opposed to gft_predict, which outputs a prediction for each row.

Debugging, confusion matrices, and error analysis
In addition to producing a score with gft_eval, suppose we want to do some deep dives to look at particular errors. The code in Listing 19 will create a confusion matrix based on the validation split.
Listing 19. Code to create confusion matrix. gft_predict outputs TSV (tab separated values) with 4 columns: 1. Input, x 2. Gold label, y 3. Predicted label,ŷ 4. Score The cut statement on line 4 in Listing 19 selects y andŷ. The sort and uniq statements count the number of confusions, producing the confusion matrix shown in Table 6. Standard Unix tools such as grep (or AWK) can be used to find more details for particular confusions.

Vectors on the left hand side (LHS)
With regression and classification, the left-hand side (lhs) of the equation is typically a scalar, but gft has been generalized so the lhs can also be a point in a vector space, as shown in Listing 20. This example fine-tunes BERT with the NRC-VAD lexicon 38 (Mohammad, 2018). Words are assigned to points in R 3 , Valance, Arousal, and Dominance, based on VAD norms in psychology (Osgood et al., 1957). Listing 20 is our first example of a custom data set. There are three CSV files on the local filesystem: 1. train split: $gft/datasets/VAD/VAD.train 2. validation split: $gft/datasets/VAD/VAD.val 3. test split: $gft/datasets/VAD/VAD.test The three CSV files start with a header row that specifies the names of the columns. The variables in the equation refer to these columns in the CSV files.
In addition to illustrating the use of custom data sets, Listing 20 introduces two new features. First, we normally train models on corpora, but Listing 20 trains a model on a lexicon, the NRC-VAD lexicon. Second, regression usually takes scalar values on the left-hand side (lhs), but in this case, the lhs is a point in R 3 .
Listing 20 produces a post-trained model f post . A few results with f post are shown in Table 7. This table shows some predictions,ŷ, for some inputs, x, using f post . These predictions,ĥ, can be compared with gold labels, y, VAD scores from NRC-VAD (last three columns).
Although the model was trained on words (lemmas in the NRC Lexicon), the inputs, x, in Table 7 include a number of words, phrases, and texts, many of which are not in the NRC-VAD Lexicon (by construction). That is, f post can be applied to any input text (up to 512 subword units). Table 7 shows predictions,V,Â, andD, as well as gold values, V, A, and D. When the input, x, is not in the NRC-Lexicon, the gold value, y, is NA (not available). Since NRC-VAD is based on lemmas, NAs are to be expected for inflected forms, OOVs (out-of-vocabulary) words such as unlovable, MWEs (multiword expressions) such as ugly duckling, sentences, documents.

Conclusions
This paper proposed gft, a little language for fine-tuning pretrained base (foundation) models. Little languages make it easier for a broader audience (including non-programmers) to join in on the fun. Just as most users of regression do not need to know how to solve the regression optimization, so too users of deep nets should not need to understand hundreds of lines of Python and PyTorch. Higher level environments offer a number of advantages: ease of use, transparency, portability. gft removes much of the complexity, and much of the magic (and the alchemy) in deep nets, reducing fine-tuning to an optimization similar to regression. No one would suggest that regression-like methods are "intelligent." Wang A., Singh A., Michael J., Hill F., Levy O. and Bowman S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium: Association for Computational Linguistics, pp. 353-355. Wu B., Xu C., Dai X., Wan A., Zhang P., Yan Z., Tomizuka M., Gonzalez J., Keutzer K. and Vajda P. (2020). Visual transformers: Token-based image representation and processing for computer vision. Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A. and Fidler S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, pp. 19-27.

A.1 Pretraining (f pre ): Don't do it (yourself)
Recent work on foundation models 39 (Bommasani et al., 2021) attempts to compete with industry on what industry does best. We think this is a mistake. Industry has "unfair" advantages 40 on tasks like pretraining f pre , which require large investments in people and machines, as shown in Table 8. We recommend that academics focus on fit and predict, which are much more affordable than pretraining f pre . The last two columns in Table 8, time and hardware, obviously depend on many factors such as the size of the model. One of the motivations behind distillation (Hinton et al., 2015;Ganesh et al., 2021) is to reduce the size of the model. Smaller models tend to run faster at inference time. While inference times are relatively faster than training times, inference time is often a bottleneck for commercial applications since training is a one-time investment, whereas inference is a recurring cost. For successful applications with millions or billions of users, recurring costs can easily dominate one-time training costs.
As for training costs, pretraining is much more expensive than fine-tuning, especially for large models. Pretraining is already very expensive and will become even more expensive in the future as models become larger and larger. Pretraining large models will be beyond the means of academics (and governments).
Consider the pretrained models in Table 9, and especially the largest model, PaLM (Chowdhery et al., 2022). PaLM produces impressive results, using a huge model (540B parameters). That said, the size of the investment is even more impressive: the paper has dozens of authors using thousands of TPUs (distributed over multiple data centers).
When the investments are this large, projects become risk adverse. Projects of this size cannot afford to fail. Academics should focus on projects that reward creativity and avoid projects that are too big to fail.
We like to think of f pre like Intel CPU chips. Universities can afford to program CPUs, but universities cannot afford to compete with Intel and fabricate their own CPUs. So too, we argue that universities can afford to fit and predict deep nets, but they cannot afford to compete with industry on f pre . When the first author was a student at MIT, his thesis advisor, Jon Allen, urged the university to make large investments in VLSI fabrication. In retrospect, it was probably a mistake for a university to invest in VLSI fabrication, though others may disagree with that assessment. 41 In short, we recommend users start by downloading f pre from hubs and focus on steps 2 (fit) and 3 (predict) of the standard recipe. Some examples of f pre are shown in Table 9. Many of these models can be downloaded from hubs, with a few exceptions, especially for larger models such as ERNIE 3.0, GPT-3, PaLM. Most models are trained on corpora, as shown in Table 10. 39 https://crfm.stanford.edu/workshop.html 40 "Unfair advantages" is management jargon, common in industry, especially when discussing strategy. Obviously, there is nothing "unfair" about taking advantage of one's strengths. 41 http://www.eecs.mit.edu/docs/newsletter/VLSI.pdf  Table 9. gft starts with large pre-trained base models, f pre , typically trained on large corpora in Table 10, using expensive GPU clusters Base Model (f pre ) Params Training Data