The usage-based linguist Michael Tomasello has compared a number of theories about language acquisition (Reference Tomasello2003) including what Pierrehumbert in the last chapter called the “standard” model, generative linguistics. Tomasello is not an admirer of the work of Chomsky and Pinker. He begins with the following characterization of the difference between “formal linguistics” and usage-based linguistics (98–99):
formal linguistic approaches (including generative grammar) characterize natural languages in terms of formal languages, using as basic theoretical primitives meaningless algebraic rules and meaningful linguistic elements that serve as variables in the rules. But many contemporary linguists simply do not believe that the analogy between natural languages and formal languages is a particularly accurate or productive one – most importantly because it effaces the symbolic dimension of grammatical constructions. The alternative is to look at linguistic competence, not in terms of the possession of a formal grammar of semantically empty rules, but rather in terms of the mastery of structured inventory of meaningful linguistic constructions.
The previous chapter has provided the basic terms for the alternative approach, including exemplar theory and constructions. Tomasello sums up the difference between the “standard” model and its alternative (Reference Tomasello2003: 100):
The most important point is that constructions are nothing more or less than patterns of usage, which may therefore become relatively abstract if these patterns include many different kinds of specific linguistic symbols. But never are they empty rules devoid of semantic content of communicative function. In usage-based approaches, countless rules, principles, parameters, constraints, features, and so forth are the formal devices of professional linguists; they simply do not exist in the minds of speakers of a natural language.
Patterns of usage, then, not formal rules, constitute the grammar of a language for Tomasello as it inheres in the human cognitive processes. Still, the patterns may “become relatively abstract” under the right conditions, moving towards the algebraic formulas of the generativists, and indeed, generativists would insist that all linguists have an empirical side, so perhaps the difference between usage-based and formal models might consist in how the patterns are formulated. As discussed in the last chapter, usage-based linguists have not broken away from the old baggage of formal linguistics as much as they might like to think.
Tomasello does answer the generativist claim about the biological innateness of grammar (284):
The best-known theory about the role of biology in human linguistic competence is Chomsky's proposal that with respect to core grammar biology is everything. In this theory, the innate language module (universal grammar) does not contain things like special learning procedures and perceptual biases but rather real linguistic content; it is thus a theory of “representational innateness”…But representational innateness is a very unlikely theory.
Tomasello then treats grammar genes, linguistic savants, brain localization, the critical period, deficient input, and poverty of the stimulus as reasons to believe in representational innateness, and finds all of them lacking. He concludes (289):
Overall, then, the case for linguistic nativism – in the form of representational innateness – is very poor. Combining all the data and arguments throughout this book, we can say that: (1) there are virtually no linguistic items or structures that are universal in the world's languages; (2) there is no poverty of the stimulus in language acquisition; (3) linking does not work; (4) parameters do not help; (5) the continuity assumption is demonstrably false; (6) performance factors and the maturation of universal grammar are simply unprincipled fudge factors used to explain recalcitrant data; (7) invoking extensive lexical learning as necessary for triggering parameters makes the theory basically indistinguishable from other learning theories – except that it has in addition the linking problem; and (8) although the empirical situations cited in support of biological bases for language acquisition mostly do demonstrate such bases, they do not demonstrate in any form representational innateness.
This is quite an indictment, and yet as before it does not completely dispose of the generative model. We may even agree with Tomasello that the claim for biological innateness cannot stand, but we could still uphold the logical bases for the generative model. After all, most generativists spend little time arguing for representational innateness, and a great deal more time describing relationships within sentences of a language. The principle of “elegance” guides the logic of generativism, and elegance is a matter separate from representational innateness.
What I propose to do in this chapter is to suggest why we in language studies ought to pay more attention to what I will call the 80/20 Rule than we have before now, especially for the way that we understand the grammar of English. We can, and indeed will have to, accommodate more than one theory of grammar in the environment for language study that we have inherited. We should be prepared to cope with Tomasello's usage-based approach and related historical theories from structural grammar, and with traditional prescriptive school grammar, and with the “standard” generative model. Once we accept that there are in fact other different approaches to grammar that compete with generativism in our linguistic marketplace of ideas (see Kretzschmar Reference Kretzschmar2009: 6–20), we need to discover the principles of each one so that we can assess them appropriately. Complex systems offers us the means to do these things.
Zipf's Law and the 80/20 Rule
The distributional pattern in language known as Zipf's Law has been around for nearly one hundred years now. Jean-Baptiste Estoup may have noticed the pattern first around 1912 (Petruszewycz Reference Petruszewycz1973), but Zipf made it famous about three decades later in association with what he would come to call the “Principle of Least Effort” (Reference Zipf1949). Wikipedia gives us the basic low-down (http://en.wikipedia.org/wiki/Zipf's_law):
Zipf's Law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. For example, in the Brown Corpus “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word “of” accounts for slightly over 3.5% of words (36,411 occurrences), followed by “and” (28,852).
In any text of at least moderate size, the frequency of any word in the text is just about inversely proportional to its rank. This statistical relationship is noticed in two of the best modern books on Natural Language Processing, a field which specializes in quantitative measurements of language, but the key word here is “noticed.” Manning and Schütze (Reference Manning and Schütze1999: 23–29) offer a concise, six-page account of Zipf's Law early in their very long book, using the text of Mark Twain's novel Tom Sawyer for illustration, but then they mention Zipf on only two occasions after the initial description, both in passing. Jurafsky and Martin's textbook Natural Language Processing (Reference Jurafsky and Martin2000) does not even mention the Zipfian distribution until a passing reference on page 603, and that is the only mention of Zipf in the volume. Those of us in language studies who want to learn more about Zipf's Law will be faced with explanations like this one from PlanetMath (http://planetmath.org/zipfslaw):
Zipf's Law says that the ith most frequent object will appear 1∕iθ times the frequency of the most frequent object, or that the ith most frequent object from an object “vocabulary” of size V occurs
times in a collection of n objects, where Hθ(V) is the harmonic number of order θ of V.
PlanetMath is a website that bills itself as “math for the people, by the people,” but whose language and notation may leave many non-mathematicians behind. Figure 4.2, however, is a bit more intelligible, especially if we call it an “A-curve” as described in Chapter 1. If we put all the different word types in a text or corpus in order by frequency, there will only be a few word types that occur very often, at the left of the graph, and there will be a great many word types that do not occur very often at all, at the right of the graph in what is often called its “long tail.” Let us say that each of the thirty gradations on the PlanetMath curve represents one hundred different words. Then, as in Figure 4.3, the first five or six gradations at the left of the graph, about 500 or 600 words, account for about 80 percent of all the running words in the text, while the remaining 24 or 25 gradations in the long tail at the right, the other 2,400 or 2,500 different words, account for only about 20 percent of the running words. This is the source of the so-called “80/20 Rule,” known as the Pareto Principle in economics from Pareto's early-twentieth-century observation that 80 percent of the land in Italy was owned by 20 percent of the population. The 80/20 Rule is often used in business (e.g. 80 percent of profits come from 20 percent of customers), quality control (e.g., fixing the top 20 percent of mistakes will control 80 percent of the complaints), and elsewhere. We in language studies can and should make good practical use of the 80/20 Rule on a conceptual basis, with the understanding that there is real mathematics behind it that derives from the complex system of speech.

Figure 4.1 Zipf's Law (adapted from PlanetMath, http://planetmath.org/zipfslaw)
Figure 4.2 The 80/20 Rule
In this practical vein, perhaps the most entertaining account of Zipf's Law was written by Hugh Kenner, in an essay called “Neatness Doesn't Count After All – Tidy vs. Untidy Desks” (Reference Kenner1986). Kenner justified his practice of keeping an untidy desk with Zipf's Principle of Least Effort:
untidy-deskers of the world may take heart. It's we who have mathematical validation. Forget what you may have thought about the swept and tidy world of numbers. Concentrate on the fine randomness of Einstein's hair. We connoisseurs of scrutable chaos have been guided all along by an inscrutable proposition called the 80–20 rule: a special case of Zipf's Law.
That is to say, if we keep the 20 percent of our books that we want 80 percent of the time near us on our desks, we save effort, as Kenner says, because they are “instantly within reach from where you sit. At the cost of a little rummaging, of course.” The more serious discussion in the Kenner article mentions the use of the 80/20 Rule by IBM to optimize data processing in information science, and he offers examples of Zipf's Law in practice for literary works from Henry James, T. S. Eliot, James Joyce, and Shakespeare. In the end, Kenner suggests that “Zipf and 80–20 end up sketching what actually goes on, quite as if they were laws of impersonal Nature, like gravity and thermodynamics. But at bottom – so Zipf assured us – such laws work because they describe the situation we create in the course of intelligent coping. Situations like my desktop.” While there are many reasons why you might like to keep particular books on your desk, Kenner's wit found a practical function for the Zipfian distribution there – at the cost of a little rummaging. Whether we use mathematical formulae or Kenner's desk as our mental model, we can see that Zipf's Law and the 80/20 Rule describe a distributional pattern that we can apply to language.
The problem with Zipf's Law is that it implies that regular behavior, in nature and in language, ought to be described as a law, a matter of cause and effect, here the Principle of Least Effort. Kenner makes the analogy explicit by mentioning gravity and thermodynamics, “laws of impersonal Nature.” In fact, Zipf's formula has been superseded by, among others, the mathematician Mandelbrot (who spent his career at IBM). Mandelbrot's improved formula (Reference Mandelbrot, Oldfield and Marshall1968) shows that the top-ranked words on the curve deviate from the frequency that Zipf expected, and the lower-ranked words also deviate, owing to what he called “the wealth of vocabulary.” In Linguistic Atlas survey data (not written words in continuous discourse but spoken words and phrases gathered in the field), the top-ranked variant is often three, four, five, even ten times more frequent than the second-ranked variant, and we also see curves that are shallower than a 2:1 ratio between the first and second variant (discussed in depth in Chapter 7). If Zipf's Law were really a law, in the same way that thermodynamics and gravity are natural laws, then Zipf's Law just does not work well enough. Frequency is not always and exactly inversely proportional to rank. Hugh Kenner said as much in his article, when he reported that the 80/20 Rule in Henry James was actually 73/27, and in T. S. Eliot it was actually 71/35. What is really important here, however, is not the Principle of Least Effort or any formula of inverse proportions, but the shape of the curve, whether the proportion is actually 80/20 or 90/10 or 73/27. We know that a relatively few types will account for most of the tokens (which we might call “the 80 percent group”), and the remaining types will account for a much smaller quantity of the tokens (which we might call “the 20 percent group”), given some margin for variation in the actual ratio.
English grammar
Let us now turn to English studies and to grammar. If we begin with the long view, Saussure has described the history of “grammar” (Reference Saussure and Harris1916/Reference Saussure and Harris1986: 1):
This discipline, first instituted by the Greeks and continued mainly by the French, is based on logic. It offers no scientific or objective approach to a language as such. Grammar aims solely at providing rules which distinguish between correct and incorrect forms. It is a prescriptive discipline, far removed from any concern with impartial observation, and its outlook is inevitably a narrow one.
“Traditional grammar,” as Saussure later calls it, is “concerned with the description of linguistic states,” and not at all with any historical development in a language. It fails to be scientific because (Reference Saussure and Harris1916/Reference Saussure and Harris1986: 82)
Traditional grammar pays no attention to whole areas of linguistic structure, such as word formation. It is normative grammar, concerned with laying down rules instead of observing facts. It makes no attempt at syntheses. Often, it even fails to distinguish between the written word and the spoken word.
The implication here is that scientific observation is systematic and comprehensive, and traditional grammarians leave too much out of account in their attempt selectively to lay down rules. Nonetheless, we must all cope with traditional grammar because it is embedded in our educational systems, not just for the French but everywhere. Traditional grammar, then, is not a product of the complex system of speech, but instead arises from the unscientific opinions of grammarians.
Our popular belief in the prescriptive approach to grammar has generated a parallel belief in “standard” forms of languages. Besides prescriptive teaching in the schools, language standards are also recognized in the law courts. So, in the opinion of the judge in the famous Ann Arbor decision, C. W. Joiner equated the school “standard” with the language of “the commercial world, the arts, science, and professions” (Martin Luther King Junior Elementary School Children et al. v. Ann Arbor School District, 73 F. Supp. 1371 (ED Mich. 1979); italics added):
The problem in this case revolves around the ability of the school system, King School in particular, to teach the reading of standard English to children who, it is alleged, speak “black English” as a matter of course at home and in their home community (the Green Road Housing Development). This case is not an effort on the part of the plaintiffs to require that they be taught “black English” or that their instruction throughout their schooling be in “black English” or that a dual language program be provided…It is a straightforward effort to require the court to intervene on the children's behalf to require the defendant School District Board to take appropriate action to teach them to read in the standard English of the school, the commercial world, the arts, science, and professions.
Judge Joiner distinguished between “black English” and “standard English,” and clearly thought that “black English” deviated from the “standard English” that he believed to be in use in school and in educated realms of society. For Judge Joiner, “standard English” constituted the normal language of his own community of educated people, to which students in the school should aspire even if their own (deviant) home and community language differed from it. US Supreme Court justice Antonin Scalia has similarly advocated the “plain language rule” for the interpretation of statutes and of the constitution. The idea that there is such a thing as a “plain meaning” to be drawn from a statute requires the implicit assumption that the language is somehow standard enough to support such a meaning for all readers. The language of the law has historically been notorious for convolutions designed to make sure that only one meaning can be derived from, say, a contract, which suggests that plain language may be more difficult to interpret than it is sometimes given credit. Still, the idea that plain language should have a plain meaning, as a result of our belief in linguistic standards, obtains still in the law and elsewhere in life around us.
The reference to “the Greeks” and to “logic” by Saussure is more complex. As John Dinneen has remarked (Reference Dinneen1967: 87):
Grammatical distinctions…are not explicitly grammatical as opposed to logical in Aristotle's works, since he did not know of this distinction. The purpose he had in mind was logical, but the definitions he gives of the forms he uses in his logic, as well as the examples he uses, justify our considering these distinctions as grammatical.
Grammar and logic have remained linked for many teachers in Western educational institutions. As Dinneen remarks (Reference Dinneen1967: 71), “The terminology of traditional grammar, inherited from the Greeks, comprises the most widespread, best understood, and most generally applied grammatical distinctions in the world.” Thus, as Saussure said, traditional grammar, from its origins among the Greeks, came to be linked with logic and to be characterized by prescription. It remains the primary mode of language instruction in our elementary and secondary schools, and as such we cannot ignore it.
Of course, modern linguists are not bound by the prescriptive tradition, and we do try to make observations scientifically, based on the evidence before us. Still, we have mainly preserved traditional grammatical distinctions and categories, from parts of speech to functional categories. We still talk about the subject and predicate, as Aristotle did, or about subject–verb–object structure, or about NPs and VPs. Moreover, we still think of these as a hierarchy of rules, things that we do in English (or another language) as opposed to things that we do not do. In their Comprehensive Grammar of the English Language (1985), Quirk, Greenbaum, Leech, and Svartvik wrote that (11–12)
It is grammar that is our primary concern in this book. Words must be combined into larger units, and grammar encompasses the complex set of rules specifying such combination.
The “complex set of rules” in the Comprehensive Grammar turn out to be not complex enough: we maintain professional journals that continually publish additional claims for what Quirk et al. called “the regularities of grammar.” Structural grammars like the Comprehensive Grammar thus must keep expanding in size, even though the Comprehensive Grammar already has 1,779 pages. The idea of a structural grammar thus includes documentation of as many regularities of grammar as might be identified in a language. Structural grammars must allow for growth in the grammar.
Academic grammars following Chomsky are not essentially different from this idea, but take grammatical complication in a different direction. Chomsky suggests that the central problem is deciding which generalizations are significant (Reference Chomsky1965: 41):
We have a generalization when a set of rules about distinct items can be replaced by a single rule (or more generally, partially identical rules) about the whole set, or when it can be shown that a “natural class” of items undergoes a certain process or set of similar processes. Thus, choice of an evaluation measure constitutes a decision as to what are “similar processes” and “natural classes” – in short, what are significant generalizations.
Chomsky called generalizations that render the description of a system more complicated “spurious generalizations” as opposed to significant ones, which of course reversed the structuralist process of evaluation that has always led to bigger grammars, as structuralist grammars keep adding newly noticed grammatical constructions to the list. The distinction between significant and spurious generalizations underlies the generative principle of elegance. The generative trend in grammar seeks the fewest number of rules which can account for the regularities of, first, a language like English, on the way to universal rules that can account for human language in general (Chomsky Reference Chomsky1965). In the end, however, Huddleston and Pullum Reference Huddleston and Pullum2002, a full-featured generative grammar that parallels the Comprehensive Grammar, is not smaller in size at 1,860 pages in the same large format. Generative grammars may have a principle of elegance that is supposed to control growth of the grammar by distinguishing significant generalizations from spurious generalizations, but in practice the principle of elegance has not led to smaller grammars.
What linguists have called a “complex set of rules” is just not the same thing as a complex system. Östen Dahl (Reference Dahl2004) has written a book about the growth and maintenance of linguistic complexity, but that topic concerns the enrichment of possibilities with a hierarchy of rules, not complexity theory as described in this book. Indeed, the basic principles underlying grammar are just the opposite of those for a complex system:
1 grammar is a static structure of rules, as opposed to being open and dynamic;
2 grammar consists of a hierarchical arrangement of rules, ideally the smallest possible set of rules, as opposed to a very large number of interactive components/agents;
3 grammar exhibits fixed relations between elements, as opposed to emergent order. Because of these fixed relations, grammar ideally has binary distributions, right vs. wrong, acceptable vs. unacceptable constructions, instead of the nonlinear frequency distributions characteristic of complex systems; and
4 grammar has homogeneous unity, as opposed to the property of scaling.
Grammar, then, as a way of thinking about language, is complementary to the complex system of speech, not a description of it. The properties of a grammar consist of idealizations, generalizations away from the dynamic qualities of speech as people actually use it. Generative grammar ideally should apply the principle of elegance to create the smallest number of rules that can generate the acceptable sentences of a language, while structural grammar collects regularities that need to be included in the static structure and thus allows for growth in the grammar. Traditional grammar allows grammarians to choose what rules should be included, without any necessary limitation by what people actually say or write in their language.
The “Five Graces Group” has proposed, on the basis of complex adaptive systems:
a usage-based theory of grammar in which the cognitive organization of language is based directly on experience with language. Rather than being an abstract set of rules or structures that are only indirectly related to experience with language, we see grammar as a network built up from the categorized instances of language use…The basic units of grammar are constructions, which are direct form-meaning pairings that range from the very specific (words or idioms) to the more general (passive construction, ditransitive construction), and from very small units (words with affixes, walked) to clause-level or even discourse-level units.
The Five Graces Continue:
Because grammar is based on usage, it contains many details of cooccurrence as well as a record of the probabilities of occurrence and co-occurrence. The evidence for the impact of usage on cognitive organization includes the fact that language users are aware of specific instances of constructions that are conventionalized and the multiple ways in which frequency of use has an impact on structure. (2009a: 5)
There are many things to agree with in this programmatic statement, but the direct connection of speech and grammar is not one of them. The Five Graces’ assertion of a “network built up from categorized instances of language use” bears within it the notion of categorization, which implies some sort of a priori grammatical hierarchy, the “natural classes” in grammar cited earlier from Chomsky. The natural classes that Chomsky hopefully predicted to exist in 1965, linguistic universals, have not thereafter actually been documented (as Tomasello argued at the beginning of this chapter). Without natural classes, the process of categorization is fraught with problems that arise from all of the variation in speech around us, as we saw in the discussion of exemplar theory in Chapter 3. Indeed, in modern cognitive science, the theory of mental representations and the language of thought as championed by Jerry Fodor (see Bermudez Reference Bermudez2010: 156–158, 251–253) is just one way of imagining the cognitive process, and categorization into representations is thus not something that we should accept as a given. The neural network theory of activation patterns (Bermudez Reference Bermudez2010: 215–245) also explains how information and its frequencies can be processed, without categorization into representations. The newest work on the dynamical systems model of cognitive processing explicitly dispenses with representations (Bermudez Reference Bermudez2010: 414–431; Van Gelder Reference Van Gelder1998 is a key source) in favor of the process involved in complex systems, interactions with feedback (see Chapter 6 for further discussion of cognitive processing). Thus the Five Graces Group is right to insist on usage as what builds a speaker's cognitive sense of a language, but not credible in their assertion of a direct connection between speech and grammar as a network of categories. Grammaticalization is a process without an endpoint, whether for speech in society as Hopper argued in Reference Hopper1987 or for cognitive information processing by individuals. Grammar, on the other hand, when it is defined as a network or hierarchy of categories or rules, is something essentially different from the output of the complex system of speech, something only indirectly related to language in use.
Our interest in making grammars that transform the complex system of speech into a hierarchy of rules comes originally, in our Western culture, from just the motivation that Saussure talked about. First, we have Aristotle's wish to specify how language can be used to make particular logical, philosophical relations between subjects and predicates. By extension, the logic of structure in a language tells us about what Saussure called “linguistic value” (see Kretzschmar Reference Kretzschmar2009: 53–55). Linguistic value is the heart of Saussure's Course in General Linguistics (Reference Saussure and Harris1916/Reference Saussure and Harris1986), because linguistic value is the aspect of language that we can only describe through analysis of a fixed structure. Linguistic values are established as a result of the relational fit of each linguistic feature within the hierarchical system of the language. Thus, in Saussure's example (Reference Saussure and Harris1916/Reference Saussure and Harris1986: 114), the French word mouton covers the same semantic space as both English sheep (‘live animal’) and English mutton (‘meat from the animal’), and so the French word is the equivalent of neither English word. The French system has one element while the English system has two, and the linguistic values must therefore be different because of the relational fit of the elements. The same sort of difference in linguistic values obtains for elements in both paradigmatic structures such as the list of words in a language, and also in syntagmatic structures such as the constructions into which those words may be assembled. Linguistic value can only arise from a hierarchical structure, whether a paradigmatic inventory of words or an array of syntagmatic rules. This means that it does not, cannot, come directly from a dynamic complex system with emergent order. Grammar is something that requires us to abstract away from what people are doing, to get outside of the process of interaction of the complex system, in order to build a hierarchy. Thus a grammar must by necessity rely on linguists’ perceptions of what people are doing for generative and structural grammars, or rely on traditional grammarians’ opinions of what people ought to be doing. Still, linguistic value will always justify an interest in linguistic structure as an appropriate mode for study of a language, even if we cannot extract it directly from the complex system of speech.
The real problem with grammars has always been the relation of a grammar to the complex system of speech, to language as we use it. Ideally, grammatical rules are binary: they tell us what we can do in a language and what we cannot do; we think that speakers of a language should be able to make grammaticality judgments. As Quirk et al. said of speech sounds in the Comprehensive Grammar (Reference Quirk, Greenbaum, Leech and Svartvik1985: 11),
When people speak, they emit a stream of sounds. We hear the sounds not as indefinitely variable in acoustic quality (however much they may be so in actual physical fact). Rather, we hear them as corresponding to one of the very small set of sound units (in English /p/, /l/, /n/, /I/, /ð/, /s/ …) which can combine in certain ways and not others.
Thus the nature of rules, which rely on our perceptions. Anybody making a grammar will assemble a perceived set of categorizations from the stream of speech into a structure of combinatorial patterns that people use (“certain ways and not others,” whether by observation or opinion). With regard to word-formation, Quirk et al. say that “products of word-formation are subject to the idiosyncrasies inherent in the lexicon, whereas the rules of grammar are far broader, transcending the individualities and unpredictabilities of the lexical items which the grammatical rules manipulate” (Reference Quirk, Greenbaum, Leech and Svartvik1985: 1518). Unfortunately, as speakers of English know very well, the “sound units,” or phonemes, of English are not as clear-cut as this statement suggests. The inventory of phonemes is not the same for all speakers of English, and speakers of English also do not all share the same rules for combinations of sounds, say for the many speakers who vary in rhoticity, or who employ consonant cluster simplification, or who substitute glottals for other stops. The same is true of rules of grammar. Quirk et al. offer the “Rules for forming open-class -ly adverbs from adjectives,” but also cite exceptions like wholly for expected wholy, gaily for expected gayly, publicly for expected publically (Reference Quirk, Greenbaum, Leech and Svartvik1985: 439). They talk about rules for clause formation, and then say that “These restrictions are frequently violated in poetry and in other imaginative uses of language” (Reference Quirk, Greenbaum, Leech and Svartvik1985: 772). In their discussion of pro-forms, with regard to constructions that they have earlier described “in terms of grammatical rule,” they say that “What is grammatically tolerable has to be mediated by what is textually tolerable” (Reference Quirk, Greenbaum, Leech and Svartvik1985: 1460). So, even if grammatical and phonological rules are supposed to be binary, in fact they have never been so in structural grammars like the Comprehensive Grammar. Newer generative approaches to rules do not allow exceptions or leaks, but instead adopt a strategy of rule ordering, or of multilevel surface and deep rules, or other ingenious manipulation of the hierarchy of rules in order to compensate for the basic fact that binary decision trees are a poor match for language in use.
Sapir has famously written that “All grammars leak” (Reference Sapir1921: 38–39), which may give the impression that all grammars are somehow badly constructed. It would, however, be better to say that “The idea of grammar leaks” because the basic binary notion of acceptability simply does not fit the facts of any language. And that is actually what Sapir was talking about:
The fact of grammar, a universal trait of language, is simply a generalized expression of the feeling that analogous concepts and relations are most conveniently symbolized in analogous forms. Were a language ever completely “grammatical,” it would be a perfect engine of conceptual expression. Unfortunately, or luckily, no language is tyrannically consistent. All grammars leak.
Rather than talking about the universal linguistic features or structures that Tomasello claims have never been discovered, Sapir says merely that the fact of grammar is universal. And grammar for Sapir is a “generalized expression,” not a fixed hierarchy of rules or structures but instead something that is never tyrannically consistent. His generalization is not susceptible to binary decisions of acceptability because, like many another generalization, it captures what usually happens without governing every case.
Complexity science and grammar
Complexity science, the complex system of language in use, tells us why this should be so, through a combination of nonlinear frequency distributions and scale-free networks. First of all, we have the 80/20 Rule. The different ways of putting words together have a nonlinear distribution, just like other linguistic features. Just to show once more the generality and relevance of the A-curve pattern, Michael Stubbs has reported the top-ranked collocates of undergo in a corpus of British English. He never claimed that this distribution was like Zipf, but it certainly has the shape of the curve when we graph his list (Figure 4.3). Of course the long tail of the curve is cut off here, since Stubbs only reported the most frequent collocates. Collocation is a different sense of “putting words together” than grammar per se, especially when we consider grammar to be composed of constructions as the usage-based linguists do, but it is difficult to know when the notion of collocations ends and constructions begin. Nick Ellis and Diane Larsen-Freeman have documented the nonlinear distribution of grammatical “constructions,” as shown in Chapter 3. For Stubbs’ corpus we cannot calculate the 80/20 proportion because we do not have all of the data, but we can see that the top 20 percent of co-occurrences belong to the top three collocates, surgery, tests, treatment. The top twenty collocates occupy 63 percent of co-occurrences, which is many times more than we might expect if words were arranged randomly – Stubbs has calculated that a 1 percent co-occurrence rate for this corpus, the rate for the twentieth-ranked word here, is 125 times greater than would be expected by chance (2001: 73–74). Feedback in the complex system of speech has created a pattern, one that we see wherever we look in speech data, that is wildly improbable by chance.
Figure 4.3 Stubbs, top twenty collocates of undergo, with frequency of collocate co-occurrences given in percentages in parentheses (adapted from Stubbs Reference Stubbs2001: 89)
What is important to note here is that the 80/20 Rule for such nonlinear distributions (whether the actual percentages are 80/20 or 90/10 or 70/30) tells us that we will always find one or a few constructions that account for the great majority of the instances for the feature under study, and that there will be a large number of variant constructions for the feature that account for a small minority of the instances. What the 80/20 Rule tells us about grammatical rules, then, is that we know in advance that there will be exceptions. Indeed, exceptions are not rare events, because we can predict that the class of exceptions will account for about 20 percent of the instances of any feature we study. Moreover, the exceptions will account for about 80 percent of the different constructions possible for any feature. Once we understand that the 80/20 Rule is not a curiosity but instead the hallmark of a complex system, we can understand what we take to be grammatical regularities in a different way. Grammatical rules, it turns out, are not laws but more like guidelines.
Grammatical rules are, however, more than mere suggestions because they have a nonlinear frequency curve behind them. Indeed, it is highly likely that the 80/20 distributional pattern gave rise to the idea of grammar in the first place, because speakers of any language perceive that for any question about how to put words together, there will be one or a few constructions that occur a great majority of the time, in the 80 percent group. The idea that there is a fixed objective language hierarchy, a linguistic system, originates as an observational artifact, something that we just perceive to be there because we usually do one of just a few things for any construction. Variants of features that are massively more common have a strong perceptual advantage because of the ever-present nonlinear distributional pattern, and they come to be perceived as “grammatical” or “acceptable.” On the other hand, the less-frequent variants in the 20 percent group in long tail of the curve are at best perceived as exceptions, and at worst perceived as “unacceptable” or “ungrammatical.” If speech is a complex system from which order emerges in the shape of nonlinear distributional curves for every linguistic variant, as all the evidence confirms, then grammar cannot begin as an underlying system of rules that generates all and only the acceptable structures. Grammaticality is something that we notice after the fact of usage, given our perception of the nonlinear frequency distribution of variants. Complex systems, then, support Paul Hopper (Reference Hopper1987) when he called grammar “epiphenomenal,” because our perception of grammar arises as a secondary effect of the complex system of speech, not as an intrinsic cause of how languages work.
It follows, then, that the 80/20 Rule that arises from the output of complex systems can account for Sapir's statement that “the fact of grammar” is “a universal trait of language.” The Five Graces Group was wrong to make the tacit assumption of universal grammar, just as the generativists were wrong to do so. What is actually universal is the trait that speech is a complex system. Grammar is not a fact because it is an underlying system, but because of the belief of language users in the real existence of languages and dialects, based on their perceptions of the A-curves for every feature. That is, people share cognitive tools for processing their experience with speech, and the nonlinear distribution of variants allows for habituation of neural pathways for common features in the 80 percent group (whether or not we choose to call such habituation “mental representations” or instead consider it to be the output of neural networks or dynamical systems). We then come to believe that the variants with which we are most familiar constitute our language, and more locally constitute our dialects. The components of a grammar for some particular population of speakers are first and foremost the top-ranked variants of every linguistic feature, those in the 80 percent group, which linguists can make into a hierarchical system. Traditional grammarians also perceive that there is an 80 percent group of frequent usages, and they just substitute their own perceptions to make a grammar, instead of the scientific observations made by linguists of the 80 percent group in a community. Even though the nonlinear distribution of linguistic tokens is not something that our grammars have taken into account, it is the nonlinear distribution that allows us perceive it to be instantiated through rankings of feature variants for any and all populations of speakers. Grammar is thus, as Saussure has written, “a social phenomenon,” something “social in its essence and independent of the individual,” but for a reason that he had no way to anticipate. The idea of grammar requires, for its very existence, the nonlinear distributional pattern of linguistic variants that continually emerges, at all levels of scale, from huge numbers interactions between speakers who deploy feature variants in the complex system of speech.
Improved grammars
Let us now consider how our knowledge of speech as a complex system can help us to improve the grammars we create. First of all, the foundation of grammar in nonlinear distributions allows us to emphasize in contrast that prescriptive rules, from Saussure's dark side of grammar, have no such foundation. Instead of the justification of the ever-present A-curve of frequency, prescriptive grammars rely on someone's perception of social acceptability, whether the opinion comes from the Academie Française or, for English, from self-appointed usage mavens. Prescriptive judgments are thus inherently arbitrary in a way that grammatical descriptions based on the A-curve are not. The 80/20 Rule weighs against arbitrary judgments of acceptability and supports the idea in the Comprehensive Grammar that grammar is “immanent” in language – just not built-in in the same way that the authors thought it was. A knowledge of complex systems therefore frees us to consider grammatical prescriptions as a social affectation. Prescription is inherently political, a product of social debate and the exercise of social influence, in a way that the frequency profiles arising from the complex system are not. Judge Joiner's view that “standard English” belongs to the schools and to the professions can thus be seen as a political statement, as can Justice Scalia's belief in “plain language.” The importance of Judge Joiner's Ann Arbor decision comes in its recognition that “black English” exists as a home and community language, not in its ratification of “standard English” as the property of the educated class. Justice Scalia's belief in “plain language” is revealed as an activist judicial tactic that justifies a judge's belief in his own perceptions of language, as opposed to historical or scientific modern views of how people actually use language. When we know the true grounds of prescriptive grammar, and that nothing about prescriptions is natural or inevitable, we can do a better job of managing expectations in our schools, where notions of prescriptive grammar are fostered and maintained. So, it is not true that we might as well be prescriptive because of the natural irregularity of grammar. We just need to use the 80/20 Rule to modify our understanding of how the “rules” of grammar arise, and to see more clearly how prescription differs from descriptive grammar.
Quirk et al., ever the most reasonable of grammarians, knew nothing of the A-curve and complex systems, but they did include a notion of frequency and its relation to acceptability. They were interested not just in binary decisions but in “regularities” (Reference Quirk, Greenbaum, Leech and Svartvik1985: 12):
The study of words is the business of lexicology, but the regularities in their formation are similar in kind to the regularities of grammar and are closely connected with them. [the authors provide an appendix on word formation]
They further specify that the regularities of grammar cannot be entirely separated from other systematic ways of observing the language, from phonology or lexicology or semantics or pragmatics, so that, they say (1985: 15),
Our general principle will be to regard grammar as accounting for constructions where greatest generalization is possible, assigning to lexicology…constructions on which least generalization can be formulated. In applying this principle we will necessarily make arbitrary decisions along the gradient from greatest to least generalization.
Quirk et al. would not be very scientific if they were proposing that we should keep rules when they work and throw everything else into the category of lexical exceptions (unfortunately, this is a strategy not unknown in linguistics). However, the important point here is their recognition of “the gradient from greatest to least generalization.” They are proposing a continuum in which grammar permits the greatest degree of generalization, but still not perfect generalization. Further to the same point, Quirk et al. create a relationship between acceptability and frequency. They claim that “Assessments by native speakers of relative acceptability largely correlate with their assessments of relative frequency” (Reference Quirk, Greenbaum, Leech and Svartvik1985: 33), and they have used the major corpora available at the time of their writing (Survey of English Usage, Brown, LOB) in order to establish the frequency of language phenomena. Quirk et al. thus differ from Chomsky's famous rejection of corpus evidence (see e.g. Andor Reference Andor2004). They are also quite different in approach from Chomsky's principle of elegance in the creation of generalizations, which shows Chomsky's confidence in our ability to make binary judgments, a confidence not supported by the 80/20 Rule for language in use.
For the Comprehensive Grammar, since it already includes the idea of a continuum of acceptability based on frequency, all we need to do to make it align with the complex system of language in use is to say that any grammatical rule is subject to the 80/20 Rule. This change for the Comprehensive Grammar would really amount to a change in tone, not in substance. Instead of presenting rule systems that could be confused for prescriptions, the grammar would freely admit its leakage and build “exceptions” into the discussion of regularities – as the Comprehensive Grammar already does for some rules. The application of the 80/20 Rule would also allow us to evaluate how different constructions are included as part of a grammar. We can use frequency to determine whether any given construction is in the 80 percent group, and thus a part of what we might want to call the “core” of the grammar constituted by the top-ranked variants. Alternatively, we can determine that constructions are in the 20 percent group, the great majority of constructions that rarely occur, more like what Quirk et al. call “exceptions.” To do so improves their structural grammar by specifying that their generalizations are actually guidelines: a grammatical rule, if well formulated in a structural grammar, can be expected to apply in the great majority of uses in practice, but we can expect a large array of infrequent cases where it does not. These are not all errors. The cases in the long tail of the A-curve are as much a part of the language as the highly frequent cases. The 80/20 Rule just gives a better metric for such evaluations, to replace the impression of some users of the grammar that rules are binary distinctions of what is right and what is wrong, and to minimize the necessity in Quirk et al. for the authors to make arbitrary decisions along the continuum.
For generative grammar, the best thing would be for Chomsky and others to continue to insist on the separation between language in use and the rational linguistic structure they prefer to talk about. As Tomasello has argued, the evidence of the world's languages has not borne out the prediction of representational innateness, and so generativists need not press that old case. Generativism, however, still offers a substantially elaborated model of logical relations in language, and we need not abandon that logic along with innateness. Indeed, we should not abandon a model that helps us to discover the linguistic value that Saussure identified. The generative model begins with idealization in order to avoid the complex interactions of speech – the ideal speaker in a homogeneous community – and yet it must still have some empirical connection with speech. The 80/20 Rule suggests that there is no end to the problem of trying to write rules that can generate all of the acceptable sentences of the language. It cannot happen because of the long tail. Once we accept that infrequent constructions are normal parts of language in use, then we can understand that there are too many constructions to accommodate in any elegant rule system. It will, however, continue to be possible to write rule systems that account for the 80 percent group in the 80/20 Rule, and that is in fact what has mostly been happening already. Indeed, the 80/20 Rule can offer assistance to generative grammars because it also provides a metric for them, like Chomsky's earlier metric of elegance, to help decide what constructions to include in such a grammar and which ones to leave out. Application of the 80/20 Rule to a generative grammar would retain the core variants that are most important to linguistic value, and offers a principled way to set aside variants on the periphery. Bill Labov once famously said, in his essay “Logic of Nonstandard English,” “For many generations, American school teachers have devoted themselves to correcting a small number of nonstandard English rules to their standard equivalents, under the impression that they were teaching logic” (1972a: 255). If we do not apply the 80/20 Rule to generative grammars, we risk turning Labov's statement around: generativists would be proposing a small number of rules of logic, under the impression that they were describing language. The 80/20 Rule offers generative linguists a better way to align language in use with the kind of binary logical decisions important to the generative model. And those binary logical decisions are just what is required to analyze linguistic value, as Saussure described it, the meaning that comes from hierarchical relations within a linguistic system.
The 80/20 Rule thus highlights an important difference between structural and generative grammars. Neither one tells the whole story of grammar. As its name says, the Comprehensive Grammar does the best job of collecting all of the grammatical constructions of English. The structural grammar is comprehensive at the cost of allowing exceptions, of including the constructions in the long tail. Generative grammars, on the other hand, are superior for the analysis of linguistic value because of their reliance on binary logic, but they retain this superiority only when they study just the top-ranked variants and do not attempt to be comprehensive. As consumers of grammars we want both kinds, and we want them to concentrate on their strong points. We do not want structural grammars that try to compete with generative grammars by asserting iron-clad binary rules that instead come to look like prescriptions. We do not want generative grammars that try to compete with structural ones by accounting for every low-frequency construction, and so lose the ability to write elegant rules that reveal linguistic value in the core elements of grammar. We should prefer structural grammars that do the best job with paradigmatic relations, listing and organizing all of the available constructions. We should prefer generative grammars that do the best job with syntagmatic relations, revealing linguistic value in the relationships between constructions. To make Hugh Kenner's insight into a metaphor for grammars, we want all of the books in our structural library, but we find value in having the books we use most often on our generative desks.
Complex systems offer us another way to write better grammars when we apply the principle of scale-free networks to what we have learned from the 80/20 Rule. The property of scaling tells us that we have to pay close attention to exactly what population of speakers or texts our grammar will apply. Contemporary sociolinguistics addresses the evident problem that different groups of speakers, whether distinct according to geographical/social criteria or just at different levels of scale, have different language behavior and thus different grammars. Grammars can only be defined for the speech of one population at a time because, while the 80/20 Rule always applies, it will apply differently for every different population of speakers. There is literally an infinite number of possible grammars, because the number of possible groupings of speakers along the geographical/social continuum is infinite. Our problem with scaling is how to relate the grammars described for different population groups, either as part to whole or between different parts of the whole. If we use the same array of grammatical constructions to compare two or more populations, comparison of the results will show that some variant constructions will be higher and some lower on the A-curves in each population, and that some will be in the 80 percent group for one population while they are down in the 20 percent group for another population. We can measure these frequency differences, but they necessarily create categorically different grammars. There will be differences in which different variants happen to be top ranked for different populations, and also differences in whether particular low-frequency variant constructions happen to be present or absent for different populations. To use the terms already suggested, constructions in the “core” and “periphery” will therefore differ, so that in grammars for different populations, constructions will sometimes belong to the “core” and sometimes appear as exceptions, and the linguistic value from relationships in the hierarchy of constructions will necessarily be different. The same problem occurs for different kinds of texts. As Douglas Biber has demonstrated (1988; Biber, Conrad, and Reppen Reference Biber, Conrad and Reppen1998), there is a continuum of texts in multiple dimensions that we can observe by measuring the frequency of an array of grammatical constructions. The notion that “every text has its own grammar” is true, for text types, in the same way that the language behavior of different groups of speakers yields different grammars. Complex systems explain how this could be so, and explains why we should avoid the proposal of the Five Graces Group that speech is directly connected to grammar and the consequent implication that a language has only a single grammar. Grammar is always relative to the population of speakers and texts that we examine.
Once we know about and expect the property of scaling, we can accommodate it in our grammars. For one thing, as I have argued in detail elsewhere (2009: ch. 7), we cannot rely on anecdotal evidence. Because of scaling, we cannot assume that others will share our own intuitions about grammatical constructions. To indulge in another metaphor, we cannot assume that speech is like the paint on a wall, where we can tell the color of the entire wall by looking at any small part of it. We need to use valid randomized statistical sampling when we measure the frequencies of constructions in different, particular populations. We can do so for very small populations, as sociolinguists now do for communities of practice: we can sample from just a small section of a wall, and tell what color the section is, say the part of the wall that has been behind a picture for many years. Alternatively, we can sample from an entire wall and tell what color the wall is, say, the wall in a room with an accent color. And, again, we can sample from the walls of an entire room and tell what color the room is, whether it is dark or light overall. The judgment we make about the room as a whole will not tell us about an accent wall or the darker patch where the picture used to hang, and vice versa, the small patch or the single wall will not tell us about the room as a whole. In some cases they might – there might not have been a picture above the patch we sample, and all the walls may be the same color. But in the complex system of speech not only can we not rely on such uniformity, we know even from our own experience, no sampling required, that English is continuously variable around us. If we take care to match the grammars we make to particular populations we can write better grammars, whether for English as a whole, for national or regional varieties, for social groups, or for particular kinds of texts.
The Longman Grammar by Biber, Johansson, Leech, Conrad, and Finegan (Reference Biber, Johansson, Leech, Conrad and Finegan1999; also Biber, Conrad, and Leech Reference Biber, Conrad and Leech2002, from which the illustrations in Figure 4.4 are adapted), for example, makes an excellent start on the process of recognizing differential frequencies of different grammatical constructions in four different “registers,” different text types, conversation, news, fiction, and academic prose. So, in Figure 4.4 different relativizers show up at different frequencies in the four registers; the chart for fiction looks most like an A-curve as the authors ordered the forms, but each chart here has been reordered to make a good A-curve. Conversation has a low frequency of relativizers overall, but the A-curve would appear more clearly there as well if the scale of the y-axis were changed to fit the data. In Figure 4.5 the same distributional pattern appears for negatives across the registers, at right for the frequency of function word classes within conversation and within academic prose, and for the syntactic forms of adverbials in the overall corpus. In the latter case, the last category is “other forms of adverbials,” so we know that the long tail would appear if these forms were broken out separately. These illustrations make it clear that, when we associate the frequency of constructions with text types or populations of speakers, we will not be able to generalize from individuals to small groups to larger groups of speakers or texts, or across parallel populations, so we will always need to be satisfied with what our sampling tells us for the particular population under study.
What, then, of the idea of Standard English, from the perspective of complex systems? Standard English is an institutional construct, and not, to use Quirk et al.'s term, something “immanent” in the language. The idea of Standard English is of a different order from the grammars that we can write based on the frequency of variant constructions from particular populations. Because Standard English is institutionalized, it promotes variants as if they were top ranked, even though they may not in fact be top ranked across a national population or for the group of English speakers as a whole. In this way, Standard English is on the edge between prescriptive grammar and descriptive grammar. The variants promoted institutionally in Standard English are often presumed to be those of some preferred population of speakers, but in fact experimental speech evidence shows that highly educated speakers (the group typically assumed to embody Standard English) vary substantially in their language behavior, whether between countries (British vs. American Standard), or between regions, or between localities, or even between different kinds of occupations. In contrast to Judge Joiner's formulation, it is simply not true that “the school, the commercial world, the arts, science, and professions” all share the same grammar. Artists do not talk like lawyers. Scientists do not talk like businesspeople. What the schools try to teach is not identical with the usage of any post-secondary educated community. Unlike the social arbitrariness of prescriptive grammar, there are good reasons to impose institutional standardization, such as spelling. Any agreement that limits the variation in grammar that will necessarily arise across time or large distances will improve the chances for successful communication. However, the relationship between Standard English and the usage of any individual speaker, or any group of speakers, will remain probabilistic. Any claims for linguistic value extracted from the structure of Standard English would be relevant just for the institutionalized construct, since no particular population has been identified for it.
Figure 4.4 Differential frequencies of relativizers in Biber et al. the Longman Grammar Reference Biber, Conrad and Leech2002 (adapted from figures 8.14–8.17)
Figure 4.5 Differential frequencies of various constructions in Biber et al. the Longman Grammar Reference Biber, Conrad and Leech2002 (adapted from figs. 2.2–2.3, 8.6, 11.2)
The 80/20 Rule does not replace traditional grammars. It does help us to think more clearly about how we make and how we understand our grammars. Complex systems, with their nonlinear distributions and scale-free networks, offer a means to distinguish prescriptive traditional grammar from contemporary scientific grammars of different kinds. Complex systems also shows us how to make better structural and generative grammars. If a knowledge of complex systems can help us to relax the sometimes strident competition between promoters of different kinds of grammars, that too will be welcome. Once we understand the relation of a grammar to the complex system of speech, we are free to do a better job of making the kind of grammatical generalizations that we will always want to make about a language, and need to make in order to exploit what Saussure recognized as linguistic value. Grammar per se is not the output of the complex system of speech, but we cannot make a satisfactory grammar without understanding the complex system of speech.
