Published online by Cambridge University Press: 22 April 2026
The following investigation presents a constructional procedure segmenting an utterance in a way which correlates well with word and morpheme boundaries. The procedure requires a large set of utterances, elicited in a certain manner from an informant (or found in a very large corpus); and it requires that all the utterances be written in the same phonemic representation, determined without reference to morphemes. It then investigates a particular distributional relation among the phonemes in the utterances thus collected; and on the basis of this relation among the phonemes, it indicates particular points of segmentation within one utterance at a time. For example, in the utterance /hiyzkwikər/ He's quicker it will indicate segmentation at the points marked by dots: /hiy.z.kwik.ər/; and it will do so purely by comparing this phonemic sequence with the phonemic sequences of other utterances.
1 I have had the advantage of discussing the subject of this paper with Noam Chomsky. Bernard Bloch and Charles F. Hockett have devoted a great amount of their time to a careful reading of the paper, which now appears considerably modified as a result of their valuable criticism. For data and comments on particular languages I am indebted to Henry M. Hoenigswald (German), Carol Schatz (French), Fred Lukoff (Korean), and Leigh Lisker and Bh. Krishnamurti of Andhra University (Telugu). Yuen Ren Chao, Murray Fowler, and William S. Cornyn have made tests for me in Chinese, Thai, and Burmese respectively. The English data were obtained with the aid of the Committee on the Advancement of Research of the University of Pennsylvania.
2 We are concerned here only with the segmentation at morpheme boundaries. The fact that some of the morphs are alternants of each other (allomorphs), and together comprise a single morpheme, is not relevant here. In /hiy iz leyt/ He is late, there are three morphemic segments, even though the middle one is an alternant of a morpheme unit. We are here seeking a method that will locate cuts after the third and fifth and last phonemes (not counting junctures) in this sequence. Such a method will give us the morphemic segments of an utterance, whether or not some of these are alternants of other segments.
3 In some cases a segment, when morphologically tested, turns out not to constitute a morph. In almost all such cases the lack of correlation between this segmentation and the desired morphemic boundaries affects only a small portion of the utterance, and is automatically corrected by the ancillary procedures discussed in §3 and §4, or else by morphological analysis. For example, in /itdist∂rbdmiy/ It disturbed me, the segmentation comes out /it.dis.t∂rbd.miy./: we lack a cut at the morpheme boundary before /d/. But this affects only the segment /t∂rbd/, which contains two morphs instead of one. When we test the morphological relations of this stretch, we find that it is not a morphemic segment, but that it can be divided into two morphemic segments. Analogously, in /Ð∂otæksiy/ The taxi, we get /Ð∂.tæk.s.iy./, with two cuts at points that are not morpheme boundaries. But when we test morphologically, we find that /s/ and /iy/ and even their sum /siy/ cannot be morphemic segments in this position, whereas the somewhat larger sum /tæksiy/ can be. Almost all cases where our segmentations do not coincide with morpheme boundaries fall within short stretches of this type; cf. §5.
4 Note that we are asking not the frequency of the various phonemes, but only which ones ever occur in that position. In the example of §1, the test utterance is /hiyzklev∂r/ He's clever. After the first 5 phonemes of that utterance we find 11 different successors: that is, in all the sentences that begin with /hiyzk/ we can find 11 different phonemes after the /k/. Some of these are more frequent than others: the successor /∂/ is frequent, as in /hiyzk∂v∂rd/ He's covered, /hiyzk∂miŋ/ He's coming; the successor /r/ is less so, as in /hiyzkreyziy/ He's crazy; and the successor /y/ is rare, as in /hiyzkyuwrd/ He's cured. We ask only how many different successors there are to the first 5 phonemes. We next consider the first 6 phonemes of the test sentence, and find that in all the utterances which begin with /hiyzkl/ there are only 7 different phonemes that ever occur after the /l/, again without regard to how frequent they are.
5 This is a special case, though the most common one. More generally: we segment the utterance at those points where the number and variety of successors (see below) is similar to that at utterance end. This formulation is needed, for example, in cases where strong syllabic and other phonemic restrictions are not corrected for (§5). It is also needed if, contrary to Table 2, we wish to apply this procedure to a phonemic writing in which the juncture /+/ is kept as a separate segmental phoneme, e.g. /2hìyz+3kwík∂r1+/ for He's quicker.
6 The list of phonemes which occurs in any utterance after a particular utterance-initial sequence may be called the successor variety for that sequence; while the number of phonemes in that list is the successor count for that sequence.
7 Though we are here correlating the variety with the phoneme which it follows, we must remember that the variety and the count depend upon the whole utterance-initial sequence. In the example above, the 11-phoneme variety occurs after the phoneme /k/, but it is the successor variety of utterance-initial /hiyzk/, not of /k/ in general. After /k/ in general we can find other phonemes in addition, for example /s/ after /k/ in pixie. The results of §4 are obtained by correlating the successors of an utterance-initial sequence with the last phoneme of that sequence.
8 Here and in similar cases, it is understood that we refer to utterance-initial sequences.
8a The tables illustrating this article are assembled at the end of the paper.
9 The examples above suggest (without assuming morphological knowledge) that junctures and intonation or stress contours have a special relation to morpheme boundaries. Junctures and some contours correlate with morpheme boundaries; other contours correlate with phrase or sentence boundaries without regard to morphemes and words. In contrast, if we dropped some segmental phonemes (e.g. the vowels), we would not obtain a segmentation similar to that obtained with these phonemes included. This applies also to tones and stresses which are not part of long contours, such as the tones in ‘tone languages’. Such tones have distributions like those of other phonemes of the language. We can therefore tell whether, in a given language, tones are parts of a contour (and can be omitted in these tests) by seeing what kind of phonemic distribution they have.
10 More exactly: if we want to segment an utterance which has a junctural pair, like the two sets in Table 2, the junctural allophones must be specified. But if we are segmenting some other utterance, we can usually get a successor count for the segmental phonemes alone that is almost the same as (not better than) the count we get if we specify junctural allophones and contours. And this with less work and confusion on the informant's part.
11 In general, different phonemic representations will give somewhat different successor counts; necessarily, since the different analyses mean that the same allophonic facts are represented by different phoneme sequences. In most cases these differences will not suffice to yield different peaks, i.e. different locations for our tentative segmentations. But sometimes this will happen. In particular, phonemes with great restrictions of occurrence usually yield very low successor counts (e.g. the successor of /o/ is usually only /h, w, y/; and this may raise a neighboring moderate count to a relative peak. A particular phonemic analysis may eliminate certain of these undesired low counts. But some difficulties are unavoidable, for frequently, when we are dealing with very restricted allophones, our phonemic representation will have to have one or another serious restriction, especially since solutions by means of long components cannot be used here (since they involve unpronounceable and nonsuccessive elements).
12 When we counted the successors of n we approximated the morpheme dependence at position (n + 1) upon the preceding phonemic sequence; but we made no use of any morphemic dependence of position n upon the following phonemic sequence. Sometimes (or always, depending on the structure of the language) the dependence upon the preceding sequence suffices to show whether there is a morpheme boundary before position (n + 1). When it does not suffice, we may be able to find out whether there is a morpheme boundary before position (n + 1) by counting the predecessors of position m from the end (where the utterance is of length n + m), thus finding the dependence of position (m + 1) from the end (= position n from the end) upon the phoneme sequence which follows it.
13 Of course, all the inadequacies of the forward operations can also occur in the backward operation if the positions are reversed: if a morpheme in a particular position is in grammatical agreement with something later in the sentence; if a morpheme or alternant has limited distribution in respect to what precedes it; if the last few phonemes of a morpheme are identical with the total phonemes of some other morpheme. As an example of the limited morpheme: in It disturbs me we find, on going backward, only 2 predecessors before /t∂rbzmiy/: /s/ and /r/ (in It perturbs me); but on going forward we find a peak of 15 successors after /itdis/.
14 E.g. The silo walls were up has a successor peak after /Ð∂say/ (The sigh ...) and also a predecessor peak before /lowwohlzwərə́p/ (... low walls were up). In this case we would get a segmentation in the middle of silo.
15 For example, in It disturbed me we find 16 successors after the first /i/. Of these successors, which are in the (n + 1)th place, 6 had 29 successors after them in turn, in the (n + 2)th position: it, if, itch, is, ill, in; after these 6 successors a new morpheme could begin in the (n + 2)th place. Of the other 10 successors, 1 had 18 successors (/y/: eat, eager, easy, each, either, aeons, etc.), 1 had 10 successors (/m/: imp, imbibe, immune, immediate, etc.), and 8 had from 1 to 4 successors (/ŋ/: ink, English; /d/: idiot; etc.).
16 To put it differently, the roughly decreasing numbers as we go from peak to peak (when we interpret peaks as word or morpheme boundaries) mean that there are in English about 29 ways of choosing the initial phoneme of a word; then depending upon the choice of the initial there are about 6 to 18 ways of choosing the second phoneme; and depending on the choice of the second phoneme (and somewhat on the first too) there are about 2 to 15 ways of choosing the third phoneme; about 1 to 10 ways of choosing the fourth; and 1 to 3 ways of choosing each following phoneme up to the end of the morpheme.
17 These categories depend on the decreasing numbers between peaks. If we say that the (n + 2)th phonemes for a given (n + 1)th are in category B, we mean that there are about as many (n + 2)th phonemes here as we would expect to find if the (n + 1)th were the first phoneme of an utterance. On the basis of the successor varieties of §4, we can go back and modify these categories, so as to obtain categories which closely characterize the first few (and, backward, the last few) places of an utterance. Thus modified, the calculations of the present section yield segmentations that agree even more closely with morphological boundaries. The adjusted categories are: A for the class J and high-count M of §4; B for the class K and high-count L of §4; C for middle-count L and N, and low-count M; D for low-count L and N. Part of this adjustment can be obtained simply by doubling the value of every successor vowel, thus correcting for some of the difference between the possible number of vowel and consonant successors. The adjusted categories are used in Table 1 above; for purposes of the arithmetic averaging there we set B = 15, C = 5, D = 1. Then if after n, 6 of the (n + 1)th phonemes have successors in category B (i.e. about 15 successors each) and 2 have successors in category C, and 1 has its successor in category D, the total in (n + 2)th place is 6 B + 2 C + 1 D, and the (n + 2)th average per (n + 1)th phoneme is 11.2.
18 If the adjusted categories of fn. 17 are used, there is no drop at all between /hiyz/ and /hiyzə/. The 29 successors to /hiy/ then have the following total of (n + 2)th successors in turn: 6 A + 21 B + 1 C + 1 D = an average of 14.2 per (n + 1)th phoneme in the numerical values of fn. 17 and Table 1 above. The 29 successors to /hiyz/ have the following total for their successors: 27B + 1 C + 1 D = an average of 14.3. The 22 successors to /hiyzə/ have for their successors: 22 B = an average of 15. The average is virtually the same, A not being counted since its distribution differs from that of B, C, D. Almost all the (n + 2)th places (including the C and D) have the distribution we would expect if the (n + 2)th phoneme were the second of an utterance, that is, if there were a word boundary after the nth phoneme.
19 This does not mean that there is necessarily a morpheme boundary before xy in this sequence.
20 Thus the J 28 above is merely J at a later point of the sentence, or at a grammatically more limited point. However, after consonants of rare occurrence, a vowel may have low L successor even near the beginning of an utterance, as in they /Ðey/, where the successors are K 6, L 4, J 28.
21 K and L can occur in first position; K' and L in second or third position; M, N, L can occur in the same medial positions.
22 The rise in L 8 can also be eliminated by correcting for the vowel-consonant distribution, as in §5.
23 Before some consonants, T consists only of the vowels (i.e. they are always the first of a postvocalic cluster). In these cases we may write T' before S, and T before Z; but one could also adopt some different convention. Similarly K' after a few consonants contains only vowels, and is hence identical with N.
24 If a sequence ends in R instead of T, we understand (since R includes T) that it sometimes is a morpheme separate from the following stretch, and sometimes constitutes a single morpheme with the following stretch.
25 Recognizing this /t/ as a separate morpheme is less obvious than recognizing the /d/ (and so for /s/ as compared with /z/) because /t/ has the same predecessors when it ends a morpheme as when it is a suffix. However, the predecessor variety tells us that in a certain utterance position the sequences preceding /t/ are the same as those which can themselves end an utterance (and have the same [n + 2]th predecessors), whereas in other positions we find fewer and more restricted sequences preceding the /t/. The phonetic possibilities may be the same in both positions, but the variety that we find is different. This is precisely the difference between the present method and a study of phonetic structure. In those cases where the predecessors to /t/ are the same as those we would find before a peak or utterance end, i.e. where the sequence ends in T rather than T', we place a tentative cut.
26 I.e. 0 or 1 G followed by 0 or 1 H, repeated up to 10 or 15 times.
27 Note also that segmental morphemes which consist of one phoneme are not easily separated out, since their boundary may be overshadowed by the neighboring boundary. In any case, a plateau of two high numbers (as in 9, 14, 29, 29 for /hiyz/ He's) indicates two segmentations, even though there are not two separate peaks.
28 This can be established by purely distributional investigations of the successors and predecessors of phonemes, as in J. D. O'Connor and J. L. M. Trim, Vowel, consonant, and syllable—a phonological definition, Word 9.103-22 (1953).
29 Ibid. Particularly valuable here would be the phonemic system used by Stanley S. Newman, On the stress system of English, Word 2.171-87 (1946).
30 For the inclusion of such considerations in the bases of phonology, see Charles F. Hockett, A manual of phonology (Indiana University Publications in Anthropology and Linguistics, 1955).