1. Introduction
Prosody, that is, suprasegmental aspects of speech and language such as stress, rhythm and intonation, often remains understudied in language documentation (e.g. Himmelmann & Ladd Reference Himmelmann and Ladd2008). In part, this is due to the complexity of the prosodic structure, often affected by linguistic and paralinguistic factors (e.g. Ladd Reference Ladd2008). Furthermore, initial language descriptions often lack in-depth acoustic analyses that provide quantitative support for (impressionistically) observed phenomena. The study of prosody in underresearched languages makes a particularly valuable contribution towards refining existing theory. That is, typological accounts are often shaped by work on well-studied languages and may therefore be biased. This study focuses on Yali, a Trans-New Guinea language of Indonesia that is documented to some extent, but lacks contemporary analysis of its prosodic structure (Riesberg et al. Reference Riesberg2021). The limited studies available for Yali and other members of its language family suggest that Yali has a form of word prosody, possibly using word-final tones. The current study provides an overview of the available literature on Yali and related languages and then explores the nature of Yali’s word-level prosody using cluster analysis on f0 contours (Kaland Reference Kaland2021). This analysis offers a solid starting point for prosodic analysis due to its data-driven approach and reproducible nature. Although the current paper covers a specific feature of Yali prosody and leaves several aspects for future research, it offers a pioneering in-depth study of an otherwise unknown prosodic system using modern analysis methods and, crucially, analysing naturalistic spontaneously produced speech.
The following (sub)sections discuss three different typological accounts of prosodic prominence and the role of f0 in languages with different prosodic structures (Section 1.1). Thereafter, the Yali language situation and the available studies on the prosody of related Dani languages are reviewed (Section 1.2). Note that the literature review concerns all relevant works known to the authors to date and can therefore be taken as a reference for future work on prosody of languages in this area. The research questions of this study are formulated in Section 1.3.
1.1. F0 in prosodic typology
Ladd (Reference Ladd2008:165) proposes an intonational typology along two dimensions (Table 1), one of which is based on Beckman Reference Beckman1986 (see Lindström & Remijsen Reference Lindström and Remijsen2005 for a similar proposal which provides a space for languages without any pitch accents such as Wolof). The typology distinguishes four categories of languages, based on whether phonetic cues to word stress are used to signal the phonologically strongest syllable in a word or not (stress-accent or non-stress accent in Beckman Reference Beckman1986) and whether they use f0 only at the phrase level (post-lexical) or at both word and phrase level (lexical pitch). Swedish and Serbo-Croatian are both categorised as stress languages that make use of lexical f0 contrasts. A common example from Swedish is stegen (e.g. Bruce Reference Bruce1977); either [ˈstê:gen] ‘steps’ (accent 1) or [ˈstě:gen] ‘ladder’ (accent 2). Note that in both words the first syllable is stressed. Japanese, a language without stress accent, shows f0 contrasts; either an accent on the first syllable [kámì] ‘God’ or on the second syllable [kàmí] ‘hair’/‘paper’ (Beckman Reference Beckman1986). In English, the minimal stress pair permit, either [ˈpǝrmıt] (noun) or [pǝrˈmıt] (verb) differs acoustically mainly in vowel quality, whereas f0 does not play a consistent role in signaling word meaning distinctions. At the phrase level, however, English is claimed to have different pitch accent types that highlight important information (i.e. focus) or demarcate (intermediate) phrases (e.g. Pierrehumbert & Hirschberg Reference Pierrehumbert, Hirschberg, Cohen, Morgan and Pollack1990). Bengali is a language that has no lexical specification of pitch, no or weak phonetic marking of word stress, and only uses f0 at the phrase level (focus marking, boundary tones; Hayes & Lahiri Reference Hayes and Lahiri1991; see also Ladd Reference Ladd2008).
Prosodic typology in Ladd (Reference Ladd2008).

In another account by Gordon (Reference Gordon and van der Hulst2014), word stress and pitch accents are disentangled in a typology where prosodically prominent syllables are found across languages (Table 2). A central assumption of this account is that word stress patterns reflect phrasal prominence, in particular due to edge repulsion. Edge repulsion is a compensatory mechanism to avoid tonal crowding at phrase edges. That is, if the realisation of a boundary tone plus another (phrasal) tone is phonologically required to appear at the right edge of a phrase, that is, on the final syllable, some languages choose to realise the phrasal tone on the pre-final syllable. For example, in American English a prenuclear H* pitch accent is realised earlier in the prenuclear syllable when immediately followed by the nuclear accented syllable (Silverman & Pierrehumbert Reference Silverman and Pierrehumbert1987). In Papiamentu the LH tone is also realised more leftward (away from the boundary) when occurring phrase-finally (Remijsen & Van Heuven Reference Remijsen and Van Heuven2005). The typology predicts that in languages with non-peripheral (e.g. penultimate or antepenultimate) stress there must be edge repulsion at the phrase level (Gordon Reference Gordon and van der Hulst2014). English and Egyptian Arabic are examples of such languages, in which word and phrase level are symmetrical in the sense that edge repulsion applies to both. Whereas in languages in which there is no edge repulsion at the phrase level, stress can be found on either a peripheral syllable (ultimate, e.g. Hebrew, symmetrical) or a non-peripheral one ((ante-)penultimate, e.g. Chickasaw, asymmetrical). A fourth category consists of asymmetrical languages that have word stress but do not make use of pitch accents that align with stressed syllables and therefore have no reason to apply edge repulsion (e.g. Wolof). Although hypothetically, languages might have edge repulsion only at the word level and not at the phrase level, these have so far not been attested. The unattested categories confirm the theory that if edge repulsion is found at the word level it must reflect the location of prominence at the phrase level, ‘since there are less likely to be any pitch properties bounded by the word that would repel stress from an edge’ (Gordon Reference Gordon and van der Hulst2014:20).
Prosodic typology in Gordon Reference Gordon and van der Hulst2014; PA = pitch accent.

The most recent overview of prosodic prominence across languages by Jun (Reference Jun2014) (following Jun Reference Jun2005) proposes a typological account according to three major aspects: prominence type, word prosody, and macro-rhythm (Table 3). Prominence type refers to the way a language marks phrasal prominence by f0. This can be done by means of pitch accents on the lexically or post-lexically specfied head of the phrase (head languages). Languages belong to this prominence type category when they only mark the head as prominent (not the edge). For example, English and Arabic (stress) or Swedish (stress and tone) are head languages. Other languages mark both the head and the edge of a prosodic domain; that is, word or phrase. These head/edge languages can have a lexically specified head (e.g Tokyo Japanese) or a post-lexical one (e.g. French). The edges in this type of languages are often marked at the level of the accentual phrase (AP; Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988), a level corresponding to one or more words, approximating the phonological phrase level (
$ \phi $
in Nespor & Vogel Reference Nespor and Vogel2007). Edge languages are claimed not to have any (post-)lexically specified head that is marked with prosodic prominence, but to only mark the edges of prosodic phrases (on a number of different levels). These languages are reported to not exhibit any word prosody (e.g. Seoul Korean).
Prosodic typology in Jun (Reference Jun2014) Table 17.4. Shaded cells indicate language categories that are unexpected to be attested.

As for the head and head/edge languages, different types of word prosody are distinguished. A language could have word stress (e.g. English and Bengali), some tonal marking at the lexical level (tone, e.g. Somali, or lexical pitch accent, e.g. Japanese) or both of these (e.g. Swedish and Serbo-Croatian). The typology in Jun Reference Jun2014 does not exclude head languages for which the head is only post-lexically specified (i.e. no word prosody), however, these head languages are unattested. Post-lexical marking of prosodic heads is attested in two head/edge languages, albeit with a strong tendency to occur at the right word edge (French and Basque). Thus, the typology strongly suggests that prominence marking in non-edge phrase positions (i.e. phrase medially) is mainly driven by word prosody, similar to Gordon’s typology (Gordon Reference Gordon and van der Hulst2014). In this respect there is an important trade-off between head or edge marking as a means to make the word prominent: ‘… edge tones will help to mark the prominence of [a] word when the head marking (stress) is weak, which is common in head/edge-prominence languages’ (Jun Reference Jun2014:537).
The concept of macro-rhythm was introduced to capture similarities between languages that differ with respect to prominence type and/or word prosody (Jun Reference Jun2014). Macro-rhythm concerns the regularity with which high and low tones alternate in a phrasal intonation contour (as opposed to micro-rhythm, which concerns the smaller units such as syllables or feet). Three degrees of macro-rhythm are distinguished; strong, medium, and weak. This degree is determined by (i) the number of possible phrase-medial pitch accents with larger pitch accent inventories leading to weaker degrees of macro-rhythm at the phrase level, (ii) the type of the most common phrase-medial pitch accent with rising or falling accents being more rhythmic than level ones, and (iii) the frequency or domain of pitch accents, with smaller domains (e.g. the word) being often marked, leading to a higher degree of macro-rhythm. As a consequence, the less marking of heads the stronger the degree of macro-rhythm, to the extent that all edge languages are classified with a strong degree of macro-rhythm (Table 3). It is important to note that the concept of macro-rhythm constitutes an overall tonal regularity that surfaces within a phrase. That is, phonological rules governing the association of pitch accents to stressed syllables are not immediately relevant for determining the degree of macro-rhythm, because a subunit of tonal rhythm (e.g. LH in L-H-L-H-L-H) ‘can cover materials larger or smaller than a word … and does not necessarily form a prosodic unit’ (Jun Reference Jun2014:524).
The typological accounts of Ladd (Reference Ladd2008), Gordon (Reference Gordon and van der Hulst2014), and Jun (Reference Jun2014) have two important aspects in common. First, they both allow for a special status of f0 compared to other acoustic cues such as duration or intensity. This special status is reflected in the central role of specific tonal phenomena such as lexical tone contrasts (Ladd Reference Ladd2008), edge repulsion (Gordon Reference Gordon and van der Hulst2014), or tonal rhythm (Jun Reference Jun2014). Second, f0 plays the most important role at the phrase level and a secondary role at the word level. That is, languages are assumed to show at least post-lexical use of f0 (Ladd Reference Ladd2008) and to exhibit some degree of tonal macro-rhythm, irrespective of whether a language has word prosody (Jun Reference Jun2014). And importantly, edge repulsion at the phrase level may determine the location of prominences at the word level, but not vice versa (Gordon Reference Gordon and van der Hulst2014). The latter point requires careful and systematic analyses to reveal the potential interplay of word-level and phrase-level prosody, in particular in languages that make use of tonal targets at both levels. Yali may be such a language and is further discussed in the following section.
1.2. Language situation
Yali is spoken in the area east of Wamena, the capital of the Jayawijaya district in the Papuan Highlands (Papua Pergunungan, Indonesia). It is a member of the Dani family, spoken in the Baliem Valley and surrounding highlands. The language is ‘stable’ (Eberhard et al. Reference Eberhard, Simons and Fennig2020) and between 4 and 5 on the EGIDS scale (Lewis & Simons Reference Lewis and Simons2010, Anderbeck Reference Anderbeck, Arka, Malini and Puspani2015). Classification of these languages and of Papuan languages in general is subject to ongoing revision given that they are often underresearched (e.g. Palmer Reference Palmer2018, Usher & Suter Reference Usher and Suter2021). The name Yali (iso: yac) refers to a Trans-New Guinea dialect cluster with a total number of speakers estimated at around 30,000, according to several sources (Silzer & Heikkinen-Clouse Reference Silzer and Heikkinen-Clouse1991, Wilson Reference Wilson1999, Lewis et al. Reference Lewis, Simons and Fennig2015, Riesberg et al. Reference Riesberg2021). However, these counts all lead to the same original sources (Voorhoeve Reference Voorhoeve1975 and Grimes Reference Grimes1984), the validity of which to date could be questioned. It has been suggested that there are three Yalic languages (Hammarström et al. Reference Hammarström, Forkel and Haspelmath2021): Angguruk, Ninia, and Pass Valley Yali. However, it remains unclear how many dialects can be subsumed under Pass Valley Yali. Hammarström et al. (Reference Hammarström, Forkel and Haspelmath2021) report three dialects: Apahapsili, Landikma, and Pass Valley Yali. Fahner (Reference Fahner1979) and Riesberg (Reference Riesberg2021) also list Abenaho as an additional dialect.
The available published linguistic work on Yali is limited. Most recently, Angguruk Yali was covered in a dictionary (Riesberg et al. Reference Riesberg2021), which includes a grammar sketch that mainly focuses on the verbal system. Abenaho Yali morphology is studied in Fahner (Reference Fahner1979). A small number of unpublished master’s theses have focused on specific aspects of Yali; the verbal system (Sawaki Reference Sawaki1997), affixation (Walianggen Reference Walianggen2012a), verb agreement (Walianggen Reference Walianggen2012b), causative constructions (Lekeneny Reference Lekeneny2014), and reduplication (Wandik Reference Wandik2014). As for phonology, existing work does not go beyond sound inventories. Angguruk Yali is reported to have the following vowels: /i, i, u, u, e, o, a/ and consonants: /p, t, k, ϕ, s, χ, m, n, ŋ, l, w, j/ (Riesberg Reference Riesberg2021). Thus, prosodic aspects of Yali remain to be studied. Note, however, that more elaborate phonological descriptions do exist for other Dani languages found in areas bordering on the ones where Yali is spoken. It is relevant to also consider these, as neighboring language varieties have been reported to be mutually intelligible in this area (e.g. Heeschen Reference Heeschen1998). This will be done in the following discussion, which is limited to languages in the Dani family as classified in Glottolog (Hammarström et al. Reference Hammarström, Forkel and Haspelmath2021).
1.2.1. Lower Grand Valley Dani
The phonology of Lower Grand Valley Dani (iso: dni), spoken south of Wamena, has been covered extensively in Bromley Reference Bromley1961. The analysis is couched in the Pikean version of the American structuralist model of intonation (Pike Reference Pike1945), which distinguishes (primary) stress from four contrastive pitch levels associated with the boundaries of intonationally defined units on a number of different hierarchical levels. This model has been criticized in a number of regards (cp., for example, Lieberman Reference Lieberman1965, Crompton Reference Crompton1980, Ladd Reference Ladd2008:62–65).
One problem pertains to the fact that analyses presented in this framework are difficult to interpret without access to a recording. Bromley’s analysis of Dani—one of the most detailed analyses of the prosodic system of a Papuan language published in the twentieth century—is, unfortunately, no exception. Here we only attempt to summarize the points that are most salient to us and provide a basis for generating testable hypotheses. In Bromley’s analysis, the phonological word in Dani is defined as the unit carrying a single primary stress. This word stress occurs on the final syllable of the stem. The stem may be followed by one or more clitics, which are unstressed. The acoustic cues for word stress are ‘upgliding pitch and/or increased loudness’ (Bromley Reference Bromley1961:49). The details of the pitch glide ‘are matters of intonation’ (loc. cit.) and appear to be influenced primarily by position in the phrase, but also by word length.
The next level in the hierarchy of prosodic units is the ‘intonation contour group’, which consists of one or more words (most examples provided by Bromley only encompass one word). It is at this level that contrastive pitch levels, which are combined to describe distinctive tunes, are assigned (Bromley Reference Bromley1961:53). The intonation contour group is further defined by including one syllable carrying primary contour stress, which involves lengthening of the syllable nucleus in addition to being associated with a contrastive pitch level. The stressed syllable is generally the final syllable of the stem of its last word.
The third level in Bromley’s prosodic hierarchy is called the ‘phonological sentence’, which consists of one or more intonation contour groups. This level is essentially defined by the ‘final relaxation of the articulatory muscles’ (Bromley Reference Bromley1961:58). Otherwise, phonological sentences involve relatively minor modifications of its final intonation contour group.
A major shortcoming of Bromley’s analysis is that it does not propose a clear inventory of possible tunes. That is, in principle it appears to be possible that the four pitch levels all can occur on the contour group-final stressed syllable and that they may be combined rather freely with each other (i.e. level 1 may follow level 3 and vice versa, level 4 level 2, etc.). Another problem of interpretation pertains to the question of how Bromley’s prosodic levels relate to the levels of the prosodic hierarchy in Nespor & Vogel Reference Nespor and Vogel2007. The correspondence is unproblematic in the case of the phonological word as this unit appears to be the same in both hierarchies. The ‘intonation contour group’ appears to cover both the intonational phrase (IP) and the intermediate (or phonological) phrase (ip), as it involves boundary tone phenomena and allows for pauses to occur. For current purposes, we simply equate it with the IP, as the distinction between these two levels in the Nespor & Vogel hierarchy is not clear in all regards (see Ladd Reference Ladd2008:274–84, 297–308 for pertinent discussion). The ‘phonological sentence’, finally, appears to pertain to a level higher than the IP for which there is very little terminological consensus in current work, ‘utterance’ perhaps being the most widely used term. Despite its shortcomings, Bromley’s analysis constitutes a number of useful observations and testable claims that warrant further investigation. Most important for the current study is the analysis of Lower Grand Valley Dani as having word-level stress that is signalled by f0.
1.2.2. Wano
Spoken north-west of Wamena, Wano (iso: wno) is another Dani language, the phonology of which has been studied to some extent (Burung Reference Burung2007, Reference Burung2017). These works claim regular ultimate word stress, also for affixed words, but not for cliticized words, as clitics remain unstressed (see 1, taken from Burung Reference Burung2017:64). No information is provided about how stress is realized.

For phrase intonation, Burung (Reference Burung2017) distinguishes between falling, flat (most common), and rising. Falling contours are found at the end of an utterance, in questions without a compulsory answer, and in tail-head linked clauses, often accompanied by the clitic {=o} (pause). Flat contours accompany clauses that are chained. Rising contours are found in questions requiring a compulsory answer, for example with the clitic {=a} (question), and in simple (i.e. short) statements where the rise coincides with the stressed syllable.
1.2.3. Western Dani
An extensive grammar of Western Dani (iso: dnw) covers many aspects of this language (Barclay Reference Barclay2008), but fails to cover prosody. Although minimal, a phonological description of Western Dani does make reference to word prosody (Purba et al. Reference Purba, Warwer and Fatubun1993:36). A tentative analysis of ultimate word stress in three-syllable words (the most common word length reported) is given (see examples in 2) and it is concluded on the basis of a lack of minimal pairs that stress is not lexical.

1.2.4. Nggem
Nggem (iso: nbq) is spoken north-west of the Pass Valley Yali dialects. Its morphology and syntax is described in Etherington Reference Etherington2002. This language is claimed to have regular penultimate word stress, and although there are no minimal stress pairs, some exceptions in which stress is not penultimate are reported (Etherington Reference Etherington2002); see 3. No descriptions of f0 contours are reported.

1.2.5. Yali
To date, there is no literature on the prosody of Yali (iso: yac), although data collections are publicly available for the Angguruk and Apahapsili dialects (Riesberg et al. Reference Riesberg, Walianggen and Zöllner2012). The Angguruk dictionary (Riesberg et al. Reference Riesberg2021) as well as the other sources on Dani languages make no reference to lexical items that can be distinguished solely by f0, which for now we take as a reason to exclude the option of classifying Yali as a language with lexically contrastive tone (see further discussion of lexical pitch accent languages in the summary sub-section below). From the available data (descriptions in Section 2) some examples are given here to illustrate the f0 contours found in words and phrases.
At the word-level, final syllables are regularly found with a high or rising f0 movement, whereas non-final syllables show lower-level movements; see the syllable ho in 4, and Figure 1. Note the transcription with initial ‘h’ (Riesberg Reference Riesberg2021), which correspond to a voiceless [h] or voiced glottal fricative [ɦ] as an allophone of /χ/. The high movement is still found on the final syllable of the word if that word contains a clitic (final syllable sat and ma in 5 and Figure 2). Note that the classification of affixes vs. clitics given here is based on morphosyntactic behaviour. Both suffixes and enclitics are treated as word-final syllables for the purposes of placing the word-final rise: there are usually rises on suffixed or cliticized syllables and there never are rises on stem-final syllables when followed by a suffix or a clitic. In this regard, suffixed/enclitic syllables differ clearly from free/non-clitic grammatical function words where the final syllable of the preceding lexical item can carry a rising tone (e.g. fam ‘on/in’). This differs from the account given for Wano but it is very well possible that the author of the Wano description uses terminology differently. If the syllable is both word-final and phrase-final, the f0 movement is not necessarily high or rising. Consider ma in 6 and Figure 3; the final movement does not clearly stand out as higher than the non-final syllable ap, but stays level. In 7 and Figure 4 the word- and phrase-final syllable lo has a falling f0 movement, whereas in non-final phrase position the same word-final syllable is high.
The syllable ho in word-final (left) and word-non-final (right) context (gloss in 4).

The syllable sat in word-final (left) and word-non-final (right) context (gloss in 5).

The syllable ma in word-/phrase-final (left) and word-/phrase-non-final (right) context (gloss in 6).

The syllable lo in phrase-final (left) and phrase-non-final (right) context (both cases word-final; gloss in 7).

It seems, therefore, that f0 might play a role at the word-level in Yali prosody such that the final syllables of words have a high tonal target. This target is reached most clearly in non-final phrase positions by a high or rising f0 movement. In phrase-final positions, the f0 movement appears to be affected primarily by the demands of the (pragmatic) discourse context, clause type, and speech act. Falling movements are found in elicited statements (see 5 and 7a), rising movements accompany elicited questions (7b) and level or rising movements are found in spontaneous speech to indicate continuation (6).








1.2.6. Dani languages: summary
Although the literature on Dani languages is far from comprehensive in its analysis of prosody, the occurrence of ultimate word stress is reported for all languages for which a (minimal) phonological description has been published (dni, wno, dnw, and nbq). Furthermore, except for Nggem, all these sources mention at least one interaction between word-level and phrase-level prosody. F0 is a particularly relevant cue for this interaction (Bromley Reference Bromley1961), although it remains unclear what exactly the status of f0 is at either level. As for Yali, the few examples discussed above also hint at a word-final tonal target, although lexical contrasts based on f0 differences are unattested. Given that the word-final syllable appears to be the locus of the relevant f0 movement, it remains to be seen whether it could be classified as a lexical pitch accent (e.g. Beckman Reference Beckman1986, Hyman Reference Hyman2006, Ladd Reference Ladd2008:164–68). Note that lexical pitch accent languages are claimed to differ from tone languages in that the former apply a single word-level tone to one syllable in the word and are more prone to phrase-level intonation influences, whereas the latter apply them on multiple syllables in a more stable way (e.g. Ota et al. Reference Ota, Yamane and Mazuka2018). It is worth emphasizing that the notion of lexical pitch accent is controversial. It has been argued that lexical pitch accents represent just one way tonal distinctions may be organized in human languages (Hyman Reference Hyman2006). That is, so-called lexical pitch accent languages are just one type of (lexical) tone languages, other major types being syllable tone languages and word tone languages. On the other hand, ‘accent’ has been argued to be an abstract prosodic category that differs fundamentally from (lexical) tone and allows for both tonal and non-tonal marking (Beckman Reference Beckman1986, Beckman & Venditti Reference Beckman, Venditti, Hardcastle, Laver and Gibbon2010, van der Hulst Reference van der Hulst2012). If Yali indeed has tones on word-final syllables, it remains to be investigated at which phonological level exactly these apply and how the interaction with phrase intonation can be characterised. Even for Swedish there are competing analyses, namely binary (Bruce Reference Bruce1977) and privative (Riad Reference Riad, Fikkert and Jacobs2003), in which the f0 movement is either present or absent rather than chosen from a set of possible alternatives. Also recall the discussion of Japanese (Section 1.1) in this respect. It would take more than a single study to tackle all the issues just discussed. The exploratory nature and goals of the current study is further described in the next section.
1.3. Research questions
The aims of this study can therefore be summed up by the following two research questions:
-
I. Do specific f0 movements align with word-final syllables (compared to non-final ones)?
-
II. To what extent do word-level f0 movements interact with phrase-level ones?
In the following, the research questions of the current study are further explained. Note, however, that the state of the research on Dani languages, in particular on Yali prosody, leaves many questions unanswered. Previous work on Dani languages has described a word-final prominence that is most likely cued by pitch. The central question in this study is therefore whether we can find acoustic evidence for word-level tonal targets. At the same time it is expected for f0 to play a role at the phrase level, regardless of potential word-level tones. Thus, the second research question concerns the extent to which word-level f0 movements interact with phrase-level tones. The minimal assumption regarding phrase prosody is that phrase-final syllables show a tonal movement that is different from other phrase positions. Which direction that phrase-final f0 movement has (i.e. falling, level, or rising) requires detailed study of the context in which the phrase was uttered, taking into account pragmatic, information-structural and speaker-individual factors. Although it is plausible that, at the phrase level, Yali prosody shows similarities with Lower Grand Valley Dani (Bromley Reference Bromley1961), this cannot be taken for granted and will be investigated in the current study.
In the next sections we report an exploratory investigation of the f0 movements at the word and phrase level, as found in spontaneously produced Yali speech.
2. Methodology
This section describes the collection, processing, acoustic measurements, and analysis of the Yali data.
2.1. Data collection
The Yali speech data investigated in this paper consists of mainly short stories and interactive dialogue tasks. They were collected in Apahapsili between 2014 and 2019 by Sonja Riesberg funded by the Volkswagen Foundation (DoBeS project ‘Documentation summits in the central mountains of Papua’) and the Australian Research Council. Among the collected materials are retellings of Pear stories (Chafe Reference Chafe1980), QUIS tasks (Skopeteas et al. Reference Skopeteas, Fiedler, Hellmuth, Schwarz, Stoel, Fanselow, Féry and Krifka2006), Space games (Senft Reference Senft, Wassmann and Stockhaus2007), retellings of the Frog story (Mayer Reference Mayer1969), and Tangram tasks (as used in e.g. Savino et al. Reference Savino, Lapertosa, Caffò and Refice2019). The recordings lasted from one and a half to seventeen minutes (
$ \mu $
= 4.6 minutes) and were made using different devices (Sony HDR-PJ260, Sony HDR-CX250, Zoom Q8, or Zoom H4) in the houses of some of the speakers. A subset of the recordings used in this study were archived (see documentation information in Riesberg et al. Reference Riesberg, Walianggen and Zöllner2012). In total, thirty-seven recordings from thirty different native Yali speakers were used. Their personal data could not always be accurately documented. From twenty-seven speakers the age and gender was known: sixteen males and eleven females, age range: thirteen to fifty-three (
$ \mu $
= 28.7 years). Informed consent was given by all participants.
2.2. Data processing
The recordings were transcribed into (intonation) phrases by two language experts with the help of native speakers. The segmentation was carried out at the level of intonation units (Chafe Reference Chafe1994), corresponding to intonation phrases in the phonological hierarchy (e.g. Nespor & Vogel Reference Nespor and Vogel2007). The intonation units were defined on the basis of auditory boundary cues such as pitch modulations and pauses. In a previous crosslinguistic comparative study using Yali among other languages, these boundaries were found to be perceived in highly similar ways by native and non-native speakers (Himmelmann et al. Reference Himmelmann, Sandler, Strunk and Unterladstetter2018). Based on the transcriptions the phrases were automatically split into words (based on spaces) and syllables (four types of syllables were distinguished: V, VC, CV, and CVC). The positions of the syllables were then indicated by two numbers for each level. At the phrase level, the number of the word in the phrase and the total number of words in that phrase were counted. At the word level, the number of the syllable in the word and the total number of syllables in that word were indicated. In this way, final versus non-final syllables were distinguished on each level such that all syllables not occurring in a final position were marked as non-final. No further distinction was made as to which non-final position in the phrase or word these syllables occupied.
Given the use of Indonesian loanwords in most of the recordings, all words were looked up in a Toolbox (SIL International 2021) lexicon compiled as part of the language documentation (Section 2.1) and selected if they matched with an entry. Matches also included forms with bound morphemes (affixed entries and words including clitics), as preliminary observations suggest that these do not necessarily affect final f0 movements (see example 5, and discussion in Section 4). The selection was further limited to content words (verbs and nouns) as these are crosslinguistically the most informative words and tend to attract prosodic phenomena (i.e. tone, word stress, and/or (post-lexical) pitch accents). Selected syllables were further limited to the ones with a CV structure, which represents the most common syllable structure in Yali. In order to disentangle the two prosodic levels final and non-final syllables at either the phrase (
$ \phi $
) or word (
$ w $
) level were marked. This resulted in three finality categories (
$ \phi w $
): phrase-final, word-final (ff); phrase-non-final, word-final (nf); and phrase-non-final, word-non-final (nn) (see Table 4). Note that phrase-final syllables are inherently word-final, such that the category phrase-final, word-non-final (fn) does not exist. The final selection of syllables were the ones that occurred in all three level/finality categories. For example, the syllable [ma] was only selected if found in all three categories. This resulted in a set of 381 syllables (127 triplets, twenty-seven unique syllable types), for which the boundaries were manually indicated in Praat textgrids. The syllables were taken from words with a length of two up to five syllables (
$ \mu $
= 2.47), which occurred in phrases with a length of one up to fifteen words (
$ \mu $
= 4.47).
Three categories of syllables (marked with
$ \surd $
) according to prosodic level (phrase
$ \phi $
/ word
$ w $
) and finality (final/non-final).

2.3. Acoustic measurements
Five syllables appeared to be produced either too quietly, too briefly, or with background noise such that they needed to be excluded from acoustic analysis. The remaining syllables (
$ N $
= 376, from twenty-nine different speakers; see the appendix for finality type per speaker) were acoustically analysed for f0 and duration. That is, f0 was measured in a time-series manner, specifically, at twenty evenly distributed time points in the syllable. The f0 measurements were taken in Hertz (Hz) using Praat (Boersma & Weenink Reference Boersma and Weenink2019) using the autocorrelation method, and using an f0 window of 50 to 500 Hz with otherwise default pitch settings. The measurements were also checked and corrected for octave-jumps and f0 measurement tracking errors. Voiceless parts of the contour were interpolated linearly to obtain an uninterrupted series of f0 values. Interpolation is also perceptually motivated, that is, listeners perceive intonational prominence largely in uninterrupted ways (Mixdorff & Niebuhr Reference Mixdorff and Niebuhr2013). Although f0 jumps were mainly absent in the data, it is still possible that f0 perturbations due to the effect of consonants affected the extracted contours. These perturbations could affect the cluster analysis if they concern a significant part of the f0 contour (e.g. onset level, register). This issue is further discussed in the final section.
Time-series measures furthermore allow us to compare contours of different lengths in a more equal way and provide a more precise measurement of the f0 contour compared to static measures such as f0 mean or f0 range. F0 measurements were speaker corrected, to account for effects of gender on the f0 level. This was done using standardization, as proposed for word-level f0 movements in Rose Reference Rose1987. A crude categorical difference was made for each contour whether it was a rise or a fall. These categories were obtained by subtracting the standardised f0 level of the last measurement point from the standardised f0 level of the first measurement point. This resulted in an f0 value indicating the direction and steepness of the f0 contour (see Section 3 for the analysis of f0 shape). In this way, negative values are obtained for falls and positive values are obtained for rises. The further the f0 direction value is from zero, the steeper the f0 movement is. Note that there were no cases in which multiple directions were found in one syllable (i.e. a rise+fall or a fall+rise). Duration was measured as the time (in ms) between the start and the end of the syllable. The measurements were obtained in Praat (Boersma & Weenink Reference Boersma and Weenink2019).
2.4. Statistical analysis
Statistical analyses were carried out using R (R Core Team 2020, R Studio Team 2020) and the lme4 package (Bates et al. Reference Bates, Mächler, Bolker and Walker2015). Two linear mixed models (LMM) fit by maximum likelihood (using Satterthwaite approximations to degrees of freedom to calculate p-values) were generated for the acoustic measures (one for f0 direction and one for duration). In both LMMs, ‘Finality category’ (three dummy-coded levels: ff, nf, nn) was included as fixed factor and ‘Speaker’ was included as random intercept (the maximal converging model). Note that the incorporation of speaker as random intercept not entirely accounts for the violation of the independence assumption (Winter & Grice Reference Winter and Grice2021) as multiple observations (words) were taken for each speaker (see also Table A1 in the appendix). Both models were then analysed in a post-hoc analysis in order to pairwise compare each finality category using the Tukey honestly significant difference (HSD) test.
2.5. Results
The results of the acoustic measurements (Table 5, Table 6, and Figure 5) show that syllables in phrase-final and word-final position (ff) are on average produced with a falling f0 movement and significantly lengthened (approximately 2.7 times) compared to syllables in the other finality categories. Phrase-non-final syllables in word-final position (nf) show on average a rising f0 movement, whereas phrase-non-final syllables in word-non-final positions (nn) show a slight falling movement. The f0 direction measure was significantly different between each finality category. The duration of the phrase-final and word-final (ff) syllables was significantly longer compared to each of the other finality categories. The phrase-non final syllables (nf and nn) did not differ significantly in duration.
Results of the acoustic measurements of f0 direction (standardized) and duration (ms) in each finality category (
$ \phi w $
), with standard deviations between brackets.

Pairwise comparisons for each finality category (
$ \phi w $
) on the LMMs for f0 direction (standardized) and duration (ms).

Mean f0 contour (speaker standardised) and standard deviation in each finality category (ff:
$ \phi $
-final
$ w $
-final, nf:
$ \phi $
-non-final
$ w $
-final, nn:
$ \phi $
-non-final
$ w $
-non-final) as measured by the time-series f0 analysis (20 measurement points).

3. Contour clustering
The results discussed above provide a first coarse-grained analysis of the acoustic realisation of the syllables in each finality category. To obtain a more detailed overview of the variation in the f0 movements, the time-series measurements were analysed using cluster analysis, following the method proposed in Kaland Reference Kaland2021. This approach consists of two components; measurement of f0 and clustering. As for the measurements, the ones described in Section 2.3 were taken as the input for the cluster analysis. Cluster analysis essentially divides the f0 contours into maximally different groups based on their numerical (shape) differences. The division in clusters is based on distances computed for each possible pair of observations (distance matrix). The distance metric used in the current analysis is Euclidean distance, a generic method used in previous literature (e.g. Calhoun & Schweitzer Reference Calhoun, Schweitzer, Elordieta and Prieto2012, Cole & Steffman Reference Cole and Steffman2021). Clusters were merged in a bottom-up hierarchical fashion, such that at the initial state of the analysis all observations formed a cluster on their own (maximum amount of clusters). In subsequent steps the clusters with the smallest distance were merged, keeping the distances between the clusters maximal (complete linkage). The merging stops when all observations are in the same cluster, which is equal to a state in which nothing is clustered. The merging process follows a tree-structure (dendrogram) in which the number of clusters (X) is determined by the height in the tree, that is, the height at which an X amount of branches is found. Finding the number of clusters that best describes the data is crucial; see also Kaland Reference Kaland2021 for a discussion and details below on how this was done in the current study. It is crucial to note that with contour clustering, there is no direct correspondence between the number of clusters and the possible inventory of phonological units that underlie an f0 contour in the language under investigation. The dataset, the domain of the contour (syllable, word, phrase) and potentially many other factors affect the contour shapes and hence the clustering output. It is often informative to run multiple analyses with varying numbers of clusters, to see how observations are split or merged depending on whether one increases or decreases the number of clusters respectively. This was done in the current study and described in further detail below.
Before clustering, the data was subset based on outlying contours as detected by the contour clustering tool (Kaland Reference Kaland2021). Outlying contours mainly consisted of ones for which interpolation was challenging, either due to too many missing f0 values or inaccurate estimation of f0 in voiceless parts. Note that these missing values did not affect the calculation of the f0 direction in Section 2.3. The outlying contours were removed from the data to obtain a more homogeneous set that would result in clusters of more comparable sizes, specifically, clusterings for which the mean contour of a certain cluster indeed represents the contours in that cluster. Although this procedure made the dataset more suitable for cluster analysis, it could not prevent contour shape differences due to local microprosodic or segmental influences (i.e. creaky phonation or consonantal effects; see discussion in Section 4). In total, 329 contours were clustered. The standardization to account for speaker differences led to contours with values centered around zero (the speaker’s mean).
Several clustering rounds were carried out, starting with two clusters and going up to ten clusters. This was done because we did not know how many different contours were actually present in the data. It is important to note that the three finality categories indeed showed contour differences (Figure 5), however, there could be more (subtle) variation in the contours that could not be captured by a categorization based on the three finality categories. Furthermore, the clustering rounds provide an additional indication of the extent to which the shape differences among the clustered contours matches with their position at the word or phrase level, as distinguished by the finality categories.
How the contours were divided over the finality categories for each round of clustering can be seen in Table 7. Starting from six clusters, there appeared one cluster that contains four contours. This is cluster number 5 (round 6 and 7), cluster number 6 (round 8), and cluster number 7 (round 9 and 10) respectively. The asymmetrical outcome is one indication that the analysis might not be representative, given the low number of contours. In addition, an evaluation metric based on information costs to describe the data was applied to find the ideal number of clusters for this data set, following the method in Kaland & Ellison Reference Kaland, Ellison, Skarnitzl and Volín2023. This metric suggested five contours as the most informative description of the contour differences (Table 7). The mean f0 contours of the round with five clusters are plotted in Figure 6.
Number of contours in each clustering round (N clusters) as divided over the finality categories (
$ \phi w $
) and clusters (cluster number), including their information cost (shaded area indicates round with lowest information cost).

Mean f0 contour per cluster with five clusters assumed.

The contours in Figure 6 can be roughly divided into falling and rising movements. Clusters 1, 2, and 5 all show a falling movement, comparable in steepness but different in register. That is, the fall is either just below the speaker’s mean (cluster 1), just above it (cluster 2), or well below it (cluster 5). The two rises differ in both steepness and register. That is, the rise just below the speaker’s mean (cluster 3) is steeper than the one that is well above the speaker’s mean (cluster 4). How the five contours are divided over the finality categories can be seen from Table 7. Examples of each of the five contours are given in Figures 7–11 and examples 8–12. The contour in cluster 1 (mid fall) appears approximately half the time in ff syllables (25/51). The contour in cluster 2 (high fall) appears more than half of the time in nn syllables (63/122). Almost half of the contours in cluster 3 (low rise) appear in nf syllables (42/88), although a large number of contours in that cluster also appears in nn syllables. The contours in cluster 4 (high rise) are mainly found in nf syllables (31/48). More than half of the contours in cluster 5 (low fall) are found in ff syllables (12/20).





Example phrase with a mid fall grouped under cluster 1 (ff syllable: li).

Example phrase with a high fall grouped under cluster 2 (nn syllable: ne).

Example phrase with a low rise grouped under cluster 3 (nf syllable: go).

Example phrase with a high rise grouped under cluster 4 (nf syllable: ha in laha).

Example phrase with a low fall grouped under cluster 5 (ff syllable: sa).

To further clarify the relative amount of contours in each finality category and in each cluster, two additional proportion tables are presented. In Table 8 the proportions of contours in each finality category add up to 1 (horizontal), whereas in Table 9 the proportions of contours in each cluster add up to 1 (vertical).
Proportions of contours in each finality category and in each cluster. Proportions add up to 1 for each finality category (rows).

Proportions of contours in each finality category and in each cluster. Proportions add up to 1 for each cluster (columns).

4. Discussion and conclusion
We now turn to the two research questions (RQs) outlined in Section 1.3. Concerning RQ I, the present study provides strong support for word-final high tones. The support is found in the type of contours that are dominant in phrase-non-final syllables (nf). These are the low rise and the (shallow) high rise found in cluster 3 and 4 respectively (Table 7 and Figure 6). The low rise and high rise appear the most common contours compared to other contours found on nf syllables (Table 8) and compared to other finality categories for which these contours occur (Table 9).
It needs to be noted that there is overlap in the sense that the rises occur in other finality categories too. Moreover, other contours are found for the nf syllables. For example, low falls (cluster 5) were found for six nf syllables (Table 7). This would constitute the clearest counter-evidence for the alignment of a rise with word-final syllables in Yali. However, a post-hoc inspection of these six syllables revealed that all of them were produced with creaky voice, which could significantly have lowered the f0 (e.g. Keating et al. Reference Keating, Garellek and Kreiman2015). It is important to acknowledge that our data comes from unscripted speech production and we cannot conclude here whether the observed differences indeed constitute noticeable ones for listeners, or whether creaky-voiced productions constitute a phonologically different category.
The shallow high fall cluster (2) contains the highest number of contours compared to the other clusters, and these contours from cluster 2 were also found on twenty-seven phrase-non-final, word-final syllables, although with a low proportion. Two remarks should be made here. First, the large number of contours in cluster 2 indicate that variation can be expected among the contours in that cluster. This variation is probably too subtle to be clustered in a more informative way. Second, the mean contour of cluster 2 remains above the speaker’s mean throughout the contour and it is shallower than the low fall (cluster 5). Both observations ask for a more precise look at which features of the contour shape in word-final syllables are relevant. That is, the contours found in 84% of the nf syllables (Table 8) are either a rise or a fall that is found in a high register of the speaker. This leaves two features for further scrutiny; both the rising movement and the overall target level might be important phonologically on word-final syllables. Numerically, however, the rising movement would stand out as the most salient feature as 61% of the contours on nf syllables show it.
As for RQ II, the results show a clear interference between the word and phrase levels. In phrase-final syllables (ff) contours tend to be low and/or falling (Table 8 and Table 9). The eight contours on ff syllables found with high rises (cluster 4) were further inspected as these would most clearly counter this general tendency. The inspection revealed that they were often located at the end of short phrases with a length of one or two words (6/8) and immediately followed by phrases elaborating what was said in the short phrase; see example 13. Similarly, for the twelve ff syllables with a low rise (cluster 3), eleven of them were followed by another phrase in which the speaker elaborated the same discourse topic. Generally, the phrase following a low rise started later than the phrase following a high rise. Taken together, it seems therefore that phrasal demands can override the word-final rise such that the phrase-level tone takes precedence over the word-level tone. That is, most phrases end with a falling movement, most likely signalling discourse finality, whereas the rises found in that position clearly signal that more discourse material is upcoming, either immediately (high rise) or up to 1,400 ms later (low rise). We can therefore conclude that the word-final rise in Yali is most clearly realised in non-final phrase positions.

The nn syllables are most often realised with a high fall (cluster 2). A falling movement would indeed be expected without phrase- or word-level demands for specific tonal targets, that is, following the natural declination of f0. The somewhat higher start of this contour could also be the result of interpolation. Although a fair amount of low rises (cluster 3) also occurs on nn syllables, the majority of these thirty-four cases occurred on penultimate syllables (N = 19) and/or were realised on words starting with [ɦ] (N = 23). Both observations are indicative of a low f0 on these syllables, either due to an anticipatory movement toward the final rise or a lowering effect due to voiced obstruents (Kingston Reference Kingston, Lahiri, Gussenhoven and Riad2007). The [ɦ] lowering can be furthermore observed from (subtle) f0 dips in example 4.
The results of this study are a first exploration of how word-level and phrase-level prosody are realised and how they might interact. Two important phonological aspects remain to be investigated in more detail. One aspect pertains to the status of the rising f0 movement in the prosodic structure. From example 5 it appears that the final position of the rise is determined after morphological operations such as cliticization and/or affixation. This would suggest that the f0 movement is applied post-lexically. Whether this level is the one of the phonological word or clitic group (Nespor & Vogel Reference Nespor and Vogel2007) remains to be seen. Another possibility would be to analyse the rise as a marker of the AP, which coincides in most cases with the word domain or is slightly larger, as in Korean (Jun Reference Jun1996). It is furthermore likely that the occurrence of the final rise is affected by multiple factors that were not directly investigated in the current study. From the examples, it seems that the final rise does not appear in all cases but rather shows an overall trend (Table 8 and Table 9). To conclude which (phonological) factors could underlie the realisation of the final rise, more research is needed on the phonological and prosodic structure of Yali. Second, it seems unlikely that Yali has stressed syllables, where stress would then be signaled at the word level by other cues than f0. The duration measures in this study provide a first coarse indication that there are no stress differences between final and non-final syllables at the word level (i.e. both in non-final phrase positions; nf, nn). Although duration is crosslinguistically a reliable cue to word stress (Gordon & Roettger Reference Gordon and Roettger2017), and would be expected to be greater in stressed syllables than in unstressed ones, the only effect of duration found in this study could be ascribed to phrase-final lengthening. An investigation of other stress cues such as vowel quality or spectral tilt could shed more light on this issue.
Concerning the two prosodic typologies discussed in Section 1.1, it seems that Yali can be tentatively categorised as a non-stress-accent language with only post-lexical pitch (as Bengali; Table 1). In Gordon’s typology (Table 2) Yali would most likely be classified as a language without edge repulsion at the word level due to the word-final rise. As for macro-rhythm (Table 3), Yali would most plausibly fit in the category of languages that have it to a strong degree due to the marking of word ends with rises. It is therefore plausible that Yali can be classified as a head/edge or edge language. The latter classification would fit an analysis in which word-final rises are post-lexically applied. Whether or not Yali makes use of pitch accents at levels higher than the phonological word remains to be investigated. This study found clear support for the use of phrase-final boundary tones (falls), but does not provide evidence on the use of pitch accents at the phrase level. Depending on future research on this issue, Yali would be classified as either a symmetrical language (no edge repulsion at word and phrase level; like Hebrew and Farsi) or as an asymmetrical language (no edge repulsion at the word level and no pitch accents; like Wolof).
It is challenging to compare the current results with the ones available for the other Dani languages (see Section 1.2: ‘Dani languages: summary’), in particular given the methodological differences between the studies. Previous work has mainly provided impressionistic descriptions, whereas the current study is the first to provide an in-depth acoustic analysis. While the generalization in terms of a word-final prosodic event is shared between the previous and current results, the only acoustic evidence for Yali so far concerns f0. It is therefore more plausible that Yali has a word-level (pitch) accent system rather than word stress. This observation could cast doubt on the claims that attribute high f0 to word stress, as done in the previous literature for the other Dani languages. The Yali word-level accent’s identity is notably subject to or part of a phrase-level interaction, along the lines of what has been suggested for Lower Grand Valley Dani (Bromley Reference Bromley1961 and Section 1.2).
To conclude, this study has shown that Apahapsili Yali makes use of word-final rising f0 contours on content words and that these can be overwritten by phrase-level boundary tones. Although this study has strongly hinted at post-lexical levels, the exact phonological level at which f0 requirements are applied remains to be investigated. The minimal assumption based on the current outcomes constitutes two levels in the Yali prosodic structure; the phonological word for the word-final rises and the (intermediate/intonational) phrase for boundary tones. Related, more detailed investigations on the presence of pitch accents are needed to obtain a more complete profile of post-lexical prosody in Yali. Likewise, it remains a question for future research whether word stress is indeed absent in this language, as tentatively concluded in this study. Although several issues just mentioned remain unanswered, this study has shown that an initial exploration of f0 contours in words using acoustic measurements and cluster analysis provides a methodologically novel and representative account of the type of prosodic variation found in (semi-)spontaneous Yali.
Acknowledgements
The authors thank Sonja Riesberg for comments on an earlier version of this paper, and two anonymous reviewers for their valuable feedback.
Data Availability Statement
Data collection information is provided in Riesberg et al. (2012).
Conflict of interest statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Funding disclosure statement
The research for this paper has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project-ID 281511265—SFB 1252 Prominence in Language.
Ethics statement
Informed consent was given by all speakers in the corpus that was used for the research conducted for this paper.
Appendix
Overview of the number of syllables per finality category (ff, nf, nn) per speaker (1–29) in the final subset of syllables analysed in this study (
$ N $
= 376).


































