Predictability effects in degraded speech comprehension are reduced as a function of attention

Abstract The aim of this study was to examine the role of attention in understanding linguistic information even in a noisy environment. To assess the role of attention, we varied task instructions in two experiments in which participants were instructed to listen to short sentences and thereafter to type in the last word they heard or to type in the whole sentence. We were interested in how these task instructions influence the interplay between top-down prediction and bottom-up perceptual processes during language comprehension. Therefore, we created sentences that varied in the degree of predictability (low, medium, and high) as well as in the degree of speech degradation (four, six, and eight noise-vocoding channels). Results indicated better word recognition for highly predictable sentences for moderate, though not for high, levels of speech degradation, but only when attention was directed to the whole sentence. This underlines the important role of attention in language comprehension.


Introduction
Spoken language comprehension seems like an easy, automatized process. But intelligibility and comprehension of speech can be rendered difficult in our daily conversations due to adverse listening conditions like background noise and distortion of the speech signal (e.g., Chen & Loizou, 2011;Fontan et al., 2015). For example, the voice of a person talking on the other end of a telephone connection can sound robotic and difficult to understand when the signal quality or transmission is poor. Perception and comprehension of speech in such an adverse condition is effortful (Pals et al., 2013;Strauss & Francis, 2017;Winn et al., 2015). To deal with perceptual difficulties, listeners rely on top-down prediction based on the context that has been understood so far (Obleser & Kotz, 2010;Pichora-Fuller, 2008;Sheldon et al., 2008b). The context can contain information about a topic of the conversation, syntactic information about the structure of the sentence, world knowledge, visual information, and so forth (Altmann & Kamide, 2007;Brothers et al., 2020;Kaiser & Trueswell, 2004;Knoeferle et al., 2005;Xiang & Kuperberg, 2015; for reviews, see Ryskin & Fang, 2021;Stilp, 2020).
To utilize context information, listeners must attend to it and build up a meaning representation of what has been said. Listeners attend to the context information in clear speech with minimal effort, but processing and comprehending degraded speech is more effortful and requires more attentional resources (Eckert et al., 2016;Peelle, 2018;Wild et al., 2012). However, it is less clear how listeners distribute attentional resources: On the one hand, listeners can attend throughout the whole stream of speech and may thereby profit from the context information to predict sentence endings. On the other hand, listeners can focus their attention on linguistic material at a particular time point in the speech stream and, as a result, miss critical parts of the sentence context. If the goal is to understand a specific word in an utterance, there is a trade-off between allocating attentional resources to the perception of that word vs. allocating resources also to the understanding of the linguistic context and generating predictions.
The aim of this study was to investigate how the allocation of attentional resources induced by different task instructions influences language comprehension and, in particular, the use of context information under adverse listening conditions. To examine the role of attention on predictive processing under degraded speech, we conducted two experiments in which we manipulated task instructions. In Experiment 1, participants were instructed to only repeat the final word of the sentence they heard, while in Experiment 2, they were instructed to repeat the whole sentence, thus drawing attention to the entire sentence including the context. In both experiments, we varied the degree of predictability of sentence endings as well as the degree of speech degradation. In the following, we first summarize the findings of studies that have investigated predictive language processing in the comprehension of degraded speech, and then results on the role of attention and task instruction in speech perception.

Predictive processing and language comprehension under degraded speech
It is broadly agreed that human comprehenders generate expectations about upcoming linguistic material based on context information (for reviews, see Kuperberg & Jaeger, 2016;Nieuwland, 2019;Pickering & Gambi, 2018;Staub, 2015). These expectations are formed while a sentence unfolds. The claims about the predictive nature of language comprehension are based on a variety of behavioral and electrophysiological experimental measures including eye-tracking and electroencephalography (EEG). For instance, in the well-known visual world paradigm, listeners fixate on a picture of an object (e.g., a cake) that is predictable based on the prior sentence context (e.g., 'The boy will eat the …') even before hearing the final target word (e.g., Altmann & Kamide, 1999Ankener et al., 2018). Moreover, highly predictable words are read faster and are skipped more often compared to less predictable words (Frisson et al., 2005;Rayner et al., 2011).
In EEG studies, the N400, a negative-going EEG component that usually peaks around 400 ms poststimulus, is considered as a neural marker of semantic unexpectedness (Kutas & Federmeier, 2011). For instance, in the highly predictable sentence context 'The day was breezy so the boy went outside to fly …', DeLong et al. (2005) found that the amplitude of the N400 component for the expected continuation 'a kite' was much smaller than for the unexpected continuation 'an airplane'. Although these studies demonstrated that as the sentence context builds up, listeners form predictions about upcoming words in the sentence, the universality and ubiquity of predictive language processing have been questioned (see Huettig & Mani, 2016). Also, the use of context for top-down prediction can be limited by factors like literacy (Mishra et al., 2012), age, and working memory (Federmeier et al., 2002(Federmeier et al., , 2010, as well as by the experimental setup (Huettig & Guerra, 2019). While these language comprehension studies investigating predictive processing have used clean speech and sentence reading, the present study focuses on examining how attention influences the use of context to form top-down predictions under adverse listening conditions.
There is already some evidence that when the bottom-up speech signal is less reliable due to degradation, listeners tend to rely more on the context information to support language comprehension (Amichetti et al., 2018;Obleser & Kotz, 2010;Sheldon et al., 2008a). For example, Sheldon et al. (2008a, Figure 2) estimated that for both younger and older adults, the number of noise-vocoding channels required to achieve 50% accuracy varied as a function of sentence context. Compared to highly predictable sentences, a greater number of channels (i.e., more bottom-up information) was required in less predictable sentences to achieve the same level of accuracy. Therefore, they concluded that when speech is degraded, predictable sentence context facilitates word recognition. Obleser et al. (2007) found that at a moderate level of spectral degradation, listeners' word recognition accuracy was higher for highly predictable sentence contexts than for less predictable ones. However, while listening to the least degraded speech, there was no such beneficial effect of sentence context (see also Obleser & Kotz, 2010). Hence, especially when the bottom-up speech signal is less reliable due to moderate degradation, information available from the sentence context is used to enhance language comprehension, suggesting that there is a dynamic interaction between top-down predictive and bottom-up sensory processes in language comprehension (Bhandari et al., 2021).

Attention and predictive language processing
It is not only the quality of speech signal that influences the reliance on and use of predictive processing; attention to auditory input is also important. Auditory attention allows a listener to focus on the speech signal of interest (for reviews, see Fritz et al., 2007;Lange, 2013). For instance, it has been shown that a listener can attend to and derive information from one stream of sound among many competing streams as demonstrated in the well-known cocktail party effect (Cherry, 1953;Hafter et al., 2007). When a participant is instructed to attend to only one of the two or more competing speech streams in a diotic or dichotic presentation, response accuracy to the attended speech stream is higher than to the unattended speech (e.g., Tóth et al., 2020). Similarly, when a listener is presented with a stream of tones (e.g., musical notes varying in pitch, pure tones of different harmonics) but attends to any one of the tones appearing at a specified time point, this is reflected in a larger amplitude of N1 (e.g., Lange & Röder, 2010; see also Sanders & Astheimer, 2008) which is the first negative-going ERP component, peaking around 100 ms poststimulus, considered as a marker of auditory selective attention (Näätänen & Picton, 1987;Thorton et al., 2007). Hence, listeners can draw attention to and process one among multiple competing speech streams.
So far, most previous studies investigated listeners' attention within a single speech stream by using acoustic cues like accentuation and prosodic emphasis. For example, Li et al. (2014)) examined whether the comprehension of critical words in a sentence context was influenced by a linguistic attention probe such as 'ba' presented together with an accented or deaccented critical word. The N1 amplitude was larger for words with such an attention probe than for words without a probe. These findings support the view that attention can be flexibly directed either by instructions toward a specific signal or by linguistic probes (Li et al., 2017; see also Brunellière et al., 2019). Thus, listeners are able to select a part or segment of a stream of auditory stimuli to pay attention to.
The findings on the interplay of attention and prediction mentioned above come from studies which, for the most part, used a stream of clean speech or multiple streams of clean speech in their experiments. They cannot tell us about the attentionprediction interplay in degraded speech comprehension. Specifically, we do not know what role attention to a segment of a speech stream plays in the contextual facilitation of degraded speech comprehension, although separate lines of research show that listeners attend to the most informative portion of the speech stream (e.g., Astheimer & Sanders, 2011), and semantic predictability facilitates comprehension of degraded speech (e.g., Obleser & Kotz, 2010).

The present study
We examined whether context-based semantic predictions are automatic during effortful listening to degraded speech, when participants are instructed to report either the final word of the sentence or the entire sentence. We manipulated semantic predictions and speech degradation by orthogonally varying cloze probability of target words and number of channels for the noise-vocoding of speech in a factorial design. Noise-vocoded speech is difficult to understand, as the frequencyspecific information of a specific bandwidth is replaced with white noise while temporal cues are preserved (e.g., Corps & Rabagliati, 2020;Davis et al., 2005;Shannon et al., 1995).
In two experiments, we varied the task instructions to the listeners, which required them to differentially attend to the target word. In Experiment 1, listeners were asked to report the noun which was in the final position of the sentence that they heard. This instruction did not require listeners to pay attention to the context. Hence, processing the context was not strictly necessary for the task. In Experiment 2, listeners were asked to report the entire sentence by typing in everything they heard. Thus, the listeners' attention in Experiment 2 was not focused on any specific part of the sentence. We hypothesized that when listeners pay attention only to the contextually predicted target word, as they might choose to do in Experiment 1, they do not form top-down predictions, that is, there should not be a facilitatory effect of target word predictability. In contrast, when listeners attend to the whole sentence, they do form expectations, such that a facilitatory effect of target word predictability will be observed. We recruited 50 participants online via Prolific Academic (Prolific, 2014). One participant whose response accuracy was less than 50% across all experimental conditions was removed. Among the remaining 49 participants (M age AE SD = 23.31 AE 3.53 years; age range = 18-30 years), 27 were male and 22 were female. All participants were native speakers of German and did not have any speech-language disorder, hearing loss, or neurological disorder (all self-reported). All participants received 6.20 euros as monetary compensation for their participation. The experiment was approximately 40 minutes long. The German Society for Language Science ethics committee approved the study and participants provided informed consent in accordance with the Declaration of Helsinki.

Materials
We used the same materials from our previous study (Bhandari et al., 2021). They consist of 360 German sentences spoken by a female native German speaker, unaccented, at a normal rate of speech. The sentences were recorded and digitized at 44.1 kHz with 32-bit linear encoding. All sentences consisted of pronoun, verb, determiner, and object (noun) (e.g., stimuli sentences with their English translations see Supplementary Material). We used 120 nouns to create three types of sentences differing in the cloze probability of the target words (nouns) which mostly appeared as the final word of the sentence. We thereby compared sentences with low, medium, and high cloze target words.
The cloze probability ratings for each of these sentences were measured in a norming study with a separate group of participants (n = 60; age range = 18-30 years). Mean cloze probabilities for sentences with low cloze target words (low predictability sentences), medium cloze target words (medium predictability sentences) and high cloze target words (high predictability sentences) were 0.022 AE 0.027 (M AE SD; range = 0.00-0.09), 0.274 AE 0.134 (M AE SD; range = 0.10-0.55), and 0.752 AE 0.123 (M AE SD; range = 0.56-1.00), respectively.
The speech signal was divided into 1, 4, 6, and 8 frequency bands between 70 and 9,000 Hz to create four different levels of speech degradation for each of the 360 recorded sentences. Frequency boundaries were approximately logarithmically spaced, determined by cochlear-frequency position functions (Erb, 2014;Greenwood, 1990). A customized Praat script originally written by Darwin (2005) was used to create noise-vocoded speech. Boundary frequencies for each noisevocoding condition are given in Table 1.

Procedure
Participants were asked to use headphones or earphones. A sample of vocoded speech not used in the practice trial or the main experiment was provided so that the participants could adjust the volume to their preferred level of comfort at the beginning of the experiment. The participants were instructed to listen to the sentences and to type in the target word (noun) by using the keyboard. The time for typing in the response was not limited. They were also informed at the beginning of the experiment that some of the sentences would be 'noisy' and not easy to understand, and in these cases, they were encouraged to guess what they might have heard. Eight practice trials with different levels of speech degradation were given to familiarize the participants with the task before presenting all 120 experimental trials with an intertrial interval of 1,000 ms.
Each participant had to listen to 40 high predictability, 40 medium predictability, and 40 low predictability sentences. Levels of speech degradation were also balanced across each predictability level, so that for each of the three predictability conditions (high, medium, and low predictability), ten 1-channel, ten 4-channel, ten 6-channel, and ten 8-channel noise-vocoded sentences were presented, resulting in 12 experimental lists. The sentences in each list were pseudo-randomized so that no more than three sentences of the same degradation and predictability condition appeared consecutively.

Analyses
We performed data preprocessing and analyses in RStudio (R version 3.6.3; R Core Team, 2020). At 1-channel, there were only five correct responses, one each from 5 participants out of 49. Therefore, the 1-channel speech degradation condition was excluded from the analyses.
Accuracy was analyzed using Generalized Linear Mixed Models (GLMMs) with lmerTest (Kuznetsova et al., 2017) and lme4 (Bates et al., 2015) packages. Binary responses (categorical: correct and incorrect) for all participants were fit with a binomial linear mixed-effects model (Jaeger, 2006(Jaeger, , 2008. Correct responses were coded as 1 and incorrect responses were coded as 0. Number of channels (categorical: 4-channel, 6-channel, and 8-channel noise-vocoding), target word predictability (categorical: high predictability sentences, medium predictability sentences, low predictability sentences), and the interaction of number of channels and target word predictability were included in the fixed effects.
We first fitted a model with maximal random effects structure that included random intercepts for each participant and item (Barr et al., 2013). Both by-participant and by-item random slopes were included for number of channels, target word predictability, and their interaction, which was supported by the experiment design. Based on the previous findings on perceptual adaptation (e.g., Cooke et al., 2022;Davis et al., 2005;Erb et al., 2013; but see also Bhandari et al., 2021), we further added trial number (centered) in the fixed effect structure to control for whether the listeners adapted to the degraded speech. We report the results of the model that includes trial number as fixed effects. 1 We applied treatment contrast for number of channels (8-channel as a baseline) and sliding difference contrast for target word predictability (low predictability vs. medium predictability, and low predictability vs. high predictability sentences). The code and data are available in the following publicly accessible repository: https:// osf.io/t6unj/.

Results and discussion
Mean response accuracy for all experimental conditions is shown in Table 2 and Fig. 1. We found that accuracy increased with an increase in the number of noisevocoding channels, that is, with a decrease in speech degradation. However, accuracy did not increase with an increase in target word predictability. The results of statistical analysis confirmed these observations (see Table 3). There was a significant main effect of number of channels, indicating that response accuracy for the 8-channel vocoded speech was higher than for both 4-channel (β = À3.50, SE = 0.22, z (4,410) = À16.19, p < 0.001) and 6-channel vocoded speech (β = À0.70, SE = 0.21, z (4,410) = À3.29, p = 0.001), that is, when the number of channels increased to 8, listeners gave more correct responses (see Fig. 2). There was, however, no significant main effect of target word predictability (β = 0.30, SE = 0.36, z (4,410) = .84, p = 0.40, and β = 0.50, SE = 0.43, z (4,410) = 1.16, p = 0.25), and no interaction between number of channels and target word predictability (all ps > 0.05). There was also no significant main effect of trial number (β = 0.001,  SE = 0.002, z (4,410) = .48, p = 0.63) suggesting that the listeners' performance did not improve over time.
These results indicated a decrease in response accuracy with an increase in speech degradation from the 8-channel to the 6-channel noise-vocoding condition, and from the 8-channel to the 4-channel noise-vocoding condition. However, response accuracy did not increase with an increase in target word predictability, and the interaction between number of channels and target word predictability was also absent, in contrast to previous findings (Obleser & Kotz, 2010;Obleser et al., 2007; see also Hunter & Pisoni, 2018). These results suggest that the task instruction, which asked participants to report only the final word, indeed led to neglecting the context. Although participants were able to neglect the context, there was still uncertainty about the speech quality of the next trial; hence, they could not adapt to the different levels of degraded speech.
To confirm that the predictability effect (or contextual facilitation) is replicable and dependent on attentional focus, we conducted a second experiment in which we changed the task instruction to draw participants' attention to decoding the whole sentence.

Participants and materials
We recruited 48 participants (M age AE SD = 24.44 AE 3.55 years; age range = 18-31 years; 32 males) online via Prolific Academic. The same procedure was followed as in Experiment 1, and the same stimuli were used.

Procedure
Participants were presented with sentences at a comfortable volume level. They were asked to use headphones or earphones, and a prompt was presented before the experiment began to adjust the volume to their level of comfort. Eight practice trials were presented, followed by 120 experimental trials. The participants were instructed to report the entire sentence by typing in what they heard. We did not limit the response time.

Analysis
We followed the same data analysis procedure as in Experiment 1. The 1-channel speech degradation condition was excluded from the analysis. We did not consider whether listeners reported other words in a sentence correctly; only the final words of the sentences (target words) were considered as either correct or incorrect responses. As in Experiment 1, we report the results from the maximal model supported by the design. 2 Mean response accuracy for different conditions is shown in Table 4 and Fig. 2. We found that accuracy increased when the number of noise-vocoding channels increased, as well as when the target word predictability increased. The results of statistical analysis confirmed these observations (Table 5): We again found a main effect of number of channels, such that response accuracy at 8-channel was higher than for both 4-channel (β = À3.51, SE = 0.24, z (4,320) = À14.64, p < 0.001), and 6-channel noise-vocoding (β = À0.65, SE = 0.22, z (4,320) = À2.93, p = 0.003). Similar to Experiment 1, the main effect of trial number was not significant (β = 0.002, SE = 0.002, z (4,320) = 1.11, p = 0.27) indicating that the response accuracy did not increase over the course of the experiment.
In contrast to Experiment 1, these results indicate an effect of target word predictability; that is, response accuracy was higher when the target word predictability was high as compared to low. Also, the interaction between target word predictability and speech degradation, which was not observed in Experiment 1, showed that semantic predictability facilitated the comprehension of degraded speech already at moderate levels (like 6-or 8-channel). In line with the findings from Experiment 1, response accuracy was better with a higher number of channels.
We combined the data from both experiments in a single analysis to test whether participants' response accuracy changes across the experiments, that is, to test whether the difference between experimental manipulations is statistically significant. We ran a binomial linear mixed-effects model on response accuracy and followed the same procedure as in Experiments 1 and 2. A full random effects structure supported by the study design was modeled. 3 The model summary is shown in Table 6. The model revealed that there was no significant main effect of experimental group (β = 0.04, SE = 0.26, z (8,730) = .15, p = 0.88) indicating that the overall response accuracy did not change with the change in instructions from Experiments 1 and 2. However, the critical interaction between experimental group and target word predictability was statistically significant (β = 0.46, SE = 0.20, z (8,730) = 2.34, p = 0.02), that is, the effect of predictability was larger in the group that was asked to type in the whole sentence (Experiment 2) than in the group that was asked to type only the sentence-final target word (Experiment 1). Together, these findings suggest that the change in task instruction, which draws attention either to the entire sentence or only to the final word, is critical to whether the context information is used under degraded speech. But degraded speech comprehension is not reduced by binding listeners' attention allocation to one part of the speech stream.

General discussion
The main goals of the present study were to investigate whether online semantic predictions are formed in comprehension of degraded speech when task instructions encourage attention to the processing of the context information, or only to the critical target word. The results of two experiments revealed that attentional processes clearly modulate the use of context information for predicting sentence endings when the speech signal is moderately degraded.
In contrast to the first experiment, the results of our second experiment show an interaction between target word predictability and degraded speech. This is generally in line with the few existing studies that found a facilitatory effect of predictability at different levels of speech degradation when the participants were instructed to pay attention to the entire sentence (e.g., at 4-channel, or at 8-channel;Bhandari et al., 2021;Obleser & Kotz, 2010;Obleser et al., 2007). The important new finding that our study adds to the present literature is that this effect may be weakened or lost when listeners are instructed to report only the final word of the sentence that they heard (Experiment 1). The lack of predictability effect (or contextual facilitation) can most likely be attributed to listeners not successfully decoding the meaning of the verb of the sentence, as the verb is the primary predictive cue in our stimuli for the target word (noun). Hence, this small change in task instructions from Experiment 1 to Experiment 2 sheds light on the role of top-down regulation of attention in using context for language comprehension in adverse listening conditions. In an adverse listening condition, language comprehension is generally effortful, so that focusing attention on only a part of the speech signal seems beneficial in order to enhance stimulus decoding. However, the results of this study also show that this comes at the cost of neglecting the context information that could be beneficial for language comprehension. Our findings hence demonstrate that there is a trade-off between the use of context for generating top-down predictions vs. focusing all attention on a target word. Specifically, the engagement in the use of context and generation of topdown predictions may change as a function of attention (see also Li et al., 2014). This claim is also corroborated by the significant change in predictability effects (or contextual facilitation) from Experiment 1 to Experiment 2, in the combined dataset. Findings from the irrelevant-speech paradigm also support our conclusion. It has been shown that the predictability of unattended speech has no effect on the main experimental task (e.g., memorization of auditorily presented digits). Wöstmann and Obleser (2016) did not find predictability effects when the participants ignored the degraded speech (see also Ellermeier et al., 2015). An alternative explanation of 'participants neglecting the context' could be that the participants did not listen to the context at all, or they heard but did not process the context. However, irrelevant-speech paradigm studies show that listeners cannot avoid listening to the speech presented to them; to-be-ignored speech has been shown to interfere with the main experimental task (e.g., LeCompte, 1995). It is not implausible that the listeners listened to the context but did not do a deep processing. This is not incompatible with our first explanation, as in either case, attention to the final word leaves the listeners with limited resources to process and form a representation of the context information.
At this point, we note the differences in response accuracies across different levels of speech degradation, and contextual facilitation therein. At 8-channel condition, the speech was least degraded, and listeners recognized more words than in the 4-or 6-channel conditions, which is in line with prior studies that have found an increase in intelligibility and word recognition with an increase in number of channels (e.g., Davis et al., 2005;Obleser et al., 2011). Speech signal passed through 4-channel noisevocoding was most degraded. Therefore, in the second experiment, at 4-channel, attending to the entire sentence did not confer contextual facilitation because decoding the context itself was difficult. Listeners could not utilize the context differentially across high and low predictability sentences to generate semantic predictions. At 6-channela moderate level of degradationlisteners could attend to, identify, and decode the context; hence we observed the significant difference in response accuracy between high and low predictability sentences. We observed a similar contextual facilitation at 8-channel as well. This is in line with previous findings (e.g., Obleser et al., 2007; but see also Obleser & Kotz, 2010) which show that predictability effects can be observed at a moderate degradation level of 8-channel or less. To summarize, our results indicate that there was a very strong difference in intelligibility between 4-and 6-channel conditions, but that the difference in intelligibility between 6-and 8-channel conditions was minor. Note, however, that even for 8-channel, low predictability sentences were not always understood correctly.
Considering theoretical accounts of predictive language processing (Friston et al., 2020;Kuperberg & Jaeger, 2016;McClelland & Elman, 1986;Norris et al., 2016;Pickering & Gambi, 2018), one would expect that listeners automatically form topdown predictions about upcoming linguistic stimuli based on prior context. Also, when speech is degraded, top-down predictions render a benefit in word recognition and language comprehension (e.g., Corps & Rabagliati, 2020;Sheldon et al., 2008aSheldon et al., , 2008b. Results of our study revealed new theoretical insights by showing that this is not always the case. Top-down predictions are dependent on attentional processes (see also Kok et al., 2012), directed by task instructions; thus they are not always automatic, and predictability does not always facilitate language comprehension of degraded speech. To this point, our findings shed light on the growing body of literature indicating limitations of predictive language processing accounts (Huettig & Guerra, 2019;Huettig & Mani, 2016;Mishra et al., 2012;Nieuwland et al., 2018).
Results from both experiments show that the effect of trial number is not significant. In contrast to previous studies (e.g., Davis et al., 2005;Erb et al., 2013) we did not observe adaptation to noise-vocoded speech. In those studies, there was certainty about the speech quality of the next trial, as the participants were presented with only one level of spectral degradation (only 4-channel or only 6-channel noisevocoding), and crucially with no specific regard to semantic predictability. On the contrary, in our study, listeners were always uncertain about the speech quality of the next trial as well as its semantic predictability. Because of this changing context, the perceptual system of the participants may not retune itself (cf. Goldstone, 1998;Mattys et al., 2012). This is also in line with our prior finding that listeners do not adapt to degraded speech when there is a trial-by-trial variation in perceptual and semantic features (Bhandari et al., 2021).
We also should note the limitations of the current study. In our experiments, we have used short Subject-Verb-Object sentences in which the verb is predictive of the noun, and we have given participants the somewhat unnatural task of reporting the last word of a sentence. In more naturalistic sentence comprehension, participants would normally aim to understand the full utterance, and would most likely not have restricted goals such as first and foremost decoding a word in a specific position of the sentence. Instead, the speaker would usually indicate important words or concepts via pitch contours, stress, or intonation patterns, which would then direct the attention of a listener. Furthermore, the sentences uttered in most day-to-day conversations are longer, and context information builds up more graduallyinformation from several words is usually jointly predictive of upcoming linguistic units. Similarly, the design of our experiments limits our ability to discern whether participants generated predictions online while processing the speech, or while typing in the words after listening to the degraded speech.
To conclude, we show that task instructions affect distribution of attention to the noisy speech signal. This, in turn, means that when insufficient attention is given to the context, top-down predictions cannot be generated, and the facilitatory effect of predictability is substantially reduced. Data Availability Statement. The code and data mentioned above are available in the public repository of Open Science Frameworkhttps://osf.io/t6unj/.
Conflict of Interest. We conducted this research with no relationship, financial or otherwise, that could be a potential conflict of interest.