Efficacy of different presentation modes for L2 video comprehension: Full versus partial display of verbal and nonverbal input

Abstract Video materials require learners to manage concurrent verbal and pictorial processing. To facilitate second language (L2) learners’ video comprehension, the amount of presented information should thus be compatible with human beings’ finite cognitive capacity. In light of this, the current study explored whether a reduction in multimodal comprehension scaffolding would lead to better L2 comprehension gain when viewing captioned videos and, if so, which type of reduction (verbal vs. nonverbal) is more beneficial. A total of 62 L2 learners of English were randomly assigned to one of the following viewing conditions: (1) full captions + animation, (2) full captions + static key frames, (3) partial captions + animation, and (4) partial captions + static key frames. They then completed a comprehension test and cognitive load questionnaire. The results showed that while viewing the video with reduced nonverbal visual information (static key frames), the participants had well-rounded performance in all aspects of comprehension. However, their local comprehension (extraction of details) was particularly enhanced after viewing a key-framed video with full captions. Notably, this gain in local comprehension was not as manifest after viewing animated video content with full captions. The qualitative data also revealed that although animation may provide a perceptually stimulating viewing experience, its transient feature most likely taxed the participants’ attention, thus impacting their comprehension outcomes. These findings underscore the benefit of a reduction in nonverbal input and the interplay between verbal and nonverbal input. The findings are discussed in relation to the use of verbal and nonverbal input for different pedagogical purposes.


Introduction
For second language (L2) video comprehension, the pedagogical potency of captionsverbatim transcript in the same language as the spoken narrationhas been established by a multitude of studies (e.g. Montero Perez, Van Den Noortgate & Desmet, 2013;Winke, Gass & Sydorenko, 2010. Despite the supportive evidence, studies based on Mayer's (2002Mayer's ( , 2005 cognitive theory of multimedia learning (CTML) have expressed reservations about the use of fullcaptioned videos (e.g. Mayer, Lee & Peebles, 2014). CTML, which was originally proposed for learning math and science, stipulates that multimedia instruction should be designed in light of human beings' limited cognitive capacity; otherwise, incoming information that exceeds learners' processing limit will lead to cognitive overload and will thus inhibit learning.
To design multimedia material that does not overburden one's cognitive system, instructors should consider three kinds of cognitive demand (Mayer, 2005). First, intrinsic cognitive load is determined by learners' perceived difficulty or complexity of the learning material, and is thus not directly malleable to the control of the instructor. Extraneous cognitive load, on the other hand, is determined by the design and presentation of the material and is therefore more amenable to the instructor's control. High extraneous load results when the instruction requires learners to simultaneously process a large amount of input presented in different modalities. It interferes with schema acquisition and automation because learners need to devote their cognitive resources to unnecessary processing (Kam, Liu & Tseng, 2020). Lastly, germane cognitive load stems from the mental effort required to make sense of the learning material, and thus contributes to schema acquisition and learning. Given that the three kinds of cognitive load are additive, scholars generally agree with the need to minimize extraneous cognitive load (e.g. Kam et al., 2020;Mayer & Moreno, 2010), but how this goal could be realized in L2 multimedia learning environments has not been thoroughly examined.
Hitherto, in the realm of L2 captioning research, extraneous cognitive load when viewing and understanding multimodal video materials could be reduced at the verbal level by utilizing partial captionson-screen transcripts of only selected words from the oral discourse (Guillory, 1998;Montero Perez, Peters & Desmet, 2014;Teng, 2019). In contrast to full captions, partial captions can highlight the targeted learning information and direct L2 learners' attention to specific content (Guillory, 1998). Although it is unresolved whether partial captioning can unequivocally lead to superior video comprehension compared to full captioning, some studies, albeit limited in number, have demonstrated the added advantages of partial captions on cognitive load reduction and attention guiding (Rooney, 2014). L2 learners' cognitive load was found to be higher when viewing videos with full captioning than when watching videos with partial captioning (Mohsen & Mahdi, 2021). Furthermore, Mirzaei, Meshgi, Akita and Kawahara (2017) found that presenting only selected words in the captioning line may help avoid L2 learners' over-reliance on caption reading and better prepare them for real-world listening.
Besides the verbal approach, extraneous cognitive load in viewing and understanding multimodal video materials may also be reduced at a nonverbal level. Studies based on native speakers have found that fast-changing images (e.g. animations) would impose greater extraneous load than slow-changing images (Hegarty, 2004;Höffler & Leutner, 2007). To this end, presenting a series of static key frames or images extracted from the animation may help reduce the extraneous processing of transitory pictorial input (Paas, Van Gerven & Wouters, 2007). Similar to partial captions, which present key ideas of an oral discourse, static key frames are key screenshots that present essential pictorial information for understanding the gist of the video. The missing details in the deliberately chosen frames encourage learners to fill in the gaps using their prior knowledge (Hegarty, 1992). From the perspective of CTML, presenting partial captions (the verbal approach) and static key frames (the nonverbal approach) in L2 videos are viewed in this study as two plausible ways to minimize extraneous cognitive load and to promote germane cognitive load.
Nevertheless, in previous CTML studies (e.g. Kam et al., 2020;Lee, Liu & Tseng, 2021), the reduction in extraneous load was mostly discussed at the verbal rather than at the nonverbal level.
More research is needed to shed light on the optimal strategies for reducing extraneous cognitive load and their potential benefits in L2 multimedia learning scenarios. To address the gap in the research literature, the following research questions were examined: 1. Does the reduction in verbal information through the manipulation of caption presentation modes (i.e. full captions vs. partial captions) affect L2 learners' video comprehension? 2. Does the reduction in nonverbal information through the manipulation of pictorial presentation modes (i.e. animation vs. static key frames) affect L2 learners' video comprehension? 3. How do L2 learners perceive their cognitive load when viewing videos with different verbal supports (i.e. full captions vs. partial captions) and different pictorial content (i.e. animation vs. static key frames)?
2. Literature review 2.1 Theoretical accounts of processing multimodal input in captioned videos Mayer's (2002Mayer's ( , 2005 CTML specifies the cognitive processes of multimedia learning, as shown in Figure 1. Starting from the left side, words and pictures enter through the eyes and ears and are briefly held in sensory memory. With only parts of the information selected into working memory, the learner then draws on long-term memory to make sense of what has been heard or seen by piecing together the attended input and then organizing it into a coherent verbal/ pictorial representational model; the resulting model may or may not enter long-term memory, depending on whether it can be integrated with prior knowledge. According to Mayer (2014), the efficiency of the above processes in working memory can be facilitated through the design of multimedia materials, where two assumptions must be considered. One is the dual-channel assumption, which indicates humans' use of interconnected channels to process input presented in different modalities. The ideal presentation mode is the distribution of the information across both channels. The second assumption is limited capacity, which stresses that each channel has only finite cognitive resources. It explains why viewers have to selectively attend to different parts of multimodal video content at a time, rather than processing the exact copy of the input in working memory (see Hsieh, 2020). Based on the above-mentioned two assumptions, when a viewer watches a narrated animation, the narration enters the auditory/verbal channel, while the animation enters the visual/pictorial channel. The concurrent activation of the two channels is beneficial for deeper processing. However, when full captions are also presented, they enter the verbal channel as well as the visual channel, introducing more than one source of information to the same channel at the same time. In this situation, learners need to reconcile the incoming spoken and written text, and constantly scan the animation to search for images corresponding to the text (Mayer & Moreno, 2010). For L2 learners with lower working memory capacity, reading full captions in the L2 may overburden their working memory capacity and result in less information intake (Kam et al., 2020;Winke et al., 2010; see also Lee et al., 2021).
Despite the fact that full-captioned videos may create redundancy and impose additional cognitive demands on L2 learners, the CTML does not elucidate how to reduce the extraneous load in such cases. Therefore, the current study examined whether a reduction in input presentation would have positive effects on L2 learners' comprehension of the video content and, if so, which level of input reduction (i.e. verbal or pictorial) would be more effective. In the extant captioning research, partial captions are usually operationalized in the form of keyword captions. By excluding on-screen text that is less crucial for the video content, keyword or partial captions may prevent visual or verbal channel overload, which is likely to occur in a full-captioning viewing environment. The pioneering research on the use of partial captions by L2 learners was conducted by Guillory (1998). The study revealed that both full and partial captions led to superior comprehension compared with no captions. In particular, partial captions were considered as more effective for video comprehension than full captions due to no significant differences being found. Yet with less text in the visual channel, learners viewing with partial captions may be less prone to cognitive overload.
Supporting evidence regarding partial captions was also provided by Yang, Chang, Lin and Shih (2010), who compared the effects of full captions, captions of only nouns, and captions of only verbs on video comprehension. The result showed that captions comprising only nouns led to comparable comprehension to full captions. However, with captions comprising only verbs, the learners' video comprehension was significantly poorer than that of the other groups. Although Yang et al. did not explain what caused the differences between the two types of partial captions, their findings implied the importance of partial selection criteria. A more recent study by Mirzaei et al. (2017) investigated the effects of partial and synchronized captions (PSC), which consisted of words less frequently seen in corpora and/or uttered at a high speech rate. Consistent with Guillory (1998) and Yang et al. (2010), the result showed that the PSC group performed as well as the full-captioning group in listening comprehension (see also Mohsen & Mahdi, 2021). Moreover, PSC seemed to help reduce the students' reliance on caption reading, thus better preparing them for real-world listening.
Although partial captions were postulated to be as effective as full captions in the aforementioned studies, others have shown contradictory results. For instance, Montero Perez et al. (2014) investigated the effects of full, partial, and no captions on global and detailed comprehension. Results revealed no significant differences between the three types of captions in detailed comprehension, but in global comprehension, the full-captioning group significantly outperformed the others. Full captions were thus suggested to be more assistive for L2 learners. However, the authors acknowledged that the lack of inferencing comprehension questions prevented the study from gaining a more complete view of the efficacy of different captions. Similar to Montero Perez et al. (2014), Teng (2019) found full captions more beneficial than partial captions for L2 learners' global comprehension, and no significant differences were found in detailed comprehension. The researcher speculated that because of its unusual or incomplete presentation, the partial captions might have drawn so much attention from the viewers that they were not able to holistically interpret the video content.
One factor that may have caused the mixed results in previous studies is the type of selected videos. For example, the videos of situational conversation (in Guillory, 1998) and TED Talks (in Mirzaei et al., 2017) pertain to a more "unidimensional talking-head genre" (Yeldham, 2018: 375), in which the images are less dynamic and it is less crucial to understand the content. Such visual distinction may have affected how viewers allocated their attention, further determining their comprehension outcomes.

Static key frames as the nonverbal approach
Empirical evidence based on young native speakers has shown that narrated animations were more beneficial for story comprehension than audio picture books (Takacs, Swart & Bus, 2015). With dynamic images, animations attract visual attention and provide a closer match between verbal and nonverbal information, which makes the plot more explicit and easier to comprehend (Takacs & Bus, 2016).
Nevertheless, these benefits may only be realized in animations accompanied by spoken text, where the viewer's visual and auditory channels deal with one source of input at a time (Mayer & Moreno, 2010). When other sources of visual information are presented concurrently (e.g. captions), viewers may experience split attention between the text and images. Particularly for novices lacking sufficient background knowledge, the presence of multiple dynamic visuals may cause constant searches for pertinent information, which consume their cognitive resources required for deeper processing (Ayres & Paas, 2007). To minimize the extraneous processing in animated lessons, the nonverbal approachpresenting viewers with only the key segments of a videomay have the potency to reduce cognitive load resulting from constant, transient pictorial displays, and may serve as an attention-getting device to (re)direct learners' focus to key frames that are crucial for meaning interpretation.
In this vein, Moreno (2007) examined the effects of the "segmentation" and "signalling" methods on video-instructed lessons. Through segmentation, the dynamic video content was divided into smaller chunks, so that the viewers only had to process portions of the information at a time. Through signalling, the viewers could see a written list on the screen, which was expected to guide their selection of the most relevant pictorial displays. Results showed that the segmented materials could lead to better performance on retention and transfer tests. However, the signalled video/animation did not benefit the viewers' test performance, and it even imposed higher cognitive load than the original material. The major cause, as pointed out by Moreno, was that the verbal signals may have introduced another source of visual information, which split the viewers' attention between the dynamic video content and the signalling text. Moreno's finding suggests that reduction (of transient pictorial plays) was more helpful than addition (of artificially imposed input enhancement or attention-getting devices) in retaining the processed information in a multimodal environment.
It is feasible that presenting the most representative frames of a portion of a videostatic key framesmay merge the benefits of segmenting and attention guiding. Paas et al. (2007), for instance, employed static key frames as a follow-up session to an animation-based lesson. After viewing the animation, some of the participants studied all the key frames simultaneously, while the others studied the frames sequentially. No significant between-group differences in comprehension were found; however, the participants exposed to sequential key frames perceived less mental effort, which in turn indicated higher instructional efficiency. In contrast, participants viewing the key frames simultaneously all indicated perceiving a higher cognitive load. Paas et al. thus postulated that when the learners must simultaneously pay attention to multiple sources of key informationa processing environment similar to the viewing of animationextraneous cognitive load would increase. However, when the learners only had to focus on one piece of key information at a time, it required less mental effort and decreased extraneous cognitive load. Given that the research examining L2 learners' processing of dynamic images is still in its infancy, more studies are warranted to explore how the amount of pictorial input affects L2 learners' cognitive load and video comprehension.

Participants
A total of 62 students majoring in applied English at a public vocational high school in northern Taiwan participated in this study. They were all 12th graders recruited from two intact classes and were all motivated to improve their English proficiency through viewing various kinds of video content online. The two classes had comparable English competence according to their average scores on the achievement tests in the previous academic years. Additionally, based on their performance in a standardized English proficiency test, the participants' English proficiency was considered to be high-intermediate, according to the Common European Framework of Reference for Languages.

Video selection
The viewing material, dubbed by a native English speaker, was selected based on the following three criteria. First, the topic of the video was familiar to the participants to prevent comprehension barriers caused by lack of background knowledge (Othman & Vanathas, 2017). Second, to measure the participants' inferential comprehension, the video used in this study did not include materials that completely presented concrete and factual information, as such materials may not be the most feasible genre for measuring inferential comprehension (Montero Perez et al., 2014). Finally, it was ensured that the video matched the participants' language proficiency to ensure that they could manage the intrinsic cognitive load imposed by the material (Martin & Evans, 2018). Accordingly, an animated Ted-Ed video explaining the clustering phenomenon of competing stores was adopted. All participants viewed the video with headphones.

Caption modes
Participants viewed the video with either full or partial captions. With full captioning, there were at most two lines of text presented on screen at the same time; with partial captioning, only the selected words or phrases were shown in the captioning line (see Figure 2). To select the words and phrases in the partial captions, the current study utilized MonkeyLearn, online software that can automatically extract high-frequency words and co-occurring lexical strings from a given text. 1 Additionally, two experienced English teachers were invited to watch the video, read the automatically extracted phrases, and delete semantically redundant words. The final set of keywords and phrases for partial captioning accounted for approximately 45% of the full transcriptclose to the 50% caption ratio, which is considered most effective for L2 learners' listening comprehension (Rooney, 2014).

Pictorial modes
The participants watched the video with either animation or static key frames. All videoswhether animation or key-framedwere presented to the participants in high-definition format. To extract the key frames, the present study utilized VLC software, which can automatically 1 According to Meyer (2011), the cognitive load of multimodal content can be reduced through "explicit material scaffolding" (e.g. showing only novel or unfamiliar terms or vocabulary) or "embedded material scaffolding" (e.g. showing shortened lexical strings consisting of high-frequency keywords that should be known to the viewers). The former scaffolding can be employed when the purpose of multimodal video viewing mainly concerns (explicit) vocabulary learning; the latter scaffold can be utilized when the purpose is pertinent to multimodal comprehension and to activating students' prior knowledge. Both types of scaffolding aim to reduce the cognitive load through reducing the amount of verbal information, but the latter (i.e. focusing on high-frequency words) is more relevant to the context of this study. capture a frame at a particular interval. The recording ratio was set to capture one frame every 2 seconds, resulting in 120 frames extracted as candidate key frames. Among them, blurry, incomplete, similar, and repeated images were manually deleted. By doing so, we minimized the amount of redundant pictorial input. There were 50 images reserved as the static key frames. Finally, to create the key-framed videos, the full/partial captions were imported via VLC. The duration of each key frame was reset to ensure that all video materials were identical in length.

Design
The study used a 2 x 2 factorial design in which a verbal factor (i.e. full captions vs. partial captions) was paired with a nonverbal factor (i.e. animation vs. static key frames). For grouping, a randomized block design was conducted to equally assign the students of Class A and Class B to one of the four viewing conditions. First, the average scores of the students' mid-term and final English examinations in the previous two semestersincluding Class A and Class Bwere collected. Second, students were numbered according to the ascending order of their average scores (from the lowest to the highest). Based on the numerical order, students were sequentially assigned to the four viewing conditions. Figure 3 visually schematizes the grouping process.

Video comprehension test
This test consisted of 12 multiple-choice questions, among which four items assessed global understanding, another four gauged local understanding (details or understanding of a particular sentence), and the remaining four required inference making (see Appendix in the supplementary material for example items). Each question was followed by four options, with only one correct answer and three distractors. One point was given for each correct answer, so the participants received a maximum of 12 points. Additionally, to ensure that the participants in the partial information groups were as capable of answering the questions as those viewing with full information, all questions were derived from the information available in the partial captions and key frames. This particular focus on partial information was suggested by Montero Perez et al. (2014), who targeted the passages with keywords in the question items, so that participants' comprehension could be solely attributed to the mode of video presentation rather than the amount of information they received.

Cognitive load questionnaire
After the video comprehension test, all participants completed a 9-item questionnaire (see Table 6) to evaluate their cognitive load during the video viewing session. For measuring intrinsic and extraneous cognitive load, six items were adapted from Leppink, Paas, Van der Vleuten, Van Gog and Van Merriënboer's (2013) subjective rating scale, which has been proved to be a valid instrument for measuring the two types of cognitive load (Sweller, van Merriënboer & Paas, 2019). Furthermore, to probe the participants' germane cognitive load, three items developed and validated by Klepsch, Schmitz and Seufert (2017) were adapted; the original questions were slightly rephrased to conform to the context of this study. The participants were required to rate each statement on a 10-point Likert scale (0 = strongly disagree; 10 = strongly agree).

Procedure
Prior to the study, all participants were provided with experiences watching full-/partial-captioned videos as well as animated/key-framed videos in their regular class hours. This was to avoid the confounding variables arising from unfamiliarity with certain presentation modes. The experiment began with a 5-minute oral instruction, which informed the participants of their assigned viewing conditions and the overall procedure. Next, the participants watched the system-paced video twice, following the repeated viewing practice in Winke et al. (2013) and Teng (2019). Immediately after the two-time viewing, which took approximately 10 minutes, the participants were given 20 minutes to complete the video comprehension test and the questionnaire. All participants were invited to comment on their responses after they had completed the questionnaire (informal post-study interview).

Descriptive statistics
As shown in Table 1, participants viewing full captions static key frames attained the highest mean score (M = 8.75), followed by those viewing partial captions static key frames (M = 8.12), and then partial captions animation (M = 7.53), whereas full captions animation (M = 7.47) led to the lowest score. Besides an overview of the participants' performance in each viewing condition, the results showed a higher mean score of video comprehension for full captions (M = 8.13) than for partial captions (M = 7.84) in caption mode, and a higher mean score for static key frames (M = 8.44) than for animation (M = 7.50) in pictorial mode. Table 2 shows each group's performance on different types of comprehension questions (global, local, and inferential comprehension). The participants viewing the animation video consistently scored lower than their counterparts viewing the key-framed video (probably due to higher cognitive load. Notably, seeing key-framed video content with full captions ("FCS") appears to lead to the highest scores on both global (M = 2.93) and local (M = 3.06) comprehension, while for inferential comprehension, seeing the key-framed video with partial captioning ("PCS") yielded better performance (M = 2.88)a result empirically supporting the assumption of this study that partial display of multimodal input, whether verbal or nonverbal, may promote inference making.

Two-way ANOVA analysis
The normality of the participants' overall comprehension scores was assessed through the Shapiro-Wilk test, which demonstrated a normal distribution (W = .962, p = .054) with a significance value greater than 0.05. The Levene's test, an instrument used to assess the variances between two or more groups, also indicated homogeneous performance of the participants in all four groups. As shown in Table 3, the variances were insignificant in all types of comprehension scores. Accordingly, the statistical data fulfilled the prerequisites of normality and homogeneity for further ANOVA analysis. Results of the ANOVA are displayed in Table 4. While caption mode did not cause a significant main effect on any type of comprehension, pictorial mode significantly affected overall comprehension, F(1, 58) = 4.85, p < .05, with a medium effect size (η p 2 = .073). This suggested that the presentation of pictorial mode in the video was particularly effective in terms of promoting the participants' well-rounded performance in the major aspects of comprehension, including global, local, and inferential comprehension, and this was true irrespective of whether the video came with partial or full captions. In addition, the interaction effect between caption and pictorial mode also reached significance in local comprehension, F(1, 58) = 4.56, p < .05, also with a medium effect size (η p 2 = .073). The Bonferroni test, t tests performed to reduce the possibility of getting a statistically significant result, was used to further examine this interaction effect. As shown in Table 5, the test indicated that with static key frames, the participants who viewed key-framed video content with full captions scored 0.625 points higher than the other key-framed video group viewing with partial captions (p < .05). To put this gain (0.625 points) into perspective, participants were able to enhance their performance by 125% (3.06/4 points vs. 2.44/4 points) on items assessing local comprehension (extracting details) if they were simultaneously exposed to video content consisting of full captions and static key frames (FCS condition).

Results of the cognitive load questionnaire
The Cronbach's alphas, a measure of internal reliability between and among a set of test items, indicated high internal consistency for the subscales of intrinsic (α = .84), extraneous (α = .81), and germane (α = .82) cognitive loads. As shown in Table 6, the amount of perceived intrinsic cognitive load (statements 1-3) did not vary considerably among the four groups, all of which showed a similar rating of approximately 4 (out of 10).
Although the participants' responses to statements 1-3 (which deal with intrinsic load) were quite comparable, their average ratings of the statements focusing on extraneous cognitive load  (statements 4-6) revealed an interesting picture of the role of nonverbal (pictorial) input in the participants' meaning-making process. Statement 4 shows that the participants who viewed keyframed video content with partial captioning (PCS) tended to agree that their viewing condition was an effective viewing environment for learning about the video content (M = 7.88). Notably, those assigned to seeing the key-framed video with full captions (FCS) exhibited an even stronger tendency (M = 8.50) to agree with the potency of their viewing condition, suggesting that reduction in nonverbal (pictorial) input (seeing videos consisting of static key frames) appeared to leverage their attention to what the verbal input (captions) could offer. However, when asked to only rate the effectiveness of caption mode (statement 5), the participants viewing with animation all considered their caption mode (both full-and partialcaptioning) as an ineffective scaffold (M = 3.20). This finding corroborated the previous contention that nonverbal pictorial input was a strong attention-getter insofar as having full access to pictorial details (animation) might have distracted their attention from what the captions could offer, as those who did not have access to pictorial details were more likely to see captions as an effective tool (M = 7.00 and M = 6.00, for the FCS and PCS conditions, respectively).
As for the ratings for pictorial mode (statement 6), the participants assigned to the FCS condition (seeing key-framed video with full captions) tended to strongly agree that their viewing Note. CI = confidence interval; LL = lower limit; UL = upper limit; S = static key frames; FC = full captions; PC = partial captions; A = animation. *p < .05.
environment was an effective scaffold (M = 7.50); those in the PCS condition (seeing keyframed video with partial captions) did not give such a high rating for the same statement (M = 6.76). Both findings from the participants' self-perception data again confirmed that reduction in pictorial images played a prominent role in deciding the participants' perceptions of the usefulness of captions. It is worth noting that when invited to comment on their responses to statement 6 in the informal post-study interview, more than half the participants assigned to the animation condition indicated that they felt they could not appropriately process all the verbal and nonverbal input because "everything happened so fast" and "they only had a fleeting glimpse of the animation"indirect evidence that some L2 learners might not have sufficient capacity to process multimodal input simultaneously presented from different channels, in particular the transient input (Kam et al., 2020;Lee et al., 2021).
The last three statements targeting germane cognitive load demonstrate that the static key frames might have induced higher germane cognitive load than the animation, irrespective of the concurrent caption mode. As revealed by statements 7-9, the PCS and FCS conditions led to more effort invested in correctly understanding the details and overall content of the video.

RQ1: Effects of caption presentation mode
Based on the quantitative data, full captionsregardless of the concurrent pictorial moderevealed only a marginal advantage on L2 viewers' overall comprehension as compared with partial captions (FC = 8.13 vs. PC = 7.84). This appears to be aligned with the findings of some Note. FC = full captions; A = animation; S = static key frames; PC = partial captions.
L2 captioning research (Mirzaei et al., 2017;Yang et al., 2010). In spite of this marginal difference, we did observe a significant impact of full-captioning video when the purpose of the multimodal learning material targeted local comprehension. In the informal post-study interview, nearly two thirds of the participants viewing the videos with full captions indicated that they preferred full captioning for extracting details, as it provided more semantic and syntactic details not available in partial captioning. One may recall that the participants' local comprehension performance was enhanced by 125% when they viewed the key-framed video that came with full captions. This enhanced gain in this viewing condition is by no means insignificant. Accordingly, when the objective of using multimodal video materials concerns the learning of details, key-framed videos with full captions will be the advised viewing material. This result seems to contradict the original stipulations of CTML (which were proposed for the learning of science and math), namely that full transcription of spoken narrative would impose undesirable cognitive loads on learners. Mayer, Fiorella and Stull (2020) noted that "when learning from video lessons in a second language : : : [many CTML] principles are reversed" (p. 848). This phenomenon was also pointed out by Plass and Jones (2005), who observed that full captions may not be redundant for L2 learners, because "reading and listening : : : , in many cases one is used as input enhancement for the other" (p. 480). As a result, for L2 learners, any meaningful verbal input in the full captions, including conjunctions omitted in the partial captions, may have potential value for providing more detailed, comprehensible understanding. Note that this study found that full captioning, which was found to be beneficial for the participants' local comprehension (getting the details), was not as effective in terms of promoting their overall comprehension; in this study, the participants seemed to rely heavily on nonverbal pictorial inputin particular, the static key framesfor all aspects of comprehension (as borne out by the main effect of the pictorial mode in the ANOVA). This observation is generally aligned with Tragant and Pellicer-Sánchez's (2019) eye-tracking study, which also deals with L2 learners of a similar (high-intermediate) proficiency profile. Tragant and Pellicer-Sánchez found that although their participants could exploit a variety of meaning-making strategies to extract the gist of multimodal content, they seemed to rely more on the pictorial inputwhich was probably more effective in terms of providing a coherent gist of the video contentthan on text (captions) for overall comprehension.

RQ2: Effects of pictorial presentation mode
While the ANOVA analysis did not detect any main effect from the caption mode on the participants' overall comprehension, it did detect a main effect arising from the pictorial mode. The above findings suggest that the (nonverbal) pictorial presentation mode played a more prominent role than the (verbal) caption presentation mode in enhancing the participants' well-rounded performance in global, local, and inferential comprehension.
Note that this does not unequivocally endorse the use of all types of pictorial input. One may recall that the ANOVA also showed that the participants were better able to take advantage of what full captioning could offer in extracting details (local comprehension) only when it was accompanied by static key frames (Table 4), but such an advantage disappeared when full captioning was accompanied by animation. It was possible that pictorial input was more effective than verbal (captions) input in terms of attracting L2 learners' attention (see Tragant & Pellicer-Sánchez, 2019); in this case, the transiency of fast-changing dynamic animation might have imposed a high cognitive load that stopped the participantsin particular those who did not have sufficient working memory capacityfrom taking full advantage of both the verbal and nonverbal input due to inherent cognitive constraint (see Kam et al., 2020;Lee et al., 2021). In contrast, the key-framed video, which was less cognitively demanding, allowed the participants to process the nonverbal (pictorial) input without taxing their mental resources, which in turn enabled them to further process concurrent verbal (captions) input. Accordingly, key-framed video content may be more helpful than animation for promoting L2 learners' overall comprehension. This speculation is also corroborated by the participants' qualitative reports; one may recall that the participants' qualitative comments in the post-study interview also revealed that reduction in pictorial input (seeing key-framed video content) determined the participants' perceptions of their capacity to handle the verbal input.
The above contention does not mean that reduction in both nonverbal (pictorial) input and verbal input presents the most desirable learning environment in all cases. Specifically, we observed that the static key frames led to the lowest score in local comprehension when the participants were presented with key-framed video content with partial captions. The participants' less desirable performance after viewing key-framed videos with partial captions might be attributed to the simultaneous reduction in both verbal and pictorial input, which might have discarded too much key information from the video content. In this regard, the information was probably too fragmented for the participants to obtain the details of the video content (Teng, 2019).
Taken together, the above discussion lends some support to the effectiveness of key-framed videos in terms of overall comprehension (Table 1); however, if the learning purpose is concerned with extracting details, the key-framed video should be supplied with full captions. Reduction in both verbal and nonverbal (pictorial) input is probably only desirable when the objective of a class is concerned with inference making; as can be seen in Table 2, the participants' inferencing performance was the best when they viewed the key-framed video with partial captioning. With the help from software such as VLC, the extraction of static key frames can be made easy, without too much manual work from the instructor.

RQ3: Perceived cognitive load under different presentation modes
The last research question sheds light on the intrinsic, extraneous, and germane cognitive loads perceived by the participants in different viewing conditions. As reflected by the very close ratings on intrinsic cognitive load, the difficulty level of the video content might have been equivalent for all participants. The similarly perceived intrinsic load also indicated that the participants could maintain comparative cognitive capacity for dealing with extraneous and germane processing, which showed a larger discrepancy among the ratings of the four viewing groups.
For the perceived extraneous cognitive load, the participants' responses are generally consistent with the discussion based on the participants' quantitative performance data. First, the high ratings from participants assigned to the FCS (M = 8.50) and PCS conditions (M = 7.88) for statement 4 showed that overall static key frames were deemed more useful than animated content for promoting their learning (M = 4.54 and M = 5.06 for the FCA and PCA conditions, respectively). This pattern is also replicated in the participants' responses to statement 6, which examined their perceptions of the role of animation/images in learning about a given topic. As noted earlier, several participants who viewed the animated video content did not find the transient nature of the pictorial content helpful, especially when they intended to process textual (captions) and pictorial input at the same time. The most interesting picture lies in the participants' responses to statement 5, where they indicated that the usefulness of captions (full and partial) only became manifest when they were accompanied by static frames (rather than animation). This again suggests that pictorial mode mediates the effect of captioning mode, and that the potency of captions is manifested when the participants' attention is not taxed.
With regard to germane cognitive load, the questionnaire result indicated that the participants in the key-framing groups perceived higher germane cognitive load than those in the animation groups. In contrast, captioning modes did not seem to be the main factor that determined the participants' germane processing. The participants' ratings also concurred with their inferential comprehension outcome, where the static key frames led to higher scores (Tables 3 and 5). The result lends support to the theoretical assumption of CTML that more germane cognitive load can elicit more cognitive efforts on deeper processing, including inference making based on available information.

Pedagogical implications
The key-framed video yielded better overall comprehension with full captions, as revealed in this study. Similarly, if the pictorial backdrop of a video is less dynamic and rather monotonous (e.g. TED Talks and slow-paced documentaries), full captions would be cognitively manageable and more supportive for L2 viewers. In particular, for educational programs that often need to build up linguistic scaffolds for L2 learners within limited instructional hours, such as remedial courses and content and language integrated learning lessons, key-framed videos with full captions can be a useful multimedia learning material. When the instructional goal of a program aims to promote learners' overall comprehension, the results of this study suggest that key-framed videos or equivalent pictorial presentations would be more desirable; key-framed videos, in particular those that are accompanied by full captioning, would be helpful for promoting learners' balanced performance in all aspects of comprehension. With the slow-changing backdrop of the video, L2 (remedial) learners can focus on the captions in a more consistent manner, which may help them to process the text more deeply and achieve better content learning. Notably, as commented by the participants, such pictorial reduction via static key frames was preferable to the verbal reduction via partial captions.
While employing key-framed videos, L2 instructors can enhance students' extraction of details through the inclusion of full captions. It may involve a manual process, but if this can significantly enhance learners' performance in extracting and retaining more details, this additional effort, which can be made easy through the use of VLC or MonkeyLearn, is probably worth considering.
Finally, if the objective of a class is concerned with inferential comprehension or learning (e.g. speculating about the speakers' attitudes and predicting the subsequent development), instructors may want to use key-framed video materials with partial captions. The reduction in both verbal and nonverbal input may create desirable difficulty for L2 learners to generate reasonable inferences and guesses and fill in the comprehension gaps. Knowing how learners should be benefited through different verbal and nonverbal input presentation modes will help L2 instructors better prepare multimedia materials for different comprehension purposes.

Conclusions and limitations
Since multimedia learning requires very complex mental processes, the results of this one-time experiment based on manipulating a captioned animation should be interpreted with care. The major limitation is that this study focused only on the effects of animation and captions (and their reduced modes), so other types of multimedia presentation can be adopted. For nonverbal display, videos that include real-life scenarios may not draw the viewer's attention in the same way as animation and static key frames do. Likewise, other types of textual support, such as annotations and bilingual subtitles, may activate different verbal processing strategies. Future research can thus shed light on the efficacy of these verbal/nonverbal elements commonly seen in videos.
Supplementary material. To view supplementary material referred to in this article, please visit https://doi.org/10.1017/ S0958344022000088 Ethical statement and competing interests. The authors declare that there is no conflict of interest. The participants were informed about the experiment details and agreed to participate without any duress. All participants understood their rights, and their participation consent was obtained before this study. Results of experiments were immediately substituted with ID codes after the information was analyzed. Confidentiality of data is thus ensured.