Introduction
Artificial intelligence (AI) refers to the development of computer systems that can perform tasks typically requiring human intelligence, such as problem-solving, learning, and language comprehension and production. AI technologies are increasingly applied in language teaching and assessment with the potential to transform second language (L2) education, such as automated writing assessment systems, spoken dialogue systems, and intelligent personal assistants (IPAs) (Aryadoust, Reference Aryadoustforthcoming; Aryadoust & Zakaria, Reference Aryadoust, Zakaria, Goh, Sabnani and Renandyaforthcoming; Dizon, Reference Dizon2020; Huang et al., Reference Huang, Zou, Cheng, Chen and Xie2023; Shi & Aryadoust, Reference Shi and Aryadoust2024; Tai & Chen, Reference Tai and Chen2023). As AI increasingly becomes an integral part of language education, language teachers, learners, and researchers will have to understand how to leverage it for better outcomes in their respective roles, particularly for listening, speaking, and oral interaction which are the focus of this paper. At the same time, it is useful to keep in mind issues concerning the adoption and use of AI such as the place of real human interaction in oral language learning, the complex nature of spoken language competencies, the role of teachers, as well as matters of ethics, equity, and accessibility that also affect language teaching and assessment.
Among the most exciting AI advancements is generative AI (GenAI), which has shown promise in creating interactive language learning experiences and personalized assessment tools (Aryadoust, Reference Aryadoustforthcoming; Aryadoust & Zakaria, Reference Aryadoust, Zakaria, Goh, Sabnani and Renandyaforthcoming). GenAI is a subset of AI that uses generative models to produce new contents, such as text, images, audio, and video, based on patterns learned from existing data. GenAI models are trained to understand the underlying structures of their training data to generate novel outputs in response to specific prompts and take various forms, such as text generation models like OpenAI’s GPT (generative pre-trained transformer) series, image generation tools like DALL-E and Midjourney, video generation models, and text to speech models (Gozalo-Brizuela & Garrido-Merchán, Reference Gozalo-Brizuela and Garrido-Merchán2023).
OpenAI’s ChatGPT, one of the most well-known GenAI systems, has to date been used in language learning primarily for content generation, feedback, and teaching support, while the predominant use of prompts was in context specification (i.e., providing detailed information regarding the topic, target audience, etc.) and response customization (adjusting the tone or structure of the response) (Yang & Li, Reference Yang and Li2024). ChatGPT – and similar GenAI models such as Claude and Google Gemini – are based on a technology called transformers, which process and understand language by analyzing words in relation to each other (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017). Unlike older natural language processing (NLP) technologies that relied on rigid rules or limited context windows, transformers enable GenAI systems to generate coherent and meaningful text by capturing long-range dependencies and contextual relationships, making them more effective for applications like chatbots and translation tools (Yang et al., Reference Yang, Li, Raghavan, Deng, Lang, Succi, Huang and Kalpathy-Cramer2023). Due to this capability, GenAI systems have also proven valuable for L2 listening and speaking, as well as oral interaction. Some examples of GenAI application include material development in listening assessment (Aryadoust et al., Reference Aryadoust, Zakaria and Jia2024) and as a companion or foreign language learning partner for speaking (Belda-Medina & Calvo-Ferrer, Reference Belda-Medina and Calvo-Ferrer2022; Zhou, Reference Zhou2023). Despite the increased adoption of GenAI, its applications remain in the developmental stage.
In this paper, we review the application of AI in general and examine, in particular, the role of GenAI in L2 listening and speaking, which has garnered a great deal of interest from educators and researchers alike. We discuss the application of GenAI technologies including personal assistants and large language models (LLMs) that are used in, inter alia, text production and comprehension, and text-to-speech AI. We also explore possibilities and challenges of using these GenAI technologies for assessing and developing L2 listening and speaking competence and offer recommendations for future work with AI and engaging with the spread of AI. The paper is organized into the following main sections: (1) Listening and speaking competence; (2) Listening and GenAI; (3) Speaking and GenAI; (4) GenAI for assessing listening and speaking; (5) GenAI for teaching listening and speaking, and (6) Future research directions.
Listening and speaking competence
With the rapid advancement of AI systems, the expectations for using AI in L2 language education will continue to mount for both teachers and learners. In recent years, studies have been conducted that examined the effectiveness of readily available AI tools such as Google Assistant for language learners’ speaking (Hadi & Junor, Reference Hadi and Junor2022; Tai & Chen, Reference Tai and Chen2023) and listening (Tai & Chen, Reference Tai and Chen2024) with positive outcomes. Most studies use test results as an indication of improvements with few that tease out the listening, speaking, and interaction skills that AI tools contributed to. The use of any technological tools for teaching should be underpinned by teachers’ understanding of the nature of the competence being developed and suitable pedagogy for teaching it. Here we will describe the cognitive processes and enabling skills needed to achieve effective listening, speaking, and oral interaction competences. An understanding of what each of these sets of competence constitute should form the theoretical foundation for using AI, be it in teaching and learning or assessment. A theoretical description of the competence or the construct should also provide the conceptual backdrop for evaluating the functionality, suitability, and potential uses of AI in spoken language learning and assessment.
Listening
Second language listening comprehension processes are broadly distinguished between bottom-up and top-down processing according to the types of knowledge being applied (Aryadoust & Luo, Reference Aryadoust and Luo2022). Bottom-up processing, also known as decoding, requires the use of phonological and vocabulary knowledge to segment word boundaries and recognize words (Aryadoust, Reference Aryadoustforthcoming; Field, Reference Field2008). Top-down processing makes use of other kinds of knowledge such as background factual knowledge and knowledge about the use of language in particular cultures to make inferences and elaborate an emerging interpretation to an acceptable whole (Goh & Vandergrift, Reference Goh and Vandergrift2022). Listeners process input linguistically, semantically, and pragmatically by using multimodal cues such as nonverbal cues, text, and images to augment their comprehension processes according to the purpose for listening (Aryadoust, Reference Aryadoustforthcoming; Rost, Reference Rost2025). Listening processes are affected by several factors that include cognitive and affective differences of L2 listeners including linguistic knowledge, prior knowledge, working memory, and metacognition as well as variations in listening contexts created by the environment, speaker, task, and text features (Goh & Vandergrift, Reference Goh and Vandergrift2022).
L2 listening as a cognitive process can be further understood by considering Anderson’s (Reference Anderson2010) cognitive framework for language comprehension, which comprises three partly ordered, but overlapping, phases: perception, parsing, and utilization. While decoding occurs at the perception phase, the processing of meaning through lexico-syntactic (grammar and vocabulary) knowledge occurs at the parsing phase, where the words in the input are transformed into a mental representation of the message. In the utilization phase, listeners may do one or more of the following: embellish the mental representation with other cues for a more complete interpretation, respond to the message, or store it for later retrieval. These processes would be more challenging for L2 listeners because of language constraints that may affect the efficiency of their working memory and hence, listening achievement (Kormos & Sáfár, Reference Kormos and Sáfár2008).
To direct their attention to the aural input, listeners must use enabling skills or specific ways of listening, such as listening for details, listening for global understanding, listening for gist, listening and predicting, listening and inferring, and listening selectively. These skills are necessary for one-way listening (when speaking is not involved, e.g., listening to a talk, listening to an audio podcast, watching a movie or an online video) as well as during oral interactions. Listening is also a social process as seen in oral interactions where the listener alternates between being a listener and a speaker. During oral interaction, two-way or interactive listening skills are needed, and they work in tandem with various speech functions to facilitate comprehension, for example, to clarify and confirm what is said as well as asking for and giving explanations (Goh & Vandergrift, Reference Goh and Vandergrift2022). Although often subsumed under speaking, interactive listening skills should be an important part of teaching (Ryan, Reference Ryan2022) and assessing L2 listening (Aryadoust & Luo, Reference Aryadoust and Luo2022; Lam, Reference Lam2021, Reference Lam, Wagner, Batty and Galaczi2024).
Speaking
A dominant theory of second language speech production is that it is the outcome of three connected processes: conceptualization, formulation, and articulation (Bygate, Reference Bygate1998; Goh & Burns, Reference Goh and Burns2012; Kormos, Reference Kormos2006; Newton, Reference Newton, Newton, Ferris, Goh, Grabe, Stoller and Vandergrift2018), a view that is based on Levelt’s (Reference Levelt1989) model of L1 speaking (Kormos, Reference Kormos2006). In conceptualization or conceptual preparation, speakers establish their speaking goal and select information to be conveyed. This may be done in real time or as preparation before saying anything. Formulation or lexical and grammatical encoding puts ideas together more precisely in utterances. Articulation is the “visible” process when speakers express their ideas audibly. It involves the phonological encoding of words hitherto in the mind into speech. Articulation can be delayed after conceptualization and formulation as when individuals prepare for a talk.
Hughes and Reed (Reference Hughes and Reed2017) emphasize that speaking is situated in specific social contexts of interaction and is complex, drawing on “oral/aural, cognitive, processing, pragmatic, interpersonal, cultural, and motor skills simultaneously” (p. 83). This is challenging for L2 learners who are learning to use language appropriately to convey their intentions and meaning according to their communicative purpose, as highlighted in the functional-notional approach (Finocchiaro & Brumfit, Reference Finocchiaro and Brumfit1983; Munby, Reference Munby1978). It delineated functions that spoken and written language fulfill in everyday discourse, such as request, apologize, explain, describe, and so forth. The identification of speech functions and how to use spoken language to express them continue to be relevant in syllabuses and resources for teaching speaking today.
Speaking typically occurs with another person present. This element of interactivity has implications for how much and how often language learners can practice their speaking outside class (Newton, Reference Newton, Newton, Ferris, Goh, Grabe, Stoller and Vandergrift2018) and how they are assessed for speaking (Hughes & Reed, Reference Hughes and Reed2017). Oral interactions can be social or transactional in nature and requires participants to alternate between speaking and listening, albeit to different extent depending on the social context (Burns et al., Reference Burns, Joyce and Gollin1996). Learners learn skills to manage an interaction, for example, initiating, maintaining, and closing conversations; offering and taking conversational turns; modifying and changing topics; directing the focus of an interaction; clarifying meaning, recognizing, and using verbal and nonverbal clues (Goh & Liu, Reference Goh and Liu2024, p. 14). L2 oral interactions can be challenging because learners must think on their feet and the process of conceptualization, formulation, and articulation may be compressed. They need to use communication strategies to cope with speaking and listening difficulties and move the interaction forward through giving an appropriate response as well as using strategies, such as clarification requests and back-channeling signals to indicate comprehension (Nakatani, Reference Nakatani2006). Interlocutors also work interactively to resolve communication problems due to comprehension difficulties through negotiation of meaning (Long, Reference Long, Ritchie and Bhatia1996). Given the cognitive and affective complexities of speaking, some L2 learners may experience high anxiety that affects their willingness to communicate, which is their readiness to engage in conversations with others (MacIntyre et al., Reference MacIntyre, Dörnyei, Clement and Noels1998). Their self-perceptions of their ability and willingness to speak can be affected by classroom situations such as the use of pair and small group work, teachers’ use of the language, preparation time, and task complexity (Humphries et al., Reference Humphries, Burns and Tanaka2015).
Listening and GenAI
Listening has received less attention in the application of GenAI compared with speaking and other language skills (Aryadoust, Reference Aryadoustforthcoming; Bibauw et al., Reference Bibauw, Van den Noortgate, François and Desmet2022; Huang et al., Reference Huang, Zou, Cheng, Chen and Xie2023). One of the technologies that have been used in previous studies is IPAs, such as Siri, Xiao Ai, Celia, Alexa, and Google Assistant. IPAs are AI-powered software applications designed to perform tasks, provide information, and assist users through voice or text-based interactions. Google Assistant via Google Nest Hub has been reported to improve L2 listening performance and enjoyment for high school students (Tai & Chen, Reference Tai and Chen2024) while the use of listening tests that were created by using AI was helpful to learners’ listening development because of the nuanced evaluation and individualized feedback provided (Abdellatif et al., Reference Abdellatif, Alshehri, Alshehri, Hafez, Gafar and Lamouchi2024). Integration of deep learning models for automated speech recognition and text-to-speech conversion with ChatGPT has been proposed as a useful technology for language learning to enhance listening practice (Xing, Reference Xing2023). The impact of such multimodal GenAI systems, however, has yet to be established. Perception studies on learners’ views about AI mobile applications for listening practice showed learners to be positive and that they believed that using such apps had helped improve their listening skills (Azzahra et al., Reference Azzahra, Widiastuti, Sopyani, Luthfiyyah and Dwiniasih2024; Suryana et al., Reference Suryana, Asrianto and Murwantono2020).
Research examining whether AI technologies such as IPA improved listening comprehension is emerging and mainly showed an absence of significant outcomes (Bibauw et al., Reference Bibauw, Van den Noortgate, François and Desmet2022; Dizon, Reference Dizon2020). However, Tai and Chen’s (Reference Tai and Chen2024) study yielded some positive outcomes. Their study involved a large sample of 92 high school students and used an experimental design to examine the influence of IPA, differentiated by the presentation modes of responses (audio and visual, screen-based responses, and only audio) as well as learners’ preference for IPA interaction styles. Both experimental groups with Google Assistant responses outperformed the control group that listened only to CD recordings. The multimodal presentation of IPA responses was found to benefit learners more, partly due to the presence of clues from text, images, and videos that enabled inferencing of missing information. Students welcomed the IPA’s role as an interlocutor as it provided a conversation partner to practice their listening. What is not apparent from the study, however, was what kind of conditions were created for the use of specific listening skills, such as listening for details, listening to infer, and so forth.
Research on listening assessment offers further insights into benefits and limitations of using of AI. Two studies are reviewed here. The first is by Runge et al. (Reference Runge, Attali, LaFlair, Park and Church2024), which aimed to address limitations in traditional language assessments by using LLMs to develop interactive listening tasks for the Duolingo English Test. These tasks were designed to assess interactional competence and provide a more authentic evaluation of listening skills compared to earlier automated methods. The authors further implemented a human-in-the-loop approach to create diverse content at scale. A human-in-the-loop approach typically means that human experts actively supervise, review, or refine AI-generated content. The authors reported that the pilot study, involving 713 tasks and hundreds of responses per task, demonstrated the practicality of this method for enhancing large-scale language assessments.
Despite its innovative approach to task generation, Runge et al.’s (Reference Runge, Attali, LaFlair, Park and Church2024) investigation has several limitations. First, a “predetermined” and “non-branching” dialogue path was applied wherein the dialogue does not evolve based on previous answers, something that one would expect to get in a natural conversation during an oral interaction. Instead, test-takers must select the correct option from a set of choices to complete the task. The authors indicated that this addressed the challenge of creating interactive large-scale listening assessments, ensuring (self-)consistency, reducing unpredictability and risk of hallucination, and streamlining the process of item generation and review. However, the design limits the ability of the task to simulate real-world communication, where conversations are dynamic and reactive. In real-world target language use (TLU) domains, interlocutors adjust their responses based on prior turns, which requires active negotiation of meaning and flexible language use.
Second, the format may also jeopardize the cognitive validity of the assessment (Field, Reference Field, Geranpayeh and Taylor2013), as test takers are primarily focused on selecting the “correct” option rather than engaging in the deeper cognitive processes involved in authentic communication. The cognitive and affective processes involved in this form of discrete-point assessment tools and those in a TLU domain seem to be significantly different, thus misaligning the cognitive demands of the test with those required in real-world situations. It appears that, in addition to the above-mentioned reasons, the decision to develop a non-dynamic system stems from practicality as well as the limitations of psychometric analysis, where test designers must, for example, ensure that task difficulty remains consistent across all administrations. In our view, the main question is whether, or to what extent, the requirements of pre-GenAI listening assessments, such as psychometric modeling for rank-ordering test takers, are still applicable in the current context of GenAI-designed listening assessments. GenAI promises greater interactivity and “personalization,” and therefore raises the possibility of redefining traditional constraints and rethinking how authenticity and validity are conceptualized in assessment design. Therefore, it is not recommended that GenAI-based assessments be strictly aligned with the conventional standards of traditional psychometric-based approaches.
The second study by Aryadoust et al. (Reference Aryadoust, Zakaria and Jia2024) aimed to develop scripts and test items for listening assessments. It investigated how GenAI adapts its outputs based on task complexity and thematic variation. It explored the lexical and grammatical distinctions in the texts generated across varying levels of difficulty, such as advanced or basic, and examined how generated test items differ in their thematic focus and content similarity. The authors adopted a progressive hint prompting (PHP) method to generate texts and thereafter multiple-choice questions (MCQs) to assess listening comprehension. The PHP involves engaging with ChatGPT-4 by using its earlier responses as cues and refining subsequent prompts iteratively to progressively obtain the desired output. Overall, 96 scripts of from different difficulty levels and 384 MCQs measuring “the main idea, purpose, information details, and inferred information” (p. 4) were generated and their difficulty levels were compared. The researchers found that while ChatGPT-4 was able to develop texts properly tailored to the four different levels of proficiency, and the MCQ stems were well designed, the MCQ options were inappropriate as most of them could be eliminated by using one’s world knowledge or through comparing with other options. One common feature between this study and Runge et al.’s (Reference Runge, Attali, LaFlair, Park and Church2024) study is their attempt to use GenAI to create test items that would be more appropriate for inclusion in conventional assessment, specifically MCQs. While MCQs have been a useful test method in listening assessment and beyond, the question of authenticity and cognitive validity raised earlier also applies to this test format. In other words, MCQs elicit cognitive processes that may differ from those involved in real-world language comprehension (Rupp et al., Reference Rupp, Ferne and Choi2006), thereby weakening their cognitive validity (Field, Reference Field, Geranpayeh and Taylor2013). Additionally, the MCQ format does not reflect the structure of authentic language tasks that users typically encounter in most TLU domains. Another limitation in Aryadoust et al.’s (Reference Aryadoust, Zakaria and Jia2024) study is the relatively small sample size of the texts generated, even though the test items made a sufficiently large dataset for statistical analysis and inference.
Overall, the application of GenAI in listening assessment is a relatively new development, and research in this area remains in its early stages (Aryadoust, Reference Aryadoustforthcoming). While the above two studies highlight the potential of GenAI in creating listening tests, they also underscore limitations such as the lack of dynamism in interactions, small sample sizes, issues with authenticity, and challenges in achieving cognitive validity. For language teaching, there is a vast untapped potential to leverage GenAI not only for innovative assessment designs but also for creating interactive and personalized learning experiences that align closely with real-world communication demands. Aryadoust (Reference Aryadoustforthcoming) has proposed several methods for leveraging GenAI in listening and speaking assessment, particularly for purposes such as assessment for learning (AfL) and assessment as learning (AaL), which will also be discussed in the following sections.
Speaking and GenAI
The integration of AI into language learning and assessment has a history that predates the current advancements in GenAI. AI has been utilized in automated evaluation systems (Shi & Aryadoust, Reference Shi and Aryadoust2024) and spoken dialogue systems employed in language learning and assessment (Bibauw et al., Reference Bibauw, Van den Noortgate, François and Desmet2022). Spoken dialogue-based systems in which humans interact verbally on a turn-by-turn basis with computer systems, alongside chatbots and virtual agents, show promise for language learning and assessment. While older systems equipped with “simpler” NLP algorithms were primarily task-oriented (Deriu et al., Reference Deriu, Rodrigo, Otegi, Echegoyen, Rosset, Agirre and Cieliebak2021) and could only focus on specific tasks for which they were trained, such as assessing communication skills in tourism domains (Karatay, Reference Karatay, Yan, Dimova and Ginther2023), recent advancements in transformer-based LLMs and text-to-speech AI have significantly expanded the versatility of dialogue systems (Xu et al., Reference Xu, Qin, Chen, Zha, Qiu and Wang2024).
AI personal assistants such as chatbots have been examined for their effectiveness in speaking practice. Chatbots with voice facilities can simulate the outputs of human conversation partners. Not only do these programs respond to questions in language that sounds natural but they are also able to recognize emotions in words and respond to them appropriately. Although they are not human interlocutors, many language learners were happy using them and reported better motivation and confidence in practicing their speaking.
The positive impact of spoken dialogue systems on learners’ willingness to communicate has been consistent (see Ayedoun et al., Reference Ayedoun, Hayashi and Seta2019; Fathi et al., Reference Fathi, Rahimi and Derakhshan2024; Huang & Zou, Reference Huang and Zou2024; Rad, Reference Rad2024; Tai & Chen, Reference Tai and Chen2023; Zhang et al., Reference Zhang, Meng and Ma2024). The study by Ayedoun et al. (Reference Ayedoun, Hayashi and Seta2019) stands out for its use of embodied conversational agents (ECA), which were augmented with communication strategies and nonverbal back-channels, such as gaze, nods, and smiles to encourage the learners to continue speaking. While the ECA’s use of these back-channels may not constitute what the authors claimed to be “empathic care” (p. 53), the study nevertheless showed that spoken dialogue interfaces could be programmed with more humanlike qualities to promote learners’ readiness to engage in conversations. Other studies (Bashori et al., Reference Bashori, Van Hout, Strik and Cucchiarini2021; Zhang et al., Reference Zhang, Meng and Ma2024) have found that spoken dialogue systems can benefit learners affectively by lowering their anxiety. When learners’ confidence and interest to speak increased, more opportunities for practice and language use were created. In an experiment of IPA- and non-IPA speaking instruction, learners who practiced with IPA improved their comprehensibility and reduced their accentedness more than their non-IPA counterparts, but both groups showed improvements in their overall speaking based on several assessment criteria with no significant differences in fluency (Fathi et al., Reference Fathi, Rahimi and Teo2025).
Many language learners in countries with limited exposure to the target language would seek out competent speakers of the language to practice their speaking and listening. With the successes reported in emerging research mentioned above and the increased availability of AI-powered tools, language learners will have more access to GenAI systems as conversation partners. To add to this, the transformative power of GenAI such as ChatGPT has continued to add more elements of humanlike interactional features in human–computer interactions. LLMs, which are advanced AI systems trained on extensive datasets to generate contextually appropriate and coherent text, are unlike traditional NLP systems that often rely on rule-based or task-specific algorithms. LLMs use attention mechanism and transformers to understand and respond to a wide variety of user inputs, thereby enabling more dynamic and humanlike interactions (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter and Amodei2020). Their ability to generate precise, context-aware responses makes LLMs particularly effective in open-ended applications such as conversation tasks and language learning, where traditional NLP systems often struggle to adapt to diverse topics or provide rich interaction (Xiao et al., Reference Xiao, Zhao, Sha, Yang and Warschauer2024; Xu et al., Reference Xu, Qin, Chen, Zha, Qiu and Wang2024). For example, ChatGPT combines LLMs with text-to-speech and image-generation GenAI, creating multimodal capabilities for dialogue-based learning and assessment on virtually any topic permitted under OpenAI’s terms and conditions (Aryadoust & Zakaria, Reference Aryadoust, Zakaria, Goh, Sabnani and Renandyaforthcoming).
Xiao et al. (Reference Xiao, Zhao, Sha, Yang and Warschauer2024) conducted a review of AI-based conversational agents (another term for spoken dialogue systems) in language learning. Xiao et al. (Reference Xiao, Zhao, Sha, Yang and Warschauer2024) identified three areas of application for these agents in language learning including “general communication practice, task-based language learning, and structured pre-programmed dialogue” (p. 303). First, the studies they reviewed demonstrate that these dialogue systems can improve speaking skills and overall language proficiency, with some systems performing as effectively as first language English speakers in enhancing oral proficiency (e.g., Aryadoust & Zakaria, Reference Aryadoust, Zakaria, Goh, Sabnani and Renandyaforthcoming). Second, the authors found that learners generally perceived these tools as humanlike and showed a clear preference for one-on-one interactions, especially with those providing audio-visual and screen-based feedback. This could suggest that learners perceive these tools as having anthropomorphic qualities. However, the studies did not systematically evaluate whether AI-driven conversation tasks contribute to actual linguistic development, making it difficult to assess their pedagogical efficacy beyond student engagement. Finally, studies also showed that these systems support young children’s literacy and content and vocabulary learning, with English learners benefiting the most. Overall, learners appreciated these systems for their simplicity, consistent interactions, and capacity to meet their need for practicing English-speaking skills, particularly in settings where access to first language speakers is limited. However, the systems reviewed by Xiao et al. (Reference Xiao, Zhao, Sha, Yang and Warschauer2024) demand substantial effort in design and programming, which accounts for their limited use but highlights their promising potential for future research and practical applications.
Xiao et al.’s (Reference Xiao, Zhao, Sha, Yang and Warschauer2024) findings align closely with Bibauw et al.’s (Reference Bibauw, Van den Noortgate, François and Desmet2022) meta-analysis, which identified a significant medium effect of spoken dialogue systems on language learning across 11 studies. This suggests that these systems can play a meaningful role in enhancing language acquisition outcomes. Similarly, Xu et al. (Reference Xu, Qin, Chen, Zha, Qiu and Wang2024) fine-tuned several LLMs on a variety of topics and discovered that the 7B and 14B parameter models were not only adept at handling conversations on the topics they were fine-tuned for but also demonstrated the ability to engage on new, unrelated topics. The 7B and 14B parameter models refer to AI language models with 7 billion and 14 billion parameters, respectively. These models are designed for NLP tasks such as text generation, conversation, and summarization. Xu et al. (Reference Xu, Qin, Chen, Zha, Qiu and Wang2024) claimed that this represents a significant advantage compared to task-oriented educational dialogue systems, which usually rely on experts to predefine the scope and structure of the dialogues.
Fine-tuning refers to the process of taking a pre-trained AI model and further training it on a specific dataset to improve its performance for a particular task (see Prottasha et al., Reference Prottasha, Mahmud, Sobuj, Bhat, Kowsher, Yousefi and Garibay2024). While fine-tuning an LLM appears beneficial and appealing, it comes with certain limitations, including the need for substantial computational resources (e.g., graphics processing units), the risk of overfitting to the fine-tuned data, and the potential for the model to lose some of its generalization ability (see Lv et al., Reference Lv, Yang, Liu, Gao, Guo and Qiu2024; Prottasha et al., Reference Prottasha, Mahmud, Sobuj, Bhat, Kowsher, Yousefi and Garibay2024). In addition, for many teachers and researchers in language learning and assessment fields, the knowledge of coding and fine-tuning may not be readily accessible. Therefore, leveraging existing systems that require minimal or no technical expertise to create dialogue systems becomes a more practical and desirable solution. In this regard, Aryadoust and Zakaria (Reference Aryadoust, Zakaria, Goh, Sabnani and Renandyaforthcoming) described a structured process for developing and testing a GPT-based system tailored for beginner English learners. The process begins with defining task specifications, including user proficiency levels, discussion topics, tone, and feedback types. Proficiency levels in the study ranged from beginner to advanced, with the system adapting based on learner progress. The authors designed the tone to mimic a teacher’s warmth and guidance, while feedback focused on promoting global understanding, pronunciation, grammar, and vocabulary particularly when they hindered comprehension. The GPT creation further involved an interactive process where specifications were input incrementally to ensure precise implementation and responsiveness of the system. The resultant GPT, named “English Practice Buddy,” provided supportive feedback through both spoken and written formats, encouraging learners with constructive suggestions and positive reinforcement. Testing included engaging with different English varieties, such as Singlish, demonstrating the GPT’s adaptability in understanding and responding appropriately. Feedback levels included explicit corrective feedback, implicit recasts, and self-regulation advice, progressively tailoring responses to learner needs. The iterative design and feedback system exemplify a useful approach to building AI tools for language learning using ChatGPT’s capabilities.
The studies reviewed above demonstrate the unprecedented capabilities of GenAI in teaching and assessing speaking and communication skills particularly through spoken dialogue systems. As such, it would not be unreasonable to say that GenAI is bringing significant changes to the fields of language learning and assessment. Commercialized LLMs like ChatGPT have integrated automated speech recognition and text-to-speech functionalities. OpenAI has also provided the GPT Builder, which enables users to develop their own custom GPT without any coding requirements. In addition, users can customize the voice and emotional undertones of the conversational AI. For instance, OpenAI’s ChatGPT allows users to specify their desired tone and voice, such as preferring concise, straightforward language or a conversational and approachable style. This easy customization allows users to focus on developing the technical aspects of speaking tasks, including domain definition and theoretical frameworks, content and language, alignment with TLU domains (in proficiency testing) or curricula (in formative assessments), evaluation criteria, as well as considerations of fairness, equity, and inclusivity. However, a caveat of using these systems is the recording of users’ language and voice data, which raises privacy and legal issues. In some jurisdictions, disclosing or storing student data – especially when involving third-party platforms – may not only require strict compliance with data protection regulations but could also be illegal. This issue is particularly concerning for young learners, who are more vulnerable to privacy risks. Therefore, scrutiny of data security policies, legal frameworks, and ethical implications is essential before implementing such technologies in educational settings.
The question of hallucination is also relevant. Hallucination, or confabulation, refers to GenAI’s tendency to generate factually incorrect or unreliable information. Studies have shown that while AI-generated hallucinations stem from limitations in training data and probabilistic text generation (Ji et al., Reference Ji, Yu, Xu, Lee, Ishii and Fung2023), humans also engage in similar behaviors, such as misremembering details or fabricating information in conversation (Roediger & McDermott, Reference Roediger and McDermott1995). Despite the differences in the nature of human false memories and fact fabrication compared to GenAI’s confabulation, hallucination in GenAI systems should not be evaluated as an isolated flaw but as a characteristic that, while needing attention, can also be observed in human communication.
GenAI for assessing listening and speaking
GenAI systems can be leveraged as efficient tools for developing language tasks for assessing listening and speaking. Over the past 2 years, GenAI systems such as ChatGPT have developed multimodal capabilities. ChatGPT can now hear (process audio input), see (analyze images), and speak to users, depending on the version and platform being used. This allows for creating assessment tasks that simulate oral interaction and listening tasks. Aryadoust and Zakaria (Reference Aryadoust, Zakaria, Goh, Sabnani and Renandyaforthcoming) outlined steps for developing speaking assessment tasks for young English learners, highlighting the capabilities of GenAI systems like ChatGPT in oral interactions. At present, these systems can be customized for specific assessment tasks by incorporating task descriptions in prose format, selecting a voice, and implementing the design seamlessly. For example, one of the authors (see https://youtu.be/lDNuqfnLdjA) developed a GPT-powered speaking examiner by using ChatGPT to administer a speaking test. In this GPT, the AI examiner asks five questions on a topic chosen by the test taker, followed by a 4-min monologue prepared and delivered by them. The AI examiner then provides both qualitative and quantitative feedback. The test takers’ responses are transcribed in real time, providing the user or teacher with valuable material for further linguistic and content analysis. For instance, by inputting the transcripts into ChatGPT again, one can request a semantic analysis or an evaluation of vocabulary sophistication and other linguistic features (see Aryadoust, Reference Aryadoustforthcoming, for details).
A distinctive advantage of these GPT-based examiners compared with traditional spoken dialogue systems built on simpler NLP lies in their underlying LLM engines, which gives them remarkable versatility (Aryadoust, Reference Aryadoustforthcoming). LLMs enable meaningful conversations on virtually any topic, though there is a potential for confabulated information. Moreover, ChatGPT and similar systems can be fine-tuned and personalized by educators and examiners to simulate authentic scenarios that reflect real-world TLU domains, such as lectures, situational conversations, professional discussions, and (in)formal meetings to provide learners with immersive and contextually relevant practice opportunities. Aryadoust (Reference Aryadoustforthcoming) has introduced a fine-tuning method termed “double-AI in the loop,” which is based on API (application programming interface) platforms and leverages the adaptability of LLMs such as GPT and Claude, capitalizing on their advanced learning capabilities.
Finally, there are several multimedia GenAI systems such as HeyGen, which can be used for developing listening assessments. HeyGen is a multimedia AI platform ideal for creating multimodal listening assessments. It enables the design of tasks like watching virtual presenters deliver instructions in multiple languages or interpreting video scenarios with synchronized speech and lip movements. For example, learners can watch a simulated speech, and then answer comprehension questions, summarize key points, or write a reflection on the video. By integrating realistic audiovisual elements, HeyGen can certainly enhance the authenticity and engagement of listening assessments.
Despite the dominance of psychometric and statistical modeling for rank-ordering test takers in the listening assessment field (Aryadoust & Luo, Reference Aryadoust and Luo2022; Shang et al., Reference Shang, Aryadoust and Hou2024), we argue that not all assessments should be limited to ranking and comparison, nor should they be confined to summative evaluation or assessment of learning (see Fulcher, Reference Fulcher2025). In language learning, particularly in listening practices and assessments, the focus should extend beyond measuring proficiency and standardization to supporting assessment for learning (AfL) and assessment as learning (AaL) (see Dann, Reference Dann2014). These alternative approaches prioritize helping students develop self-regulation, refine and extend their listening processes and strategies beyond comprehension, and actively engage with feedback. This is where the use of GenAI can be particularly valuable. GenAI can enhance AaL in listening by creating personalized, interactive experiences that support learner autonomy and self-regulation. GenAI can embed assessment within the learning process by offering real-time, individualized feedback. GenAI-powered listening tools, especially those leveraging LLMs and deep-learning-based text-to-speech technology, can enhance language learning by providing automatic transcriptions, highlighting key terms, generating contextual explanations, posing questions, offering perspectives, and engaging in dialogues across an unlimited range of topics (see Aryadoust & Zakaria, Reference Aryadoust, Zakaria, Goh, Sabnani and Renandyaforthcoming). Furthermore, GenAI can tailor listening materials to each learner’s proficiency level and interests (see Aryadoust et al., Reference Aryadoust, Zakaria and Jia2024, for an example), while promoting what Dann describes as a “zone of curiosity” (Dann, Reference Dann2014, p. 157), where learners are optimally challenged.
Thus, we believe that rather than viewing listening assessment merely as a device for comparison and classification, listening assessment tools can be designed to guide learners in recognizing their comprehension gaps, interpreting feedback effectively, and applying it in meaningful ways to improve their skills. By fostering this reflective mindset, listening assessments can serve as integral parts of the learning process, rather than merely measure performance. To formally examine whether AaL or AfL fosters self-regulation, metacognitive awareness, and meaningful engagement with feedback, researchers can employ computational methods that assess growth at the individual level. These approaches provide an alternative to group-based psychometric and statistical models, such as item response theory, which primarily focus on population-level inferences. Methods suited for analyzing individual change include latent Markov models and idiographic approaches, as discussed by Molenaar (Reference Molenaar2004).
GenAI for developing listening and speaking
GenAI can play a part in some aspects of alternative assessment approaches that support learning and has the potential to make listening and speaking development interesting and engaging for language learners (Aryadoust, Reference Aryadoustforthcoming). One consistent outcome regarding the use of GenAI is how it increases learners’ willingness to communicate (see Fathi et al., Reference Fathi, Rahimi and Derakhshan2024; Rad, Reference Rad2024; Tai & Chen, Reference Tai and Chen2023; Zhang et al., Reference Zhang, Meng and Ma2024). Engaging with an IPA not only lowers learners’ anxiety (Bashori et al., Reference Bashori, Van Hout, Strik and Cucchiarini2021; Zhang et al., Reference Zhang, Meng and Ma2024) but it can also give learners opportunities to practice their speaking about a topic as many times as they need. There are many benefits of task repetition for L2 speaking such as improving fluency (Ahmadian & Tavakoli, Reference Ahmadian and Tavakoli2011; Bygate, Reference Bygate, Bygate, Skehan and Swain2001), expanding vocabulary use (Lynch & Maclean, Reference Lynch and Maclean2000), and better focus on form (Fukuta, Reference Fukuta2016; Hawkes, Reference Hawkes2012). Learners who repeat the task more than once can see a more lasting impact on their speaking performance (Lambert et al., Reference Lambert, Kormos and Minn2017). In developing conversational skills, learners may have problem finding human interlocutors who are willing to talk about the same topic repeatedly. This, however, is possible with AI conversation partners, and it can promote task repetition that is self-directed by learners.
However, AI as an intelligent conversation partner (particularly older systems) has its limitations. For example, while spoken dialogue systems can offer humanlike features during oral interactions, they do not necessarily simulate the complexities of interaction with human interlocutors that require a wider repertoire of speaking and listening skills. Oral interactions occur not only in one-to-one conversations but also in groups. They are situated in specific social contexts that may present various challenges (Hughes & Reed, Reference Hughes and Reed2017) as previously highlighted. Thus, while GenAI offers optimistic prospects for speaking and listening practice, teachers would need to make decisions about how to balance learners’ AI-driven practice with learning through face-to-face activities in class involving communicative language activities or explicit teaching (Goh, Reference Goh2014). In other words, while GenAI can provide plenty of opportunities to generate talk, it does not necessarily help to develop a wider set of speaking and listening skills, which is possible through well-designed lessons in class.
It is yet unclear whether future AI technologies will provide learners with opportunities to interact in ways that can develop a wider repertoire of listening and speaking skills and the type of interaction that can contribute to language acquisition. There is the question of whether AI assistants can respond with interactional adjustments to encourage more output and interaction from learners. Any form of more humanlike interactions from LLMs should include interactional processes for negotiation of meaning that enable learners to focus on language and meaning (Long, Reference Long, Ritchie and Bhatia1996) as one would in classroom interactions (Yan & Goh, Reference Yan and Goh2024). Interactional processes that push the production of comprehensible output on the part of the learners will help learners develop their listening and speaking skills and strategies while contributing to their language proficiency (Swain, Reference Swain, Cook and Seidlhofer1995). AI conversational partners that incorporate communication strategies and affective feedback for learners (Ayedoun et al., Reference Ayedoun, Hayashi and Seta2019) as well as clarification techniques can potentially give learners more opportunities to develop their skills and language proficiency. L2 listeners who experience problems processing input during perception and parsing (Anderson, Reference Anderson2010) may be able to find help if intelligent assistants can adjust their responses to support greater comprehension by learners during practice. While research projects revealed various possibilities, the translation and scaling up of new systems for wider use remains uncertain because of practical challenges (Xiao et al., Reference Xiao, Zhao, Sha, Yang and Warschauer2024).
In general education, AI is recognized as a catalyst for students to generate ideas, and a tool for refining conceptual understanding and exploration of new perspectives (Wong et al., Reference Wong, Park and Looi2024). In the language class, AI can assist language learners who struggle with not knowing what to say because they lack adequate content and ideas or are not confident about their own ideas. Conceptualization is a key component of speaking (Kormos, Reference Kormos2006; Levelt, Reference Levelt1989) and learners may benefit from some help here. Pre-task planning (Foster & Skehan, Reference Foster and Skehan1996) can assist learners in improving their speaking performance in language complexity and fluency, particularly in monologic tasks (Yuan & Ellis, Reference Yuan and Ellis2003). For example, Geng and Ferguson (Reference Geng and Ferguson2013) found that learners who planned with a partner, individually, or guided by a teacher all performed significantly better at oral tasks than those who did no planning. In traditional language classrooms, teachers may help learners with planning what to say by generating content through class brainstorming or getting pairs to share ideas before they present to the class, as a form of scaffolding (Goh, Reference Goh2017). With the advent of GenAI, language teachers may want to consider scaffolding pre-task planning by permitting learners to get the help of AI assistants with some of their content planning. This should not be mistaken as getting AI to prepare a script. Rather, learners use AI to enhance their ideas and provide some key vocabulary items that they could use in subsequent interaction.
“Co-intelligence” with AI (Mollick, Reference Mollick2024, p. 49) has several benefits. Learners can use the content generated to review the adequacy and appropriateness of their own ideas and augment what they have, thereby improving their conceptualization process. By doing this, learners can shift their cognitive load from processing the message content to working on formulating utterances and articulation of the message. In listening activities, GenAI such as ChatGPT and Gemini may be used to help learners prepare for a listening topic while teachers can use ChatGPT to generate questions based on a listening transcript to support and direct learner listening (Meniado & Marlina, Reference Aryadoust, Zakaria, Goh, Sabnani and Renandyaforthcoming). When encouraging the use of GenAI for supporting listening and speaking, teachers must apply their professional judgment to decide how and how often AI tools should be used as a thinking and practice companion for their students and balance this with other forms of supporting listening and speaking processes.
Other than considering how AI is used for teaching, it is important to consider its impact on students’ learning. If AI is introduced in language learning, teachers should provide learners with opportunities to understand how AI can support their task of learning to listen and speak and how best to use AI for this purpose. This is one aspect of AI literacy that is highly valued in education and contributes to the holistic development of individuals as they learn to use spoken language. AI literacy encompasses an “understanding of AI concepts, applications, and potential risks” and involves “grasping its societal implications, ethical considerations, and the potential impact on various aspects of life, including education, employment, ethics, and privacy” (Looi, Reference Looi2024, p. 481–482). By developing better AI literacy in the context of language learning, learners will be more discerning of what using AI means for them as individual learners and the pros and cons of practicing speaking and listening with AI assistants to prepare them to interact with the world beyond their devices.
With the pervasiveness of AI in education, students will need even greater self-awareness and metacognitive skills to derive the full benefits of engaging with generative AI technologies (Looi, Reference Looi2024). Effective language learning is supported by learners’ own metacognitive processes (Haukås et al., Reference Haukås, Bjørke and Dypedahl2018; Wenden, Reference Wenden1987). By engaging in planning, monitoring, evaluation, and problem-solving strategies, learners can become more self-directed and strategic in their L2 learning (Wenden, Reference Wenden1987). Leveraging technology for L2 learners’ metacognitive development (Tan & Tan, Reference Tan and Tan2010), AI tools can also be incorporated as an option for metacognitive engagement such as in their reflection of their personal knowledge about speaking and listening, their task knowledge of how they should approach tasks involving spoken language, as well as the use and learning of strategies for problem solving, planning, monitoring, and evaluation, while developing their oral language competencies. As AI becomes increasingly used for speaking and listening development, AI literacy is a new dimension of learners’ metacognitive knowledge that teachers can develop alongside knowledge about listening and speaking processes.
In considering AI and the development of listening and speaking skills, we have taken reference from four rules for co-intelligence, or thinking together with AI, that Mollick (Reference Mollick2024, p. 46–62) proposed: (a) always invite AI to the table – get AI to help us in what we do, (b) be the human in the loop – incorporate human judgment in its operations, (c) treat AI like a person (but tell it what kind of person it is) – give it specific instructions on how it can do the work better for us, and (d) assume this is the worst AI you will ever use – there will be better versions of AI in the future. Rather than thinking that resistance is futile, it is better to embrace a future of living and working with AI with caution, foresight, and agency. L2 educators and researchers must strive to be the human in the loop, exercising control over the technology for teaching, learning, and assessment, thus “ensuring that AI-driven solutions align with human values, ethical standards, and social norms …” as well as being “responsible for the output of the AI, which can help prevent harm” (Mollick, Reference Mollick2024, p. 54). This can happen in two ways. Experts in language education and assessment must gain a place at the table when new AI programs are being developed for language learning so that the tools that are pushed out can genuinely benefit teachers and learners. In addition, teachers should exercise the time-honored expertise of evaluating teaching materials such as textbooks as well as assessment materials in deciding which AI tools can benefit learners.
Future research directions
On the research front, there has been a surge in interest in using GenAI in language education in general. With the successes reported in emerging research, we expect even greater interest in adopting AI assistants and chatbots as conversation partners for language learners. For listening, the presence of multimodal cues can potentially support top-down comprehension processes to compensate for poor input perception. Internal and external factors for facilitating listening comprehension are well recognized, but current research with AI has not provided sufficient clarity of how listening with AI is influenced by these factors. Future research can examine how various factors can influence the cognitive, social, and affective processes of L2 listening with AI. In addition, it will be worthwhile to examine how AI can support learners when they experience listening comprehension problems, such as attentional failure, inability to decode or segment streams of speech, inadequate or partial automatization of linguistic knowledge, inability to elaborate and utilize mental models for better interpretation, failure to use cues to draw inferences, and so forth. Similarly, for speaking, research is needed to show impact of practice with GenAI on more than just willingness to communicate, but on a repertoire of oral interaction skills as well as improvements in cognitive processes and speech monitoring. A pertinent issue for the comprehension and use of spoken language, particularly for interaction, is the relevance of human teachers and interlocutors. Research can examine how learners conceive of AI assistants and human beings as interlocutors. An examination of these interactions may help us understand whether they apply the same or different interactional rules in the respective interactions as well as how listening and speaking processes are influenced by these conceptions positively or otherwise.
Future research should focus more on clear operationalizing of the constructs of listening and speaking and also examine the cognitive and metacognitive processes involved. Because of the pervasiveness of AI, wider issues of access and equity for language teachers and learners must remain important. Future research could focus on sociocultural issues, exploring learning opportunities with AI and their impact while ensuring fairness and accessibility for all language learners and teachers. Additionally, with the emergence of multimodal GenAI, there is the possibility of integrating multiple skills and modes of communication to enhance the authenticity and cognitive validity of language learners’ experiences, an emerging stream of research that needs further development. Research can also examine learners’ interactional competence (Roever & Dai, Reference Roever, Dai, Salaberry and Burch2021) when AI is used for assessment. In the next phase of the use of AI, we should look toward research that can contribute to theoretical development of not just oral language comprehension and production but also L2 acquisition that specifically arises from the context of human–AI spoken interaction.
Conclusion
GenAI has become an integral part of the human society in many parts of the world. Perceived to have anthropomorphic qualities such as humanlike empathy and understanding (e.g., Welivita & Pu, Reference Welivita and Pu2024), this feature can make interactions with GenAI systems more engaging and meaningful. With the growth of GenAI models, learners have unprecedented opportunities to personalize their learning experiences and gain access to their own virtual tutor, examiner, and companion. This gives learners the unique opportunity to engage in humanlike interactions with the technology and even experience the affective dimensions inherent in human communication and interactions. With AIs that can generate text, voices, images, videos, and more, there is much that language education can leverage on. The rapid advancements of AI models offer the potential for better support to the teaching, learning, and assessment of oral language use and skills, even completely revolutionizing it, much like how our lives have been completely changed by the power residing in our mobile phones. Despite its current limitations, AI has the potential to make things better for teaching, learning, and assessment of listening and speaking. Language educators and researchers should use their expertise of theory and practice to evaluate AI tools that are being considered in a language program. Currently, there is a lack of rigorous evaluation of such tools, particularly commercial ones, when they are being adopted. To do this, teachers must develop their own AI literacy and continue to develop pedagogical content knowledge for L2 teaching. Finally, as the line of research examining AI and oral language comprehension and use further develops, it will be important to ensure that future publications are accompanied by increasing rigor and depth in the questions asked, research design, and the constructs that are examined. After the initial surge in rapid publications, the field will benefit from the momentum to drive further quality research that can contribute to the development of these related areas, ultimately helping many language learners who will sooner or later be thrust into a landscape with AI’s prominence in language teaching, learning and use.