1. Motivation: LLMs vs engineers in requirement generation
The role of engineering requirements is fundamental in product development, as they form the basis for meeting the needs, wants, and constraints of stakeholders (Reference UllmanUllman, 2017). These requirements define the problem space and guide the engineering design process [2–4]. Traditionally, requirements are gathered through direct interaction with stakeholders via interviews, surveys, and other elicitation methods (Reference Morkos and SummersMorkos & Summers, 2009). However, challenges arise when access to these stakeholders is limited, especially in large, complex projects with diverse or geographically dispersed teams, some examples of stakeholders who are difficult to engage may include those from underrepresented or marginalized communities, those from small populations, or those from locations not accessible by the designers (Reference Valerio, Rodriguez, Winkler, Lopez, Dennison, Liang and TurnerValerio et al., 2016). In such cases, traditional methods may fall short in capturing the full scope of stakeholder needs
Recent advancements in Artificial Intelligence (AI) and Natural Language Processing (NLP), particularly with Large Language Models (LLMs), have opened new possibilities in automating and enhancing the requirements engineering process (Reference Ha, Jeon, Han, Seo and OhHa et al., 2024; Reference Ha, Jeon, Han, Seo and OhHao et al., 2024; Reference Krapp, Neuhaus, Hassenzahl and LaschkeKrapp et al., 2024; Reference Liu, Sharma, Oswal, Xia and HuangLiu et al., 2024). LLMs, such as ChatGPT, Gemini, and others, can process vast amounts of data and generate coherent and contextually relevant text (Reference Krapp, Neuhaus, Hassenzahl and LaschkeKrapp et al., 2024; Reference Liu, Sharma, Oswal, Xia and HuangLiu et al., 2024). This study explores the potential of using LLMs as surrogate stakeholders in requirements elicitation, comparing the quantity, variety, and completeness of requirements generated by LLMs to those produced by human engineers. By evaluating LLM performance, this research aims to understand whether LLMs can be a viable tool in supporting or supplementing traditional requirements generation methods.
2. Background
Requirements engineering ensures products align with stakeholder needs but capturing clear and comprehensive requirements is often challenging. This section discusses key evaluation metrics, the importance of requirements engineering, and the potential of LLMs as surrogate stakeholders.
2.1. Importance of requirements engineering
Requirements engineering is a critical aspect of engineering design that ensures the designed product aligns with stakeholder needs (Reference Darlington and CulleyDarlington & Culley, 2002). Well-defined requirements are key to preventing miscommunication, reducing the risk of failure, and ensuring that the final product meets its intended purpose (Reference Lind, Gonzalez-Huerta and AlégrothLind et al., 2023). Standards organizations, such as INCOSE (International Council on Systems Engineering), IEEE (Institute of Electrical and Electronics Engineers), and NASA, emphasize the importance of clear, concise, and complete requirements to ensure successful product development (Reference Lind, Gonzalez-Huerta and AlégrothLind et al., 2023; National Aeronautical and Space Administration, 2016).
The International Council of System Engineers (INCOSE) defines requirements as well-formed textual “shall” statements that communicate in a structured, natural language what an entity must do to realize the intent of the needs from which they were transformed (Institute of Electrical and Electronics Engineers (IEEE), 2018). The Institute of Electrical and Electronics Engineers (IEEE) defines a requirement as a statement that translates or expresses a need, along with its associated constraints and conditions (Institute of Electrical and Electronics Engineers (IEEE), 2018). NASA’s Systems Engineering Handbook defines a requirement as the agreed-upon need, desire, want, capability, capacity, or demand for resources or services, expressed as a “shall” statement (National Aeronautical and Space Administration, 2016). These organizations emphasize writing requirements with a single subject and using modal terms such as “shall”, “must”, or “should” to capture the criticality of a requirement (National Aeronautical and Space Administration, 2016). The requirement should also end with a verb phrase, capturing what the subject must do, along with any necessary information to fulfill the requirement (Reference SpiveySpivey, 2019) Based on guidelines from INCOSE, IEEE, and NASA, as well as other research (Joshi & Summers, 2014b, 2014a; Spivey, 2019; Spivey et al., 2021), three metrics to evaluate requirements have been identified. These metrics are quantity, variety and completeness.
-
Quantity: This refers to the total number of requirements generated during the elicitation phase. Having a sufficient number of requirements ensures that all aspects of the project are covered.
-
Variety: The diversity of requirements is essential to ensure that different categories are included, such as functional, non-functional, performance, safety, and regulatory requirements. A broad variety helps ensure the product is suitable for its intended environment.
-
Completeness: A good requirement should be fully detailed, including the subject, verb, modal terms (e.g., “shall” or “must”), and target values. Completeness refers to the depth and specificity of each requirement to ensure clarity and traceability.
Despite the importance of these metrics, capturing stakeholder needs effectively remains challenging (Reference HuzzardHuzzard, 2021). In projects, especially new product development, requirements are often incomplete or inaccurate due to stakeholders’ difficulty in articulating their needs (Reference HuzzardHuzzard, 2021). As products evolve from conception to prototype, latent needs emerge that further complicate requirements elicitation. The process is resource-intensive, particularly when the product is novel, and its design is uncertain.
2.2. Challenges in requirement elicitation
Eliciting accurate and comprehensive requirements is a complex task (Reference Christophe, Wang, Coatanéa, Zeng and BernardChristophe et al., 2011). The challenge is exacerbated by the fact that stakeholders often have limited time and availability, and their needs can evolve over time (Reference Nilsson and FagerströmNilsson & Fagerström, 2006). Traditional methods, such as one-on-one interviews, focus groups, and surveys, can be time-consuming, costly, and dependent on the skill of facilitators (Reference Morkos and SummersMorkos & Summers, 2009). These methods can also fail to capture the full spectrum of requirements, particularly in large-scale systems where stakeholders may have conflicting interests or differing perspectives (Reference Khade, Masoudi, Acena, Freeman, Rai, Gorsich, Rizzo and CastanierKhade et al., 2022; Reference Ulrich, Eppinger and YangUlrich et al., 2020).
Furthermore, the richness of the information stakeholders provide can vary significantly based on their role and understanding of the project (Reference Nilsson and FagerströmNilsson & Fagerström, 2006). For example, end-users might provide insights into usability, while regulatory bodies might focus on safety and compliance (Reference Khade, Masoudi, Acena, Freeman, Rai, Gorsich, Rizzo and CastanierKhade et al., 2022; Reference Ulrich, Eppinger and YangUlrich et al., 2020). This variability in input complicates the task of establishing a comprehensive set of requirements (Reference Morkos and SummersMorkos & Summers, 2010). Additionally, as the project progresses and new information surfaces, previously gathered requirements may need to be refined, updated, or even discarded [14].
The complexity of modern projects, combined with evolving stakeholder needs, makes it clear that traditional methods have limitations(Reference Yasin, Fatima, JiangBin, Khan and Ali KhanYasin et al., 2024). As a result, there is growing interest in exploring whether automated tools like LLMs could offer a more efficient and effective means of gathering and refining requirements (Reference WeiWei, 2024).
2.3. pPotential of LLMs as surrogate stakeholders
Large language models (LLMs) represent a promising tool for assisting in requirements engineering (Reference Liu, Sharma, Oswal, Xia and HuangLiu et al., 2024). Their advanced capabilities in text generation and context understanding make them potentially valuable for simulating stakeholder input (Reference Liu, Sharma, Oswal, Xia and HuangLiu et al., 2024). By carefully designing prompts and providing context, LLMs can generate requirements that align with typical stakeholder concerns, potentially serving as surrogate stakeholders in the requirements elicitation process (Reference Marvin, Hellen, Jjingo and Nakatumba-NabendeMarvin et al., 2024).
Prompt engineering plays a crucial role in optimizing the performance of LLMs (Reference GaoGao, 2023; Reference Marvin, Hellen, Jjingo and Nakatumba-NabendeMarvin et al., 2024; Reference Wang, Chen, Deng, Wen, You, Liu, Li and LiWang et al., 2024). This process involves crafting task-specific prompts to guide LLMs in producing relevant, coherent, and high-quality outputs (Reference Marvin, Hellen, Jjingo and Nakatumba-NabendeMarvin et al., 2024). Research has demonstrated that well-designed prompts can significantly enhance the capabilities of LLMs in generating domain-specific outputs, improving their ability to match the needs of the project at hand (Reference Marvin, Hellen, Jjingo and Nakatumba-NabendeMarvin et al., 2024).
However, while LLMs have shown promise, their ability to generate complete and varied requirements remains an open question. Initial studies have demonstrated that LLMs can produce contextually relevant text, but whether they can consistently generate requirements that are as comprehensive, diverse, and detailed as those developed through traditional methods is still under investigation. A study examining the use of LLMs for generating safety requirements for autonomous vehicles found that LLMs could generate requirements comparable to human-written ones, but certain areas still required refinement to fully capture the complexity of the system (Reference Liu, Sharma, Oswal, Xia and HuangLiu et al., 2024).
This study aims to explore the comparative effectiveness of LLMs in generating requirements, focusing on key metrics such as quantity, variety, and completeness. By identifying areas where LLMs excel or fall short, the research will contribute to understanding how these tools can best support or supplement human-driven requirements engineering efforts.
3. Experimental design
This study investigates the comparison between engineering requirements generated by preservice engineers (undergraduate engineering students) and those produced by large language models (LLMs). The objective is to evaluate the effectiveness of LLMs in generating engineering requirements and assess how their output compares to human-generated requirements in terms of quantity, variety, and completeness. This section details the experimental setup, including participant selection, procedures, and evaluation criteria, to systematically analyse the generated requirements. By employing a controlled experimental framework, this study aims to provide insights into the potential of artificial intelligence to complement traditional methods of requirements generation in engineering design.
The study focuses on two primary independent variables: preservice engineers and LLMs. These variables are analyzed to determine their respective strengths in generating engineering requirements, with the comparison grounded on three metrics: quantity, variety, and completeness. A standardized rubric based on requirement guidelines outlined in (Reference SpiveySpivey et al., 2021) serves as the evaluation tool for both groups.
3.1. Design prompts
Two distinct design prompts were created to elicit requirements from both participant groups. Design Prompt 1 was administered to both pre-service engineers (undergraduate engineering students) and LLMs. Design Prompt 2 with Persona was exclusively administered to the LLMs. This variant incorporated a persona intended to mimic the context and behavior of preservice engineers. These design prompts are detailed in Table 1.
Table 1. Design prompts

3.2. Human participants
The study was conducted under an approved experimental protocol through the local Institutional Review Board, ensuring no risks to the participants. The participants included 116 preservice engineers enrolled in four sections of an “Introduction to Engineering and Computer Science” course at a large public research university in the United States. While the study initially involved 116 participants, 50 of them were given a different design prompt, resulting in 66 participants included in the analysis. The course, a required first-semester class for all engineering and computer science students, comprised approximately half computer science majors, with the remaining students enrolled in bioengineering, electrical engineering, or mechanical engineering programs. About four-fifths of the students identified as male, and only 5% were international students, with an age range typically between 17 and 20 years. Specific demographic data were not collected, as it was deemed outside the scope of the experiment.
While this study had implications for both educational and industrial settings, pre-service engineering students from the School of Engineering and Computer Science were recruited to ensure a consistent baseline of engineering knowledge. Prior research demonstrated that pre-service engineers performed similarly to practitioners with at least three years of experience when generating requirements (Reference ElenaElena, 2019). Additionally, another study found no significant difference in solution quality or technical feasibility between freshman and senior engineering students when generating innovative solutions (Reference Genco, Hölttä-Otto and SeepersadGenco et al., 2012).
The four sections involved in the study had enrollments ranging from 170 to 300 students and met in a traditional lecture hall on Tuesdays and Thursdays. Participants were asked to refrain from discussing the in-class activity with others. The experiment was conducted during regular class time, following a lecture on requirements as part of the coursework. While each section had a different instructor, the content was standardized with a common guest lecture delivered by the same researcher. The lecture covered topics such as the role of requirements in engineering design, methods for measuring requirements, common issues in requirement formulation, and the involvement of stakeholders.
After the lecture, participants were given detailed instructions for the requirements generation task, specifying the required format, structure, and key considerations. They were presented with a problem statement and given 20 minutes to generate requirements. Once the time limit elapsed, the required documents were collected. Independent observers, unaffiliated with the participants, supervised the activity through real-time observation and periodic check-ins to ensure adherence to experimental protocols. This monitoring process was designed to be non-intrusive.
The participant packets contained two sections: the first requested the creation of a unique identifier, while the second included instructions for the requirements generation task and the design problem prompt. Participants were asked to imagine working for an engineering design firm tasked with generating requirements to address the given problem.
3.3. AI participants
Four prominent LLMs were evaluated in this study: CoPilot by Microsoft, Gemini by Google, ChatGPT by Open AI, and Claude by Anthropic. These LLMs were selected based on their demonstrated capabilities in supporting engineering design activities, specifically in requirements generation. CoPilot has been effectively used for assisting with requirement formulation tasks, such as writing requirements for a “tennis ball launcher” (Valentine, n.d.). Gemini has been evaluated to have advanced capabilities in understanding complex instructions (Reference Rane, Choudhary and RaneRane et al., 2024). ChatGPT has been explored with respect to improving software requirements engineering (Reference Marques, Silva and BernardinoMarques et al., 2024). Claude has been recognized for its ability to process multimodal prompts, including text, code, and structured data files (Reference Caruccio, Cirillo, Polese, Solimando, Sundaramurthy and TortoraCaruccio et al., 2024). These LLMs were chosen for their demonstrated utility in engineering design and their distinct approaches to requirement generation. The inclusion of multiple LLMs allows for comparative analysis and highlights variations in their outputs.
4. Results and discussions
This section presents the responses of the human and AI participants. The human participants responded to the baseline prompt, while large language models (LLMs) were given both the baseline prompt (B) and a persona-enhanced prompt (P). Responses were evaluated based on quantity, variety, and completeness of requirements. Differences and similarities between human and LLM responses are also discussed.
4.1. Quantity coding
Quantity refers to the number of requirements generated by the participant, reflecting their ability to propose design specifications. To standardize analysis, requirements were split into atomic statements, each expressing a single thought (Reference SpiveySpivey et al., 2021). For example, consider the requirement:
-
“While in operation, the system must cool the motor and remain under 60 decibels.”
The improved requirement would be written as two requirement statements:
-
“While in operation, the system must cool the motor”,
-
“While in operation, the system must remain under 60 decibels.”
This splitting process adhered to established criteria, including the presence of multiple verbs, adjectives, or conditional expressions. Exceptions to splitting were made when clauses explained a requirement’s purpose or when conditions applied collectively rather than independently. A comprehensive description of the splitting criteria can be found in (Reference SpiveySpivey et al., 2021).
4.2. Quantity analysis
The initial count of requirements generated totalled 842, distributed across four course sections with varying participant numbers: 12, 9, 25, and 21. Statistical analysis using ANOVA confirmed no significant differences in the number of requirements generated between sections (α = 0.05). Following the initial count, requirements were split into atomic units to facilitate detailed analysis.
Table 2 presents the initial and final counts of requirements, along with the average number of requirements per participant and the percentage of splits. While human participants generated more preliminary requirements than the LLMs, splitting revealed that LLMs using the persona-enhanced prompt produced more refined and focused outputs. The results indicate that the persona prompt improved the LLMs’ ability to generate a higher quantity of meaningful requirements.
Table 2. Number of split requirements for human participants

These results demonstrate that while human participants generated a larger volume of requirements, LLMs using the persona prompt achieved better efficiency, with fewer splits needed to clarify requirements.
4.3. Variety coding
Variety evaluates the range of categories covered by participants’ requirements. Categories, adapted from (Reference Spivey, Ortiz, Patel, Davenport and SummersPahl et al., 2007), include geometry, kinematics, forces, energy, material, signal, safety, ergonomics, production, quality control, assembly, transport, operation, maintenance, recycling, cost, and schedules. If a requirement does not fall within these categories, it is coded as “NA”. Examples of requirement categorizations can be found in Table 4.
Table 3. Examples of requirement categorizations

4.4. Variety analysis
Variety was chosen as a key metric in this study to evaluate the breadth of requirements generated. Research indicates that greater variety suggests a more comprehensive understanding of the problem domain (Reference Viswanathan and LinseyViswanathan & Linsey, 2012). Variety was assessed by categorizing requirements based on the split requirements approach detailed in Section 4.1. Each requirement was assigned to a maximum of three categories to minimize overgeneralization and ensure relevance.
The analysis began by examining the variety of requirements generated by human participants and large language models (LLMs). For each participant group, the number of categories addressed was counted. To account for variations in the total number of requirements generated, percentages were calculated by dividing the frequency of each category by the total number of categories across all requirements. This approach normalized the results, enabling comparisons between participants who generated different quantities of requirements. To facilitate a more nuanced comparison, categories were grouped into five percentile ranges. The highest frequency categories were assigned a score of 5, while the lowest received a score of 1. These percentiles provided a clear basis for comparing the diversity of responses between human participants and LLMs.
Table 4 provides a summary of the variety in responses between the human participants and the LLMs. It includes the percentage distribution of requirements across the categories and the assigned percentiles. The categories, adapted from (Reference Pahl, Beitz, Feldhusen and GrotePahl et al., 2007) are listed on the x-axis, and the variety percentages for humans, baseline LLM responses, and persona-driven LLM responses are shown.
Table 4. Summary of the variety in responses between the human participants and the LLMs

Figure 1 visually represents the data from Table 4 illustrating the percentage of categories occupied by the responses from human participants and LLMs. The blue bars represent the categories covered by the human participants, the orange bars represent the baseline LLM responses, and the green bars represent the persona-driven LLM responses.

Figure 1. Number of categories occupied
The results in Figure 1 demonstrate that human participants generated requirements with higher variety compared to the LLMs. Categories such as geometry, kinematics, forces, safety, operation, and cost were more extensively addressed by human participants, likely reflecting their familiarity with these areas. Conversely, the LLMs, particularly when prompted with a persona, generated more requirements in energy, material, production, and maintenance categories. Notably, human participants produced minimal requirements related to production and transport, which may stem from limited manufacturing experience.
LLM responses to the baseline prompt failed to include force or recycling requirements, whereas persona-augmented prompts resulted in more assembly-related requirements but excluded schedule requirements entirely. Similarly, the signals category was not covered in baseline responses but was partially addressed by persona-augmented prompts. These findings suggest that while LLMs can expand their coverage with appropriate prompting, they still lag behind human participants in variety across critical categories.
4.5. Completeness coding
Completeness, another critical criterion, measures how well requirements encapsulate all necessary functionalities, constraints, and specifications. A complete requirement should include a subject, modality, and verb phrase to avoid ambiguity and ensure clarity [54][55]. Additional components, such as modifiers, objects, and target values, further enhance the precision and applicability of the requirements. Complete requirements are instrumental in facilitating effective communication and project planning [33].
4.6. Completeness analysis
Completeness was evaluated by analysing the structural components of the requirements generated by both human participants and LLMs. The presence of a subject, modality, and verb phrase served as the baseline for completeness, while the inclusion of modifiers, objects, and target values was also assessed. Percentages were calculated for each component relative to the total number of requirements generated.
Table 5 presents the results, highlighting the disparities between the human-generated and LLM generated requirements. Human participants generated 842 requirements, compared to 159 of the LLMs. Across all components, the human-generated requirements exhibited higher percentages. For instance, 75.27% of human-generated requirements contained a subject, compared to only 25% and 30% for the baseline and persona-augmented LLM responses, respectively. Similarly, 98.53% of human-generated requirements included a verb, far exceeding the 38% and 33% observed for the LLMs.
Modifiers, objects, and target values were significantly underrepresented in the LLM-generated requirements. Only 1% of baseline LLM responses included modifiers, and none were observed in persona-augmented responses. In contrast, human participants included modifiers in 56.01% of their requirements. Similarly, objects appeared in 44.09% of human-generated requirements, compared to 38% and 32% for the LLMs. Target values, a critical indicator of specificity, were present in 22.09% of human-generated requirements but only 3% of LLM responses.
Table 5. Percentages of requirement components

The disparity in completeness can be attributed to the training provided to human participants, who received a lecture on requirement writing before the task. This preparation likely equipped them to generate more comprehensive and structured requirements. In contrast, the LLMs, while capable of producing coherent responses, struggled to incorporate critical components consistently, highlighting a gap in their ability to replicate human-level understanding and detail.
5. Conclusions
This study explored the potential of Large Language Models (LLMs) as surrogate stakeholders in requirements engineering, comparing their performance to human engineers. The design team may not be able to engage with all stakeholders, therefore surrogates may be necessary. While LLMs demonstrated the ability to generate a substantial quantity of requirements and explore a variety of categories, their coverage and depth remain limited compared to human participants. Human engineers consistently outperformed LLMs in producing well-structured, detailed, and comprehensive requirements.
A key limitation of this study is the LLMs were not given any lecture on requirements like the human participants, which may not fully capture the diversity of their capabilities. Future research should explore a wider range of LLMs and prompt engineering techniques to further assess their potential.
Future research could investigate the effectiveness of combining human and LLM efforts in a collaborative approach. By leveraging the strengths of both, it may be possible to achieve a more efficient and effective requirements engineering process. Additionally, exploring the impact of different personas and prompt engineering techniques on LLM performance could provide valuable insights for optimizing their use in practical applications.