1. Introduction
Technical product development generally begins by clarifying the problem or task, often culminating in a comprehensive requirements list (Reference Pahl, Beitz, Feldhusen and GrotePahl et al., 2007; VDI 2221-1, 2019; Reference Schlattmann and SeibelSchlattmann & Seibel, 2021). This document, often presented in tabular format, details all the essential specifications, functions, and constraints necessary for product development. It is continuously updated throughout the development process and serves as a foundation for design, evaluation, and decision-making.
Compiling the relevant information to create a requirements list is a complex and time-consuming task, involving substantial manual effort (Reference Dąbrowski, Letier, Perini and SusiDąbrowski et al., 2020), particularly for new product development. This process is usually carried out by experts using methods such as market analysis (Reference Palomares, Franch, Quer, Chatzipetrou, López and GorschekPalomares et al., 2021), stakeholder surveys (Franch, 2021), product reviews (Reference Lim, Henriksson and ZdravkovicLim et al., 2021), user feedback (Reference Van Vliet, Groen, Dalpiaz and BrinkkemperVan Vliet et al., 2020), and/or quality function deployment (Reference Pokorni, Popescu and ConstantinescuPokorni et al., 2022). In this context, large language models (LLMs) have recently emerged as a promising tool to support and accelerate the requirements engineering process (Reference Ronanki, Cabrero-Daniel, Horkoff and BergerRonanki et al., 2024) as they can understand and process natural language and are able to effectively extract, structure, and refine information from large datasets.
Interacting with LLMs typically involves a process known as prompt engineering (Reference Sahoo, Singh, Saha, Jain, Mondal and ChadhaSahoo et al., 2024), where users craft specific requests to elicit desired responses. However, producing high-quality outputs through this method can be challenging, as it often requires providing extensive context for the request and additional details about the desired result. Furthermore, LLMs are generally sensitive to the phrasing of prompts, meaning that even slight variations in wording can result in significantly different outcomes (Reference Jiang, Xu, Araki and NeubigJiang et al., 2020).
To address these challenges and enhance the utility of LLMs for generating requirements lists, we have developed a domain-specific GPT (generative pre-trained transformer) that enables product developers to create high-quality requirements lists for technical products using a single standardized prompt, such as “Create a requirements document for [product].” This document can be tailored to specific contexts and company needs by incorporating elements such as stakeholder surveys, ISO standards, international and regional regulations, or internal requirements from previous projects (Reference Dehn, Jacobs, Zerwas, Berroth, Hötter, Korten and FleischerDehn et al., 2023).
2. Background
2.1. General LLMs
In the field of artificial intelligence, LLMs have become groundbreaking tools, revolutionizing human-machine interactions. These models are distinguished by the massive datasets on which they are trained, enabling them to process and respond to complex requests. Their extensive training allows not only for the comprehension of natural language but also for the interpretation of linguistic relationships and the generation of coherent, contextually appropriate responses.
In contrast to traditional natural language processing (NLP) models, which heavily depend on recurrent or convolutional neural networks (Reference Kombrink, Mikolov, Karafiát and BurgetKombrink et al., 2011), modern transformer-based models—such as GPT-4 from OpenAI—are built entirely on attention mechanisms, particularly self-attention (Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez and PolosukhinVaswani et al., 2017). This architecture facilitates parallel data processing, enabling these models to efficiently extract meaningful information from vast amounts of text. However, while general LLMs are powerful text creators, they may lack the precision and domain-specific knowledge required for specialized tasks.
2.2. Domain-specific LLMs
Domain-specific LLMs are tailored to support specialized tasks across various disciplines. In contrast to general-purpose LLMs, which are designed to handle a wide variety of tasks, domain-specific LLMs are explicitly trained to address the unique requirements and nuances of specific industrial segments. In these specialized contexts, they can even surpass much larger general LLMs for previously unseen tasks (Reference Chung, Hou, Longpre, Zoph, Tay, Fedus and WeiChung et al., 2024).
Examples of domain-specific LLMs already exist in fields such as finance (Reference Wu, Irsoy, Lu, Dabravolski, Dredze, Gehrmann and MannWu et al., 2023; Reference Yang, Liu and D.Yang et al., 2023), law (Reference Cui, Li, Yan, Chen and YuanCui et al., 2023), natural sciences (Reference Xie, Wan, Huang, Yin, Liu, Wang and HoexXie et al., 2023), biomedicine (Reference Lee, Yoon, Kim, Kim, Kim, So and KangLee et al., 2020), and bio-inspired design (Reference Chen, Cai, Jiang, Luo, Sun, Childs and ZuoChen et al., 2024). The models are typically derived from general-purpose LLMs via a process known as fine-tuning (Reference Moradi, Yan, Colwell, Samwald and AsgariMoradi et al., 2024). Fine-tuning involves retraining or adapting a model using training datasets to improve its performance on targeted tasks.
Several methods exist for fine-tuning LLMs for specific tasks, with full-parameter fine-tuning being the most straightforward approach (Reference Sun, Ji, Ma and LiSun et al., 2023). The method involves adjusting all layers of the model by training it on task-specific data, making it particularly effective for scenarios with large and distinct datasets that differ significantly from the original pre-training data. While it allows the model to deeply learn and adapt to new requirements, it has notable drawbacks, including substantial memory demands and the need for high-performance hardware (Reference Ding, Chen, Zhu, Jiang, Zhong, Zhou and LiangDing et al., 2023).
To reduce the computational and memory effort for fine-tuning, low-rank adaptation (LoRA) provides a more efficient alternative (Reference Hu, Shen, Wallis, Allen-Zhu, Li, Wang and ChenHu et al., 2021). LoRA operates on the assumption that the weight changes during model adaptation have a low “intrinsic rank.” Instead of retraining the entire model, the method optimizes low-rank decomposition matrices that encapsulate these changes in the dense layers. Building on this concept, quantized low-rank adaptation (QLoRA) further minimizes the memory requirements by introducing advanced memory management techniques (Reference Dettmers, Pagnoni, Holtzman and ZettlemoyerDettmers et al., 2024). QLoRA achieves this by applying four-bit quantization to the LLM and enabling backpropagation of gradients through these quantized, frozen layers, significantly improving efficiency without sacrificing performance.
2.3. LLMs for requirements lists
The creation of a requirements list is a text-based activity, making it highly suitable for the application of NLP models. LLMs can assist in extracting relevant information from extensive pre-trained datasets while addressing ambiguities and inconsistencies. This ensures that requirements are clear, structured, and well-aligned with product development needs by converting project-specific language into formal notations (Reference Bertram, Boß, Kusmenko, Nachmann, Rumpe, Trotta and WachtmeisterBertram et al., 2022) and machine-readable formats (Reference Ray, Cole, J., Bhat, White and N.Ray et al., 2023).
While prompt engineering serves as an alternative to fine-tuning for requirements elicitation, it is highly dependent on precise wording. Research in LLM-based prompt engineering recommends incorporating key elements such as intent, context, motivation, structure, output indicators, example implementations, and additional consequences (Reference GirayGiray, 2023; Reference White, Hays, Fu, Spencer-Smith and C.White et al., 2024). Techniques like few-shot and chain-of-thought prompting methods are particularly useful (Reference Ronanki, Cabrero-Daniel, Horkoff and BergerRonanki et al., 2024). However, prompt engineering lacks domain specificity and often yields suboptimal output performance (Reference Gu, Han, Chen, Beirami, He, Zhang and TorrGu et al., 2023).
Recent advances propose various approaches to address engineering requirements tasks. For example, LLMs’ inherent capability to detect semantic similarities between words and sentences can be applied to summarize contextually identical requirements (Reference Norheim, Rebentisch, Xiao, Draeger, Kerbrat and L.Norheim et al., 2024). Multi-layered frameworks can help address issues arising from conflicting or redundant requirements (Reference Malik, Bangash, Iqbal, Zhenisbekovna and HammadMalik et al., 2023). By analysing text data from interviews, online platforms, and other databases, LLMs can optimize formatting, length, memory, and wording more effectively compared to humans (Reference Singhal, Goyal, Xu and DurrettSinghal et al., 2023; Reference Talebirad and NadiriTalebirad and Nadiri, 2023; Reference Kapoor, Stroebl, Siegel, Nadgir and NarayananKapoor et al., 2024).
The MARE framework investigates requirements engineering in software development by using a multi-agent system based on LLMs. This system collaborates to perform the entire requirements engineering process, improving the accuracy and quality of requirements specifications (Reference Jin, Jin, Chen and WangJin et al., 2024).
To match user needs with their expectations, the Elicitron framework introduces a possibility to discover linguistic intricacies based on the analysis of interviews with LLMs (Reference Ataei, Cheong, Grandi, Wang, Morris and TessierAtaei et al., 2025). By categorizing needs as either direct or latent, this method aims to identify subliminal expectations, thereby facilitating information extraction from text—a crucial aspect of requirements engineering.
3. Method
3.1. ReqGPT workflow
Figure 1 illustrates the development process of ReqGPT. First, 120 requirements lists for products from the consumer goods sector were automatically created employing few-shot prompting with GPT-4 using a requirements list structure extracted from Reference Hubka, Andreasen, Eder and J.Hubka et al. (1988). Precisely 107 of these lists satisfied all necessary criteria according to ISO 29148 (2018), providing a robust dataset for the subsequent training. The training was performed on the base model of Mistral-7B-Instruct-v0.2 over nine epochs, ensuring comprehensive data integration and processing efficiency. The training process resulted in a parameter-efficient and domain-specific API designed for generating high-quality requirements lists for technical product development.

Figure 1. Flowchart of the development process of ReqGPT
The goal of this project was to develop a fine-tuned LLM to enhance the creation of requirements lists for technical product development. The model was not specifically designed to eliminate hallucinations but rather to generate typical requirements for developmental purposes. By connecting the model with external databases, such as standards and regulatory resources, through retrieval-augmented generation (RAG), we expect that hallucinations can be reduced to an acceptable level (Reference Su, Tang, Ai, Wang, Wu and LiuSu et al., 2024).
3.2. Data generation
To generate training data for ReqGPT, we synthesized a generalized structure for a requirements list by abstracting from various specific examples within the field of mechanical design (Reference Hubka, Andreasen, Eder and J.Hubka et al., 1988). The identified requirement types were systematically grouped into categories and further organized into appropriate subcategories. Prompt engineering techniques using GPT-4 were then employed to generate generalized requirements lists. To address the token context window limit of GPT-4 (as of now 8,192), the prompt was divided into two sections. The entire composite prompt is provided in Appendix A.
To optimize the prompt for generating consistent requirements lists, we performed several iterations of refinement, including manual instance selection (Blachnik et al., 2020) as well as deduplication (Reference Lee, Ippolito, Nystrom, Zhang, Eck, Callison-Burch and CarliniLee et al., 2021). During this process, some subcategories were deemed inadequate as they produced irrelevant, misleading, or inconsistent outputs. These outputs were manually refined until the desired results were achieved. This refinement ensured that the input clearly defined individual requirements, avoiding the generation of instruction manuals within the output.
For the dataset, we compiled a sample of 120 consumer products to generate requirements documents. Of these, 13 products deviated strongly from the rest and were therefore excluded from further analysis. The remaining 107 documents were curated and manually refined to ensure consistency and relevance.
3.3. Model selection
The selection criteria for the base model used for fine-tuning focused on open-source accessibility and the ability to perform both training and inference on consumer-grade hardware. At the time of selection, the best-performing model that fulfilled these criteria was Mistral-7B-Instruct v0.2. With just over seven billion parameters, this model works seamlessly with next-generation lightweight notebooks, making it ideal for deployment in typical business environments.
The instruction-tuned variant of Mistral-7B was selected as the base model, as fine-tuning small LLMs with task-specific instructions has been shown to enhance performance on new, unrelated tasks in both zero-shot and few-shot scenarios without increasing computational resources and costs (Reference Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkn and LoweOuyang et al., 2022). Given the limited size of the dataset, specific commands were integrated into the model instead of performing full-parameter fine-tuning. This method reduced training time while delivering improved results with fewer input tokens.
3.4. Model training
As outlined in the previous section, Mistral-7B-Instruct-v0.2 serves as the base model for ReqGPT. The fine-tuning process was conducted using an NVIDIA A10 GPU with 24 GB VRAM. The learning rate was set to 3e−5, and the AdamW optimizer was employed (Reference Kingma and BaKingma & Ba, 2014). Later stages of training incorporated QLoRA with four-bit quantization and an adapter rank of 16.
The fine-tuning process was carried out with 107 curated requirements lists, completed over nine epochs. Training was marked by smooth transitions and incremental improvements, ensuring consistency in the outputs. The model achieved a validation loss value of 0.4332, which was smoothed to 0.4419 after 350 steps. Notably, the threshold for an acceptable validation loss (< 1) was already reached after processing 75 lists, adhering to the characteristics of requirements according to ISO 29148 (2018).
Following training, ReqGPT exhibited a strong coherence and replicability, effectively generating high-quality requirements lists for defined products. Additionally, significant prompt shortening (Reference Patil, Joshi, Ingle, Jayappa and KetkarPatil et al., 2023) augmented the efficiency of the model, enabling the generation of detailed requirements lists with input prompts as concise as 15 tokens. Data for training, analysis, and output are accessible at GitHub.Footnote 1
3.5. Evaluation
To evaluate the output quality of ReqGPT, the corresponding requirements lists were compared to those produced by GPT-4 (the source of the training data) and the baseline Mistral model (prior to fine-tuning). Since language quality is best assessed by human evaluators, we conducted a qualitative study involving 18 graduate-level master students specializing in product design. These participants represented diverse technical disciplines, such as engineering, management, and business informatics.
Participants were tasked with acting as product developers improving a technical product by evaluating multiple features of the requirements lists. The assessment included nine criteria according to ISO 29148 (2018): necessity, appropriateness, correctness, creativity, completeness, coherence, unambiguousness, verifiability, and uniformity of requirements and sub-requirements.
The evaluation was conducted during a 60-minute session where the participants rated the lists on a five-point Likert scale (1: strongly disagree, 2: disagree, 3: neutral, 4: agree, 5: strongly agree). Subsequently, the authors analysed the responses to assess linguistic, semantic, and structural differences between the outputs of the different LLMs. The evaluation results are summarized in Table 1.
Table 1. Weighted mean scores for requirements lists generated by Mistral, GPT-4, and ReqGPT

The study’s results show that ReqGPT outperforms both the baseline LLM and GPT-4 in overall mean score performance across the measured criteria. ReqGPT achieved the highest similarity score at 77% with a moderate interquartile range (IQR) of 17% (69–86%), indicating a consistent evaluation across all nine criteria. GPT-4 followed with a similarity score of 69% and a slightly narrower IQR of 13% (62–75%), reflecting a slightly smaller variability in output quality compared to ReqGPT. In contrast, Mistral recorded the lowest similarity score at 34% and exhibited the widest IQR of 32% (22–54%), highlighting significant inconsistencies and, comparatively, the poorest outputs. A representative output from ReqGPT can be found in Appendix B.
Overall, ReqGPT outperformed GPT-4 and Mistral, providing more accurate and consistent results, as reflected by its higher median and lower variability. Furthermore, the p-value for the evaluation criteria across all lists was approximately < 1%, highlighting the statistical significance of ReqGPT’s superior performance compared to the other LLMs. The p-value was determined using a two-sided test.
4. Challenges and limitations
4.1. Hallucinations
Current LLMs are generic tools that can generate texts, but this ability alone is insufficient for creating adequate requirements documents. While text processing is a prerequisite for formulating requirements lists, domain specificity is indispensable for a successful application in product development. ReqGPT, however, is not able to mitigate hallucinations in its current form. This requires a detailed understanding of the market situation, the company’s internal constraints, legal requirements, as well as relevant norms and standards. To realise this, ReqGPT needs to be extended by relevant databases using RAG methods (Reference Su, Tang, Ai, Wang, Wu and LiuSu et al., 2024), which is intended for future research.
4.2. Training dataset
The initial dataset of 120 requirements lists for model retraining was considerably small and was further reduced to 107 lists for few-shot data generation. This reduction was justified by the observation that an acceptable validation loss was reached after training with just 75 lists, rendering the additional 32 lists less necessary. A limited dataset was possible, as it was meticulously curated and deliberately selected upfront, rather than relying on randomized data generation through the LLM. For future replications of similar fine-tuning processes, careful monitoring of the validation loss is advised, as it serves as a key indicator for effective learning while minimizing the risk of overfitting (Reference Schubert, Riedlinger, Kahl, Kröll, Schoenen, Šegvić and RottmannSchubert et al., 2024).
4.3. Bias in human evaluation
The analysis of LLM outputs through human evaluation inherently involves a degree of subjectivity and can lead to biased results (Reference Tjuatja, Chen, Wu, Talwalkwar and NeubigTjuatja et al., 2024). To address this limitation, incorporating a more diverse group of evaluators—beyond engineering master students—could provide a broader and more balanced assessment. Expanding the cohort of evaluators would improve the clarity and precision of the analysis, leading to more robust and significant outcomes.
4.4. LLMs for evaluation
An alternative evaluation approach involves using multiple-agent collaborations of LLMs as impartial evaluators. Unlike human evaluation, which is both time-consuming and costly, LLM-based evaluation addresses these challenges more efficiently (Reference Dubois, Liang and HashimotoDubois et al. 2024). Research has shown that humans often focus less on output correctness and more on relatively marginal aspects, such as geometry, potentially missing critical issues (Reference Singhal, Goyal, Xu and DurrettSinghal et al., 2023). LLMs, by contrast, provide the possibility for more accurate assessments with fewer errors, enhancing the reliability of evaluation outcomes (Reference Chang, Wang, Wang, Wu, Yang, Zhu and XieChang et al. 2024).
5. Conclusion and future work
This study demonstrates that ReqGPT, even though it is based on the smaller Mistral-7B-Instruct-v0.2 model, can outperform the significantly larger, general-purpose GPT-4 in generating requirements lists for technical product development. Despite the larger scale and broader data access of GPT-4, ReqGPT’s targeted adaptations enable it to better address industry-specific demands.
The findings highlight several important advantages of smaller LLMs. First, they operate with increased efficiency, requiring less computational power and functioning effectively on local hardware. Second, local deployment of LLMs enhances data privacy and cybersecurity, which are essential for industries handling sensitive or proprietary information. By minimizing reliance on external networks and servers, local models reduce data exposure risks while also offering stability in environments with inconsistent internet connectivity (Reference Da Silva, Benitti and WanghamDa Silva et al., 2022). Third, smaller LLMs are easier to customize and specialize for specific tasks or industries. Models like Mistral-7B-Instruct-v0.2 can be fine-tuned with greater ease, enabling improved performance in targeted applications and generating outputs that are highly relevant and customized to the user’s operational context.
The results from this study pave the way for two critical avenues for future research. First, efforts should focus on integrating LLM-accelerated workflows into real-world product development processes. This involves refining the models to improve their specificity and accuracy and embedding them in existing product development workflows to evaluate their practical impact. Second, incorporating agent-based retrieval systems offers significant potential to improve LLMs’ contextual understanding. These systems would extract specific data from knowledge databases, thus enriching the models’ context with relevant, up-to-date information, improving both the quality and relevance of the content produced.
Acknowledgement
We would like to thank the participants of the master course “AI-supported Product Development” from Leuphana University Lüneburg for taking part in the evaluation.
Appendix A
Prompt details for GPT-4
No “‘markdown and’” at start and end. Create a detailed and structured requirements list that can be used for product development for [product]. Use Markup to structure the text. Do not use explanations, just provide the necessary information/facts about this list. Follow this structure:
Product design specifications document for [product]

Appendix B
Output from ReqGPT for a smart electronic toothbrush (excerpt)
1. Main Function
Description
-
– The primary purpose of the smart electric toothbrush is to facilitate efficient and effective oral hygiene through automated brushing.
-
– The product is designed to enhance user experience by integrating smart features such as app connectivity, personalized brushing programs, and real-time feedback.
Key Features
-
– Automated Brushing: Oscillating, pulsating, or sonic brushing action to remove plaque effectively.
-
– App Connectivity: Bluetooth or WIFI integration for syncing brushing data with a companion app.
-
– Personalized Brushing Programs: Customizable brushing modes based on user needs (Sensitive Teeth, Whitening, Gum Care).
-
– Real-time Feedback: Vibration or audio signals to guide the user for the correct brushing technique.
-
– Battery Management: Rechargeable battery with smart charging indicators.
-
– Water Resistance: IPX7 or higher rating for waterproofness.
-
– Replaceable Brush Heads: Compatibility with multiple types of replaceable brush heads.
-
– Smart Timer: Built-in timer to ensure the recommended brushing duration (e.g., 2 minutes).
-
– Pressure Sensor: Alerts user when brushing too hard to prevent enamel damage.
2. Functionally Determined Properties
Performance Requirements
-
– Battery Life: Minimum of 14 full brushing sessions on a single charge.
-
– Charging Time: Less than 4 hours to fully charge.
-
– Brushing Modes: At least 5 distinct brushing modes (e.g., Daily Clean, Sensitive, Whitening, Gum Care).
-
– Oscillations Per Minute (OPM): Minimum of 31,000 OPM for sonic models.
-
– Pulsations Per Minute (PPM): Adjustable pulsations from 5,000 to 31,000 PPM.
-
– App Compatibility: Compatible with iOS and Android devices.
-
– Bluetooth Range: Up to 10 meters for seamless connection.
-
– Data Storage: Ability to store brushing history for at least 6 months.
-
– Waterproof Depth: IPX7 rating to withstand immersion in up to 1 meter of water. […]