Can large language models support machine learning implementation in product development? A comparative analysis and perspectives

Sebastian Sonntag; Janosch Luttmer; Arun Nagarajah

doi:10.1017/pds.2025.10100

Can large language models support machine learning implementation in product development? A comparative analysis and perspectives

Published online by Cambridge University Press: 27 August 2025

Sebastian Sonntag ,

Janosch Luttmer and

Arun Nagarajah

Show author details

Sebastian Sonntag*: Affiliation:
Universität Duisburg-Essen, Germany
Janosch Luttmer: Affiliation:
Universität Duisburg-Essen, Germany
Arun Nagarajah: Affiliation:
Universität Duisburg-Essen, Germany
*: sebastian.sonntag@uni-due.de

Article contents

Abstract:
Introduction
Study design
Results
Discussion
Conclusion
Footnotes
References

Abstract:

Recent advancements in machine learning (ML) offer substantial potential for enhancing product development. However, adoption in companies remains limited due to challenges in framing domain-specific problems as ML tasks and selecting suitable ML algorithms, requiring expertise often lacking. This study investigates the use of large language models (LLMs) as recommender systems for facilitating ML implementation. Using a dataset derived from peer-reviewed publications, the LLMs were evaluated for their ability to recommend ML algorithms for product development-related problems. The results indicate moderate success, with GPT-4o achieving the highest accuracy by recommending suitable ML algorithms in 61% of cases. Key limitations include inaccurate recommendations and challenges in identifying multiple sub-problems. Future research will explore prompt engineering to improve performance.

Keywords

artificial intelligence machine learning product development large language model decision making

Information

Type: Article
Information: Proceedings of the Design Society , Volume 5: ICED25 , August 2025 , pp. 861 - 870

DOI: https://doi.org/10.1017/pds.2025.10100 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright: © The Author(s) 2025

1. Introduction

Recent advancements in artificial intelligence (AI), particularly in machine learning (ML) as a subfield of AI, have led to a growing number of ML-based approaches being applied in product development (Reference Yüksel, Börklü, Sezer and CanyurtYüksel et al., 2023). Driven by increasing cost and innovation pressures, intensified international competition (Reference KrauseKrause, 2018), and additional challenges and expenses resulting from the transition to a circular economy (Reference Pluhnau, Lübke and NagarajahPluhnau et al., 2023), ML has shown significant potential to support product development tasks, ranging from routine activities in requirements management (Reference Luttmer, Ehring, Vinke, Alazzazy and NagarajahLuttmer et al., 2024) and information allocation (Reference Ehring, Menekse, Luttmer and NagarajahEhring et al., 2024) to creative tasks in concept development (Sonntag, Brünjes, et al., 2024). However, the adoption of these methods within companies remains limited (Reference Cooper and BremCooper & Brem, 2024). One key challenge lies in companies’ difficulties with taking the initial steps to identify opportunities for applying ML to address product development related tasks. In this context, tailored solutions are required to meet individual needs, making the selection of suitable ML algorithms particularly complex. Currently, this process requires specialized expertise that is often lacking within companies and is highly subjective and typically involves a lengthy trial-and-error approach, testing numerous ML algorithms. (Reference Müller, Roth and KreimeyerMüller et al., 2023; Reference SarangSarang, 2023) However, poorly defined use cases can result in underwhelming outcomes and reduced willingness to invest, thereby hindering the adoption of ML in product development (Reference Cooper and BremCooper & Brem, 2024). Consequently, there is a pressing need to support companies in exploring and identifying ML’s potential applications to effectively address their unique product development related problems.

1.1. Implementing machine learning in product development

Building on the preliminary work presented in Reference Sonntag, Luttmer, Pluhnau and NagarajahSonntag et al. (2023); (2024), key preparatory steps are required before implementing ML in the context of product development. These steps include identifying and formulating the domain-specific product development-related problem, deriving the associated ML-related problems, and subsequently selecting suitable ML algorithms as potential solutions. As illustrated in Figure 1, a major challenge lies in the translation of the product development-related problems into ML-related problems, which requires specialized ML expertise, often not readily available within companies (Reference Zschech, Heinrich, Horn and HöscheleZschech et al., 2019).

Figure 1. Required translation of the product-related problem formulation into the corresponding ML-related problem formulation

Consequently, several approaches have been proposed in the literature to support these initial steps, particularly in selecting suitable ML algorithms, with the goal of making the early stages of ML implementation more accessible to non-experts, that can be categorized into three distinct classes:

Quantitative decision support – Several publications focus on testing and comparing a defined selection of ML algorithms for specific problems and datasets. For instance, in Reference Luttmer, Prihodko, Ehring and NagarajahLuttmer et al. (2023), four different ML algorithms are evaluated for requirement extraction on standard documents. The objective is to compare the performance of these algorithms using predefined performance measures such as Precision, Recall and F1 score. To aid in this process, Reference Blagec, Barbosa-Silva, Ott and SamwaldBlagec et al. (2022) introduce a knowledge graph containing various benchmarking results. Additionally in the broader field of AutoML, optimization approaches iteratively adjust ML algorithm hyperparameters to identify the most suitable one for application on a given dataset (Reference Waring, Lindvall and UmetonWaring et al., 2020). More recently, Reference Nascimento, Tavares, Alencar and CowanNascimento et al. (2023) explored the application of LLMs as an alternative to the application of AutoML. Here, the aim is to identify suitable ML algorithms for solving specific ML problems defined by promoting a preexisting dataset. However, these methods generally assume that the ML problem to be addressed is already known and that a corresponding dataset is available.

Qualitative decision support – In addition to comparing ML algorithms based on performance metrics, some approaches utilize predefined criteria lists to evaluate algorithms based on qualitative aspects such as transparency or computational efficiency for specific problems. For example, Reference Lickert, Wewer, Dittmann, Bilge and DietrichLickert et al. (2021) compare several supervised learning algorithms for their application in reverse logistics, while Reference Riesener and KlumpenRiesener et al. (2020) examine a limited selection of ML algorithms in the context of product development. However, these approaches are limited by their reliance on predefined ML problems and vary significantly in the criteria and ML algorithms considered.

Knowledge-based support – Several approaches have been proposed that introduce databases containing information on ML algorithm characteristics, such as advantages, disadvantages, and application scenarios. These databases aim to assist in exploring the solution space of applicable ML algorithms for predefined product development related problems. In this context, Reference Benjamin GerschützGerschütz et al. (2021) present a semantic web-based approach that includes a limited number of ML algorithms including the characterization of ML algorithms in the context of potential use cases in product development, while Reference Gerschütz, Goetz and WartzackGerschütz et al. (2023) propose an ontology-based approach that enables the identification of suitable ML algorithms through query syntax. Additionally, Zschech et al. (Reference Zschech and HeinrichZschech et al., 2020) introduce a text-based recommender system that identifies the type of ML problem (e.g. Regression or Classification) based on a textual description of the domain-specific problem. However, these approaches primarily serve as explorable knowledge databases that cover only a limited portion of the solution space, rely on expert knowledge to evaluate potential solutions, or fail to recommend specific ML algorithms capable of directly addressing the problem.

1.2. Research gaps and objectives

Based on the literature review the following deficits of the existing approaches can be derived:

Prerequisite of ML problem definitions – Current approaches require the ML-related problem to be known, offering no support in deriving ML problems, containing specific ML algorithms as solution candidate from product development-related problem formulations.

Prerequisite of existing datasets – Especially quantitative decision support methods rely on the availability of datasets, which may not exist in the early stages of exploring the possibility of applying a ML algorithm as support.

Restriction to predefined solutions – Existing support methods are limited to predefined ML algorithms and problem types, reducing their applicability to novel or tailored problem formulations.

In summary, no existing approach effectively bridges the gap between product development-related domain-specific problem formulations and the selection of ML algorithms without requiring knowledge of the inherent ML problem or predefined datasets. This highlights the need for a method that supports early-stage exploration of the potential of applying ML algorithms for product development problems.

As the product development problems are formulated in human language and domain specific knowledge in product development as well as ML is required, given their language understanding, general knowledge capabilities and therefore flexibility regarding novel problems, LLMs present a promising opportunity to address these gaps. Since proving these capabilities by serving as foundation in their application in similar tasks like recommender systems in data science-related problems (Reference Alsayed, Dam and NguyenAlsayed et al., 2024) as well as expert systems for assembly processes (Reference Hu, Li, Pan, Wen and BaoHu et al., 2023), the following research question arises:

To what extent can LLMs support the identification of suitable ML algorithms capable of solving product development-related problems based on a given problem formulation?

Therefore, the goal of this study is to explore the capabilities and limitations of LLMs as recommender systems for the initial stages of implementing ML algorithms for product development-related problems.

The remainder of the paper is structured as follows: Section 2 describes the study design, with the results presented in Section 3 and discussed in Section 4. Finally, the research concludes with a summary and an outlook on future work in Section 5.

2. Study design

The study design is based on four main steps, as illustrated in Figure 2, which are elaborated in detail in the following subsections.

Figure 2. Study design

First, a novel dataset of product development-related problem statements and corresponding ML algorithms is developed, derived from peer-reviewed publications focusing on the application of ML for product development-related tasks, to ensure validity and reliability. Therefore, the publications were identified through a systematic literature review (Section 2.1). The problem statements from the dataset are used to design prompts, that describe the product development related problems and the task for the LLM to identify suitable ML algorithms (Section 2.2). Next, evaluation metrics are defined to provide a foundation for answering the research question. These metrics also serve as a starting point for analysing the limitations and areas for improvement of the investigated LLMs (Section 2.3). Finally, the derived prompts are used as input in the LLMs to be investigated (Section 2.4).

2.1. Dataset creation

The systematic literature review, conducted to identify publications that serve as the foundation for creating the dataset, follows the PRISMA 2020 guidelines to ensure a transparent and reproducible process (Reference Page and McKenziePage et al., 2021). A detailed documentation of the literature review process, along with the resulting dataset, is available on GitHubFootnote ¹.

Definition of search strategy and databases – The literature review aims to identify publications on applying ML algorithms to product development-related problems. Therefore, the literature search is guided by the topics of product development and ML.

The product development process and its associated activities is defined based on the VDI 2221, which is a well-established framework, consisting of four main phases: problem definition (PD), concept development (CD), embodiment design (ED), and detailed design (DD) (Reference Gericke, Bender, Pahl, Beitz, Feldhusen, Grote, Bender and GerickeGericke et al., 2021). These phases guided both the literature search and the categorization of identified literature, ensuring the dataset represents problem formulations from all phases equally.

This study focuses on ML, reflecting recent advancements in addressing product development-related problems (Reference Yüksel, Börklü, Sezer and CanyurtYüksel et al., 2023). As a subfield of AI, ML encompasses algorithms capable of identifying data correlations and making predictions, categorized into supervised, unsupervised, and reinforcement learning, along with subfields like deep learning and generative AI.

The search was conducted using the Scopus and Web of Science databases, selected for their extensive coverage and partially distinct indexed publications (Reference Singh, Singh and MayrSingh et al., 2021).

Definition of inclusion and exclusion criteria – The study focuses on peer-reviewed journal articles and conference papers published in English between 2004 and 2024. Only publications addressing the specific application of ML algorithms to product development-related problems were considered. General frameworks and studies involving, e.g. the application of genetic algorithms were excluded.

Data extraction and synthesis – To establish the dataset, two key information artifacts were extracted using narrative synthesis: product development problem statements and the ML algorithms applied. As problem statements are typically presented alongside methods in abstracts, the abstracts were fully extracted to serve as the basis for formulating the dataset’s problem statements.

Performing the literature search – Based on the review process illustrated in Figure 3, in total 170 publications were identified and categorized by product development phases. To ensure balance, 39 publications were randomly selected from overrepresented phases, resulting in a total number of 156 problem statements and corresponding ML algorithms for the dataset.

Figure 3. Review process

2.2. Prompt design

Prompts form the foundation of input for LLMs, describing the task to be performed as well as the desired output generated by the LLM (Reference Sahoo and ChadhaSahoo et al., 2024). In the scope of this study, the investigation focuses on the core capabilities of LLMs, excluding specialized techniques such as prompt engineering. To ensure standardized and comparable inputs, the prompts were built upon a structured format, as illustrated in Figure 4. This structure consists of three elements: the problem description, the task description, and the format description.

Figure 4. Prompt design based on the information provided in the abstracts

The problem description – The problem description defines the product development-related issue for which an ML algorithm is being sought. It also provides contextual information and outlines any constraints associated with the problem. These descriptions are formulated based on abstracts extracted from the reviewed publications. However, as illustrated in Figure 4, the problems described in the abstracts are often solution-specific and therefore must be rephrased into solution-neutral problem statements.

The task description – The task description defines the specific task the LLM is expected to perform. In this study, the LLMs were tasked with identifying suitable ML algorithms to solve the product development-related problems described in the problem descriptions and is consistent across all derived prompts. This involves translating the domain-specific problem into an ML-specific problem and providing a suitable ML algorithm for application.

The format description – To facilitate evaluation, a clear description of the desired format for the output generated by the LLM was included in the prompt. This ensured that the responses could be easily analyzed and compared.

This study investigates the capabilities of LLMs using a zero-shot approach as the primary focus. However, initial exploratory trials with prompts based on publications outside the dataset identified a notable pattern: when the number of ML algorithms to suggest was not explicitly specified, the LLM occasionally provided a range of suitable ML algorithms instead of identifying a single one. To ensure consistent and comparable responses, two additional one-shot prompts were defined and used whenever the LLM offered multiple ML algorithms as a solution. These additional prompts are as follows:

Name only the most appropriate machine learning algorithm.
Name for each identified problem only the most appropriate machine learning algorithm.

2.3. Evaluation metrices

To evaluate the LLM’s capability in performing the task, which forms the basis for answering the research question, two evaluation metrics Task fulfilment-rate (TF-rate) and Identical algorithm-rate (IA-rate) were defined which are calculated as follows (TF = number of cases the task can be noted as fulfilled, NP = Number of problems tested, IA = number of cases where identical algorithms were identified):

(1)

$$Task\;fulfilment - rate = {{TF} \over {NP}}$$

(2)

$$Identical\;algorithm - rate = {{IA} \over {TF}}$$

The TF-rate evaluates whether the LLM translates a product development problem into an ML-related problem by recommending a suitable ML algorithm. A task is considered fulfilled if the suggested ML algorithm is either identical to or equally applicable for the given problem. Alternative ML algorithms proposed by the LLM were reviewed for applicability based on their traits, acknowledging the LLM’s ability to suggest new, valid solutions.

In contrast, the IA-rate measures the proportion of cases where the ML algorithm suggested by the LLM is identical to the one used in the corresponding publication.

Both metrics are applied with respect to the distinct product development phases. Additionally, the TF-rate is further analysed to determine whether a zero-shot approach was sufficient (TF-rate₀) or if a one-shot approach (TF-rate₁) was required.

2.4. Study execution

For this study, three state-of-the-art LLMs with comparable capabilities are investigated: GPT-4o (OpenAI, 2024), Gemini 1.5 Pro (Google, 2024) and Claude 3.5 Sonnet (Anthropic, 2024). The prompts were uniformly provided as zero-shot inputs via the web API of each LLM, using their default parameters. When necessary, the one-shot prompt was applied individually.

3. Results

A total of 156 experiments were performed for each LLM under investigation. The responses were assessed based on predefined criteria, yielding the calculated TF- and IA-rates, as summarized in Tables 1 and 2. These results are analyzed for individual product development phases and for the dataset as a whole. Table 2 presents a detailed analysis, specifying whether the task was successfully completed in a zero-shot setting or required a one-shot approach.

Table 1. Comparison of the achieved TF- and IA-rates of the LLMs

Overall, all LLMs demonstrated a comparable, moderate ability to translate product development-related problems into ML-related problems by recommending appropriate ML algorithms, achieving TF-rates between 54% and 61%. Among them, GPT-4o demonstrated the best performance, achieving a TF-rate of 61% across all cases, followed closely by Claude 3.5 Sonnet with 59%, while Gemini 1.5 Pro showed the lowest performance at 54%. A closer examination of the product development phases to which the problems were assigned reveals an interesting trend: GPT-4o performed the worst in problems related to detailed design with a TF-rate of 54%, whereas Gemini 1.5 Pro performed the best with a TF-rate of 69%. However, Gemini 1.5 Pro and Claude 3.5 Sonnet struggled significantly, with a TF-rate of 36% and 54% in problems related to the problem defintion phase.

With respect to the alignment between the ML algorithms proposed by the LLMs and those used in the publications, all LLMs achieved relatively moderate results. Here, GPT-4o and Claude 3.5 Sonnet achieved relativly close IA-rates of 43% and 50%, respectively, indicating that these LLMs tended to propose valid alternative solutions. In contrast, Gemini 1.5 Pro demonstrated the best performance, achieving an IA-rate of 68%.

Table 2. Comparison of TF-rates for zero-shot and one-shot approaches

R-ZS: ratio of zero-shots

In several cases, an additional prompt was necessary because the LLMs generated outputs that either listed multiple ML algorithms of the same type (e.g. K-means and DBSCAN), provided a general term (e.g. Natural Language Processing (NLP)), or referred to a category of ML algorithms (e.g. Classification). As shown in Table 2, Claude 3.5 Sonnet exhibited this behavior in only a small number of cases, successfully generating the desired output in a zero-shot approach 99% of the time. Moreover, when an additional prompt was required, Claude 3.5 Sonnet provided a valid solution. By comparison, Gemini 1.5 Pro successfully generated the desired output in 73% of cases, while GPT-4 achieved 66%.

However, GPT-4o demonstrated a higher capability to generate a valid solution after the one-shot prompt, with a TF-rate₁ of 62% than Gemini 1.5 Pro, which performed significantly worse with 36%, often repeating or sticking to general terms rather than specifying a valid solution. This indicates that the additional prompt in the one-shot approach was particularly effective in enabling Claude 3.5 Sonnet and GPT-4o to refine its output and provide a valid ML algorithm as solution.

4. Discussion

The moderate TF-rate of 61% achieved by GPT-4o, the best-performing model, indicates that LLMs cannot reliably support the translation of product development-related problems into ML-related ones, as posed in the research question. This raises the question of the specific limitations or classes of failures encountered by the LLMs and how these issues can be effectively addressed. To investigate this, Section 4.1 explores and discusses the reasons behind incorrectly translated product development-related problems into ML-related ones. Based on these findings, Section 4.2 examines the underlying causes and proposes potential strategies to overcome these challenges. Finally, Section 4.3 presents the study’s limitations and connects them to directions for future research.

4.1. Classes of failures

An investigation into the reasons behind the inability to provide a suitable ML algorithm for the given product development-related problem of the LLMs revealed six distinct classes of failures. As shown in Figure 5, half of these failure classes are observed in all three LLMs. However, three exceptions were identified: GPT-4o uniquely exhibited issues related to the identification of multiple non-existent ML problems, while Gemini 1.5 Pro and Claude 3.5 Sonnet defined general AI related terms as solution and Gemini 1.5 Pro declared the problem as unsolvable with ML.

Figure 5. Classes of failures

Naming unsuitable ML algorithms – The majority of failures for GPT-4o and Claude 3.5 Sonnet resulted from suggesting unsuitable ML algorithms to solve the given product development-related problems. In particular, the LLMs frequently suggested ML algorithms that were incompatible due to factors such as the provided data type, the available data volume, or their inability to correctly identify the type of the ML-related problem. For instance, they occasionally misclassified a clustering problem as classification one.

Inability to identify multiple ML-related problems – A significant reason for failure in all LLMs was their inability to identify multiple ML-related problems within a single product development problem description. For instance, some cases required the identification of both a classification algorithm and a clustering algorithm to address two distinct sub-problems. However, none of the LLMs were able to recognize all relevant problems in such scenarios. Nevertheless, they were generally capable of identifying at least one issue and providing a suitable solution for it. It is also worth noting that GPT-4o occasionally made the opposite error by identifying more ML-related problems within a single problem description than actually existed.

Naming general AI-related terms – A significant proportion of the failures specific to Gemini 1.5 Pro, and partly been made by Claude 3.5 Sonnet, but absent in GPT-4o, involved the use of general AI-related terms instead of specific ML algorithms. These included broad topics such as NLP, regression or clustering. In such cases, even applying the one-shot approach failed to yield a satisfactory result.

Naming non-ML-related solutions – In a total of 17 cases, all LLMs suggested solutions that fell outside the scope of ML, such as genetic or evolutionary algorithms or techniques like Particle Swarm Optimization (PSO). Notably, Gemini 1.5 Pro exhibited this behaviour more frequently than GPT-4o and Claude 3.5 Sonnet.

Stating insolvability of the problem – In one instance, Gemini 1.5 Pro claimed that the given problem description was unsolvable using ML algorithms.

To reliably utilize LLMs as a recommender system for supporting the implementation of ML in product development, an LLM must accurately identify the underlying ML problems and recommend appropriate ML algorithms for their application. However, the analysis revealed that LLMs struggle to interpret the problem formulation by not recognizing multiple ML-related problems within a single problem formulation and frequently suggest ML algorithms that are unsuitable. Additionally, LLMs struggle to reliably present specific actionable ML algorithms by sometimes providing overly generic or irrelevant solutions. These limitations indicate that the basic application of LLMs for translating product development-related problems into ML solutions is currently inadequate and significant advancements are necessary to improve their performance, particularly in terms of problem recognition, domain-specific reasoning, and algorithm recommendation.

4.2. Possible root causes and potential for advancement

There are several optimization methods to enhance the quality of output and fulfilment of task regarding LLMs (Reference Ibrahim, Aboulela, Ibrahim and KashefIbrahim et al., 2024; Reference Sahoo and ChadhaSahoo et al., 2024). In this context, the identified failures can be attributed to three possible root causes, each of which has implications for potential advancements: the problem description, problem understanding, and ML algorithm selection.

Problem description – Failures related to the LLMs presenting unsuitable ML algorithms and their inability to identify all ML-related problems in the product development-related problem description may stem from an insufficient problem description, in this case, the prompt. A lack of detailed information regarding data characteristics, constraints, or an imprecise formulation of the problem may have hindered the LLMs’ ability to accurately identify and address the issues. This limitation could potentially be mitigated by using a more structured and detailed prompt that includes additional problem-related information. Alternatively, the application of prompt engineering techniques could enhance the solution quality for similar tasks, offering a promising avenue for improvement.

Problem understanding – The inability to identify multiple problems, the failure to consider relevant background information that could narrow down the range of applicable ML solutions, and the provision of general terms instead of specific ML algorithms may stem from insufficient knowledge of product development-related problems and ML, along with their corresponding definitions. Rather than employing fine-tuning, which requires a large amount of data, this knowledge could be integrated into the LLM using Retrieval-Augmented Generation (RAG) (Reference Ibrahim, Aboulela, Ibrahim and KashefIbrahim et al., 2024). Previous studies applying LLMs to domain-specific tasks have demonstrated that this technique successfully enhances domain-specific knowledge, leading to improved solutions (Reference LIU, Qian, Zhao and TaoLIU et al., 2024).

ML algorithm selection – A significant proportion of failures, particularly those generated by GPT-4o, stemmed from the selection of incorrect ML algorithms. Similar to the issues caused by insufficient problem descriptions, an inadequate or unreasoned selection process, lacking a systematic comparison of ML algorithms based on their characteristics, may have contributed to these errors. Implementing a systematic evaluation approach, such as the one presented in (Sonntag, Pohl, et al., 2024), could improve the quality and reliability of the algorithm selection process, thereby reducing the occurrence of such failures.

4.3. Limitations and future work

The primary goal of this study was to assess the fundamental capabilities of LLMs in translating product development-related problems into ML-related problems by identifying suitable ML algorithms for application. Additionally, the study aimed to highlight areas and directions for future research. To this end, default hyperparameter settings were used throughout the study. However, the effects of varying hyperparameters, such as temperature, top-k sampling, or repetition penalties, on the quality and relevance of the outputs should be tested.

Additionally, prompt engineering techniques (e.g. chain-of-thought prompting, contextual rephrasing), which were deliberately excluded in this study, should be incorporated in future experimental comparisons to evaluate baseline results against outputs generated using these advanced techniques.

The study did not consider the probabilistic nature of LLMs and whether their generated solutions are reproducible. Future work should include an analysis of the consistency and variability of the outputs. Moreover, incorporating a broader range of open-source and commercial LLMs could enable a more comprehensive comparison. In this context, smaller language models, which require less effort for fine-tuning compared to LLMs, should also be examined to assess the capabilities necessary for effectively solving the given tasks and to evaluate their potential as viable alternatives to larger models.

The dataset used in this study was built from peer-reviewed publications, ensuring a balance across the product development phases and their corresponding problem types. However, the dataset lacks balance regarding the ML algorithms represented in the included problems. Future studies should address this limitation by expanding the dataset to encompass a wider range and a more balanced representation of ML algorithms, ensuring that the findings are not biased toward specific methods. Furthermore, the validity of the ML approaches presented in the publications was not tested but assumed by the peer-reviewed status.

5. Conclusion

The implementation of ML in product development within companies involves several critical preparatory steps. These include the translation of domain-specific product development problems into corresponding ML-related problems and the selection of suitable ML algorithms as potential solutions. This process often requires specialized ML expertise, which is frequently unavailable within companies. Accordingly, this study aims to investigate the extent to which LLMs can assist in identifying suitable ML algorithms for product development tasks based on specified problem formulations.

A novel dataset of product development-related problem formulations and corresponding ML algorithms, derived from peer-reviewed publications, was used to evaluate three LLMs: GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. GPT-4o achieved the highest TF-rate at 61%, yet insufficient for reliable support. Failures were attributed to unsuitable algorithm suggestions, inability to identify multiple problems, and reliance on generic terms instead of specific algorithms, rooted in deficiencies in problem formulation, understanding, and algorithm selection.

These findings lay a foundation for future research, focusing on: (1) expanding the dataset to balance ML algorithm representation and address biases; (2) employing advanced prompt engineering techniques to test improved output precision; and (3) analysing reproducibility and variability in LLM-generated solutions under varying hyperparameters.

Footnotes

1 https://github.com/IPE-PEP/ICED25.git

References

Alsayed, A. S., Dam, H. K., & Nguyen, C. (2024). MicroRec: Leveraging Large Language Models for Microservice Recommendation. In Proceedings of the 21st International Conference on Mining Software Repositories (pp. 419–430). . https://doi.org/10.1145/3643991.3644916 CrossRef Google Scholar

Gerschütz, B., Schleich, B., & Wartzack, S. (2021). Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https://www.anthropic.com/news/3-5-models-and-computer-use Google Scholar

Benjamin Gerschütz, B. S. (27 and 2021, September 28 ). A semantic web approach for structuring data-driven methods in the product development process. In DS 111: Proceedings of the 32nd Symposium Design for X. The Design Society. https://doi.org/10.35199/dfx2021.15 CrossRef Google Scholar

Blagec, K., Barbosa-Silva, A., Ott, S., & Samwald, M. (2022). A curated, ontology-based, large-scale knowledge graph of artificial intelligence tasks and benchmarks. Scientific Data, 9(1), 322. https://doi.org/10.1038/s41597-022-01435-x CrossRef Google Scholar

Cooper, R. G., & Brem, A. M. (2024). The Adoption of AI in New Product Development. Research-Technology Management, 67 (3), 44–53. https://doi.org/10.1080/08956308.2024.2324241 CrossRef Google Scholar

Ehring, D., Menekse, I., Luttmer, J., & Nagarajah, A. (2024). Automatic identification of role-specific information in product development: a critical review on large language models. Proceedings of the Design Society, 4, 2009–2018. https://doi.org/10.1017/pds.2024.203 CrossRef Google Scholar

Gericke, K., Bender, B., Pahl, G., Beitz, W., Feldhusen, J., & Grote, K.-H. (2021). Der Produktentwicklungsprozess. In Bender, B. & Gericke, K. (Eds.), Pahl/Beitz Konstruktionslehre: Methoden und Anwendung erfolgreicher Produktentwicklung (9th ed. 2021, pp. 57–93). Springer Berlin Heidelberg; Imprint Springer Vieweg. https://doi.org/10.1007/978-3-662-57303-7_4 CrossRef Google Scholar

Gerschütz, B., Goetz, S., & Wartzack, S. (2023). AI4PD—Towards a Standardized Interconnection of Artificial Intelligence Methods with Product Development Processes. Applied Sciences (Switzerland), 13(5), 3002. https://doi.org/10.3390/app13053002 CrossRef Google Scholar

Google. (2024). Our next-generation model: Gemini 1.5. #https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#sundar-note Google Scholar

Hu, Z., Li, X., Pan, X., Wen, S., & Bao, J. (2023). A question answering system for assembly process of wind turbines based on multi-modal knowledge graph and large language model. JOURNAL of ENGINEERING DESIGN, 1–25. https://doi.org/10.1080/09544828.2023.2272555 CrossRef Google Scholar

Ibrahim, N., Aboulela, S., Ibrahim, A., & Kashef, R. (2024). A survey on augmenting knowledge graphs (KGs) with large language models (LLMs): models, evaluation metrics, benchmarks, and challenges. Discover Artificial Intelligence, 4 (1). https://doi.org/10.1007/s44163-024-00175-8 CrossRef Google Scholar

Krause, D. (2018). Methodische Entwicklung modularer Produktfamilien: Hohe Produktvielfalt beherrschbar entwickeln. Springer Berlin Heidelberg; https://doi.org/10.1007/978-3-662-53040-5 CrossRef Google Scholar

Lickert, H., Wewer, A., Dittmann, S., Bilge, P., & Dietrich, F. (2021). Selection of Suitable Machine Learning Algorithms for Classification Tasks in Reverse Logistics. Procedia CIRP, 96, 272–277. https://doi.org/10.1016/j.procir.2021.01.086 CrossRef Google Scholar

LIU, P., Qian, L., Zhao, X., & Tao, B. (2024). Joint Knowledge Graph and Large Language Model for Fault Diagnosis and Its Application in Aviation Assembly. IEEE Transactions on Industrial Informatics, 20 (6), 8160–8169. https://doi.org/10.1109/TII.2024.3366977 CrossRef Google Scholar

Luttmer, J., Ehring, D., Vinke, K., Alazzazy, M., & Nagarajah, A. (2024). Requirements Mining From Engineering Standards – Development and Evaluation of a Standards Requirements Mining Framework. ASME 2024 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. . https://doi.org/10.1115/DETC2024-143281 CrossRef Google Scholar

Luttmer, J., Prihodko, V., Ehring, D., & Nagarajah, A. (2023). Requirements extraction from engineering standards – systematic evaluation of extraction techniques. Procedia CIRP, 119, 794–799. https://doi.org/10.1016/j.procir.2023.03.125 CrossRef Google Scholar

Müller, B., Roth, D., & Kreimeyer, M. (2023). BARRIERS TO THE USE OF ARTIFICIAL INTELLIGENCE IN THE PRODUCT DEVELOPMENT – A SURVEY OF DIMENSIONS INVOLVED. Proceedings of the Design Society, 3, 757–766. https://doi.org/10.1017/pds.2023.76 CrossRef Google Scholar

Nascimento, N., Tavares, C., Alencar, P., & Cowan, D. (2023). GPT in Data Science: A Practical Exploration of Model Selection. In 2023 IEEE International Conference on Big Data (BigData) (pp. 4325–4334). . https://doi.org/10.1109/BigData59044.2023.10386503 CrossRef Google Scholar

OpenAI. (2024). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ Google Scholar

Page, M. J., McKenzie, J. E. (2021). Prisma 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ (Clinical Research Ed.), 372, n160. https://doi.org/10.1136/bmj.n160 CrossRef Google Scholar

Pluhnau, R., Lübke, J., & Nagarajah, A. (2023). Multi-Life-Products – towards a new paradigm of product development aligning on module's multiple usage-scenarios. Procedia CIRP, 119, 873–878. https://doi.org/10.1016/j.procir.2023.03.132 CrossRef Google Scholar

Riesener, M., & Klumpen, N. (2020). Identification of evaluation criteria for algorithms used within the context of product development. Procedia CIRP, 91, 508–515. https://doi.org/10.1016/j.procir.2020.02.207 CrossRef Google Scholar

Sahoo, P., & Chadha, A. (2024). A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. ArXiv E-Prints, arXiv:2402.07927. https://doi.org/10.48550/arXiv.2402.07927 CrossRef Google Scholar

Sarang, P. (2023). Thinking Data Science: A Data Science Practitioner’s Guide (1st ed. 2023). The Springer Series in Applied Machine Learning. Springer International Publishing; Imprint Springer. https://doi.org/10.1007/978-3-031-02363-7 CrossRef Google Scholar

Singh, V. K., Singh, P., & Mayr, P. (2021). The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis. Scientometrics, 126 (6), 5113–5142. https://doi.org/10.1007/s11192-021-03948-5 CrossRef Google Scholar

Sonntag, S., Brünjes, V., Luttmer, J., Corves, B., & Nagarajah, A. (2024). Machine Learning Applications for the Synthesis of Planar Mechanisms — A Comprehensive Methodical Literature Review. ASME 2024 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. https://doi.org/10.1115/DETC2024-143690 CrossRef Google Scholar

Sonntag, S., Luttmer, J., Pluhnau, R., & Nagarajah, A. (2023). A PATTERN LANGUAGE APPROACH TO IDENTIFY APPROPRIATE MACHINE LEARNING ALGORITHMS IN THE CONTEXT OF PRODUCT DEVELOPMENT. Proceedings of the Design Society, 3, 365–374. https://doi.org/10.1017/pds.2023.37 CrossRef Google Scholar

Sonntag, S., Pohl, E., Luttmer, J., Geldermann, J., & Nagarajah, A. (2024). A conceptual MCDA-based framework for machine learning algorithm selection in the early phase of product development. Proceedings of the Design Society, 4, 2257–2266. https://doi.org/10.1017/pds.2024.228 CrossRef Google Scholar

Waring, J., Lindvall, C., & Umeton, R. (2020). Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artificial Intelligence in Medicine, 104, 101822. https://doi.org/10.1016/j.artmed.2020.101822 CrossRef Google Scholar

Yüksel, N., Börklü, H. R., Sezer, H. K., & Canyurt, O. E. (2023). Review of artificial intelligence applications in engineering design perspective. Engineering Applications of Artificial Intelligence, 118, 105697. https://doi.org/10.1016/j.engappai.2022.105697 CrossRef Google Scholar

Zschech, P., Heinrich, K., Horn, R., & Höschele, D. (2019). Towards a text-based recommender system for data mining method selection. Proceedings of the Twenty-Fifth Americas Conference on Information Systems.Google Scholar

Zschech, P., & Heinrich, K. (2020). Intelligent User Assistance for Automated Data Mining Method Selection. Business & Information Systems Engineering, 62 (3), 227–247. https://doi.org/10.1007/s12599-020-00642-3 CrossRef Google Scholar

Figure 1. Required translation of the product-related problem formulation into the corresponding ML-related problem formulation

Figure 2. Study design

Figure 3. Review process

Figure 4. Prompt design based on the information provided in the abstracts

Table 1. Comparison of the achieved TF- and IA-rates of the LLMs

Table 2. Comparison of TF-rates for zero-shot and one-shot approaches

Figure 5. Classes of failures

Article contents

Can large language models support machine learning implementation in product development? A comparative analysis and perspectives

Abstract:

Keywords

Information

1. Introduction

1.1. Implementing machine learning in product development

1.2. Research gaps and objectives

2. Study design

2.1. Dataset creation

2.2. Prompt design

2.3. Evaluation metrices

2.4. Study execution

3. Results

4. Discussion

4.1. Classes of failures

4.2. Possible root causes and potential for advancement

4.3. Limitations and future work

5. Conclusion

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests