1. Introduction
Eco-design is an approach aimed at minimizing a product’s environmental impact by integrating sustainable materials, energy efficiency, and promoting user behaviors that support sustainability throughout its life cycle (Reference Balikci, Borgianni, Maccioni and NezziBalikci et al., 2021; Reference Cor and ZwolinskiCor & Zwolinski, 2015; Reference MacDonald and SheMacDonald & She, 2015). Within eco-design, Design for Sustainable Behavior (DfSB) considers product features that guide or control users’ actions toward environmentally friendly outcomes, such as conserving electricity, water, or reducing waste (Reference De Medeiros, Da and RibeiroDe Medeiros et al., 2018; Reference Shu, Duflou, Herrmann, Sakao, Shimomura, Bock and SrivastavaShu et al., 2017).
The degree of control afforded to users is a primary consideration for user receptivity to behavioral change. Designs that offer too little control can feel restrictive or frustrating, while too much control may result in inconsistent sustainable behavior (Reference Consolvo, McDonald and LandayConsolvo et al., 2009; Reference Coskun, Zimmerman and ErbugCoskun et al., 2015). Persuasive designs encourage or enable users to make sustainable choices by providing cues, feedback, or incentives (Reference Asbjørnsen, Hjelmesæth, Smedsrød, Wentzel, Ollivier, Clark, Gemert-Pijnen and SolbergAsbjørnsen et al., 2022; Reference Yun, Lasternas, Aziz, Loftness, Scupelli, Rowe, Kothari, Marion, Zhao, Berkovsky and FreyneYun et al., 2013). These persuasive design features are considered cognitive interventions, as they intend to influence users’ thoughts, attitudes, and decision-making processes to promote eco-friendly actions (Reference Saadi and YangSaadi & Yang, 2020; Reference Wilson, Khan, Lee, Krishnan, Fernández and YangWilson et al., 2024). Decisive design, conversely, automates eco-friendly behavior with minimal or no user involvement; these are physical interventions. For example, a paper towel dispenser designed to reduce paper consumption might employ persuasive design by displaying the environmental impact of deforestation, whereas a decisive design could use an automatic dispensation to limit paper towel usage (Reference Saadi and YangSaadi & Yang, 2020).
This study aims to develop a research methodology that bridges the gap between designer intentions and user perceptions in DfSB. Traditional user-centered research methods, such as interviews, surveys, and observations, are effective but can be time-consuming and difficult to scale (Reference Ghazi, Petersen, Reddy and NekkantiGhazi et al., 2019). This study seeks to leverage large datasets of user-generated content, specifically online product reviews and ratings, to efficiently gather insights from a broad range of users to inform eco-design decision-making.
To achieve this, the study employs Large Language Models (LLMs) to assess product information from a user’s perspective and generate actionable design recommendations tailored to target users. Recent advancements in LLMs have shown their potential for perspective-taking, a cognitive mechanism that allows models to infer and evaluate user needs and experiences (Reference Siddharth, Blessing and LuoSiddharth et al., 2022; Reference Zhu, Chong, Yang and LuoZhu et al., 2024). Simulating user’s cognitive processes could be particularly insightful for persuasive design, which utilizes cognitive interventions to change user behavior.
This potential for simulating user perspectives can be harnessed through specific techniques that enhance the reasoning capabilities of LLMs. In-context learning is a method where LLMs are provided with contextual information to perform specific tasks without the need for fine-tuning (Reference Dong, Li, Dai, Zheng, Ma, Li, Xia, Xu, Wu, Liu, Chang, Sun, Li and SuiDong et al., 2024). The model can infer patterns and apply similar reasoning to new data. This method allows LLMs to dynamically adapt to various tasks, such as classifying products or evaluating user-generated content, by emulating the perspective and reasoning processes detailed in the contextual information.
A related technique, Chain-of-Thought (CoT) prompting, enhances the LLM’s reasoning capabilities by breaking down complex tasks into a sequence of simpler, intermediate steps (Reference Li, Li, Li and JinLi et al., 2024) .CoT prompting guides the model to articulate its thought process in a structured manner, allowing it to handle tasks requiring multi-step, context-sensitive reasoning. Encouraging the model to explicitly consider each stage of the analysis improves its ability to generate contextually aware and nuanced responses (Reference Ge, Sun, Cui and WeiGe et al., 2024). Together, in-context learning and CoT prompting could prove powerful methods for leveraging LLMs to analyze user perceptions, interpret product information, and generate design recommendations that align with user evaluations.
The research question this study addresses is: How can LLMs be leveraged to provide user-centered design recommendations for sustainable behavior by accurately adopting the perspective of target users? The methodology involves a two-step process. First, an LLM classifies product descriptions using in-context learning informed by user-generated insights. Second, the LLM’s classifications and reasoning are evaluated against human responses, and the sentiment and ratings of user reviews for these products are analyzed to determine the effectiveness of different design interventions. This approach assesses an LLM’s ability to adopt a user’s perspective and provides actionable design recommendations for sustainable behavior.
2. Methods
The product of interest for this case study was household thermostats. This study employed a multi-stage approach to classify thermostats. The process began with a user survey that collected qualitative data on participants’ perceptions of eco-friendly thermostats and analysis of design interventions for sustainable behavior. The survey responses were processed to filter out inconsistencies and extract key eco-design terms. These keywords were then applied to identify relevant user reviews from a dataset of Amazon user reviews, ensuring that the analysis of the reviews focused specifically on evaluations related to eco-design features.
Applying in-context learning with CoT prompting to simulate the thought processes of users, OpenAI’s GPT-4o was then tasked with classifying 196 thermostats based on their product descriptions. The classifications determined whether the thermostats were eco-friendly or not, and further categorized the design interventions as persuasive, decisive, or both. Finally, sentiment analysis of user reviews and comparisons of user ratings provided insights into the receptivity of these design interventions.
2.1. User survey
The purpose of this survey is to understand users’ perceptions and interpretations of product descriptions, specifically identifying the cues that signal product features designed for sustainable behavior. The goal is to uncover the keywords and phrases within these descriptions that lead users to perceive a product to be “eco-friendly,” as well as those that indicate the presence of persuasive or decisive design elements.
Terminology associated with sustainability and pro-environmental behavior is highly product-specific. There is no standardized corpus of terms that apply across different product categories to identify features enabling sustainable behavior (Reference Ghazi, Petersen, Reddy and NekkantiTelenko & Seepersad, 2010). Without a comprehensive understanding of product-specific terminology, identifying eco-friendly products within large datasets becomes prone to inaccuracies and misclassification.
Moreover, there is an anticipated gap between designers’ intentions and users’ interpretations of sustainable features (Reference Qazi, Raj, Tahir, Waheed, Khan and AbrahamQazi et al., 2014) . Users may perceive and evaluate a products’ eco-friendliness differently than designers intend. Understanding how users interpret these cues through this survey is essential for accurately predicting their reception of and interactions with products designed for sustainable behavior. Therefore, this study situates the analysis from the users’ frame of reference to bridge this gap effectively.
2.1.1. Survey design and deployment
The survey, conducted online via Qualtrics, aimed to understand how users identify and interpret sustainable design features in product descriptions. The study was determined to be exempt by the MIT Institutional Review Board.
50 participants were recruited via Prolific, with eligibility criteria including being 18 years or older, residing in the United States, having English as a primary language, holding an Amazon Prime membership, and maintaining a 95% or higher approval rating for at least 50 prior studies. CAPTCHA verification was also used to protect the survey from automated bots. The survey took approximately 20 minutes to complete, and participants were compensated $5 USD for their time. The study was conducted in November 2024, and informed consent was obtained from all participants.
Participants were provided with definitions and examples of persuasive design and decisive design before beginning the survey. Persuasive design was defined as enabling or encouraging sustainable choices, while decisive design was defined as automating eco-friendly behavior. These concepts were illustrated with images of paper towel dispensers to clarify the distinctions.
Ten thermostats available on Amazon were selected to represent frequently purchased brands and a range of functionalities. Participants were provided with the product descriptions and images of these thermostats. For each thermostat, participants evaluated whether the product and its description conveyed eco-friendly attributes. They were then prompted to provide their written reasonings–including specific keywords, phrases, or features–for their interpretation. If a thermostat was identified as eco-friendly, participants rated its degree of eco-friendliness on a 5-point Likert Scale: (1) Not eco-friendly, (2) Slightly eco-friendly, (3) Moderately eco-friendly, (4) Very eco-friendly, (5) Extremely eco-friendly. They then classified the design interventions as persuasive, decisive, or both, and again provided written reasoning for their classification. From the survey, 830 evaluations were generated: 500 for eco-classification and 330 for design feature classification.
2.1.2. Processing survey data
The intention of collecting user evaluations and reasonings was to train an LLM on this data to perform the same evaluations. Therefore, processing the user survey data ensured that the classifications and reasonings provided by participants were consistent, interpretable, and non-conflicting.
First, a thermostat was filtered-out from the dataset if its user evaluations were inconsistent. For each thermostat, the percentage of users classifying it as “eco-friendly” or “not eco-friendly” was calculated. To ensure reliability, only thermostats with at least 80% user agreement were retained in the dataset. Thermostats classified as “eco” also required a mean eco-friendliness score above 3 (indicating “Moderately eco-friendly”) on the 5-point Likert scale. Conversely, thermostats classified as “not eco” were retained if their eco-friendliness score was below 2 (indicating “Slightly eco-friendly”). After applying these criteria, four of the ten thermostats were filtered out. The final dataset, as seen in Tables 1 and 2, included the user evaluations for six thermostats: four classified as “eco-friendly” and two classified as “not eco-friendly.”
Table 1. Summary of users’ classifications for “eco” thermostats that passed filtering criteria

Table 2. Summary of users’ classifications for “not eco” thermostats that passed filtering criteria

For the thermostats classified as “eco-friendly,” the next step was to determine their design intervention type: persuasive, decisive, or both. Classifications were assigned based on the following rules: if more than 40% of users agreed on an intervention, the thermostat was classified accordingly. If two labels each were selected by more than 40%, such as “both” and “persuasive,” the classification was recorded as “both” with a leaning toward the more dominant label. Of the four eco-friendly thermostats, the final classifications were persuasive, both, both (leaning decisive), and both (leaning persuasive), shown in Table 3.
Table 3. Summary of users’ behavior design intervention classifications for thermostats that passed filtering criteria

The processed dataset compiled the product’s name, description, user eco-classification, mean user eco-friendliness score, design feature classification, and all users’ written reasonings for classification of the six thermostats. This refined dataset provides a consistent foundation for the selected LLM, GPT-4o, to learn how to perform similar eco-classifications and design intervention categorizations.
2.1.3. Keyword identification
KeyBERT was applied to extract keywords from user reasonings for each classification. KeyBERT is a keyword extraction tool that leverages the BERT (Bidirectional Encoder Representations from Transformers) model to identify the most relevant keywords and key phrases within a piece of text (Reference Issa, Jasser, Chua and HamzahIssa et al., 2023). The tool generates vector embeddings for the input text and potential keywords and then calculates the cosine similarity between these embeddings. This similarity score indicates the contextual relevance of each keyword to the input text, with scores closer to 1 signifying stronger alignment. This process revealed key features that participants associated with eco-friendliness and the design interventions of persuasive and decisive products. Keywords was a high similarity score (close to 1) associated with persuasive design interventions included terms like “optional,” “usage reports and alerts” and “incentives for participation.” In contrast, keywords for decisive design interventions included “automatic adjustments” and “default mode.”
2.2. Amazon reviews dataset
The dataset analyzed in this study is the Amazon Reviews’23 dataset from the McAuley Lab at UC San Diego (Reference Hou, Li, He, Yan, Chen and McAuleyHou et al., 2024). The intention of this phase of the methodology is to extend the survey’s baseline categorization of thermostats to larger datasets and also pair these thermostats with their associated reviews and ratings for additional analysis. This dataset comprises 571.54 million Amazon reviews and associated product metadata across various product categories. For this study, which focuses specifically on household thermostats, the relevant category is “Home and Kitchen.”
The dataset is split into two primary files: one for user reviews and another for product metadata. The review data includes key attributes such as user ratings (on a scale of 1 to 5 stars), the review title, and the full review text. The product metadata file contains each product’s name, category, description, and features.
2.2.1. Identifying thermostats in the dataset
The Amazon reviews dataset was filtered using several steps to identify thermostats designed for room or home temperature control. First, only reviews from verified users—those who purchased the product directly through Amazon—were included to ensure data reliability. Next, the dataset was narrowed by selecting products that specifically included the term “thermostat” in the product name.
To exclude irrelevant products, a secondary filter was applied to remove entries containing keywords associated with appliance-specific thermostats. Some 27 excluded keywords included: refrigerator, dish washer, dryer, freezer, stove, microwave, and oven. After applying these filters, the dataset was condensed to 196 thermostats, with a total of 4303 user reviews.
2.2.2. Processing user reviews
KeyBERT-identified keywords derived from the user survey were used to refine the dataset by filtering for reviews containing these relevant keywords. This ensured the selected reviews specifically discussed features related to the design interventions of interest, persuasive and decisive design. After applying the keyword filters, 1892 user reviews remained for analysis.
2.3. LLM in-context learning and CoT prompting
GPT-4o was guided to classify thermostat descriptions using a combination of in-context learning and CoT prompting. GPT-4o was selected for this research due to its extensive training on diverse datasets, enabling it to capture patterns in users’ written reasonings and replicate natural language. To perform in-context learning, GPT-4o was provided with the processed dataset of user survey classifications and reasonings. These examples served as contextual references, enabling the model to learn how users reasoned through these classifications and apply similar patterns to new data.
Building on this foundation, CoT prompting was used to structure the model’s reasoning process into a sequence of intermediate steps. This step-by-step breakdown mirrored the thought process structured in the user survey. This prompting taught the model to justify its evaluations by referencing specific product features and design interventions in the same manner conducted by the users. The prompt instructed GPT-4o to first determine if a thermostat was eco-friendly or not based on the product description. The model then assigned an eco-friendliness score on the same 5-point Likert scale as in the user survey. If the product was classified as eco-friendly, the model proceeded to identify the design intervention type as persuasive, decisive, or both. Finally, GPT-4o generated detailed natural language reasoning for its classifications, articulating the logic and observations behind its decisions.
2.3.1. LLM thermostat evaluations
GPT-4o was tasked with evaluating the product descriptions of the 196 thermostats identified in the Amazon dataset. Out of the 196 thermostats evaluated, 102 were classified as “not eco,” while 94 were classified as “eco.” Among the eco-friendly thermostats, the model identified 31 as having persuasive design features, 19 as having decisive design features, and 44 as incorporating both persuasive and decisive elements. Tables 4 and 5 show selected reasonings from users and from GPT-4o.
Table 4. User and LLM reasoning for classification of thermostat 4 as “eco-friendly”

Table 5. User and LLM reasoning for classification of thermostat 8 as “persuasive”

2.3.2. Validation of LLM classifications and reasonings
To validate GPT-4o’s performance in classifying thermostats, two primary validation methods were employed: semantic similarity analysis using BERT and inter-rater reliability.
The cosine similarity heat map shown in Figure 1 demonstrates the semantic closeness of BERT embeddings between user-generated reasoning and GPT-4o-generated reasoning. BERT created contextual embeddings for the text-based reasonings. Cosine similarity was then calculated between user and GPT-4o embeddings. The resulting similarity matrix showed high semantic agreement between all embeddings, with all values being at least 0.81. Interestingly, the highest similarity scores (0.88) were found in comparisons of the reasonings for thermostats classified as “not eco” (5 and 10).

Figure 1. Cosine similarity of BERT embeddings between user-generated reasoning and LLM reasoning for eco-classification of six thermostats
Inter-rater reliability between GPT-4o’s classifications and those of human evaluators was also verified. Three types of inter-rater reliability were measured using Cohen’s Kappa (κ). In the first validation, GPT-4o’s eco-classifications for the 10 thermostats from the user survey matched the users’ classifications with perfect agreement (κ = 1.000). In the second validation, GPT-4o’s eco-classifications for 196 thermostats from the Amazon dataset were compared with an evaluator’s classifications of the thermostats, resulting in high agreement (κ = 0.8674). In the third validation for design intervention classifications, GPT-4o’s outputs for 94 eco-friendly thermostats were compared with an evaluator’s classifications (κ = 0.6382). These validation results demonstrate that this method for GPT-4o learning and prompting produced reasonings closely aligns with users’ reasonings, both in terms of semantic content and classification outcomes, supporting the reliability of using GPT-4o for automated product evaluations.
3. Results
Following the classification of Amazon thermostats, analysis of the thermostats’ associated user reviews was conducted to identify patterns in sentiment and satisfaction. Reviews were examined for thermostats under each classification to determine how users respond to each. This analysis aimed to uncover trends in user sentiment and ratings that reflect the effectiveness of different approaches to sustainable design.
To evaluate the users’ receptivity to various eco-designed thermostats, statistical methods were used to determine the significance of the differences observed. The T-statistic, or T-score, was used to measure the difference between the means of two groups, specifically comparing eco-friendly and non-eco-friendly classifications. This measure accounts for the size of the difference relative to the variation within each group. For comparisons involving three, such as for the design intervention classification, the F-statistic was used to evaluate the variance between these groups. The significance of these results was determined by the p-value.
3.1. Aspect-based sentiment analysis
Aspect-based sentiment analysis was conducted to examine the sentiment expressed in user reviews of thermostats that reference sustainable design features. The analysis was conducted on the 1892 user reviews in which one or more of the identified eco-design keywords was present.
The sentiment polarity of phrases containing these keywords was calculated using TextBlob, a Python library designed for natural language processing tasks. TextBlob assigns polarity scores ranging from -1 to 1, where -1 indicates highly negative sentiment, 1 indicates highly positive sentiment, and 0 represents neutral sentiment (Reference Diyasa, Mandenni, Fachrurrozi, Pradika, Manab and SasmitaDiyasa et al., 2021). This tool leverages pre-trained models and lexicons to assess the sentiment of words and their context within sentences.
For eco-friendly versus non-eco-friendly thermostats, the T-statistic was -1.1383 with a p-value of 0.2564, indicating that the difference in sentiment was not statistically significant. For the evaluation of design interventions, the F-statistic was 0.0033 with a p-value of 0.9547, also showing no statistically significant difference in sentiment. These results, shown in Figure 2, suggest that user sentiment towards thermostats does not significantly differ based on eco-classifications or design intervention types.

Figure 2. Distribution of aspect-based sentiment analysis polarity scores for thermostat users reviews by (left) eco-classification and (right) behavior design intervention classification
3.2. User ratings analysis
User ratings for eco-friendly and non-eco-friendly thermostats, as well as for persuasive and decisive design interventions, were analyzed to identify differences in user satisfaction. The comparison between eco-friendly and non-eco-friendly thermostats resulted in a T-statistic of 1.3999 and a p-value of 0.1631, indicating that the difference in ratings between these two groups was not statistically significant.
However, as seen in Figure 3, a significant difference was observed when comparing persuasive and decisive thermostats. The T-statistic for this comparison was 3.0959, with a p-value of 0.0027, which indicates a statistical significant difference in user ratings. Persuasive thermostats received significantly higher user ratings compared to decisive thermostats. These findings suggest that users favor designs that allow them to exercise control over sustainable behavior rather than those that automate eco-friendly actions.

Figure 3. Distribution of thermostat product ratings by (left) eco-classification and (right) behavior design intervention
3.3. LLM design recommendations
Based on these evaluations of user sentiment and rating, GPT-4o was prompted to give design recommendations for thermostats to maximize user satisfaction. In the prompt, GPT-4o was first given a summary of the sentiment analysis and rating patterns associated with each thermostat classification. Through CoT prompting, GPT-4o was guided to break down the task into intermediate steps: interpreting user reviews, identifying recurring themes in positive reviews, and mapping these themes to specific design features.
The resulting recommendations focused on incorporating persuasive design elements. To enhance user satisfaction with persuasive design, thermostats should provide programmable settings and remote access, empowering users to actively optimize their energy usage. Integrating with utility programs, such as demand response initiatives, would allow users to receive incentives like rebates or rewards for participating in energy-saving activities. Offering detailed energy usage reports and timely alerts would educate users and help them make informed decisions, reinforcing sustainable behavior.
4. Discussion
This study demonstrates the potential of LLMs in user-centered design research, particularly for design for sustainable behavior. Through in-context learning and CoT prompting, LLMs can effectively adopt a user’s perspective, evaluate product information, and generate informed design recommendations. This capability provides designers with a scalable, efficient, and data-driven method for understanding user responses to design interventions. LLMs offer a promising complement to traditional user research methods by processing large volumes of user-generated content to identify patterns in user perceptions. Furthermore, the ability to simulate user reasoning allows LLMs to infer how target users are likely to interpret and respond to design features. This enables designers to leverage forms of user-generated content, like user reviews, to identify interventions that are more likely to be well-received and effective in promoting eco-friendly behaviors.
The preference for user-controlled, persuasive design interventions suggests that users are more likely to engage with and adopt sustainable behaviors when they feel a sense of agency and autonomy. Allowing users to make decisions regarding energy-efficient settings or eco-friendly features fosters a sense of ownership and responsibility, which can lead to more consistent and long-term engagement with sustainability goals. For example, thermostats with optional energy-saving modes or customizable schedules empower users to integrate energy efficiency into their routines without feeling restricted. This approach aligns with psychological theories of motivation, such as Self-Determination Theory, which posits that autonomy is a key factor in fostering intrinsic motivation and sustained behavioral change (Reference Larson, Stedman, Cooper and DeckerLarson et al., 2015). Features such as energy usage reports, alerts, and real-time feedback can help users understand the environmental impact of their behaviors and make more informed decisions. When users receive actionable insights about their energy consumption, they are better equipped to adjust their behavior to align with sustainability goals. This feedback loop not only enhances user satisfaction but also promotes a learning process that reinforces eco-friendly habits over time.
Additionally, the preference for persuasive design interventions has implications for product marketing and communication strategies. Designers and marketers should emphasize the flexibility and user-centric nature of eco-friendly features in their product descriptions. Highlighting how a product empowers users to make sustainable choices can enhance its appeal, particularly among consumers who value autonomy.
5. Conclusions and future work
This study demonstrates that persuasive design interventions, which offer users autonomy and control over their behavior, are more positively received than decisive ones, which automate eco-friendly behaviors. The analysis of user-generated content, combined with the classification and reasoning capabilities of GPT-4o, reveals that users rate thermostats more highly that provide them multiple options and incentives for choosing to behave sustainably. This preference for persuasive design underscores the importance of cognitive interventions—such as customizable settings, feedback, and incentives—that enhance user agency in sustainable practices.
The validation of GPT-4o’s evaluations against users’ evaluations also indicates that GPT-4o, through in-context learning and CoT prompting, can effectively take on the perspective of a user and provide design recommendations that align with user expectations. This methodology presents a scalable alternative to traditional user research methods, enabling designers to analyze large datasets of product descriptions and user reviews quickly and efficiently.
Several directions for future research emerge from these findings. Expanding the dataset to include a broader range of product categories beyond thermostats can provide more comprehensive insights into how persuasive and decisive design interventions perform across different contexts. This expansion would help determine whether user preferences for control are consistent across various types of eco-friendly products or if they vary based on the nature of the product and its specific use case. For example, products like smart home devices or water-efficient appliances may elicit different responses based on how users interact with them.
In addition, comparing the performance of GPT-4o with other LLMs could further refine this methodology. Such comparisons can highlight the strengths and weaknesses of various models in capturing user perspectives, identifying effective cognitive interventions, and generating actionable design recommendations. This investigation could also reveal areas for improvement in prompt engineering, model training, and the contextual understanding of sustainable design principles.
Acknowledgments
The authors would like to express their gratitude to Professor Faez Ahmed for his teaching in AI and Machine Learning for Engineering Design. We also thank Qihao Zhu for his insightful contributions to the dataset processing methodology.