Evaluating large-language-model chatbots to engage communities in large-scale design projects

Abstract Recent advances in machine learning have enabled computers to converse with humans meaningfully. In this study, we propose using this technology to facilitate design conversations in large-scale urban development projects by creating chatbot systems that can automate and streamline information exchange between stakeholders and designers. To this end, we developed and evaluated a proof-of-concept chatbot system that can perform design conversations on a specific construction project and convert those conversations into a list of requirements. Next, in an experiment with 56 participants, we compared the chatbot system to a regular online survey, focusing on user satisfaction and the quality and quantity of collected information. The results revealed that, with regard to user satisfaction, the participants preferred the chatbot experience to a regular survey. However, we found that chatbot conversations produced more data than the survey, with a similar rate of novel ideas but fewer themes. Our findings provide robust evidence that chatbots can be effectively used for design discussions in large-scale design projects and offer a user-friendly experience that can help to engage people in the design process. Based on this evidence, by providing a space for meaningful conversations between stakeholders and expanding the reach of design projects, the use of chatbot systems in interactive design systems can potentially improve design processes and their outcomes.


Introduction
Recent advances in machine learning (ML) have enabled computers to converse with humans in meaningful ways (Brown et al., 2020;OpenAI, 2023).In this paper, we argue that this technology can potentially revolutionize the design process, particularly in large-scale urban development projects.Traditionally, in small architecture projects, there is a negotiation between a client (the end user) and the architect; during this negotiation, information is exchanged, thereby facilitating the progression of reflection into an agreement among all stakeholders (McDonnell, 2009;Oak, 2009).However, in large-scale projects, the client is typically a governmental agency or developer who is not the end user of the constructed buildings, making it extremely challenging to engage in meaningful conversations with thousands of potential stakeholders who are the end users.
Since the mid-20th century, such urban development projects have faced extensive criticism due to their disconnect with end users, often resulting in underperforming designs (Alexander, 1964), the destruction of thriving neighborhoods (Jacobs, 1961), and a lack of inclusivity and democracy (Harvey, 1973), which affects marginalized communities (Arnstein, 1969).This issue has become increasingly relevant as cities continue to densify, and urban renewal projects significantly impact various aspects of urban life, including social, economic, and environmental factors.In response to this growing need, numerous participatory design methods have emerged since the 1970s (Simonsen and Robertson, 2012), aiming to incorporate diverse perspectives in architectural projects (Luck, 2018).By involving end users in the design process, these methods foster a more comprehensive approach to urban development, better suited to address the complex challenges faced by contemporary cities.
However, despite the widespread agreement on the significance of community participation in fostering sustainable development, promoting democratic culture, and creating equitable communities (Münster et al., 2017;Calderon, 2020), the practical implementation of participatory design in urban design remains challenging.Various factors have hindered effective public participation, including intra-community politics and power dynamics (Krüger et al., 2019), bureaucratic obstacles and red tape (Brabham, 2009), knowledge gaps between experts and community members (Dortheimer and Margalit, 2020), and a pervasive lack of public trust in politicians and local authorities (Giering, 2011).Additionally, the considerable time and effort required from participants and organizers can impede the successful execution of participatory design processes.
In response to these challenges, researchers and practitioners have explored various strategies to enhance participatory design by leveraging digital tools to enable more accessible and inclusive participation (Luck, 2018).The emergence of crowdsourcing technologies has significantly strengthened the trend toward participatory design in recent years (Robertson and Simonsen, 2012;Gooch et al., 2018, Dortheimer et al., 2020).These technologies facilitate individual communication, thereby relieving political pressure on participants and allowing them to express their opinions freely and, when necessary, anonymously (Dortheimer et al., 2023).Numerous studies have examined the application of crowdsourcing in architecture and urban design, encompassing ideation (Lu et al., 2018), architectural design (Dortheimer, 2022), co-creation (Mueller et al., 2018;Hofmann et al., 2020), mapping (Borges et al., 2015), and opinion-gathering (Hosio et al., 2015;Wang et al., 2021).However, survey-like methods remain the primary method for collecting information on an urban scale.
Unlike surveys, conversations can be more open and flexible, fostering an environment that encourages stakeholders to reflect upon and share their ideas and experiences.This interactive approach to data collection allows for a deeper understanding of the participants' perspectives and emotions.
Therefore, we argue that chatbots have the potential to revolutionize the field of urban design by facilitating meaningful conversations with a diverse array of stakeholders.Some studies have indicated that chatbots may be more effective for gathering respondents' information than traditional surveys (te Pas et al., 2020;Xiao et al., 2020a,b).Furthermore, the engaging nature of chatbot-facilitated conversations can help maintain participants' interest and make the interaction more enjoyable.Finally, adopting chatbots in urban design can lead to more comprehensive and accurate insights, driving more effective decision-making and resulting in better-designed urban spaces that cater to the needs and preferences of all stakeholders.
Implementing chatbots in participatory urban design projects can lead to an overwhelming amount of conversational data, which poses a significant challenge for human designers.Consequently, it is essential to develop a comprehensive framework that not only streamlines chatbots' effective communication with stakeholders but also enhances the efficient analysis and summarization of the vast data collected from these interactions.By doing so, this framework would transform chatbots into practical and valuable tools in participatory urban design processes, equipping urban planners with meaningful and actionable insights while effectively managing communication and data volume.
Furthermore, there is a limited understanding of the differences in the quality and quantity of information gathered from such chatbot frameworks compared to traditional surveys in the context of urban design.This research gap warrants a comprehensive examination to inform better the implementation of chatbot interventions in urban planning processes.
Consequently, this research aims to investigate the use of chatbots within the realm of participatory urban design and assess their efficacy compared to conventional survey methods.The research question addressed in this study is as follows: "What are the differences between using a chatbot framework and surveys to collect information and design ideas in the context of urban design?" To answer the research question, we develop and test a chatbot framework capable of performing design conversations and summarizing such conversations into design requirements.Next, in an experiment with 56 participants, we compared the chatbot framework to a traditional online survey, focusing on the quality and quantity of collected information.
The contributions of this paper are as follows: 1) We present a novel chatbot system and human-artificial intelligence (AI) prompt framework for initiating and managing design conversations in large-scale urban design projects, contributing to the emerging field of AI-assisted participatory design; 2) we perform a comprehensive experiment involving 56 participants, comparing the chatbot framework with traditional online surveys in the context of participatory urban design that offers insight into the differences between chatbot and survey outputs regarding the quality and quantity of information collected in the context of urban design, addressing the current research gap; and 3) we propose recommendations for effectively using the chatbot framework in participatory urban design, based on the findings and observations from the experiment, to facilitate more productive and engaging interactions between stakeholders and designers.

Related works
In this section, we provide an overview of the related work in the areas of conversation in the design process, chatbots, design, and the large language models (LLMs) to establish the context for our study.

Participatory design
Proponents of participatory design maintain that a more suitable design fit can be achieved when end users actively participate in the design process (Reich et al., 1996).This approach brings together individuals with diverse backgrounds, roles, and expertise to collaboratively examine a problem and collectively generate potential solutions.The concept of incorporating local residents into urban and architectural design planning has been in practice for over five decades (Luck, 2018).Generally, urban planning research posits that community involvement fosters democratic values and equitable communities, making it a vital component of sustainable development (Münster et al., 2017;Calderon, 2020).

The role of conversation in the design process
While design frequently involves working with visual representations, verbal communication plays a crucial role in the design process.Verbal conversations with various stakeholders, such as clients, contractors, and community representatives, allow designers to understand their needs (Lawson and Loke, 1997;Dubberly and Pangaro, 2019) and constraints better and negotiate project requirements (McDonnell, 2009).Similarly, conversations among designers facilitate knowledge sharing and negotiation of new ideas, allowing design teams to transcend the individual abilities of a single designer (Arias et al., 2000).Conversations with end users and community members help explicate their needs and concerns about a design project, making tacit knowledge explicit from the user's perspective (Luck, 2003).Although verbal communication is essential, it is important to consider its challenges and limitations, such as potential misunderstandings and difficulties translating verbal ideas into concrete design elements (Karlgren and Ramberg, 2012).In early architectural discussions between designers and clients, clients often concentrate on familiar functional and structural aspects of building designs, while designers seek to identify problems and understand the design's significance to the client (Luck and McDonnell, 2006).Understanding how project stakeholders communicate verbally to express their needs is a valuable insight for developing effective design chatbots.

Chatbots and the automation of conversation
The history of chatbots dates back to the first chatbot, ELIZA, developed between 1964 and 1966, which was based on a language model that identified keywords and, following a set of rules, provided a response (Weizenbaum, 1983).Over the years, chatbots have evolved significantly, owing to advancements in ML and AI.There are now three primary types of chatbots: rulebased, retrieval-based, and generative models (Hussain et al., 2019).Rule-based chatbots function based on a predefined set of rules, while retrieval-based chatbots choose responses from a pre-built database.However, both are limited by linguistic knowledge hard-coded into their software (Shawar and Atwell, 2005).Therefore, generative models powered by ML techniques have shown the most progress in recent years since these models can construct novel responses, adapting better to various conversational situations.
Overall, building a chatbot that can understand complex conversations and answer appropriately was reported to be a challenging task (Xiao et al., 2020a).However, modern chatbots such as Apple's Siri or Amazon's Echo leverage ML techniques to create LLMs and web search results to produce meaningful responses.
At the time we completed the work described here, GPT-3 was the largest publicly available LLM that produces human-like text (Brown et al., 2020).The model includes 175 billion parameters and produces high-quality texts.The model generates texts based on a provided text prompt.For instance, if a prompt is the beginning of a story, the model would try to predict the continuation of that story.Several previous applications demonstrated that this model can be meaningfully used for grammar correction, summarizing, answering questions, parsing unstructured data, classification, and translation (Radford et al., 2019;Brown et al., 2020), among other tasks.LLMs have also been used to build chatbots to capture self-reported user data chatting (Wei et al., 2023).However, since the model cannot reason, solve mathematical and ethical questions, or pass the Turing test (Floridi and Chiriatti, 2020), it is regarded as a human-like text generator rather than a general AI in the strict sense.
To build useful chatbots, an LLM must be provided with a welldesigned prompt that can steer the LLM to generate topic-relevant text.Designing these prompts, as described by Zamfirescu-Pereira et al., is akin to "herding AI cats" due to the unpredictable nature of LLMs (Zamfirescu-Pereira et al., 2023).The challenges in crafting good prompts are manifold.For instance, adding a new instruction to repair specific issues found using a previous prompt might unpredictably affect other instructions.In addition, the model may generate "hallucinations," which are instances of fabricated information.Furthermore, the GPT-3 model could not acknowledge that it did not know some information.Similarly, a study on prompting generative image models for design highlighted the unpredictability and challenges of using these models (Dortheimer et al., 2023b).The impact of prompts on model outputs and prompting techniques are active areas of natural language processing research (White et al., 2023).

Chatbots for creativity and design
Several previous studies have utilized human-chatbot interactions to generate creative ideas in various design fields (Kulcke, 2018;Cuadra et al., 2021;Shin et al., 2022).For instance, a notable study investigated how humans converse with a perceived AI during a Wizard of Oz study where designers prototyped speech interaction with music systems (Martelaro et al., 2020).Other studies have explored human-chatbot interactions in spatial design (Kulcke, 2018;Dortheimer et al., 2023) and ornament design (Cuadra et al., 2021).Additionally, chatbots have been employed to mediate consensus-building conversations (Shin et al., 2022).
Moreover, numerous studies have focused on the development (Ahmed, 2019) and evaluation (Tavanapour and Bittner, 2018;Hwang and Won, 2021) of chatbots for ideation tasks.Chatbots have also been utilized to facilitate design thinking through the empathy map method (Bittner and Shoury, 2019).An intriguing example is CharacterChat, a chatbot designed to assist writers in creating fictional characters (Schmitt and Buschek, 2021).However, to the best of the authors' knowledge, no studies have specifically examined the use of chatbots in urban design tasks.

Comparisons of chatbots and surveys
Another challenge in human-chatbot communication is the human behavior and understanding that needs to be taken into account (Nguyen et al., 2022).First, compared to human-human interaction, human-chatbot communication was reported to take longer and include shorter messages (Hill et al., 2015).In addition, there is evidence that human-chatbot messages lack vocabulary richness and can contain profane words (Hill et al., 2015).However, compared to web surveys with open-ended questions, chatbots were reported to produce longer and richer responses from humans (Xiao et al., 2020b).Interestingly, some evidence suggests that humans can generate more or better-quality ideas when communicating with a chatbot rather than with a human partner (Hwang and Won, 2021).
Furthermore, with regard to effectiveness in eliciting information from respondents, several previous reports noted that chatbot interfaces could be more effective (Xiao et al., 2020a,b) and preferable (te Pas et al., 2020) than surveys.A reason that could underlie this finding is that there is evidence suggesting that people are more willing to share information through chatbots (Lee et al., 2020), which are generally believed to be useful for collaboration (Kim et al., 2021).Together with LLMs such as GPT-3, recent research has explored the usability of these models in operating chatbots (Wei et al., 2023).In line with this novel technology, in the present study, we empirically test a chatbot framework that may be more effective than survey methods for eliciting meaningful responses from participants in the context of urban design.

Chatbot design and development
We developed a chatbot system with dual functionalities to discuss an urban design project.The first function involves conversing with users to gather their responses and insights about the project.The second function is to analyze and extract a set of design requirements from these conversations.
The chatbot design and development process can be broken down into several stages.Initially, we experimented with a "mock" chatbot to explore design conversations.Next, we constructed a prototype utilizing an LLM for the chatbot system.Finally, we enhanced the chatbot's performance through a series of experiments and testing.

Exploring design conversations
We developed a chatbot prototype to investigate the ideal structure of human-chatbot design conversations.In order to achieve this, we conducted a "Wizard of Oz experiment" wherein a human subject engaged in a discussion about an architecture project of their choice with a human-operated chatbot.
A chat system was created to enable the participants to converse with a human architect, as depicted in Figure 1a.The chatbot was operated by a certified and experienced architect, who was instructed to engage in a natural conversation with the participants.Each chatbot message was automatically converted into audio using text-to-speech technology and played when the message was displayed to the user, facilitating a more authentic interaction.The chat was initiated with a predefined user prompt introducing the chatbot: "I'm a design bot and would love to speak with you about a design project of your choice!"The conversations were recorded in a database.The chat logs were then summarized to create a set of design requirements for the design project.The participant would then rate these design requirements, as shown in Figure 1b.Participants were later interviewed to gather their experiences and suggestions for improving the dialogue.Finally, the recorded conversations were thoroughly analyzed.

Preliminary experiment
After the initial test with the Wizard of Oz controlled chatbot, we implemented a preliminary chatbot using the GPT-3 LLM (textdavinci-001) as the backend of the application instead of the human operator (see the appendix section "Final chatbot implementation" for further detail).Since GPT-3 did not know when the conversation was over and always produced new responses to user inputs, we added a "finish chat" button to allow the participants to conclude the discussion.Alternatively, the chat automatically ended after exchanging 50 messages.Then, the discussion log was automatically summarized into a list of design requirements that resulted from the discussion transcript.The participants then rated the correctness of each requirement on a five-point Likert-type scale.Finally, the process concluded when the participants provided requirement evaluations.
In order to test the new chatbot's performance, we conducted a preliminary experiment where participants (n = 51) were asked to discuss an architecture project of their choice with the bot, rate the automatically produced requirement list, and answer a user experience survey.The participants were students who used the chatbot during a design class at a university.Based on the results of this test, the chatbot was improved and finetuned to be used later in the controlled experiment.In addition, the chatbot LLM was updated to the newer "text-davinci-002" for the conversations and "text-curie-001" for design requirement summarizing.

Human-AI prompt framework
The present study generated chatbot responses using the improved "text-davinci-002" GPT-3 LLM.The request to the LLM included multiple parameters, including a "text prompt" that the model would use as input to predict how the text would continue.The text prompt is the foundation for the chatbot's ability to comprehend and respond to user inputs in a meaningful and contextually relevant manner.It is the most critical element, as the quality and relevance of the chatbot's responses heavily depend on the information and context provided by the prompt.
However, a part of the prompt influences how the human engages with the chatbot since it sets the stage for their expectations from the conversation.Designing the optimal prompt is challenging, given the nuances of human language and the need to account for numerous conversational scenarios.Consequently, extensive testing and fine-tuning are necessary to identify the appropriate balance of context, specificity, and flexibility to achieve the desired performance (Zamfirescu-Pereira et al., 2023).
Our chatbot prompt contained the context of the conversation, including a specification of whether it was a conversation between an architect and a client, some character descriptions, and conversion goals.To design a chatbot operated solely by an LLM, we used the terms internal prompt and shared prompt to describe the structure of the "prompt" as the input of the LLM.The key difference between the two prompts was that the shared prompt also acted as a user prompt to provide the human with a shared understanding of the design situation and expectations from the conversation (see Fig. 2).Our chatbot implementation can be viewed in the appendix section "Final chatbot implementation." The internal prompt includes the definitions of the human and bot personas and the technical context of the conversation provided solely to the LLM.In our case, we defined the chatbot first as an architect conversing with a client.In the first experiments, we noticed that the chatbot sometimes stopped discussing the design, asked to show site photos, presented non-existing sketches, or negotiated for payment and budget.
In order to address the observed shortcomings, we modified the prompt by specifying that it involves a "conversation between an architect and her client."We also added to the prompt that the conversation focuses on aspects such as aesthetics, functional elements, and social preferences of the design.
Additionally, we noted that providing "The architect is very kind and professional" in the prompt led the chatbot to agree with most of the participants' suggestions without further discussion.In order to generate more engaging and reflective conversations, we replaced the prompt as mentioned above with "The architect is challenging the client with questions to gain a deeper mutual understanding of the requirements."This adjustment helped make the chatbot more critical without dismissing the participants' ideas.
The shared prompt was also improved between the experiments and included the names of the chatbot and the human, relevant project information, and the appropriate context of the conversation.We started with a generic prompt, "Hi, I am your architect.Let us discuss your architecture project.I would like to know what kind of project you had in mind?"This allowed the participants to discuss any architecture project, which resulted in different kinds of architecture project conversations with various qualities.
We added personal attributions to the shared prompt by asking the participants to provide their names before the chat started.We chose the chatbot to be female and named it Zaha, in reference to the late influential architect Zaha Hadid.To reduce the risk of conversations about business issues, we stated that the chatbot is part of a design team.This resulted in the following shared prompt "Hello username, my name is Zaha, and I am on the design team of the project."Next, we wanted the chatbot to discuss a single project with several participants to investigate how to aggregate several conversations.To this end, we outlined a specific construction project the participants would be familiar witha new architecture school building on campus.The project was presented with the following shared prompt: "The (Technical University of Munich) is planning to build a new architecture school instead of the outdated (electrical facility on Theresienstrasse).The building will host the architecture school, and the design will be based on the preferences of the students and faculty.That's why we want to ask you about your ideas for the new building." Finally, we defined the design conversations' scope and goals with the criteria we thought were important to address.These included the following: functionality, aesthetics, cultural values, and the desired social effect.This was done with the following shared prompt: "Let's discuss the project requirements.What spaces should be in the building?How can the building have a positive impact on the community and the environment?What should the building look like?What values should the building express?"We found that the detailed shared and internal prompts produced better conversations that were more focused and produced higher-quality chatbot texts that helped the chatbot keep the discussion on topic and ask relevant questions.
In summary, the internal and shared prompt structure was a practical, functional approach to designing communication about urban design between LLM-based chatbots and humans.

Method
Upon chatbot improvement, we conducted a controlled experiment to answer our research question ("What are the differences between using a chatbot framework and using surveys to collect information and design ideas in the context of urban design?").The experiment was designed to evaluate the chatbot's performance as a design requirement collection tool and compare its performance to a webbased survey.
As mentioned, we outlined a hypothetical urban architecture project as a test case, which involved constructing a new architecture school in place of a historic building (in Munich).The project's context, situated within an existing neighborhood and incorporating public spaces, added complexity and relevance to the experiment.We chose this scenario to present a realistic and multifaceted design challenge, requiring participants to consider various factors and constraints in a familiar setting.The nature of this design project was both demanding and engaging for our participants, enabling us to assess the chatbot's performance more effectively.
The new school building had to offer the most suitable environment for studying and working, enrich the university campus with the best quality and sustainable architecture in mind, serve around 1,400 architecture students, and provide them with various spaces such as studios, lecture halls, and so forth.These constraints were based on the current architecture school needs.By providing a realistic design problem, we aimed to create an environment where participants would be more likely to engage with the experiment.
The study participants were divided into the experimental and control groups (see the "Participants" section for further detail).The experimental group had conversations with the chatbot.The control group completed a web-based survey derived from a realistic urban public participation project survey.The adapted survey was built using Google Forms and included the following questions: • What are your hopes for this project?
• What are your concerns about this project?
• How can the building be more sustainable and have a positive effect on the environment?• Which values should the new building design express?
• How can the building have a positive impact on the community and society?
The same questions were also provided in the chatbot's internal prompt so that it could discuss these questions with the study participants.

Participants
We recruited students and faculty from the architecture department.All participants were stakeholders of our hypothetical project, possessed learning or teaching experiences, and had a good knowledge of the existing facility.The experiment group consisted of 35 participants; the control group had 21 participants.No compensation was offered.
In the experimental group, there were 12 participants aged 18-24 and 13 participants aged 25-34.Unfortunately, the remaining participants did not complete the user experience survey, so we lack information about their age, gender, and education.Concerning gender, 10 participants were male, 13 were female, and two indicated "other" as gender.As concerns educational attainment, eight of the study participants in the experimental group had a master's degree, while 14 had a bachelor's degree, and three were bachelor's students at the time when the study was conducted.
In the control group, which consisted of 21 participants, all except one were students, with the remaining participant being a faculty member.Regarding age, eight respondents were 18-24, while 13 were 25-35 years old.Furthermore, most of the participants (N = 12) were male, while eight were female, and one indicated "other" as gender.Finally, regarding educational attainment, six respondents were bachelor's students, 12 had a bachelor's degree, and three held a master's degree.

Analyzing and evaluating chatbot responses
We evaluated the chatbot performance by manually analyzing the conversation transcripts in terms of logic, behavior, and accuracy of the produced requirement list, inspired by Han et al. (2021).This analysis encompassed several stages: response quality, user experience, and ethical considerations.The quality of responses was assessed through a detailed text analysis, which will be elaborated upon later.User experience was measured using survey data, while ethical considerations were evaluated by identifying any potentially problematic chatbot messages containing hate speech or privacyintrusive questions.
First, we checked whether the chatbot messages were coherent and relevant in the context of the conversation.A chatbot that does not produce logically sound responses is more likely to be perceived as unreliable by users.To this end, we differentiated between the following three possible levels: "the message makes perfect sense" (level 2), "the message makes sense but is slightly wrong" (level 1), and "the message is a mistake" (level 0).
To illustrate, consider the following example of a level 0 response in a discussion where the participant specifically asked to design a fire station with a parking space for three fire trucks: H: "Okay, but I asked for space for just 3 trucks" C: "But you said 6 trucks" The next example shows a level 1 response when the chatbot recommended cladding the walls with basalt while the participant wanted a white bedroom: H: "but basalt is black and I like white, maybe is not such a good idea?"C: "I recommend you to paint the basalt white."Finally, consider an example of a level 2 response, which is a response that makes perfect sense: Human: "What about taking marble or other material that originally is white?" Chatbot: "Marble is a great idea."Next, in our analysis of conversation dynamics, we sought to identify when the chatbot discussed a specific topic and when it changed it.In doing so, we aimed to identify when these events happened and whether they were in the proper context of the conversation.Accordingly, we manually tagged the chatbot responses with labels "changing subject," "reflecting," and "suggesting ideas" when responses made sense.
To illustrate, a reflective conversation was assumed to occur when the chatbot asked a relevant question in response to a user's idea, which could cause the user to reflect on the idea and develop it.For example: H: "Hey, I would like a building with community space" C: "Excellent, we were thinking the same thing.What kind of community space?" When the chatbot continued asking repetitive questions or fixated on a topic instead of changing the conversation subject, we marked such responses as "repetition."For example: C: "What about the other rooms?"H: "There are no other rooms" C: "How would you like to use the rooms?"H: "There are no other rooms" Finally, we tallied and assessed the design requirements generated by the chatbot and those from the web-based survey.To evaluate the accuracy of the design briefs summarizing the design requirements created by the chatbot, we first matched each conversation with its corresponding design brief.We then labeled each accurately represented design brief item as "correct."Conversely, items that failed to express a design requirement were labeled as "incorrect" or "invalid" if they were technically flawed.Lastly, we pinpointed any absent design brief items and marked them as "missing."

Comparison between chatbot and survey
The comparison between two different methodschatbots and web-based surveyswas challenging since these methods produce different kinds of information.The following three metrics were used to analyze the performance of both methods.
The first metric was the number of words in an interaction, which was taken to attest to the quantity of data produced.The second metric was the number of themes the participant mentioned, which was done by manually analyzing the text and identifying the themes.The third metric was novel ideas.Innovative and novel ideas hold a significant value that can elevate the design process, allowing designers to explore new perspectives and approaches.
Novel ideas were identified by comparing the participant's suggestions to a list of existing concepts.These ideas were mentioned in the provided project introduction document to the participants or popular themes, for example, producing a suitable environment for studying, quality of architecture, sustainable construction, availability of studios, lecture halls, facilitating idea exchange or creativity, shared spaces, and accessibility issues.To these themes, we added a list of themes that were most frequently mentioned in similar projects or academic discourse but not in the project introduction document.Such themes included having a public space, facilitating connections to the community, generic sustainability ideas (e.g., recycling, solar power, water preserving, wood construction, green facades or roofs), simplicity, efficiency, cafe, coworking spaces, colors, and ordinary materials.
The statistical analysis of the data was conducted using the jStat statistical library, implementing Welch's t-test.
Once the experimental and control groups completed the experiment, they filled out a user experience survey that contained 10 questions about their recent experience.The questions included in the study were taken from Ashfaq et al.'s (2020) subset relevant to the chatbot experiment and web surveys (see Table A1 in the appendix section "Survey questions").The items in the survey were rated on a five-point Likert scale.For further analysis, an average score was computed for each survey item.Finally, the participants were asked to share their qualitative opinions about using the chatbot or survey.

Results
In the experiment, 751 messages were collected in 35 conversations.We provide example chatbot and survey output in the appendix sections "Chatbot conversation example" and "Survey response example."The participants produced 377 messages, and the chatbot produced 374, with an average of 21.45 messages per conversation.The average interaction duration was 9.75 minutes (min = 2.25, max = 24.74,SD 6.1).The chatbot generated 145 design requirements from 26 conversations since some participants did not click on the "finish chat" button or did not reach 50 messages.Twenty-five participants filled out the user experience survey.We decided to keep the conversation transcripts of the participants who did not complete the user experience survey since they are valuable to the analysis and might contain failed conversations.However, the analysis did not reveal chatbot conversation failures.

Conversation quality
The summary of our conversation analysis results, including the chatbot and human participants' evaluation of the preliminary and controlled experiment, is provided in Table 1. Figure 3 shows the chatbot's human word count per message distribution compared to the survey method.The chatbot produced responses of an average length of 24.56 words (SD 22.11), with the longest response being 134 words long.By contrast, the participants' responses to the chatbot produced significantly fewer words in each message, with an average length of 10.51 words (SD 15.42).This result demonstrates a clear LLM performance difference from our preliminary experiments, where the chatbot produced shorter responses (M = 10.83,SD 8.41), which caused the participants to respond with shorter messages (M = 4.62,SD 4.68).
With regard to the qualitative aspects of the analyzed conversations, of a total of 374 messages, 297 were marked as most comprehensive (level 2, 86.84%), 34 were marked as lightly flawed (level 1, 9.94%), and only 11 did not make sense (level 0, 3.22%) (see Table 2).
This improvement between the preliminary and the controlled experiments can be attributed to the enhancement in the LLM and prompt, which highlights the crucial role of coherent and logical responses in fostering effective interaction.However, there is still a need for improvement in the LLM's performance.
In their responses to the chatbot, the participants produced an average of 117.91 words in each conversation (SD 75.09), with a maximum of 350 words.The control group respondents who filled out the survey produced considerably fewer words per responsean average of 59.04 words for each survey record, with a maximum of 175 words (SD 35.78).The results show that the chatbot elicited twice as much data from the participants as the survey.

Behavior analysis
The subsequent behavior analysis indicated that the chatbot changed the conversation topic by asking a different question 41.44% of the time (see Table 3).However, in 5.26% of the messages, the chatbot demonstrated a reflecting behavior and asked the participants about their ideas.In 5.14% of the messages, the chatbot repeated the previous message, albeit with some (minor) changes.Finally, the chatbot did not produce any messages that could be considered hate speech, nor did it commit privacy intrusion.

Themes and novel ideas
The study found no significant difference in the topics generated through the chatbot or survey, t(38) = 0.49, p = .062,although the survey (M = 6.90, SD = 1.86) produced more topics on average than the chatbot (M = 4.82, SD = 2.49).Similarly, there was no significant difference in generating novel ideas between the two methods (t(38) = 0.36, p = .071).The difference was not statistically significant despite surveys (M = 0.43, SD = 0.62) yielding a little more novel ideas on average than chatbots (M = 0.36, SD = 0.70).These findings suggest that the choice of method, whether chatbot or survey, does not significantly impact the number of topics or novel ideas generated.
Several survey response examples are provided below: • "Modern, sustainable architecture" • "Opening up to the public and creating spaces for idea exchange and collaboration" • "Space for experimentation, generous connection to public space, space for experimentation" Below is an example of a chatbot conversation: H: "well, I think that the spaces should be focused on interaction and collaboration… so the biggest amount of exchange can take place" C: "Excellent, so you want the spaces in the building to facilitate interaction and collaboration.Do you have any ideas about how we can achieve that?" H: "well it could be achieved through different methods.One, for example, could be the relationship between public and private spaces -leaning more to the public side.Another method could be through the typology of the floorplan and the construction methods implemented -meaning to have a high degree of "permeability" in the architectural concept" C: "That makes sense.Are there any other values that you feel are important for the building to express?" The findings can be attributed to the fact that participants can contemplate and offer well-considered responses in the form of a list during a survey.Furthermore, the interactive nature of a chatbot may result in more impulsive responses, potentially generating fewer ideas.Additionally, as demonstrated in the above example, chatbot conversations tend to be lengthier and demand more effort.Nevertheless, chatbot discussions yielded many ideas, exhibiting a comparable novelty rate, which is promising as chatbot technology improves.

Design brief analysis
The chatbot generated 26 valid design brief lists with 140 list items.The design briefs contained an average of 5.38 list items per discussion (SD = 2.49, min = 3, max = 15).No relationship was observed between the length of conversations and the number of design brief items.Accordingly, a more extended discussion did not translate into more ideas and topics for the subsequent generation of design requirements.Next, to measure the system's performance, we manually analyzed conversation transcripts and corresponding design briefs (see Table 4).As a result, in addition to 140 requirements captured by the chatbot, we identified 30 further requirements and preferences that were overlooked by the chatbot.Furthermore, out of 140 list items, 130 (92.85%) were correct, 10 were incorrect (7.15%), and 43 were invalid (12.2%).Most invalid items were texts that were generated by the chatbot or fragmented sentences.

User experience
As mentioned earlier, the survey was conducted in both the experimental and control groups to evaluate service quality, enjoyment, usefulness, and ease of use of the chatbot or the survey.
The results of the comparative analysis between the chatbot and survey methods demonstrated that participants rated the user experience of the chatbot more positively than the survey (see Fig. 4).Participants reported a higher level of enjoyment when interacting with the chatbot than when completing the survey.They also perceived the chatbot's service quality to be superior.Furthermore, the chatbot was deemed considerably easier to use, and satisfaction ratings were higher for the chatbot experience.Interestingly, the perceived usefulness and continuance intention were similar for both methods.
In conclusion, these findings suggest that the chatbot provides a more engaging and satisfying user experience than traditional surveys, and may encourage increased digital public participation.

User feedback
According to the results of the user feedback survey, most participants found the chatbot system to be a valuable tool for data collection in the early stages of the project.The participants appreciated the chatbot's conversational nature, which made it easier for them to understand and express their opinions.
However, two participants mentioned that they preferred merely writing down their ideas instead of having a longer conversation.Furthermore, some other participants noted that the chatbot lacked the conversational qualities of real-life interaction.They felt that the chatbot was too quick to agree or thank them for their contribution without providing a meaningful response.
One participant admitted having a negative bias toward chatbot systems, mentioning that a real person could still do the job better.However, they acknowledged that chatbot systems could be helpful in certain situations, such as when architects do not have time to discuss their ideas with stakeholders.

Discussion
The chatbot tool demonstrated its ability to handle extensive conversations, allowing for significant and focused human-like discussions while gathering new types of information.In the study, participants' input during their interactions with the chatbot was automatically transformed into a valuable list of requirements.This innovative method can enhance participation in large-scale urban design projects.
However, the findings of this study indicate that although chatbot technology can generate meaningful conversations, there are still numerous challenges to address in order to guarantee its successful implementation in such urban design projects.

Human-chatbot prompt design
In chatbot prompt design, we face a unique challenge that stems not only from the unpredictable nature of LLMs but also from the variability of human behavior.Previous chatbot research has compared the unpredictability of LLMs to "herding AI cats" due to the tendency of LLMs to generate unexpected responses (Zamfirescu-Pereira et al., 2023).This metaphor is particularly pertinent when designing an assistant chatbot.However, our chatbot's primary aim is significantly differentto extract specific information from human users.
To achieve this goal, designers must also consider the significant variability in human behavior.For instance, our data show a wide range of responses to the same prompt that resulted in different kinds of conversations and a varying number of design requirements, demonstrating high variability in human behavior.This challenge is further compounded by the chat conversation format's inherently unstructured and open-ended nature.Therefore, it is not just about "herding AI cats" but also about managing "human cats."This metaphor underscores the need for designers to account for human behavior's unpredictability and LLMs' inherent unpredictability.In essence, designers of participatory chatbot systems, which involve users in the design process, must be prepared to navigate the dual unpredictability of human interaction and LLMs.
To address this complexity, we propose a prompt design framework for information collection comprising two key components: an internal and shared prompt.The internal prompt encapsulates project-specific information and a personality modifier to influence the chatbot's responses.The shared prompt, on the other hand, plays a crucial role in shaping the human-AI interaction by providing a clear prompt for the human user.Both prompts must be fine-tuned through iterative testing to produce the expected interaction between humans and AI.While testing LLM prompts can be done automatically, human subject experiments are much more complex but can be automated using crowdsourcing platforms.
In conclusion, our proposed framework provides an essential foundation for designing effective human and chatbot prompts, addressing the challenges posed by the unpredictable nature of both.Our chatbot design framework distinguishes itself from Perceived ease of use: The interaction with the software is clear and understandable (g) Perceived ease of use  previous studies by extending the prompt design to the human, which is essential for fostering clear communication and effectively framing the conversation (Wei et al., 2023).However, the design of chatbot prompts is still very challenging.It necessitates improved tools and methodologies, as they demand extensive trials and testing.Therefore, future research should focus on developing more sophisticated tools and methodologies for prompt chatbot design, considering the unpredictability of human input and AI responses.

Enhancing stakeholder engagement through enjoyable participation methods
In accordance with prior research, our study demonstrated that the participants perceived the chatbot interaction as more enjoyable than completing a traditional survey (Kim et al., 2019;Xiao et al., 2020b).Previous research emphasizes the significance of stakeholder involvement in design projects to achieve successful outcomes (Arnstein, 1969;Münster et al., 2017;Calderon, 2020).However, urban design and planning initiatives often struggle with low participation rates due to their professional and political nature (Brabham, 2009;Giering, 2011;Krüger et al., 2019;Dortheimer and Margalit, 2020).This creates difficulties in meaningfully engaging underrepresented communities in such projects.
To address this challenge, it is essential to comprehend the motivations behind participation.Researchers agree that enjoyment is a crucial factor in driving participation (Lindenberg, 2001;Malone et al., 2010).A robust connection has been discovered between enjoyment and increased engagement in a crowdsourcing activity.Individuals who partake in enjoyable crowdsourcing activities are more likely to invest time and effort, resulting in more contributions and greater satisfaction with the final product (Frey et al., 2011;Liang et al., 2018).
Our study results indicate that chatbots are perceived as more enjoyable than traditional surveys, which can enhance participation and foster greater engagement with diverse communities.Practitioners should consider this finding when planning large-scale participatory urban projects, as incorporating chatbots may lead to improved involvement and more successful outcomes.
However, it is crucial to consider the potential novelty effect of the chatbot.Participants may have enjoyed their first interaction with the LLM-based chatbot due to its novelty, which could have influenced their favorable ratings.Over time, as the novelty wears off, users may experience "chatbot fatigue," similar to "survey fatigue."This can impact the long-term effectiveness and user satisfaction of chatbots.More research is needed to understand this potential effect.
We propose several strategies to maximize the potential of chatbots in stakeholder involvement in spatial design to make them more enjoyable.Firstly, the chatbot persona should be designed to be engaging and enjoyable, incorporating elements of humor, empathy, and a conversational style in the internal prompt.Secondly, chat interfaces should be made more accessible by integrating chatbots into popular instant messaging platforms, reducing the learning curve for users.Lastly, it is crucial to continually monitor and improve the chatbot based on user feedback and performance analysis.

Comparing information quality and quantity in chatbot conversations and surveys
In examining the quality and quantity of information generated, we discovered that chatbot conversations yielded more data than surveys despite covering fewer topics.However, the rate of novel ideas was similar.While the differences observed were not statistically significant, preventing us from definitively stating one method as superior, our findings suggest that chatbots can generate design requirements and ideas similar to those obtained from surveys.We recommend that future research delve deeper into this comparison, utilizing larger datasets from chatbot conversations and surveys to substantiate these preliminary findings further.
Various factors may explain the differences between surveys and chatbot conversations concerning information quality and quantity.Firstly, our chatbot conversations were less structured than a survey, which may have led to answering fewer questions.Secondly, surveys offer a clear outline of the required input through a series of pages and questions, whereas chatbot conversations may lack clarity regarding completing information collection, allowing users to end the conversation when they feel it is over.Lastly, the casual, conversational nature of chatbot interactions may result in fewer topics being discussed, as they demand more effort from users and need to be longer to cover all questions.
This limitation of current chatbot technology should be considered when using chatbots for ideation or design requirement collection and should inform future research.Therefore, to ensure that sufficient issues and novel ideas are generated, we suggest using chatbots with a substantial participation group.With small groups, the current chatbot should not be seen as a replacement for surveys but rather as an additional input method.Notwithstanding, according to the results of the present study, chatbots can be a valuable tool for engaging stakeholders as they are more enjoyable and, thus, are more likely to be used by a broader pool of stakeholders.

Future research
Future research should focus on the experimental testing of chatbots in real-world urban design settings to identify new challenges.In addition, further research on improving the proposed chatbot in terms of conversation structure and summarization algorithms would also be needed.In particular, future studies could explore how chatbots can communicate design using visual communication.Furthermore, considering our findings on both the strengths and limitations of using chatbots in design projects, it would be meaningful to examine the combined use of chatbots and surveys to leverage the advantages of both approaches.

Limitations
The present study has several limitations.First, the study participants were architecture students and faculty members proficient in verbally expressing design ideas, understanding what buildings require, and having outstanding novel ideas.This could have led to the creation of more topics and a higher topic and novelty rate than lay people.
The second limitation is that the participants were aware that the project used in the present study was not real, and thus, their ideas would not be realized.This could have caused some participants to have a lighter conversation with the chatbot, knowing there would be no ramifications.Furthermore, it may have led people to also report less in their survey responses.
The third limitation concerns certain obscurities in utilizing the chatbot's design requirement generation process.Upon the conversation's conclusion, participants were required to activate a "finish chat" button to compile and generate a list of design requirements.However, some participants failed to execute this step.Upon analysis of the conversations, no causative factors related to the chatbot's performance could be identified, leading us to conclude that the issue lies within the user interface.To address this limitation in future chatbot research, we propose automatically generating design requirements once the conversation ends, eliminating the need for user-initiated activation.

Conclusion
Chatbots have the potential to transform stakeholder engagement in urban design projects.By offering a more engaging and interactive experience than traditional surveys, they can potentially help urban designers connect with more extensive and diverse communities.However, the findings of this study do not show a significant difference between chatbots and surveys in generating topics and new ideas.Despite this limitation, the study demonstrates that chatbots can successfully automate design conversations in architecture and urban design.Participants found the chatbot experience enjoyable and stimulating, which could lead to increased public involvement in participatory design processes.
This research suggests a chatbot system and prompt framework that can be utilized in large-scale participatory design projects, streamlining data collection and analysis.The system enables automated conversations while providing a summarization mechanism to help designers manage the vast amounts of data generated.In conclusion, our findings support the effective use of chatbots in facilitating design conversations, highlighting necessary further research to enhance the data collection capabilities of chatbots, making them even more beneficial for design processes.

Figure 2 .
Figure 2. Conceptual diagram of the LLM prompt and user prompt, made out of internal and shared prompts.

Figure 3 .
Figure3.Distribution of the number of human-provided words per message, comparing chatbot and survey responses with a bucket size of two words.Both mediums are similarly distributed, peaking at 4-12 words per message.Notably, the chatbot generated a significantly higher number compared to the surveys.

Figure 4 .
Figure4.Chatbot and survey user experience evaluation result comparison in terms of service quality, perceived enjoyment, perceived usefulness, perceived ease of use, satisfaction, and continuance intention.

Table 1 .
Summary of human-provided information in chatbot and survey https://doi.org/10.1017/S0890060424000027Published online by Cambridge University Press

Table 2 .
Comparative analysis of chatbot response quality evaluation between preliminary experiments and a controlled experiment.It shows that the enhanced GPT-3 model, coupled with refined text prompts, improved the quality of the generated text

Table 4 .
Summary of design brief analysis success