1. Introduction
In the intricate tapestry of design research, the phenomenon of joint attention within co-creation emerges as a pivotal thread, weaving together cognitive actors and their dynamic interactions (Falck-Ytter et al. Reference Falck-Ytter, Kleberg, Portugal and Thorup2023; Sani-Bozkurt & Bozkus-Genc Reference Sani-Bozkurt and Bozkus-Genc2023). Rooted in the broader paradigm of intersubjectivity, joint attention epitomizes the confluence of individual cognitive processes within a shared, collaborative space. The overarching concept of intersubjectivity, as postulated by Fuchs & De Jaegher (Reference Fuchs and De Jaegher2009) and Racine & Carpendale (Reference Racine and Carpendale2007), encapsulates a shared intellectual and emotional state. Yet, the specific manifestation of joint attention within co-creation remains an underexplored niche, warranting rigorous academic scrutiny.
Traditional approaches to assessing co-creation effectiveness have relied heavily on subjective observation methods, including self-report questionnaires, expert evaluations and post-hoc interviews (Shalley, Zhou & Oldham Reference Shalley, Zhou and Oldham2004; Heiss & Kokshagina Reference Heiss and Kokshagina2021). These subjective methods, while valuable for capturing experiential dimensions of collaboration, present several critical limitations that constrain the advancement of co-creation research. First, subjective observations are inherently susceptible to observer bias, where researchers’ theoretical preconceptions and expectations can influence their interpretation of collaborative behaviors (Nguyen & Mougenot Reference Nguyen and Mougenot2022). Second, self-report measures suffer from retrospective bias, as participants may struggle to accurately recall or articulate the nuanced dynamics that occurred during co-creation activities (Cash, Dekoninck & Ahmed-Kristensen Reference Cash, Dekoninck and Ahmed-Kristensen2020). Third, the temporal granularity of traditional methods is insufficient for capturing the micro-level interactions that constitute joint attention, as these methods typically rely on broad, summary assessments rather than moment-by-moment behavioral tracking (Behoora & Tucker Reference Behoora and Tucker2015). Fourth, the scalability of subjective methods is limited, as expert observation becomes resource-intensive when applied across multiple sessions or large numbers of participants (Kassner, Patera & Bulling Reference Kassner, Patera and Bulling2014; Spagnolli et al. Reference Spagnolli, Guardigli, Orso, Varotto, Gamberini, Jacucci, Gamberini, Freeman and Spagnolli2014).
Quantitative research methodologies offer complementary advantages that can address these limitations while preserving the valuable insights from qualitative approaches. Objective measurement systems provide consistent, reproducible data that is independent of observer interpretation, enabling more reliable cross-study comparisons and meta-analyses (Kent et al. Reference Kent, Gopsill, Giunta, Goudswaard, Snider and Hicks2022). The temporal precision of computer vision-based approaches allows for the capture of brief, ephemeral moments of joint attention that might be missed by human observers (Erichsen et al. Reference Erichsen, Sjöman, Steinert and Welo2021). Furthermore, quantitative methods enable the analysis of large datasets, facilitating the identification of patterns and relationships that emerge across multiple co-creation sessions (Hansen & Özkil Reference Hansen and Özkil2020). Most importantly, the integration of quantitative and qualitative approaches creates a more comprehensive understanding of co-creation dynamics, where objective behavioral indicators can validate and complement subjective experiential reports (Kleinsmann, Valkenburg & Sluijs Reference Kleinsmann, Valkenburg and Sluijs2017).
Co-creation, as a distinctive form of collaborative activity, represents more than mere coordination of efforts among participants. As Cash et al. (Reference Cash, Hicks, Culley and Salustri2021) demonstrate in their Design Science research, it involves complex intersubjective dynamics where participants not only work together but also actively construct shared meaning through joint attentional processes. This stands in contrast to general collaboration, which may involve coordinated action without necessarily sharing cognitive and emotional states. The distinction is crucial for understanding how design knowledge emerges through collective processes rather than individual contributions alone.
Historically, the realm of intersubjectivity has been predominantly assessed through the lens of self-descriptions and expert observations, as elucidated by Shalley et al. (Reference Shalley, Zhou and Oldham2004). While these traditional methodologies offer invaluable insights, they are not devoid of limitations. The inherent subjectivity of self-descriptions, coupled with the potential biases of expert observations, often culminates in data that may be riddled with discrepancies (So et al. Reference So, Cheng, Law, Wong, Lee, Kwok, Lee and Lam2023). This lacuna underscores the exigency for a more objective, data-driven approach to deciphering joint attention in co-creation.
Recent advancements in design research methodology have begun to address this gap through computational approaches to measuring design activity. Kent et al. (Reference Kent, Gopsill, Giunta, Goudswaard, Snider and Hicks2022) demonstrate how network analysis can reveal patterns in prototyping activities that remain invisible to traditional observation methods. Similarly, Erichsen et al. (Reference Erichsen, Sjöman, Steinert and Welo2021) have pioneered digital approaches to capturing physical design artifacts, providing more objective measures of design processes. These methodological innovations align with what Cash et al. (Reference Cash, Dekoninck and Ahmed-Kristensen2020) identify as a broader trend toward more rigorous, quantitative approaches to understanding design cognition and collaboration.
Enter the realm of human activity recognition, underpinned by deep learning algorithms. The burgeoning advancements in this domain proffer the tantalizing possibility of objectively quantifying joint attention, transcending the constraints of subjective interpretations (Ozdemir, Akin-Bulbul & Yildiz Reference Ozdemir, Akin-Bulbul and Yildiz2024). This aligns with growing recognition in the design research community of the need for more robust measurement approaches to intersubjective phenomena. Our study addresses this methodological gap by developing a computational framework for measuring joint attention in co-creation contexts.
This study, therefore, embarks on an ambitious odyssey to harness the prowess of deep learning in elucidating the nuances of joint attention within co-creation. By developing and validating a quantitative measurement framework for joint attention in co-creation, our research contributes to what Hansen & Özkil (Reference Hansen and Özkil2020) identify as a critical need for more objective approaches to understanding design collaboration. This framework not only enables more precise assessment of co-creation effectiveness but also provides a foundation for evidence-based enhancement of co-creation processes across diverse domains.
The ensuing discourse is meticulously structured to offer a holistic overview of the research. Section 2 delves into a comprehensive literature review, elucidating the theoretical underpinnings of intersubjectivity and its manifestations in co-creation. Section 3 unveils the optimized deep learning algorithm tailored for human activity recognition. Section 4 delineates the research methodology, encompassing the procedural intricacies and data sources. Section 5 elucidates the empirical findings, accentuating the reliability and intercorrelations of joint attention measures. Finally, Section 6 culminates in a synthesis of the research insights, charting potential avenues for future exploration.
Our study makes three principal contributions to the field of design research. First, it develops a novel computational approach to measuring joint attention in co-creation contexts, addressing what Kleinsmann et al. (Reference Kleinsmann, Valkenburg and Sluijs2017) identify as a significant methodological gap in design research. Second, it identifies and quantifies three dimensions of joint attention – empathic sharing, social context and key area – providing a structured framework for understanding intersubjectivity in co-creation. Third, it establishes weighted indicators that enable design researchers and practitioners to optimize co-creation environments for enhanced effectiveness. Collectively, these contributions advance the field beyond subjective assessments of co-creation toward more rigorous, evidence-based approaches to understanding and facilitating this complex form of collaborative design activity.
2. Literature review
2.1. Joint attention in the cocreation process
Co-creation is a multifaceted concept that has evolved significantly over time across various disciplines. At its core, co-creation refers to the joint creation of value by multiple stakeholders through collaborative processes (Prahalad & Ramaswamy Reference Prahalad and Ramaswamy2004). In the management and marketing literature, co-creation has been conceptualized as “the joint creation of value by the company and the customer; allowing the customer to co-construct the service experience to suit their context.”
From a design perspective, Sanders & Stappers (Reference Sanders and Stappers2008) define co-creation as “any act of collective creativity, i.e., creativity that is shared by two or more people.” In this context, co-creation encompasses various participatory approaches for design and decision-making with diverse participants (Ramaswamy & Ozcan Reference Ramaswamy and Ozcan2018), distinguished by assisted involvement in orchestrated multistakeholder interactions, such as formal workshops and self-organizing modes of engagement.
Importantly, across these diverse conceptualizations, several common elements emerge: co-creation involves multiple stakeholders working together, it is an interactive and collaborative process and it aims to create value that benefits all involved parties (Ind & Coates Reference Ind and Coates2013; Ramaswamy & Ozcan Reference Ramaswamy and Ozcan2018). These characteristics highlight the fundamentally social nature of co-creation processes.
The limitations of traditional co-creation assessment methods have been increasingly recognized in recent design research literature. Lloyd & Oak (Reference Lloyd and Oak2018) identified significant challenges in capturing the temporal dynamics of collaborative design processes through conventional observation methods, noting that “categories, stories, and value tensions” often emerge through micro-interactions that escape traditional documentation approaches. Similarly, Andersen & Mosleh (Reference Andersen and Mosleh2021) demonstrated that conflicts and resolutions in co-design activities occur through subtle gestural and spatial interactions that require fine-grained temporal analysis to understand fully. These findings support the need for more sophisticated measurement approaches that can capture the nuanced behavioral indicators underlying effective co-creation (Cooper Reference Cooper2023). Furthermore, Devos & Loopmans (Reference Devos and Loopmans2022) emphasize the importance of embodied intersubjectivity in co-creation processes, arguing that traditional verbal and survey-based assessments fail to capture the full spectrum of collaborative engagement that occurs through physical presence and spatial interaction.
In recent years, a growing number of scholars have sought to investigate the social dimension of the co-creation process (Park et al. Reference Park, O’Brien, Cai, Morris, Liang and Bernstein2023). Analogous to the study of individual designers, codesign is categorized as social psychology, a type of design work that relies on ongoing, subtle social interactions and transformative work involving the design of artifacts (Button & Sharrock Reference Button and Sharrock1996). According to Devos and Loopmans, co-creation involves the enactment of creation through interactions that go beyond mere collaboration between two or more human actors, thereby revealing its inherently social nature (Devos & Loopmans Reference Devos and Loopmans2022). Within this context, a multitude of studies have focused on examining social interactions and conflicts that emerge during the design process (Andersen & Mosleh Reference Andersen and Mosleh2021). This line of inquiry encompasses various facets, including how designers create artifacts or employ them to facilitate and promote collaborative efforts (Andersen & Mosleh Reference Andersen and Mosleh2021; Christensen & Abildgaard Reference Christensen and Abildgaard2021), as well as the gestures and sketches they produce during interactions. The primary objectives of these research endeavors are to foster collaboration and communication (Howard & Bevins Reference Howard and Bevins2022), resolve conflicts and discrepancies (Le Bail, Baker & Détienne Reference Le Bail, Baker and Détienne2022), make decisions (Cooper Reference Cooper2023) and ultimately examine the nature of joint and collaborative meaning-making, which is of paramount importance (Ind & Coates Reference Ind and Coates2013).
The measurement of co-creation processes presents significant challenges due to its abstract and multifaceted nature. Various approaches have been proposed in the literature, focusing on different aspects of co-creation. Some studies have measured outcomes, such as innovation performance (Frow et al. Reference Frow, Nenonen, Payne and Storbacka2015) or customer satisfaction (Grönroos & Voima Reference Grönroos and Voima2013), while others have examined process aspects like customer participation (Yi & Gong Reference Yi and Gong2013) or collaboration quality (Ranjan & Read Reference Ranjan and Read2016). In the context of design, researchers have investigated the development of shared understanding during collaborative work (Cash et al. Reference Cash, Dekoninck and Ahmed-Kristensen2020) and significant “episodes” during the process (Lloyd & Oak Reference Lloyd and Oak2018).
Therefore, this paper places a greater emphasis on the role of interaction in co-creation as a means of promoting mutual understanding and embodied cognition (Devos & Loopmans Reference Devos and Loopmans2022), which in turn helps to create and reconstruct our own and others’ roles in the process of relating (Mosleh & Larsen Reference Mosleh and Larsen2021), thus fostering social innovation. To understand the positive impact of social interactions that emerge during a series of design processes, we must borrow relevant concepts from the fields of social psychology, cognitive science and human-computer interaction (Wang, Kim & Lin Reference Wang, Kim and Lin2024). Additionally, we need to develop appropriate measures and evaluation methods that allow us to assess the quality and effectiveness of social interaction in the context of co-creation. Researchers have focused on individual behaviors by investigating the “episodes” during the process that are especially significant to participants (Lloyd & Oak Reference Lloyd and Oak2018) and have presented the relations among collaborative design work and the development of shared understanding (Cash et al. Reference Cash, Dekoninck and Ahmed-Kristensen2020).
Most recent studies have measured these indices using qualitative or subjective questionnaires (Heiss & Kokshagina Reference Heiss and Kokshagina2021). Quantitative approaches reported in the literature require wearing intrusive sensors for each participant (Kassner et al. Reference Kassner, Patera and Bulling2014; Spagnolli et al. Reference Spagnolli, Guardigli, Orso, Varotto, Gamberini, Jacucci, Gamberini, Freeman and Spagnolli2014). Behoora & Tucker (Reference Behoora and Tucker2015) examine interpersonal interactions in cocreation activities by using computer vision and machine learning methods to measure the emotional state of individuals. However, separately identifying motions and facial expressions to assess their “interaction” behavior during multiperson cocreation is inaccurate.
After thorough consideration of various potential measures, joint attention has been selected as the indicator for measuring co-creation in this study. This selection is based on several theoretical and practical considerations that align with the social and interactive nature of co-creation processes:
-
(1) Fundamental to social interaction: Joint attention refers to the ability of two or more individuals to focus on the same object, event or person with the intention of interacting with each other (Tomasello Reference Tomasello, Moore and Dunham1995; Moore, Dunham & Dunham Reference Moore, Dunham and Dunham2014). This directly corresponds to the interactive nature of co-creation, which involves multiple stakeholders working together toward shared goals (Prahalad & Ramaswamy Reference Prahalad and Ramaswamy2004).
-
(2) Indicator of shared understanding: Joint attention is closely linked to the development of shared understanding among participants (Carpenter, Nagell & Tomasello Reference Carpenter, Nagell and Tomasello1998), which is a crucial aspect of effective co-creation (Ind & Coates Reference Ind and Coates2013). By measuring joint attention, we can assess the extent to which participants are developing a common ground for collaboration.
-
(3) Observable and measurable: Unlike some abstract aspects of co-creation, joint attention can be observed and measured through behaviors, such as gaze direction, gestures and verbal references (Mundy, Sullivan & Mastergeorge Reference Mundy, Sullivan and Mastergeorge2007), making it a practical indicator for empirical research.
-
(4) Associated with positive outcomes: Research has shown that joint attention is associated with improved communication, enhanced problem-solving and greater mutual understanding in collaborative contexts (Richardson, Dale & Kirkham Reference Richardson, Dale and Kirkham2007; Shteynberg & Galinsky Reference Shteynberg and Galinsky2011), all of which are essential for successful co-creation.
Joint attention necessitates the ability to gain, maintain and shift attention (Mundy et al. Reference Mundy, Sullivan and Mastergeorge2007), and it is located at the intersection of various complex abilities that facilitate our cognitive, emotional and action-oriented connections with other individuals (Rauschnabel et al. Reference Rauschnabel, Felix, Heller and Hinsch2024).
The emergence of positive interactions in the context of co-creation is a complex process that is influenced by several factors. Understanding these underlying principles and processes is crucial in order to identify the measurable indicators that characterize these factors. Furthermore, it is important to explore how design strategies can be leveraged to enhance these indicators in the context of co-creation activities. In this regard, our focus is on research related to joint attention, which examines how individuals share subjective experiences and construct meaning through their interactions. By drawing on theories and concepts from this field, we aim to shed light on the underlying mechanisms that enable positive effects among interactors and to identify design strategies that can be used to promote effective social interaction in the context of co-creation.
2.2. Joint attention in intersubjectivity
Intersubjectivity has been recognized as a critical factor in small-group research, as mutual understanding among group members promotes productivity, dependability and flexibility (Weick & Roberts Reference Weick and Roberts1993). Given that the cocreation process is characterized by “social interaction” (Edvardsson, Tronvoll & Gruber Reference Edvardsson, Tronvoll and Gruber2011; Finsterwalder & Kuppelwieser Reference Finsterwalder and Kuppelwieser2011) and “meaning-making in dialogue” (De Jaegher, Peräkylä & Stevanovic Reference De Jaegher, Peräkylä and Stevanovic2016), assessing intersubjectivity in the cocreation process can provide valuable insights into the experience of the process and the effectiveness of its outcomes. Several design researchers have emphasized the role of intersubjectivity in understanding the design process, including its role in facilitating mutual understanding (Ma Reference Ma2013) and establishing spaces for equal dialogue among participants (Ho & Lee Reference Ho and Lee2012). In addition, social cognition situates the study of intersubjectivity within interaction theory (IT) (Gallagher Reference Gallagher2001, Reference Gallagher2009), providing further support for research on intersubjectivity in the context of social interaction in co-creation activities. These findings are highly relevant to our study and contribute to our understanding of the importance of intersubjectivity in the cocreation process.
Intersubjectivity has been theorized to entail a shared intellectual and emotional statement of social competence in those that they serve (Djenar, Ewing & Howard Reference Djenar, Ewing and Howard2017) and a shared involvement in a reciprocal exchange (Loots & Devisé Reference Loots and Devisé2003). Shared involvement refers to concurrently observing or concentrating on the same aspect of the environment (Moore et al. Reference Moore, Dunham and Dunham2014). Reciprocal exchange refers to the active and reciprocal involvement of both interaction partners, whether physically in coordinated behavior patterns and vitality affects; existentially in the sharing of intentions, feelings and objects of joint attention or symbolically in the creation of linguistic and symbolic meaning (Rochat, Passos-Ferreira & Salem Reference Rochat, Passos-Ferreira, Salem, Carassa, Morganti and Riva2009). Beebe & Lachmann (Reference Beebe and Lachmann2002) proposed a systems model of interaction that illustrates that verbalizable symbolic narratives (dialogues), unconscious gaze, facial expressions, eye contact, spatial orientation and body posture momentarily influence intersubjectivity during interactions.
Several studies have begun to describe the impact of intersubjectivity on social interactions and applied intersubjectivity as a measure (Loots, Devisé & Jacquet Reference Loots, Devisé and Jacquet2005; Damen et al. Reference Damen, Janssen, Ruijssenaars and Schuengel2015). Scholars have assessed children’s joint attention, joint focus, shared meaning-making (Trevarthen & Aitken Reference Trevarthen and Aitken2001; Göncü, Patt & Kouba Reference Göncü, Patt, Kouba, Smith and Hart2002), emotional attunement and social coordination (Bateman, Campbell & Fonagy Reference Bateman, Campbell and Fonagy2021) during group interactions. Garte (Reference Garte2015) provides a method to capture the interactive social competence development process. This assessment approach can provide new insights into how intersubjectivity supports social cognition and competence. Matsumae captured each participant’s emotional fluctuation during the cocreation process to assess the degree of qualitative coincidence of fluctuation as a state of intersubjectivity being formed among them (Matsumae & Nagai Reference Matsumae and Nagai2018) (Table 1).
Table 1. List of important references

Based on previous studies, joint attention has been identified as a critical dimension in measuring intersubjectivity (Garte Reference Garte2015). It serves as a link between primary intersubjectivity and secondary intersubjectivity (Trevarthen Reference Trevarthen1998; Trevarthen Reference Trevarthen2012). As joint attention plays a critical role in the development of intersubjectivity, it is measured as an indicator of intersubjectivity in the co-creation process. Interaction turns have also been used as a representation of intersubjectivity in previous research, such as in Loots’ study. However, the description of interaction turns in representations are similar to the method used to measure joint attention. Therefore, joint attention is considered a reliable and measurable indicator of co-creation in our research.
Computer vision was utilized to recognize behaviors and emotions among a group of individuals in videos. By computing metrics related to interactions in nonverbal and implicit modes, which are typically outside of conscious awareness, we were able to measure joint attention among participants in various cocreation scenarios. The subsequent table presents indicators of joint attention among groups of individuals in the cocreation process, inspired by existing research (Table 2).
Table 2. Reference source comparison table of important indicators

3. Methods and technical principles
In design workshops, the accurate identification of participants’ objectives serves as the foundation for subsequent joint attention analysis. Current methods predominantly encompass questionnaire surveys, artificial scene observations and on-site interviews, among others. However, these approaches are prone to measurement inaccuracies and may introduce varying degrees of interference for participants. Notably, there is a paucity of quantitative research on individuals’ engagement levels in design workshops within the design field, and the factors influencing the extent of collaborative participation in such settings remain ambiguous. Consequently, it is imperative to incorporate non-contact observation techniques to gather information on design workshop participants without compromising their experience and to quantitatively evaluate the specific indicators that influence their engagement levels. To accomplish this research objective, we employed deep learning-based visual measurement technology to precisely identify workshop participants during the study. Given the intricate nature of diverse design workshop scenarios, we further optimized and enhanced this methodology, which will be elaborated upon in this chapter.
3.1. YOLO-TP: design workshop person target recognition network
To measure the joint attention among design workshop scenes using computer vision, we must accurately identify personnel targets in various complex scenes. The accuracy of personnel target recognition will affect the reliability of subsequent joint attention analysis. Therefore, this study uses extensive sample data combined with the improved YOLO-TP (YOLO Transformer Person) deep learning network to accurately identify and extract personnel objectives in the design workshop.
For the optimization and improvement of the network model, this study mainly uses YOLO v5s as the primary network structure, as shown in Figure 1. The basic architecture of the network mainly consists of backbone, neck structure, and head structures. There are many problems in the actual space environment of design creativity workshops, such as multiple personnel goals, complex environmental backgrounds, different lighting conditions and large data volumes (Wang et al. Reference Wang, Wu, Yang, Thirunavukarasu, Evison and Zhao2021; Wu et al. Reference Wu, Liu, Li, Long, Wang, Wang, Li and Chang2021; Sharma et al. Reference Sharma, Debaque, Duclos, Chehri, Kinder and Fortier2022). This research adopts several optimization methods to improve the traditional YOLO v5s network structure (Aziz et al. Reference Aziz, Salam, Sheikh and Ayub2020) to form a YOLO-TP target recognition network that is dedicated to personnel recognition in the design creativity workshop space.

Figure 1. YOLO v5s network structure diagram.
3.2. Improvement and optimization of model structure
3.2.1. Introducing the TransformV2 module to optimize the backbone structure
First, this part optimizes the backbone part of the network. We replace CSP1_1 in Figure 1 with the latest Transformer structure. Compared with the traditional CSP structure (Zhang et al. Reference Zhang, Wan, Wu and Du2022), the Transformer structure can better overcome bottlenecks when training numerous data and provide a more significant and stable detection model to stably identify the personnel target in the design workshop.
Here, we creatively introduce the structure layer of the new version of Transformer V2 to improve the problems of the traditional Transformer V1 structure layer. The improved part is reflected in the part marked in red in Figure 2. The traditional Transformer V1 structure layer faces three problems:
-
(1) Increasing the visual model may create excellent training instability.
-
(2) For many downstream tasks that require high resolution, there is no well-explored method to transfer the trained model with low resolution to a larger scale model.
-
(3) In a complex background environment, a small number of pixels have considerable interference.

Figure 2. Improvement of V2 compared with V1.
For the first problem of unstable training, we adopt the idea of the post norm, which is to move the layer norm layer in the Transformer block from the front of the attention layer to the back of the attention layer. The advantage of this idea is that after the attention is calculated, the output will be normalized to stabilize the output value.
For the second problem, the module uses log space continuous position offset technology to migrate the low-resolution pretraining model to the high-resolution pretraining model. The traditional processing step here adopts the continuous position offset method (Liu et al. Reference Liu, Hu, Lin, Yao, Xie, Wei and Guo2022). The principle of this method is shown in the publicity (1), and the meta-network for relative coordinates is adopted.

In the above formula, G is a small network that generates offset parameters for any relative coordinates, so it can naturally migrate any variable window size. The log space continuous position offset technology mentioned here alleviates the problem that a large proportion of the relative coordinate range needs to be extrapolated when migrating across large windows. The publicity of this part of technology is shown in (2).

Logarithmic operation is adopted here as the extrapolation ratio required for block resolution migration will be smaller using logarithmic space coordinates, which can lay the foundation for migration from a low-resolution pretraining model to a high-resolution pretraining model.
For the third problem, we discovered that in the V1 version of the self-attention calculation process, the pixel property of the pixel pair is calculated by the dot product of the query and key. However, in the workshop, a scene model with many data, the attention map of some modules and heads will be dominated by a small number of pixels. To alleviate this problem, the scaled cosine attention (SCA) method is applied in this part. The formula is shown in (3).

3.2.2. Introducing the Asff module to optimize the head structure
Pyramid feature representation is a standard method to solve the problem of target scale change in target detection. However, the inconsistency between the two different feature scales is the main limitation of the target detector based on the feature pyramid. Here, we use a new and data-driven pyramid feature fusion strategy that is referred to as adaptive spatial feature fusion (ASFF) in academia. This structure can effectively overcome the problem caused by different scales of data features due to the diversity of workshop scenarios, thus improving the scale invariance of features. Additionally, this structure does not require additional computing resources. In the actual design workshop, during the process of personnel recognition, the adaptive spatial feature fusion module effectively fuses pedestrian features in different scene backgrounds and further improves the robustness of the model. In the network, we replace the traditional Detect module with the Asff module discussed in this section. Figure 3 shows a schematic of the three-layer ASFF structure.

Figure 3. ASFF module structure diagram.
As shown in Figure 3, the module achieves feature fusion by setting weights α, β and γ. Note that the weight coefficient is automatically generated by a 1x1 convolution layer, softmax function and backpropagation. The principle of publicity is shown in (4).

where X is the input of each scale and y is the feature map output after scale fusion in space. We need to meet the following conditions:
$ {\alpha}_{ij}^l\bullet +{\beta}_{ij}^l+{\gamma}_{ij}^l=1 $
,
$ {\alpha}_{ij}^l,{\beta}_{ij}^l,{\gamma}_{ij}^l\in \left[0,1\right] $
and
$ {\alpha}_{ij}^l=\frac{e^{\lambda_{\alpha_{ij}}^l}}{e^{\lambda_{\alpha_{ij}}^l}+{e}^{\lambda_{\beta_{ij}}^l}+{e}^{\lambda_{\gamma_{ij}}^l}} $
.
3.2.3. Introducing SKAttention attention mechanism to optimize neck structure
It is undeniable that the introduction of an attention mechanism has a vital role in the accuracy of the personnel identification model. In the previous section, we reduced the computational overhead of some GPUs by optimizing Transformer V2. In this section, we consider that we can reasonably use the computational overhead saved by the preamble part and improve accuracy by adding one-dimensional convolutions.
As shown in Figure 4, the primary processing process of this module is divided into the following three parts:
-
(1) Split: complete convolution operation (group convolution) of input vector X with different kernel sizes. In particular, to further improve the efficiency, the traditional convolution of 5×5 is replaced by the cavity convolution with a division = 2 and a convolution core of 3×3.
-
(2) Fusion: After adding the two feature maps, the global average pooling operation is performed. The fully connected layer that reduces the dimension and then increases the dimension is a two-layer fully connected layer: the output of two attention coefficient vectors, a and b, where a + b = 1.
-
(3) Select: Select uses two weight matrices, a and b, to weigh the previous two feature maps. There is an operation similar to feature selection between them.

Figure 4. SKAttention module structure diagram.
3.2.4. Improved pedestrian target detection network: YOLO-TP
After the traditional YOLO v5s network is integrated with the above three improved modules, a new network, which is referred to as the YOLO-TP network, and especially suitable for high-precision target recognition of workshop staff. The network is optimized by inserting the Transformer V2 structure in the Backbone part, using the SKAttention attention mechanism in the neck part, and introducing the Asff structure in the head part. The modified YOLO-TP network achieves accurate recognition of pedestrian targets in different scenes; effectively overcomes the problem of inconsistent data characteristics caused by scene changes, changes in lighting conditions and changes in personnel actions and has high robustness. The improved network structure is shown in Figure 5.

Figure 5. Schematic of the improved YOLO-TP.
3.3. Comparison of human target recognition test results under a complex background environment
For the training model, we selected the video stream data collected from (Tongji University) Design Institute and Shanghai NICE 2035 Creative Work Community. After the video frame was drawn, we marked the personnel target samples to build a dataset. A total of 7,000 sample datasets were built in multiple scenes. The training and verification sets were divided according to a ratio of 9:1. The GPU used for training is an Nvidia GeoForce 2080 Ti, with Epoch = 10 training rounds. Figure 6 compares the effects of the traditional and improved networks on various evaluation indicators.

Figure 6. Comparison of the effect between the improved YOLO-TP network structure and the traditional network: (a) comparison of personnel target detection accuracy, (b) comparison of mAP50 and mAP50:90 accuracy indicators, (c) comparison of F1-Score model evaluation indicators and (d) comparison of training accuracy loss.
As shown in Figure 6(a), the detection accuracy of the traditional algorithm reaches 97.39%, and the detection accuracy of the improved algorithm proposed in this paper reaches 98.47%, an increase of 1.08%. Although our method can achieve a recognition accuracy of over 98%, in the process of processing a large amount of data, there are still 1% to 2% errors in discrimination when facing large changes in environmental lighting and fast movement of actions. This is a very normal phenomenon in the automatic processing of large amounts of data, so we believe that this does not affect subsequent analysis. Figure 6(b) shows that the improved YOLO-TP model in this paper is superior to the traditional algorithm in terms of mAP50 and mAP50:90, with increases of 0.11% and 1.07%, respectively. According to Figure 6(c), for the F1-Score index, the optimal index of the traditional algorithm reaches 97.59%. However, the highest calculation in this paper is 97.86%, an increase of 0.27%. As shown in Figure 6(d), compared with the traditional YOLO v5 model, the YOLO-TP model proposed in this paper reduces the target frame loss and target detection loss. The smaller the former is, the more accurate the mark position is when the frame mark is marked, and the smaller the latter is, the more accurate the detection of personnel targets. The improved algorithm in this paper has improved in terms of both detection accuracy and detection effect.
To assess the performance of the proposed model in a design workshop setting, we utilized real-world data collected from design workshops at (Tongji University). The results of the test are depicted in Figure 7.

Figure 7. Test results of real design workshop environment.
Figure 7 illustrates that the conventional YOLO v5 model exhibits some errors in the actual personnel target detection within the workshop, leading to a degree of omission. The red marks in the figure highlight the targets that were missed during detection. In contrast, the YOLO-TP model proposed in this paper accurately identifies personnel targets across various workshop scenarios, establishing a solid foundation for high-precision target recognition in subsequent personnel statistical analyses.
In summary, the YOLO-TP network proposed in this paper demonstrates a strong capability for accurately identifying participants (detecting participant targets) within various complex design workshop scenarios, significantly improving upon baseline models as shown in our validation (Section 3.3). Building upon this reliable foundation of participant detection and localization, we proceed in this study to analyze specific behavioral indicators related to joint attention exhibited during these workshop activities. The eight specific indicators utilized for this quantitative assessment, which are derived from the YOLO-TP output, are formally introduced and detailed in Section 4.2 and the Appendix. As an innovative endeavor, we apply this non-contact visual measurement technique to the study of design activities, leveraging deep learning to precisely capture participant presence and interaction patterns in diverse contexts. This provides a robust methodological basis for future analyses of design activities. To further demonstrate the application of these methods, the subsequent sections elaborate on experiments conducted using data from real-world design settings.
3.4. The measurement significance of the passive measurement method for joint attention
Joint attention constitutes a significant area of research within the field of interactive scene design. The precise measurement of joint attention in the milieu of design creativity presents an enduring challenge. This paper proposes the employment of the YOLO-TP network as a primary method for the quantification of joint attention, achieved through the efficient identification of individuals within a scene. Moreover, this network forms the foundation for the eight quantifiable indicators detailed in the Appendix of this manuscript, thereby underscoring its considerable import.
In the present study, the examination of human activities within design scenes and the quantification of joint attention are executed through video and image data gleaned from visual sensors. This paper contrasts conventional methodologies dependent on scenario analysis and questionnaire statistics, adopting instead a non-contact measurement approach via video and image data. This method thereby addresses the inherent limitations of survey-based approaches in accurately quantifying the involvement of individuals.
Moreover, the YOLO personnel analysis network is trained on specialized datasets, particularly tailored for individuals involved in design scenarios. This network is especially adapted for the automated detection and statistical analysis of the number of participants during design workshops and exchanges. It has been optimized to quantify joint attention accurately in creative design scenarios.
In summation, the network and methodologies proposed within this paper facilitate the capturing of data from diverse perspectives, thereby enabling their application across a range of scenarios. This approach offers quantitative processing support for data collated from different contexts. It is recognized that the observation and quantification method presented herein is not the only approach to analyze joint attention in the design field. Nonetheless, it provides a potent means for quantification, thus enabling the transition from qualitative to quantitative analysis in design research.
4. Data and experiments
4.1. Data
The data presented in this study were collected from design studios within the NICE 2035 living labs at the College of Design and Innovation, Tongji University over a period of 24 months, between 2021 and 2023. A total of 296 students were sampled from 24 design workshops, with an average workshop size of 12.33 members. Of these participants, 46% were male and 54% were female, with an average age of 27.7 years. The sampled students had backgrounds in Design, Technology, and Business, primarily from (Tongji University). All participants had normal communication and activity abilities, no language barriers or psychological issues and ensured normal interaction among workshop participants. We installed video cameras in each workshop scenario to capture and record the participants’ design activities and processes. Figure 8 depicts an example of the camera fields of view after YOLO-TP processing.

Figure 8. Example of tagged information picture after YOLO-TP processing.
Table 3 shows the basic data of each design studio, among which we have screened a total of 40 pieces of video data to better measure joint attention in design workshops. The duration of each piece of data is controlled at 10 minutes. This selection was guided primarily by the need to balance analytical depth with the significant computational cost associated with processing long video sequences using our deep learning pipeline. Furthermore, clips were prioritized that exhibited high levels of participant interaction relevant to co-creation and joint attention. The algorithm designed in this paper is used to count eight key indicators and then to conduct a subsequent principal component analysis.
Table 3. List of video data of the design workshop

4.2. Experimental processing
4.2.1. Image measurement
In the camera imaging system used in this article, there are four coordinate systems: the world coordinate system, the camera coordinate system, the image coordinate system and the pixel coordinate system. There is a rigorous mathematical correlation between these four coordinate systems, as shown in Figure 9. The image measurement method in this study strictly proves that the pixel distance in the image is positively correlated with the spatial distance.

Figure 9. Relationship between image coordinates and real coordinates.
The transformation relationship between the above four coordinate systems is:

Where,
$ \left(U,V,W\right) $
is the physical coordinate of a point in the world coordinate system,
$ \left(u,v\right) $
is the pixel coordinate corresponding to the point in the pixel coordinate system and Z is the scale factor. The
$ \left(\begin{array}{ccc}\frac{1}{dX}& -\frac{cot\theta}{dX}& {u}_0\\ {}0& \frac{1}{dYsin\theta}& {v}_0\\ {}0& 0& 1\end{array}\right) $
to the right of the equal sign represents an affine transformation matrix, and the
$ \left(\begin{array}{cc}\begin{array}{cc}\begin{array}{c}f\\ {}0\\ {}0\end{array}& \begin{array}{c}0\\ {}f\\ {}0\end{array}\end{array}& \begin{array}{cc}\begin{array}{c}0\\ {}0\\ {}1\end{array}& \begin{array}{c}0\\ {}0\\ {}0\end{array}\end{array}\end{array}\right) $
represents a projection transformation matrix. These two matrices constitute the camera’s internal parameter matrix. Where,
$ f $
represents the image distance,
$ dX, dY $
represents the physical length of a pixel in the
$ X $
and
$ Y $
directions on the camera’s photosensitive plate (i.e., how many millimeters a pixel is on the photosensitive plate), and
$ {u}_0,{v}_0 $
represents the coordinates of the center of the camera’s photosensitive plate in the pixel coordinate system, respectively, θ Indicates the angle between the horizontal and vertical edges of the photosensitive plate (90 ° indicates no error).
$ \left(\begin{array}{cc}R& T\\ {}0& 1\end{array}\right)\left(\begin{array}{c}\begin{array}{c}U\\ {}V\end{array}\\ {}\begin{array}{c}W\\ {}1\end{array}\end{array}\right) $
represents a rigid body transformation matrix that constitutes the camera’s external parameter matrix. Where R represents the rotation matrix and T represents the translation vector.
When we get an image and perform recognition, the distance between the two parts obtained is a certain pixel value, but how many meters do these pixels correspond to in the real world? This requires using the camera calibration results to convert pixel coordinates to physical coordinates to calculate the distance. For camera calibration, we directly use the Zhang Zhengyou calibration method to calibrate the camera (Lu, Liu & Guo Reference Lu, Liu and Guo2016). The purpose of calibration is to obtain the camera’s internal and external parameter matrices, as shown in Formula (5), which is used to convert the distance on the map to the actual distance. Through the strict proof mentioned above, we found that the distance on the image and the actual distance present a strict positive correlation. Therefore, in our subsequent analysis, we directly use the relative distance of pixels on the image to measure the distance. These values can strictly reflect the distance situation in the real world.
4.2.2. Workshop scenario analysis
The first step in analyzing the videotapes consisted of separating peer interactions from the continuous co-creation process on the recording. The inclusion of episodes was contingent on two criteria: length and participation. During the creation of the measures, it was revealed that interactions lasting less than 60 s could not be consistently recorded. In addition, the theoretical concept of joint attention as focusing on shared activity necessitated that contacts be properly sustained to identify any shared activity. A study of the tapes confirmed that a minimum of 1 minute of contact was necessary to achieve these criteria. The second criterion for analysis inclusion was that episodes must be defined by participant continuity. The interacting people must remain the only participants throughout the episode. The previously established episode would be considered concluded if a new person joined. This approach represents the notion that joint attention among interacting partners emerges due to their particular social dynamic.
To better analyze the pedestrian activities in the design workshop, we quantitatively extracted the following eight quantifiable indicators from the video stream data collected by the visual sensor: number of people in the scene, number of activity tracks, number of people in key areas, time of appearance of people in key areas, frequency of eye contact, frequency of common facial expressions, mutual social distance and frequency of common attention. For different indicators, we designed and specified relevant algorithms. In summary, for the above eight indicators, this paper provides a statistical method using computer vision and strives to obtain the relationship among the joint attention in the design workshop from the data. The overall statistical flow chart of the above indicators is shown in Figure 10.

Figure 10. Key indicator statistics flow chart.
For each indicator, we have designed a special calculation method flow, which is included in the Appendix of this article. Additionally, during the process of data analysis, our two authors, as data analysis engineers, examined 40 analysis video clips, reviewed the correctness of the algorithm’s output on 8 indicators and evaluated the reliability of the algorithm based on this, providing a highly credible data source for subsequent principal component analysis (PCA) data analysis, as shown in Table 4. We paid special attention to the indicators of “eye contact frequency,” “common facial expression frequency,” and “common attention frequency.” This involves comparing the key event time points identified by the algorithm with their direct visual interpretation of the interactions in the corresponding video segments. These qualitative evaluations confirm that the automatically generated indicators are consistent with their observations and common patterns in these co creation activities, providing confidence in the effectiveness of the indicators. Finally, we also believe that introducing high-level experts for interpretation is a necessary operation and one of the improvement points for our future research.
Table 4. Reliability evaluation table for 8 indicators (compared with manual interpretation)

As shown in Table 4, each column represents different observation indicators, and each row represents different scenarios. The values in the table represent the accuracy of the program’s calculated results and manual interpretation results. The closer the value is to 1, the closer the accuracy of the program’s inferred indicator value is to manual interpretation. The specific results can be found in the Appendix of this article.
5. Results and analysis
5.1. Results related to the three dimensions of joint attention
To reduce the dimensionality of the behavioral measures derived from the video analysis and identify underlying latent constructs of joint attention, we performed principal component analysis (PCA). The analysis was conducted on the dataset comprising the calculated values for the eight quantitative indicators (number of people in the scene, number of activity tracks, number of people in key areas, time of appearance of people in key areas, frequency of eye contact, frequency of common facial expressions, mutual social distance and frequency of common attention) obtained from the 40 video clips, as detailed in Section 4.2 and the Appendix. PCA is a multivariate statistical method used to transform multiple variables linearly, reducing the number of variables. In this study, we used PCA to identify comprehensive indicators for joint attention measurement. The KMO test value of 0.646 indicates that the information contained in each indicator has more common factors. The significance of Bartlett’s spherical test is less than 0.01 (p = 0.000), indicating that the indicators are independent and ·that the data are suitable for principal component analysis. Three principal components were extracted from the PCA, each with eigenvalues greater than 1. The factor loadings of the generalized principal component (shown in Table 5) indicate that the frequency of eye contact, common facial expressions and common attention have the highest loadings on the first principal component, which we define as the empathic sharing dimension (PCA 1). Mutual social distance, the number of people in the scene, and the number of activity tracks have the highest loadings on the second principal component, which we define as the social context dimension (PCA 2). The number of people in key areas and the time of appearance of people in key areas have the highest loadings on the third principal component, which we define as the key area dimension (PCA 3).
Table 5. Factor loadings of the generalized principal component

Empathic sharing dimension is the key dimension in Joint Attention, which includes the frequency of eye contact, common facial expressions and common attention. Previous research has shown that empathic sharing is associated with higher levels of social connection and cooperation between participants in a design process (Swan & Riley Reference Swan and Riley2015). Designers can use this dimension to understand participants’ emotional states and needs and facilitate interactions between them, leading to better design outcomes. Social context dimension is another important dimension of Joint Attention, including Mutual social distance, the number of people in the scene and the number of activity tracks. These factors influence the level of social interaction and attention among participants, which can affect design outcomes. For example, a higher number of people in the scene may increase the complexity of the design process and require more careful coordination among participants. Thus, designers should consider the social context dimension when designing collaborative environments and activities. Key area dimension, which includes the number of people in key areas and the time of appearance of people in key areas, is also crucial in Joint Attention. Key areas are places that participants focus on during the design process. Designers can use this dimension to identify the most important areas of focus for participants and facilitate collaboration and interaction in these areas.
5.2. Results related to the items of joint attention
The weightings assigned to each of the eight indicators in the PCA (shown in Table 6) reflect their respective contributions to the variability within the dataset. Specifically, the weights represent the proportion of the total variance in the original eight indicators that can be accounted for by each resulting principal component. In this case, the weights for the eight indicators range from 11.47% to 14.22%, indicating that each indicator contributes to the variability within the dataset, albeit to varying degrees. For example, the indicator “Mutual social distance” has the highest weighting at 14.22%, indicating that it is the most influential factor in shaping the underlying patterns of Joint Attention in the dataset. Conversely, the indicators “Number of people in the scene” and “Number of activity tracks” have the lowest weightings at 11.47%, suggesting that they contribute the least to the overall variability of the dataset. These weightings provide a useful understanding of the relative importance of each indicator in shaping Joint Attention, which can inform the design of collaborative environments and activities.
Table 6. Linear combination coefficients and weights

5.3. The Relevance for design researchers
In design research, the complex interactions between joint attention dimensions and their respective weightings present significant implications for co-creation processes. This discussion explores the consequences of each dimension and the weightings of indicators, grounded in academic literature and analytical reflection, while clarifying how our findings contribute to and extend the broader discourse on co-creation rather than merely collaboration.
5.3.1. Empathic sharing dimension: a nexus of emotional resonance in co-creation
The empathic sharing dimension, encapsulating the frequency of eye contact, common facial expressions and common attention, serves as a linchpin in the co-creation process. This dimension directly addresses the intersubjective aspects of co-creation, which identified as critical for meaningful collaborative design. Co-creation, as conceptualized by Sanders & Stappers (Reference Sanders and Stappers2008), goes beyond mere collaboration to encompass “collective creativity applied across the whole span of a design process,” with joint attention being a fundamental mechanism through which this collective creativity manifests. Our findings extend the current understanding of co-creation by demonstrating that joint attention, specifically through empathic sharing, quantifiably contributes to more effective co-creation outcomes. Cash & Maier (Reference Cash and Maier2016) demonstrated in their Design Science research that gestural communication plays a crucial role in establishing shared understanding during collaborative design activities. Our work builds upon theirs by providing a computational framework for measuring these previously qualitative aspects of co-design interaction.
The computational measurement of empathic sharing provides design researchers with unprecedented insights into the quality of co-creation processes. Rather than relying on subjective assessments or post-hoc analyses, our framework allows for real-time evaluation and enhancement of co-creation activities. As Calvo, Sclater & Smith (Reference Calvo, Sclater and Smith2021) note, “achieving collaboration through co-design is challenging as people need to understand each other, and develop trust and rapport.” Our measurement framework specifically addresses this challenge by providing objective metrics for these hard-to-quantify aspects of intersubjective engagement. Furthermore, our research contributes to resolving what Trischler et al. (Reference Trischler, Pervan, Kelly and Scott2018) identify as a key challenge in co-design: balancing diverse participant contributions while maintaining cohesive progress toward shared goals. The empathic sharing dimension provides a metric for assessing this balance, offering design researchers a tool for orchestrating more effective co-creation sessions where collaborative empathy can be fostered and measured.
5.3.2. Social context and key area dimensions: the infrastructure of co-creation
The social context dimension, encompassing mutual social distance, the number of people present, and activity tracks, reveals the spatial and social infrastructure of co-creation environments. This dimension builds upon Menold, Jablokow & Simpson’s (Reference Menold, Jablokow and Simpson2017) “Prototype for X (PFX)” framework by extending its principles to the social dynamics of co-creation, demonstrating how physical arrangements and movement patterns directly influence the quality of collaborative design. Our research contributes to the co-creation literature by establishing that physical proximity and movement patterns significantly impact the quality of co-creative design. Cash, Dekoninck & Ahmed-Kristensen (Reference Cash, Dekoninck and Ahmed-Kristensen2017) previously identified in their Design Science research that the spatial arrangement of design teams influences communication patterns, but our study extends this understanding by providing a measurable framework for optimizing spatial configurations in co-creation settings. The implications of our findings on the Social Context Dimension align with what Protzen & Harris (Reference Protzen and Harris2010) term the “ecology of design spaces,” wherein the physical environment serves not merely as a backdrop but as an active agent in shaping design discourse. By quantifying the impact of spatial arrangements on joint attention, our research provides design facilitators with evidence-based guidelines for configuring co-creation environments to maximize collaborative potential. Furthermore, our work contributes to what Carlile (Reference Carlile2002) identifies as the challenge of “knowledge boundaries” in cross-functional team interactions. By measuring how spatial proximity influences joint attention across participants from diverse backgrounds, our framework offers insights into how physical space can be leveraged to overcome disciplinary boundaries that often hinder effective co-creation.
The key area dimension, highlighting the number of individuals in pivotal areas and their temporal presence, offers critical insights into attention allocation during co-creation processes. This dimension builds upon and extends Cash et al.’s (Reference Cash, Hicks, Culley and Salustri2021) research on the role of prototypes in facilitating shared understanding, suggesting that key areas—whether physical spaces or conceptual domains—serve as “boundary objects” that facilitate cross-disciplinary exchange in co-design activities. Our research advances the co-creation discourse by quantifying how attention to specific physical or conceptual spaces correlates with collaborative outcomes. Erichsen et al. (Reference Erichsen, Sjöman, Steinert and Welo2021) in their Design Science research explored how physical prototypes capture design knowledge, but our findings provide a measurable framework for identifying and leveraging these focal points in co-creation settings. The key area dimension intersects with what Kleinsmann et al. (Reference Kleinsmann, Valkenburg and Sluijs2017) term “collaborative design loops,” wherein participants iterate between individual exploration and collective synthesis. Our measurement framework provides a means to assess the effectiveness of these loops by tracking how participants converge around and engage with key areas during the co-creation process. The practical implications of this dimension extend to what Sanders & Stappers (Reference Sanders and Stappers2014) describe as the “front end” of co-design, where the problem space is still being explored and defined. By identifying which key areas attract joint attention during early co-creation phases, facilitators can more effectively structure subsequent activities to build upon emergent shared understanding.
5.3.3. Weightings of indicators: the quantifiable metrics of co-creation
The weightings, ranging from 11.47% to 14.22%, offer a nuanced, data-driven understanding of each indicator’s contribution to co-creation effectiveness. “Mutual social distance” emerges as a dominant force (14.22%), which extends research on proximity in design collaboration by quantifying its precise contribution to co-creation outcomes. Our research significantly contributes to the co-creation literature by providing a weighted framework that allows design researchers to prioritize specific aspects of co-creation environments based on their measurable impact. Previously, as noted by Dorst & Cross (Reference Dorst and Cross2001), such prioritization was largely intuitive or based on qualitative assessments. Our findings transform this approach by offering a quantitative basis for decision-making in co-creation facilitation. The weighted indicator framework connects directly to what Bjögvinsson, Ehn & Hillgren (Reference Bjögvinsson, Ehn and Hillgren2012) describe as “infrastructuring” in co-design—the process of establishing conditions that enable productive participation. Our research provides empirical evidence for which aspects of this infrastructuring most significantly impact the quality of joint attention and, by extension, co-creation outcomes. In synthesizing these reflections, we advance the understanding of co-creation by moving beyond treating it as merely a collaborative activity to recognizing it as a complex, measurable phenomenon with specific dimensions that can be optimized. Co-creation is not just a designerly collaboration to involve people but a formal research practice with a general model that produces new academic knowledge. Our weighted framework provides the empirical foundation for such a formal model of co-creation effectiveness.
6. Conclusion
6.1. Practical applications for co-creation process optimization
The joint attention measurement framework developed in this study offers several concrete pathways for optimizing co-creation processes in real-world design settings. First, real-time monitoring capabilities enable facilitators to identify when joint attention is declining during co-creation sessions and implement targeted interventions to re-engage participants (Trischler et al. Reference Trischler, Pervan, Kelly and Scott2018). For example, when the Empathic Sharing Dimension indicators show decreased eye contact frequency and common facial expressions, facilitators can introduce structured interaction activities or modify the physical arrangement to enhance face-to-face engagement (Cash & Maier Reference Cash and Maier2016).
Second, the weighted indicator framework provides evidence-based guidance for designing optimal co-creation environments. Given that mutual social distance emerged as the most influential factor (14.22% weighting), design practitioners can prioritize spatial configurations that promote appropriate proximity levels (Cash et al. Reference Cash, Dekoninck and Ahmed-Kristensen2017). This might involve adjusting table arrangements, seating configurations, or workspace layouts to facilitate the social distances that correlate with enhanced joint attention. Similarly, the importance of the key area dimension suggests that co-creation spaces should include clearly defined focal points that naturally draw participant attention and provide shared reference points for collaborative work (Menold et al. Reference Menold, Jablokow and Simpson2017).
Third, the three-dimensional framework enables diagnostic assessment of co-creation sessions, allowing researchers and practitioners to identify specific areas for improvement (Sanders & Stappers Reference Sanders and Stappers2014). Sessions scoring low on the social context dimension might benefit from interventions targeting group size optimization or movement pattern enhancement, while sessions with poor key area dimension scores might require better definition of focal work areas or improved tool accessibility (Christensen & Ball Reference Christensen and Ball2016). This diagnostic capability transforms co-creation facilitation from an intuitive practice to an evidence-based discipline (Calvo et al. Reference Calvo, Sclater and Smith2021).
Fourth, the framework supports the development of adaptive co-creation protocols that respond to real-time joint attention measurements. Advanced implementations could incorporate automated feedback systems that adjust lighting, spatial arrangements, or activity structures based on ongoing joint attention assessments, creating responsive environments that continuously optimize collaborative conditions (Oertzen et al. Reference Oertzen, Odekerken-Schröder, Brax and Mager2018). This approach aligns with emerging trends in design research toward more responsive and data-driven facilitation methods (Bjögvinsson et al. Reference Bjögvinsson, Ehn and Hillgren2012).
6.2. Key findings and theoretical implications
In this study, we endeavored to elucidate a machine learning-oriented paradigm for ascertaining joint attention within the co-creation milieu. Our research directly addresses the growing need within design research for objective, quantifiable measures of co-creation effectiveness, moving beyond the traditional reliance on subjective assessments that has limited the field’s advancement (Kleinsmann et al. Reference Kleinsmann, Valkenburg and Sluijs2017). Leveraging the finesse of computer vision, we have developed a novel methodological approach that not only captures joint attention data from multifaceted co-creation scenarios but also circumvents potential pitfalls inherent in traditional sampling surveys. This methodological contribution responds directly to calls within the co-creation literature for more robust measurement frameworks (Sanders & Stappers Reference Sanders and Stappers2014).
The empirical revelations from our study proffer three significant contributions to the co-creation discourse. First, our Principal Component Analysis has identified and quantified three cardinal dimensions of joint attention in co-creation – empathic sharing, social context and key area – providing a structured framework for understanding the previously amorphous concept of intersubjectivity in collaborative design. This framework extends beyond mere collaboration to address the distinctive characteristics of co-creation as a specific form of collaborative activity where shared attention and intersubjectivity are paramount. Second, we have demonstrated that joint attention, as a specific manifestation of intersubjectivity, can be objectively measured and correlated with co-creation effectiveness. This finding bridges the gap between the experiential aspects of design collaboration and measurable outcomes, offering a quantitative basis for evaluating and enhancing co-creation processes. As Nguyen & Mougenot (Reference Nguyen and Mougenot2022) note in their systematic review of empirical studies on multidisciplinary design collaboration, shared understanding is consistently identified as crucial across diverse collaborative contexts, yet methodological approaches for measuring it remain inconsistent. Our computational framework for measuring joint attention provides a standardized approach to evaluating this critical aspect of co-creation. Third, our weighted indicator framework provides design researchers and practitioners with a practical tool for optimizing co-creation environments and activities. As Cash & Maier (Reference Cash and Maier2016) demonstrated in their Design Science research on gestural communication in design, non-verbal interactions significantly impact shared understanding development. Our framework extends this line of inquiry by providing quantitative metrics for measuring these interactions and their effects on joint attention in co-creation settings.
Central to our discourse is the pivotal role of refined deep learning frameworks in objectively measuring design processes. Our approach aligns with recent advances in applying computational methods to design research, such as Kent et al.’s (Reference Kent, Gopsill, Giunta, Goudswaard, Snider and Hicks2022) network analysis approach to prototyping and Erichsen et al.’s (Reference Erichsen, Sjöman, Steinert and Welo2021) digital capture of physical prototypes. These approaches collectively represent a paradigm shift toward more objective, data-driven assessment of design activities. Our research clarifies the distinction between general collaboration and co-creation by focusing specifically on the intersubjective dimensions that make co-creation a unique form of collaborative activity. While collaboration broadly encompasses coordinated effort toward shared goals, co-creation, as defined by Sanders & Stappers (Reference Sanders and Stappers2008) and expanded by Oertzen et al. (Reference Oertzen, Odekerken-Schröder, Brax and Mager2018), distinctively involves “joint value creation among multiple actors through resource integration.” Our quantitative framework for measuring joint attention provides a means to distinguish co-creation from other collaborative activities by quantifying the degree to which participants achieve shared focus, understanding, and engagement – the hallmarks of true co-creation. This distinction is particularly important in light of Castañer & Oliveira’s (Reference Castañer and Oliveira2020) systematic review clarifying the differences between collaboration, coordination and cooperation in organizational contexts. While these terms are often used interchangeably, co-creation represents a specific form of collaborative activity characterized by mutual focus, shared understanding and joint value creation. Our measurement framework provides empirical support for this distinction by quantifying the degree to which participants achieve joint attention during co-creation activities.
However, the current study focused on developing and validating the methodology for measuring joint attention components. Consequently, we did not investigate the direct correlation between these automatically extracted measures (either the eight indicators or the three PCA dimensions) and specific outcomes of the co-design process, such as participant experience ratings or the quality/quantity of design outputs. Exploring these relationships is a critical avenue for future research to determine the practical utility and predictive validity of our joint attention metrics for assessing and potentially enhancing co-creation effectiveness. Future studies should aim to collect both the automated behavioral data and corresponding process/outcome measures.
6.3. Limitations and future research directions
Looking beyond the current horizon, we are poised to integrate speech emotion recognition with computer vision, aligning with calls for multimodal approaches to design research. This future direction aims to unravel the nuanced behavioral attributes of designers in co-creation settings via a comprehensive analytical lens. The integration of multiple data streams will enable a more holistic assessment of co-creation dynamics, potentially revealing interaction patterns that remain invisible when examined through a single modality. However, a critical introspection warrants the acknowledgment of certain limitations. At the outset, our focus on human-centric parameters to gauge joint attention in intersubjectivity provides but a glimpse into the vast expanse of co-creation analytics. As Christensen & Ball (Reference Christensen and Ball2016) noted in their Design Studies research on creative analogy use in heterogeneous design teams, the cognitive aspects of design collaboration extend beyond observable behaviors. Future research should explore how joint attention correlates with cognitive processes underlying co-creation, potentially through mixed-methods approaches combining our computational framework with qualitative assessment of participants’ thought processes.
Furthermore, the inherent reliance on visual sensor deployment, especially in expansive co-creation environments, amplifies both fiscal and computational burdens. This limitation echoes concerns about the scalability of technological approaches to design research. As Hansen & Özkil (Reference Hansen and Özkil2020) demonstrated in their longitudinal case study of prototyping strategies, design processes unfold across multiple spaces and timeframes, presenting challenges for comprehensive data capture. Future research should explore more efficient sensor systems and sampling strategies to reduce the resource intensiveness of our approach while maintaining measurement validity. A pivotal caveat lies in the modest sample size, underscoring the preliminary nature of our findings and beckoning corroborative studies with augmented sample sizes. While our study demonstrates the feasibility and potential value of computational approaches to measuring joint attention, broader deployment across diverse co-creation contexts is needed to establish the generalizability of our findings. As Erichsen et al. (Reference Erichsen, Sjöman, Steinert and Welo2021) note in their work on prototyping data capture, design research methods must balance specificity with generalizability to maximize their value to the field.
In conclusion, our research advances the co-creation discourse by providing a quantitative framework for measuring and enhancing the intersubjective dimensions that distinguish co-creation from general collaboration. By objectively measuring joint attention, we offer design researchers a powerful tool for understanding and optimizing co-creation processes, addressing a significant gap in the current literature. As co-creation continues to gain prominence across diverse domains – from product development to service design to policy formulation – our framework provides a foundation for evidence-based enhancement of co-creation processes across these contexts. This contribution represents not merely an incremental advancement in co-creation methodology but a fundamental shift toward more objective, data-driven approaches to understanding and facilitating this complex form of collaborative design activity.
Acknowledgments
We would like to thank NICE2035 for providing the experimental facilities essential for this study. We also appreciate the support and feedback from colleagues and participants, which greatly contributed to the success of this work. The data used in the paper can be obtained by contacting the corresponding author or the following link (https://www.wjx.cn/vm/PQfR5gm.aspx#). Due to the involvement of personal image data of experimental participants, the researchers in this article have the right to review the qualifications of the data requesters during the data sharing process.
Financial support
This research was funded by the Chinese Ministry of Education Humanities and Social Sciences Research Youth Fund Project [23YJC760101] and MOE (China) Research Innovation Team on “Design-Driven High-Quality Urban Development [20242717]”.
Competing interest
The authors declare none.
A. Appendix
The specific measurement methods for the eight indicators mentioned in this article are shown in the Appendix. For each indicator, we provide a detailed calculation process to facilitate the digital statistics of the information collected from visual sensors and achieve the transformation of design methods from qualitative analysis to quantitative analysis.
(1) Number of people in the scene: This paper uses the target statistics algorithm based on the YOLO-TP deep learning network. After accurately recognizing pedestrian targets using the YOLO-TP deep learning network proposed in this paper, the number of pedestrian targets is counted according to time. The algorithm flow is shown in Table A1.
Table A1. Statistical algorithm flow of the number of people in the scene

(2) Number of activity tracks: This indicator is closely related to the circulation in the design workshop space and the activities of participants. For the quantitative analysis of this indicator, the number of trace lines in the specified time series is mainly counted. The key point of this part is how to accurately draw the dynamic trajectory data of the same pedestrian target in the video stream data. Here, we use the DeepSORT algorithm and a dynamic Kalman filter for processing (Zhao, Zhang & Fu Reference Zhao, Zhang and Fu2020). The statistical algorithm for this trajectory is shown in Table A2.
Table A2. Statistical algorithm flow of the number of activity tracks

(3) Number of people in key areas: the principle of algorithmic statistics here is to count the effective collisions of the frame labels in key areas. The key areas described here are determined according to different situations. For example, in the kitchen scene, we usually set the key area in the kitchen stove area to effectively count the number of people cooking. The steps and principles of the specific statistical algorithm are shown in Table A3.
Table A3. Statistical algorithm flow of the number of people in key areas

(4) Time of appearance of people in key areas: here, we refer to the previous indicator 3, count the time of effective collisions in the key areas, and accumulate them. This indicator reflects the continued attractiveness of key areas to personnel in the design workshop. The statistical algorithm flow for this indicator is shown in Table A4.
Table A4. Statistical algorithm flow of time of appearance of people in key areas

(5) Frequency of eye contact: This indicator mainly counts the frequency of eye contact between two people in communication. Here, two people’s eye contact is evaluated mainly by the same duration of their eye gaze. When the duration is greater than 3 seconds, we believe that these two people are making eye contact. The statistical algorithm flow for this indicator is shown in Table A5.
Table A5. Statistical algorithm flow of frequency of eye contact

(6) Frequency of common facial expressions: Common facial expressions are used to describe the interaction between two people in the design workshop space. In general, if people have the same facial expressions in the design workshop, it can be considered that there is a typical emotional interaction between these people. We choose to count the number of changes in the common facial expressions, and we can observe the activity of the workshop through these data. The specific statistical algorithm is shown in Table A6.
Table A6. Statistical algorithm flow of the frequency of common facial expressions

(7) Mutual social distance: this indicator mainly reflects the interaction distance of personnel in the design workshop. In the actual statistical process, we recorded the average value of the relative pixel distance. We connect all the adjacent nearest pedestrian targets in each frame detection scene according to the center of gravity position of the frame marker, sum the pixel distance obtained by connecting multiple pairs of human targets, divide it by the total number of people in the scene and obtain the average distance of each frame. We sum the average distance of each frame and average it. The size of this indicator reflects the degree of intimacy between two people in the design workshop. The specific statistical algorithm is shown in Table A7.
Table A7. Statistical algorithm flow of mutual social distance

(8) Frequency of common attention: although different design workshops have different themes and different environmental scenes, the theme of each workshop is relatively fixed. The critical things in the scene are relatively fixed, so it is important that the people in the statistical design workshop pay attention to things in the same scene. Here, we count the number of times two people pay attention to the same thing. The specific statistical algorithm is shown in Table A8.
Table A8. Statistical algorithm flow of the frequency of common attention

To verify the reliability of the 8 statistical data obtained by our method, we selected 40 scene segments for algorithm metric calculation. We compared the data obtained by our method with the manually interpreted data indicators and obtained a detailed comparison table as shown in Table A9.
Table A9. Comparison between algorithm calculation of 8 indicators and manual interpretation data
