1. Introduction
Space signifcantly infuences key human activities such as collaboration, learning, and creative thinking, through shaping the activity process and human behaviors (Reference Meinel, Maier, Wagner and VoigtThoring et al., 2019; Reference Tong, Song, Wang and WangMehta & Zhu, 2009). For instance, workspaces outftted with advanced information and communication technologies can enhance social interaction and cognitive functions, thereby fostering collaboration and new knowledge creation (Reference Ravi, Gabeur, Hu, Hu, Ryali, Ma, Khedr, Rädle, Rolland, Gustafson, Mintun, Pan, Alwala, Carion, Wu, Girshick, Dollár and FeichtenhoferPeschl & Fundneider, 2014). Similarly, spatial features encouraging physical activity and relaxation, such as fexible furniture and lounge areas, can stimulate creative thinking and support refection and problem-solving (Reference Nebeling, Ott and NorrieMeinel et al., 2017).
To investigate these spatial infuences, researchers commonly implement video-based behavioral analysis, which involves recording entire activity sessions and systematically annotating observed behaviors (Reference Cash, Hicks and CulleyCash et al., 2015). However, this manual method is time-consuming and tedious, as it often requires repeatedly reviewing large volumes of video data (Brudy et al., Reference Brudy, Suwanwatcharachat, Zhang, Houben and Marquardt2018b). Moreover, solely relying on the researcher’s observation can result in a degree of subjectivity and potential inaccuracy.
Advances in computer vision, particularly vision-based artifcial intelligence (AI), offer promising solutions to these challenges. Recent progress in spatio-temporal feature representation and action classi-fcation shows that vision-based AI models can automatically detect, track, and segment human subjects and objects in a given space, providing both visualized outputs and quantitative metrics of observed behaviors (Reference Jordan and HendersonAbou Elassad et al., 2020; Reference Abou Elassad, Mousannif, Al Moatassime and KarkouchJaouedi et al., 2022). Moreover, these models show promising capabilities to recognize human posture, actions, and facial expressions in videos (Reference Al-Faris, Chiverton, Ndzi and AhmedAl-Faris et al., 2020). With improvements in algorithmic performance and expansions in behavior datasets, vision-based AI demonstrates great potential to support human behavior analysis and understanding in design research.
Despite these advancements, challenges remain when applying vision-based AI in the design domain. Specifcally, there is a signifcant need for domain-specifc datasets that refect the specifc behaviors relevant to design practices, and a closer alignment of AI’s current capabilities with the particular analytic tasks required in design studies. To better understand and leverage AI capabilities in design research, our work seeks to investigate the following research questions:
1. How can we bridge specifc behavior analysis tasks in design research and current AI capabilities for behavior analysis?
2. Which AI models currently are available and well-suited for video-based behavior analysis tasks common in design research?
3. Which spatial behaviors commonly occur and are observed within the context of design research?
4. How can we apply these vision-based AI models to analyze the behaviors in real physical space in design research, and what are the advantages and limitations of these models?
The contributions of this work are four-fold. First, we proposed a framework for utilizing vision-based AI models for spatial behavior analysis tasks in design research. Second, we compile and evaluate a set of suitable vision-based AI models—considering their usability, capabilities, and applications—to guide researchers in the effective selection and employment of AI tools. Third, we identify and categorize relevant spatial behaviors observed in workspaces into four distinct groups, drawing on insights from previous behavior studies and our own design research. This categorization helps pinpoint specifc research tasks where AI can enhance behavior analysis. Finally, we apply the proposed framework and selected AI models to video data collected in design research settings. Through these applications, we assess the models’ performance, discussing both their advantages and limitations, and ultimately offer practical experiences and new insights for researchers interested in integrating AI into design research.
2. Related Work
2.1. Video-based human behavior analysis in design research
Video-based human behavior analysis involves recording activities in videos and then observing, annotating, and analyzing behavior for deeper insight (Reference Aggarwal and RyooJordan & Henderson, 1995; Reference Larsen, Ramsay, Godinho, Gershuny and HovorkaAggarwal & Ryoo, 2011). In design research, this approach has been used to study collaboration, creativity, and social interaction. For instance, Brudy et al. (Reference Brudy, Budiman, Houben and Marquardt2018a) examined how shared screens infuence team decision-making and sensemaking via video-based open coding. Similarly, Cash et al. (Reference Cash, Hicks and Culley2015) leveraged multi-perspective video recordings to identify complex behavior patterns in product design, organizational processes, and management tasks, enabling multi-level behavior analysis. Jakobsen & Hornbæk (Reference Jaouedi, Boujnah and Bouhlel2014) used video coding to assess communication frequency, attention, and spatial preferences, indicating team interaction dynamics.
However, manual video annotation is both time-consuming and labor-intensive (Reference Papandreou, Zhu, Kanazawa, Toshev, Tompson, Bregler and MurphyNebeling et al., 2015). To mitigate this, visualization tools such as VisTACO (Reference Thoring, Desmet and Badke-SchaubTang et al., 2010), EagleView (Brudy et al., Reference Brudy, Suwanwatcharachat, Zhang, Houben and Marquardt2018b), and MIRIA (Reference Büschel, Lehmann and DachseltBüschel et al., 2021) provide insights into spatial behaviors (e.g., distance, orientation, movement) and offer data visualizations like scatterplots, heatmaps, and 3D trails. Yet, these still require substantial manual effort in observation and annotation, underscoring the need for more automated and effcient analysis methods.
2.2. State-of-the-art - vision-based AI for human behavior analysis
With the development of AI in computer vision, vision-based AI models have been widely researched to detect and recognize human behaviors based on video data (Reference Jordan and HendersonLi & Zhu, 2024; Reference Peschl and FundneiderPareek & Thakkar, 2021; Reference Marfa and RoccettiJaouedi et al., 2022). In education settings, researchers can detect students’ attention states in different classrooms by assessing their facial expressions, hand gestures, and body postures using AI models (Reference Ashwin and GuddetiAshwin & Guddeti, 2020). In healthcare, vision-based AI is applied for behavior monitoring and posture correction (Reference Sharma, Choudhury, Soni and SharmaSharma et al., 2022). In workspace scenarios, large vision models can support overexertion behavior detection in the offce (Reference Mehta and ZhuMarfa & Roccetti, 2017) and workspace occupancy monitoring (Reference Zou, Chen and SrebricZou et al., 2017), facilitating human well-being and effcient space usage in the offce. The application of AI-based computer vision method fnds in Human-computer interaction to develop a user-friendly interface and operation system Sharma et al. (Reference Tang, Pahud, Carpendale and Buxton2023). These advancements highlight the potential of vision-based AI to streamline and enhance behavior analysis in diverse spatial and contextual settings, including design research.
3. Methodology
The previous studies highlight the need for a low manual-effort behavior analysis method and the potential of AI in this feld. Building on these insights, our work further investigated how to leverage AI capabilities for behavior analysis tasks in design research, for which we conducted a workspace study and applied vision-based AI models to it.
Data Set. We design tree rooms, the activating room, the relaxing room and the neutral room (as control group), as demonstrated in Fig. 1. Participants worked individually and 30 minutes in each setting. Each session was recorded from a top-down camera view, resulting in approximately 45 hours of video data. We used data from 10 participants: eight for defning workspace behaviors through qualitative coding and two for testing AI models.

Figure 1. Room setups
AI Model Research. We searched for open-source, actively maintained vision-based AI models capable of motion tracking, human detection, object recognition, facial expression analysis, and posture/action analysis. Models were identifed through literature reviews, GitHub, Hugging Face, and relevant model hubs, and then fltered by accuracy higher than 75% (Reference Yacouby and AxmanWu et al., 2023) and core functionalities.
Qualitative Video Coding. Based on the method proposed by Saldaña (Reference Serengil and Ozpinar2021), we coded eight hours of video from eight participants in ATLAS.ti to identify workspace behaviors. We annotated behaviors with behavior code (e.g., “turning on a light,” “sitting on a stool”) and timestamps, then grouped similar behaviors into clusters and categorized them following a taxonomy adapted from Larsen et al. (Reference Li and Zhu2021).
4. Results
4.1. A framework of using AI for behavior analysis in design research
We developed a framework that leverages AI for video-based human behavior analysis in design research (Fig. 2). The process begins with deconstructing complex human behaviors into more specifc target behaviors, allowing researchers to identify key objects, contextual elements. These detailed factors identify the behavior features, which guide the selection and placement of camera systems—including the number, types, and optimal angles of cameras—to ensure that critical aspects of the behavior are adequately captured on video. Next, vision-based AI models process these video inputs to extract behavior features using capabilities such as detection, tracking, segmentation, and recognition. For instance, AI tools can detect and track both humans and objects, or recognize human postures like standing and sitting. By representing observed objects and their relationships, AI models generate analysis process stored in accessible data formats. These outputs can include visualizations, analytic graphs, and qualitative annotations (e.g., object and behavior labels).

Figure 2. Framework for video-based human behavior analysis utilizing AI in design research
4.2. Vision-based AI models for behavior analysis
Table 1 summarizes a selection of open-source vision-based AI models identifed for behavior analysis tasks. Pose estimation models, such as AlphaPose, PoseNet, DensePose, and OpenPose, can detect and track key points of the whole body, including the human face, body, hand, and foot, enabling to capture the subtle human actions in complex behavior analysis. YOLOv8, an object detection model, can detect and track humans and objects, and classify diverse objects with name labels. SAM2 is an object segmentation model, with which users can identify the segment interest in the video using positive and negative prompts. MMAction2 and SlowFast provide comprehensive frameworks for action recognition, supporting various algorithms and integrating with popular datasets. Moreover, deepFace can analyze facial expressions to predict human emotions.
Table 1. Vision-based AI models for behavior analysis.

4.3. Behavior classifcations in the workspace design research
We qualitatively code the behaviors of eight participants in ca. eight hours of videos and demonstrate the result in Table 2.
We defned 106 spatial behavior codes in the workspace, grouped them into 32 behaviors, and then classifed them into four categories: object manipulation, spatial movement, posture, and eye gaze. The frequency(Freq.) represents the overall occurrence numbers of each behavior, and the following columns show behaviors’ occurrence percentage in the activating room(Act.), the neutral room 1(Neu.1), the relaxing room(Rlx.) and the neutral room 2(Neu.2). Specifc actions, such as manipulate the light and lean over the desk, shows considerable difference in frequency depending on the workspace arrangement. General behaviors, such as moving, standing, and reading, occurred in different workspaces with relatively balanced frequencies.
Table 2. Observed Behaviors by Qualitative Video Coding (Percentages by Room).

Note: “-” indicates not applicable to this room due to the workspace element setup.
4.4. Applications of AI for human behavior analysis in workspace
We applied the proposed AI-based behavior analysis framework to selected testing videos. Specifcally, we utilized a segmentation AI model(SAM2) and a recognition AI toolkit (MMAction2) from Section 4.2 to analyze one representative behavior under each behavior category defned in Section 4.3. We then conducted evaluations of the AI-generated results.
Spatial Movement. In this behavior category, we focus on using AI models to observe moving-related behavior patterns, such as walking and position changing, in activating and neutral rooms. Using SAM2 (Reference SaldañaRavi et al., 2024), we tracked the human movement path, as illustrated in Fig 3. Specifcally, we used SAM2 to track the participant’s central body point at a rate of two frames per second, resulting in a sequence of 3,600 yellow points over the 30-minute video. Linking these points by timecode generated the observed movement path. The coordinate axes correspond to the original image frame of (1080x1920 pixels). Fig. 4b and 4d respectively display participant stay position and movement path in activation and neutral rooms.

Figure 3. AI-tracked human movement path in 30 minuets video in two rooms
To evaluate accuracy, we reviewed a result video composed of the 3,600 analyzed frames and identi-fed mis-tracked frames. Out of 3,600 frames, AI incorrectly tracked nine frames, achieving a 99.75% accuracy in path tracking.
Object Manipulation. As an example of object manipulation, we examined participants’ interactions with a soft cushion on the desk (labeled “Manipulate cushion” in Table 4.3). We analyze when and how long the participant interacts with the cushion. We used SAM2 to detect, segment, and track the relevant objects (human hands and the cushion) twice a second and cover them with masks. Based on the overlapping of different masks we can capture the timestamps and duration of human-cushion interaction. The AI-estimated interaction timestamps are represented by the orange line in Fig. 4.
We then compared these AI-estimated interaction timestamps with the timestamps coded by two human researchers (blue line in Fig. 4). In results coded by the human researcher, the video contained eight human-cushion interactions, all correctly identifed by the AI, but with one additional false estimation (nine estimated interactions in total). Among the eight correctly detected interactions, fve showed start and end times closely aligning with the human researchers’ analysis, while three exhibited larger discrepancies in timing.

Figure 4. AI-estimated human-cushion interactions in 30 minutes of video in the activating room
Posture. We employed MMAction 2 (Reference Fan, Li, Xiong, Lo and FeichtenhoferContributors, 2020) to analyze participants’ postures during the frst fve minutes of videos recorded in both the activating and neutral rooms. More specifcally, within MMAction2, we used VideoMAE (Reference Varghese and SambathTong et al., 2022), a skeleton-based action recognition model, combined with the Atomic Visual Actions (AVA) dataset. The AI model generated posture predictions at a frequency of one frame per second and only reported results with confdence scores above 0.48. Figure 5 presents the AI-predicted postures, their frequency, and their temporal distribution.

Figure 5. AI-predicted postures in 5 minutes of video in two rooms
We evaluated the AI-based posture recognition against human-coded reference data using three performance indicators: recall, precision, and F1-score (Reference Zou, Chen and SrebricYacouby & Axman, 2020) (Table 3). The AI model produced posture predictions for 300 frames of each 5-minute video; these predictions were then compared frame-by-frame with the video annotations from the human researcher. For “get up,” “jump,” and fall down,” the data is insuffcient to compute the aforementioned indicators, so these actions were excluded from the table.
Recall (1) quantifes the proportion of researcher-identifed behavior instances correctly detected by the AI:

“Stand” showed low recall (18.4%), whereas “Bend/Bow” and “Touch” both exceeded 90%, indicating strong coverage by the AI. Precision (2) measures correct AI predictions among all AI predictions:

Except for “Stand,” all postures surpassed 63% precision, with particularly strong results for “Walk” and “Touch.” Finally, the F1-score (3), the harmonic mean of recall and precision, provides an overall performance metric:

“Carry/Hold” and “Touch” achieved high F1-scores, refecting robust overall detection, while “Stand” remained problematic under the tested conditions.
Table 3. Performance analysis of AI posture recognition for participant 3382

5. Discussion
5.1. Performance of AI models for behavior analysis
In this study, current vision-based open-source AI models show promising capabilities, especially in tracking, detection, and segmentation. Moreover, several AI models demonstrate good usability for nonexperts, providing non-coding browsers and demons. Notably, tools like SAM2 and YOLOv8 provided accessible interfaces and comprehensive documentation, making them more user-friendly for design researchers with limited coding backgrounds. In contrast, more specialized tools such as MMAction2 might remain challenging for researchers without coding experience.
Overall, the tested AI models achieved proper accuracy in most tasks. In spatial movement analysis, the AI-generated movement paths closely matched human-coded references, demonstrating high accuracy. In object manipulation, AI effectively identifes all the interactions with objects, though it struggles to detect the interaction duration. The inaccuracies in duration can be caused by the proximity of the human hand to the object. In posture recognition, the AI’s systematic accuracy was moderate, achieving f1-scores above 57% for most predicted postures.
Two main errors of AI emerged in posture recognition: (1) failure to respond to target behavior when it occurs, and (2) misclassifcations, most notably error predictions in standing and bending/bowing. These issues are likely related to the camera’s top-down angle, which resulted in substantial occlusion of participants’ lower bodies and thus limited the model’s ability to interpret torso positions accurately. In contrast, the upper body and hands were generally visible, facilitating more accurate recognition of hand-related actions. For instance, the system effectively identifed interactions between the hands and the environment, such as carrying/holding, and touching objects. However, it struggled to infer the specifc context and purpose. As a result, grouping a range of specifc behaviors (e.g. adjusting the height of a desk, holding a chair’s armrest, or using a mobile phone) under a single generic category - “carry/hold”.
5.2. Limitations and future work
Several limitations apply to this study. We only used top-down camera angles. Incorporating multiple camera perspectives could improve the accuracy of posture and action recognition, especially for lower-body movements. Moreover, it poses signifcant limitations to accurately detect eye gaze relying solely on videos, due to the infuence of camera angles and the inherent limitations of AI models. Integrating sensor-based technologies (e.g., eye-tracking devices) with AI-driven video analysis may offer a more comprehensive approach. While AI-based posture and action recognition have great potential for making behavior analysis more automatic, its direct application in design research is still challenging. Achieving higher accuracy and context-specifc understanding will likely require domain-specifc datasets and advanced algorithms tailored to design-related behaviors.
Moving forward, we plan to refne and expand the space-related behavior categories proposed in Section 4.3, ensuring they refect the common needs of human behavior analysis and design research. Moreover, we would break down the general observed actions into fne-grained actions to improve the general-izability of AI-detected behaviors for diverse design-related research scope. Based on that, we aim to develop an AI-enhanced analysis toolkit to detect behavior patterns in design-related research more automatically. Future work will also include direct comparisons between human-coded and AI-based behavior analysis to more thoroughly assess the advantages and limitations of current AI models in design research contexts.
6. Conclusions
This study introduced a framework for leveraging vision-based AI models to analyze human behavior in spatial environments. By aligning AI capabilities with specifc analytical tasks, we aimed to enhance the effciency of video-based behavior analysis and reduce the extensive manual effort traditionally involved. Moreover, we listed current visual-based AI models and tools, which are available and offer reliable tracking, detection, and segmentation capabilities, providing reference to design researchers. Additionally, we identifed a set of commonly observed spatial behaviors and tested suitable AI tools for them.
Our applications demonstrated how identifed behavior categories—ranging from spatial movement to posture recognition—could be examined following the proposed framework. These practical implementations provide valuable guidance for researchers seeking to incorporate AI into their analyses, illustrating the processes and considerations necessary for successful integration. Finally, our evaluation of the AI models’ performance offers new insights for design researchers, pointing toward strategies for refning AI-enhanced human behavior analysis and integrating emerging AI technologies into the study of human behavior in design settings.