Computational Modeling of Affective User Experience Using Multimodal Physiological and Behavioral Signals

Xiaohong Zhang; Ikseo Choi

doi:10.3791/69823

Research Article

Computational Modeling of Affective User Experience Using Multimodal Physiological and Behavioral Signals

DOI:

10.3791/69823

⸱

April 7th, 2026

Xiaohong Zhang¹ , Ikseo Choi²

¹School of Space Design, Hongik University, ²School of Industrial Design, Hongik University

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This protocol describes a computational framework that models the affective user experience by integrating physiological and behavioral signals in a multimodal fashion, using techniques for correlation-based feature learning and multimodal fusion. This protocol proposes and tests a framework for multimodal affective modelling on the AMIGOS benchmark dataset.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This work proposes a reproducible computational protocol for multimodal affective modeling that utilizes physiological signals. The goal of the protocol is to enable offline emotion recognition by integrating multiple bio signals using a unified deep learning framework. The proposed work consists of five steps: data collection, preprocessing, feature alignment, multimodal fusion, and evaluation. EEG, ECG, and GSR signals from publicly accessible AMIGOS data were used as the experimental baseline in this work. Bio signals were pre-processed and normalized to extract modality-specific features. Heterogeneous feature spaces were aligned across modalities using Deep Canonical Correlation Analysis, followed by a multimodal fusion network for classifying an affective state. The protocol has been evaluated with offline experiments and compared to conventional fusion and classification models using standard performance metrics such as accuracy, precision, recall, F1-score, and AUC. This study focuses on the development and validation of a computational framework for multimodal affective user experience modeling rather than the deployment of a real-time interactive system. With 92.1% accuracy for UX-affective state prediction and 94.2% F1-score for valence-arousal classification, the results consistently outperformed baseline models on emotional dimensions. These findings verified the effectiveness of the proposed multimodal fusion workflow for computational affective modeling by benchmarking physiological data.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The complex interplay of thinking, feeling, and action shapes how people think and act. Affective computing is studying these relationships by leveraging cross-disciplinary knowledge from neuroscience, psychology, and artificial intelligence to build systems that are capable of analysing, understanding, and reacting to human emotion. This area has been increasingly applied to human–technology communication by incorporating expressive consciousness into responsive AI structures, making technology interact not just with intellectual but also with emotive conditions, resulting in more individualized and emotion-aware user knowledge. An emotion, a complex mental process, is a reflection of human perceptions and has a substantial role in human interactions¹. Today, there are numerous human–computer interaction (HCI) applications that need research on emotion recognition². The HCI system environment is dynamic and complex. In most instances, it needs to synchronize its functionalities with the defendants; hence, a context with emotional intelligence can adapt more in this kind of atmosphere. Therefore, in this study, computational modeling of affective user experience through multimodal physiological and behavioral signals is explored³. Instead of developing or validating an interactive system in practice, the work is more inclined to contribution in methodology related to multimodal signal integration or affect inference with application contexts of immersion or exhibition-based settings. In order to abstain from the breadth of concepts, this work focuses on two major concepts: the implementation of a multimodal affective modeling framework using computational approaches, and multimodal data fusion techniques for the combination of physiological as well as behavioral data. All other topics that emerge in the manuscript are referenced for context with regard to the relevance of the developed models.

Every modality of emotion conveys different information about a person's sentiment that cannot be derived from another sensory system. Extraction of emotion from body motion, which involves an overall body motion pattern, has the merits of identifying the nonverbal feelings of a person⁴. Speech and facial appearance can convey material not available in body movement and human body signals. Humans utilize various emotion modalities in parallel and in combination. Each modality is limited and has its strengths. In order to correctly detect a feeling, multimodal emotion recognition methods perhaps enjoy a greater recognition performance than unimodal emotion recognition. Yet, in the event of an emotional shift, facial expression cues, bio signals/physiological signals, and speech signals have a tendency to appear earlier than other signals. Thus, computer vision-based emotion detection is mainly concerned with facial emotional expression⁵.

Affective computing has been comprehensively investigated in child-robot interaction (CRI); emotion-aware systems change their response based on users' physiological and behavioral signals. These works illustrate the feasibility of biophysical emotion sensing and adaptive interaction. However, mostly, the designed CRI systems are only suitable for structured, task-oriented, and participant contexts. Consequently, CRI systems have limited generalizability to dynamic, public environments like interactive exhibitions. Their work points to emotion-aware robots as essential to enhance motivation, besides tackling some issues like personalization. Involvement is a key element to affective child–robot interaction since emotionally reactive robots provide richer and more effective learning experiences. The author⁶ presents a framework for detecting user engagement on the basis of task and social interaction characteristics. They prove through their findings that identifying affective cues, including facial expressions and social behaviour, allows robots to dynamically adjust their responses to enhance user interaction and engagement. Affective adaptation in long-term CRI is an ongoing challenge. The authors in⁷ present the children's emotions during a period of time, underlining the fact that robots need to adjust to personal patterns of emotions and learning styles.

It has also achieved strong performance from EEG due to its high temporal resolution and sensitivity to affective states. Recent works⁸^,⁹^,¹⁰using deep learning approaches improved cross-subject robustness by introducing spatiotemporal modeling and graph-based attention mechanisms. However, most studies remain unimodal, and only a few take into account perception-emotion interactions across multiple sensory channels; thus, they are less applicable in immersive UX scenarios¹¹^,¹²^,¹³. The model was evaluated on the dataset, which showed improved performance compared to the current state-of-the-art techniques. The emotion-dependent critical selection algorithm was proposed by the authors in¹⁴, and the strength, clustering coefficient, and eigenvector centrality of the EEG functional connectivity network features were examined. The authors in¹⁵ researched a deep, simple recurrent unit network in an effort to acquire the temporal features from EEG signals, and the experimental results surpassed related work in the literature¹⁶^,¹⁷^,¹⁸^,¹⁹. It is almost impossible to collect a large dataset of EEG signals, but one can try other approaches, such as the cross-subject approach.

Recent studies²⁰^,²¹ have made significant strides in using extremely advanced spatiotemporal modeling algorithms to address the issue of cross-subject Emotion Classification based on EEG signals. This includes a spatio-temporal hybrid network with enhanced domain adaptation and dynamic graph attention for handling inter-subject variability in Emotion Classification from EEG signals²². In particular, via the use of temporal modeling mechanisms and spatial brain relationship modeling with dynamic attention, they significantly boosted subject-independent emotion classification performance. Another research effort on this problem includes a spatiotemporal isomorphic cross-brain region interaction network²³. In this research, greater emphasis is laid on modeling spatial-temporal invariants to significantly enhance inter-subject robustness for Emotion Classification from EEG signals. Although these two research contributions have successfully documented outstanding performance on Emotion Classification from purely neural signals, they are limited to monomodal signals and lack exploration of multimodal P-E interactions. On the contrary, the current research moves on broadened unimodal signal models via the use of deep learning solutions that bind multiple P/E modalities together via deep correlation learning, providing a richer flavoured affective UX modeling solution.

Although current research, including the project on expressively conscious child–robot interface using biophysical information, has validated the potential to tap into multimodal physiological signals to improve affect recognition, they are nonetheless limited by domain-related constraints. In particular, these systems are largely optimized for structured, task-oriented interaction with children and not generalizable to more intricate, dynamic environments like interactive exhibitions. Furthermore, the utilized feature fusion techniques within the baseline work are not scalable and tend to draw upon shallow fusion approaches, which do not effectively capture the complex perception-emotion interaction that exists within multisensory UX environments. Also, the baseline paper doesn't utilize state-of-the-art deep learning techniques that can align and fuse high-dimensional multimodal streams of data, such as EEG, facial expressions, and eye-tracking, under different environmental stimuli. This exposes a fundamental knowledge gap in creating a high-performance computational paradigm that not only accommodates varying audience demographics and stimulus contexts but also captures perception-emotion interaction through deep canonical correlation analysis (DCCA) and multimodal fusion network (MMFN) architectures. Bridging this gap can significantly affective understanding of user experience (UX).

The main goal of this work is to create a computational system that facilitates emotionally adaptive user experience within interactive exhibition contexts through multimodal biophysical and behavioral sensing. Building upon the basic principles of emotion modeling from child–robot interaction research and affective computing, this framework aims to model and record user perception-emotion interactions based on heterogeneous physiological and behavioral cues like EEG, ECG, EDA, facial expressions, and eye-tracking. Through the integration of deep canonical correlation analysis (DCCA) with a multimodal fusion network (MMFN), the system seeks to learn common latent emotional representations across modalities and project these representations onto affective user experience (UX) states. The architecture is tasked with overcoming the handicaps of shallow feature fusion and age-constrained emotional models by supporting context-aware emotion inference and multisensory integration in dynamic, public environments. Finally, this work aims to help develop more advanced intelligent and affectively sensitive exhibition systems that respond to users' actual-time affective feedback and interaction modes, and hence further heighten engagement, satisfaction, and cognitive-emotional resonance in digital cultural experiences.

This paper contributes to the literature in the following ways: A computational framework for multimodal affective modeling that considers both physiological and behavioral data. A multimodal fusion approach that facilitates effective learning of perception and emotion representation from multiple data sources. This suggested work introduces a new computational paradigm to augment affective user experience (UX) through AI-based multimodal sensing and multisensory integration pipeline modeling of the perception-emotion interaction. The primary contribution is the fusion of DCCA with MMFN to facilitate strong feature alignment and high-level representation learning across heterogeneous modalities such as EEG, ECG, EDA, facial expression, and eye-tracking. This enables one to model an accurate offline evaluation mapping of sensory perception to emotional states like interest, engagement, surprise, or boredom. Experimental validation with the publicly available AMIGOS dataset shows that the developed DCCA+MMFN model performs better than baseline models (e.g., 1D-CNN, CNN-ResNet, and LSTM-CNN) with an average classification accuracy of 89.4% for valence-arousal emotion states and 87.1% for discrete emotion classes.

In contrast to previous studies that either stressed participants, task-centered situations, or used surface-level feature-fusion techniques, the proposed approach brings a deep learning framework specifically designed for dynamic real-world exhibition environments. Through the integration of DCCA and an MMFN, this work provides an expandable and noise-robust method capable of understanding continuous perception–emotion shifts in heterogeneous user groups. This clear fusion of high-dimensional physiological, behavioral, and environmental signals is a primary innovation for affective UX modeling. This paper presents an AI-driven framework to model affective user experiences in interactive exhibitions using multimodal sensing and multisensory integration. The proposed framework achieves robust alignment and fusion of heterogeneous modalities like EEG, ECG, EDA, facial expressions, and eye-tracking by combining DCCA with MMFN. Different from previous studies that were limited to either user or task-centric settings, the proposed approach supports continuous perception-emotion modeling within dynamic public environments, hence being a key advance in affective UX research.

The creation of a systematic, repeatable multimodal computational approach for physiological affect modeling that methodically handles heterogeneous feature alignment before fusion is the main novelty of this work. The suggested framework enforces correlated representation learning across EEG, ECG, and GSR signals by introducing an intermediate cross-modal latent alignment stage using DCCA, in contrast to traditional multimodal emotion recognition studies that rely on direct feature concatenation or decision-level fusion. By guaranteeing modality consistency prior to classification, this alignment-driven fusion approach enhances generalizability and robustness across affective dimensions. The contribution is a benchmark-oriented, end-to-end workflow that standardizes preprocessing, representation alignment, multimodal fusion, and evaluation within a single deep learning architecture, allowing for repeatable and task-agnostic physiological emotion modeling, as opposed to suggesting a single algorithmic change. The proposed framework in this study is set as a feasibility and conceptual demonstration of multimodal affective modeling for exhibition-like experiences. Instead of trying to directly model real multisensory exhibition environments, the AMIGOS dataset is used as a benchmark proxy to validate the modeling approach in a controlled environment.

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The AMIGOS dataset used in this study is publicly available and was collected with prior institutional review board approval and informed consent, as reported in the original publication. This study involves only secondary analysis of the dataset, and no additional ethical approval was required.

The present method uses feature alignment and multimodal fusion approaches to handle multimodal physiological and behavioral data in order to describe perception–emotion correlations. This study proposes a computational model for affective user experience (UX) in interactive exhibitions, leveraging multimodal biophysical sensing and AI-based emotion modeling. Based on biophysical data insights from the baseline paper, this methodology digitizes computational modeling user states from synchronized EEG, ECG, EDA, eye-tracking, facial expressions, and environmental inputs. The term “spatial–temporal modeling” in the manuscript refers to Spatial multi-channel physiological feature representation and inter-modality correlation alignment via DCCA. Temporal: sequential encoding through BiLSTM and temporal dependency preservation within segmented windows. These diverse signals are initially preprocessed and normalized to account for temporal and spatial correlations across modalities. Feature vectors are independently learned for each modality, capturing emotion-related patterns such as arousal, attention, and engagement. To learn successfully from shared emotional representations across heterogeneous data streams, DCCA is used. DCCA projects each modality into a latent common subspace where correlated features are maximized, while retaining modality-specific information through augmented cross-modal relevance. These modality-aligned latent embeddings are then passed to a Multimodal Fusion Network (MMFN) that combines temporal and contextual information via a hybrid BiLSTM and attentional layers. This fusion enables the system to produce high-level affective states, which are translated into user experience indicators (e.g., boredom, interest, discomfort). Lastly, a classification module estimates the affective UX state and gives feedback for exhibition adaptation, e.g., adjusting visual or audio stimuli in computational modeling. Figure 1 shows the architecture of the proposed method.

Multimodal fusion diagram for affective UX modeling; includes EEG preprocessing, feature extraction.
Figure 1: Architecture of the proposed method. Structure of the proposed UX modeling framework fusing multimodal inputs, including EEG, ECG, EDA, facial expressions, and eye gaze, using DCCA and MMFN. Abbreviations; UX = User Experience; EEG = Electroencephalography; ECG = Electrocardiography; EDA = Electrodermal Activity; DCCA = Deep Canonical Correlation Analysis; MMFN = Multimodal Fusion Network. Please click here to view a larger version of this figure.

The proposed architecture in Figure 1 illustrates an end-to-end affective user experience (UX) system utilizing Deep Canonical Correlation Analysis (DCCA) and a Multimodal Fusion Network (MMFN) for perception-emotion interaction modeling in interactive exhibitions. The system begins by taking raw multimodal inputs like EEG, ECG, and EDA physiological signals; facial expressions and eye gaze behavioral signals; and contextual inputs like audio-visual and environmental information. These signals are then fed into a preprocessing and feature extraction module, where signal-specific processing (e.g., noise removal, normalization, feature calculation) is executed to produce meaningful descriptors for every modality. Second, the DCCA module projects pairs of modalities into a common latent space and learns maximally correlated embeddings that encode intrinsic emotional patterns across modalities. These multiple DCCA modules' high-correlation features are subsequently fused by the Multimodal Fusion Network (MMFN), which hierarchically fuses the learned features with deep learning layers, possibly BiLSTM, attention mechanisms, or transformers to create a compact and high-level affective representation. This integrated representation is then fed into the emotion and UX state classification layer, in which the model is predicting affective outcomes like valence, arousal, or cognitive/emotional states like engagement, boredom, or overload. An adaptive UX feedback loop is optionally provided at the pipeline termination, enabling the system to respond interactively to states of the user by adapting stimuli or content in the exhibition context. This scalable and modular architecture guarantees strong emotion modeling by correlation-aware representation learning and deep multimodal integration, which is extremely fit for computational modeling of affective computing in interactive or experiential environments.

The selection of the fusion of DCCA with MMFN is based on the current limitations of standard unimodal and shallow fusion models. Although CNN-ResNet and LSTM-CNN extract temporal or spatial features, they are poor at balancing heterogeneous modalities like EEG, EDA, and facial expressions under different environmental stimulations. DCCA maximizes cross-modal correlation to ensure shared latent representations, whereas MMFN utilizes hierarchical fusion and attention mechanisms to dynamically highlight the most informative signals. When combined, these modules offer a more reliable, comprehensible, and broadly applicable affective UX modeling approach for interactive multisensory environments.

Data acquisition
In the suggested model framework of perception-emotion interaction for interactive exhibitions, the publicly available AMIGOS dataset²⁴ is used as a basis of multimodal affective data. The AMIGOS dataset provides an exhaustive collection of physiological and behavioral measures from human participants undergoing affect-evoking stimuli like video clips. This data contains synchronized EEG, ECG, and GSR signals, facial video recordings, and self-reported emotional labels in discrete (e.g., happy, sad) and dimensional (valence/arousal) forms. These modalities align well with the needs of the system proposed here, which aims to infer affective user experience (UX) states during immersive, multisensory exhibition interactions. The dataset includes recordings from 40 subjects, each exposed to both short (video clips <150 s) and long (movies >14 min) emotion-inducing stimuli, under circumstances that simulate real-life media engagement. For the scope of this suggested work, a pre-processed subset of the dataset is employed: records from 30 subjects with artifact-free, high-quality EEG, ECG, and GSR signals, as well as full facial video and annotation files. These are the chosen samples to mimic multimodal emotional dynamics during exhibition-based situations. The learned dataset is a training and benchmarking base for the DCCA + Multimodal Fusion Network (MMFN) pipeline, with affective states like engagement, boredom, confusion, and arousal explored as responses to visual, auditory, and spatial stimuli in a simulated exhibition environment. Table 1 shows the dataset description of the proposed method.

Parameter	Details
Dataset Name	AMIGOS (A Dataset for Affect, Personality and Mood Research)
No. of Participants	40 total (30 used in proposed work with complete multimodal data)
Stimuli Type	Short and long video clips (emotionally annotated)
EEG Channels	14-channel Emotiv EPOC EEG headset
ECG	Heart rate sensor (for HRV and arousal)
GSR	Skin conductance sensor (measuring sympathetic activity)
Facial Video	High-resolution frontal face video for expression analysis
Emotion Labels	Self-assessed valence, arousal, dominance (scale 1–9), plus discrete labels
Sampling Rate	EEG: 128Hz, ECG/GSR: 1000Hz
Data Format	.mat files and synchronized timestamp logs
Modality Synchronization	Yes – Synchronized across all sensors and video
Duration per Session	Short clips (<150 s) and long movies (>14 min)
Application Fit	Affective UX modeling, multimodal fusion, real-time perception-emotion mapping

Table 1: AMIGOS dataset description. Description of the AMIGOS dataset used in the proposed framework, including participant information, modalities, sampling rates, and experimental design. Abbreviations; AMIGOS = A Dataset for Affect, Personality and Mood Research on Individuals and Groups.

The publicly available AMIGOS dataset applied in the suggested framework consists of multimodal behavioral and physiological data collected from 40 participants, of which 33 were chosen for this study based on the excellence and fullness of recordings. The data set records emotional reactions to short and long video clips, and features signals like EEG (from a 14-channel Emotive headset), ECG for heart rate variability, GSR for skin conductance, and high-resolution facial video for expression measurement. Emotional states are self-reported in both dimensional (valence, arousal, dominance) and discrete categories. The data is well-aligned with each other in the modalities and is appropriately sampled- 128Hz for EEG, 1000Hz for ECG, and GSR. Such a configuration makes the dataset ideal to model user experience (UX) in interactive exhibitions so that perception-emotion interactions can be tracked precisely using AI-based multimodal sensing.

Data pre-processing
All the preprocessing steps, model training, and evaluations involved were implemented in custom-written Python scripts (Python 3.10, PyTorch 2.0) individually, with each module for signal preprocessing, DCCA-based alignment, MMFN training, and performance evaluation; key parameters' setting and some essential computational configurations are summarized in Table 1. In order to process the publicly available AMIGOS dataset for affective UX modeling of interactive exhibition, initially implement a systematic data preprocessing pipeline on physiological and behavioral modalities such as EEG, ECG, GSR, and facial video. At first, standardize and synchronize each modality, and then perform modality-specific preprocessing. Presently, contextual information is obtained only from intrinsic audio-visual cues within the video stimuli, for example, facial expressions, visual motion patterns, and acoustic features such as speech prosody and background audio.

Signal normalization and resampling
Raw physiological signals x(t) from all modalities are resampled to a common frequency fs=128 Hz for temporal alignment across modalities:

Signal resampling equation x_resampled(t)=resample(x(t),fs) for time-domain analysis. (1)

The resampling was performed using a standard signal processing library function rather than a custom-designed algorithm. Specifically, the implementation was carried out using the resampling utility available in the Python signal processing ecosystem (SciPy), which applies Fourier-method-based interpolation to convert signals to the target sampling frequency. The target sampling rate f_s was selected to ensure uniform temporal resolution across modalities prior to segmentation and feature extraction. Every signal is subsequently z-score normalized in order to eliminate inter-subject variability:

$Normalization formula $x_{norm}(t) = \frac{x(t) - \mu_x}{\sigma_x}$, statistical diagram.$ (2)

where μ_x and σ_x are the trial window mean and standard deviation of the signal.

EEG preprocessing
The EEG signals, captured from 14 channels, are band-pass filtered in the choice of 4–45 Hz to preserve cognitive-related frequencies:

Signal processing equation, x_EEG(t)=BandPass(x(t),4,45), illustrating bandpass filtering. (3)

Power spectrum is employed to determine frequency patterns in signals, and this can be different based on the type of emotion, in the case of the brain signal, and can provide useful information regarding the signal. Welch's technique is a superior technique of periodograms equation 3, which provides an estimate of the spectral density and can be employed to obtain spectrograms. Welch's technique segments the time domain signal into discrete intervals of time and constructs. A spectrogram for every segment, then all the spectrograms are averaged as demonstrated in equation 4. This process is smoother compared to the full FFT approach, and therefore, it gets the maximum power out of the signals. Finally, Power Spectral Density (PSD)¹⁹ is calculated using Welch's technique to capture frequency-domain characteristics:

Power spectral density formula, PSD(f)=Σ image, digital signal processing equation diagram. (4)

PSD features are summed over five bands: Theta (4–8 Hz), Alpha (8–13 Hz), Beta (13–30 Hz), Low Gamma (30–45 Hz).

Video-based facial features
Extracting Facial Action Units (AUs) and eye gaze vectors for each frame of video using a multichannel-EEG dataset or similar software, these are subsequently combined as temporal sequences:

$AU intensity formula, $AU_j(t)$, depicted; symbols for intensity analysis in research diagram.$ (5)

Emotion label mapping
Both valence-arousal scores and discrete emotion tags are given by AMIGOS. The continuous ratings are normalized to [0,1]:

$Normalization formula $ V'=\frac{V-V_{min}}{V_{max}-V_{min}} $ for data scaling in statistics.$ (6)

Where, V is the actual observed value before normalization , V_min is the smallest value of the variable in the dataset, V_max is the largest value of the variable in the dataset and V' is the normalized value in the range 0 to 1. These labels are applied as ground truth for supervised affect modeling. These labels are applied as ground truth for supervised affect modeling.

Synchronization across modalities
All modalities are synchronized using timestamp alignment or dynamic time warping (DTW) when necessary:

Equation for signal alignment of multimodal data, including EEG, ECG, GSR, and video analysis. (7)

The preprocessed feature-rich data is then input to the Deep Canonical Correlation Analysis (DCCA) module, which learns maximally correlated representations between modalities.

Feature extraction using DCCA
In the envisaged framework of affective UX modeling for interactive exhibitions, DCCA is utilized to learn connected latent characteristics in heterogeneous modalities like EEG, ECG, EDA, facial expressions, and eye-tracking data. The objective is to learn a common representation space in which signals from various modalities albeit individually noisy or modality-specific are maximally correlated and semantically consistent in terms of perceived emotional states. In this work, DCCA was applied to the following modality pairs: i) EEG–ECG, ii) EEG–GSR, and iii) EEG–facial feature representations. These pairs were selected in order to capture complementary neural-physiological and neural-behavioral correlations relevant to affective engagement. In order to extend DCCA to a multimodal setting, the pairwise DCCA embeddings for each modality pair were computed and then fused using the proposed MMFN, which allows for efficient integration of more than two modalities without altering the core DCCA formulation.

By performing correlation-driven latent alignment prior to fusion, DCCA structurally separates representation alignment from decision fusion, in contrast to Cross-Modal Attention or Tensor Fusion Networks, which work directly on possibly misaligned representations. Redundancy is decreased, and representation coherence is improved by this two-stage architecture. Furthermore, DCCA directly optimizes cross-modal statistical dependency instead of depending just on shared loss functions, which is more appropriate for heterogeneous physiological data than multi-task learning frameworks. First, DCCA²⁰, are used to calculate representations of a number of modalities by treating them sequentially through multiple stacked layers of nonlinear transformations. Figure 2 illustrates the construction of DCCA employed in this research work. A grid search method was used by us to determine optimal hyperparameters for architecting the deep learning model to be utilized in the DCCA approach. After a thorough experimental analysis, the authors selected a stochastic gradient descent optimizer, cross-entropy loss, and a regulating parameter of 1e5. Next, the authors selected 15 optimization steps using the bias vector as all zeros, the validation set as early stopping criteria, and the Xavier initializer as the weight initializer. Figure 2 shows the working process of DCCA.

Feature extraction and correlation maximization in DCCA shown in neural network diagram.
Figure 2: Working process of DCCA. Process workflow of DCCA, showing the mapping of characteristics from diverse modalities into a common latent space to maximize correlation.
Abbreviations; DCCA = Deep Canonical Correlation Analysis. Please click here to view a larger version of this figure.

Figure 2 visually depicts the internal working framework of Deep Canonical Correlation Analysis (DCCA) for shared representation learning between two different modalities, EEG (Modality A) and EDA (Modality B). Both modalities are fed into their respective deep feature extractor networks (Feature Extractor A and B), which have several hidden layers that learn abstract, modality-specific features. Prior to fusion, cross-modal latent alignment is accomplished using Deep Canonical Correlation Analysis (DCCA). Two parallel deep projection networks are built for each modality pair (EEG–ECG, EEG–EDA, and EEG–Facial). Using ReLU activation, each projection network is made up of three completely connected layers with dimensions [256, 128, 64]. For the shared latent space, 64 is the ultimate embedding dimension. With an L2 regularization coefficient of 1e-5 to guarantee numerical stability, the objective function maximizes canonical correlation between projected modality embeddings. Bias vectors are initialized to zero, and weights are initialized via Xavier initialization. To stabilize alignment learning, the DCCA modules are trained separately prior to MMFN training. To maintain alignment consistency, there is no end-to-end fine-tuning between DCCA and MMFN. These pipelines of neural networks accept the input data streams in parallel but isolated. The feature extractor output from each of them is then projected into the common latent space, where the features of both modalities are mapped to become comparable in structure. The projections are optimized according to the objective of DCCA, which is to maximize the correlation between the corresponding latent representations of EEG and EDA. By acquiring this maximally correlated subspace, DCCA guarantees that the output embeddings retain complementary and semantically significant patterns from both modalities, which are essential for downstream applications like emotion recognition or affective computing.

In this work, features are converted using DCCA and then integrated for categorization. The DCCA model, shown in Figure 2, uses a deep learning model for feature transformation. The CCA layer calculates the correlation, which is then used for feature combining and classification. Suppose the matrix Mathematical formula for static equilibrium using matrix notation, IM₁ ∈ Rᴹᴬˣⁿ₁. stores trials of the EEG modality, and matrix IM2 ∈ RMAXn2 formula; matrix operations; numerical analysis method; concise algebraic notation. includes face video clip modality experiments. In this case, the dimensions of features in the EEG trials and face video clips are represented by n₁ and n₂, respectively, while the total number of trials is indicated by MA. For every modality, the authors created the following deep neural network to rearrange the input characteristics in a non-linear way:

Equation showing mathematical functions for operational node inputs in a network analysis context. (8)

where parameters of the nonlinear transformation are represented as HT₁ and HT₂; the subsequent characteristics from every neural network are represented as $Matrix equation $ON_1 \in R^{MAX \times n_1}$ in mathematical notation.$ and $Matrix notation formula, $ON_2 \in R^{MAX \times n_2}$, mathematical concept illustration.$ and measurement of characteristic of DCCA as n. Recursively learned parameters HT₁ and HT₂, generated from DCCA, have boosted correlation between ON₁ and ON₂ to a level as much as possible:

Static equilibrium equation, arg max corr diagram, educational formula, correlation analysis. (9)

Mutually learned parameters as HT₁ and HT₂ were trained by the backpropagation algorithm. In order to obtain the intended answer, the gradients of the objective function were approximated as suggested. The altered features ON1 and ON2 ∈ SP are in the combined hyperspace SP after the two neural networks have been trained. The authors of DCCA mainly20 did not specifically mention the use of altered characteristics. The altered features can be used in a way that best serves the user's application. In this work, the authors obtained the following fused features from altered features:

Equation of linear combination, ON=αON₁+βON₂, for vector development analysis. (10)

where α and β represent weights maintaining α + β = 1. The characteristics ON integrated using DCCA are inputted to the SoftMax classifier. Emotion recognition tasks are used to train the classifier. As previously mentioned, building DCCA for data fusion across numerous formats has some advantages. To observe the characteristics and correlation of modality-centric transformations, for instance, DCCA explicitly obtains ON1 and ON2 for each modality at the feature-level fusion. Moreover, emotion-based data can be preserved by controlling the nonlinear mapping functions f1(·) and f2(·). Furthermore, the authors are using equivalent weights for each modality in the weighted sum fusion. A Multimodal Fusion Network (MMFN) that forecasts user experience states receives this combined space. For multimodal alignment, a hierarchical approach is used, where the raw signals are first temporally aligned with resampling and timestamp alignment, and the alignment of the semantically rich signals from diverse modalities is obtained via Deep Canonical Correlation Analysis (DCCA). This ensures that the modality-aware representations are transformed into a common latent space before the attention-based fusion process in the MMFN.

Multimodal fusion network (MMFN) for affective UX modeling
The MMFN encodes each modality separately, then fuses them using a fusion gate and attention mechanism to output an integrated representation for emotion prediction.

Modality-specific encoding
Let x⁽ⁱ⁾∈R^di be the characteristics vector for the i-th modality (e.g., EEG, EDA). Each modality is encoded with a neural encoder:

Neural network function, equation: h^(i)=f^(i)(x^(i)), where h^(i) ∈ R^d; mathematics formula. (11)

Gated fusion mechanism
To manage the flow of information from each modality, a fusion gate is employed:

Mathematical equation for gated recurrent unit process: gl(i) = σ(Wgl(i)hl(i) + bgl(i)). (12)

Deep learning equation, hl(i) = g(i) ⊙ hl(i), illustrating neural network operation. (13)

σ is the sigmoid activation function. ⊙characterizes element-wise development. This enables the network to down-weight noisy modalities or up-weight useful features dynamically.

Cross-modality attention fusion
An attention layer learns how much weight to put on each modality's representation:

Attention mechanism formula, α(i) equation in neural network research, illustrating soft alignment. (14)

$Fused feature representation equation, $ h_{\text{fused}} = \sum_{i=1}^{M} \alpha^{(i)} \tilde{h}^{(i)} $.$ (15)

α(i) are attention weights h_fused equation in Rd, mathematical representation; formula for modeling or statistical analysis. the fused final representation.

Affective state prediction
The fused vector is input to an output layer to make predictions of valence/arousal or UX states:

Neural network function: ŷ = softmax(Wₒ.RELU(hₓₜₑₙ) + bₒ), formula for classification. 22(16)

Where Static equilibrium ΣFx=0, MA=0 diagram; forces analysis, structural stability study, educational use. can be used to represent: Discrete emotion class (happy, neutral, sad), Continuous scale (valence/arousal), and UX categories (engaged, distracted, overloaded).

Multimodal data processing diagram: EEG, EDA, Facial inputs; fusion with encoders; prediction layer.
Figure 3: Structure of MMFN. Structure of the MMFN with modality-specific encoders, gated fusion, and attention-based integration for affective state prediction.
Abbreviations; MMFN = Multimodal Fusion Network. Please click here to view a larger version of this figure.

Figure 3 is the working process of a Multimodal Fusion Network (MMFN) employed in an affective computing framework for emotion and user experience (UX) prediction. Modality-specific encoders, gated fusion layers, an attention-based integration module, and a final classification head make up the hierarchical fusion architecture of the MMFN. It illustrates how various heterogeneous input modalities, such as EEG (Modality A), EDA (Modality B), and Facial expressions (Modality C), are processed with their own specialized encoder networks (Encoder A, B, and C). An independent encoder made up of two completely connected layers with ReLU activation processes each modality (EEG, ECG, EDA, and facial characteristics) before a Bidirectional Long Short-Term Memory (BiLSTM) layer captures temporal dependencies. Each encoder learns a modality-specific feature representation that retains significant emotional or physiological features from the corresponding data stream. Each direction in the BiLSTM has 128 hidden units, giving each modality a 256-dimensional contextual embedding. To avoid overfitting, dropout (rate = 0.5) is performed after the BiLSTM layer. A gated fusion approach that uses sigmoid activation to dynamically control modality contributions is then applied to the encoded modality representations. A self-attention layer then calculates attention weights across modalities to suppress noisy signals and highlight relevant information. A fully connected layer and either a linear output layer for valence-arousal regression tasks or a Softmax output layer for discrete classification tasks project the fused representation. The output of each encoder is then input to a Multimodal Fusion Layer, which conducts concatenation of all the modality features and applies an attention mechanism to highlight more informative signals and mask noise. This operation produces a common latent representation that integrates emotional signals across all modalities into a high-level dense vector.

The BiLSTM layers applied to modality-specific sequential embeddings directly model temporal dynamics. The BiLSTM enables contextual modeling of changing affective states by capturing both forward and backward temporal dependencies within segmented physiological sequences. Furthermore, adaptive weighting of time informative features is made possible by the attention-based fusion layer that follows, which functions over temporally encoded representations. The authors have recently updated the text to make it more evident how sequence building, recurrent encoding, and attention weighting work together to support temporal modeling.

The proposed pseudocode 1 shows the overall process of affective user experience modeling using DCCA-MMFN in a reproducible form. To begin with, multimodal physiological and behavioral signals are fed into an independent preprocessing step involving noise reduction and normalization of scales. The features are then fed into DCCA, wherein correlated features are jointly learned for specified modality pairs, aiming at robust cross-modal alignment. The aligned features are subsequently combined within an MMFN consisting of gated and attention mechanisms for modality-specific modulation aimed at promoting informative modalities and suppressing noise. The final combined feature representation is employed for training a classifier intended for affective and UX-related state estimation, and overall performance evaluation is carried out based on standard metrics. The step-by-step nature of this proposed pseudocode ensures transparency, feasibility, and easy reproduction by others for all technically informed individuals taking an interest in this proposed protocol.

The proposed framework aims to predict the affective user experience (UX) state using multimodal physiological and behavioral signals from the AMIGOS dataset based on the following steps: Step 1: The input multimodal data consist of EEG, ECG, EDA, eye-gaze, and facial expression signals. Initially, the AMIGOS multimodal dataset is accessed, and each modality undergoes signal preprocessing, including noise filtering and normalization to ensure data quality and consistency. Step 2: To maintain temporal uniformity across modalities, all signals are resampled to a standard frequency. Subsequently, modality-specific feature extraction is performed to obtain distinctive feature representations corresponding to each signal type. Step 3: To capture cross-modal relationships, selected modality pairs (EEG–ECG, EEG–EDA, and EEG–Facial) are processed using Deep Canonical Correlation Analysis (DCCA). The DCCA networks are trained to learn correlated latent representations between paired modalities, resulting in aligned feature embeddings for each modality pair. These aligned representations are then combined to form a unified multimodal feature space. Step 4: The aggregated aligned features are provided as input to a Multimodal Fusion Network (MMFN), where attention mechanisms and gated fusion strategies are employed to effectively integrate complementary information across modalities. This fusion process generates a comprehensive affective representation. Step 5: The resulting representation is subsequently used to train a classifier with labeled emotion data to predict the affective UX state. Finally, the model performance is evaluated using standard metrics, including Accuracy, Precision, Recall, F1-score, and Area Under the Curve (AUC). The predicted affective UX state is returned as the final output of the framework.

Perception-emotion modelling using the proposed DCCA-MMFN framework
In the suggested computational framework of affective user experience (UX) of interactive exhibitions, the Perception–Emotion Modeling module is the central process that connects the user's perceptual reactions to environmental stimuli with their multimodal biophysically derived emotional states. This module is based on the DCCA-gained common representations of physiological and behavioral modalities like EEG, EDA, ECG, facial expressions, and eye movements to record the user's affective dynamics. Concurrently, metadata about the surrounding exhibition environment–visual stimuli, soundscapes, patterns of interactivity, and ambient characteristics–are used to encode perceptual cues. Using a Multimodal Fusion Network (MMFN) to combine these multimodal representations, the model acquires detailed, non-linear mappings from perceptual stimulus features and emotion labels or dimensions (e.g., valence, arousal, engagement). This mapping enables the system to constantly know how a user emotionally reacts towards various exhibition elements and then modify the content or interface accordingly to augment engagement, satisfaction, or cognitive resonance. Finally, the perception-emotion modeling facilitates emotion-aware adaptation of interaction, which constitutes the core of computational modeling and user-sensitive exhibition systems.

First, DCCA modules were trained to learn correlated latent representations for each modality pair. The resultant embedding serves as input to the MMFN, further trained sequentially for the affective prediction task. No end-to-end fine-tuning is done to maintain the stability of training.

Category	Parameter	Value / Description
Signal Preprocessing	EEG band-pass filter	4–45 Hz
	Resampling frequency	128 Hz
	Window length	2 s
	Window overlap	50%
	Normalization	Z-score normalization
DCCA Architecture	Number of hidden layers	3 layers per modality
	Neurons per layer	128–64–32
	Latent dimension size (n)	32
	Regularization parameter	1 × 10⁻⁵
	Optimizer	Adam
	Learning rate	0.001
	Batch size	32
	Maximum training epochs	150
	Early stopping patience	15 epochs
MMFN Configuration	Encoder type	BiLSTM
	Hidden units	64
	Dropout rate	0.3
	Fusion strategy	Gated fusion with attention
	Attention type	EEG, ECG, GSR, Video
Training & Evaluation	Loss function	Cross-entropy loss
Training & Evaluation	Train–test split	5-fold cross validation
	Evaluation metrics	Accuracy, Precision, Recall, F1-score, Cohen’s Kappa, AUC–ROC
Reproducibility Checkpoints	EEG input tensor shape	Channels × time samples (e.g., 14 × 256 per window)
	DCCA output representation	32-dimensional shared latent embedding
	MMFN output	Emotion class probabilities (valence–arousal or discrete classes)
	Expected validation performance	85–90% classification accuracy

Table 2: Parameters of the proposed DCCA–MMFN algorithm. Summary of preprocessing, architecture, fusion, training, evaluation, and reproducibility parameters considered in the proposed affective UX modeling framework. Abbreviations; DCCA = Deep Canonical Correlation Analysis; MMFN = Multimodal Fusion Network; UX = User Experience.

In Table 2, the entire set of parameters used within the proposed DCCA-MMFN framework is provided, which takes into account signal preprocessing, deep architecture design, fusion technique, as well as training conditions. Physiological signals were uniformly resampled and normalized to synchronize them with respect to their respective modalities. The DCCA layer was configured to use multi-layer non-linear transformations, which aided in the learning of maximally correlated latent spaces, whereas the MMFN component used BiLSTM encoder networks with gated attention mechanisms to dynamically weigh the value of different modalities. The requirements for training and evaluation are explicitly defined, ensuring a fair comparison with existing options.

The suggested framework is intended to serve as the computational foundation for affect-aware systems that use behavioral and physiological information in several dimensions. The architecture is conceptually adaptable to real-world environments like intelligent exhibition spaces, adaptive smart interfaces, and affect-aware human–computer interaction systems, even though the current study validates the model through controlled offline experiments using the publicly available AMIGOS dataset. The framework can function as a fundamental module for applications needing reliable emotion inference by offering unified fusion techniques, cross-modal alignment, and structured preprocessing. However, rather than being systems that have been experimentally deployed, these environments are described in this study as possible deployment contexts.

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Assessment of the proposed system
To assess the proposed system, it performed experiments on the publicly available AMIGOS dataset, which provides synchronized measurements of EEG, ECG, GSR, video, and audio of 40 users exposed to emotionally engaging stimuli. For the purpose of this research, the authors used data from 33 participants (following preprocessing and removal of incomplete trials), resulting in 1,320 valid samples on the valence and arousal dimensions. The assessment emphasized emotion c...

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Spatial, environmental, and physical interaction contexts, such as spatial layout, crowd density, or ambient environmental conditions, are explicitly not given in the AMIGOS dataset. Thus, such factors are also not directly modelled in the current experiments. The suggested computational framework for Affective User Experience (UX) modeling progresses much further than the base paper's foundational concepts that dealt with user, task-oriented child–robot interaction employing biophysical emotion detection. Gene...

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have no conflicts of interest.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors acknowledge the support of the School of Space Design and the School of Industrial Design at Hongik University. The authors also thank the exhibition partners and participants for their contributions to the study.

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
Dataset	AMIGOS dataset	40 participants; EEG (128 Hz), ECG (1000 Hz), GSR (1000 Hz), facial video, self-reported valence/arousal labels	Multimodal ground truth data for affective state modeling
Physiological Sensors	EEG headset	Emotiv EPOC+ (14 channels, 128 Hz)	Capturing brain activity related to attention, arousal, and engagement
	ECG sensor	Biopac MP150 or equivalent (1000 Hz)	Heart rate variability and arousal
	GSR/EDA sensor	Shimmer GSR+ or equivalent (1000 Hz)	Skin conductance as measure of arousal
Behavioral Sensors	Eye-tracking device	Tobii Pro X2-60 or equivalent	Recording gaze fixation and saccades
	Facial expression recording	High-resolution video camera; analyzed with OpenFace (AUs, gaze vectors)	Extracting facial Action Units (AUs) and gaze cues
Environmental Inputs	Audio-visual recording setup	Microphone + Camera (synchronized with stimuli)	Capturing contextual stimuli during exhibition
Software / Toolkits	OpenFace	Open-source facial behavior analysis toolkit	Extracting Action Units (AUs), gaze direction
	MATLAB / Python (NumPy, SciPy, scikit-learn)	Signal preprocessing (resampling, z-score normalization, PSD computation)	Data preprocessing and feature extraction
	TensorFlowv2.13 / PyTorchv2.0	Deep learning framework for DCCA and MMFN	Model implementation and training
Algorithms / Models	Deep Canonical Correlation Analysis (DCCA)	Nonlinear feature alignment method	Learning correlated latent representations across modalities
	Multimodal Fusion Network (MMFN)	BiLSTM + Attention-based fusion layers	Hierarchical fusion of heterogeneous modalities for UX state classification
Evaluation Metrics	Accuracy, Precision, Recall, F1-Score, Cohen’s Kappa, AUC-ROC, Confusion Matrix	Implemented with scikit-learn / TensorFlow metrics	Model performance assessment
Computing Hardware	Workstation / GPU cluster	NVIDIA RTX 3080 (10GB) or equivalent, 32 GB RAM, Intel i9 processor	Model training and simulation

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Computational Modeling of Affective User Experience Using Multimodal Physiological and Behavioral Signals

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

Reprints and Permissions

Tags

Related Articles