Seamless Multimodal Human-Robot Communication: Integration Techniques in Human&#8211;Computer Interaction

Weiqing Zheng; Yifang Gao

doi:10.3791/70218

Review Article

Seamless Multimodal Human-Robot Communication: Integration Techniques in Human–Computer Interaction

DOI:

10.3791/70218

⸱

June 9th, 2026

Weiqing Zheng¹ , Yifang Gao²

¹Automation Engineering Department, Meizhouwan Vocational Technology College, ²School of Electrical and Electronic Engineering, Universiti Sains Malaysia

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This review synthesizes recent advances in deep learning, multimodal sensing, and integration strategies that enable seamless, adaptive, and human-centered communication between humans and robots.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Seamless multimodal human-robot communication has become essential as social, assistive, and educational robots move into everyday settings like homes, healthcare facilities, and classrooms, where they need to integrate speech, gesture, gaze, facial expressions, and tactile cues with high precision, low latency, and real-world robustness. This review systematically examines how current techniques achieve real-time, reciprocal multimodal human-robot interaction (HRI), focusing on fusion strategies, system architectures, and applications in specific domains. We searched the Web of Science Core Collection for English-language empirical studies from 2015 to 2025, selecting 148 papers that address communicative integration. The most important insights include a clear shift toward hybrid and attention-based fusion since 2022 (about 43% of approaches), which better handles temporal asynchrony and embodied interactions than earlier methods; widespread inconsistency in evaluation metrics that hinders cross-study comparisons; and a persistent gap between strong laboratory performance and weaker real-world robustness under noise, occlusion, or user variability. Key challenges include real-time processing, semantic alignment, data efficiency, deployment robustness, and the under-explored integration of tactile signals with affect. Looking ahead, the review suggests prioritizing adaptive fusion policies, standardized benchmarks for synchronization and fluency, and continual learning to enable user-personalized adaptation. Ultimately, it aims to guide the development of more human-centered robotic agents that can engage collaboratively, meaningfully, and are socially acceptable in daily life.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Robots have evolved from industrial tools in structured settings to socially embedded agents in homes, hospitals, and classrooms. They now assist in education, therapy, and companionship, reflecting a shift from task-oriented automation to user-centered, adaptive interaction. At the core of this transition lies multimodal communication, enabling robots to perceive and respond to human cues—speech, gaze, gestures, facial expressions, and touch—with the temporal precision needed for natural exchanges in dynamic environments. This emphasis on coordination and responsiveness aligns with central concerns in human–computer interaction and cognitive science. In this review, seamless multimodal communication refers to interactions that feel fluid and intuitive to users, characterized by sub-second response times (typically less than 300 ms), minimal perceptible delays or mismatches, and robust adaptation to real-world variability. This distinguishes it from traditional command-based or unimodal systems, where timing errors, modality failures, or rigid responses disrupt natural flow.

Within human-robot interaction (HRI), this review focuses on human-robot communication, defined as the reciprocal, real-time exchange of multimodal signals between humans and robots. While HRI also encompasses planning, control, and safety, communication concerns the synchronization of perception and action across sensory channels. Early command-based systems prioritized efficiency over engagement, often producing rigid or delayed behavior. Contemporary work pursues adaptive, context-aware models that accommodate user state and emotion¹^,². This evolution underscores the need for precise, intuitive, and personalized interaction frameworks.

Initial social and assistive robots commonly relied on scripted routines or single-modality inputs such as voice or touch, limiting flexibility³. Rule-based logic and slow processing frequently led to timing mismatches between gestures and speech, or to poor adaptation to user behavior. To address these issues, researchers employ low-latency fusion, multimodal sensory integration, and user-adaptive learning⁴^,⁵. These strategies can be understood through three foundational communication principles widely recognized in HRI: complementarity (modalities compensate for each other), redundancy (overlap improves robustness), and synergy (convergence enhances naturalness).

Advances in sensing and learning have yielded socially responsive robots that integrate visual, auditory, tactile, and physiological inputs in real-time⁵^,⁶. Rather than fixed scripts, these systems continuously adapt to user cues. Accessible programming interfaces and learning-from-demonstration methods allow educators and therapists to customize behaviors for caregiving and instruction, where precision, responsiveness, and adaptability are essential.

Communication patterns in social HRI can be grouped into parallel, complementary, and interactive modes (Figure 1). In parallel interaction, humans and robots act independently with minimal exchange. Complementary interaction introduces structured prompts or event-based support. The most advanced, interactive mode involves continuous bidirectional adjustment across channels such as speech, movement, and gaze⁷.

Literature selection and analysis

Systematic review methodology: We conducted a systematic literature search in the Web of Science Core Collection, targeting peer-reviewed journal articles and full conference proceedings from major publishers (IEEE, ACM, Springer, Elsevier, Wiley, Taylor & Francis). The Boolean query was: (("human-robot interaction" OR HRI) AND (multimodal OR "sensor fusion" OR gesture OR gaze OR speech OR tactile) AND ("real-time" OR "low-latency" OR adaptive OR robust)). Inclusion criteria were: English-language publications from January 2015 to 2025—a period capturing the shift from rule-based, single-modality systems to deep learning architectures emphasizing low-latency, adaptive, and robust interaction—with a focus on multimodal integration techniques for communicative HRI and empirical validation (quantitative or qualitative). We excluded studies limited to unimodal interaction, those addressing robot control without communicative intent, and non-empirical works (e.g., purely theoretical or simulation-only). The initial search returned a substantial corpus. After removing duplicates, we screened titles and abstracts against the inclusion criteria, then conducted full-text reviews to assess relevance and contribution to reliable multimodal fusion. This multi-stage process, summarized in the PRISMA-style flowchart (Figure 2), yielded a final corpus of 148 papers that form the empirical foundation of this review.

Overview of selected literature: Figure 3 shows publication trends and disciplinary distribution, with Computer Science and Automation & Robotics dominating, alongside growing contributions from Behavioral Sciences and Neuroscience, which underscores the field's interdisciplinary trajectory. We thematically clustered the literature by technical contribution to multimodal integration (Table 1). This synthesis covered modalities, fusion methods, latency handling, robustness strategies, adaptability, and metrics. Studies from 2019 to 2021 predominantly used early or late fusion, often emphasizing gestures and tactile inputs. Since 2022, hybrid and attention-based architectures have gained favor, particularly for affective and embodied interaction. In this corpus, hybrid fusion accounts for approximately 43% of approaches, early fusion 29%, and late fusion 28%—a distribution suggesting improved tolerance for temporal asynchrony.

Although real-time capability is commonly claimed in the broader literature screened initially, only a limited number of studies detail specific mitigation strategies (e.g., buffered processing, predictive models, or sensor-level optimization). Robustness typically relies on filtering, redundancy, or rule-based fallbacks, whereas anticipatory adaptation remains rare. Evaluation practices are inconsistent, and standardized assessments of temporal synchronization or interaction fluency remain infrequent. The metrics vary so widely across studies that direct comparisons are often difficult. Accuracy, F1 scores, and latency figures are common, but they stem from different datasets, tasks, and environmental conditions; few papers adopt shared benchmarks for temporal alignment or noise robustness. Consequently, strong performance in clean lab setups frequently overstates what methods can achieve outside controlled settings. Developing consistent evaluation frameworks—perhaps including standardized latency tests under realistic variability—would make it far easier to judge which approaches truly advance the field.

Key gaps remain in tactile-affective integration (only six studies⁸^,⁹^,¹⁰^,¹¹^,¹²^,¹³ address this) and dynamic latency adjustment under user-driven temporal uncertainty. Several interesting tensions emerge from the literature. Hybrid fusion is often praised for handling temporal misalignment well¹⁴, yet truly anticipatory or adaptive behaviors remain rare, calling into question just how seamless that claimed real-time performance really is. At the same time, impressive progress in vision and speech-based affect recognition stands in stark contrast to the minimal work integrating tactile cues with emotion understanding. High accuracy in controlled experiments also tends to degrade in real-world conditions, underscoring ongoing challenges in generalization across environments and users. Table 2 summarizes representative architectural trends and evaluation approaches, informing the challenges discussed in subsequent sections.

This review offers a distinct contribution to multimodal HRI literature. Recent surveys, such as Zhao et al.¹⁴, provide broad overviews of perception-driven decision-making, while Wang et al.¹⁵ focus on 5G-enabled visual-tactile transmission. Both emphasize general frameworks or communication protocols, with limited discussion of real-time fusion trade-offs, metric comparability, or deployment challenges. In contrast, we offer a more targeted analysis: comparing fusion strategies under specific constraints, critiquing cross-study metric inconsistencies, and highlighting practical gaps—particularly the lab-to-field performance divide that prior surveys have largely overlooked.

Access restricted. Please log in or start a trial to view this content.

Review and Perspective

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

As robots become fixtures in homes, classrooms, and healthcare settings, natural and adaptive communication is critical to their success. Increasingly viewed not as tools but as social partners, they must perceive, interpret, and respond to human signals across modalities. Systems that accurately capture user intent and deliver timely feedback enhance task performance, trust, and emotional comfort—particularly when engaging children, older adults, or individuals with disabilities. Achieving this requires continuous, intuitive interaction that aligns robot responses with ongoing human behavior rather than interrupting it. Such fluency demands multimodal inputs—voice, gaze, gesture, facial expression—integrated through mechanisms that mirror human communication itself¹⁴^,¹⁶.

Perception and communication channels
Vision remains the foundation of multimodal human–robot interaction. Visual cues such as facial expression, gaze, posture, and gesture underpin intent recognition, emotion inference, and engagement tracking¹⁷^,¹⁸^,¹⁹^,²⁰^,²¹. Early handcrafted methods using Haar detection²² or background subtraction²³ often failed under occlusion or lighting changes¹⁶^,²¹. Deep learning systems now dominate, employing CNNs and recurrent models with multi-camera inputs. Figure 4 illustrates the HI-ROS tracking pipeline, which reduces motion error by 27% in real-time gesture monitoring.

Integration of gaze, arm, and head cues improves prediction of user actions¹⁷, while proprioceptive and depth fusion enhances precision²⁴^,²⁵. Mutual gaze¹⁹ and biomimetic eye designs¹⁸ further demonstrate the role of subtle signals in trust and coordination. Figure 5 depicts the architecture from ²⁶, where dual Leap Motion Controllers capture hand motion, fused through LSTM-RNN for robust surgical teleoperation. Recent multi-camera systems, such as Hi-ROS²⁰ and RPF²¹^,²⁷, enhance person-following in unstructured settings. Figure 6 shows the adaptive ReID lifecycle that maintains tracking under occlusion by continuous classifier refinement. Complementary work on active labeling²⁸ and semantic SLAM²⁶ improves context awareness, extending perception from object recognition to behavioral understanding.

Language provides a parallel communicative channel. Early rule-based dialogue systems struggled with ambiguity, prompting the adoption of data-driven and transformer-based models²⁹. Emotional and social responsiveness are now key: multimodal systems align facial and vocal signals for adaptive dialogue²^,³⁰. Reinforcement learning improves coordination between affective and acoustic cues⁵^,³¹^,³². Large language models such as GPT-3³³ and ChatGPT³⁴ enable contextual reasoning and instruction following across tasks³⁵^,³⁶^,³⁷. Their integration with vision and reward modeling³⁸^,³⁹^,⁴⁰^,⁴¹^,⁴²^,⁴³^,⁴⁴ supports socially aware, generalized interaction.

Auditory perception complements vision and language. Prosody, pitch, and rhythm communicate intent when visual cues are unreliable. Reinforcement-based feature extraction improves synchronization between speech and expression. Datasets such as AVA ActiveSpeaker⁴⁵, AFFECT-HRI⁵, SAVEn-Vid⁴⁶, HumorHRI⁴⁷, and CHiME-5 enrich model robustness in noisy, affective environments. Pan et al.³² advance speech–lip synchronization for multi-speaker scenes, while multimodal fusion with tactile or physiological inputs informs adaptive behavior.

Embodied and physiological sensing
Tactile and physiological perception enhance social and physical awareness. Force, pressure, and biosignals enable inference of intent, stress, and engagement. Vision-based tactile simulators such as TACTO⁴ and neural mappers like FOTS⁴⁸ generate high-resolution contact data at 300 fps, supporting low-latency control. FOTS models illumination and deformation via a learned reflectance function R:

I(x,y) formula; gradient, Hessian; mathematical analysis equation, research, diagram.

and shadow projection:

Matrix transformation equation, quaternion representation; key in computer graphics analysis.

where M_s encodes projection geometry. High-resolution tactile skins have been realized using magnetorheological elastomers⁴⁹, EtherCAT sensor arrays⁵⁰, and auxetic Hall-effect materials⁵¹. Systems such as DiGeTac¹³ unify gesture, tactile, and distance signals through modular pipelines. Filtering of time-of-flight distance data is modeled as:

ARMA model equation, time series analysis formula, statistical method for forecasting.

followed by neural gesture classification and CNN-based contact localization. Safety control is enforced by scaling the robot velocity with separation d. These modules enable robust human–robot collaboration. Multimodal transformers further integrate tactile, visual, and textual signals. The TVT-Transformer⁶ aligns modality-specific representations (q_i, k_i, v_i) into joint matrices:

Vector transformation equation, Q=[q_vis, q_tac, q_text]; K=[k_vis, k_tac, k_text]; V=[v_vis, v_tac, v_text];

enabling cross-modal self-attention for unified understanding (Figure 7).

Physiological sensing advances emotional alignment. Heinisch et al.⁵ synchronize physiological and audiovisual signals for affect modeling. Electromyography-based systems⁵²^,⁵³ decode gestures and silent speech, while MetaSonic⁵⁴ improves spatial perception through acoustic localization. Large-scale datasets such as AgiBot World Colosseo⁵⁵ support multimodal training with over one million tactile-visual trajectories. Together, these advances mark a transition from unimodal perception to integrated understanding of social, physical, and internal states. Improved sensing fidelity enables robots not only to measure contact but to interpret hesitation, tension, or discomfort, responding with sensitivity and adaptive control, which are key foundations for the seamless multimodal frameworks discussed in later sections.

Multimodal fusion and representation learning
Fusion enables the transformation of distributed sensory streams into coherent, context-aware control. Approaches are typically grouped as early, late, or hybrid fusion. Early fusion integrates features at the input stage, supporting synchronized cues such as speech–gesture or facial–prosodic alignment. The Multimodal Transformer by Tsai et al.¹⁴ exemplifies attention-based early integration of visual, acoustic, and textual signals. Xue et al.⁵⁶ extended this with residual coupling of facial landmarks and speech features, improving robustness under lighting variation, while Hu et al. Proposed MMSERNet⁵⁷, a multiscale fusion network using dilated convolution and residual fusion to merge MFCC, spectrogram, and lexical features for emotion recognition. The fused output is expressed as:

Deep learning mathematical formula showing ReLU and batch normalization operations in network layers.

Late fusion, in contrast, combines decisions after independent processing, offering higher tolerance to asynchronous or noisy inputs. Examples include decision-level combination of gaze, inertial, and motion signals in underwater HRI⁵⁸^,⁵⁹. Hybrid fusion integrates both approaches—mapping heterogeneous modalities to shared embedding spaces while preserving temporal specificity. The Multimodal Brain–Computer Fusion Network⁶⁰ aligns EEG and visual features before classification, while cross-modal designs for 3D pose estimation merge inertial and monocular data⁶¹. Recent frameworks combine convolutional, recurrent, and transformer layers with attention alignment⁶²^,⁶³, dynamically weighting modalities by context or task relevance⁶⁴^,⁶⁵^,⁶⁶. The choice between early, late, and hybrid fusion is not a matter of one-size-fits-all. Early fusion tends to work well when modalities are tightly synchronized and noise levels are low, allowing the model to capture rich cross-modal relationships from the start. Late fusion, by contrast, is generally more robust in messy, asynchronous real-world conditions because each modality can be processed independently before decisions are combined. Hybrid approaches, which now appear in roughly 43% of recent studies, strike a useful middle ground, especially for affective and embodied interactions, by using attention to dynamically balanced inputs, though they do come with added computational demands. Ultimately, the best strategy depends on the specific task, noise profile, and hardware constraints, suggesting that future systems might benefit from switching fusion modes on the fly.

The complementarity between modalities also varies strongly with context. Vision paired with haptics is particularly effective in close-range physical tasks, such as grasping or object handover, where tactile feedback compensates for visual occlusion, lighting changes, or distance limitations and supplies precise force and contact information that audio cannot provide. Conversely, vision combined with audio performs better in distant or social settings, such as emotion recognition or dialogue in noisy environments, where prosody and tone carry intent and affect when visual cues are unavailable or unreliable.

Beyond the classic early, late, and hybrid framework, deep learning has enabled more granular categorizations. Attention-based fusion allows models to learn dynamic modality importance per input, effectively creating task- and context-specific integration paths without rigid stage boundaries. Learned fusion strategies go further by treating the entire fusion process as an end-to-end differentiable module, letting the network discover optimal combination patterns directly from data. These developments move the paradigm from predefined rules toward fully data-driven, adaptive architectures that better address the complexity of real-time multimodal HRI.

Vision-language grounding
The rapid growth of large language models has expanded the reach of multimodal reasoning, particularly through vision–language models (VLMs). These architectures align linguistic and visual information to enable robots to interpret commands, perceive scenes, and generate context-sensitive actions. Foundational frameworks such as LXMERT⁶⁴ and CLIP⁶⁶ established joint embeddings for image–text alignment. Flamingo⁶⁵ introduced few-shot multimodal reasoning, while Maggio’s Clio³⁹ dynamically links instructions to scene graphs via CLIP semantics. Gao et al.⁴¹ developed LAMARS for anticipatory planning based on multimodal prediction, and Xie et al.⁶⁷ applied spatiotemporal convolutional models for efficient gesture comprehension. Arenas et al. introduced prompt-based customization for personalized interaction styles⁶⁸. These systems allow commands like pick up the red cup to be grounded directly in scene semantics.

Integration with spatial reasoning further enhances robotic navigation and interpretation. Wilson’s LatentBKI³⁸ grounds instructions in voxel-based spatial maps, while Nwankwo et al.⁴⁰ fuse large language models with VLM reasoning to interpret dialogue acts under visual uncertainty. Event-based segmentation and asynchronous perception methods add efficiency to dynamic tasks. Collectively, VLMs act as bridges between symbolic instruction and embodied understanding, providing the cognitive substrate for flexible interaction.

Domain-specific applications
In healthcare and eldercare, multimodal communication enables safe, intuitive, and empathetic assistance. Core functions—fall detection, daily activity monitoring, emergency response—rely on posture, voice, and facial cues³. Hierarchical control models improve motion precision in arm-assistive robots¹, while proprioceptive feedback guides cooperative dressing assistants⁶⁹. Object handover, depicted in Figure 8, exemplifies synchronized perception and motor coordination⁷⁰. Emotional responsiveness, studied through systems like LOVOT, fosters user trust but raises ethical concerns regarding overdependence.

Social robots require fluid multimodal responsiveness to sustain empathy and trust in domestic and public settings. Figure 9 outlines a typical architecture integrating sensing, planning, and socio-emotional reasoning layers. Large language models integrated with perception allow open-ended dialogue modulated by gaze and tone⁴⁰. Pediatric studies show that expressive movement and prosody reduce anxiety in children⁷¹, while surveys highlight nonverbal behaviors as determinants of engagement⁸. Proactive models such as RoboAgent³⁷ and Sub-Prior-guided Ant Colony Optimization⁷² enable joint exploration of shared goals. Systems capable of expressing internal states (ready, confused)⁷³ enhance transparency and coordination, while adaptive social agents mediate group collaboration⁷⁴^,⁷⁵.

Educational robots: In education, multimodal HRI promotes motivation and learning by leveraging embodiment and social signaling⁷⁶. Combining gestures, facial, and verbal cues enhances accessibility and engagement⁶^,⁹^,⁷⁷^,⁷⁸. Emotional adaptivity supports younger learners: systems that sense affective state and modulate speech and motion increase attention and vocabulary⁷^,⁷⁹. Integration with virtual reality improves immersion but challenges scalability⁸⁰. Collectively, these systems shift the robot’s role from instructor to emotionally attuned collaborator.

Personalization ensures relevance and trust in multimodal systems⁷¹^,⁸¹^,⁸². Ethical personalization balances autonomy and empathy, tailoring responses through participatory feedback. Maaz et al.⁸³ demonstrated affective tutoring guided by facial and cognitive cues, while Gomez-Izquierdo et al.⁸⁴ modeled user preferences for synchronized behavior. Personalized adaptation can be formalized as:

Utility function equation Σwi·fi(u, c) for mathematical analysis; summation formula.

where w_i are learned preference weights. Implementations such as RFIS achieve real-time person re-identification and gesture recognition at the edge⁸⁵, while proprioceptive touch interfaces¹⁰ enable intuitive physical communication. Social perception effects further modulate personalization outcomes, where narrator identity shifts user response timing and utterance length⁸⁶. Personalization thus transforms users from passive recipients into co-designers, shaping interaction through mutual adaptation.

While much of the reviewed work remains in research prototypes, several multimodal HRI techniques have moved from research prototypes to commercial or industrial applications. SoftBank's Pepper robot, for instance, integrates speech, gesture, and facial expression recognition for customer service and education in retail and educational settings⁸⁷. Toyota's Human Support Robot employs vision-gesture fusion for assistive tasks in homes and healthcare facilities⁸⁸. These commercial implementations often simplify fusion strategies, typically using late or hybrid approaches with rule-based fallbacks to ensure reliability and low cost, underscoring the persistent gap between advanced academic models and scalable industrial solutions. Bridging this gap will require greater emphasis on edge deployment and rigorous real-world validation.

Challenges to seamless communication in HRI
Despite advances in multimodal sensing and fusion, sustaining seamless human–robot communication outside controlled settings remains difficult. Models that excel in labs often de-grade amid noise, delay, occlusion, shifting intent, or resource limits. The literature converges on six intertwined barriers: vision–language integration, real-time synchronization, multimodal robustness, semantic precision, dataset coverage, and deployment constraints.

Vision language integration
Visual understanding underpins grounded language and action. Foundational steps improved spatial consistency and abstraction through visual–inertial SLAM without manual initialization⁸⁹, re-identification plus sensor fusion for occlusion-robust tracking¹⁶, lane graph prediction⁹⁰, semi-supervised segmentation that aligns labels with navigation utility, and egocentric video with SLAM and large-scale segmentation for traversability²⁶. Uncertainty-aware mapping remains central: uPLAM fuses probabilistic localization with panoptic segmentation for dynamic scenes⁹¹, while Clio adaptively builds CLIP-based hierarchical 3D scene graphs for task-sensitive semantics³⁰. Grounding approaches progressed from structured descriptions and perception APIs⁹²^,⁹³ to multimodal encoders and decoders⁹⁴^,⁹⁵.

Instructions following in open environments integrate perception and reasoning. RobotGPT³⁵ and GroundingGPT³⁶ interpret commands via coupled visual–linguistic inference; SayPlan, LFG, and LLM-Planner align navigation and planning to grounded language³⁴^,⁴⁴. To reduce hallucination and enforce semantic consistency, GLAM and Vtimellm introduce alignment objectives⁹⁶^,⁹⁷, while adaptive frameworks validate LLM outputs against perception⁹⁸^,⁹⁹. Timing critically shapes legibility: optimized gesture timing improves predictability⁹; reinforcement learning reduces audiovisual lag in emotion pipelines²; gesture entrainment increases perceived fluency¹⁰⁰; and reviews consolidate LSTM, DTW, and GAN tools for alignment¹⁰¹. Overall, seamlessness demands a latency-aware loop across perception, fusion, and response, where mimicry³⁰ and cerebellar-inspired prediction¹⁰² coordinate timing, as also illustrated by the system-level timing loops in Figure 10.

Real-time perception across modalities
Fluency depends on bounded end-to-end delay across heterogeneous streams. TACTO delivers high-resolution tactile feedback with low latency for responsive manipulation. Tendon-driven hands achieve fast interactive play through event vision and lightweight inference¹⁰³, as shown in Figure 11. For vision–language–action, coupling CLIP with GraspNet accelerates grasping in clutter¹⁰⁴. Transformers support rapid social haptic gesture classification¹¹. Edge-optimized audiovisual models sustain sub-second inference²⁴. Precise head-movement timing improves engagement¹⁰⁵. Together, these results argue for synchronized fusion, buffer policies tailored by modality, and lightweight attention to preserve temporal co-adaptation. Acceptable latency varies significantly by task domain. In social dialogue or emotion recognition, delays under 200–300 ms maintain perceived fluency and natural interaction. Assistive tasks like object handover or fall detection demand tighter constraints (50–150 ms) to ensure safety and responsiveness. Teleoperation or collaborative manipulation often requires sub-100 ms latency to prevent operator disorientation or motion sickness, while navigation or monitoring applications can tolerate 300–500 ms without critical impact.

Temporal and semantic processing challenges
Misalignment and latency erode trust and task efficiency, especially in distributed teleoperator–robot–human systems¹⁰⁶. Perceptual and interface strategies help users tolerate delay: bodily gestures and non-lexical fillers¹⁰⁷, visual overlays and adaptive cues¹⁰⁸^,¹⁰⁹^,¹¹⁰^,¹¹¹. Control-level prediction shortens response gaps¹⁰². Parallel dialogue pipelines overlap filler generation, speech synthesis, and prompt editing to reduce perceived lag, as in¹¹² and Figure 12. System studies decompose cumulative latency across capture, encode/decode, networks, decision, and actuation¹¹³. As shown in Figure 13, latency is introduced through sensor capture, video encoding and decoding, network transmission, operator processing, and vehicle actuation, forming a complex, bidirectional feedback loop. Syntalos provides millisecond synchronization for closed-loop experiments¹¹⁴. In affective exchanges, real-time facial mimicry enhances synchrony²³. Delay-sensitive domains such as remote surgery show that anchoring haptics against delayed vision supports precision and responsiveness¹¹⁵. As shown in Figure 14, anchoring haptic feedback led to better performance and a stronger sense of responsiveness.

Fine-grained effects and intention recognition remain challenging. Microexpressions, prosody, gaze, and anticipatory motion often occur briefly and asynchronously. Emo produces anticipatory facial expressions aligned to user state¹¹⁶. Body language alone conveys intent where face or voice is unavailable¹¹⁷. Pepper-based multimodal displays combine verbal classification with expressive gestures and emoji for emotional synchrony¹¹⁸. Online RL strengthens fusion of facial and vocal streams², while lightweight mimicry supports edge deployment³⁰. Open issues include group effect, cultural variation, and temporal effect drift.

Generalization in open domains is also limited. Reinforcement learning from human feedback (RLHF) exhibits sparse rewards and brittleness under out-of-distribution inputs¹¹⁹. RoboAgent blends spatial reasoning with language grounding for cross-task transfer²⁹. Reasoning through Action-free Data (RAD) leverages passive video and language for zero-shot capabilities without manual labels¹²⁰. Perceiver-Actor grounds object properties to manipulate unfamiliar items¹²¹. Continual alignment between perception and semantics in RobotGPT and GroundingGPT improves adaptive reasoning but still faces real-time feedback challenges³⁵^,³⁶.

Multimodal robustness and environmental variability
Robustness must span signal integrity and behavioral consistency. SimuMuHRI injects modality-specific noise and dropout to test resilience¹²². Physically grounded structured-light simulation improves RGB-D reliability under lighting and occlusion¹²³^,¹²⁴. Latency–haptics coupling exposes sensitivity in remote manipulation¹²⁵. Principles of redundancy and thermal resilience from safety-critical energy systems inform fault-tolerant HRI¹²⁶^,¹²⁷.

Behavioral coherence is equally vital. Voice–appearance mismatch reduces trust¹²⁸, and erratic but correct motion undermines cooperation¹²⁹. Observer-based adaptive force control and robust interaction controllers sustain stability under structured and unstructured uncertainties¹³⁰^,¹³¹. Dataset coverage also constrains progress. Multiobot perception datasets—OPV2V¹³², CoPeD¹³³, CSE¹³⁴, and AgiBot World Colosseo⁵⁵—expand collaborative sensing across simulation and field. AV-HRI resources—CHiME-5¹³⁵ with the subsequent CHiME-8 challenge¹³⁶, AVA-ActiveSpeaker⁴⁵, AFFECT-HRI⁵, and SAVEn-Vid⁴⁶—enable ASR, speaker activity, affect, and long-context instruction following. Large-scale social benchmarks—HumorHRI⁴⁷, THÖR-MAGNI¹³⁷, RW4T¹³⁸, and InViG¹³⁹—target humor, navigation, teaming, and interactive grounding. Despite breadth, many datasets are domain-specific or scripted, with limited temporal depth for evaluating seamlessness, as summarized in Table 3.

Evaluation of seamlessness
Conventional metrics such as accuracy and latency do not fully capture co-adaptation, mutual prediction, or social fluency. New evaluators rate contextual coherence and collaboration¹⁴⁰, integrate user feedback and situational signals¹⁴¹, and add predictability, coordination, and comfort to physical HRI⁸⁴. Team frameworks quantify shared anticipation⁷⁴, while digital twins track trust calibration and team fluency¹⁴². A unified benchmark that fuses temporal synchronization, affective alignment, and user-centered fluency remains an open need.

Efforts toward standardization in multimodal HRI evaluation remain limited and fragmented. Notable initiatives such as the CHiME challenges¹³⁵^,¹³⁶ have advanced benchmarks for audio-visual speech recognition, particularly in terms of noise robustness. Datasets like THÖR-MAGNI¹³⁷ and RW4T¹³⁸ provide shared resources for motion capture and teaming behaviors. However, no widely adopted unified protocol yet exists for cross-modal temporal synchronization, interaction fluency, or affective alignment across diverse modalities and domains. This absence of standardized metrics continues to hinder reproducibility and comparability, underscoring the need for community-driven benchmarks.

From lab to real-world deployment
Performance often degrades in unstructured settings due to sensory irregularities, shifting intent, and computing bottlenecks¹⁴³. This performance degradation highlights a persistent divide between laboratory and real-world settings. In controlled environments, hybrid and attention-based fusion excel, delivering high accuracy and low latency for tasks like object manipulation or scene understanding. Real-world deployments, however, tell a different story. Assistive robots in eldercare, companion systems in homes, and educational tools in classrooms routinely contend with noise, occlusion, lighting changes, and unpredictable user behaviors that erode effectiveness. Late fusion approaches tend to prove more reliable under such variable, asynchronous conditions, while hybrid models, despite their strengths in affective and contextual adaptation, can be constrained by computational demands on edge devices. Many successful field implementations still incorporate simpler redundancy or rule-based safeguards for reliability, underscoring that moving from promising lab demonstrations to robust everyday performance remains a formidable challenge. Ecologically valid testing is essential to identify which techniques truly succeed beyond controlled experiments. Adaptive robust controllers maintain force accuracy amid disturbance¹³¹. LLM-driven collaborative planning updates goals as user intent evolves¹⁴⁴. Visual overlays and latency-aware feedback help sustain operator awareness under degraded links¹⁰⁷^,¹⁰⁹. Lightweight supervision with LoRa and digital shadows enables monitoring in low-bandwidth fields¹⁴⁵. Experience from space robotics highlights that autonomy must co-evolve with human cognition under uncertainty and delay¹⁴⁶.

Resource constraints shape feasibility. Sound-based affect and touch recognition run under 1 MB and 0.7 GFLOPs for low-power platforms¹². Adding social features may improve presence, but risks overload when delays rise¹⁴⁷. Scheduling and feature selection must balance cognitive demand with continuity¹⁴⁸. Latency mitigation spans perception, control, and networks as organized in ¹¹³. Users frequently prefer transparent, stable interaction to maximal task efficiency, reinforcing that seamless HRI must prioritize technical capability alongside human comfort¹⁴⁹^,¹⁵¹. Across domains, real-world deployments show distinct preferences: social and educational applications often rely on hybrid vision-audio fusion for affective dialogue, while assistive and collaborative tasks favor vision-haptics combinations with late or early fusion for precise physical interaction. These patterns align with the summarized strategies—prioritizing late fusion for robustness in variable settings and hybrid/attention-based for richer contextual adaptation—though simplification for reliability remains common.

Access restricted. Please log in or start a trial to view this content.

Conclusions

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This review has several limitations. Restricting the search to English-language publications in the Web of Science Core Collection may have introduced language bias and excluded relevant non-English work. Focusing on peer-reviewed articles from 2015 to 2025 could also reflect publication bias, potentially overlooking preprints, grey literature, or emerging unpublished results. Manual screening of the 148 papers, though guided by explicit criteria, inevitably involves some subjectivity. Finally, the thematic interpretatio...

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare that they have no conflicts of interest.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This work was supported by Universiti Sains Malaysia, Bridging Grant with Project No: R501-LR-RND003-0000001342-0000.

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Seamless Multimodal Human-Robot Communication: Integration Techniques in Human–Computer Interaction

In This Article

Summary

Abstract

Introduction

Review and Perspective

Conclusions

Disclosures

Acknowledgements

Reprints and Permissions

Tags

Related Articles