This review synthesizes recent advances in deep learning, multimodal sensing, and integration strategies that enable seamless, adaptive, and human-centered communication between humans and robots.
Method Article
This review synthesizes recent advances in deep learning, multimodal sensing, and integration strategies that enable seamless, adaptive, and human-centered communication between humans and robots.
Seamless multimodal human-robot communication has become essential as social, assistive, and educational robots move into everyday settings like homes, healthcare facilities, and classrooms, where they need to integrate speech, gesture, gaze, facial expressions, and tactile cues with high precision, low latency, and real-world robustness. This review systematically examines how current techniques achieve real-time, reciprocal multimodal human-robot interaction (HRI), focusing on fusion strategies, system architectures, and applications in specific domains. We searched the Web of Science Core Collection for English-language empirical studies from 2015 to 2025, selecting 148 papers that address communicative integration. The most important insights include a clear shift toward hybrid and attention-based fusion since 2022 (about 43% of approaches), which better handles temporal asynchrony and embodied interactions than earlier methods; widespread inconsistency in evaluation metrics that hinders cross-study comparisons; and a persistent gap between strong laboratory performance and weaker real-world robustness under noise, occlusion, or user variability. Key challenges include real-time processing, semantic alignment, data efficiency, deployment robustness, and the under-explored integration of tactile signals with affect. Looking ahead, the review suggests prioritizing adaptive fusion policies, standardized benchmarks for synchronization and fluency, and continual learning to enable user-personalized adaptation. Ultimately, it aims to guide the development of more human-centered robotic agents that can engage collaboratively, meaningfully, and are socially acceptable in daily life.
Robots have evolved from industrial tools in structured settings to socially embedded agents in homes, hospitals, and classrooms. They now assist in education, therapy, and companionship, reflecting a shift from task-oriented automation to user-centered, adaptive interaction. At the core of this transition lies multimodal communication, enabling robots to perceive and respond to human cues—speech, gaze, gestures, facial expressions, and touch—with the temporal precision needed for natural exchanges in dynamic environments. This emphasis on coordination and responsiveness aligns with central concerns in human–computer interaction and cognitive science. In this review, seamless multimodal communication refers to interactions that feel fluid and intuitive to users, characterized by sub-second response times (typically less than 300 ms), minimal perceptible delays or mismatches, and robust adaptation to real-world variability. This distinguishes it from traditional command-based or unimodal systems, where timing errors, modality failures, or rigid responses disrupt natural flow.
Within human-robot interaction (HRI), this review focuses on human-robot communication, defined as the reciprocal, real-time exchange of multimodal signals between humans and robots. While HRI also encompasses planning, control, and safety, communication concerns the synchronization of perception and action across sensory channels. Early command-based systems prioritized efficiency over engagement, often producing rigid or delayed behavior. Contemporary work pursues adaptive, context-aware models that accommodate user state and emotion1,2. This evolution underscores the need for precise, intuitive, and personalized interaction frameworks.
Initial social and assistive robots commonly relied on scripted routines or single-modality inputs such as voice or touch, limiting flexibility3. Rule-based logic and slow processing frequently led to timing mismatches between gestures and speech, or to poor adaptation to user behavior. To address these issues, researchers employ low-latency fusion, multimodal sensory integration, and user-adaptive learning4,5. These strategies can be understood through three foundational communication principles widely recognized in HRI: complementarity (modalities compensate for each other), redundancy (overlap improves robustness), and synergy (convergence enhances naturalness).
Advances in sensing and learning have yielded socially responsive robots that integrate visual, auditory, tactile, and physiological inputs in real-time5,6. Rather than fixed scripts, these systems continuously adapt to user cues. Accessible programming interfaces and learning-from-demonstration methods allow educators and therapists to customize behaviors for caregiving and instruction, where precision, responsiveness, and adaptability are essential.
Communication patterns in social HRI can be grouped into parallel, complementary, and interactive modes (Figure 1). In parallel interaction, humans and robots act independently with minimal exchange. Complementary interaction introduces structured prompts or event-based support. The most advanced, interactive mode involves continuous bidirectional adjustment across channels such as speech, movement, and gaze7.
Literature selection and analysis
Systematic review methodology: We conducted a systematic literature search in the Web of Science Core Collection, targeting peer-reviewed journal articles and full conference proceedings from major publishers (IEEE, ACM, Springer, Elsevier, Wiley, Taylor & Francis). The Boolean query was: (("human-robot interaction" OR HRI) AND (multimodal OR "sensor fusion" OR gesture OR gaze OR speech OR tactile) AND ("real-time" OR "low-latency" OR adaptive OR robust)). Inclusion criteria were: English-language publications from January 2015 to 2025—a period capturing the shift from rule-based, single-modality systems to deep learning architectures emphasizing low-latency, adaptive, and robust interaction—with a focus on multimodal integration techniques for communicative HRI and empirical validation (quantitative or qualitative). We excluded studies limited to unimodal interaction, those addressing robot control without communicative intent, and non-empirical works (e.g., purely theoretical or simulation-only). The initial search returned a substantial corpus. After removing duplicates, we screened titles and abstracts against the inclusion criteria, then conducted full-text reviews to assess relevance and contribution to reliable multimodal fusion. This multi-stage process, summarized in the PRISMA-style flowchart (Figure 2), yielded a final corpus of 148 papers that form the empirical foundation of this review.
Overview of selected literature: Figure 3 shows publication trends and disciplinary distribution, with Computer Science and Automation & Robotics dominating, alongside growing contributions from Behavioral Sciences and Neuroscience, which underscores the field's interdisciplinary trajectory. We thematically clustered the literature by technical contribution to multimodal integration (Table 1). This synthesis covered modalities, fusion methods, latency handling, robustness strategies, adaptability, and metrics. Studies from 2019 to 2021 predominantly used early or late fusion, often emphasizing gestures and tactile inputs. Since 2022, hybrid and attention-based architectures have gained favor, particularly for affective and embodied interaction. In this corpus, hybrid fusion accounts for approximately 43% of approaches, early fusion 29%, and late fusion 28%—a distribution suggesting improved tolerance for temporal asynchrony.
Although real-time capability is commonly claimed in the broader literature screened initially, only a limited number of studies detail specific mitigation strategies (e.g., buffered processing, predictive models, or sensor-level optimization). Robustness typically relies on filtering, redundancy, or rule-based fallbacks, whereas anticipatory adaptation remains rare. Evaluation practices are inconsistent, and standardized assessments of temporal synchronization or interaction fluency remain infrequent. The metrics vary so widely across studies that direct comparisons are often difficult. Accuracy, F1 scores, and latency figures are common, but they stem from different datasets, tasks, and environmental conditions; few papers adopt shared benchmarks for temporal alignment or noise robustness. Consequently, strong performance in clean lab setups frequently overstates what methods can achieve outside controlled settings. Developing consistent evaluation frameworks—perhaps including standardized latency tests under realistic variability—would make it far easier to judge which approaches truly advance the field.
Key gaps remain in tactile-affective integration (only six studies8,9,10,11,12,13 address this) and dynamic latency adjustment under user-driven temporal uncertainty. Several interesting tensions emerge from the literature. Hybrid fusion is often praised for handling temporal misalignment well14, yet truly anticipatory or adaptive behaviors remain rare, calling into question just how seamless that claimed real-time performance really is. At the same time, impressive progress in vision and speech-based affect recognition stands in stark contrast to the minimal work integrating tactile cues with emotion understanding. High accuracy in controlled experiments also tends to degrade in real-world conditions, underscoring ongoing challenges in generalization across environments and users. Table 2 summarizes representative architectural trends and evaluation approaches, informing the challenges discussed in subsequent sections.
This review offers a distinct contribution to multimodal HRI literature. Recent surveys, such as Zhao et al.14, provide broad overviews of perception-driven decision-making, while Wang et al.15 focus on 5G-enabled visual-tactile transmission. Both emphasize general frameworks or communication protocols, with limited discussion of real-time fusion trade-offs, metric comparability, or deployment challenges. In contrast, we offer a more targeted analysis: comparing fusion strategies under specific constraints, critiquing cross-study metric inconsistencies, and highlighting practical gaps—particularly the lab-to-field performance divide that prior surveys have largely overlooked.
As robots become fixtures in homes, classrooms, and healthcare settings, natural and adaptive communication is critical to their success. Increasingly viewed not as tools but as social partners, they must perceive, interpret, and respond to human signals across modalities. Systems that accurately capture user intent and deliver timely feedback enhance task performance, trust, and emotional comfort—particularly when engaging children, older adults, or individuals with disabilities. Achieving this requires continuous, intuitive interaction that aligns robot responses with ongoing human behavior rather than interrupting it. Such fluency demands multimodal inputs—voice, gaze, gesture, facial expression—integrated through mechanisms that mirror human communication itself14,16.
Perception and communication channels
Vision remains the foundation of multimodal human–robot interaction. Visual cues such as facial expression, gaze, posture, and gesture underpin intent recognition, emotion inference, and engagement tracking17,18,19,20,21. Early handcrafted methods using Haar detection22 or background subtraction23 often failed under occlusion or lighting changes16,21. Deep learning systems now dominate, employing CNNs and recurrent models with multi-camera inputs. Figure 4 illustrates the HI-ROS tracking pipeline, which reduces motion error by 27% in real-time gesture monitoring.
Integration of gaze, arm, and head cues improves prediction of user actions17, while proprioceptive and depth fusion enhances precision24,25. Mutual gaze19 and biomimetic eye designs18 further demonstrate the role of subtle signals in trust and coordination. Figure 5 depicts the architecture from 26, where dual Leap Motion Controllers capture hand motion, fused through LSTM-RNN for robust surgical teleoperation. Recent multi-camera systems, such as Hi-ROS20 and RPF21,27, enhance person-following in unstructured settings. Figure 6 shows the adaptive ReID lifecycle that maintains tracking under occlusion by continuous classifier refinement. Complementary work on active labeling28 and semantic SLAM26 improves context awareness, extending perception from object recognition to behavioral understanding.
Language provides a parallel communicative channel. Early rule-based dialogue systems struggled with ambiguity, prompting the adoption of data-driven and transformer-based models29. Emotional and social responsiveness are now key: multimodal systems align facial and vocal signals for adaptive dialogue2,30. Reinforcement learning improves coordination between affective and acoustic cues5,31,32. Large language models such as GPT-333 and ChatGPT34 enable contextual reasoning and instruction following across tasks35,36,37. Their integration with vision and reward modeling38,39,40,41,42,43,44 supports socially aware, generalized interaction.
Auditory perception complements vision and language. Prosody, pitch, and rhythm communicate intent when visual cues are unreliable. Reinforcement-based feature extraction improves synchronization between speech and expression. Datasets such as AVA ActiveSpeaker45, AFFECT-HRI5, SAVEn-Vid46, HumorHRI47, and CHiME-5 enrich model robustness in noisy, affective environments. Pan et al.32 advance speech–lip synchronization for multi-speaker scenes, while multimodal fusion with tactile or physiological inputs informs adaptive behavior.
Embodied and physiological sensing
Tactile and physiological perception enhance social and physical awareness. Force, pressure, and biosignals enable inference of intent, stress, and engagement. Vision-based tactile simulators such as TACTO4 and neural mappers like FOTS48 generate high-resolution contact data at 300 fps, supporting low-latency control. FOTS models illumination and deformation via a learned reflectance function R:

and shadow projection:

where Ms encodes projection geometry. High-resolution tactile skins have been realized using magnetorheological elastomers49, EtherCAT sensor arrays50, and auxetic Hall-effect materials51. Systems such as DiGeTac13 unify gesture, tactile, and distance signals through modular pipelines. Filtering of time-of-flight distance data is modeled as:

followed by neural gesture classification and CNN-based contact localization. Safety control is enforced by scaling the robot velocity with separation d. These modules enable robust human–robot collaboration. Multimodal transformers further integrate tactile, visual, and textual signals. The TVT-Transformer6 aligns modality-specific representations (qi, ki, vi) into joint matrices:
![figure-protocol-4 Matrix operations equations; Q=[q_vis, q_tac, q_text], K=[k_vis, k_tac, k_text], V=[v_vis, v_tac, v_text].](/files/ftp_upload/70218/70218eq4.jpg)
enabling cross-modal self-attention for unified understanding (Figure 7).
Physiological sensing advances emotional alignment. Heinisch et al.5 synchronize physiological and audiovisual signals for affect modeling. Electromyography-based systems52,53 decode gestures and silent speech, while MetaSonic54 improves spatial perception through acoustic localization. Large-scale datasets such as AgiBot World Colosseo55 support multimodal training with over one million tactile-visual trajectories. Together, these advances mark a transition from unimodal perception to integrated understanding of social, physical, and internal states. Improved sensing fidelity enables robots not only to measure contact but to interpret hesitation, tension, or discomfort, responding with sensitivity and adaptive control, which are key foundations for the seamless multimodal frameworks discussed in later sections.
Multimodal fusion and representation learning
Fusion enables the transformation of distributed sensory streams into coherent, context-aware control. Approaches are typically grouped as early, late, or hybrid fusion. Early fusion integrates features at the input stage, supporting synchronized cues such as speech–gesture or facial–prosodic alignment. The Multimodal Transformer by Tsai et al.14 exemplifies attention-based early integration of visual, acoustic, and textual signals. Xue et al.56 extended this with residual coupling of facial landmarks and speech features, improving robustness under lighting variation, while Hu et al. Proposed MMSERNet57, a multiscale fusion network using dilated convolution and residual fusion to merge MFCC, spectrogram, and lexical features for emotion recognition. The fused output is expressed as:

Late fusion, in contrast, combines decisions after independent processing, offering higher tolerance to asynchronous or noisy inputs. Examples include decision-level combination of gaze, inertial, and motion signals in underwater HRI58,59. Hybrid fusion integrates both approaches—mapping heterogeneous modalities to shared embedding spaces while preserving temporal specificity. The Multimodal Brain–Computer Fusion Network60 aligns EEG and visual features before classification, while cross-modal designs for 3D pose estimation merge inertial and monocular data61. Recent frameworks combine convolutional, recurrent, and transformer layers with attention alignment62,63, dynamically weighting modalities by context or task relevance64,65,66. The choice between early, late, and hybrid fusion is not a matter of one-size-fits-all. Early fusion tends to work well when modalities are tightly synchronized and noise levels are low, allowing the model to capture rich cross-modal relationships from the start. Late fusion, by contrast, is generally more robust in messy, asynchronous real-world conditions because each modality can be processed independently before decisions are combined. Hybrid approaches, which now appear in roughly 43% of recent studies, strike a useful middle ground, especially for affective and embodied interactions, by using attention to dynamically balanced inputs, though they do come with added computational demands. Ultimately, the best strategy depends on the specific task, noise profile, and hardware constraints, suggesting that future systems might benefit from switching fusion modes on the fly.
The complementarity between modalities also varies strongly with context. Vision paired with haptics is particularly effective in close-range physical tasks, such as grasping or object handover, where tactile feedback compensates for visual occlusion, lighting changes, or distance limitations and supplies precise force and contact information that audio cannot provide. Conversely, vision combined with audio performs better in distant or social settings, such as emotion recognition or dialogue in noisy environments, where prosody and tone carry intent and affect when visual cues are unavailable or unreliable.
Beyond the classic early, late, and hybrid framework, deep learning has enabled more granular categorizations. Attention-based fusion allows models to learn dynamic modality importance per input, effectively creating task- and context-specific integration paths without rigid stage boundaries. Learned fusion strategies go further by treating the entire fusion process as an end-to-end differentiable module, letting the network discover optimal combination patterns directly from data. These developments move the paradigm from predefined rules toward fully data-driven, adaptive architectures that better address the complexity of real-time multimodal HRI.
Vision-language grounding
The rapid growth of large language models has expanded the reach of multimodal reasoning, particularly through vision–language models (VLMs). These architectures align linguistic and visual information to enable robots to interpret commands, perceive scenes, and generate context-sensitive actions. Foundational frameworks such as LXMERT64 and CLIP66 established joint embeddings for image–text alignment. Flamingo65 introduced few-shot multimodal reasoning, while Maggio’s Clio39 dynamically links instructions to scene graphs via CLIP semantics. Gao et al.41 developed LAMARS for anticipatory planning based on multimodal prediction, and Xie et al.67 applied spatiotemporal convolutional models for efficient gesture comprehension. Arenas et al. introduced prompt-based customization for personalized interaction styles68. These systems allow commands like pick up the red cup to be grounded directly in scene semantics.
Integration with spatial reasoning further enhances robotic navigation and interpretation. Wilson’s LatentBKI38 grounds instructions in voxel-based spatial maps, while Nwankwo et al.40 fuse large language models with VLM reasoning to interpret dialogue acts under visual uncertainty. Event-based segmentation and asynchronous perception methods add efficiency to dynamic tasks. Collectively, VLMs act as bridges between symbolic instruction and embodied understanding, providing the cognitive substrate for flexible interaction.
Domain-specific applications
In healthcare and eldercare, multimodal communication enables safe, intuitive, and empathetic assistance. Core functions—fall detection, daily activity monitoring, emergency response—rely on posture, voice, and facial cues3. Hierarchical control models improve motion precision in arm-assistive robots1, while proprioceptive feedback guides cooperative dressing assistants69. Object handover, depicted in Figure 8, exemplifies synchronized perception and motor coordination70. Emotional responsiveness, studied through systems like LOVOT, fosters user trust but raises ethical concerns regarding overdependence.
Social robots require fluid multimodal responsiveness to sustain empathy and trust in domestic and public settings. Figure 9 outlines a typical architecture integrating sensing, planning, and socio-emotional reasoning layers. Large language models integrated with perception allow open-ended dialogue modulated by gaze and tone40. Pediatric studies show that expressive movement and prosody reduce anxiety in children71, while surveys highlight nonverbal behaviors as determinants of engagement8. Proactive models such as RoboAgent37 and Sub-Prior-guided Ant Colony Optimization72 enable joint exploration of shared goals. Systems capable of expressing internal states (ready, confused)73 enhance transparency and coordination, while adaptive social agents mediate group collaboration74,75.
Educational robots: In education, multimodal HRI promotes motivation and learning by leveraging embodiment and social signaling76. Combining gestures, facial, and verbal cues enhances accessibility and engagement6,9,77,78. Emotional adaptivity supports younger learners: systems that sense affective state and modulate speech and motion increase attention and vocabulary7,79. Integration with virtual reality improves immersion but challenges scalability80. Collectively, these systems shift the robot’s role from instructor to emotionally attuned collaborator.
Personalization ensures relevance and trust in multimodal systems71,81,82. Ethical personalization balances autonomy and empathy, tailoring responses through participatory feedback. Maaz et al.83 demonstrated affective tutoring guided by facial and cognitive cues, while Gomez-Izquierdo et al.84 modeled user preferences for synchronized behavior. Personalized adaptation can be formalized as:

where wi are learned preference weights. Implementations such as RFIS achieve real-time person re-identification and gesture recognition at the edge85, while proprioceptive touch interfaces10 enable intuitive physical communication. Social perception effects further modulate personalization outcomes, where narrator identity shifts user response timing and utterance length86. Personalization thus transforms users from passive recipients into co-designers, shaping interaction through mutual adaptation.
While much of the reviewed work remains in research prototypes, several multimodal HRI techniques have moved from research prototypes to commercial or industrial applications. SoftBank's Pepper robot, for instance, integrates speech, gesture, and facial expression recognition for customer service and education in retail and educational settings87. Toyota's Human Support Robot employs vision-gesture fusion for assistive tasks in homes and healthcare facilities88. These commercial implementations often simplify fusion strategies, typically using late or hybrid approaches with rule-based fallbacks to ensure reliability and low cost, underscoring the persistent gap between advanced academic models and scalable industrial solutions. Bridging this gap will require greater emphasis on edge deployment and rigorous real-world validation.
Challenges to seamless communication in HRI
Despite advances in multimodal sensing and fusion, sustaining seamless human–robot communication outside controlled settings remains difficult. Models that excel in labs often de-grade amid noise, delay, occlusion, shifting intent, or resource limits. The literature converges on six intertwined barriers: vision–language integration, real-time synchronization, multimodal robustness, semantic precision, dataset coverage, and deployment constraints.
Vision language integration
Visual understanding underpins grounded language and action. Foundational steps improved spatial consistency and abstraction through visual–inertial SLAM without manual initialization89, re-identification plus sensor fusion for occlusion-robust tracking16, lane graph prediction90, semi-supervised segmentation that aligns labels with navigation utility, and egocentric video with SLAM and large-scale segmentation for traversability26. Uncertainty-aware mapping remains central: uPLAM fuses probabilistic localization with panoptic segmentation for dynamic scenes91, while Clio adaptively builds CLIP-based hierarchical 3D scene graphs for task-sensitive semantics30. Grounding approaches progressed from structured descriptions and perception APIs92,93 to multimodal encoders and decoders94,95.
Instructions following in open environments integrate perception and reasoning. RobotGPT35 and GroundingGPT36 interpret commands via coupled visual–linguistic inference; SayPlan, LFG, and LLM-Planner align navigation and planning to grounded language34,44. To reduce hallucination and enforce semantic consistency, GLAM and Vtimellm introduce alignment objectives96,97, while adaptive frameworks validate LLM outputs against perception98,99. Timing critically shapes legibility: optimized gesture timing improves predictability9; reinforcement learning reduces audiovisual lag in emotion pipelines2; gesture entrainment increases perceived fluency100; and reviews consolidate LSTM, DTW, and GAN tools for alignment101. Overall, seamlessness demands a latency-aware loop across perception, fusion, and response, where mimicry30 and cerebellar-inspired prediction102 coordinate timing, as also illustrated by the system-level timing loops in Figure 10.
Real-time perception across modalities
Fluency depends on bounded end-to-end delay across heterogeneous streams. TACTO delivers high-resolution tactile feedback with low latency for responsive manipulation. Tendon-driven hands achieve fast interactive play through event vision and lightweight inference103, as shown in Figure 11. For vision–language–action, coupling CLIP with GraspNet accelerates grasping in clutter104. Transformers support rapid social haptic gesture classification11. Edge-optimized audiovisual models sustain sub-second inference24. Precise head-movement timing improves engagement105. Together, these results argue for synchronized fusion, buffer policies tailored by modality, and lightweight attention to preserve temporal co-adaptation. Acceptable latency varies significantly by task domain. In social dialogue or emotion recognition, delays under 200–300 ms maintain perceived fluency and natural interaction. Assistive tasks like object handover or fall detection demand tighter constraints (50–150 ms) to ensure safety and responsiveness. Teleoperation or collaborative manipulation often requires sub-100 ms latency to prevent operator disorientation or motion sickness, while navigation or monitoring applications can tolerate 300–500 ms without critical impact.
Temporal and semantic processing challenges
Misalignment and latency erode trust and task efficiency, especially in distributed teleoperator–robot–human systems106. Perceptual and interface strategies help users tolerate delay: bodily gestures and non-lexical fillers107, visual overlays and adaptive cues108,109,110,111. Control-level prediction shortens response gaps102. Parallel dialogue pipelines overlap filler generation, speech synthesis, and prompt editing to reduce perceived lag, as in112 and Figure 12. System studies decompose cumulative latency across capture, encode/decode, networks, decision, and actuation113. As shown in Figure 13, latency is introduced through sensor capture, video encoding and decoding, network transmission, operator processing, and vehicle actuation, forming a complex, bidirectional feedback loop. Syntalos provides millisecond synchronization for closed-loop experiments114. In affective exchanges, real-time facial mimicry enhances synchrony23. Delay-sensitive domains such as remote surgery show that anchoring haptics against delayed vision supports precision and responsiveness115. As shown in Figure 14, anchoring haptic feedback led to better performance and a stronger sense of responsiveness.
Fine-grained effects and intention recognition remain challenging. Microexpressions, prosody, gaze, and anticipatory motion often occur briefly and asynchronously. Emo produces anticipatory facial expressions aligned to user state116. Body language alone conveys intent where face or voice is unavailable117. Pepper-based multimodal displays combine verbal classification with expressive gestures and emoji for emotional synchrony118. Online RL strengthens fusion of facial and vocal streams2, while lightweight mimicry supports edge deployment30. Open issues include group effect, cultural variation, and temporal effect drift.
Generalization in open domains is also limited. Reinforcement learning from human feedback (RLHF) exhibits sparse rewards and brittleness under out-of-distribution inputs119. RoboAgent blends spatial reasoning with language grounding for cross-task transfer29. Reasoning through Action-free Data (RAD) leverages passive video and language for zero-shot capabilities without manual labels120. Perceiver-Actor grounds object properties to manipulate unfamiliar items121. Continual alignment between perception and semantics in RobotGPT and GroundingGPT improves adaptive reasoning but still faces real-time feedback challenges35,36.
Multimodal robustness and environmental variability
Robustness must span signal integrity and behavioral consistency. SimuMuHRI injects modality-specific noise and dropout to test resilience122. Physically grounded structured-light simulation improves RGB-D reliability under lighting and occlusion123,124. Latency–haptics coupling exposes sensitivity in remote manipulation125. Principles of redundancy and thermal resilience from safety-critical energy systems inform fault-tolerant HRI126,127.
Behavioral coherence is equally vital. Voice–appearance mismatch reduces trust128, and erratic but correct motion undermines cooperation129. Observer-based adaptive force control and robust interaction controllers sustain stability under structured and unstructured uncertainties130,131. Dataset coverage also constrains progress. Multiobot perception datasets—OPV2V132, CoPeD133, CSE134, and AgiBot World Colosseo55—expand collaborative sensing across simulation and field. AV-HRI resources—CHiME-5135 with the subsequent CHiME-8 challenge136, AVA-ActiveSpeaker45, AFFECT-HRI5, and SAVEn-Vid46—enable ASR, speaker activity, affect, and long-context instruction following. Large-scale social benchmarks—HumorHRI47, THÖR-MAGNI137, RW4T138, and InViG139—target humor, navigation, teaming, and interactive grounding. Despite breadth, many datasets are domain-specific or scripted, with limited temporal depth for evaluating seamlessness, as summarized in Table 3.
Evaluation of seamlessness
Conventional metrics such as accuracy and latency do not fully capture co-adaptation, mutual prediction, or social fluency. New evaluators rate contextual coherence and collaboration140, integrate user feedback and situational signals141, and add predictability, coordination, and comfort to physical HRI84. Team frameworks quantify shared anticipation74, while digital twins track trust calibration and team fluency142. A unified benchmark that fuses temporal synchronization, affective alignment, and user-centered fluency remains an open need.
Efforts toward standardization in multimodal HRI evaluation remain limited and fragmented. Notable initiatives such as the CHiME challenges135,136 have advanced benchmarks for audio-visual speech recognition, particularly in terms of noise robustness. Datasets like THÖR-MAGNI137 and RW4T138 provide shared resources for motion capture and teaming behaviors. However, no widely adopted unified protocol yet exists for cross-modal temporal synchronization, interaction fluency, or affective alignment across diverse modalities and domains. This absence of standardized metrics continues to hinder reproducibility and comparability, underscoring the need for community-driven benchmarks.
From lab to real-world deployment
Performance often degrades in unstructured settings due to sensory irregularities, shifting intent, and computing bottlenecks143. This performance degradation highlights a persistent divide between laboratory and real-world settings. In controlled environments, hybrid and attention-based fusion excel, delivering high accuracy and low latency for tasks like object manipulation or scene understanding. Real-world deployments, however, tell a different story. Assistive robots in eldercare, companion systems in homes, and educational tools in classrooms routinely contend with noise, occlusion, lighting changes, and unpredictable user behaviors that erode effectiveness. Late fusion approaches tend to prove more reliable under such variable, asynchronous conditions, while hybrid models, despite their strengths in affective and contextual adaptation, can be constrained by computational demands on edge devices. Many successful field implementations still incorporate simpler redundancy or rule-based safeguards for reliability, underscoring that moving from promising lab demonstrations to robust everyday performance remains a formidable challenge. Ecologically valid testing is essential to identify which techniques truly succeed beyond controlled experiments. Adaptive robust controllers maintain force accuracy amid disturbance131. LLM-driven collaborative planning updates goals as user intent evolves144. Visual overlays and latency-aware feedback help sustain operator awareness under degraded links107,109. Lightweight supervision with LoRa and digital shadows enables monitoring in low-bandwidth fields145. Experience from space robotics highlights that autonomy must co-evolve with human cognition under uncertainty and delay146.
Resource constraints shape feasibility. Sound-based affect and touch recognition run under 1 MB and 0.7 GFLOPs for low-power platforms12. Adding social features may improve presence, but risks overload when delays rise147. Scheduling and feature selection must balance cognitive demand with continuity148. Latency mitigation spans perception, control, and networks as organized in 113. Users frequently prefer transparent, stable interaction to maximal task efficiency, reinforcing that seamless HRI must prioritize technical capability alongside human comfort149,151. Across domains, real-world deployments show distinct preferences: social and educational applications often rely on hybrid vision-audio fusion for affective dialogue, while assistive and collaborative tasks favor vision-haptics combinations with late or early fusion for precise physical interaction. These patterns align with the summarized strategies—prioritizing late fusion for robustness in variable settings and hybrid/attention-based for richer contextual adaptation—though simplification for reliability remains common.
This review has several limitations. Restricting the search to English-language publications in the Web of Science Core Collection may have introduced language bias and excluded relevant non-English work. Focusing on peer-reviewed articles from 2015 to 2025 could also reflect publication bias, potentially overlooking preprints, grey literature, or emerging unpublished results. Manual screening of the 148 papers, though guided by explicit criteria, inevitably involves some subjectivity. Finally, the thematic interpretation of trends reflects our own perspective in a rapidly evolving field.
Despite these limitations, the review points to clear priorities for advancing seamless multimodal HRI. Real-time synchronization and robustness in unstructured environments remain the most critical challenges, as they directly affect deployment reliability and user trust. Moving forward, the field should prioritize three interconnected areas: context-adaptive fusion policies that dynamically select strategies based on noise and synchrony levels; standardized benchmarks for temporal alignment, cross-modal latency, and interaction fluency; and breakthrough approaches such as edge-optimized multimodal transformers or continual learning that enable long-term adaptation without full retraining. Integrating lightweight large language models for anticipatory intent prediction and haptic-visual co-adaptation could further accelerate progress.
Seamless communication ultimately depends on integrating perception, reasoning, and adaptive behavior. While advances in multimodal fusion, vision-language grounding, and affective modeling have improved robots' ability to interpret human intent and respond naturally, persistent challenges remain in synchronization, environmental robustness, and ethical design. True seamlessness requires not just accurate sensing and response, but mutual understanding and temporal fluency between humans and machines.
The implications are clear: future HRI systems must prioritize real-time coordination and interpretability over raw computational power. Latency-aware design and adaptive resource management, summarized in Table 4, will be central to sustaining responsiveness in real-world settings. Equally important are ethical considerations—privacy risks from continuous data collection, security concerns, and user vulnerability demand privacy-by-design, informed consent, and robust oversight to ensure trustworthy interaction.
Looking ahead, research should focus on co-adaptive frameworks that balance precision with trust, enabling robots to adjust dynamically to users' preferences and emotional states. Seamless communication reflects not only technological fluency but the quality of partnership between humans and machines. The future of HRI lies in creating systems that understand context, anticipate intention, and respond with empathy—transforming automation into meaningful collaboration.

Figure 1: Three communication patterns in human-robot interaction: parallel, complementary, and interactive. Arrows indicate the direction and intensity of interaction flows. Parallel interaction involves minimal exchange; complementary interaction includes structured prompts; interactive mode features continuous bidirectional adaptation across modalities. Please click here to view a larger version of this figure.

Figure 2: PRISMA-style flow diagram of the study selection process. The diagram outlines the systematic literature screening process, including identification, duplicate removal, title/abstract screening, full-text review, and final inclusion. A total of 148 papers were selected for qualitative synthesis. Please click here to view a larger version of this figure.

Figure 3: Publication trends and disciplinary distribution in multimodal HRI research. (left) Annual publication trends in multimodal human-robot communication (2015–2025). The graph shows a steady increase in publications, with a notable rise after 2020, reflecting growing interest in deep learning-based multimodal systems. Data for 2025 is partial and based on early access records. (right) Distribution of selected papers by research discipline (2015–2025). The pie chart illustrates the interdisciplinary nature of multimodal HRI research, with dominant contributions from Computer Science, Automation & Robotics, and emerging contributions from Behavioral Sciences and Neuroscience. Please click here to view a larger version of this figure.

Figure 4: Overview of the HI-ROS multi-camera people tracking framework. Yellow blocks denote key modules, including merger, optimizer, and filter, which enabled real-time sensor fusion and improved tracking accuracy under occlusion and motion variability. Please click here to view a larger version of this figure.

Figure 5: Architecture of a multi-sensor hand gesture recognition system for surgical robot teleoperation. Dual Leap Motion Controllers capture hand motion data, which is fused via LSTM-RNN for real-time gesture classification. The system supports precise and low-latency control in surgical settings. Please click here to view a larger version of this figure.

Figure 6: Architecture of the Robot Person Following (RPF) system with adaptive ReID. The framework integrates online continual learning for person re-identification, enabling robust tracking under occlusion and appearance changes through dynamic classifier updates. Please click here to view a larger version of this figure.

Figure 7: Cross-modal fusion mechanism in the TVT-Transformer architecture. Visual, tactile, and textual features are aligned via modality-specific encoders and fused through a shared self-attention mechanism, enabling unified multimodal understanding for object recognition. Please click here to view a larger version of this figure.

Figure 8: Human–robot object handover sequence. The figure illustrates the coordinated actions of giver and receiver, highlighting gaze, hand positioning, and timing synchronization required for fluent physical interaction. Please click here to view a larger version of this figure.

Figure 9: Main components of a social robot architecture. The system integrates perception, socio-emotional reasoning, planning, and expression modules, enabling context-aware and empathetic interaction in dynamic environments. Please click here to view a larger version of this figure.

Figure 10: Latency-aware interaction loop in multimodal HRI. The loop depicts real-time coordination across perception, fusion, and response modules. Key strategies include facial mimicry, cerebellar-inspired prediction, and cross-modal timing adaptation to maintain fluency. Please click here to view a larger version of this figure.

Figure 11: Low-latency tendon-driven robotic hand for interactive tasks. Similar emphasis on low-latency sensorimotor response can be found in tendon-driven robotic hands designed for interactive tasks such as rock-paper-scissors104, which can achieve full flexion in under 60 ms, enabling timely interaction. This system demonstrates a low-latency, tendon-driven robotic hand that plays Rock-Paper-Scissors with humans by leveraging event-based vision and fast sensorimotor control. An event camera captures gesture information, which is processed by a lightweight CNN to classify hand signs. A majority vote over five frames determines the robot’s response, which is then actuated via a microcontroller. Please click here to view a larger version of this figure.

Figure 12: Architecture of a spoken dialogue system incorporating delay mitigation strategies. The system includes filler insertion to bridge silent pauses, parallel response generation to reduce conversational gaps, and prompt editing to adapt utterances in real-time. Processing is distributed across multiple modules: automatic speech recognition (ASR), which converts spoken input to text; natural language processing (NLP), which interprets and generates appropriate responses; and text-to-speech synthesis (TTS), which vocalizes the system’s reply. These components operate in a multi-threaded framework to enhance the user perception of responsiveness despite backend delays. Please click here to view a larger version of this figure.

Figure 13: Sources of latency in a CAV (Connected and Automated Vehicle) teleoperation control loop. Uplink delays include sensor exposure, encoding, transmission to the operator, decoding, and visual display, while downlink delays stem from operator decision time and vehicle actuation. The total end-to-end latency is given by ΔE2E, comprising both uplink and downlink paths. This visualization highlights the compounding structure of delay in real-time HRI. Please click here to view a larger version of this figure.

Figure 14: Tested sensory delay conditions in remote robotic surgery. System latency was separated into onset delay, haptic feedback delay, and visual feedback delay. The four configurations included a control condition with minimal delay, a setting where haptic feedback remained real-time but visual feedback was delayed, a condition where both channels were delayed by the same amount, and one where the delays were mismatched. Visual feedback typically accounted for the greatest delay. Please click here to view a larger version of this figure.
Table 1: Representative studies of multimodal fusion strategies in HRI (2015–2025). Summary of representative studies employing early fusion, late fusion, and hybrid fusion strategies for multimodal human-robot interaction. The table includes columns for reference, year, application domain, modalities, latency mitigation, temporal synchronization, robustness, adaptability level, interaction type, and key performance metrics. Early fusion excels in low-noise synchronized tasks by capturing rich cross-modal relationships from the outset. Late fusion demonstrates greater robustness in asynchronous or noisy real-world conditions by processing each modality independently before combining decisions. Hybrid approaches leverage attention mechanisms to dynamically balance inputs, offering particular advantages for affective and embodied interactions, though often with increased computational demands. Please click here to download this Table.
Table 2: Thematic grouping of selected studies by integration technique and seamlessness attributes. Papers are categorized based on their primary contribution—fusion architecture, temporal alignment, robustness, or adaptability—providing an overview of methodological trends and underexplored areas. Please click here to download this Table.
Table 3: Overview of key multimodal datasets for collaborative perception and HRI. Datasets are listed with their focus area (e.g., affect, speech, gesture, navigation), modality composition, scale, and intended use case. The table emphasizes the diversity and limitations of existing resources for evaluating seamless interaction. Please click here to download this Table.
Table 4: Hierarchical classification of latency mitigation strategies in HRI systems. Strategies are organized by system level—perception, fusion, control, and interface—with examples from the literature. The table provides a reference for designing latency-aware multimodal architectures. Please click here to download this Table.
The authors declare that they have no conflicts of interest.
This work was supported by Universiti Sains Malaysia, Bridging Grant with Project No: R501-LR-RND003-0000001342-0000.
Request permission to reuse the text or figures of this JoVE article
Request Permission