Semantic Anchor-Aligned Model for Interpretable Video Anomaly Detection under Cross-Modal Weak Supervision

Weishan Gao; Jiangang Wang; Ye Wang; Xiaoyin Wang; Xiaochuan Jing

doi:10.3791/69286

Research Article

Semantic Anchor-Aligned Model for Interpretable Video Anomaly Detection under Cross-Modal Weak Supervision

DOI:

10.3791/69286

⸱

November 14th, 2025

Weishan Gao¹^,² , Jiangang Wang³ , Ye Wang¹^,² , Xiaoyin Wang¹^,² , Xiaochuan Jing¹^,²

¹China Aerospace Academy of Systems Science and Engineering, ²Aerospace Hongka Intelligent Technology (Beijing) Co., Ltd., ³Institute of Education, Tsinghua University

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This protocol aims to improve weakly supervised video anomaly detection by integrating structured knowledge. Its goal is to enable the classification of specific anomaly types and provide clear, interpretable explanations for detection results, moving beyond simple binary decisions and enhancing transparency.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Weakly supervised video anomaly detection is a key technique that relies solely on video-level labels to identify anomalous events. However, traditional multiple instance learning (MIL) methods rely on coarse-grained binary supervision. This approach makes it difficult to distinguish between fine-grained anomaly categories. These methods often focus solely on the most anomalous segments, resulting in detection outcomes that lack interpretability. To overcome these limitations, this study proposes a feature modelling approach that incorporates structured knowledge. By utilizing a dynamic semantic guidance mechanism, weakly supervised video anomaly detection combines external category-level information with learnable prompts to generate semantic signals within the feature space. These signals are aligned with the visual evidence extracted by the base anomaly detection module, producing two complementary outputs: an anomaly score for quantifying the severity of anomalous events, and semantic descriptions aligned with external concepts, which can be used to generate structured and interpretable explanatory texts through predefined templates. Experimental results demonstrate that the proposed method achieves an AUC of 88.03% on the UCF-Crime dataset and 98.23% on the ShanghaiTech dataset, attaining 87.05% accuracy in fine-grained anomaly classification tasks. Moreover, the generated semantic explanations extend weakly supervised detection from a binary classification task to a semantics-driven, interpretable anomaly analysis framework.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Video anomaly detection (VAD) plays an essential role in fields such as public safety and intelligent manufacturing, where it offers a scalable solution to the limitations of manual surveillance¹. In particular, weakly supervised video anomaly detection (WS-VAD) has gained prominence due to its reliance on video-level labels, substantially reducing the burden of manual annotation while improving generalizability across diverse scenes. Despite its practical advantages, WS-VAD methods still face significant challenges in precisely identifying the nature of abnormal events². Existing models often detect the presence of anom....

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study uses only publicly available datasets (UCF-Crime¹³ and ShanghaiTech¹⁴). All videos were obtained from the official releases of the datasets, which anonymize personally identifiable information to the extent provided by the dataset maintainers. No additional data collection or human subject interaction was conducted by the authors; therefore, no new IRB approval was required. Data usage complies with the licenses and terms of use of the respective datasets.

Data preparation and feature extraction

Dataset preparation:The UCF-Crime

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

To further validate the utility of semantic anchors, an additional experiment was conducted to compare different prompt templates (see Table 1). The variant using Wikidata-augmented semantic prototypes not only achieved higher AUC values but also led to BLEU-4 score improvements of 16.6 and 14.8 for the generated explanations. These findings indicate a strong link between the semantic richness of prompts and the quality of textual interpretations, confirming that the explanation generation process goes b.......

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The protocol presented here introduces a significant advancement in weakly supervised video anomaly detection by shifting the paradigm from simple binary classification to a semantically grounded, interpretable analysis. The significance of this method lies in its ability to not only identify if an anomaly occurred but also what kind of anomaly it is and why the model reached that conclusion, addressing a major limitation in existing WS-VAD systems⁹^,²⁹.

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have no conflicts of interest.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

We gratefully acknowledge the financial support provided by Aerospace Hongka Intelligent Technology (Beijing) CO., LTD.

....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
BLIP Model	Salesforce Research	ViT-B/16	Used as the text encoder for creating semantic anchors.
GPU	NVIDIA	RTX 4060Ti	Used for training the model
I3D Model	Google DeepMind	Pre-trained on Kinetics	Used as the backbone for video feature extraction.
Numpy	The NumPy Development Team	1.21.2	Used for handling and processing numerical data arrays, such as the video features, before they are fed into the deep learning model.
PyTorch	Facebook AI Research	1.8.0	Used for defining the model architecture, implementing the custom loss functions, and running the back-propagation for optimization.
Python	Python Software Foundation	3.8.20	Used to write all scripts for data processing, model implementation, training, and evaluation
ShanghaiTech Dataset	ShanghaiTech University		Used for training and evaluation in 13 different campus scenes.
UCF-Crime Dataset	University of Central Florida		Used for training and evaluation of 13 anomaly categories.
Wikidata	Wikimedia Foundation		Used as the external knowledge graph for prompt construction.

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Abdalla, M., Javed, S., Radi, M. A., Ulhaq, A., Werghi, N. Video anomaly detection in 10 years: A survey and outlook. arXiv. , (2024).
Caetano, F., Carvalho, P., Mastralexi, C., Cardoso, J. S. Enhancing weakly-supervised video anomaly detection with temporal constraints. IEEE Access. ....

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Semantic Anchor-Aligned Model for Interpretable Video Anomaly Detection under Cross-Modal Weak Supervision

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

References

Reprints and Permissions

Tags

Related Articles