Research Article

Semantic Anchor-Aligned Model for Interpretable Video Anomaly Detection under Cross-Modal Weak Supervision

DOI:

10.3791/69286

⸱

November 14th, 2025

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This protocol aims to improve weakly supervised video anomaly detection by integrating structured knowledge. Its goal is to enable the classification of specific anomaly types and provide clear, interpretable explanations for detection results, moving beyond simple binary decisions and enhancing transparency.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Weakly supervised video anomaly detection is a key technique that relies solely on video-level labels to identify anomalous events. However, traditional multiple instance learning (MIL) methods rely on coarse-grained binary supervision. This approach makes it difficult to distinguish between fine-grained anomaly categories. These methods often focus solely on the most anomalous segments, resulting in detection outcomes that lack interpretability. To overcome these limitations, this study proposes a feature modelling approach that incorporates structured knowledge. By utilizing a dynamic semantic guidance mechanism, weakly supervised video anomaly detection combines external category-level information with learnable prompts to generate semantic signals within the feature space. These signals are aligned with the visual evidence extracted by the base anomaly detection module, producing two complementary outputs: an anomaly score for quantifying the severity of anomalous events, and semantic descriptions aligned with external concepts, which can be used to generate structured and interpretable explanatory texts through predefined templates. Experimental results demonstrate that the proposed method achieves an AUC of 88.03% on the UCF-Crime dataset and 98.23% on the ShanghaiTech dataset, attaining 87.05% accuracy in fine-grained anomaly classification tasks. Moreover, the generated semantic explanations extend weakly supervised detection from a binary classification task to a semantics-driven, interpretable anomaly analysis framework.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Video anomaly detection (VAD) plays an essential role in fields such as public safety and intelligent manufacturing, where it offers a scalable solution to the limitations of manual surveillance1. In particular, weakly supervised video anomaly detection (WS-VAD) has gained prominence due to its reliance on video-level labels, substantially reducing the burden of manual annotation while improving generalizability across diverse scenes. Despite its practical advantages, WS-VAD methods still face significant challenges in precisely identifying the nature of abnormal events2. Existing models often detect the presence of anom....

Access restricted. Please log in or start a trial to view this content.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study uses only publicly available datasets (UCF-Crime13 and ShanghaiTech14). All videos were obtained from the official releases of the datasets, which anonymize personally identifiable information to the extent provided by the dataset maintainers. No additional data collection or human subject interaction was conducted by the authors; therefore, no new IRB approval was required. Data usage complies with the licenses and terms of use of the respective datasets.

Data preparation and feature extraction

Dataset preparation:The UCF-Crime

Access restricted. Please log in or start a trial to view this content.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

To further validate the utility of semantic anchors, an additional experiment was conducted to compare different prompt templates (see Table 1). The variant using Wikidata-augmented semantic prototypes not only achieved higher AUC values but also led to BLEU-4 score improvements of 16.6 and 14.8 for the generated explanations. These findings indicate a strong link between the semantic richness of prompts and the quality of textual interpretations, confirming that the explanation generation process goes b.......

Access restricted. Please log in or start a trial to view this content.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The protocol presented here introduces a significant advancement in weakly supervised video anomaly detection by shifting the paradigm from simple binary classification to a semantically grounded, interpretable analysis. The significance of this method lies in its ability to not only identify if an anomaly occurred but also what kind of anomaly it is and why the model reached that conclusion, addressing a major limitation in existing WS-VAD systems9,29.

Access restricted. Please log in or start a trial to view this content.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have no conflicts of interest.

Acknowledgements

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

We gratefully acknowledge the financial support provided by Aerospace Hongka Intelligent Technology (Beijing) CO., LTD.

....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
NameCompanyCatalog NumberComments

BLIP Model

Salesforce ResearchViT-B/16Used as the text encoder for creating semantic anchors.
GPUNVIDIARTX 4060TiUsed for training the model
I3D ModelGoogle DeepMindPre-trained on Kinetics Used as the backbone for video feature extraction.
NumpyThe NumPy Development Team1.21.2Used for handling and processing numerical data arrays, such as the video features, before they are fed into the deep learning model.
PyTorchFacebook AI Research1.8.0Used for defining the model architecture, implementing the custom loss functions, and running the back-propagation for optimization.
PythonPython Software Foundation3.8.20Used to write all scripts for data processing, model implementation, training, and evaluation
ShanghaiTech DatasetShanghaiTech UniversityUsed for training and evaluation in 13 different campus scenes.
UCF-Crime Dataset University of Central FloridaUsed for training and evaluation of 13 anomaly categories.
WikidataWikimedia FoundationUsed as the external knowledge graph for prompt construction.

References

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,
  1. Abdalla, M., Javed, S., Radi, M. A., Ulhaq, A., Werghi, N. Video anomaly detection in 10 years: A survey and outlook. arXiv. , (2024).
  2. Caetano, F., Carvalho, P., Mastralexi, C., Cardoso, J. S. Enhancing weakly-supervised video anomaly detection with temporal constraints. IEEE Access. ....

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Video Anomaly DetectionWeak SupervisionSemantic GuidanceMultiple Instance LearningAnomaly ScoreFine Grained ClassificationStructured KnowledgeCross Modal LearningInterpretable DetectionVisual Evidence

Related Articles