RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
This study proposes a multimodal deep learning framework for video summarization by integrating audio-visual features using pretrained GARNN models. Leveraging GRUs, AlexNet, and an adversarial LSTM classifier, the system enhances keyframe detection, reduces redundancy, and achieves high summarization accuracy with an average F-score of 0.985.
Video summarization focuses on creating concise versions of lengthy videos by preserving essential content. This study reveals a multimodal machine learning strategy that integrates visual and auditory information by using pretrained Gated Recurrent Neural Network architectures referred to as GARNN, combining Gated Recurrent Units (GRUs) and AlexNet, to extract audio, image, and visual features. The keyframe detection is improved by removing redundant frames and applying motion-compensated feature reduction, followed by optional PCA-based dimensionality reduction. An adversarial encoder-based Long Short-Term Memory (AE-LSTM) classifier is employed for temporal modeling by achieving high accuracy in summarization. The results got evaluated by using sensitivity, F-scores, and positive predictive values, and the method attained an average F-score of 0.985. A gated AlexNet is introduced in a multimodal GARNN-AE-LSTM framework, where motion-compensated PCA-based reduction eliminates redundancy, GRUs record temporal progression, and gating fine-tunes spatial feature selection, all of which contribute to a more accurate and efficient video summarization system. The improved F1 score demonstrates the model's effectiveness in generating accuracy in video summaries by creating a meaningful video. This approach highlights the potential of multimodal feature extraction and advanced deep learning techniques for robust video analysis and compression.
Multimodal Analysis (MMA) systems have emerged, combining several modalities like audio, video, text, and sensor data to provide insights into how students learn1. These technologies can aid in gaining a thorough understanding of student behavior and engagement levels by evaluating multimodal data2. However, there are a number of obstacles and restrictions when it comes to creating efficient MMA systems3. There are a ton of digital videos due to the growth of the Internet and security cameras. It is imperative that these videos be compiled into databases. A video summary can be helpful in this situation. The creation of a useful synopsis of the real video is known as video summarizing, and it helps with activity tracking, anomaly detection, and video retrieval4. Video summary can be accomplished using a variety of strategies. These methods fall into one of two categories5, such as extractive and abstractive video summarization.
Because video summarizing requires a lot of computing, efficient methods are needed. If every frame in a movie is examined for selection, the summary process could be slow and lengthy, and processing resources could be spent on identical or similar frames. Moreover, to expedite the process and ensure that only important features are considered, space reduction ought to be done to any set of features6. Techniques for video summarizing can be categorized into four major areas based on the type of audio-visual signals that are produced and shown to the end user7, or their output. To be more specific, the results of video summarization calculations could include: (i) primary frames8, which are sequentially displayed extracted video frames commonly known as static summaries; (ii) video frames9, which are called dynamic summaries; (iii) visual signals10, which enhance summaries for the end user by adding graphical-based syntax to other cues; and (iv) generated by computers, textual annotations11, that are designed to offer efficient summaries of video information.
Summaries based on keyframes are typically smaller than those based on keyshots. Keyframe refers to an individual representative frame selected from the video. Keyshot denotes a short continuous segment of video frames grouped around keyframes. Summary class refers to the classifier's output labeling frames or segments as informative (to be included in the summary) or non-informative (to be excluded). Nevertheless, this benefit comes at the expense of omitting important details in the process of summarizing. For instance, it can be difficult to get the context of the predecessor frame in a keyframe-based summary. It is also devoid of the original sound. To address these issues, keyshot-based video summarizing techniques are therefore frequently chosen. Keyshot-based video summarizing techniques are utilized to create subsets for either long-form or short-form videos. Short and long-form videos typically have durations of less than and more than 10 min, respectively12. Generating keyshot-assisted subsets for short-form videos is impractical and may not be successful due to their already brief replay length. Furthermore, the playback time of long-form videos, particularly movies or sports videos, might surpass 90 min. For such video categories, keyshot-based summary techniques are more useful and efficient in giving users a brief overview.
The subjective nature of video summary makes it a daunting task. This is due to the fact that each user has distinct tastes, even when it comes to comparable video content. This issue can be precisely resolved with the help of the tailored video summarization approach13. The goal of the algorithm is to provide each user with content that is tailored to their interests. Adapted video summaries with ideal durations for new long-form videos (like sporting events) are not readily available, though. Current methods14,15 require massive computer resources to evaluate customer preference data and video footage in order to provide individualized summaries in real time. Real-time, personalized video summaries can be obtained from centralized dedicated servers. Babu Veesam and Satish16 discussed a thorough taxonomic analysis of all the major video summarizing strategies that demonstrate widely used approaches effectively compressing vast amounts of video data. The review classifies and assesses methodologies in relation to their fundamental approaches, which include multimodal integration strategies, deep learning frameworks, and clustering-based methods. Among the noteworthy methods are the KDAN framework, which uses knowledge distillation for supervised summarization, and the SVS_MCO method, which uses DBSCAN clustering optimized by the Artificial Algae Algorithm. Additionally, the review provides sophisticated models such as the Audio-Visual Recurrent Network and the Query-based Deep African Vulture Learning, which have been found to be highly effective in managing dynamic and multimodal video summarization problems.
The inclusion of a gated AlexNet Recurrent Neural Network (GARNN-AE-LSTM) multimodal framework for video summarization is what makes this study novel. This approach improves the accuracy and efficiency of the video summarization process by reducing redundancy with motion-compensated PCA, capturing temporal dynamics with GRUs, and optimizing spatial feature selection using gating methods. In order to provide the video summarization efficiently with reduced complexity, this paper contributes the following: (i) Multimodal Video Summarization: Proposed a framework for generating concise video summaries by integrating both visual and auditory information using a pretrained GARNN model that combines Gated RNNs and AlexNet. (ii) Gated AlexNet for Feature Refinement: Introduced a novel gated mechanism (Sigmoid gate) into AlexNet's dense layer to filter irrelevant spatial features, thereby enhancing the extraction of high-level visual features. The integration with RNN enables effective temporal modeling of frame-to-frame dependencies. (iii) Redundant Frame Elimination: Developed a motion-compensated variance-based approach to remove identical or similar frames prior to keyframe detection, followed by optional PCA-based dimensionality reduction for efficient feature representation. (iv) Accurate Keyframe Detection: Employed an adversarial encoder-based LSTM classifier for temporal modeling, achieving an average F-score of 0.985, thus demonstrating superior summarization accuracy compared to baseline approaches. (v) Efficiency over Transformer Models: Showcased that the proposed GARNN model is computationally efficient, where AlexNet effectively captures spatial features, RNN models sequential dependencies, and the gating mechanism filters out unimportant frames, outperforming transformer-based alternatives in feature selection and summarization.
This study proposes a multimodal supervised video summarizing method that falls into the generic video summarization category, commonly referred to as video skimming. This includes techniques that concentrate on finding key segments of a larger video to create a temporally condensed version of it. This study suggests a supervised technique to analyze the video stream in 1 s sections that include audio and visual representations, as shown in Figure 1. These segments are then categorized as either "uninteresting" or "informative". As opposed to synthetic or simulated data, "real data" now refers to the actual video datasets utilized for experiments. Video segments that provide important audio-visual signals necessary for summarization -- such as high motion, voice, or scene transitions -- are referred to as "informative segments." "Uninteresting" parts are defined as repetitious or redundant frames that do not add anything new to the synopsis.
The input videos are segmented into frames, and the multimodal features are extracted from each frame using the proposed GARNN model. The audio and visual features are fused and fed as input into the dimensionality reduction phase, where the redundant frames are removed for further processing. Finally, the proposed AELSTM is applied to the reduced data for video summarization. A supervised binary classifier trained with feature representations from the visual, audio, or mixed modalities, called multi-modalities, is used to achieve this.

Figure 1: Overview of the proposed video summarization model. The pipeline includes multimodal feature extraction, dimensionality reduction, and summary generation using AELSTM. Please click here to view a larger version of this figure.
Dataset
The effectiveness of the suggested solutions is determined using VSUMM17 and SumMe18 datasets, a pair of benchmark datasets in static video summarization. CNN features created using AlexNet with LSTM and real data are among the datasets. The real data and datasets are accessible to the general public19. The 50 films in the VSUMM dataset are from websites like YouTube and span 110 min at 30 frames per second. They come in a variety of genres, including cartoons, news, sports, advertising, TV series, and home videos. Among these, 25 videos consist of holidays, festivals, and sports that were obtained from the well-known YouTube site and tagged with at least 15 human-generated summaries (390 in total) make up the "SumMe"20 video summary dataset. The duration range of videos is 1-6 min.
The original video is converted into RGB images, which are fed as input into the pretrained proposed GARNN model for feature extraction. This model consists of AlexNet and a gated RNN. The original number of features is 64, and the features from AlexNet have 4100 features.
Proposed Gated-AlexNet-RNN based feature extraction
Both the visual and audio modes of information have been used to summarize the videos. To achieve a feature representation in both modalities, we identified hand-crafted features that are often used in audio and visual clustering and classification tasks, such as picture retrieval, video classification, auditory scene analysis, and music information retrieval. The goal was to include as many instructional visual and audio components as possible. Figure 2 shows a conceptual illustration of the process used to extract features for the audio and visual modalities.

Figure 2: Proposed multimodal GARNN-based feature extraction. Features from both visual and audio modalities are extracted using AlexNet and GARNN components. Please click here to view a larger version of this figure.
Gated- AlexNetRNN: (GARNN)
For virtual image analysis, CNN is a popular DL model21. In general, CNN uses the image as an input and divides it into several groups. Input neurons, a sequence of layers with convolutional pooling, layers that are completely linked, and normalizing layers make up its structure22. The convolution layer's neurons are connected to the preceding layer via a narrow region. The layers underneath them are fully connected to the activating neurons of the fully linked layers. Eqn (1) represents the forward and backward propagation of a fully connected function.
(1)
(2)
Where
is the activation of ith neuron in lth layer is,
is the gradient of ith neuron in lth layer,
is the weight of ith neuron in l+1th layer. Numerous CNN designs have emerged as a result of recent research advancements. AlexNet has been used in this work.
The architecture of AlexNet, shown in Figure 3, reflects a meticulous and well-structured design. Three completely connected layers and five convolution layers make up its eight learning layers. The class labels are produced by feeding the result of the last layer into the function that activates softmax. The second, fourth, or fifth tier kernels are connected to their antecedent levels via GPU sharing. The second and third layer kernels are fully connected to each other. The normalization layer is connected to the max pooling layers after the first and second levels. ReLU is linked to all learning layers.

Figure 3: Architecture of AlexNet. Five convolutional layers and three fully connected layers are used to process visual data in the summarization model. Please click here to view a larger version of this figure.
The primary backbone network for the work was a dense-layer GARNN. There are three layers in the thick RNN. With 80 neurons in the second layer and 170 in the first, it is made up of two fully connected (fc) layers. The following layers are batch normalization and dropout. The fc, the last layer, is made up of three neurons and is used to divide up the image. The dense layer, which comes after the LSTM, divides the area around the brain tumor. AlexNet gives the LSTM layer features. The dataset has a maximum of 20 slices, which is equal to the total number of sequences that have been declared. The first layer of the GARNN contains 100 hidden units, whereas the third layer contains 125 hidden units.
The gating mechanism was incorporated into the dense (completely linked) layer of AlexNet in our suggested GARNN-AE-LSTM framework. Convolutional feature maps are often processed by AlexNet and sent straight into fully connected layers for classification. All retrieved spatial characteristics, including redundant or less significant patterns, are treated equally by this method. We addressed this by introducing a gating function that functions as a feature filter and is based on a sigmoid. In mathematical terms, a gate vector g = σ(wX) + b, modulates the dense layer's output, with σ representing the sigmoid function. During training, this gate learns to give various feature activation weights ranging from 0 to 1. They are: (i) low gate values suppress irrelevant features (e.g., background noise or redundant textures). (ii) Higher gate values highlight discriminative features (such as important object regions or motion-sensitive patterns). (iii) This improved feature set is then used by the RNN component, which enables it to better capture temporal dependencies without getting sidetracked by irrelevant spatial information. (iv) Backpropagation is used to change the gating parameters during training in tandem with AlexNet and the RNN. This implies that over time, the model learns which aspects are most important for summarizing in addition to which features to extract.
In Figure 4, the variable X denotes the input data, C denotes the cell, and H denotes the hidden state.

Figure 4: Gated Recurrent Neural Network architecture (GARNN) model. The GARNN integrates batch normalization, dropout, and multiple dense layers for effective sequence modeling. Please click here to view a larger version of this figure.
In each block, the respective weights, such as Iw, Rw, and bias called input, recurrent weight, and bias, respectively, have been utilized as in Eqn (3) to Eqn (5)
(3)
(4)
(5)
At a certain time stamp t, the state of the cell is denoted as in Eqn (6)
(6)
Where
the product of Hadamard and the state of hidden unit is denoted as in Eqn (7)
(7)
Thus, it utilizes the py Audio Analysis module to compute segment-level audio characteristics for every audio clip that has been extracted from the corresponding video file, utilizing ffmpeg (https: //github.com/tyiannak/pyaudioanalysis)23. The extracted features are listed in Table 1. In accordance with this process, audio feature extraction is initially done on a temporary basis. The final part of the representation is composed of segment-level feature statistics that are calculated at a second level. Specifically, a short-term processing is performed for each segment of the audio signal, resulting in the computation of 68 short-term features (34 features and 34 deltas) for every segment-level window, which may be overlapping or non-overlapping. We have used a variety of visual characteristics to convey the content of the visual information in addition to extracting auditory elements from each video's sound signal. This modality is anticipated to be crucial to the summary process. The multimodal_movie_analysis library (https://github.com/tyiannak/multimodal_movie_analysis) has been used to extract features that reflect visual aspects of a video in order to extract these visual elements. Specifically, the 88 visual elements listed below are taken from the matching frame every 0.2 s. In this, the multimodal features are fused, and a total of 59 features are used for further analysis of classifying the video as informative and not interesting by doing the video summarization.
Feature reduction using PCA
The dimension of the features in this study has been reduced by the application of principal component analysis (PCA). The basic characteristics are transformed into key features in order to boost their prominence and importance. PCA has been widely used by numerous researchers in a variety of domains24. The eigenvalues are used to determine the properties. The highest eigenvalue features are selected, while the lowest eigenvalue features are removed.
Following the removal of superfluous frames, PCA is used in this study as an optional feature reduction step. PCA suppresses noise and redundant information by compressing the high-dimensional feature vectors produced by GARNN into a small subspace. This increases the AE-LSTM's computational efficiency while guaranteeing that keyframe detection is dominated by the most discriminative visual and aural clues. In order to balance scalability and accuracy in video summarization, PCA is utilized as a useful tool. The algorithm (Algorithm 1) is provided as Supplementary File 1.
Any frame that is exactly the same as or strikingly comparable to the preceding frame is considered redundant. It is obvious that changing the framerate can have an impact on the proportion of removed video frames. More specifically, as frame rates increase, the similarity between successive frames increases as well, leading to a higher percentage of frames being removed. Now, the dimensionality-reduced features are used to train the ML model, which is called the adversarial AE-LSTM model.
Training using AELSTM
A neural network called a GAN25 is made up of two rival subnetworks: i) a "generator" network (G) that creates data that resembles an unknown distribution, and ii) a "discriminator" network (D) that distinguishes between the created samples and those from actual observations. The aim is to find a generator that maximizes the likelihood of the discriminator committing a mistake while fitting the actual data distribution.
Assume X is the input and E is the prior input noise, X’= g(E) is the sample generated. Using the minimax optimization, learning is formulated:
(8)
Where D is qualified to obtain the correct probability of the classification. The components of our developed training model are shown in Figure 5. The selector LSTM chooses the frames subset from the input video sequence X. The selected frames are converted into fixed-length feature E using the encoder-LSTM, followed by the video reconstruction X' using the decoder-LSTM. The classification of X' into real video or summary class is performed by the discriminator-LSTM. In this study, AE-LSTM is used for video summarization with GAN structure for efficient, diverse, and structured video summaries. The AE can eliminate the unnecessary background changes, and the LSTM can detect the scene transitions rather than treating the frames as images, and the GAN, the generator can summarize the video, and the discriminator ensures the summary is diverse

Figure 5: Architecture of the AELSTM summarization model. Includes Selector-LSTM, Encoder-LSTM, Decoder-LSTM, and Discriminator-LSTM to optimize video summaries via adversarial training. Please click here to view a larger version of this figure.
By giving the GARNN features for each frame in the input video X called
, the summarizer utilizes the Selector LSTM to choose the frames subset, and the encoder encodes the frames as E followed by the decoder reconstructing the video as X' at each frame xt. The selector utilizes the importance score when selecting the frame. The features are given by the weight value using the scores, and then move onto the encoder. For each frame with score st = 1, the subset is only received by the encoder. The discriminator chooses the classes as original or summary by estimating the distance between X and X' and assigning the labels. In this case, the discriminator computes the error between the original and summary videos. Algorithm 2 (Supplementary File 2) denotes the summarization of the training of AELSTM for video summarization.
Post-processing – Video summarization
Video summaries can be produced by segment-level classifiers once they have been trained. Three steps are involved in achieving this: (i) Determine each video segment's audio, visual, or blended characteristics. (ii) Classify each video segment using the appropriate audio, visual, or fusion classifier. (iii) To prevent glaring mistakes, post-process the orderly classifier predictions.
To meet this demand, a pipeline consisting of two distinct filters has been developed for the post-processing step. The input array is first subjected to a median filter of length N1 utilizing local windows in order to smooth the sequential classifier predictions. The final predictions are then determined by hard filtering using a straightforward method that maintains a series of consecutively positive predictions (informative segments) if at least N2 segments are included in that sequence. Stated differently, that criterion requires an instructive segment to last at least N2 seconds.
The proposed GARNN-based feature extraction and AGLSTM-based video summarization of lengthy videos were experimented on two datasets, namely VSUMM and SumMe. This section discusses the experimental results and comparison with conventional approaches. For the discriminator LSTM, we employ a two-layer LSTM having 1024 hidden units at every layer. For encoder_LSTM and decoder_LSTM, respectively, we employ two two-layer LSTMs having 2048 hidden units in each layer. A decoder LSTM that seeks to store and synthesize the reverse sequence is demonstrated to be simpler to train26. The feature sequence is also reconstructed in reverse order by our decoder LSTM. The settings of a pre-trained recurring autoencoder model that was trained on feature sequences from original movies are used to initialize the encoder LSTM and decoder LSTM models. We discover that this leads to faster convergence and helps to increase overall accuracy.
Evaluation metrics
The metrics used for this study are positive predictive values (PPv), sensitivity (SE), F1 score, and overall accuracy (OA). The goal of video summarization is to select the relevant frames while reducing the redundant frames. Accuracy is not a suitable metric for video summarization because it does not account for the imbalanced keyframes. PPV is used to select too many unimportant frames, ensuring a compact and high-quality summary. Sensitivity ensures the summary does not miss critical moments. F1 score helps when there is a trade-off between avoiding redundant frames vs. capturing all key moments.
PPv is the proportion of actual positive forecasts compared to all positive forecasts, where P represents all positive predictions and Pp represents the real positive prediction.
(9)
The percentage of genuine positive predictions over the user's ground truth (at identical frame indices) is known as sensitivity (S), as shown in Eqn (10), where Pu is the selected user frame and their indices, and Pp is the true positive prediction.
(10)
The accuracy and recollection of the scores' harmonic mean is known as the F-measure, or F-score, when neither of these scores is adequate to characterize the unbalanced classification approaches.
(11)
Overall accuracy is the percentage of 1-s segments that are correctly classified. To guarantee consistent and manageable processing across modalities, the video data is separated into 1-second temporal pieces. Every video clip relates to both:
Visual stream: A video with 30 frames per second (fps) has 30 consecutive frames in each 1-s segment. AlexNet is used to extract visual information from these frames, filtering out spatial features that are not significant. Temporal dependencies between successive frames are further modeled by the gated recurrent neural network (GARNN).
Audio stream: Each 1-second audio slice precisely matches the frames from that time period, and the associated audio track is segmented using the same timestamps as the visual segmentation. Pretrained GRU-based models are used to extract audio features in order to record temporal and frequency patterns.
Thanks to this synchronization, the visual and aural senses are guaranteed to be precisely aligned on the same temporal axis. For tasks like keyframe recognition and the categorization of "informative" versus "uninteresting" segments, the fused multimodal feature representation represents the same temporal context.
The area under the ROC curve, or AUC, is a more general metric of the classifier that can operate at different "operation points," which correspond to distinct thresholds applied to the positive class's posterior outputs.
The experimental results are based on the confusion matrices of the developed model using VSUMM and SumMe datasets, illustrated in Figure 6. Based on the developed feature extraction using GARNN along with PCA-based feature reduction secured OA of 98.2%, sensitivity of 0.982, PPv of 0.947, and F1 score of 0.956.

Figure 6: Confusion matrices for SumMe and VSUMM datasets. Illustrates the classification accuracy of informative and uninformative video segments in both datasets. Please click here to view a larger version of this figure.
Once the input data is prepared for processing, the proposed PCA-based feature reduction with multi-model feature extraction using GARNN is compared with the SAE-based27 feature reduction. The developed model itself was compared with multimodal features, Cu features alone, and audio and video features alone. Comparatively, the proposed model with multimodal features secured improved performance compared to other features and models. The proposed model is compared with the feature extraction models such as AlexNet, GoogleNet, and VGG16, along with the proposed multimodal features and SIFT and HEVC features. Each run is based on the data split of 80:20 for training with testing data. Table 2 shows experimental results of the proposed feature extraction GARNN with PCA-based feature reduction.
The proposed model is experimented on the VSUMM dataset based on the extracted multimodal features in terms of the metrics, and the results are illustrated in Table 3. Comparatively, the proposed model with multimodal features secured improved performance over a single feature type. Similarly, the considered feature extraction approaches are tested against the VSUMM data set based on SIFT, HEVC, and multimodal features. Each model's features are improved when multimodal features are applied compared to the other two types of features. As a whole, the developed model secured an overall performance of 22% is increased compared to SAE.
The developed model was applied to the SumM dataset and compared with existing video summarization approaches such as MultiCNN-SAE-RF24, Multi-CNN-HEVC28, and SUM-GAN17 in terms of the metrics. The hyperparameters of the considered network, such as SAE (hidden layer-3, activation function- ReLU, optimizer- Adam, learning rate -0.001, Batch size-64 with epochs-50), Multi-CNN (conv layer-4, activation function- ReLU, optimizer- SGD, learning rate -0.01, Batch size-32 with epochs-100). The results are illustrated in Figure 7. It has been noted that the proposed model secured improved performance than the considered approaches, with the outcome as PPv (94.7%), sensitivity as 98.2%, F1 score as 95.6% and OA as 98.4%. The considered approaches such as CNN secured 71.2% of PPv, 95.3% of sensitivity, 84.5% of F1 score and 91.4% of OA, HEVC with multi CNN model secured 71.3% of PPv, 96.7% of sensitivity, 85.3% of F1 score and 92.1% of OA and SUM-GAN model obtained 65.4%% of PPv, 83.4% of sensitivity, 81.2% of F1 score and 78.2% of OA. The overall performance of the proposed PCA-GARNN-based feature extraction, along with AELSTM, is superior in terms of sensitivity, F1 score, and OA to the conventional system for summarizing the lengthy videos without missing their important frames.
The developed model was applied to the VSUMM dataset and compared with existing video summarization approaches, such as MultiCNN-SAE-RF24, Multi-CNN-HEVC28, and SUM-GAN17 in terms of the metrics. The results are illustrated in Figure 8. It has been noted that the proposed model secured improved performance compared to the considered approaches, with the outcome of PPv being 95.2%, Sensitivity being 97.1%, F1 score being 94.2%, and OA being 98.5%. The considered approaches such as Multi CNN secured 73.4% of PPv, 89.4% of sensitivity, 83.4% of F1 score and 91.2% of OA, HEVC with multi CNN model secured 74.5% of PPv, 95.2% of sensitivity, 87.3% of F1 score and 93.4% of OA and SUM-GAN model obtained 68.2%% of PPv, 84.5% of sensitivity, 81.2% of F1 score and 87.3% of OA. The overall performance of the proposed PCA-GARNN-based feature extraction along with AELSTM is superior to conventional system sensitivity, F1 score, and OA for summarizing the lengthy videos without missing their important frames.
The performance of the proposed Adversialencoder-based LSTM classifier is compared with the other classification approaches, such as multiCNN-RF27, MultiCNN-HEVC28, and GAN17, in terms of AUC-ROC. The analysis based on ROC is shown in Figure 9. Figure 9A presents the SumMe dataset, while Figure 9B depicts the variation in the VSUMM dataset. The obtained RoC value of the proposed video summarization system is 0.982 on the SumMe dataset and 0.984 on the VSUMM dataset. Similarly, the RoC value of MultiCNN-RF is 0.842 on the SumMe dataset and 0.851 on the VSUMM dataset. The RoC value of MultiCNN-HEVC is 0.891 and SUMAN is 0.812 on the SumMe dataset, 0.911 for MultiCNN-HEVC and 0.821 for SUM-GAN on the VSUMM dataset.
The comparison in terms of summary duration for the approaches is shown in Table 4. Additionally, approaches such as lightweight thumbnail container-based summarization (LTC-SUM)18 and AC-SUM-GAN29 were added for comparison. The proposed model secured a reduced duration compared to the recent summarization approaches. Therefore, the developed model is efficient and effective in extracting the features, reducing the irrelevant features, and classifying the interesting videos using video summarization. Analyzing the SumMe and VSUMM datasets with the suggested approach significantly improves the overall computing, communication, and storage efficiencies. Based on user settings, it can extract and provide summaries from pertinent content (e.g., events and objects). However, short-form videos with numerous, quick scene changes might not benefit from it. The multimodal system described in this study maintains both visual continuity and audio cues, ensuring more thorough and informative video summaries than standard keyframe-only techniques that run the risk of losing temporal or auditory context.
DATA AVAILABILITY:
The datasets analyzed in this study are publicly available. The SumMe and TVSum datasets are archived on Zenodo at DOI 10.5281/zenodo.4884870, and the VSUMM dataset is available at http://www.vision.ime.usp.br/~creativision/vsumm/.

Figure 7: Performance comparison on the SumMe dataset. The proposed model outperforms state-of-the-art summarization models in PPv, sensitivity, F1 score, and overall accuracy. Please click here to view a larger version of this figure.

Figure 8: Performance comparison on the VSUMM dataset. Demonstrates superior results for the proposed model compared to existing approaches on standard metrics. Please click here to view a larger version of this figure.

Figure 9: ROC analysis of video summarization models. (A) ROC curve for the SumMe dataset. (B) ROC curve for the VSUMM dataset showing AUC superiority of the proposed model. Please click here to view a larger version of this figure.
Table 1: Extracted audio and visual features with descriptions. The table lists all features used in the multimodal summarization pipeline, including both handcrafted and learned components. Please click here to download this table.
Table 2: Performance of GARNN + PCA on the SumMe dataset. The table presents evaluation metrics (PPv, sensitivity, F1 score, OA) comparing different features and models. Please click here to download this table.
Table 3: Performance of GARNN + PCA on VSUMM dataset. Displays accuracy and quality metrics across different model configurations using VSUMM. Please click here to download this table.
Table 4: Summary duration comparison. Compares the duration of generated summaries across proposed and baseline models. Please click here to download this table.
In this study, the video summarization of long videos is presented using an efficient multimodal ML and DL models, which use audio, image, and visual modalities of data generated from the input data. The binary classifier is trained to learn the discrimination between the important segments that are the produced summary part and the "non-important" segments that are discarded. The model is trained using datasets such as SumMe and VSUMM, and the scalability of the model is demonstrated in terms of the metrics. Initially, the features are extracted from the videos using the GARNN model, and it is further processed using PCA to reduce the dimensionality. Using the AELSTM-based classifier, the efficiency of the model is tested based on quantitative and qualitative evaluation. To claim the superiority of the model, it is compared with the four existing ML and DL based video summarization approaches based on audio, image, and video features. The results show that the importance of multimodal features secured an enhanced outcome of PPv as 0.947, sensitivity as 0.982, F1 score as 0.956, and OA as 98.4% for the SumMe dataset. The application on the VSUMM dataset obtained 0.952 of PPv, 0.971 of sensitivity, 0.943 of F1 score, and 98.5% of OA, which is superior to the conventional systems' sensitivity, F1 score, and OA. The model's exceptional performance on benchmark datasets emphasizes how crucial multimodal characteristics are. In addition to offering useful applications in surveillance monitoring, medical video analysis, educational video condensation, and media content summarizing, this method gives a strong foundation for further study in multimodal representation learning and temporal pattern analysis. Due to frame-level feature extraction and temporal modeling, the suggested GARNN-AE-LSTM architecture has comparatively higher computing costs, even if it achieves high accuracy in video summarization.
For real-time or resource-constrained applications, more optimization is required, even if redundant frame removal and PCA-based feature reduction lessen this overhead. In order to lower processing needs while preserving summarization quality, future research will concentrate on investigating hardware-accelerated implementations, pruning techniques, and lightweight backbone networks.
Training the suggested approach on a sizable training data set that includes numerous movies from various categories, annotated by a large number of people, would be an intriguing future project. This video analysis can be multimodal, using not just the video frames or audio but also text from various parts of the video metadata or viewer interactions. It would be intriguing to investigate and begin our methodology's data collection procedure by looking into the potential lack of such a large data set.
The authors have no conflicts of interest.
The authors are thankful to Dr. Television School, Sichuan Film and Television University, for providing the lab facilities to conduct the research study.
| AlexNet (pre-trained model) | MATLAB / PyTorch | PyTorch Hub — AlexNet pre-trained model: https://pytorch.org/hub/pytorch_vision_alexnet/ | Used for visual feature extraction |
| FFmpeg | FFmpeg.org | https://ffmpeg.org/ | Audio extraction from video |
| GRNN-based model | Custom implementation | Library “neupy” implements GRNN: http://neupy.com/apidocs/neupy.algorithms.rbfn.grnn.html | Used for multimodal feature fusion |
| LSTM (Long Short-Term Memory) | PyTorch | https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html | Used in Selector, Encoder, Decoder |
| Multimodal_movie_analysis lib | GitHub | https://github.com/tyiannak/multimodal_movie_analysis | For visual feature extraction |
| NumPy | Python Software Foundation | https://pypi.org/project/numpy/ | Numerical computation and matrix ops |
| PCA (Principal Component Analysis) | Scikit-learn | https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html | Dimensionality reduction |
| PyAudioAnalysis | GitHub | https://github.com/tyiannak/pyaudioanalysis | Audio feature extraction |
| PyTorch | PyTorch Foundation | https://pytorch.org/ | Deep learning framework |
| SumMe Dataset | Public dataset | https://gyglim.github.io/me/vsum/index.html | Benchmark video summarization dataset |
| VSUMM Dataset | Public dataset | http://www.vision.ime.usp.br/~creativision/vsumm/ | Benchmark video summarization dataset |