RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
This study proposes an innovative approach based on Support Vector Machine integrated with a Van der Waerden rank score-enhanced feature attention mechanism, aiming to address the challenges of high-dimensional sparse spam data and improve the classification performance of spam detection.
As email usage expands, spam has become a critical challenge, threatening network security and reducing communication efficiency. Conventional detection methods face persistent limitations: traditional machine learning models often struggle with high-dimensional sparse data, while deep learning requires substantial computational resources.
This study introduces a Van der Waerden rank score feature attention-enhanced Support Vector Machine (VWR-Attn-SVM) to address these issues. The method applies Van der Waerden rank transformation to normalize text features, improving robustness against outliers and preserving ordinal relationships. An enhanced attention mechanism further optimizes feature selection through non-linear processing with regularization, highlighting the features most relevant to spam detection.
Experiments on the UCI Spambase and Indonesian Spam datasets show that VWR-Attn-SVM outperforms traditional classifiers in accuracy, precision, recall, F1-score, and AUC. By combining high performance with reduced computational cost, the method provides an efficient and interpretable solution for spam classification, with potential extension to other text-based platforms such as messaging and social media.
In the contemporary digital era, characterized by the rapid evolution of the internet and digital technologies, email has remained an indispensable cornerstone in the domains of electronic transactions and corporate communication, despite the continuous emergence and innovation of instant messaging and social media platforms1. Its ability to transcend temporal and spatial boundaries endows it with unique advantages, allowing seamless communication across the globe at any time. However, this extensive adoption has given rise to a pressing and detrimental issue-the rampant spread of spam. Malicious actors have exploited email systems as vehicles to distribute vast quantities of unsolicited commercial advertisements, malicious software, and illegal content. According to research, from 2012 to 2023, the proportion of global spam in total email traffic skyrocketed by 7700%2,3. This inundation of spam not only severely disrupts users' normal email operations but also poses multifaceted threats. It undermines personal privacy by potentially exposing sensitive information, jeopardizes corporate security through the risk of data breaches and malware infections, and even destabilizes the economic order by facilitating fraudulent activities4,5. Effective spam classification reduces phishing-related financial losses by 40-60%6, highlighting the practical value of efficient, accurate filtering methods. Consequently, developing an efficient and accurate spam detection model has emerged as a crucial research area for ensuring network security and enhancing efficiency.
A substantial body of existing research on spam detection has centered around machine learning and deep learning methodologies. In the field of traditional machine learning, a diverse array of techniques has been explored and applied. Rule-based methods, such as decision trees7, have been utilized to make classification decisions based on predefined rules derived from data features. Boosting methods8,9,10,which aggregate multiple weak learners into a strong one, and rough set theory11, which deals with uncertainty and imprecision in data, have also shown potential. Additionally, statistical methods including logistic regression, K-nearest neighbors (KNN)12,13, Naive Bayes14,15,16, and SVM17,18,19 have been widely employed. These approaches commonly rely on traditional feature extraction methods like TF-IDF. While TF-IDF is effective in quantifying the importance of words in a document, it struggles to capture the intricate semantic relationships and contextual nuances inherent in email texts. Moreover, when confronted with high-dimensional and sparse data, which is typical in email feature spaces, these methods often encounter computational bottlenecks. Their limited robustness can lead to getting trapped in local optimal solutions during the training process, thereby severely restricting the classification accuracy and generalization ability of the models.
Deep learning, with its remarkable capacity for automatic feature extraction, has emerged as a powerful alternative in spam detection. Algorithms, such as Convolutional Neural Networks (CNN)20,21,22, Recurrent Neural Networks (RNN)23, and Long Short-Term Memory networks (LSTM)24,25, as well as more recent Transformer-based models such as Word2vec and BERT26,27, have made significant strides in improving classification performance. CNNs are adept at extracting local features from data, RNNs and LSTMs can handle sequential data well, capturing temporal dependencies in text, and Transformer-based models excel at mining complex semantic relationships and context information. Recent efficient NLP methods, such as TinyML-based text classifiers28, offer strong baselines for spam classification. TinyML models are optimized for edge devices with limited memory. We compare our method to these approaches in the Results section, highlighting trade-offs between accuracy, computational efficiency, and deployment flexibility. However, these deep learning models come with their own set of limitations. They typically require a large number of training parameters, resulting in high computational resource demands and extended training times. Deep learning models like BERT require 3-5x more memory and 10x longer training times than traditional SVMs29, making them less suitable for resource-constrained environments. This makes them less practical for deployment in resource-constrained environments, such as mobile devices or low-end servers. Moreover, their complex architectures often render them less interpretable, which can be a significant drawback in applications where understanding the decision-making process of the model is crucial.
Against this backdrop, the overarching goal of this study is to develop an innovative approach that can overcome the limitations of existing methods and effectively address the challenges posed by the high-dimensional and sparse nature of spam data. The proposed Van der Waerden Rank Score Feature Attention-Enhanced SVM (VWR-Attn-SVM) represents a novel integration of techniques aimed at enhancing spam detection performance (Figure 1). The fundamental principle behind the VWR-Attn-SVM lies in its unique design that combines the strengths of multiple components.

Figure 1: Overall flow chart of research on spam classification with VWR-Attn-SVM. This flowchart illustrates the workflow of spam classification based on the Van der Waerden rank score and feature attention-enhanced SVM, covering data preparation (loading, splitting, preprocessing), experimental preparation, verification of TF-IDF feature-label statistical correlations, attention-enhanced SVM-based spam detection, and multi-classifier comparison. Please click here to view a larger version of this figure.
The core Enhanced Feature Attention Mechanism processes individual email samples with a specific dimensionality. By applying the Van der Waerden rank transformation, it normalizes the email text features distorted by abnormal word frequencies into a standard normal distribution-like form. This transformation significantly enhances the model's robustness, enabling it to better handle the variability of email data. Van der Waerden rank scores were preferred over log-scaling and quantile transforms for three reasons: (1) Robust to spam feature outliers (e.g., extreme word frequencies), unlike log-scaling which amplifies low-frequency noise; (2) Preserve feature ordinal relationships (critical for spam indicator hierarchy like "free" vs. "win"), whereas quantile transforms flatten distributions; (3) Normalize to [0,1], easing attention mechanism integration and ensuring consistent weighting (Figure 2).

Figure 2: Experimental Flowchart. (A-C) Workflows for spam classification, covering data handling, feature selection, model training, evaluation, and comparison with/without Van der Waerden rank score transformation. Please click here to view a larger version of this figure.
Structurally, the mechanism features a two-layer fully connected network for non-linear feature transformation (Figure 2). The first layer, equipped with a LeakyReLU activation function, reduces the input dimensions while introducing non-linearity and incorporates a Dropout layer to mitigate overfitting. The second layer, using a Sigmoid function, outputs attention weights that can precisely quantify the importance of each feature. An L1/L2 regularization strategy is integrated into the model to optimize feature selection, where L1 regularization promotes sparsity, effectively screening out less relevant features, and L2 regularization prevents overfitting by constraining the magnitude of the weights. During the training phase, a multi-task learning framework is adopted, combining feature reconstruction loss and classification loss to optimize the model parameters. This allows the VWR-Attn-SVM to adapt precisely to the high-dimensional, sparse TF-IDF features of email texts, which are characteristic of the complex nature of email content.
Our method is optimized for text-based spam datasets ranging from several thousand to ten thousand (e.g., Spambase, Indonesian Spam dataset (Supplemental File 1)) and requires standard computational resources (Intel Core i7 processor, 16 GB RAM) for training; inference can be run on a standard laptop (Intel Core i5, 8 GB RAM) with sub-second latency. Key constraints include limited performance on non-text spam (e.g., image-embedded spam) and reliance on structured text features. Compared with existing alternative technologies, VWR-Attn-SVM has several remarkable advantages. Different from traditional machine learning methods, it does not solely rely on basic feature extraction but actively learns to weight features according to their importance through the enhanced attention mechanism, to better capture features more relevant to spam classification. In contrast to deep learning models, it achieves a favorable balance between performance and computational efficiency. It requires fewer computational resources and shorter training times, making it more suitable for a wide range of applications, especially those with limited resources. This innovative approach is applicable not only to the specific task of spam detection in email systems but also holds potential for extension to other text-based communication channels, such as instant messaging apps, social media platforms, and SMS services, where similar issues of unwanted and malicious content dissemination exist. Overall, the VWR-Attn-SVM represents a significant advancement in the field of spam detection, offering a more practical, efficient, and versatile solution to combat the persistent problem of spam in the digital communication landscape.
1. Experimental preparation (Supplemental File 2 and Supplemental File 3)
Table 1: Summary of dataset statistics and feature definitions. This table presents variables for spam classification, including word frequency (word_freq_WORD), character frequency (char_freq_CHAR), capital run length metrics, and the target class variable, with descriptions of each variable type and meaning. Please click here to download this Table.
2. Experiment to verify the statistical association between TF-IDF features and labels (Supplemental File 2 and Supplemental File 3)
3. Attention-enhanced SVM classification for spam detection (Supplemental File 2 and Supplemental File 3)


Rk×d and k=64 (hidden neurons).
Rd×k, and a
Rk are the attention weights for each feature. Select Sigmoid instead of SoftMax to maintain the independence of the importance of multiple features.
denotes element-wise multiplication.


4. Comparison of multiple classifiers (Supplemental File 2 and Supplemental File 3)
5. Comparison chart of multi-metric performance of different classifiers in training/test time and memory (Supplemental File 4)
6. Experimental results of CNN, RNN, LSTM, or Transformers (Supplemental File 5)
7. Supplemental code instructions
To commence, as per the established experimental protocol, Figure 1 provides an overview of the overall flowchart of this study. Figure 2, sequentially depict the operation flowcharts of Experiments 2. Additionally, Table 1 primarily presents the word and character frequencies within the spam email dataset, spam.csv.
Regarding the model performance evaluation, five key metrics were employed: accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (AUC). Table 2 defines the concepts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The F1-score, a harmonic means of precision and recall, serves to balance these two crucial aspects of classification performance. The receiver operating characteristic (ROC) curve, with the false positive rate (FPR) plotted on the x-axis and the true positive rate (TPR) on the y-axis, offers a comprehensive visualization of the classification performance across various decision thresholds. Consequently, the AUC, quantifying the area under the ROC curve, emerges as a pivotal metric for evaluating the effectiveness of binary classifiers.
Table 2: Classification performance evaluation indicators. This table defines evaluation metrics for classification: Accuracy, Precision, Recall, F1-score (with their formulae), and AUC (Area Under the ROC Curve), providing a basis for model performance assessment. Please click here to download this Table.
The heatmaps in Figure 3, from the spambase and spam_indonesian datasets, reveal a key trait of spam detection datasets: key feature correlations are sparse. The "spam" label has strong positive correlations with only a few features (like char_freq_$ in spambase), while most features show weak correlations. Traditional models, treating all features equally, dilute critical signals and lower classification accuracy. In contrast, the attention mechanism dynamically assigns weights, emphasizing strong-correlation features and reducing irrelevant ones' impact. Thus, given the datasets' feature distribution, introducing the attention mechanism is vital for boosting spam detection model performance.

Figure 3: Feature Correlation Heatmap. These two heatmaps depict feature correlations for the spambase (left) and spam_indonesian (right) datasets, visualizing pairwise relationships to inform feature interdependency analysis in spam classification. Please click here to view a larger version of this figure.
Figure 4 from Experiment 3 showcases the visualized results of loss functions obtained from two comparative experimental groups across two datasets (the spambase dataset on the left and the spam_indonesian dataset on the right, respectively), designed to analyze the impact of incorporating or excluding the Van der Waerden rank score on the evolution of model training loss. For each dataset, the experiments were divided into two groups: "without normal rank score" (upper graph) and "with normal rank score" (lower graph), and involved three types of losses: total loss, mean squared error (MSE) loss, and cross-entropy loss.
As the number of training epochs increased, the overall decrease in these losses reflected the effectiveness of the training process. In the absence of the normal rank score, the total loss decreased at a slow pace, and the validation loss became notably higher in the later stages, suggesting issues of underfitting and poor generalization. The MSE loss remained stable during training but increased during validation, leading to unstable numerical predictions. The cross-entropy loss decreased rapidly during training but showed a slow decline and high errors during validation, indicating room for classification optimization. Conversely, after introducing the normal rank score, the total loss decreased more significantly, with the training and validation losses converging, thereby enhancing generalization. The MSE loss decreased rapidly and steadily during training, with a lower validation loss, resulting in improved numerical prediction accuracy. The cross-entropy loss decreased rapidly in both training and validation phases with a minimal gap, enhancing the confidence and accuracy of classification.
In summary, the introduction of the Van der Waerden rank score accelerates the loss reduction process, narrows the disparity between training and validation losses, optimizes the accuracy and stability of numerical and classification predictions, and proves to be an effective strategy for enhancing model performance, especially for tasks that are highly sensitive to losses and demand robust generalization capabilities.

Figure 4: Loss Curves (Training and Validation) of introducing or not introducing the Van der Waerden rank score. These plots show the impacts of introducing or excluding the Van der Waerden rank score on total, MSE, and cross-entropy losses during training/validation for spambase (left) and spam_indonesian (right) datasets. Please click here to view a larger version of this figure.
The four bar charts in Figure 5, all titled "Top Feature Attention Weights", display the attention weights assigned to different features by two models-Attention-Enhanced SVM (Attn-SVM) and Van der Waerden Rank Score Attention-Enhanced SVM (VWR-Attn-SVM)-for two datasets (spambase on the left and spam_indonesian on the right).
For the spambase dataset: In the Attn-SVM model (top-left chart), "word_freq_remove" has the highest attention weight of 0.6466, followed by "word_freq_000" (0.5505) and "word_freq_hp" (0.5367), showing the model's focus on these features. In the VWR-Attn-SVM model (bottom-left chart), "word_freq_your" has the highest weight of 0.3246, with "char_freq_#" (0.3218) and "word_freq_data" (0.3057) also having notable weights.
For the spam_indonesian dataset: In the Attn-SVM model (top-right chart), "edanrodan" has the highest attention weight of 0.5209, followed by "vidanceweke" (0.5182) and "tidissaiyaadanga" (0.5124). In the VWR-Attn-SVM model (bottom-right chart), "edanrodan" still has the highest weight (0.5655), with "vidanceweke" (0.5575) and "tidissaiyaadanga" (0.5038) following.
A comparative analysis of the two models for both datasets shows that integrating the Van der Waerden Rank Score reshapes the feature attention distribution. For example, in the spambase dataset, the weight of "word_freq_000" decreases, and new features like "word_freq_meeting" are added to the attention list, leading to a more balanced weight distribution. This indicates the Van der Waerden Rank Score can adjust the model's feature focus, helping capture more comprehensive feature interactions and enhancing the model's ability to identify spam emails by optimizing feature attention allocation.

Figure 5: Attention weights with Attn-SVM and VWR-AttnSVM (Spambase (left two panels); spam_indonesian (right two panels)). These bar plots present attention weights from Attn-SVM and VWR-AttnSVM. Left two panels correspond to the spambase dataset, while right two panels relate to the spam_indonesian dataset, illustrating feature importance in spam classification. Please click here to view a larger version of this figure.
Table 3, presenting the "Evaluation Metrics Performance Comparison of Classifiers for spambase", compares classifiers like Logistic Regression and KNN. VWR - Attn - SVM demonstrates distinct advantages. In the training set for non-spam emails, its precision (0.9264) and recall (0.9582) are close to SVM's (0.9447, 0.9637), while refining features through rank transformation. For spam email training set classification, it outperforms Logistic Regression (F1: 0.8423; AUC: 0.9509) and Naive Bayes (F1: 0.7932; AUC: 0.8789) with F1 = 0.9400 and AUC = 0.9821 and exceeds basic SVM (F1: 0.9243; AUC: 0.9789) by balancing weights and reducing noise. It maintains SVM's accuracy, enhances robustness via Van-der-Waerden scores, and in the face of spam's complex features, its precision (0.9567), recall (0.9239), and superior AUC outshine most traditional models, providing an accurate and robust solution.
In the test set, for non-spam emails, VWR-Attn-SVM outperforms Logistic Regression, KNN, and Naive Bayes, with precision at 0.9436, F1-score at 0.9506, and AUC at 0.9812. Compared to SVM (AUC 0.9816) and Attn-SVM (AUC 0.9770), it remains accurate, optimizing features with the Van-der-Waerden Rank Score. In spam classification, it has precision 0.9398, F1-score 0.9299, and AUC 0.9812. Against traditional classifiers, it balances weights and reduces the impact of extreme values, surpassing most models in complex scenarios. With the Van-der-Waerden Rank Score, it combines accuracy and robustness for email classification.
Table 4 presents the "Performance Comparison of Evaluation Metrics for Indonesian Spam Classifiers". When compared with classifiers like Logistic Regression and KNN, the Van der Waerden Rank Score Attention-Enhanced SVM (VWR-Attn-SVM) highlights its advantages in cross-language classification. In the training set, for non-spam emails, its precision (0.9319) and recall (0.9592) are close to those of SVM (0.9447, 0.9637), and it optimizes Indonesian features through rank transformation. For spam emails, with an F1 score of 0.9437 and an AUC value of 0.9828, it far surpasses Logistic Regression (F1: 0.8424; AUC: 0.9510) and Naive Bayes (F1: 0.7933; AUC: 0.8789), and also exceeds the basic SVM (F1: 0.9243; AUC: 0.9789), being able to balance the weights of Indonesian features and reduce noise interference.
In the test set, for non-spam email classification, its precision is 0.9391, F1 score is 0.9489, and AUC is 0.9803, which is better than that of traditional classifiers and comparable in accuracy to SVM (AUC 0.9816) and Attn-SVM (AUC 0.9769). In spam email classification, with a precision of 0.9411, an F1 score of 0.9270, and an AUC of 0.9803, it still leads. Relying on the Van der Waerden rank score, VWR-Attn-SVM can adapt to the unique lexical structure and semantic features of Indonesian and takes into account both accuracy and robustness in cross-language spam classification, with significant advantages.
Table 3: Evaluation metrics performance comparison of classifiers for spambase. This table presents evaluation metrics (Accuracy, Precision, Recall, F1-Score, AUC) of classifiers on the spambase dataset, comparing performance for non-spam and spam emails in training and test sets. Please click here to download this Table.
Table 4: Evaluation metrics performance comparison of classifiers for spam_indonesian. This table presents evaluation metrics (Accuracy, Precision, Recall, F1-Score, AUC) of classifiers on the spam_indonesian dataset, comparing performance for non-spam and spam emails in training and test sets. Please click here to download this Table.
The two groups of bar charts of Figure 6 compare the performance of multiple classifiers (KNN, Logistic Regression, AdaBoost, Naive Bayes, SVM, Attn-SVM, VWR-Attn -SVM) in spam email classification for spambase and spam_indonesian datasets across five metrics: accuracy, precision, recall, F1-score, and AUC. Overall, VWR-Attn-SVM (Van der Waerden Rank Score Attention-Enhanced SVM) shows competitive or superior performance in both datasets. Achieving relatively high values in most metrics, it indicates that integrating the Van der Waerden rank score and the attention mechanism can effectively enhance the classifier's accuracy, robustness, and generalization ability in spam email classification for different datasets. Thus, VWR - Attn - SVM outperforms traditional classifiers like KNN and Naive Bayes in comprehensive performance for both datasets.

Figure 6: Comparison chart of multi-metric performance of different classifiers in spam email classification task. These subfigures present classifier performance across accuracy, precision, recall, F1-score, and AUC for the spambase (left) and spam_indonesian (right) datasets, enabling comparative analysis of model effectiveness in spam detection. Please click here to view a larger version of this figure.
Computational Performance: As shown in Table 5 and Figure 7, computational efficiency differs significantly among models. Naive Bayes has optimal efficiency, with O(n·d) time complexity, 0.001 s test time, and 0.11 MB memory usage, being highly resource-efficient. VWR-Attn-SVM shows notable improvements over Attn - SVM: training time is cut by around 85% (from over 32s to about 4.7s), and memory usage drops by about 43% (from 70 MB to ~39.5 MB), matching the idea of balancing performance and resources. Compared to deep - learning baseline TinyML, which uses 640+ MB memory, VWR-Attn-SVM is much more resource-efficient. Yet, there are limitations. SVM (RBF kernel) has poor training efficiency (over 74 s), and though TinyML has a reasonable test time (~0.0967 s), it uses too much memory. Such trade-offs suggest that efficiency gains might be limited under certain constraints. The training time, test time, and memory usage patterns for Indonesian spam filtering, as presented, also have similar traits.
Table 5: Comparison table of complexity (time/space) and key influencing factors for various machine learning models. This table compares time/space complexity, key influencing factors, and resource usage (training time, test time, memory) of machine learning models on spambase and spam_indonesian datasets. Please click here to download this Table.

Figure 7: Comparison chart of Multi-metric performance of different classifiers in Training_test time and memory. These subfigures illustrate classifier resource usage (training time, test time, memory) for the spambase (left) and spam_indonesian (right) datasets, facilitating comparison of computational efficiency across models. Please click here to view a larger version of this figure.
To address the gap between the promised and actual comparisons with deep learning methods, experimental results against CNN, RNN, LSTM, and Transformer are now added. From Table 6, CNN achieves 0.7742 accuracy for non-spam (0 precision/recall/F1) and 0.5478 for spam (all metrics 0), while RNN, LSTM, and Transformer yield 1 for non-spam metrics but 0 for spam, with all models having 0.5 AUC. These poor/inconsistent performances are likely tied to TF-IDF data, which prioritizes word frequency over the semantic and structural information deep learning models require.
Table 6: Experimental results of CNN, RNN, LSTM, or Transformers. This table presents evaluation metrics (Accuracy, Precision, Recall, F1-Score, AUC) of CNN, RNN, LSTM, and Transformer for non - spam and spam emails in training and test sets. Please click here to download this Table.
The data in Table 7 present the spam classification accuracy rates of various models obtained using different random numbers. To test whether there are significant differences among the models, we conducted analysis of variance (ANOVA) and multiple comparison tests on the data in Table 7. One-way analysis of variance (ANOVA) showed a significant difference among the models (F=136.2448, p=0.00000000). Tukey HSD multiple comparison indicated that there was no significant difference between VWR-Attn-SVM and AdaBoost, or TinyML (p>0.05), while significant differences existed between VWR-Attn-SVM and Attn-SVM, SVM, KNN, Logistic Regression, or Naive Bayes (p<0.05).
Table 7: Statistical significance testing. This table shows accuracy values of classifiers (Logistic Regression, KNN, AdaBoost, etc.) for spam emails in the test set across different conditions, supporting statistical significance analysis. Please click here to download this Table.
Limitation Analysis: As can also be seen in Table 3 and Figure 6, VWR-Attn-SVM achieves high performance: 0.9421 accuracy, 0.9398 Precision, 0.9203 Recall, 0.9299 F1-score, and 0.9812 AUC. However, from Table 5 and Figure 7, its test time (0.1496s) is slightly longer than Attn - SVM's 0.1456s. Moreover, like other models, it may struggle with non-text or semantically complex spam, as seen in prior analyses of such limitations in spam classification tasks. Table 4, Figure 6,Table 5, and Figure 7 also exhibit similar patterns and trends for spam_indonesiandataset.
Supplemental File 1: spambase and spam_indonesian data for all representative results. First dataset: source spambase.csv from the UCI Machine Learning Repository; Second dataset: generate spam_indonesian.csv from the Indonesian email spam dataset 'Indonesian Email Spam' on Kaggle, via preprocessing and TF-IDF transformation. Please click here to download this File.
Supplemental File 2: code_new.py. This code implements the full spam classification workflow, integrating an attention-enhanced SVM and multi-classifier comparison, with four core modules: Experimental Preparation, Statistical Association Validation, Attention-Enhanced SVM Classification, and Multi-Classifier Comparison. It specifies input ('spam.csv'), outputs (metrics, visualizations), and dependencies (e.g., NumPy 1.23.5, TensorFlow 2.12.0). Please click here to download this File.
Supplemental File 3: code_indonesian.py. This code implements a complete spam classification workflow, integrating an attention-enhanced Support Vector Machine (SVM) and multi-classifier comparison. Its core functions are consistent with Supplemental File 1, encompassing four core modules, with the key modification being the input file-replaced with "spam_indonesian.csv" (containing TF-IDF features and binary labels: 0 for non-spam, 1 for spam). Please click here to download this File.
Supplemental File 4: code_compute_time.py. This code follows Supplemental File 1's spam classification workflow (attention-enhanced SVM, multi-classifier comparison) but adds a system resource monitoring module. It has four core modules (consistent with File 1 in most parts), specify inputs (spam_indonesian.csv/spam.csv), outputs (metrics, resource data), and dependencies (e.g., psutil). Please click here to download this File.
Supplemental File 5: DNN.py. This code implements spam classification via CNN, RNN, LSTM, and Transformer, using TF-IDF, SMOTE, and class weights. It includes system resource monitoring, computes per-class metrics, visualizes results, and saves them to CSV, with specified inputs, outputs, and dependencies. Please click here to download this File.
This study verified the effectiveness of VWR-Attn-SVM based on the Spambase dataset, providing insights for addressing the high-dimensional and sparse nature of spam data. Experiments revealed that only a few features in spam data have a strong correlation with labels; traditional models treat all features equally, leading to poor performance, whereas the attention mechanism of this model can dynamically weight key features. After integrating the Van der Waerden (VWR) rank transformation, the model achieves faster loss convergence, stronger generalization, balanced feature weights, and captures more interaction information. It exhibits excellent classification metrics on the test set, outperforming traditional methods while saving resources. Its innovation lies in solving the inherent problems of traditional machine learning and deep learning, providing a new paradigm for text classification, and being adaptable to resource-constrained scenarios with good interpretability.
Key steps in experimental operations
Several key steps in the experimental operations of this study significantly influenced the outcomes of spam email classification. In data preparation, the selection of the Spam-base dataset from the UCI Machine Learning Repository, with its 4,601 instances, 57 continuous features, and a binary class label, laid a solid foundation. The process of verifying the statistical association between TF-IDF features and labels was crucial. Employing the chi-squared test for feature screening and visualizing the correlation through heatmaps helped identify the most relevant features, guiding the subsequent model construction. When building the attention-enhanced SVM classification model, the design of the enhanced feature attention layer was pivotal. Through a dual-layer fully connected network for non-linear feature transformation, the integration of L1/L2 regularization strategies, and the adoption of a multi-task learning framework, the model was able to effectively adapt to the high-dimensional, sparse TF-IDF features of email texts.
Improvement schemes for the experimental method and solutions to technical issues
To further enhance the performance of the experimental method, multiple improvement directions exist. The following are some explicit troubleshooting guidelines for common experimental issues:
Handling data imbalance
Detection: Check label distribution using df['spam'].value_counts(normalize=True) to identify imbalance ratios.
solution 1: Apply SMOTE oversampling (implemented in load_data()):
smote = SMOTE(random_state=RANDOM_SEED)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
Verify effectiveness by comparing class counts before/after:
np.bincount(y_train) vs. np.bincount(y_train_smote).
solution 2: Adjust class weights in SVM (implemented in attention-SVM models):
SVC(class_weight={0: 1, 1: 1}) # Modify ratios (e.g., {0:1, 1:2}) for severe imbalance
Prioritize based on F1-score changes for minority class ("Spam").
Optimizing model parameters
SVM-specific tuning: Use grid search with cross-validation (implemented in Experiment 4):
svm_param_grid = {'C': [0.01, 0.1, 1, 10, 100], # Regularization strength
'gamma': [0.01, 0.1, 1, 10, 100], # Kernel coefficient
'kernel': ['rbf', 'linear']
}
GridSearchCV(SVC(), param_grid=svm_param_grid, cv=5, scoring='f1')
Interpret results via clf.best_params_ and clf.best_score_ to identify optimal combinations.
Neural network tuning
Learning rate: Use ReduceLROnPlateau callback to adjust dynamically:
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=0.0005)
Regularization: Modify l1_reg and l2_reg in EnhancedFeatureAttention layer (current values: 0.0002-0.002). Increase if overfitting (large train-test gap).
Early stopping: Prevent overfitting with:
EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
Addressing overfitting
detection: Compare training vs. test metrics (e.g., AUC, F1-score). A gap > 5% indicates overfitting.
solutions
Increase dropout rate in attention layers (current: 0.2) via dropout_rate=0.3.
Strengthen regularization: Increase l1_reg/l2_reg in EnhancedFeatureAttention.
Reduce model capacity: Decrease units in attention layers (current: 64-128).
Verify with loss curves (via visualize_training_history); ensure validation loss stabilizes near training loss.
Poor attention mechanism performance
diagnosis: Check attention weight sparsity with np.mean(train_attention_weights < 0.01). Values >30% indicate uninformative weighting.
fixes
Reduce regularization strength (l1_reg/l2_reg) to encourage diverse weights.
Adjust units in EnhancedFeatureAttention to match feature dimensionality.
Enable rank transformation (use_rank_transform=True) to normalize feature distributions, as shown in the VWR-Attn-SVM model.
Grid search-based SVM optimization with cross-validation
To optimize SVM, a comprehensive parameter grid is defined: regularization strength (C: [0.01, 0.1, 1, 10, 100]), kernel coefficient (γ: [0.01, 0.1, 1, 10, 100]), and kernel type (kernel: ['rbf', 'linear']). The SVM is encapsulated in GridSearchCV with 5-fold cross-validation (cv=5) to split training data into 5 subsets for iterative parameter evaluation, preventing overfitting and ensuring generalizability. Grid search optimizes for F1-score (scoring='f1')-a critical metric for imbalanced spam detection, balancing precision and recall more meaningfully than accuracy alone. Parallel processing (n_jobs=-1) accelerates search via all CPU cores. The SVM baseline adopts the same preprocessing (70-30 train-test split, MinMaxScaler) as other models to avoid comparison biases. Post-search, optimal parameters and best cross-validation F1-score are reported for optimization transparency, establishing a robust SVM baseline for fair comparison with Attn-SVM and VWR-Attn-SVM in subsequent evaluations.
These solutions are directly implementable using existing code structures, with clear validation steps to measure effectiveness.
Limitations of the experimental method
Despite its effectiveness, the proposed experimental method has certain limitations. Although the VWR-Attn-SVM model demonstrates good performance in general spam email classification, our method uses TF-IDF + Van der Waerden rank scores, which remain shallow lexical representations-unlike contextual embeddings from BERT, they cannot capture nuanced semantic relationships (e.g., sarcasm or ambiguous phrasing). This limits performance on semantically complex spam, such as emails with indirect phishing prompts.
Significance of the method compared to existing or alternative approaches
Compared with traditional spam classification methods, such as decision trees and Naive Bayes, the VWR-Attn-SVM method proposed in this study overcomes the shortcomings of traditional methods that rely on simple feature extraction and are prone to local optimal solutions. By introducing the Van der Waerden rank score and the enhanced feature attention mechanism, it can better capture the complex semantic relationships and context information in email texts, significantly improving the classification accuracy and generalization ability. When compared with deep learning models, although deep learning models have powerful automatic feature extraction capabilities, they often suffer from issues such as excessive training parameters and long training cycles. In contrast, the VWR-Attn-SVM method achieves a good balance between performance and resource consumption, making it more suitable for real-world applications where efficiency and resource utilization are crucial. Therefore, this method provides a more practical and efficient alternative for spam email classification.
Importance and potential application prospects of the method in the specific research field
In the fields of network security and email system optimization, the VWR-Attn-SVM method holds great significance. Accurate spam classification effectively protects users' personal privacy, safeguards enterprise operational security, and maintains the stability of the economic order. Looking ahead, its potential application prospects are extensive. Beyond email systems, this method can be extended to filter spam information in various other text-based communication channels, such as instant messaging apps, social media platforms, and SMS services. As the forms and dissemination channels of spam continue to evolve with the development of network technology, continuous optimization and improvement of this method will enable it to play an increasingly important role in a wider range of information security scenarios, providing essential technical support for building a secure and efficient network information environment. Additionally, we encourage researchers to test the model on domain-specific spam datasets (e.g., social media spam, comment spam) and multilingual datasets (e.g., Spanish, Mandarin spam) to further validate generalizability. Additionally, testing on real-time spam streams (e.g., live email feeds) would assess performance in dynamic, real-world environments. Future research could explore the combination of metaheuristic algorithms with our proposed method to further optimize model performance, similar to how they are used in other related fields, such as in the detection of Parkinson's disease with deep long short-term memory networks optimized by modified metaheuristic algorithm32 or in forecasting bitcoin using decomposition-aided long short-term memory based time series modeling and its explanation with Shapley values33.
The authors have no conflicts of interest to disclose.
We thank the Fujian Alliance of Mathematics (Grant No. 2023SXLMMS10) and Natural Science Foundation of Fujian Province (2023J05083, 2022J011396, 2023J011434) for funding this work.
| Supplemental File 2: code_new.py; Supplemental File 3: code_indonesian.py. | |||
| numpy | NumPy Developers | Library for numerical computing in Python | |
| pandas | pandas Development Team | Library for data manipulation and analysis | |
| matplotlib | Matplotlib Developers | Library for creating static, animated, and interactive visualizations | |
| seaborn | Michael Waskom et al. | Statistical data visualization library based on matplotlib | |
| scikit-learn | scikit-learn Developers Team | Machine learning library featuring various classification, regression, and clustering algorithms | |
| tensorflow | Open-source machine learning framework, including Keras API for building neural networks | ||
| imblearn | imbalanced-learn Developers | Library for handling imbalanced datasets, including SMOTE for oversampling | |
| warnings | Python Standard Library | Module for issuing warning messages | |
| Supplemental File 4: code_compute_time.py | |||
| numpy | NumPy Developers | Numerical computing library for Python | |
| pandas | pandas Development Team | Data manipulation and analysis library | |
| matplotlib | Matplotlib Developers | Visualization library for creating plots and figures | |
| seaborn | Michael Waskom et al. | Statistical data visualization library built on matplotlib | |
| scikit-learn | scikit-learn Developers Team | Machine learning library with classification, regression, and preprocessing tools | |
| tensorflow | Open-source machine learning framework with Keras API for neural networks | ||
| imblearn | imbalanced-learn Developers Team | Library for handling imbalanced datasets (includes SMOTE) | |
| warnings | Python Standard Library | Module for issuing warning messages | |
| time | Python Standard Library | Module for time-related functions | |
| psutil | Giampaolo Rodola | Library for retrieving system information and monitoring resource usage | |
| os | Python Standard Library | Module for interacting with the operating system | |
| Supplemental File 5: DNN.py. | |||
| pandas | pandas Development Team | Data manipulation and analysis library | |
| numpy | NumPy Developers | Numerical computing library for Python | |
| time | Python Standard Library | Module for time-related functions | |
| psutil | Giampaolo Rodola | Library for system information retrieval and resource monitoring | |
| matplotlib | Matplotlib Developers | Visualization library for creating plots and figures | |
| scikit-learn | scikit-learn Developers Team | Machine learning library with data preprocessing, model selection, and metrics tools | |
| imblearn | imbalanced-learn Developers Team | Library for handling imbalanced datasets (includes SMOTE) | |
| tensorflow | Open-source machine learning framework with Keras API for building neural networks |