Spam Classification with Support Vector Machines Using Van der Waerden Rank Score Attention

Nenghui Zhu; Jiaxin Cai

doi:10.3791/69082

Method Article

Spam Classification with Support Vector Machines Using Van der Waerden Rank Score Attention

DOI:

10.3791/69082

⸱

October 31st, 2025

Nenghui Zhu*¹ , Jiaxin Cai*¹

¹School of Mathematics and Statistics, Xiamen University of Technology

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study proposes an innovative approach based on Support Vector Machine integrated with a Van der Waerden rank score-enhanced feature attention mechanism, aiming to address the challenges of high-dimensional sparse spam data and improve the classification performance of spam detection.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

As email usage expands, spam has become a critical challenge, threatening network security and reducing communication efficiency. Conventional detection methods face persistent limitations: traditional machine learning models often struggle with high-dimensional sparse data, while deep learning requires substantial computational resources.

This study introduces a Van der Waerden rank score feature attention-enhanced Support Vector Machine (VWR-Attn-SVM) to address these issues. The method applies Van der Waerden rank transformation to normalize text features, improving robustness against outliers and preserving ordinal relationships. An enhanced attention mechanism further optimizes feature selection through non-linear processing with regularization, highlighting the features most relevant to spam detection.

Experiments on the UCI Spambase and Indonesian Spam datasets show that VWR-Attn-SVM outperforms traditional classifiers in accuracy, precision, recall, F1-score, and AUC. By combining high performance with reduced computational cost, the method provides an efficient and interpretable solution for spam classification, with potential extension to other text-based platforms such as messaging and social media.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In the contemporary digital era, characterized by the rapid evolution of the internet and digital technologies, email has remained an indispensable cornerstone in the domains of electronic transactions and corporate communication, despite the continuous emergence and innovation of instant messaging and social media platforms¹. Its ability to transcend temporal and spatial boundaries endows it with unique advantages, allowing seamless communication across the globe at any time. However, this extensive adoption has given rise to a pressing and detrimental issue-the rampant spread of spam. Malicious actors have exploited email systems as vehicles ....

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

1. Experimental preparation (Supplemental File 2 and Supplemental File 3)

Data description: Load the open-source spam dataset from the UCI Machine Learning Repository for spam email detection³⁰. Document that the dataset contains 4,601 instances with 57 continuous features and 1 class label, including 1,813 spam (39.4%) and 2,788 non-spam (60.6%) samples (Table 1).
Library import
1. Import the essential libraries (see the Table of Materials).
2. Set a global random seed to 42 to ensure the reproducibility of results.

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

To commence, as per the established experimental protocol, Figure 1 provides an overview of the overall flowchart of this study. Figure 2, sequentially depict the operation flowcharts of Experiments 2. Additionally, Table 1 primarily presents the word and character frequencies within the spam email dataset, spam.csv.

Regarding the model performance evaluation, five key metrics were employed: accuracy, precision, recal.......

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study verified the effectiveness of VWR-Attn-SVM based on the Spambase dataset, providing insights for addressing the high-dimensional and sparse nature of spam data. Experiments revealed that only a few features in spam data have a strong correlation with labels; traditional models treat all features equally, leading to poor performance, whereas the attention mechanism of this model can dynamically weight key features. After integrating the Van der Waerden (VWR) rank transformation, the model achieves faster loss c.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have no conflicts of interest to disclose.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

We thank the Fujian Alliance of Mathematics (Grant No. 2023SXLMMS10) and Natural Science Foundation of Fujian Province (2023J05083, 2022J011396, 2023J011434) for funding this work.

....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Comments
Supplemental File 2: code_new.py; Supplemental File 3: code_indonesian.py.
numpy	NumPy Developers	Library for numerical computing in Python
pandas	pandas Development Team	Library for data manipulation and analysis
matplotlib	Matplotlib Developers	Library for creating static, animated, and interactive visualizations
seaborn	Michael Waskom et al.	Statistical data visualization library based on matplotlib
scikit-learn	scikit-learn Developers Team	Machine learning library featuring various classification, regression, and clustering algorithms
tensorflow	Google	Open-source machine learning framework, including Keras API for building neural networks
imblearn	imbalanced-learn Developers	Library for handling imbalanced datasets, including SMOTE for oversampling
warnings	Python Standard Library	Module for issuing warning messages
Supplemental File 4: code_compute_time.py
numpy	NumPy Developers	Numerical computing library for Python
pandas	pandas Development Team	Data manipulation and analysis library
matplotlib	Matplotlib Developers	Visualization library for creating plots and figures
seaborn	Michael Waskom et al.	Statistical data visualization library built on matplotlib
scikit-learn	scikit-learn Developers Team	Machine learning library with classification, regression, and preprocessing tools
tensorflow	Google	Open-source machine learning framework with Keras API for neural networks
imblearn	imbalanced-learn Developers Team	Library for handling imbalanced datasets (includes SMOTE)
warnings	Python Standard Library	Module for issuing warning messages
time	Python Standard Library	Module for time-related functions
psutil	Giampaolo Rodola	Library for retrieving system information and monitoring resource usage
os	Python Standard Library	Module for interacting with the operating system
Supplemental File 5: DNN.py.
pandas	pandas Development Team	Data manipulation and analysis library
numpy	NumPy Developers	Numerical computing library for Python
time	Python Standard Library	Module for time-related functions
psutil	Giampaolo Rodola	Library for system information retrieval and resource monitoring
matplotlib	Matplotlib Developers	Visualization library for creating plots and figures
scikit-learn	scikit-learn Developers Team	Machine learning library with data preprocessing, model selection, and metrics tools
imblearn	imbalanced-learn Developers Team	Library for handling imbalanced datasets (includes SMOTE)
tensorflow	Google	Open-source machine learning framework with Keras API for building neural networks

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Ayo, F. E., Ogundele, L. A., Olakunle, S., Awotunde, J. B., Kasali, F. A. A hybrid correlation-based deep learning model for email spam classification using fuzzy inference system. Decis Anal J. 10, 100390(2024).
Douzi, S., AlShahwan, F. A., Lemoudden, M., Ouahidi, B.

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Spam Classification with Support Vector Machines Using Van der Waerden Rank Score Attention

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

References

Reprints and Permissions

Tags

Related Articles