Method Article

Spam Classification with Support Vector Machines Using Van der Waerden Rank Score Attention

DOI:

10.3791/69082

October 31st, 2025

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study proposes an innovative approach based on Support Vector Machine integrated with a Van der Waerden rank score-enhanced feature attention mechanism, aiming to address the challenges of high-dimensional sparse spam data and improve the classification performance of spam detection.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

As email usage expands, spam has become a critical challenge, threatening network security and reducing communication efficiency. Conventional detection methods face persistent limitations: traditional machine learning models often struggle with high-dimensional sparse data, while deep learning requires substantial computational resources.

This study introduces a Van der Waerden rank score feature attention-enhanced Support Vector Machine (VWR-Attn-SVM) to address these issues. The method applies Van der Waerden rank transformation to normalize text features, improving robustness against outliers and preserving ordinal relationships. An enhanced attention mechanism further optimizes feature selection through non-linear processing with regularization, highlighting the features most relevant to spam detection.

Experiments on the UCI Spambase and Indonesian Spam datasets show that VWR-Attn-SVM outperforms traditional classifiers in accuracy, precision, recall, F1-score, and AUC. By combining high performance with reduced computational cost, the method provides an efficient and interpretable solution for spam classification, with potential extension to other text-based platforms such as messaging and social media.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In the contemporary digital era, characterized by the rapid evolution of the internet and digital technologies, email has remained an indispensable cornerstone in the domains of electronic transactions and corporate communication, despite the continuous emergence and innovation of instant messaging and social media platforms1. Its ability to transcend temporal and spatial boundaries endows it with unique advantages, allowing seamless communication across the globe at any time. However, this extensive adoption has given rise to a pressing and detrimental issue-the rampant spread of spam. Malicious actors have exploited email systems as vehicles ....

Access restricted. Please log in or start a trial to view this content.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

1. Experimental preparation (Supplemental File 2 and Supplemental File 3)

  1. Data description: Load the open-source spam dataset from the UCI Machine Learning Repository for spam email detection30. Document that the dataset contains 4,601 instances with 57 continuous features and 1 class label, including 1,813 spam (39.4%) and 2,788 non-spam (60.6%) samples (Table 1).
  2. Library import
    1. Import the essential libraries (see the Table of Materials).
    2. Set a global random seed to 42 to ensure the reproducibility of results.
    3. ....

Access restricted. Please log in or start a trial to view this content.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

To commence, as per the established experimental protocol, Figure 1 provides an overview of the overall flowchart of this study. Figure 2, sequentially depict the operation flowcharts of Experiments 2. Additionally, Table 1 primarily presents the word and character frequencies within the spam email dataset, spam.csv.

Regarding the model performance evaluation, five key metrics were employed: accuracy, precision, recal.......

Access restricted. Please log in or start a trial to view this content.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study verified the effectiveness of VWR-Attn-SVM based on the Spambase dataset, providing insights for addressing the high-dimensional and sparse nature of spam data. Experiments revealed that only a few features in spam data have a strong correlation with labels; traditional models treat all features equally, leading to poor performance, whereas the attention mechanism of this model can dynamically weight key features. After integrating the Van der Waerden (VWR) rank transformation, the model achieves faster loss c.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have no conflicts of interest to disclose.

Acknowledgements

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

We thank the Fujian Alliance of Mathematics (Grant No. 2023SXLMMS10) and Natural Science Foundation of Fujian Province (2023J05083, 2022J011396, 2023J011434) for funding this work.

....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
NameCompanyCatalog NumberComments
Supplemental File 2: code_new.py; Supplemental File 3: code_indonesian.py.
numpyNumPy DevelopersLibrary for numerical computing in Python
pandaspandas Development TeamLibrary for data manipulation and analysis
matplotlibMatplotlib Developers Library for creating static, animated, and interactive visualizations
seabornMichael Waskom et al.Statistical data visualization library based on matplotlib
scikit-learnscikit-learn Developers TeamMachine learning library featuring various classification, regression, and clustering algorithms
tensorflowGoogleOpen-source machine learning framework, including Keras API for building neural networks
imblearnimbalanced-learn DevelopersLibrary for handling imbalanced datasets, including SMOTE for oversampling
warningsPython Standard LibraryModule for issuing warning messages
Supplemental File 4: code_compute_time.py
numpyNumPy DevelopersNumerical computing library for Python
pandaspandas Development TeamData manipulation and analysis library
matplotlibMatplotlib DevelopersVisualization library for creating plots and figures
seabornMichael Waskom et al.Statistical data visualization library built on matplotlib
scikit-learnscikit-learn Developers TeamMachine learning library with classification, regression, and preprocessing tools
tensorflowGoogleOpen-source machine learning framework with Keras API for neural networks
imblearnimbalanced-learn Developers TeamLibrary for handling imbalanced datasets (includes SMOTE)
warningsPython Standard LibraryModule for issuing warning messages
timePython Standard LibraryModule for time-related functions
psutilGiampaolo RodolaLibrary for retrieving system information and monitoring resource usage
osPython Standard LibraryModule for interacting with the operating system
Supplemental File 5: DNN.py.
pandaspandas Development TeamData manipulation and analysis library
numpyNumPy DevelopersNumerical computing library for Python
timePython Standard LibraryModule for time-related functions
psutilGiampaolo RodolaLibrary for system information retrieval and resource monitoring
matplotlibMatplotlib DevelopersVisualization library for creating plots and figures
scikit-learnscikit-learn Developers TeamMachine learning library with data preprocessing, model selection, and metrics tools
imblearnimbalanced-learn Developers TeamLibrary for handling imbalanced datasets (includes SMOTE)
tensorflowGoogleOpen-source machine learning framework with Keras API for building neural networks

References

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,
  1. Ayo, F. E., Ogundele, L. A., Olakunle, S., Awotunde, J. B., Kasali, F. A. A hybrid correlation-based deep learning model for email spam classification using fuzzy inference system. Decis Anal J. 10, 100390(2024).
  2. Douzi, S., AlShahwan, F. A., Lemoudden, M., Ouahidi, B.

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Spam ClassificationSupport Vector MachinesVan Der WaerdenRank Score AttentionFeature SelectionText NormalizationOutlier RobustnessAttention MechanismHigh Dimensional DataText Based Platforms

Related Articles