Voting-based Classification for E-mail Spam Detection

Bashar Awad Al-Shboul, Heba Hakh, Hossam Faris, Ibrahim Aljarah, Hamad Alsawalqah

Abstract


The problem of spam e-mail has gained a tremendous amount of attention. Although entities tend to use e-mail spam filter applications to filter out received spam e-mails, marketing companies still tend to send unsolicited e-mails in bulk and users still receive a reasonable amount of spam e-mail despite those filtering applications. This work proposes a new method for classifying e-mails into spam and non-spam. First, several e-mail content features are extracted and then those features are used for classifying each e-mail individually. The classification results of three different classifiers (i.e. Decision Trees, Random Forests and k-Nearest Neighbor) are combined in various voting schemes (i.e. majority vote, average probability, product of probabilities, minimum probability and maximum probability) for making the final decision. To validate our method, two different spam e-mail collections were used.

Full Text:

PDF

References


Clark, J., Koprinska, I., & Poon, J., A Neural Network based Approach to Automated E-mail Classification, in Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence, p. 702, IEEE Computer Society, 2003.

Kumar, R.K., Poonkuzhali, G., & Sudhakar, P., Comparative Study on Email Spam Classifier Using Data Mining Techniques, in Proceedings of the International MultiConference of Engineers and Computer Scientists, 1, pp. 14-16, 2012.

Cormack, G.V., Email Spam Filtering: A Systematic Review, Foundations and Trends in Information Retrieval, 1(4), pp. 335-455, 2007.

Guzella, T.S., & Caminhas, W.M., A Review of Machine Learning Approaches to Spam Filtering, Expert Systems with Applications, 36(7), pp. 10206-10222, 2009.

Blanzieri, E., & Bryl, A., A Survey of Learning-based Techniques of Email Spam Filtering, Artificial Intelligence Review, 29(1), pp. 63-92, 2008.

Al-Jarrah, O., Khater, I., & Al-Duwairi, B., Identifying Potentially Useful Email Header Features for Email Spam Filtering, in The Sixth International Conference on Digital Society (ICDS), 2012.

Alqatawna, J., Faris, H., Jaradat, K., Al-Zewairi, M., & Adwan, O., Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution, International Journal of Communications, Network and System Sciences, 8(05), p. 118, 2015.

Ruan, G., & Tan, Y., A Three-layer Back-propagation Neural Network for Spam Detection Using Artificial Immune Concentration, Soft Computing, 14(2), pp. 139-150, 2010.

Oda, T., & White, T., Increasing the Accuracy of a Spam-detecting Artificial Immune System, in Evolutionary Computation, 2003. CEC’03. The 2003 Congress on, 1, pp. 390-396, IEEE, 2003.

Kołcz, A., & Alspector, J., Svm-based Filtering of E-mail Spam with Content-specific Misclassification Costs, in Proceedings of the workshop on text mining (TEXTDM2001), Citeseer, 2001.

Chuan, Z., Xianliang, L., Mengshu, H., & Xu, Z., A LVQ-based Neural Network Anti-spam Email Approach, ACM SIGOPS Operating Systems Review, 39(1), pp. 34-39, 2005.

Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C., An Evaluation of Naive Bayesian Anti-spam Filtering, in Proc. of the Workshop on Machine Learning in the New Information Age, 2000.

Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. D. & Stamatopoulos, P., A Memory-based approach to Anti-spam Filtering for Mailing Lists, Information Retrieval, 6(1), pp. 49-73, 2003.

Youn, S., & McLeod, D., A Comparative Study for Email Classification, in Advances and Innovations in Systems, Computing Sciences and Software Engineering, pp. 387-391, Springer, Netherlands, 2007.

Lai, C.C. & Tsai, M.C., An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization, in Hybrid Intelligent Systems, 2004. HIS’04. Fourth International Conference on, pp. 44-48, IEEE, 2004.

Fawcett, T., In Vivo Spam Filtering: a Challenge Problem for KDD, ACM SIGKDD Explorations Newsletter, 5(2), pp. 140-148, 2003.

Carreras, X., Marquez, L. & Salgado, J. G., Boosting Trees for Anti-spam Email Filtering, in In Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, Citeseer, 2001.

Koprinska, I., Poon, J., Clark, J. & Chan, J., Learning to Classify E-mail, Information Sciences, 177(10), pp. 2167-2187, 2007.

Idris, I., Selamat, A. & Omatu, S., Hybrid Email Spam Detection Model with Negative Selection Algorithm and Differential Evolution,” Engineering Applications of Artificial Intelligence, 28, pp. 97-110, 2014.

Alexandre, L.A., Campilho, A.C. & Kamel, M., On Combining Classifiers Using Sum and Product Rules, Pattern Recognition Letters, 22(12), pp. 1283-1289, 2001.

Shams, R., & Mercer, R., Classifying Spam Emails Using Text and Readability Features, ICDM, pp. 657-666, 2013.

Kittler, J., Hatef, M., Duin, R. & Matas, J., On Combining Classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), pp. 226-239, 1998

Kotsiantis, S.B., Supervised Machine Learning: A Review of Classification Techniques, Informatica, 31, pp. 249-268, 2007.

Kuhn, M., & Johnson, K., Applied Predictive Modeling, Springer-Verlag New York, 2013.

Hand, D., Mannila, H., & Smyth, P., Principles of Data Mining, The MIT Press, Cambridge, MA, USA, 2001.

de Sa, M., Pattern Recognition Concepts Methods and Applications, Springer-Verlag Berlin Heidelberg, 2001.

Liu, B., Web Mining: Exploring Hyperlinks, Contents and Usage Data, Springer-Verlag Berlin Heidelberg, 2nd ed., 2011.

Faris, H., Aljarah, I. & Alqatawna, J., Optimizing Feedforward Neural Networks Using Krill Herd Algorithm for E-mail Spam Detection, Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on, pp. 1-5, 2015.

Rodan, A., Faris, H. & Alqatawna, J., Optimizing Feedforward Neural Networks Using Biogeography Based Optimization for E-Mail Spam Identification, International Journal of Communications, Network and System Sciences, Scientific Research Publishing, 9 (1), pp. 19-28, 2016.




DOI: http://dx.doi.org/10.5614%2Fitbj.ict.res.appl.2016.10.1.3

Refbacks

  • There are currently no refbacks.


Contact Information:

ITB Journal Publisher, LPPM – ITB, 

Center for Research and Community Services (CRCS) Building Floor 7th, 
Jl. Ganesha No. 10 Bandung 40132, Indonesia,

Tel. +62-22-86010080,

Fax.: +62-22-86010051;

e-mail: jictra@lppm.itb.ac.id.