Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

Document Type: ORIGINAL RESEARCH PAPER

Authors

Department of Computer Engineering, R.C. Patel Institute of Technology, Shirpur-25405, India

Abstract

Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking. Existing approaches have focused on binary detection i.e., either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This paper proposes a methodology to detect malicious URLs and the type of attacks based on multi-class classification. In this work, we propose 42 new features of spam, phishing and malware URLs. These features are not considered in the earlier studies for malicious URLs detection and attack types identification. Binary and multi-class dataset is constructed using 49935 malicious and benign URLs. It consists of 26041 benign and 23894 malicious URLs containing 11297 malware, 8976 phishing and 3621 spam URLs. To evaluate the proposed approach, the state-of-the-art supervised batch and online machine learning classifiers are used. Experiments are performed on the binary and multi-class dataset using the aforementioned machine learning classifiers. It is found that, confidence weighted learning classifier achieves the best 98.44% average detection accuracy with 1.56% error-rate in the multi-class setting and 99.86% detection accuracy with negligible error-rate of 0.14% in binary setting using our proposed URL features.

Keywords


1] Symantec internet security threat report istr 2017. URL https://www.symantec.com/security_response/
publications/ monthlythreatreport.jsp . Accessed: September 30, 2017.
[2] Apwg phishing activity trends report analysis 2017. URL http://docs.apwg.org/reports/apwg_trends_report_h1_2017.pdf . Accessed: October 17, 2017.
[3] Patil D. R. and Patil J. B. Survey on malicious web pages detection techniques. International Journal of U-and E-service, Science and Technology, 8(5):195–206, 2015. doi: 10.14257/ijunesst.2015.8.5.18.
[4] Justin Ma, Saul L. K., Savage S. and Voelker G. M. Learning to detect malicious urls. ACM Transactions on Intelligent Systems and Technology, 3(2):1–24, 2011. doi: 10.1145/1961189.1961202.
[5] Canali D., Cova M., Vigna G. and Kruegel C.Prophiler: a fast filter for the large-scale detection of malicious web pages. In 20th International Conference on World Wide Web(WWW11), pages 197–206, 2011.
[6] Eshete B., Villafiorita A. and Weldemariam K.BINSPECT: Holistic Analysis and Detection of Malicious Web Pages. In SecureComm, pages 149–166, 2012.
[7] Basnet R. B., Mukkamala S. and Sung, A. H.Detection of Phishing Attacks: A Machine Learning Approach. In Soft Computing Applications in Industry, pages 373–383, 2008.
[8] Garera S., Provos N., Chew M. and Rubin A.D. A framework for detection and measurement of phishing attacks. In 2007 ACM workshop on Recurring malcode, pages 1–8, 2007.
[9] Zhang Y., Hong J. I. and Cranor L. F. Cantina: a content-based approach to detecting phishing web sites. In 16th International Conference on World Wide Web, pages 639–648, 2007.
[10] Verma R. and Das A. Whats in a URL: Fast Feature Extraction and Malicious URL Detection. In 3rd International Workshop on Security and Privacy Analytics, pages 55–63, 2017.
[11] Basnet R. B. and Sung A. H. Classifying phishing emails using confidence-weighted linear classifiers. In International Conference on Information Security and Artificial Intelligence (ISAI), pages 108–112, 2010.

[12] Marchal S., Saari K., Singh N. and Asokan N.Know your phish: Novel techniques for detecting phishing sites and their targets. In 36th IEEE International Conference on Distributed Computing Systems (ICDCS), pages 323–333, 2016.
[13] Nepali R. K. and Wang, Y. You look suspicious!!: Leveraging visible attributes to classify malicious short urls on twitter. In 49th Hawaii International Conference on System Sciences (HICSS), pages 2648–2655, 2016.
[14] Patil D. R. and Patil J. B. Malicious web pages detection using static analysis of URLs. International Journal of Information Security and Cybercrime, 5(2):57–70, 2016. doi: 10.19107/IJISC.2016.02.06.
[15] Patil D. R. and Patil J. B. Detection of Malicious JavaScript Code in Web Pages. Indian Journal of Science and Technology, 10(19):1–12, 2017.doi: 10.17485/ijst/2017/v10i19/114828.
[16] Choi H., Zhu B. B. and Lee, H. Detecting Malicious Web Links and Identifying Their Attack Types. In 2nd USENIX conference on Web application development(WebApps’11), pages 1–12,2011.
[17] Babagoli, M., Aghababa, M. P. and Solouk, V.Heuristic nonlinear regression strategy for detecting phishing websites. Soft Computing, pages 1–13, 2018. doi: https://doi.org/10.1007/s00500-018-3084-2.
[18] Zuhair, H., Selamat, A. and Salleh, M. Selection of robust feature subsets for phish webpage prediction using maximum relevance and minimum redundancy criterion. Journal of Theoretical and Applied Information Technology, 81(2):188–205,2015.
[19] Sahoo D., Liu C.and Hoi, S. C. Malicious URL Detection using Machine Learning: A Survey. arXiv preprint arXiv:1701.07179., pages 1–21,2017.
[20] Thomas K., Grier C., Ma J., Paxson V. and Song D. Design and evaluation of a realtime URL spam filtering service. In IEEE Symposium on Security and Privacy (SP), pages 447–462, 2011.
[21] Cova M., Kruegel C. and Vigna G. Detection and analysis of drive-by-download attacks and malicious JavaScript code. In 19th International Conference on World Wide Web, pages 281–290,2010.
[22] Hajian Nezhad J, Vafaei Jahan M, Tayarani-N M and Sadrnezhad Z. Analyzing new features of infected web content in detection of malicious web pages. The ISC International Journal of Information Security, 9(2):63–83, 2017.
[23] Dewald A., Holz T. and Freiling F. C. ADSand-box: Sandboxing JavaScript to fight malicious
websites. In ACM Symposium on Applied Computing, pages 1859–1864, 2010.
[24] Zhang, J., Seifert, C., Stokes, J. W. and Lee, W.Arrow: Generating signatures to detect drive-by downloads. In 20th international conference on World wide web, pages 187–196, 2011.
[25] Lee, S.and Kim, J. WarningBird: Detecting Suspicious URLs in Twitter Stream. In Network and Distributed System Security Symposium (NDSS12), pages 1–13, 2012.
[26] Imani M and Montazer GA. Phishing website detection using weighted feature line embedding. The ISC International Journal of Information Security, 9(2):49–61, 2017.
[27] Sonowal G. and Kuppusamy K. S. PhiDMA- A phishing detection model with multi-filter approach. Journal of King Saud University-Computer and Information Sciences, pages 1–14, 2017. doi: 10.1016/j.jksuci.2017.07.005.
[28] Vinayakumar, R., Soman, K. P. and Poornachandran, P. . Evaluating deep learning approaches to characterize and classify malicious URLs. Journal of Intelligent & Fuzzy Systems, 34(3):1333–1343, 2018. doi: 10.3233/JIFS-169429.
[29] Smadi, S., Aslam, N. and Zhang, L. Detection of Online Phishing Email using Dynamic Evolving Neural Network Based on Reinforcement Learning. Decision Support Systems, 107:88–102, 2018.doi:https://doi.org/10.1016/j.dss.2018.01.001.
[30] Sungjin Kim, Jinkook Kim and Brent ByungHoon Kang. Malicious URL protection based on attackers habitual behavioral analysis. Computers and Security, 2018. doi:10.1016/j.cose.2018.01.013.
[31] Selenium webdriver 2.39. URL http://www.seleniumhq.org/download/ . Accessed: March 25, 2017.
[32] Knuth D. E., Morris Jr J. H. and Pratt V. R.Fast pattern matching in strings. SIAM journal on computing, 6(2):323–350, 1977. doi: 10.1137/0206024.
[33] Seshagiri, P., Vazhayil, A. and Sriram, P. AMA:Static code analysis of web page for the detection of malicious scripts. In Procedia Computer Science, pages 768–773, 2016.
[34] Aly M. Survey on multiclass classification methods. Neural Networks, 19:1–9, 2005.
[35] Hsu C. W. and Lin C. J. A comparison of methods for multiclass support vector machines.
IEEE Transactions on Neural Networks, 13(2):415–425, 2002. doi: 10.1109/72.991427.
[36] Yuan G. X., Ho C. H. and Lin C. J. Recent advances of large-scale linear classification. In the IEEE, pages 2584–2603, 2012.
[37] Bottou, L., Cortes, C., Denker, J. S., Drucker, H.,Guyon, I., Jackel, L. D.and Vapnik, V. Comparison of classifier methods: a case study in handwritten digit recognition. In the IEEE, pages 77–82, 1994.

[38] Fan R. E., Chang K. W., Hsieh C. J., Wang X.R. and Lin C. J. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
[39] Mark Dredze, Koby Crammer and Fernando Pereira. Confidence-Weighted Linear Classification. In 25th International Conference on Machine Learning (ICML), pages 264–271, 2008.
[40] Crammer K., Dredze M. and Kulesza A. Multiclass confidence weighted algorithms. In Conference on Empirical Methods in Natural Language Processing, pages 496–504, 2009.
[41] Dahlmeier D., Ng H. T. and Ng E. J. F. NUS at the HOO 2012 Shared Task. In Seventh Workshop on Building Educational Applications Using NLP, pages 216–224, 2012.
[42] Confidence-weighted (cw) learning. URL http://www.comp.nus.edu.sg/ ~ nlp/software.html. Accessed: September 20, 2017.
[43] Alexa: Alexa top global websites. URL http://www.alexa.com/topsites . Accessed: September 1, 2017.
[44] Phishtank: Join the fight against phishing.URL https://www.phishtank.com/ . Accessed: September 1, 2017.
[45] Malware domain list. URL http://www.malwaredomainlist.com/forums/index.php?topic=3270.0/ . Accessed: September 1, 2017.
[46] Spam domain blacklist (filtered by jwspamspy). URL http://www.joewein.de/sw/blacklist.htm/. Accessed: September 1, 2017.
[47] Sokolova M. and Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4):427–437, 2009. doi: 10.1016/j.ipm.2009.03.002.
[48] Quality metrics for multi-class classification algorithms. URL https://software.intel.com/en-us/daal-programming-guide-quality-metrics-for-multi-class-classification-algorithms . Accessed: August 20, 2017.