Analyzing new features of infected web content in detection of malicious web pages

Document Type: ORIGINAL RESEARCH PAPER

Authors

1 Department of Computer Engineering, ImamReza University, Mashhad, Iran

2 Department of Computer Engineering, Islamic Azad University, Mashhad, Iran

3 Department of Electrical and Computer Science, University of Glasgow, Glasgow,U.K

Abstract

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery library) and XSS attacks. The proposed features are evaluated on a data set that is gathered by a crawler from malicious web domains, IP and address black lists. For the purpose of evaluation, we use a number of machine learning algorithms. Experimental results show that using the proposed set of features, the C4.5-Tree algorithm offers the best performance with 97.61% accuracy, and F1-measure has 96.75% accuracy. We also rank the quality of the features. Experimental results suggest that nine of the proposed features are among the twenty best discriminative features.

Keywords


[1] Mahdieh Zabihi, Majid Vafaei Jahan, and Javad Hamidzadeh. A density based clustering approach for web robot detection. In Computer and Knowledge Engineering (ICCKE), 2014 4th International eConference on, pages 23{28.
IEEE, 2014.
[2] Nedim Srndic and Pavel Laskov. Hidost: a static machine-learning-based detector of malicious les. volume 2016, page 22, Sep 2016.
[3] Hyunsang Choi, Bin B. Zhu, and Heejo Lee. Detecting malicious web links and identifying their attack types. In Proceedings of the 2Nd USENIX Conference on Web Application Development, WebApps'11, pages 11{11, Berkeley, CA, USA,2011. USENIX Association.
[4] H. Divandari, B. Pechaz, and M. V. Jahan.Malware detection using markov blanket based on opcode sequences. In 2015 International Congress on Technology, Communication and Knowledge (ICTCK), pages 564{569, Nov 2015.
[5] Suyeon Yoo, Sehun Kim, Anil Choudhary,OP Roy, and T Tuithung. Two-phase malicious web page detection scheme using misuse and anomaly detection. volume 2, pages 1{9, 2014.
[6] Birhanu Eshete, Adolfo Villa orita, and Komminist Weldemariam. Binspect: Holistic analysis and detection of malicious web pages. In Angelos D. Keromytis and Roberto Di Pietro,editors, Security and Privacy in Communication Networks: 8th International ICST Conference,SecureComm 2012, Padua, Italy, September 3-5,2012. Revised Selected Papers, pages 149{166,Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
[7] Kyle Soska and Nicolas Christin. Automatically detecting vulnerable websites before they turn malicious.
[8] K Pragadeesh Kumar, N Jaisankar, and N Mythili. An ecient technique for detection of suspicious malicious web site. 2011.
[9] B. V. Ram Naresh Yadav, B. Satyanarayana, and D. Vasumathi. A Vector Space Model Approach for Web Attack Classification Using Machine Learning Technique, pages 363{373.Springer India, New Delhi, 2016.
[10] Abubakr Sirageldin, Baharum B. Baharudin, and Low Tang Jung. Malicious Web Page Detection: A Machine Learning Approach, pages 217{224. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.
[11] Hesham Mekky, Ruben Torres, Zhi-Li Zhang,Sabyasachi Saha, and Antonio Nucci. Detecting malicious http redirections using trees of user browsing activity. In INFOCOM, 2014 Proceedings IEEE, pages 1159{1167. IEEE, 2014.
[12] Marco Cova, Christopher Kruegel, and Giovanni Vigna. Detection and analysis of drive-by download attacks and malicious javascript code. In Proceedings of the 19th international conference on World wide web, pages 281{290. ACM,
2010.
[13] Davide Canali, Marco Cova, Giovanni Vigna, and Christopher Kruegel. Prophiler: a fast lter for the large-scale detection of malicious web pages. In Proceedings of the 20th international conference on World wide web, pages 197{206.

ACM, 2011.
[14] Justin Ma, Lawrence K Saul, Stefan Savage, and Geo rey M Voelker. Learning to detect malicious urls. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):30, 2011.
[15] Kartick Subramanian, Ramasamy Savitha, and Sundaram Suresh. A metacognitive complex valued interval type-2 fuzzy inference system. IEEE Transactions on Neural Networks and Learning Systems, 25(9):1659{1672, 2014.
[16] Andreas Dewald, Thorsten Holz, and Felix C Freiling. Adsandbox: Sandboxing javascript to  ght malicious websites. In Proceedings of the 2010 ACM Symposium on Applied Computing, pages 1859{1864. ACM, 2010.
[17] Hassan B Kazemian and Sha Ahmed. Comparisons of machine learning techniques for detecting malicious webpages. Expert Systems with Applications, 42(3):1166{1177, 2015.
[18] Majid Vafaei Jahan and Mohammad-R Akbarzadeh-Totonchi. From local search to global conclusions: migrating spin glass-based distributed portfolio selection. IEEE Transactions on Evolutionary Computation, 14(4):591{601, 2010.
[19] Jeremiah Grossman. Whitehat security website statistics report. whitehat security. Summer2012.
[20] Suman Saha. Consideration points detecting cross-site scripting. arXiv preprint arXiv:0908.4188, 2009.
[21] DMW Powers. Evaluation: From precision,recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1):37{63, 2011.
[22] J. G. Carbonel T. M. Mitchell, J. R. Anderson and R. S. Michalski. Machine learning: An arti cial intelligence approach. 1983.
[23] Gwo-Hshiung Tzeng and Jih-Jeng Huang. Multiple attribute decision making: methods and applications. CRC press, 2011.