Document Type : Research Article

Authors

Abstract

Today world's dependence on the Internet and the emerging of Web 2.0 applications is significantly increasing the requirement of web robots crawling the sites to support services and technologies. Regardless of the advantages of robots, they may occupy the bandwidth and reduce the performance of web servers. Despite a variety of researches, there is no accurate method for classifying huge data sets of web visitors in a reasonable amount of time. Moreover, this technique should be insensitive to the ordering of instances and produce deterministic accurate results. Therefore, this paper presents a density-based clustering approach using Density-Based Spatial Clustering of Applications with Noises (DBSCAN), to classify web visitors of two real large data sets. We propose two new features based on the behavioral patterns of visitors to describe them. What's more, we consider 12 common features and use the significance of the difference test (T-test) to reduce the dimensions and overcome one of the disadvantages of DBSCAN. Based on the supervised evaluation metrics, the proposed algorithm has the 95% of Jaccard metric and produces two clusters having the entropy and purity rates of 0.024 and 0.97, respectively. Furthermore, from the standpoint of clustering quality and accuracy, the proposed method performs better than state-of-the-art algorithms. Finally, it can be concluded that some known web robots through imitating human users make it difficult to be identified.

Keywords

[1] D. Doran, K. Morillo, and S. S. Gokhale. (2013). A comparison of web robot and human requests. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM '13). ACM, New York, NY, USA, 1374-1380.
[2] D. Doran, and S. S. Gokhale. (2010). Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22, 183-210.
[3] D. Stevanovic, N. Vlajic, and A. An. (2013). Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Applied Soft Computing, 13(1), 698-708.
[4] D. Doran. (2014). Detection, Classification, and Workload Analysis of Web Robots. Ph.D. thesis, university of connecticut.
[5] O. Maimon, and L. Rokach. (2005). Data Mining and Knowledge Discovery Handbook. Springer Press.
[6] D. Zhang, D. Zhang, and X. Liu. (2013). A Novel Malicious Web Crawler Detector: Performance and Evaluation. International Journal of Computer Science Issues (IJCSI), 10(1).
[7] T. H. Sardar, and Z. Ansari, (2014). Detection and confirmation of web robot requests for cleaning the voluminous web log data. IMpact of E-Technology on US (IMPETUS), 2014 International Conference on the, 13-19.
[8] S. Gupta, S. Tarun, and P. Sharma. (2014). Controlling Access of Bots and Spamming Bots. IJCER, 3(2), 87-92.
[9] A. Stassopoulou, and M. D. Dikaiakos. (2009). Web Robot Detection: A probabilistic reasoning approach. Computer Networks, 53, 265-278.
[10] W. Lu, and S. Yu. (2006). Web robot detection based on hidden Markov model. Proceedings of international Conference on communications, circuitsand systems, pp. 18061810.
[11] C. Bomhardt, W. Gaul, and L. Schmidt-Thieme. (2005). Web Robot detection pre-processing web logfiles for Robot Detection. New Developments in Classification and Data Analysis. 113-124.
[12] M. Zabihi, and J. Hamidzadeh, M. Vafaei Jahan. (2014). Fuzzy Inference for intusion detection of web robots in computer networks. The 45th Annual Iranian Mathematic Conference, Semnan.
[13] N. Jain, and M. P. Mangal. (2014). An Approach to build a web crawler using Clustering based K-Means Algorithm. Journal of Global Research in Computer Science, 4(12), 14-22.
[14] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Knowledge Discovery and Data Mining (KDD 96), Portland, Oregon.
[15] P. Jha, S. Goyal, T. Kumari, and N. Gupta. (2014). Robots Exclusion Protocol. International journal of emerging science and engineering, 2(5).
[16] R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, and G. Tummarello. (2008). Semantic sitemaps: Efficient and flexible access to datasets on the semantic web. Springer Berlin Heidelberg, 690-704.
[17] N. Yousefi. (2013). Detection of Malicious Web Robots Using Machine Learning Techniques. M.Sc. Thesis, Imam Reza International University.
[18] (2014) WebLog Expert. [Online]. http://www.weblogexpert.com
[19] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu. (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications. Data Mining and Knowledge Discovery, an International Journal, 2 (2), Kluwer Academic Publishers, 169-194.
[20] G. K. Kanji. (2006). 100 Statistical Tests, 3rd ed. SAGE Publication.
[21] E. Amig, J. Gonzalo, J. Artiles, and F. Verdejo. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461-486.
[22] X. Lin, L. Quan, and H. Wu. (2008). An Automatic Scheme to Categorize User Sessions in Modern HTTP Traffic. Global Telecommunications Conference, 1485-1490.
[23] J. Han, and M. Kamber. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann.
[24] Kohonen, T. (1995). Self-organizing Maps. 2nd ed., Springer-Verlag, Berlin.
[25] N. Vlajic, and H.C. Card. (2001). Vector quantization of images using modified adaptive resonance algorithm for hierarchical clustering. IEEE Transactions on Neural Networks 12, 11471162.
[26] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrdl. (2001). Constrained k-means clustering with background knowledge. ICML, 1, 577-584.
[27] J. Luo, and D. Chen. (2008). An enhanced ART2 neural network for clustering analysis. Knowledge Discovery and Data Mining, 81-85.
[28] H. Zhang, W. Guan, and G. Guan. (2008). Online Diagnosis of Faulty Insulators Based on Improved ART2 Neural Network. Advances in Neural Networks, 465-472.
[29] M. Lotfi Shahreza, D. Moazzami, B. Moshiri, and M. R. Delavar. (2011). Anomaly detection using a self-organizing map and particle swarm optimization. Scientia Iranica, 18(6), 1460-1468.
[30] T. Vijaya Kumar, and H. S. Guruprasad. (2012). Clustering Web Usage Data using Concept Hierarchy and Self Organizing Map. International Journal of Computer Applications, 56.
[31] Bots vs. Browsers. (2014). http://www.botsvsbrowsers.com
[32] user-agent-string.info. (2014). http://user-agent-string.info