Document Type : Research Article


Computer Engineering Department, Tarbiat Modares University, Tehran, Iran.


File fragment’s type classification in the absence of header and file system information, is a major building block in various solutions devoted to file carving, memory analysis and network forensics. Over the past decades, a substantial amount of effort has been put into developing methods to classify file fragments. Meanwhile, there has been little innovation on the basics of approaches given into file and fragment type classification. In this research, by mapping each fragment as an 8-bit grayscale image, a method of texture analysis has been used in place of a classifier. Essentially, we show how to construct a vocabulary of visual words with the Bag-of-Visual-Words method. Using the n-gram technique, the feature vector is comprised of visual words occurrence. On the classification of 31 file types over 31000 fragments, our approach reached a maximum overall accuracy of 74.9% in classifying 512 byte fragments and 87.3% in classifying 4096 byte fragments.


[1] Like Zhang and G.B. White. An approach to detect executable content for anomaly based network intrusion detection. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, pages 1–8, 2007.
[2] Wei-Jen Li, Ke Wang, Salvatore J Stolfo, and Benjamin Herzog. Fileprints: Identifying file types by n-gram analysis. In Proceedings of the Sixth Annual IEEE Workshop on Information Assurance, pages 64–71, 2005.
[3] Marco Pontello. Trid–file identifier, 2013. http: //
[4] P.P. Pullaperuma and A.T. Dharmaratne. Taxonomy of file fragments using gray-level cooccurrence matrices. In Proceedings of 2013 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–7, 2013.
[5] Leonardo FS Scabini, Wesley N Gonçalves, and Amaury A Castro. Texture analysis by bag-ofvisual-words of complex networks. In Proceedings of Iberoamerican Congress on Pattern Recognition, pages 485–492, 2015.
[6] Mason McDaniel and Mohammad Hossain Heydari. Content based file type detection algorithms. In Proceedings of the 36th Annual Hawaii International Conference on System Sciences, pages 10–pp, 2003.
[7] Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin, and ManPyo Hong. On improving the accuracy and performance of content-based file type identification. In Proceedings of the 14th Australasian Conference on Information Security and Privacy, pages 44–59, 2009.
[8] Ahmed Kattan, Edgar Galván-López, Riccardo Poli, and Michael OâĂŹNeill. Gp-fileprints: file types detection using genetic programming. In Proceedings of the European Conference on Genetic Programming, pages 134–145, 2010.
[9] Philip Penrose, Richard Macfarlane, and William J Buchanan. Approaches to the classification of high entropy file fragments. Digital Investigation, 10(4):372–384, 2013.
[10] Martin Karresand and Nahid Shahmehri. OscarâĂŤfile type identification of binary data in disk clusters and ram pages. In Proceedings of the IFIP International Information Security Conference, pages 413–424, 2006.
[11] Martin Karresand and Nahid Shahmehri. File type identification of data fragments by their binary structure. In Proceedings of the IEEE Information Assurance Workshop, pages 140–147, 2006.
[12] Konstantinos Karampidis, Ergina Kavallieratou, and Giorgos Papadourakis. Comparison of classification algorithms for file type detection a digital forensics perspective. Polytech. Open Libr. Int. Bull. Inf. Technol. Sci., 56:15–20, 2017.
[13] Cor J Veenman. Statistical disk cluster classification for file carving. In Proceedings of the Third International Symposium on Information Assurance and Security, pages 393–398, 2007.
[14] Robert F Erbacher and John Mulholland. Identification and localization of data types within large-scale file systems. In Proceedings of the Second International Workshop on Systematic Approaches to Digital Forensic Engineering, pages 55–70, 2007.
[15] Sarah J Moody and Robert F Erbacher. Sádistatistical analysis for data type identification. In Proceedings of the Third International Workshop on Systematic Approaches to Digital Forensic Engineering, pages 41–54, 2008.
[16] William C Calhoun and Drue Coles. Predicting the types of file fragments. digital investigation, 5:S14–S20, 2008.
[17] Stefan Axelsson. The normalised compression distance as a file fragment classifier. digital investigation, 7:S24–S31, 2010.
[18] Qiming Li, A Ong, P Suganthan, and V Thing. A novel support vector machine approach to high entropy data fragment classification. In Proceedings of the South African Information Security Multi-Conference (SAISMC), pages 236– 247, 2011.
[19] Gregory Conti, Sergey Bratus, Anna Shubina, Benjamin Sangster, Roy Ragsdale, Matthew Supan, Andrew Lichtenberg, and Robert PerezAlemany. Automated mapping of large binary objects using primitive fragment type classification. digital investigation, 7:S3–S12, 2010.
[20] Ning Zheng, Jinlong Wang, Ting Wu, and Ming Xu. A fragment classification method depending on data type. In Proceedings of 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing (CIT/IUCC/DASC/PICOM), pages 1948–1953, 2015.
[21] Ding Cao, Junyong Luo, Meijuan Yin, and Huijie Yang. Feature selection based file type identification algorithm. In Proceedings of 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS), volume 3, pages 58–62, 2010.
[22] Siddharth Gopal, Yiming Yang, Konstantin Salomatin, and Jaime Carbonell. Statistical learning for file-type identification. In Proceedings of the 10th International Conference on Machine Learning and Applications(ICMLA), volume 1, pages 68–73, 2011.
[23] Simran Fitzgerald, George Mathews, Colin Morris, and Oles Zhulyn. Using nlp techniques for file fragment classification. Digital Investigation, 9:S44–S49, 2012.
[24] Nicole L Beebe, Laurence A Maddox, Lishu Liu, and Minghe Sun. Sceadan: using concatenated ngram vectors for improved file and data type classification. IEEE Transactions on Information Forensics and Security, 8(9):1519–1530, 2013.
[25] Dinil Mon Divakaran, Yung Siang Liau, and Vrizlynn LL Thing. Accurate in-network file-type classification. In SG-CRC, pages 139–146, 2016.
[26] Kristijan Vulinović, Lucija Ivković, Juraj Petrović, Kristian Skračić, and Predrag Pale. Neural networks for file fragment classification. In 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pages 1194– 1198, 2019.
[27] Manish Bhatt, Avdesh Mishra, Md Wasi Ul Kabir, SE Blake-Gatto, Rishav Rajendra, Md Tamjidul Hoque, and Irfan Ahmed. Hierarchy-based file fragment classification. Machine Learning and Knowledge Extraction, 2(3):216–232, 2020.
[28] K Skračić, F Rukavina, K Miličić, J Petrović, and P Pale. File fragment classification with focus on ole and ooxml classes. In 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO), pages 1250–1253. IEEE.
[29] Gregory Conti, Sergey Bratus, Anna Shubina, Andrew Lichtenberg, Roy Ragsdale, Robert Perez-Alemany, Benjamin Sangster, and Matthew Supan. A visual study of primitive binary fragment types. White Paper, Black Hat USA, 2010.
[30] André Ricardo Backes, Dalcimar Casanova, and Odemir Martinez Bruno. Texture analysis and classification: A complex network-based approach. Information Sciences, 219:168–180, 2013.
[31] David Sculley. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pages 1177–1178, 2010.
[32] N Vapnik Vladimir and V Vapnik. Statistical learning theory. Xu JH and Zhang XG. translation. Beijing: Publishing House of Electronics Industry, 2O04, 1998.
[33] Alberto Cano. A survey on graphic processing unit computing for large-scale data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(1):e1232, 2018.
[34] Continuum Analytics. Anaconda python distribution, 2015.
[35] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9:1871–1874, 2008.
[36] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2:27, 2011.
[37] Simson Garfinkel, Paul Farrell, Vassil Roussev, and George Dinolt. Bringing science to digital forensics with standardized forensic corpora. digital investigation, 6:S2–S11, 2009.
[38] Nicole L Beebe. Utsa filetypes1 dataset, 2016. files/filetypes1/.
[39] N. L Beebe. Sceadan. UTSA-cyber/sceadan.