Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

BaniMustafa, Ahmed

doi:10.22042/isecure.2019.11.0.11

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

Document Type : Research Article

Author

Ahmed BaniMustafa

American University of Madaba

https://doi.org/10.22042/isecure.2019.11.0.11

Abstract

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes which is known to deteriorate the performance of classifiers. It also influences its validity and generalizablity. The classification models in this study were built using five machine learning algorithms known as PLS-DA, MLP, SVM, C4.5 and ID3. This model is built after carrying out a number of intensive data preprocessing procedures to tackle the problem of imbalanced classes and improve the performance of the constructed classifiers.
These procedures involves applying data transformation, normalization, standardization, re-sampling and data reduction procedures using a number of variables importance scorers. The best performance was achieved by building an MLP model that was trained and tested using five-fold cross-validation using datasets that were re-sampled using SMOTE method and then reduced using SVM variable importance scorer. This model was successful in classifying samples with excellent accuracy and also in identifying the potential disease biomarkers. The results confirm the validity of metabolomics data mining for diagnosis of cachexia. It also emphasizes the importance of data preprocessing procedures such as sampling and data reduction for improving data mining results, particularly when data suffers from the problem of imbalanced classes.

Keywords

Data Mining

Metabolomics

Cachexia

Preprocessing

Imbalanced Classes

Re-sampling

Data Reduction

20.1001.1.20082045.2019.11.3.12.5

[1] Roman Eisner, Cynthia Stretch, Thomas Eastman, Jianguo Xia, David Hau, Sambasivarao Damaraju, Russell Greiner, David S Wishart, and Vickie E Baracos. Learning to predict cancerassociated skeletal muscle wasting from 1h-nmr
profiles of urinary metabolites. Metabolomics, 7(1):25–34, 2011.
[2] Vicki Maloney. Plant metabolomics. BioTeach Journal, 2:92–99, 2004.
[3] W. B. Dunn and D. I. Ellis. Metabolomics: Current analytical platforms and methodologies. Trends in Analytical Chemistry, 24(4):285–294,2005.
[4] R. Goodacre, S. Vaidyanathan, W.B Dunn, G.G. Harrigan, and D.B. Kell. Metabolomics by numbers: Acquiring understanding global metabolite data. Trends in Biotechnology, 22(5):245–252,2004.
[5] Ahmed Hmaidan Bani Mustafa and Nigel William Hardy. A strategy for selecting data mining techniques in metabolomics. In Nigel W. Hardy and Robert D. Hall, editors, Plant Metabolomics: Methods and Protocols,
volume 860 of Methods in Molecular Biology, pages 317–335. Springer Science, 2012.

[6] A. BaniMustafa. Predicting software effort estimation using machine learning techniques. In 2018 8th International Conference on Computer Science and Information Technology (CSIT), volume 1, pages 249–256. IEEE, July 2018.
[7] R.J Bino, R.D Hall, O. Fiehn, and J. Kopka. Potential of metabolomics as a functional genomics tool. Trends In Plant Science, 9(9):418–425,2004.
[8] Oliver Fiehn. Combining genomics, metabolome analysis, and biochemical modelling to understand metabolic networks. Comparative and Functional Genomics, 2:155–168, 2001.
[9] David S. Wishart. Metabolomics: applications to food science and nutrition research. Trends in Food Science & Technology, 19(9):482–493, 2008.
[10] Shaza M Abd Elrahman and Ajith Abraham. A review of class imbalance problem. Journal of Network and Innovative Computing, 1:332–340, 2013.
[11] Guang-Hui Fu, Feng Xu, Bing-Yang Zhang, and Lun-Zhao Yi. Stable variable selection of classimbalanced data with precision-recall criterion. Chemometrics and Intelligent Laboratory Systems, 171:241–250, 2017.
[12] Sreejita Ghosh, E Baranowski, Rick van Veen, Gert-Jan de Vries, Michael Biehl, Wiebke Arlt, Peter Tino, and Kerstin Bunte. Comparison of strategies to learn from imbalanced classes for computer aided diagnosis of inborn steroidogenic
disorders. In Proc. of the European Symposium on Artificial Neural Networks, 2017.
[13] NV Chawla, KW Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR), 16:321–357, 2002.
[14] B. Feng, S. M. Wu, S. Lv, F. Liu, H. S. Chen, Y. Gao, F. T. Dong, and L. Wei. A novel scoring system for prognostic prediction in dgalactosamine/ lipopolysaccharide-induced fulminant hepatic failure balb/c mice. BMC Gastroenterol,
9:99, 2009.
[15] Steve Rozen, Merit E. Cudkowicz, Mikhail Bogdanov, Wayne R. Matson, Bruce S. Kristal, Chris Beecher, Scott Harrison, Paul Vouros, Jimmy Flarakos, Karen Vigneau-Callahan, Theodore D. Matson, Kristyn M. Newhall, M. Flint Beal,
Robert H. Brown, and Rima Kaddurah-Daouk. Metabolomic analysis and signatures in motor neuron disease. Metabolomics, 1(2):101–108, 2005.
[16] Anne-Laure Boulesteix and Korbinian Strimmer. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1):32–44, 2007.
[17] Katherine Hollywood, Daniel R. Brison, and Royston Goodacre. Metabolomics: Current technologies and future trends. Proteomics, 6(17):4716–4723, 2006.
[18] SJ Barrett and WB Langdon. Advances in the application of machine learning techniques in drug discovery, design and development. In Applications of Soft Computing, pages 99–110. Springer,2006.
[19] Young Truong, Xiaodong Lin, and Chris Beecher. Learning a complex metabolomic dataset using random forests and support vector machines. In Ronny Kohavi, Johannes Gehrke, William Du-Mouchel, and Joydeep Ghosh, editors, Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 835 – 840, Seattle, WA, 2004.ACM.
[20] Victor Maojo and José Sanandrés. A survey of data mining techniques. In G. Goos, J. Hartmanis, and J. van Leeuwen, editors, Medical Data Analysis, volume 1933 of Lecture Notes in Computer Science, pages 77–92. Springer Berlin /
Heidelberg, 2000.
[21] Tom Mitchell. Machine Learning. McGraw-Hill series in computer science. McGraw-Hill, New York, 1997.
[22] Julien Boccard, Jean-Luc Veuthey, and Serge Rudaz. Knowledge discovery in metabolomics: An overview of ms data handling. Journal of Separation Science, 33(3):290–304, 2010.
[23] N. Jovanovic, V. Milutinovic, and Z. Obradovic. Foundations of predictive data mining. In Neural Network Applications in Electrical Engineering, 2002. NEUREL ’02. 2002 6th Seminar on, pages 53–58, 2002.
[24] Michael Goebel and Le Gruenwald. A survey of data mining and knowledge discovery software tools. SIGKDD Explor. Newsl., 1(1):20–33, 1999.
[25] S. Kotsiantis, I. Zaharakis, and P. Pintelas. Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26(3):159–190, 2006.
[26] Ryszad S. Michalski, Ivan Bratko, and Miroslav Kubat. Machine Learning and Data Mining: Methods and Applications. John Wiley & Sons, Chichester, 1998.
[27] Royston Goodacre. Metabolomics of a superorganism. Journal of Nutrition, 137(1):259–266,2007.
[28] Jin-mei Xia, Xiao-jian Wu, and Ying-jin Yuan. Integration of wavelet transform with pca and ann for metabolomics data-mining.Metabolomics, 3(4):531–537, 2007.
[29] R. Quinlan, J. Induction of decision trees. Machine Learninig, 1(1):81–106, 1986.
[30] R. Quinlan, J. C4.5: Programming For Machine learning. Morgan Kufmann Publishing, USA,1990.

[31] Jae Kim, Myoung Cho, Hyung Baek, Tae Ryu, Chang Yu, Myong Kim, Eiichiro Fukusaki, and Akio Kobayashi. Analysis of metabolite profile data using batch-learning self-organizing maps. Journal of Plant Biology, 50(4):517–521, 2007.
[32] David P. Enot, Manfred Beckmann, David Overy, and John Draper. Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals. Proceedings of the National Academy of Sciences, 103(40):14865–14870, 2006.
[33] Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159,1997.
[34] Paolo Sonego, András Kocsor, and Sándor Pongor. Roc analysis: applications to the classification of biological sequences and 3d structures. Briefings in bioinformatics, 9(3):198–209, 2008.
[35] Sushrut S. Waikar, Venkata S. Sabbisetti, and Joseph V. Bonventre. Normalization of urinary biomarkers to creatinine during changes in glomerular filtration rate. Kidney Int, 78(5):486–494, 2010.
[36] B. V. Stolyarov, A. G. Vitenberg, L. M.Kuznetsova, L. N. Ogongo, and S. A. Smirnova.A modification of the internal normalization method with sample splitting. Chromatographia,9(1):3–9, 1976.
[37] Frank Dieterle, Alfred Ross, Götz Schlotterbeck,and Hans Senn. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. application in 1h nmr metabonomics. Analytical Chemistry, 78(13):4281–4290, 2006.
[38] Leo Breiman. Random forests. Mach. Learn.,45(1):5–32, 2001.
[39] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson,R. P. Sheridan, and B. P. Feuston. Random forest: a classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences,
43(6):1947–58, 2003.
[40] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, Pittsburgh, Pennsylvania,
United States, 1992. ACM.
[41] Anne H. Milley, James D. Seabolt, and John S. Williams. Data mining and the case for sampling solving business problems. Technical report, SAS Institute Inc, 1998.