Document Type : Research Article


Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.


Motif discovery is a challenging problem in bioinformatics. It is an essential step towards understanding gene regulation. Although numerous algorithms and tools have been proposed in the literature, the accuracy of motif finding is still low. In this paper, we tackle the motif discovery problem using ensemble methods. A review and classification of current ensemble motif discovery tools is presented. We then propose our Cluster-based Ensemble Motif Discovery Tool (CEMD) which is based on k-medoids clustering of state-of-art stand-alone motif finding tools. We evaluate the performance of CEMD on benchmark datasets and compare the results to both stand-alone and similar ensemble tools. Experimental results indicate that CEMD has better sensitivity than state-of-art stand-alone tools when dealing with human datasets. CEMD also obtains better values of sensitivity when motifs are implanted in real promoter sequences. As for the comparison of CEMD with ensemble motif discovery tools, results indicate that CEMD achieves better results than MEME-ChIP on all evaluation measures. CEMD shows comparable performance to RSAT peak-motifs and MODSIDE.


[1] Nicholas M Luscombe, Dov Greenbaum, and Mark Gerstein. What is bioinformatics? a proposed definition and overview of the field. Methods of information in medicine, 40(04):346–358, 2001.
[2] Lachlan Coff, Jeffrey Chan, Paul A Ramsland, and Andrew J Guy. Identifying glycan motifs using a novel subtree mining approach. BMC bioinformatics, 21(1):42, 2020.
[3] Martin Tompa, Nan Li, Timothy L Bailey, George M Church, Bart De Moor, Eleazar Eskin, Alexander V Favorov, Martin C Frith, Yutao Fu, W James Kent, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature biotechnology, 23(1): 137–144, 2005.
[4] Federico Zambelli, Graziano Pesole, and Giulio Pavesi. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Briefings in bioinformatics, 14 (2):225–237, 2013.
[5] Jaime Davila, Sudha Balla, and Sanguthevar Rajasekaran. Fast and practical algorithms for planted (l, d) motif search. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(4):544–552, 2007.
[6] Yanju Zhang, Sha Yu, Ruopeng Xie, Jiahui Li, Andr´e Leier, Tatiana T Marquez-Lago, Tatsuya Akutsu, A Ian Smith, Zongyuan Ge, Jiawei Wang, et al. Pengaroo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins. Bioinformatics, 36(3):704–712, 2020.
[7] Mehmet Eren Ahsen, Robert Vogel, and Gustavo A Stolovitzky. R/py-summa: An r/python package for unsupervised ensemble learning for binary classification problems in bioinformatics. Journal of Computational Biology, 27(9):1337– 1340, 2020.
[8] Kanica Sachdev and Manoj K Gupta. Predicting drug target interactions using dimensionality reduction with ensemble learning. In Proceedings of ICRIC 2019, pages 79–89. Springer, 2020.
[9] Juho Kim, Seunghak Yu, and Sungroh Yoon. Ensemble algorithms for dna motif finding. In 2014 International Conference on Electronics, Information and Communications (ICEIC), pages 1–2. IEEE, 2014.
[10] Pengyi Yang, Yee Hwa Yang, Bing B Zhou, and Albert Y Zomaya. A review of ensemble methods in bioinformatics. Current Bioinformatics, 5(4): 296–308, 2010.
[11] Hsien-Da Huang, Jorng-Tzong Horng, Yi-Ming Sun, Ann-Ping Tsou, and Shir-Ly Huang. Identifying transcriptional regulatory sites in the human genome using an integrated system. Nucleic acids research, 32(6):1948–1956, 2004.
[12] Charles E Lawrence, Stephen F Altschul, Mark S Boguski, Jun S Liu, Andrew F Neuwald, and John C Wootton. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. science, 262(5131):208–214, 1993.
[13] Timothy L Bailey, Charles Elkan, et al. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. 1994.
[14] Jason D Hughes, Preston W Estep, Saeed Tavazoie, and George M Church. Computational identification of cis-regulatory elements associated with groups of functionally related genes in saccharomyces cerevisiae. Journal of molecular biology, 296(5):1205–1214, 2000.
[15] Bertrand R Huber and Martha L Bulyk. Metaanalysis discovery of tissue-specific dna sequence motifs from mammalian gene expression data. BMC bioinformatics, 7(1):1–25, 2006.
[16] X Shirley Liu, Douglas L Brutlag, and Jun S Liu. An algorithm for finding protein–dna binding sites with applications to chromatinimmunoprecipitation microarray experiments. Nature biotechnology, 20(8):835–839, 2002. [17] Xiaole Liu, Douglas L Brutlag, and Jun S Liu. Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. In Biocomputing 2001, pages 127–138. World Scientific, 2000.
[18] Frederick P Roth, Jason D Hughes, Preston W Estep, and George M Church. Finding dna regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation. Nature biotechnology, 16(10):939–945, 1998.
[19] Jianjun Hu, Yifeng D Yang, and Daisuke Kihara. Emd: an ensemble algorithm for discovering regulatory motifs in dna sequences. BMC bioinformatics, 7(1):1–13, 2006.
[20] Katherine A Romer, Guy-Richard Kayombya, and Ernest Fraenkel. Webmotifs: automated discovery, filtering and scoring of dna sequence motifs using multiple programs and bayesian approaches. Nucleic acids research, 35(suppl 2): W217–W220, 2007.
[21] Bartek Wilczynski, Milosz Darzynkiewicz, and Jerzy Tiuryn. Memofinder: combining de novo motif prediction methods with a database of known motifs. Nature Precedings, pages 1–1, 2008.
[22] Edward Wijaya, Siu-Ming Yiu, Ngo Thanh Son, Rajaraman Kanagasabai, and Wing-Kin Sung. Motifvoter: a novel ensemble method for finegrained integration of generic motif finders. Bioinformatics, 24(20):2288–2295, 2008.
[23] Lakshmi Kuttippurathu, Michael Hsing, Yongchao Liu, Bertil Schmidt, Douglas L Maskell, Kyungjoon Lee, Aibin He, William T Pu, and Sek Won Kong. Completemotifs: Dna motif discovery platform for transcription factor binding experiments. Bioinformatics, 27(5): 715–717, 2011.
[24] Simon J van Heeringen and Gert Jan C Veenstra. Gimmemotifs: a de novo motif prediction pipeline for chip-sequencing experiments. Bioinformatics, 27(2):270–271, 2011.
[25] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
[26] Pengyu Hong, X Shirley Liu, Qing Zhou, Xin Lu, Jun S Liu, and Wing H Wong. A boosting approach for motif modeling using chip-chip data. Bioinformatics, 21(11):2636–2643, 2005.
[27] Yue Fan, Mark A Kon, and Charles DeLisi. Ensemble machine methods for dna binding. In 2008 Seventh International Conference on Machine Learning and Applications, pages 709–716. IEEE, 2008.
[28] Victor X Jin, Jeff Apostolos, Naga Satya Venkateswara Ra Nagisetty, and Peggy J Farnham. W-chipmotifs: a web application tool for de novo motif discovery from chip-based highthroughput data. Bioinformatics, 25(23):3191– 3193, 2009.
[29] Jonathan M Carlson, Arijit Chakravarty, Charles E DeZiel, and Robert H Gross. Scope: a web server for practical de novo motif discovery. Nucleic acids research, 35(suppl 2):W259–W264, 2007.
[30] A Chakravarty, JM Carlson, RS Khetani, and RH Gross. A parameter-free algorithm for improved de novo identification of transcription factor binding sites. BMC Bioinformatics, 8:29, 2007.
[31] Wenxiu Ma, William S Noble, and Timothy L Bailey. Motif-based analysis of large nucleotide data sets using meme-chip. Nature protocols, 9 (6):1428–1450, 2014.
[32] Philip Machanick and Timothy L Bailey. Memechip: motif analysis of large dna datasets. Bioinformatics, 27(12):1696–1697, 2011.
[33] Dongsheng Che, Shane Jensen, Liming Cai, and Jun S Liu. Best: binding-site estimation suite of tools. Bioinformatics, 21(12):2909–2911, 2005.
[34] Christophe Liseron-Monfils, Tim Lewis, Daniel Ashlock, Paul D McNicholas, Fran¸cois Fauteux, Martina Str¨omvik, and Manish N Raizada. Promzea: a pipeline for discovery of co-regulatory motifs in maize and other plant species and its application to the anthocyanin and phlobaphene biosynthetic pathways and the maize development atlas. BMC plant biology, 13(1):1–17, 2013.
[35] Ngoc Tam L Tran and Chun-Hsi Huang. Modside: a motif discovery pipeline and similarity detector. BMC genomics, 19(1):1–9, 2018.
[36] K Till´an, M Leoncini, and M Montangero. Ce 3: Customizable and easily extensible ensemble tool for motif discovery. 2012.
[37] Timothy L Bailey, Charles Elkan, et al. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. 1994.
[38] Giulio Pavesi, Giancarlo Mauri, and Graziano Pesole. An algorithm for finding signals of unknown length in dna sequences. Bioinformatics, 17(suppl 1):S207–S214, 2001.
[39] Xiaole Liu, Douglas L Brutlag, and Jun S Liu. Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. In Biocomputing 2001, pages 127–138. World Scientific, 2000.
[40] Pilib O Broin, Terry J Smith, and Aaron AJ ´ Golden. Alignment-free clustering of transcription factor binding motifs using a genetic-kmedoids approach. BMC bioinformatics, 16(1): 1–12, 2015.
[41] Shaun Mahony, Philip E Auron, and Panayiotis V Benos. Dna familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput Biol, 3(3): e61, 2007.
[42] Ivan V Kulakovskiy, VA Boeva, Alexander V Favorov, and Vsevolod J Makeev. Deep and wide digging for binding motifs in chip-seq data. Bioinformatics, 26(20):2622–2623, 2010.
 [43] Timothy L Bailey, Nadya Williams, Chris Misleh, and Wilfred W Li. Meme: discovering and analyzing dna and protein sequence motifs. Nucleic acids research, 34(suppl 2):W369–W373, 2006.
[44] Sebastian Luehr, Holger Hartmann, and Johannes S¨oding. The xxmotif web server for exhaustive, weight matrix-based motif discovery in nucleotide sequences. Nucleic acids research, 40 (W1):W104–W109, 2012.
[45] Morgane Thomas-Chollier, Carl Herrmann, Matthieu Defrance, Olivier Sand, Denis Thieffry, and Jacques van Helden. Rsat peak-motifs: motif analysis in full-size chip-seq datasets. Nucleic acids research, 40(4):e31–e31, 2012.