Optimization of Spaced K-mer Frequency Feature Extraction using Genetic Algorithms for Metagenome Fragment Classification

Arini Pekuwali, Wisnu Ananta Kusuma, Agus Buono


K-mer frequencies are commonly used in extracting features from metagenome fragments. In spite of this, researchers have found that their use is still inefficient. In this research, a genetic algorithm was employed to find optimally spaced k-mers. These were obtained by generating the possible combinations of match positions and don’t care positions (written as *). This approach was adopted from the concept of spaced seeds in PatternHunter. The use of spaced k-mers could reduce the size of the k-mer frequency feature’s dimension. To measure the accuracy of the proposed method we used the naïve Bayesian classifier (NBC). The result showed that the chromosome 111111110001, representing spaced k-mer model [111 1111 10001], was the best chromosome, with a higher fitness (85.42) than that of the k-mer frequency feature. Moreover, the proposed approach also reduced the feature extraction time. 


genetic algorithm; k-mers; metagenome; naïve Bayesian classifier; spaced k-mers

Full Text:



Zerbino, D.R. & Birney, E. Velvet: Algorithms for De Novo Short Read Assembly Using de Bruijn Graph. Genome Research, 18, pp. 821-829, 2008.

Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J., De novo Bacterial Genom Sequencing: Millions of Very Short Reads Assembled on a Desktop Computer,. Genome Research, 18, pp. 802-809, 2008.

Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., Li, S., Yang, H., Wang, J., De novo Assembly of Human Genomes with Massively Parallel Short Read Sequencing, Genome Research, 20 (2) pp. 265-272, 2009.

Wu, H., PCA-based Linear Combinations of Oligonucleotide Frequencies for Metagenomic DNA Fragment Binning Proc., IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology (CIBCB '08), IEEE Press, pp. 46-53, doi:10.1109/CIBCB.2008.4675758, 2008.

Bouchot, J.L., Trimble, W.L., Ditzler, G., Lan, Y., Essinger, S. & Rosen, G., Advances in Machine Learning for Processing and Comparison of Metagenomic Data, Available on: http://www.math.drexel.edu/~jb3455/publications/preprints/metagenomics_BC.pdf, 2013. [download September 15, 2014]

Thomas, T., Gilbert, J. & Meyer, F., Metagenomics – A Guide from Sampling to Data Analysis, Microbial Informatics and Experimentation, 2012.

Meyerdierks, A. & Glockner, F.O., Metagenome Analysis, Advances in Marine Genomics, 2010.

Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D., Basic Local Alignment Search Tool, Journal of Molecular Biology, 215(3) pp. 403-410, 1990.

Huson, D.H., Auch, A.F. & Schuster, S.C., MEGAN Analysis of Metagenomic Data, Genome Research, 17(3) pp. 337-386, 2007.

McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I., Accurate Phylogenetic Classification of Variable-Length DNA Fragments, Nature Methods, 4(1) pp. 63-72, 2007.

Rosen, G., Garbarine, E., Caseiro, D., Polikar, R. & Sokhansanj, Metagenome Fragment Classification Using N-mer Frequency Profiles, Advances in Bioinformatics, 2008.

Overbeek, M.V., Kusuma, W.A. & Buono, A., Clustering Metagenome Fragments Using Growing Selforganizing Map,Advanced Computer Science and Information Systems (ICACSIS), 285-289, DOI: 10.1109/ICACSIS.2013. 6761590, 2013.

Rosen, G.L., Reichenberger, E.R. & Rosenfeld, A.M., NBC: The Naive Bayes Classification Tool Webserver for Taxonomic Classification of Metagenomic Reads, Bioinformatics, 27(1), pp. 127-129, 2011.

Gsponer, S., Smyth, B. & Ifrim, G., Efficient Sequence Regression by Learning Linear Models in All-Subsequence Space, Insight Centre for Data Analytics, 2017.

Kusuma, W.A., Combined Approaches for Improving the Performance of Denovo DNA Sequence Assembly and Metagenomic Classification of Short Fragments from Next Generation Sequencer [Dissertation], Tokyo (JP): Tokyo Institute of Technology, 2012.

Ma, B., Tromp, J. & Li, M, PatternHunter: Faster and More Sensitive Homology Search, Bioinformatics, 18(3) pp. 440-445, 2002.

Alomari, O.A., Khader, A.T., Al-Betar, M.A. & Abualigah, L.M., Gene Selection for Cancer Classification by Combining Minimum Redundancy Maximum Relevancy and Bat-inspired Algorithm, Journal of Data Mining and Bioinformatics, 19(1) pp. 32-51, 2017.

Alomari, O.A., Khader, A.T., Al Betar, M.A. & Abualigah, L.M., MRMR BA: A Hybrid Gene Selection Algorithm for Cancer Classification, Journal of Theoretical and Applied Information Technology, 95(12), pp. 2610-2618, 2017.

Kusuma, W.A. & Akiyama, Y., Metagenome Fragments Classification Based on Characterization Vectors. Proceedings of International Conference on Bioinformatics and Biomedical Technology, Sanya China, pp. 50-54, March, 2011.

Richter, D.C., Ott, F., Auch, A.F., Schmid, R. & Huson, D.H., MetaSim: a Sequencing Simulator for Genomics and Metagenomics, PloS One, 3(10) pp. 1-12, 2008.

Goh, K. S., Lim, A. & Rodrigues, B., Sexual Selection for Genetic Algorithms, Artificial Intelligence Review, 19, pp. 123-152, 2003.

Schwartz, S., Kent W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D. & Miller,W., Human-Mouse Alignment with BLASTZ, Genome Research, 13, pp. 103-107, 2003.

Mitchell, T.M., Machine Learning, US: Mc Graw-Hill Science/ Engineering/ Math, 1997.

Han, J. & Kamber, M., Data Mining: Concepts and Techniques, San Francisco (US): Morgan Kaufmann Publishers, 2001.

Abualigah, L.M., Khader, A.T. & Al-Betar, M.A., Unsupervised Feature Selection Technique Based on Genetic Algorithm for Improving the Text Clustering, 7th CSIT (International Conference on Computer Science and Information Technology), pp. 1-6, 2016.

Abualigah, L.M. & Hanandeh, E. S., Applying Genetic Algorithms to Information Retrieval using Vector Space Model, IJCSEA, 5(1), pp. 19-28, 2015.

Angelova, M. & Pencheva, T., Tuning Genetic Algorithm Parameter to Improve Converge Time, International Journal of Chemical Engineering. doi: 10.1155/2011/646917, 2011.

Gen, M. & Cheng, R., Genetic Algorithms and Engineering Design, New York: John Wiley & Sons, Inc., 1997.

Razali, N. M. & Geraghty, J., Genetic Algorithm Performance with different Selection Strategies in Solving TSP, Proceedings of the World Congress on Engineering, London (UK), II, July 6-8, 2011.

Zhang, X. & Liu, S., Image Edge Feature Extraction and Refining Based on Genetic-Ant Colony Algorithm, Telkomnika, 3(1), pp. 118-127, 2015.

DOI: http://dx.doi.org/10.5614%2Fitbj.ict.res.appl.2018.12.2.2


  • »
  • »
  • »
  • »
  • »
  • »

Contact Information:

ITB Journal Publisher, LPPM – ITB, 

Center for Research and Community Services (CRCS) Building Floor 7th, 
Jl. Ganesha No. 10 Bandung 40132, Indonesia,

Tel. +62-22-86010080,

Fax.: +62-22-86010051;

e-mail: jictra@lppm.itb.ac.id.