Combinatorial Chemistry & High Throughput Screening

Author(s): Loren Hansen, Ernestine A. Lee, Kevin Hestir, Lewis T. Williams and David Farrelly

DOI: 10.2174/138620709788488984

DownloadDownload PDF Flyer Cite As
Controlling Feature Selection in Random Forests of Decision Trees Using a Genetic Algorithm: Classification of Class I MHC Peptides

Page: [514 - 519] Pages: 6

  • * (Excluding Mailing and Handling)

Abstract

Feature selection is an important challenge in many classification problems, especially if the number of features greatly exceeds the number of examples available. We have developed a procedure - GenForest - which controls feature selection in random forests of decision trees by using a genetic algorithm. This approach was tested through our entry into the Comparative Evaluation of Prediction Algorithms 2006 (CoEPrA) competition (accessible online at: http://www.coepra.org). CoEPrA was a modeling competition organized to provide an objective testing for various classification and regression algorithms via the process of blind prediction. In the competition GenForest ranked 10/23, 5/16 and 9/16 on CoEPrA classification problems 1, 3 and 4, respectively, which involved the classification of type I MHC nonapeptides i.e. peptides containing nine amino acids. These problems each involved the classification of different sets of nonapeptides. Associated with each amino acid was a set of 643 features for a total of 5787 features per peptide. The method, its application to the CoEPrA datasets, and its performance in the competition are described.

Keywords: Decision trees, random forests, feature selection, genetic algorithms, evolutionary computation