By proposing a improved Chous pseudo amino acid composition approach to extract the features of the sequences, a powerful predictor based on k-nearest neighbor was introduced to identify the types of lipases according to their sequences. To avoid redundancy and bias, demonstrations were performed on a dataset where none of the proteins has ≥ 25%sequence identity to any other. The overall success rate thus obtained by the 10-fold cross-validation test was over 90%, indicating that the improved Chous pseudo amino acid composition might be a useful tool for extracting the features of protein sequences, or at lease can play a complementary role to many of the other existing approaches.
Keywords: Lipase, improved Chou's pseudo amino acid composition, feature extraction, k-nearest neighbor, bioinformatics