As the knowledge of protein signal peptides can be used to reprogram cells in a desired way for gene therapy, signal peptides have become a crucial tool for researchers to design new drugs for targeting a particular organelle to correct a specific defect. To effectively use such a technique, however, we have to develop an automated method for fast and accurately predicting signal peptides and their cleavage sites, particularly in the post-genomic era when the number of protein sequences is being explosively increased. To realize this, the first important thing is to discriminate secretory proteins from non-secretory proteins. On the basis of the Needleman-Wunsch algorithm, we proposed a new alignment kernel function. The novel approach can be effectively used to extract the statistical properties of protein sequences for machine learning, leading to a higher prediction success rate.
Keywords: Kernel function, global alignment, support vector machine, signal sequence, cleavage site, scaled window