Background: Identifying protein-ligand binding sites is an important step to the characterizing of molecular function. Although many ligand-binding site prediction methods have been developed, there is still a great demand for improving the prediction accuracy and reducing the amount of calculation.
Objective: In this paper, we introduce a structure alignment-based binding site prediction method, involved a big and well refined template database, homologous indexed alignment, combination of conservation in binding sites ranking, and Hadoop based alignment acceleration.
Method: We first build a big template database with strict quality control. Homologous index is used to refine the templates of a certain query chain in the process of structure alignment. Moreover, Hadoop is used for structure alignment, which improves the prediction efficiency. Clustering method is used for analysis of sites. Finally, the sites are ranked according to the conservation scores of all residues in each site.
Results: For the 210 bound test dataset, our method achieved Accuracy (ACC) up to 0.93, Matthews Correlation Coefficient (MCC) 0.80. For the 48 unbound/bound test dataset, our method achieved ACC up to 0.97 for bound proteins (MCC 0.87), and 0.95 for unbound proteins (MCC 0.66). Structure alignment is also accelerated on Hadoop cluster, as illustrated in chain 1qif.A.
Conclusion: Our method can reduce computation time and improve prediction accuracy, compared with other binding site prediction methods using the same test datasets.
Keywords: Binding site, structure alignment, homologous index, clustering, conservation, hadoop.