Background: The analysis of DNA nucleotide sequence similarity among different species is crucial in identifying their functional, structural or evolutionary relationships. The number of bioinformatics tools designed to perform the similarity analysis of nucleotide sequences has been growing rapidly. According to the current literature, alignment-free methods have not been performed on repetitive nucleotide sequence of different lengths.
Objective: To develop a new algorithm for determining sequence characteristics and similarity based on statistically significant repetitive elements of different lengths, which are located in analyzed sequences.
Methods: This paper presents Repeats-Position/Frequency method (R-P/F method), for determining nucleotide sequence similarity which takes into consideration statistically significant repetitive parts of analyzed sequences. It is based on information theory and the fact that both position and frequency of repeated sequences are not expected to occur with the identical presence in a random sequence of the same length. Nucleotide sequences are presented in rn-dimensional vector space and their hierarchy is constructed by applying hierarchical clustering algorithm.
Results: R-P/F method has been validated on multiple data sets of nucleotide sequences and compared with results obtained from alignment-based algorithms BLAST and Clustal Omega, and multiple wellestablished alignment-free dissimilarity measures. Presented method provides results comparable with other commonly used methods focused on resolving the same problem, with the novel view on the used repetitive parts of sequences in these calculations.
Conclusion: The presented, novel algorithm for calculating sequence similarity measure is effective in discovering relationships among the sequences and makes a powerful and complementary addition to existing sequence similarity methods.
Keywords: Sequence similarity analysis, alignment-free method, statistically significant repeat, local frequency based entropy, hierarchical clustering, multi-dimensional vector space.