Abstract
BACKGROUND: Repetition of super secondary structure is a common phenomenon, especially in higher eukaryotic organism. The copy number and assembly of these repeating units are responsible for diverse protein-protein interactions, and consequently, defects in repeat proteins are linked to many diseases. The variation within the repeat units makes their identification difficult at the sequence level and structure based approaches are desired. Since most structure based methods employ computationally intensive structure-structure alignment, we propose a computationally efficient structure-based approach for the identification of repeats using concepts from graph theory. RESULTS: The three-dimensional topology of protein structures is known to be well captured by protein contact graphs. The connectivity information in a graph is represented in the adjacency matrix and the eigenspectra of the adjacency matrix depicts the topological importance of each node to the connectivity of the graph. In our earlier work, we observed that the principal eigenspectra of the adjacency matrix well captures the tandemly repeated structural motifs. Here we propose an algorithm for the identification of tandemly repeated structural motif using graph properties and secondary structure information from STRIDE database. The algorithm begins by first identifying the length of repeat motif by analyzing the periodicity of peaks in eigenvector centrality. The repeat boundaries are then identified by superposing the contiguous repeats and extending on either side of the peak regions to the start/end of the secondary structure elements and checking for periodicity of the secondary structure architecture in the identified repeat regions. Thus, using the secondary structure annotation helps in refining the boundaries of the repeat regions and to discard false positives. We have tested the algorithm for identifying various structural repeats such as HEAT, WD, Ankyrin (ANK), Tetratricopeptide repeat (TPR), Leucine rich repeat (LRR), etc. with different super secondary structure motif, ranging from all alpha, to all beta to a mixed topology such alpha-turn-alpha, beta-alpha, etc. The predictions are in agreement with annotation in UniProt database. CONCLUSIONS: The graph based analysis of protein structures, along with domain information such as the organization of the secondary structure elements provides a computationally efficient approach for the identification of structural repeats.