Abstract
Recently much attention is being focused on the analysis of the three dimensional structure of proteins using graph and network concepts. The simplest protein network graph is constructed by defining the amino acids as nodes and the edges drawn between amino acids (within a threshold) which attempt to capture not only covalent but also non-covalent interactions. Analysis of the topological details of proteins with known structures, such as clustering of specific types of amino acids important for structure, folding and function, is of great value and is an active field of research. Since structures of a large number of proteins is now available, automatic methods of analysis are required to analyze them and recently, the tools from graph theory are being explored for such analysis. Here, we propose to use graph properties and their spectral analysis for identifying patterns in protein structures, in particular, structural repeats or motifs. These repeats are not easily detectable at the sequence level because of low conservation or high divergence between independent repeated units. The importance of such repeats in understanding biological function resides not only in their high frequency among known sequences, but also in their abilities to confer multiple binding and structural roles on proteins, e.g., zinc finger domain, a constituent of transcription factors involved in DNA binding, where the composition and copy number of individual tandem repeats confers selectivity and activity of DNA binding. This functional versatility is apparent not only among different repeat types, but also for similar repeats from the same family. Our understanding of repeats, with respect to their structures, functions, and evolution, therefore represents a considerable challenge. Our analysis of the degree distribution and spectral analysis of a set of proteins from various repeat families show that the repeat regions exhibit similar connectivity patterns in the graph. The network property, Betweenness, was found to identify accurately the repeat boundaries in protein families. Thus, this approach can help in identifying repeated motifs in proteins which are difficult to identify at the sequence level.