Abstract
Pattern discovery is one of the fundamental problems in bioinformatics and algorithms from computer science have been widely used for identifying biological patterns. The assumption behind pattern discovery approaches is that a pattern that occurs often enough in biological sequences/structures or conserved across organisms is expected to play a role in defining the respective sequence’s or structure’s functional behavior and/or evolutionary relationships. Typical examples of pattern recognition at the sequence level involves identifying repeats, conserved regulatory sequences, protein-coding regions, horizontally transferred regions, motifs, profiles, etc. Similarly at the structure level pattern recognition involves identifying domains, structural motifs and repeats, etc. In this thesis we have addressed to problems Algorithms from computer science have been widely used for identifying biological patterns. In this thesis we have addressed two problems of pattern identification: the first one is at the structure level and involves identifying structural tandem repeats using graph theoretic approaches. The other pattern recognition problem addressed here is at the genomic level and involves identifying horizontally transferred regions. A web-based tool, IGIPT, incorporating a number of measures based on anomalies in nucleotide compositions has been developed.
1. Graph Theoretic Analysis of Protein Structures
Recently much attention is being focused on the analysis of the three dimensional structure of proteins using graph theory. Here, the protein structure is defined as a graph with nodes defined as the amino acids on the polypeptide backbone chain and edges between amino acids which attempt to capture not only covalent but also non-covalent interactions. A wide variety of problems dealing with protein structure such as protein fold recognition, active site identification, detection of functionally and structurally important amino acids, domain identification, etc. have been addressed using graph theory. Here, in this study we propose to use graph properties and their spectral analysis for identifying tandemly repeated patterns in protein structures. Specifically, we focus on identifying structural repeats in proteins which are not easily detectable at the sequence level because of low conservation and high divergence between independent repeated units. Thus, analysis at the structure level would be more reliable. Also, structural analysis of the repeat regions may give some insight to the possible function of the repeat assembly. This has been carried out by constructing protein contact graphs from its 3-dimensional structure and using tools from graph theory to investigate the topology of the 3-dimensional structure of the protein. The proposed approach when implemented on a number of proteins belonging to different repeat families in Pfam database, viz., Armidallo, Spectrin, HEAT, Ankyrin, and Leucine-rich repeats and on a number of proteins from structural repeats in PROPEAT database show promising results. Compared to the self structure-comparison approach of PROPEAT database, the graph theory provides a very elegant and efficient method of identifying repeats in protein structures. Since this approach can be easily automated, one can easily carry out a comparison of multiple copies of repeats between homologues. This would help in gaining insight into the mechanisms of evolution of these proteins by observing the conservation of the copy number or gain/loss of repeating units.
2. Integrated Genomic Island Prediction Tool
Numerous biological events are responsible for the evolution to occur such as gene conversions, rearrangements (e.g., inversion or translocation), deletions, and insertions of foreign DNA (e.g., plasmid integration, transposition). Horizontal Gene Transfer (HGT) is one of the major events responsible for causing significant alterations in the genome composition. This event is more strongly supported in case of prokaryotes and archaea as compared to eukaryotes. Horizontally transferred regions generally display abnormal nucleotide compositions and measures that capture these anomalies are used for their identification.
We have developed a web-based tool for identifying genomic islands in prokaryotic genomes (IGIPT) in which various measures aimed at capturing anomalous nucleotide compositions have been incorporated, viz., GC content (at the genome level and at individual codon positions), dinucleotide biases, biases in codon and amino acid usage and k-mer distribution (k=2 to 6). The objective of developing such a tool was to provide all nucleotide based measures on a single platform since it has been observed that no single measure is sufficient in identifying GIs. Also, when more than one measure predicts the same region in the genome, it increases the confidence in the prediction. IGIPT has been tested on a set of prokaryotic geno