@inproceedings{bib_Prot_2024, AUTHOR = {Sri Lakshmi Bhavani Pagolu, Nita Parekh}, TITLE = {Protocol for Analyzing Epigenetic Regulation Mechanisms in Breast Cancer}, BOOKTITLE = {Methods in Molecular Biology}. YEAR = {2024}}
DNA methylation and gene expression are two critical aspects of the epigenetic landscape that contribute significantly to cancer pathogenesis. Analysis of aberrant genome-wide methylation patterns can provide insights into how these affect the cancer transcriptome and possible clinical implications for cancer diagnosis and treatment. The role of tumor suppressors and oncogenes is well known in tumorigenesis. Epigenetic alterations can significantly impact the expression and function of these critical genes, contributing to the initiation and progression of cancer. The present protocol chapter presents a unified workflow to explore the role of DNA methylation in gene expression regulation in breast cancer by identifying differentially expressed genes whose promoter or gene body regions are differentially methylated using various Bioconductor packages in R environment. Functional enrichment analysis of these genes can help in understanding the mechanisms leading to tumorigenesis due to epigenetic alterations.
@inproceedings{bib_On_A_2024, AUTHOR = {Hrishi Narayanan, Prasad Krishnan, Nita Parekh}, TITLE = {On Achievable Rates for the Shotgun Sequencing Channel with Erasures}, BOOKTITLE = {International Symposium on Information Theory}. YEAR = {2024}}
In shotgun sequencing, the input string (typically, a long DNA sequence composed of nucleotide bases) is sequenced as multiple overlapping fragments of much shorter lengths (called textit{reads}). Modelling the shotgun sequencing pipeline as a communication channel for DNA data storage, the capacity of this channel was identified in a recent work, assuming that the reads themselves are noiseless substrings of the original sequence. Modern shotgun sequencers however also output quality scores for each base read, indicating the confidence in its identification. Bases with low quality scores can be considered to be erased. Motivated by this, we consider the textit{shotgun sequencing channel with erasures}, where each symbol in any read can be independently erased with some probability $delta$. We identify achievable rates for this channel, using a random code construction and a decoder that uses typicality-like arguments to merge the reads.
@inproceedings{bib_Mamm_2024, AUTHOR = {Bhole Gaurav Hitesh, Suba S, Nita Parekh}, TITLE = {Mammo-Bench: A Large-scale Benchmark Dataset of Mammography Images}, BOOKTITLE = {International Conference on Computational Advances in Bio and Medical Sciences}. YEAR = {2024}}
Breast cancer remains a significant global health concern, and machine learning algorithms and computer-aided detection systems have shown great promise in enhancing the accuracy and efficiency of mammography image analysis. However, there is a critical need for large, benchmark datasets for training deep learning models for breast cancer detection. In this work we developed Mammo-Bench, a large-scale benchmark dataset of mammography images, by collating data from seven well-curated resources, viz., INbreast, Mini-DDSM, KAU-BCMD, CMMD, CDD-CESM, DMID, and RSNA Screening Dataset. To ensure consistency across images from diverse sources while preserving clinically relevant features, all the images underwent a preprocessing pipeline that includes breast segmentation, pectoral muscle removal, and intelligent cropping. The dataset consists of 71,844 high-quality mammographic images from 26,500 patients across 8 countries and is one of the largest open-source mammography databases to the best of our knowledge. To show the utility of Mammo-Bench, ResNet101 architecture was used for classifying the images into Normal, Benign and Malignant classes. Performance of ResNet101 was evaluated on the proposed dataset and the results compared with a few member datasets and an external dataset, VinDr-Mammo. We show that training on the larger, proposed benchmark dataset is more reliable compared to when trained on other smaller datasets. An accuracy of 78.8% (with data augmentation of the minority classes) and 77.8% (without data augmentation) was achieved on the proposed benchmark dataset, compared to the other datasets for which the accuracy varied from 25 – 69%. Most striking was the improved prediction of the minority classes using the Mammo-Bench. These results establish baseline performance and demonstrate Mammo-Bench's utility as a comprehensive resource for developing and evaluating mammography analysis systems.
Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset
@inproceedings{bib_Eval_2023, AUTHOR = {Suba S, Nita Parekh, Ramesh Loganathan, Vikram Pudi, Chinnababu Sunkavalli}, TITLE = {Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset}, BOOKTITLE = {International Conference on Bioinformatics and Data Science}. YEAR = {2023}}
Computer tomography (CT) have been routinely used for the diagnosis of lung diseases and recently, during the pandemic, for detecting the infectivity and severity of COVID-19 disease. One of the major concerns in using ma-chine learning (ML) approaches for automatic processing of CT scan images in clinical setting is that these methods are trained on limited and biased sub-sets of publicly available COVID-19 data. This has raised concerns regarding the generalizability of these models on external datasets, not seen by the model during training. To address some of these issues, in this work CT scan images from confirmed COVID-19 data obtained from one of the largest public repositories, COVIDx CT 2A were used for training and internal vali-dation of machine learning models. For the external validation we generated Indian-COVID-19 CT dataset, an open-source repository containing 3D CT volumes and 12096 chest CT images from 288 COVID-19 patients from In-dia. Comparative performance evaluation of four state-of-the-art machine learning models, viz., a lightweight convolutional neural network (CNN), and three other CNN based deep learning (DL) models such as VGG-16, ResNet-50 and Inception-v3 in classifying CT images into three classes, viz., normal, non-covid pneumonia, and COVID-19 is carried out on these two datasets. Our analysis showed that the performance of all the models is comparable on the hold-out COVIDx CT 2A test set with 90% - 99% accuracies (96% for CNN), while on the external Indian-COVID-19 CT dataset a drop in the performance is observed for all the models (8% - 19%). The traditional ma-chine
In Silico Identification and Functional Characterization of Genetic Variations across DLBCL Cell Lines
@inproceedings{bib_In_S_2023, AUTHOR = {Prashanthi Dharanipragada, Nita Parekh}, TITLE = {In Silico Identification and Functional Characterization of Genetic Variations across DLBCL Cell Lines}, BOOKTITLE = {Journal on CELL}. YEAR = {2023}}
Diffuse large B-cell lymphoma (DLBCL) is the most common form of non-Hodgkin lymphoma and frequently develops through the accumulation of several genetic variations. With the advancement in high-throughput techniques, in addition to mutations and copy number variations, structural variations have gained importance for their role in genome instability leading to tumorigenesis. In this study, in order to understand the genetics of DLBCL pathogenesis, we carried out a whole-genome mutation profile analysis of eleven human cell lines from germinal-center B-cell-like (GCB-7) and activated B-cell-like (ABC-4) subtypes of DLBCL. Analysis of genetic variations including small sequence variants and large structural variations across the cell lines revealed distinct variation profiles indicating the heterogeneous nature of DLBCL and the need for novel patient stratification methods to design potential intervention strategies. Validation and prognostic significance of the variants was assessed using annotations provided for DLBCL samples in cBioPortal for Cancer Genomics. Combining genetic variations revealed new subgroups between the subtypes and associated enriched pathways, viz., PI3K-AKT signaling, cell cycle, TGF-beta signaling, and WNT signaling. Mutation landscape analysis also revealed drug–variant associations and possible effectiveness of known and novel DLBCL treatments. From the whole-genome-based mutation analysis, our findings suggest putative molecular genetics of DLBCL lymphomagenesis and potential genomics-driven precision treatments. Keywords: diffuse large B-cell lymphoma (DLBCL); copy number variations; sequence variations; structural variations; cell lines; precision medicine
@inproceedings{bib_Atte_2023, AUTHOR = {Suba S, Nita Parekh}, TITLE = {Attention-CNN Model for COVID-19 Diagnosis Using Chest CT Images}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2023}}
Deep learning assisted disease diagnosis using chest radiol- ogy images to assess severity of various respiratory conditions has gar- nered a lot of attention after the recent COVID-19 pandemic. Under- standing characteristic features associated with the disease in radiology images, along with variations observed from patient-to-patient and with the progression of disease, is important while building such models. In this work, we carried out comparative analysis of various deep architec- tures with the proposed attention-based Convolutional Neural Network (CNN) model with customized bottleneck residual module (Attn-CNN) in classifying chest CT images into three categories, COVID-19, Normal, and Pneumonia. We show that the attention model with fewer parame- ters achieved better classification performance compared to state-of-the- art deep architectures such as EfficientNet-B7, Inceptionv3, ResNet-50 and VGG-16, and customized models proposed in similar studies such as COVIDNet-CT, CTnet-10, COVID-19Net, etc.
@inproceedings{bib_DNA__2023, AUTHOR = {Sri Lakshmi Bhavani Pagolu, Suba S, Suba S, Nita Parekh}, TITLE = {DNA Methylation-Based Subtype Classification of Breast Cancer}, BOOKTITLE = {International Conference on Computational Advances in Bio and Medical Sciences}. YEAR = {2023}}
Aberrant genome-wide DNA methylation patterns is common in can-cers. Understanding how these affect the transcriptome can provide insights into subtype specific development and progression of tumorigenesis. In this study we carried out genome-wide analysis of DNA methylation and gene expression pro-files in TCGA-BRCA breast cancer samples to propose a novel set of 35 meth-ylation-based prognostic markers that may provide insights to molecular subtype specific disease stratification. Gene-set enrichment and pathway analysis of the predicted markers using MSigDB and DAVID revealed their role in mammary gland development pathway, various signaling pathways (ERBB2, NOTCH, etc.), and other cancer pathways, and show clear association with genes affected by hormone receptor status. We further show the discriminative power of the proposed DNA methylation signature in classifying breast cancer samples into three molecular subtypes, viz., Luminal, HER2-enriched and Triple Negative. An accuracy of 94.12% and MCC of 0.87 is obtained in stratified 5-fold cross-vali-dation for the three-class classification using SVM-RBF.
@inproceedings{bib_Atte_2023, AUTHOR = {Suba S, Nita Parekh}, TITLE = {Attention-CNN Model for COVID-19 Diagnosis Using Chest CT Images}, BOOKTITLE = {Lecture Notes on Computer Science}. YEAR = {2023}}
Deep learning assisted disease diagnosis using chest radiology images to assess severity of various respiratory conditions has garnered a lot of attention after the recent COVID-19 pandemic. Understanding characteristic features associated with the disease in radiology images, along with variations observed from patient-to-patient and with the progression of disease, is important while building such models. In this work, we carried out comparative analysis of various deep architectures with the proposed attention-based Convolutional Neural Network (CNN) model with customized bottleneck residual module (Attn-CNN) in classifying chest CT images into three categories, COVID-19, Normal, and Pneumonia. We show that the attention model with fewer parameters achieved better classification performance compared to state-of-the-art deep architectures such as EfficientNet-B7, Inceptionv3, ResNet-50 and VGG-16, and customized models proposed in similar studies such as COVIDNet-CT, CTnet-10, COVID-19Net, etc.
Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset
@inproceedings{bib_Eval_2023, AUTHOR = {Suba S, Nita Parekh, Ramesh Loganathan, Vikram Pudi, Chinnababu Sunkavalli}, TITLE = {Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset}, BOOKTITLE = {International Conference on Bioinformatics and Data Science}. YEAR = {2023}}
Computer tomography (CT) have been routinely used for the diagnosis of lung diseases and recently, during the pandemic, for detecting the infectivity and severity of COVID-19 disease. One of the major concerns in using machine learning (ML) approaches for automatic processing of CT scan images in clinical setting is that these methods are trained on limited and biased subsets of publicly available COVID-19 data. This has raised concerns regarding the generalizability of these models on external datasets, not seen by the model during training. To address some of these issues, in this work CT scan images from confirmed COVID-19 data obtained from one of the largest public repositories, COVIDx CT 2A were used for training and internal validation of machine learning models. For the external validation we generated Indian-COVID-19 CT dataset, an open-source repository containing 3D CT volumes and 12096 chest CT images from 288 COVID-19 patients from India. Comparative performance evaluation of four state-of-the-art machine learning models, viz., a lightweight convolutional neural network (CNN), and three other CNN based deep learning (DL) models such as VGG-16, ResNet-50 and Inception-v3 in classifying CT images into three classes, viz., normal, non-covid pneumonia, and COVID-19 is carried out on these two datasets. Our analysis showed that the performance of all the models is comparable on the hold-out COVIDx CT 2A test set with 90%–99% accuracies (96% for CNN), while on the external Indian-COVID-19 CT dataset a drop in the performance is observed for all the models (8%–19%). The traditional machine learning model, CNN performed the best on the external dataset (accuracy 88%) in comparison to the deep learning models, indicating that a lightweight CNN is better generalizable on unseen data. The data and code are made available at https://github.com/aleesuss/c19.
@inproceedings{bib_Sequ_2022, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {Sequence and Structure-Based Analyses of Human Ankyrin Repeats}, BOOKTITLE = {Molecules}. YEAR = {2022}}
Ankyrin is one of the most abundant protein repeat families found across all forms of life. It is found in a variety of multi-domain and single domain proteins in humans with diverse number of repeating units. They are observed to occur in several functionally diverse proteins, such as transcriptional initiators, cell cycle regulators, cytoskeletal organizers, ion transporters, signal transducers, developmental regulators, and toxins, and, consequently, defects in ankyrin repeat proteins have been associated with a number of human diseases. In this study, we have classified the human ankyrin proteins into clusters based on the sequence similarity in their ankyrin repeat domains. We analyzed the amino acid compositional bias and consensus ankyrin motif sequence of the clusters to understand the diversity of the human ankyrin proteins. We carried out network-based structural analysis of human ankyrin proteins across different clusters and showed the association of conserved residues with topologically important residues identified by network centrality measures. The analysis of conserved and structurally important residues helps in understanding their role in structural stability and function of these proteins. In this paper, we also discuss the significance of these conserved residues in disease association across the human ankyrin protein clusters.
NetREx: Network-based Rice Expression Analysis Server for abiotic stress conditions
SANCHARI SIRCAR,Mayank Musaddi,Nita Parekh
@inproceedings{bib_NetR_2022, AUTHOR = {SANCHARI SIRCAR, Mayank Musaddi, Nita Parekh}, TITLE = {NetREx: Network-based Rice Expression Analysis Server for abiotic stress conditions}, BOOKTITLE = {Database}. YEAR = {2022}}
Recent focus on transcriptomic studies in food crops like rice, wheat and maize provide new opportunities to address issues related to agriculture and climate change. Re-analysis of such data available in public domain supplemented with annotations across molecular hierarchy can be of immense help to the plant research community, particularly co-expression networks representing transcriptionally coordinated genes that are often part of the same biological process. With this objective, we have developed NetREx, a Network-based Rice Expression Analysis Server, that hosts ranked co-expression networks of Oryza sativa using publicly available messenger RNA sequencing data across uniform experimental conditions. It provides a range of interactable data viewers and modules for analysing user-queried genes across different stress conditions (drought, flood, cold and osmosis) and hormonal treatments (abscisic and jasmonic acid) and tissues (root and shoot). Subnetworks of user-defined genes can be queried in pre-constructed tissue-specific networks, allowing users to view the fold change, module memberships, gene annotations and analysis of their neighbourhood genes and associated pathways. The web server also allows querying of orthologous genes from Arabidopsis, wheat, maize, barley and sorghum. Here, we demonstrate that NetREx can be used to identify novel candidate genes and tissue-specific interactions under stress conditions and can aid in the analysis and understanding of complex phenotypes linked to stress response in rice.
Explainable and Lightweight Model for COVID-19 Detection Using Chest Radiology Images
Suba S,Nita Parekh
Technical Report, arXiv, 2022
@inproceedings{bib_Expl_2022, AUTHOR = {Suba S, Nita Parekh}, TITLE = {Explainable and Lightweight Model for COVID-19 Detection Using Chest Radiology Images}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Deep learning (DL) analysis of Chest X-ray (CXR) and Computed tomography (CT) images has garnered a lot of attention in recent times due to the COVID-19 pandemic. Convolutional Neural Networks (CNNs) are well suited for the image analysis tasks when trained on humongous amounts of data. Applications developed for medical image analysis require high sensitivity and precision compared to any other fields. Most of the tools proposed for detection of COVID-19 claims to have high sensitivity and recalls but have failed to generalize and perform when tested on unseen datasets. This encouraged us to develop a CNN model, analyze and understand the performance of it by visualizing the predictions of the model using class activation maps generated using (Gradient-weighted Class Activation Mapping) Grad-CAM technique. This study provides a detailed discussion of the success and failure of the proposed model at an image level. Performance of the model is compared with state-of-the-art DL models and shown to be comparable. The data and code used are available at https://github.com/aleesuss/c19
Comparative Analysis of SARS-CoV-2 Variants Across Three Waves in India
Kushagra Agarwal,Nita Parekh
International Conference on Bioinformatics and Data Science, ICBDS, 2022
@inproceedings{bib_Comp_2022, AUTHOR = {Kushagra Agarwal, Nita Parekh}, TITLE = {Comparative Analysis of SARS-CoV-2 Variants Across Three Waves in India}, BOOKTITLE = {International Conference on Bioinformatics and Data Science}. YEAR = {2022}}
In this study we carried out a comprehensive analysis of SARS-CoV2 mutations and their spread in India over the past two years of the pandemic (27th Jan’ 2020 – 8th Mar’ 2022). The analysis covers four important timelines, viz., the early phase, followed by the first, second and third waves of the pandemic in the country. Phylogenetic analysis of the isolates indicated multiple independent entries of coronavirus in the country, while principal component analysis identified few state-specific clusters. Genetic analysis of isolates during the first year revealed that though lockdown helped in controlling the spread of the virus, region-specific set of shared mutations were developed during the early phase due to local community transmissions. We thus report the evolution of state-specific subclades, namely, I/GJ-20A (Gujarat), I/MH-2 (Maharashtra), I/Tel-A-20B, I/Tel-B-20B (Telangana), and I/AP-20A (Andhra Pradesh) that explain the demographic variation in the impact of COVID-19 across states. In the second year of the pandemic, India faced an aggressive second wave while the third wave was quite mild in terms of severity. Here we also discuss the prevalence and impact of different lineages and Variants of Concerns/Interests, viz., Delta, Kappa, Omicron, etc. observed during this period. From the genetic analysis of mutation spectra of Indian isolates, the insights gained in its transmission, geographic distribution, containment, and impact are discussed.
DbStRiPs: Database of structural repeats in proteins
BROTO CHAKRABARTY,Nita Parekh
Protein Science, PS, 2021
@inproceedings{bib_DbSt_2021, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {DbStRiPs: Database of structural repeats in proteins}, BOOKTITLE = {Protein Science}. YEAR = {2021}}
Abstract Recent interest in repeat proteins has arisen due to stable structural folds, high evolutionary conservation and repertoire of functions provided by these proteins. However, repeat proteins are poorly characterized because of high sequence variation between repeating units and structure-based identification and classification of repeats is desirable. Using a robust network-based pipeline, manual curation and Kajava's structure-based classification schema, we have developed a database of tandem structural repeats, Database of Structural Repeats in Proteins (DbStRiPs). A unique feature of this database is that available knowledge on sequence repeat families is incorporated by mapping Pfam classification scheme onto structural classification. Integration of sequence and structure-based classifications help in identifying different functional groups within the same structural subclass, leading to refinement in the annotation of repeat proteins. Analysis of complete Protein Data Bank revealed 16,472 repeat annotations in 15,141 protein chains, one previously uncharacterized novel protein repeat family (PRF), named left-handed beta helix, and 33 protein repeat clusters (PRCs). Based on their unique structural motif, 79% of these repeat proteins are classified in one of the 14 PRFs or 33 PRCs, and the remaining are grouped as unclassified repeat proteins. Each repeat protein is provided with a detailed annotation in DbStRiPs that includes start and end boundaries of repeating units, copy number, secondary and tertiary structure view, repeat class/subclass, disease association, MSA of repeating units and cross-references to various protein pattern databases, human protein atlas and interaction resources. DbStRiPs provides easy search and download options to high-quality annotations of structural repeat proteins (URL: http://bioinf.iiit.ac.in/dbstrips/). KEYWORDS protein repeat database, proteins repeats, structural repeat proteins, tandem repeat
Demographic Analysis of Mutations in Indian SARS-CoV-2 Isolates
Kushagra Agarwal,Nita Parekh
Joint conferece on Intelligent Systems for Molecular Biology and European Conference on Computationa, ISMB/ECCB, 2021
@inproceedings{bib_Demo_2021, AUTHOR = {Kushagra Agarwal, Nita Parekh}, TITLE = {Demographic Analysis of Mutations in Indian SARS-CoV-2 Isolates}, BOOKTITLE = {Joint conferece on Intelligent Systems for Molecular Biology and European Conference on Computationa}. YEAR = {2021}}
In this study we carried out the early distribution of clades and subclades state-wise based on shared mutations in Indian SARS-CoV-2 isolates collected (27th Jan – 27th May 2020). Phylogenetic analysis of these isolates indicates multiple independent sources of introduction of the virus in the country, while principal component analysis revealed some state-specific clusters. It is observed that clade 20A defining mutations C241T (ORF1ab: 5’ UTR), C3037T (ORF1ab: F924F), C14408T (ORF1ab: P4715L), and A23403G (S: D614G) are predominant in Indian isolates during this period. Higher number of coronavirus cases were observed in certain states, viz., Delhi, Tamil Nadu, and Telangana. Genetic analysis of isolates from these states revealed a cluster with shared mutations, C6312A (ORF1ab: T2016K), C13730T (ORF1ab: A4489V), C23929T, and C28311T (N: P13L). Analysis of region-specific shared mutations carried out to understand the large number of deaths in Gujarat and Maharashtra identified shared mutations defining subclade, I/GJ-20A (C18877T, C22444T, G25563T (ORF3a: H57Q), C26735T, C28854T (N: S194L), C2836T) in Gujarat and two sets of co-occurring mutations C313T, C5700A (ORF1ab: A1812D) and A29827T, G29830T in Maharashtra. From the genetic analysis of mutation spectra of Indian isolates, the insights gained in its transmission, geographic distribution, containment, and impact are discussed.
Identification of co-expression gene networks controlling rice blast disease during an incompatible reaction.
R Bevitori,SANCHARI SIRCAR, R.N. de Mello,R.C. Togawa,M.V.C.B. Côrtes,T.S. Oliveira,M.F. Grossi-de-Sá,Nita Parekh
Embrapa Arroz e Feijão-Artigo em periódico indexado, ALICE, 2020
@inproceedings{bib_Iden_2020, AUTHOR = {R Bevitori, SANCHARI SIRCAR, R.N. De Mello, R.C. Togawa, M.V.C.B. Côrtes, T.S. Oliveira, M.F. Grossi-de-Sá, Nita Parekh}, TITLE = {Identification of co-expression gene networks controlling rice blast disease during an incompatible reaction.}, BOOKTITLE = {Embrapa Arroz e Feijão-Artigo em periódico indexado}. YEAR = {2020}}
Rice blast disease is a major threat to rice production worldwide; the causative pathogenic fungus Magnaporthe oryzae induces rice (Oryza sativa) plants to undergo molecular changes that help them to circumvent this fungal attack. Transcriptome studies have demonstrated that many genes are involved in the defense response of rice to M. oryzae, but most of these studies focused on the screening of differentially expressed genes and the studies did not investigate the interactions among genes. We examined the interaction of rice and M. oryzae in a network context. Two near-isogenic lines were profiled at different time-points. Using transcriptome data obtained from an RNASeq analysis, a network based on the relationships among genes was developed through weighted gene co-expression network analysis. The analysis of degree centrality identified numerous hub genes and potential key regulators that control the rice response, providing new insights into the molecular network underlying the resistance of rice to M. oryzae infection. Additionally, a protein-protein interaction network was derived to identify complexes that might physically interact. For example, complexes of OsbHLH148/OsJAZ, OsMYB4 and some components of the phenylpropanoid pathway, as well as MYB/bHLH
Genome-wide characterization of copy number variations in diffuse large B-cell lymphoma with implications in targeted therapy
Prashanthi Dharanipragada,Nita Parekh
Precision Clinical Medicine, PCM, 2019
@inproceedings{bib_Geno_2019, AUTHOR = {Prashanthi Dharanipragada, Nita Parekh}, TITLE = {Genome-wide characterization of copy number variations in diffuse large B-cell lymphoma with implications in targeted therapy}, BOOKTITLE = {Precision Clinical Medicine}. YEAR = {2019}}
Diffuse large B-cell lymphoma (DLBCL) is the aggressive form of haematological malignancies with relapse/refractory in ~ 40% of cases. It mostly develops due to accumulation of various genetic and epigenetic variations that contribute to its aggressiveness. Though large-scale structural alterations have been reported in DLBCL, their functional role in pathogenesis and as potential targets for therapy is not yet well understood. In this study we performed detection and analysis of copy number variations (CNVs) in 11 human DLBCL cell lines (4 activated B-cell–like [ABC] and 7 germinal-centre B-cell–like [GCB]), that serve as model systems for DLBCL cancer cell biology. Significant heterogeneity observed in CNV profiles of these cell lines and poor prognosis associated with ABC subtype indicates the importance of individualized screening for diagnostic and prognostic targets. Functional analysis of key cancer genes exhibiting copy alterations across the cell lines revealed activation/disruption of ten potentially targetable immuno-oncogenic pathways. Genome guided in silico therapy that putatively target these pathways is elucidated. Based on our analysis, five CNV-genes associated with worst survival prognosis are proposed as potential prognostic markers of DLBCL.
Identifying Biomarkers for Diffuse Large B-Cell Lymphoma Subtypes
Prashanthi Dharanipragada,Nita Parekh
Molecular cancer, MC, 2019
@inproceedings{bib_Iden_2019, AUTHOR = {Prashanthi Dharanipragada, Nita Parekh}, TITLE = {Identifying Biomarkers for Diffuse Large B-Cell Lymphoma Subtypes}, BOOKTITLE = {Molecular cancer}. YEAR = {2019}}
Diffuse Large B-cell Lymphoma (DLBCL) is a highly heterogeneous cancer of B cells. Apart from cell-of-origin, genetic variations have been observed to contribute towards heterogeneity leading to different pathogenic mechanisms and overall survival outcomes. Various classification schemes have been proposed that may aid in risk stratification and developing new therapeutics for those who fail frontline therapy. This mini review highlights the significance of genetic variations as biomarkers for DLBCL and ease in extending it to clinical setting.
NAPS update: network analysis of molecular dynamics data and protein–nucleic acid complexes
BROTO CHAKRABARTY,VARUN NAGANATHAN,Kanak Garg,YASH AGARWAL,Nita Parekh
Nucleic Acids Research, NAR, 2019
@inproceedings{bib_NAPS_2019, AUTHOR = {BROTO CHAKRABARTY, VARUN NAGANATHAN, Kanak Garg, YASH AGARWAL, Nita Parekh}, TITLE = {NAPS update: network analysis of molecular dynamics data and protein–nucleic acid complexes}, BOOKTITLE = {Nucleic Acids Research}. YEAR = {2019}}
Network theory is now a method of choice to gain insights in understanding protein structure, folding and function. In combination with molecular dynamics (MD) simulations, it is an invaluable tool with widespread applications such as analyzing subtle conformational changes and flexibility regions in proteins, dynamic correlation analysis across distant regions for allosteric communications, in drug design to reveal alternative binding pockets for drugs, etc. Updated version of NAPS now facilitates network analysis of the complete repertoire of these biomolecules, i.e., proteins, protein–protein/nucleic acid complexes, MD trajectories, and RNA. Various options provided for analysis of MD trajectories include individual network construction and analysis of intermediate time-steps, comparative analysis of these networks, construction and analysis of average network of the ensemble of trajectories and dynamic cross-correlations. For protein–nucleic acid complexes, networks of the whole complex as well as that of the interface can be constructed and analyzed. For analysis of proteins, protein–protein complexes and MD trajectories, network construction based on inter-residue interaction energies with realistic edge-weights obtained from standard force fields is provided to capture the atomistic details. Updated version of NAPS also provides improved visualization features, interactive plots and bulk execution. URL: http://bioinf.iiit.ac.in/NAPS/
Meta-analysis of drought-tolerant genotypes in Oryza sativa: A network-based approach
SANCHARI SIRCAR,Nita Parekh
@inproceedings{bib_Meta_2019, AUTHOR = {SANCHARI SIRCAR, Nita Parekh}, TITLE = {Meta-analysis of drought-tolerant genotypes in Oryza sativa: A network-based approach}, BOOKTITLE = {Plos One}. YEAR = {2019}}
Background Drought is a severe environmental stress. It is estimated that about 50% of the world rice production is affected mainly by drought. Apart from conventional breeding strategies to develop drought-tolerant crops, innovative computational approaches may provide insights into the underlying molecular mechanisms of stress response and identify drought-responsive markers. Here we propose a network-based computational approach involving a meta-analytic study of seven drought-tolerant rice genotypes under drought stress. Results Co-expression networks enable large-scale analysis of gene-pair associations and tightly coupled clusters that may represent coordinated biological processes. Considering differentially expressed genes in the co-expressed modules and supplementing external information such as resistance/tolerance QTLs, transcription factors, network-based topological measures, we identify and prioritize drought-adaptive co-expressed gene modules and potential candidate genes. Using the candidate genes that are well-represented across the datasets as ‘seed’ genes, two drought-specific protein-protein interaction networks (PPINs) are constructed with up- and down-regulated genes. Cluster analysis of the up-regulated PPIN revealed ABA signalling pathway as a central process in drought response with a probable crosstalk with energy metabolic processes. Tightly coupled gene clusters representing up-regulation of core cellular respiratory processes and enhanced degradation of branched chain amino acids and cell wall metabolism are identified. Cluster analysis of down-regulated PPIN provides a snapshot of major processes associated with photosynthesis, growth, development and protein synthesis, most of which are shut down during drought. Differential regulation of phytohormones, e.g., jasmonic acid, cell wall metabolism, signalling and posttranslational modifications associated with biotic stress are elucidated. Functional characterization of topologically important, drought-responsive uncharacterized genes that may play a role in important processes such as ABA signalling, calcium signalling, photosynthesis and cell wall metabolism is discussed. Further transgenic studies on these genes may help in elucidating their biological role under stress conditions
iCopyDAV: Integrated platform for copy number variations—Detection, annotation and visualization
Prashanthi Dharanipragada,VOGETI SRI HARSHA,Nita Parekh
@inproceedings{bib_iCop_2018, AUTHOR = {Prashanthi Dharanipragada, VOGETI SRI HARSHA, Nita Parekh}, TITLE = {iCopyDAV: Integrated platform for copy number variations—Detection, annotation and visualization}, BOOKTITLE = {Plos One}. YEAR = {2018}}
Discovery of copy number variations (CNVs), a major category of structural variations, have dramatically changed our understanding of differences between individuals and provide an alternate paradigm for the genetic basis of human diseases. CNVs include both copy gain and copy loss events and their detection genome-wide is now possible using high-throughput, low-cost next generation sequencing (NGS) methods. However, accurate detection of CNVs from NGS data is not straightforward due to non-uniform coverage of reads resulting from various systemic biases. We have developed an integrated platform, iCopyDAV, to handle some of these issues in CNV detection in whole genome NGS data. It has a modular framework comprising five major modules: data pre-treatment, segmentation, variant calling, annotation and visualization. An important feature of iCopyDAV is the functional annotation module that enables the user to identify and prioritize CNVs encompassing various functional elements, genomic features and disease-associations. Parallelization of the segmentation algorithms makes the iCopyDAV platform even accessible on a desktop. Here we show the effect of sequencing coverage, read length, bin size, data pre-treatment and segmentation approaches on accurate detection of the complete spectrum of CNVs. Performance of iCopyDAV is evaluated on both simulated data and real data for different sequencing depths. It is an open-source integrated pipeline available at https://github.com/vogetihrsh/icopydav and as Docker’s image at http://bioinf.iiit.ac.in/icopydav/.
SeqVItA: Sequence Variant Identification and Annotation Platform for Next Generation Sequencing Data
Prashanthi Dharanipragada,SEELAM SAMPREETH REDDY,Nita Parekh
Frontiers in genetics, FiG, 2018
@inproceedings{bib_SeqV_2018, AUTHOR = {Prashanthi Dharanipragada, SEELAM SAMPREETH REDDY, Nita Parekh}, TITLE = {SeqVItA: Sequence Variant Identification and Annotation Platform for Next Generation Sequencing Data}, BOOKTITLE = {Frontiers in genetics}. YEAR = {2018}}
Discovery of copy number variations (CNVs), a major category of structural variations, have dramatically changed our understanding of differences between individuals and provide an alternate paradigm for the genetic basis of human diseases. CNVs include both copy gain and copy loss events and their detection genome-wide is now possible using high-throughput, low-cost next generation sequencing (NGS) methods. However, accurate detection of CNVs from NGS data is not straightforward due to non-uniform coverage of reads resulting from various systemic biases. We have developed an integrated platform, iCopyDAV, to handle some of these issues in CNV detection in whole genome NGS data. It has a modular framework comprising five major modules: data pre-treatment, segmentation, variant calling, annotation and visualization. An important feature of iCopyDAV is the functional annotation module that enables the user to identify and prioritize CNVs encompassing various functional elements, genomic features and disease-associations. Parallelization of the segmentation algorithms makes the iCopyDAV platform even accessible on a desktop. Here we show the effect of sequencing coverage, read length, bin size, data pre-treatment and segmentation approaches on accurate detection of the complete spectrum of CNVs. Performance of iCopyDAV is evaluated on both simulated data and real data for different sequencing depths. It is an open-source integrated pipeline available at https://github.com/vogetihrsh/icopydav and as Docker’s image at http://bioinf.iiit.ac.in/icopydav/.
NAPS: Network analysis of protein structures
BROTO CHAKRABARTY,Nita Parekh
Nucleic Acids Research, NAR, 2016
@inproceedings{bib_NAPS_2016, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {NAPS: Network analysis of protein structures}, BOOKTITLE = {Nucleic Acids Research}. YEAR = {2016}}
Traditionally, protein structures have been analysed by the secondary structure architecture and fold arrangement. An alternative approach that has shown promise is modelling proteins as a network of non-covalent interactions between amino acid residues. The network representation of proteins provide a systems approach to topological analysis of complex three-dimensional structures irrespective of secondary structure and fold type and provide insights into structure-function relationship. We have developed a web server for network based analysis of protein structures, NAPS, that facilitates quantitative and qualitative (visual) analysis of residue–residue interactions in: single chains, protein complex, modelled protein structures and trajectories (e.g. from molecular dynamics simulations). The user can specify atom type for network construction, distance range (in Å) and minimal amino acid separation along the
Copy number variation detection workflow using next generation sequencing data
Prashanthi Dharanipragada,Nita Parekh
International Conference on Bioinformatics and Systems Biology, BSB, 2016
@inproceedings{bib_Copy_2016, AUTHOR = {Prashanthi Dharanipragada, Nita Parekh}, TITLE = {Copy number variation detection workflow using next generation sequencing data}, BOOKTITLE = {International Conference on Bioinformatics and Systems Biology}. YEAR = {2016}}
In the last decade, discovery of copy number variations (CNVs) have dramatically changed our understanding of differences between individuals. CNVs include both additional copies of sequence (duplications) and loss of genetic material (deletions) and provide an alternate paradigm for the genetic basis of human diseases. Genome-wide CNV detection is now possible using high-throughput, low-cost next generation sequencing (NGS) methods. Nature of NGS data demands various preprocessing and pretreatment steps before extracting any meaningful information. Among the plethora of variant calling methods available, R-based methods offer flexible environment, facilitating choice of various methods depending on the type of data or type of analysis to be performed. Here we give a pipeline for various steps involved in CNV detection in NGS data using R-based algorithms and packages.
Robustness of Chimera States in Non-locally Coupled Logistic Maps
BELLAMKONDA PRANEETHA,Nita Parekh
The Chaotic Modeling and Simulation Journal, CMSIM, 2015
@inproceedings{bib_Robu_2015, AUTHOR = {BELLAMKONDA PRANEETHA, Nita Parekh}, TITLE = {Robustness of Chimera States in Non-locally Coupled Logistic Maps}, BOOKTITLE = {The Chaotic Modeling and Simulation Journal}. YEAR = {2015}}
In the last decade there has been considerable interest in a novel dynamical phenomenon of chimera states observed in an array of non-locally coupled oscillators where regions of coherence and incoherence coexist across the network. In this study we show how chimera states emerge in coupled logistic maps for certain specified initial conditions when the range and strength of coupling is varied. Here we show that these states are very robust and persist even in the presence of noise in the network parameters. On applying localized external perturbation to the incoherent regions, it is possible to obtain a completely coherent/incoherent dynamics in the whole network depending on the strength and sign of perturbation. This has important applications in the control of undesirable local dynamics, such as seizures in neural systems, or fibrillations in cardiac
Functional characterization of drought-responsive modules and genes in Oryza sativa: a network-based approach
SANCHARI SIRCAR,Nita Parekh
Frontiers in genetics, FiG, 2015
@inproceedings{bib_Func_2015, AUTHOR = {SANCHARI SIRCAR, Nita Parekh}, TITLE = {Functional characterization of drought-responsive modules and genes in Oryza sativa: a network-based approach}, BOOKTITLE = {Frontiers in genetics}. YEAR = {2015}}
Drought is one of the major environmental stress conditions affecting the yield of rice across the globe. Unraveling the functional roles of the drought-responsive genes and their underlying molecular mechanisms will provide important leads to improve the yield of rice. Co-expression relationships derived from condition-dependent gene expression data is an effective way to identify the functional associations between genes that are part of the same biological process and may be under similar transcriptional control. For this purpose, vast amount of freely available transcriptomic data can be used for functional annotation. In this study we consider gene expression data for different tissues and developmental stages in response to drought stress. We analyze the network of co-expressed genes to identify drought-responsive genes modules in a tissue and stage-specific manner based on differential expression and gene enrichment analysis. Taking cues from the systems-level behavior of these modules, we propose two approaches to identify clusters of tightly co-expressed/co-regulated genes. Using graph-centrality measures and differential gene expression, we identify biologically informative genes that lack any functional annotation. We show that using orthologous information from other plant species, the conserved co-expression patterns of the uncharacterized genes can be identified. Presence of a conserved neighborhood enables us to extrapolate functional annotation. Alternatively, we show that ‘guide-gene’ approach can help in understanding the tissue-specific transcriptional regulation of uncharacterized genes. Finally, we confirm the …
Identifying tandem Ankyrin repeats in protein structures
BROTO CHAKRABARTY,Nita Parekh
BMC Bioinformatics, BIO INFO, 2014
@inproceedings{bib_Iden_2014, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {Identifying tandem Ankyrin repeats in protein structures}, BOOKTITLE = {BMC Bioinformatics}. YEAR = {2014}}
Tandem repetition of structural motifs in proteins is frequently observed across all forms of life. Topology of repeating unit and its frequency of occurrence are associated to a wide range of structural and functional roles in diverse proteins, and defects in repeat proteins have been associated with a number of diseases. It is thus desirable to accurately identify specific repeat type and its copy number. Weak evolutionary constraints on repeat units and insertions/deletions between them make their identification difficult at the sequence level and structure based approaches are desired. The proposed graph spectral approach is based on protein structure represented as a graph for detecting one of the most frequently observed structural repeats, Ankyrin repeat.
PRIGSA: Protein repeat identification by graph spectral analysis
BROTO CHAKRABARTY,Nita Parekh
Journal of Bioinformatics and Computational Biology, JBCB, 2014
@inproceedings{bib_PRIG_2014, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {PRIGSA: Protein repeat identification by graph spectral analysis}, BOOKTITLE = {Journal of Bioinformatics and Computational Biology}. YEAR = {2014}}
Repetition of a structural motif within protein is associated with a wide range of structural and functional roles. In most cases the repeating units are well conserved at the structural level while at the sequence level, they are mostly undetectable suggesting the need for structure-based methods. Since most known methods require a training dataset, de novo approach is desirable. Here, we propose an efficient graph-based approach for detecting structural repeats in proteins. In a protein structure represented as a graph, interactions between inter- and intra-repeat units are well captured by the eigen spectra of adjacency matrix of the graph. These conserved interactions give rise to similar connections and a unique profile of the principal eigen spectra for each repeating unit. The efficacy of the approach is shown on eight repeat families annotated in UniProt, comprising of both solenoid and nonsolenoid repeats with …
Graph centrality analysis of structural ankyrin repeats
BROTO CHAKRABARTY,Nita Parekh
International Journal of Computer Information Systems and Industrial Management Applications, IJCISIM, 2014
@inproceedings{bib_Grap_2014, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {Graph centrality analysis of structural ankyrin repeats}, BOOKTITLE = {International Journal of Computer Information Systems and Industrial Management Applications}. YEAR = {2014}}
In recent studies it has been shown that graph representation of protein structures is capable of capturing the 3-dimensional fold of the protein very well, thus providing a computationally efficient approach for protein structure analysis. Centrality measures are generally used to identify the relative importance of a node in the network. Here we demonstrate a novel application of centrality analysis: to identify tandemly repeated structural motifs in 3-d protein structures. This is done by analyzing the profile of various centrality measures in the repeat region. The comparative analysis of five centrality measures based on local connectivity, shortest paths, principal eigen spectra and feedback centrality is presented on proteins containing contiguous ankyrin structural motifs to identify which centrality measure best captures the repetitive pattern of ankyrin. We observe that principal eigen spectra of the adjacency matrix and Katz status index, both exhibit a distinct profile for the ankyrin motif capturing its characteristic anti-parallel helix-turn-helix fold. No such conserved pattern was observed in the repeat regions of equivalent random networks, suggesting that the conserved pattern arises from the 3d fold of the structural motif.
Relaxed neighbor based graph transformations for effective preprocessing: A function prediction case study
SATHEESH KUMAR DWADASI,Krishna Reddy Polepalli,Nita Parekh
International Conference on Big Data Analytics, BDA, 2014
@inproceedings{bib_Rela_2014, AUTHOR = {SATHEESH KUMAR DWADASI, Krishna Reddy Polepalli, Nita Parekh}, TITLE = {Relaxed neighbor based graph transformations for effective preprocessing: A function prediction case study}, BOOKTITLE = {International Conference on Big Data Analytics}. YEAR = {2014}}
Protein-protein interaction (PPI) networks are valuable biological source of data which contain rich information useful for protein function prediction. The PPI networks face data quality challenges like noise in the form of false positive edges and incompleteness in the form of missing biologically valued edges. These issues can be handled by enhancing data quality through graph transformations for improved protein function prediction. We proposed an improved method to extract similar proteins based on the notion of relaxed neighborhood. The proposed method can be applied to carry out graph transformation of PPI network data sets to improve the performance of protein function prediction task by adding biologically important protein interactions, removing dissimilar interactions and increasing reliability score of the interactions. By preprocessing PPI network data sets with the proposed methodology, experiment results on both un-weighted and weighted PPI network data sets show that, the proposed methodology enhances the data quality and improves prediction accuracy over other approaches. The results indicate that the proposed approach could utilize underutilized knowledge, such as distant relationships embedded in the PPI graph.
Graph based Identification of Structural Repeats in Proteins
BROTO CHAKRABARTY,Nita Parekh
nternational Conference in Bioinformatics, INCOB, 2013
@inproceedings{bib_Grap_2013, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {Graph based Identification of Structural Repeats in Proteins}, BOOKTITLE = {nternational Conference in Bioinformatics}. YEAR = {2013}}
BACKGROUND: Repetition of super secondary structure is a common phenomenon, especially in higher eukaryotic organism. The copy number and assembly of these repeating units are responsible for diverse protein-protein interactions, and consequently, defects in repeat proteins are linked to many diseases. The variation within the repeat units makes their identification difficult at the sequence level and structure based approaches are desired. Since most structure based methods employ computationally intensive structure-structure alignment, we propose a computationally efficient structure-based approach for the identification of repeats using concepts from graph theory. RESULTS: The three-dimensional topology of protein structures is known to be well captured by protein contact graphs. The connectivity information in a graph is represented in the adjacency matrix and the eigenspectra of the adjacency matrix depicts the topological importance of each node to the connectivity of the graph. In our earlier work, we observed that the principal eigenspectra of the adjacency matrix well captures the tandemly repeated structural motifs. Here we propose an algorithm for the identification of tandemly repeated structural motif using graph properties and secondary structure information from STRIDE database. The algorithm begins by first identifying the length of repeat motif by analyzing the periodicity of peaks in eigenvector centrality. The repeat boundaries are then identified by superposing the contiguous repeats and extending on either side of the peak regions to the start/end of the secondary structure elements and checking for periodicity of the secondary structure architecture in the identified repeat regions. Thus, using the secondary structure annotation helps in refining the boundaries of the repeat regions and to discard false positives. We have tested the algorithm for identifying various structural repeats such as HEAT, WD, Ankyrin (ANK), Tetratricopeptide repeat (TPR), Leucine rich repeat (LRR), etc. with different super secondary structure motif, ranging from all alpha, to all beta to a mixed topology such alpha-turn-alpha, beta-alpha, etc. The predictions are in agreement with annotation in UniProt database. CONCLUSIONS: The graph based analysis of protein structures, along with domain information such as the organization of the secondary structure elements provides a computationally efficient approach for the identification of structural repeats.
Graph Centrality Analysis of Structural Ankyrin Repeats
BROTO CHAKRABARTY,Nita Parekh
International Journal of Computer Information Systems and Industrial Management Applications, IJCISIM, 2013
@inproceedings{bib_Grap_2013, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {Graph Centrality Analysis of Structural Ankyrin Repeats}, BOOKTITLE = {International Journal of Computer Information Systems and Industrial Management Applications}. YEAR = {2013}}
In recent studies it has been shown that graph representation of protein structures is capable of capturing the 3-dimensional fold of the protein very well, thus providing a computationally efficient approach for protein structure analysis. Centrality measures are generally used to identify the relative importance of a node in the network. Here we demonstrate a novel application of centrality analysis: to identify tandemly repeated structural motifs in 3-d protein structures. This is done by analyzing the profile of various centrality measures in the repeat region. The comparative analysis of five centrality measures based on local connectivity, shortest paths, principal eigen spectra and feedback centrality is presented on proteins containing contiguous ankyrin structural motifs to identify which centrality measure best captures the repetitive pattern of ankyrin. We observe that principal eigen spectra of the adjacency matrix and Katz status index, both exhibit a distinct profile for the ankyrin motif capturing its characteristic anti-parallel helix-turn-helix fold. No such conserved pattern was observed in the repeat regions of equivalent random networks, suggesting that the conserved pattern arises from the 3d fold of the structural motif.
Identifying Tandem Structural Repeat Motifs in Protein by Graph Spectral Analysis
BROTO CHAKRABARTY,Nita Parekh
International Conference on Biomolecular Forms & Functions, ICBFF, 2013
@inproceedings{bib_Iden_2013, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {Identifying Tandem Structural Repeat Motifs in Protein by Graph Spectral Analysis}, BOOKTITLE = {International Conference on Biomolecular Forms & Functions}. YEAR = {2013}}
BACKGROUND: Ankyrin repeat is one of the most frequently observed structural motif in proteins across all kingdoms of life. These proteins are involved in diverse set of cellular functions and act as transcriptional initiators, cell-cycle regulators, cytoskeletal, ion transporters and signal transducers, and consequently, defects in ankyrin repeat proteins have been found in a number of human diseases. Identification of these structural repeats at the sequence level is difficult due to low conservation between the repeat copies. Thus, analysis at the structure level is desirable. RESULTS: In this study, we propose a graph based approach in the identification and analysis of ankyrin repeats. The 3-dimensional topology of protein structures has been shown to be well captured by protein contact graphs. The connectivity information of these networks is represented in the adjacency matrix and here we propose the analysis of the eigen spectra of the adjacency matrix in the identification of structural repeats. A clear two-peak pattern corresponding to the helix-turn-helix region of the Ankyrin motif is observed in the principal eigenvector of the adjacency matrix. The length distribution of this repetitive pattern along with the organization of the secondary structure elements is used to design an algorithm to identify the Ankyrin structural motifs. The analysis has been carried out on a non-redundant set of 51 proteins annotated in the UniProt database and a very good agreement is observed. Analysis of all the proteins in the alpha class and alpha+beta class in SCOP database has been performed and a number of novel repeats, not annotated in the database have been identified. This approach is then applied on other structural repeats such as Tetraticopeptide repeat (TPR), Annexin, HEAT, ARM, etc. CONCLUSIONS: The graph based analysis of protein structures, along with domain information such as the organization of the secondary structure architecture provides a computationally efficient approach for the identification of structural repeats.
Identification of Genomic Islands by Pattern Discovery
Nita Parekh
BMC Microbiology, BMC-M, 2012
@inproceedings{bib_Iden_2012, AUTHOR = {Nita Parekh}, TITLE = {Identification of Genomic Islands by Pattern Discovery}, BOOKTITLE = {BMC Microbiology}. YEAR = {2012}}
Pattern discovery is at the heart of bioinformatics, and algorithms from computer science have been widely used for identifying biological patterns. The assumption behind pattern discovery approaches is that a pattern that occurs often enough in biological sequences/structures or is conserved across organisms is expected to play a role in defining the respective sequence’s or structure’s functional behavior and/or evolutionary relationships. The pattern recognition problem addressed here is at the genomic level and involves identifying horizontally transferred regions, called genomic islands. A horizontally transferred event is defined as the movement of genetic material between phylogenetically unrelated organisms by mechanisms other than parent to progeny inheritance. Increasing evidence suggests the importance of horizontal transfer events in the evolution of bacteria, influencing traits such as antibiotic …
Identification of Ankyrin Repeats in Three-Dimensional Protein Structures
BROTO CHAKRABARTY,Nita Parekh
Accelerating Biology, AB, 2012
@inproceedings{bib_Iden_2012, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {Identification of Ankyrin Repeats in Three-Dimensional Protein Structures}, BOOKTITLE = {Accelerating Biology}. YEAR = {2012}}
Internal repetition within proteins is a commonly observed phenomenon and presents multiple binding and structural roles to proteins. Ankyrin repeat, one of the most widely existing protein motifs in nature, forms helix-turn-helix motif, and exclusively functions to mediate proteinprotein interactions, some of which are directly involved in the development of human cancer and other diseases. It has been observed that the ankyrin repeat motif is defined by its fold rather than by its function, and as there is no specific sequence underlying the fold, its identification at the structural level is desirable. We propose graph based approach in the identification and analysis of these repeats. The topological details of protein structures has been shown to be effectively captured by protein graphs. In this study we analyzed about twelve graph measures for the identification of repeated structural motifs. Of these, degree, Katz status, Page rank, closeness vitality and eigenvectors corresponding to the principal eigenvalue of the adjacency matrix are found to be promising. We observe that on considering the secondary structure information along with the analysis of various graph measures, it is possible to accurately identify the boundaries of individual repeats. This approach is computationally very efficient compared to the structure-structure self-comparison. Design and engineering of repeat proteins may help to elucidate their structural and biophysical properties, such as the dependence of stability and folding on the number of repeats, the importance of key intra- and inter-repeat interactions and identifying novel binding molecules suitable for biotechnological or medical applications.
Analysis of graph centrality measures for identifying Ankyrin repeats
BROTO CHAKRABARTY,Nita Parekh
World Congress on Information and Communication Technologies, WICT, 2012
@inproceedings{bib_Anal_2012, AUTHOR = {BROTO CHAKRABARTY, Nita Parekh}, TITLE = {Analysis of graph centrality measures for identifying Ankyrin repeats}, BOOKTITLE = {World Congress on Information and Communication Technologies}. YEAR = {2012}}
Internal repetition within proteins is a commonly observed phenomenon as these provide multiple binding sites for protein-protein interactions by forming integrated multi-repeat assemblies in three-dimension. However, at the sequence level the similarity between internal copies of the repeats is generally low, and identification at the structural level is desired. In recent studies it has been shown that graph representation of protein structures is capable of capturing the 3-dimensional fold of the protein very well, thus providing a computationally efficient approach for protein structure analysis. Here we exploit this feature of protein structure graphs and carry out a comparative analysis of eight graph centrality measures for the identification of tandem structural repeats. We observe that principal eigen spectra of the adjacency matrix and Katz status index are able to capture the repetitive pattern of the structural units. The spectral analysis being computationally efficient, in this paper we present the analysis of the principal eigen spectra of the adjacency matrix of proteins containing Ankyrin (ANK) repeats, the most commonly occurring structural motifs in proteins.
Identification of Genomic Islands by Pattern Discovery
Nita Parekh
Pattern Discovery Using Sequence Data Mining: Applications and Studies, PDUSDM, 2011
@inproceedings{bib_Iden_2011, AUTHOR = {Nita Parekh}, TITLE = {Identification of Genomic Islands by Pattern Discovery}, BOOKTITLE = {Pattern Discovery Using Sequence Data Mining: Applications and Studies}. YEAR = {2011}}
Pattern discovery is at the heart of bioinformatics, and algorithms from computer science have been widely used for identifying biological patterns. The assumption behind pattern discovery approaches is that a pattern that occurs often enough in biological sequences/structures or is conserved across organisms is expected to play a role in defining the respective sequence’s or structure’s functional behavior and/or evolutionary relationships. The pattern recognition problem addressed here is at the genomic level and involves identifying horizontally transferred regions, called genomic islands. A horizontally transferred event is defined as the movement of genetic material between phylogenetically unrelated organisms by mechanisms other than parent to progeny inheritance. Increasing evidence suggests the importance of horizontal transfer events in the evolution of bacteria, influencing traits such as antibiotic resistance, symbiosis and fitness, virulence, and adaptation in general. In the genomic era, with the availability of large number of bacterial genomes, the identification of genomic islands also form the first step in the annotation of the newly sequenced genomes and in identifying the differences between virulent and non-virulent strains of a species. Considerable effort is being made in their identification and analysis and in this chapter a brief summary of various approaches used in the identification and validation of horizontally acquired regions is discussed.
IGIPT - Integrated genomic island prediction tool
Ruchi Jain,Sandeep Ramineni,Nita Parekh
Bioinformation, BII, 2011
@inproceedings{bib_IGIP_2011, AUTHOR = {Ruchi Jain, Sandeep Ramineni, Nita Parekh}, TITLE = {IGIPT - Integrated genomic island prediction tool}, BOOKTITLE = {Bioinformation}. YEAR = {2011}}
IGIPT is a web-based integrated platform for the identification of genomic islands (GIs). It incorporates thirteen parametric measures based on anomalous nucleotide composition on a single platform, thus improving the predictive power of a horizontally acquired region, since it is known that no single measure can absolutely predict a horizontally transferred region. The tool filters putative GIs based on standard deviation from genomic average and also provide raw output in MS excel format for further analysis. To facilitate the identification of various structural features, viz., tRNA integration sites, repeats, etc. in the vicinity of GIs, the tool provides option to extract the predicted regions and its flanking regions.
Construction and Analysis of Enzyme Centric Network of A. thaliana using Graph Theory
KASTHURIBAI VISWANATHAN,Nita Parekh
International Workshop on Soft Computing Applications and Knowledge Discovery, SKAD, 2011
@inproceedings{bib_Cons_2011, AUTHOR = {KASTHURIBAI VISWANATHAN, Nita Parekh}, TITLE = {Construction and Analysis of Enzyme Centric Network of A. thaliana using Graph Theory}, BOOKTITLE = {International Workshop on Soft Computing Applications and Knowledge Discovery}. YEAR = {2011}}
Graph comparisons, quantitative characterizations, computation of topological indices, clustering and partitioning are some of the major computations of graph that have yielded valuable results in various disciplines. Motivated by the potential benefits of graph theory application on biological data, we discuss the reconstruction and analysis of enzyme centric network of Arabidopsis thaliana using graph theory concepts. We had earlier constructed the metabolite network of Arabidopsis thaliana and witnessed the scale free and small world nature of the network. Compared to metabolites, the enzymes are more conserved in and across many pathways. So the aim of constructing the enzyme centric network is to see if the network follows similar network properties of the metabolite network and to look for additional details that a metabolite network cannot reveal. The enzyme flat file from KEGG FTP is used as the data set for the reconstruction of the enzyme centric network. We examined the network to find the relationship between topological connections among enzymes and their functions during evolution. The enzyme sequences of high degree and high betweenness enzymes belonged to ancient fold class and ancestry value showing evidence that they evolved very slowly.
Construction and Analysis of Metabolite Network from Arabidopsis thaliana pathways
KASTHURIBAI VISWANATHAN,Nita Parekh
International Conference on Bioinformatics and Computational Biology, BIOCOMP, 2011
@inproceedings{bib_Cons_2011, AUTHOR = {KASTHURIBAI VISWANATHAN, Nita Parekh}, TITLE = {Construction and Analysis of Metabolite Network from Arabidopsis thaliana pathways}, BOOKTITLE = {International Conference on Bioinformatics and Computational Biology}. YEAR = {2011}}
The recent large scale advances in science and technology has resulted in accumulation of large amount of biological pathways data. Any metabolic pathway contains large number of enzymes, metabolites and reactions. To make sense of diverse data available on a system, one needs to correlate and analyze them as a whole. Motivated by the potential benefits of graph theory and its application in biological data, we discuss the automated reconstruction and analysis of metabolite network of Arabidopsis thaliana using concepts of graph theory. A.thaliana metabolite network was reconstructed and analysis of the global properties of its metabolite-centric graph shows that the network is small-world and scale-free in nature. The investigation of nodes with high centrality values like high degree and high betweeness in this network help in identifying important metabolites, reactions, etc. Newman’s modularity-based approach has been used in the analysis of the metabolite network of A. thaliana to identify pathway clusters, isolated pathways, and orphan metabolites or products. Our analysis on network representations helps in understanding the relationship between the metabolites, enzymes and reactions of metabolic pathways in A. thaliana.
Analysis of Centrality Measures of Airport Network of India
SAPRE MANASI SUDHIR,Nita Parekh
Lecture Notes on Computer Science, LNCS, 2011
@inproceedings{bib_Anal_2011, AUTHOR = {SAPRE MANASI SUDHIR, Nita Parekh}, TITLE = {Analysis of Centrality Measures of Airport Network of India}, BOOKTITLE = {Lecture Notes on Computer Science}. YEAR = {2011}}
In this paper we analyze the topological properties of airport network of India (ANI) using graph theoretic approach. We show that such an analysis can be useful not only in planning the infrastructure and growth of the air-traffic connectivity, but also in managing the flow of transportation during emergencies such as accidental failure of the airport, close down of the airport due to unexpected climate changes, terrorist attacks, etc. Knowledge of the connectivity pattern and load on various routes can also help in making judicious decisions for reduction of flights to contain the spread of the infectious disease.
Analysis of Airport Network of India
SAPRE MANASI SUDHIR,Nita Parekh
Grace Hopper Conference in Computer Science, GHCCS, 2010
@inproceedings{bib_Anal_2010, AUTHOR = {SAPRE MANASI SUDHIR, Nita Parekh}, TITLE = {Analysis of Airport Network of India}, BOOKTITLE = {Grace Hopper Conference in Computer Science}. YEAR = {2010}}
In this study we analyze the topology and structure of Airport Network of India (ANI) using the graph theoretic approach. We investigate the transmission flow in the network using graph centrality measures. We observe that though ANI exhibits scale-free behaviour, it differs from Barabasi-Albert scale-free model and we suggest a scale-free model that results in high clustering coefficient to be most suitable to model ANI. The analysis is not only useful in planning the infrastructure and expansion of the air-traffic connectivity, but also in managing the flow of transportation during emergencies such as accidental failure of the airport, closing down the airport due to unexpected climate changes, terrorist attacks etc. It has been observed that densely connected air-transportation networks play a major role in the spread of infectious diseases, viz., Avian-influenza, Swine-Flu, etc., turning from epidemic into pandemic. Knowledge of the connectivity pattern for reduction of flights on certain routes can help to reduce the spread of the disease.
Controlling Dynamical Networks
RAINA ARORA,Nita Parekh
@inproceedings{bib_Cont_2010, AUTHOR = {RAINA ARORA, Nita Parekh}, TITLE = {Controlling Dynamical Networks}, BOOKTITLE = {Chaos}. YEAR = {2010}}
Two classes of networks that have been extensively studied in the analysis of physical systems are: (i) regular networks, wherein each node interacts with a specified number of neighboring nodes on geometrical lattices, and (ii) random networks, wherein every pair of nodes have a fixed probability of interacting with each other. Recently much attention is being focused on a class of network models that are neither strictly regular nor completely random, but are somewhere in-between and exhibit properties of both, called small world networks. These networks are shown to exhibit high clustering (i.e., nodes sharing a common neighbor have a higher probability of being connected to each other than to other nodes) and a low average path length (the average of the shortest distance/path between every pair of nodes in the network). Examples for small-world networks are found to occur widely across the biological (e.g., neural connection patterns), social (e.g., friendship network, co-authorship) and technological (e.g., the world wide web) domains. Also it has been observed that a large number of real complex systems exhibit power-law behaviour in their degree distribution, i.e., a few nodes have a very high degree. Such networks are referred to as scale-free networks. Thus, non-standard topologies with long-range connections (i.e., non-local diffusion) and uneven degree distributions are not uncommon in real-life systems and may provide different kinds of spatiotemporal dynamics depending on the extent of non-local diffusion. Here we discuss the characterization and control of spatiotemporal dynamics on four different network topologies, viz., (i) regular, (ii) random, (iii) small-world, and (iv) scale-free by external perturbation or pinning a few nodes in the network. This would provide us insight into the role of the network topology (the underlying connectivity structure) on the dynamics of the systems defined on such networks and the efficacy with which the dynamics can be the controlled or manipulated. An extensively studied example of a nonlinear system exhibiting a wide variety of complex dynamics ranging from simple periodic behavior to chaos is the logistic map. Here we define coupled logistic maps on the four different topologies and systematically investigate the control by external perturbation/pinning for two chaotic regimes: (i) r = 3.6 (weak-chaos), and (ii) r = 3.9 (strong chaos). Our preliminary results show that pinning nodes at regularly spaced intervals, 2nd, 4th, etc., the pinning strength required on a small-world network is similar to that in case of regular networks. For 25% of the nodes pinned, the dynamics on the scale-free topology is controllable but on random networks, it is not. However, on pinning nodes having high centrality measures, viz., degree, betweenness and closeness, we observe that control of the spatiotemporal dynamics is achieved, by pinning only 10% of high degree/ betweenness nodes in low chaotic regime on both small-world and scale-free networks. Complete control of the network is not observed on pinning nodes with high closeness values. For strongly chaotic dynamics, control is achievable only on scale-free networks on pinning 20% of nodes having either high degree, betweenness, or closeness values. Key Words: Dynamics, Control, Topologies, Coupled Logistic Maps, Chaos, Regular, Smallworld, Random, Scale-free networks, Centrality.
Comparative Analysis of Metabolic Networks
K M SHUBHI GUPTA,Nita Parekh
International Conference On Frontiers of Interface between Statistics and Sciences, FISS, 2010
@inproceedings{bib_Comp_2010, AUTHOR = {K M SHUBHI GUPTA, Nita Parekh}, TITLE = {Comparative Analysis of Metabolic Networks}, BOOKTITLE = {International Conference On Frontiers of Interface between Statistics and Sciences}. YEAR = {2010}}
In the present study, we have constructed and analysed the metabolic networks of virulent and non-virulent strain of E. coli. Concepts from graph theory are used to systematically determine how differences in basic metabolism of different strains of E. coliare reflected at the systems level. Analysis of metabolic networks can help us to understand the evolutionary history of cellular metabolic process and we can also identify the potential drug target by comparing the networks.
Graph Spectral Approach for Identifying Protein Domains
HARI KRISHNA YALAMANCHILI,Nita Parekh
International Conference on Bioinformatics & Computational Biology, BICOB, 2009
@inproceedings{bib_Grap_2009, AUTHOR = {HARI KRISHNA YALAMANCHILI, Nita Parekh}, TITLE = {Graph Spectral Approach for Identifying Protein Domains}, BOOKTITLE = {International Conference on Bioinformatics & Computational Biology}. YEAR = {2009}}
Abstract. Here we present a simple method based on graph spectral properties to automatically partition multi-domain proteins into individual domains. The identification of structural domains in proteins is based on the assumption that the interactions between the amino acids are higher within a domain than across the domains. These interactions and the topological details of protein structures can be effectively captured by the protein contact graph, constructed by considering each amino acid as a node with an edge drawn between two nodes if the C alpha atoms of the amino acids are within 7. Here we show that Newmans community detection approach in social networks can be used to identify domains in protein structures. We have implemented this approach on protein contact networks and analyze the eigenvectors of the largest eigenvalue of modularity matrix, which is a modified form of the Adjacency matrix, using a quality function called modularity to identify optimal divisions of the network into domains. The proposed approach works even when the domains are formed with amino acids not occurring sequentially along the polypeptide chain and no a priori information regarding the number of nodes is required.
Dynamical Networks
RAINA ARORA,Nita Parekh
Symposium on Theoretical and Mathematical Biology, TMB, 2009
@inproceedings{bib_Dyna_2009, AUTHOR = {RAINA ARORA, Nita Parekh}, TITLE = {Dynamical Networks}, BOOKTITLE = {Symposium on Theoretical and Mathematical Biology}. YEAR = {2009}}
In most real world networks the connectivity topology between individual entities is not governed by regular, nearest neighbour interactions but also includes long range interactions, e.g., small-world topology of interacting neurons, scale-free behaviour of gene regulatory networks, etc. In this study we have analyzed the synchronization and control of the dynamics in such networks having non-regular topology, viz., small-world, random and scale-free topologies, compared to that in regular networks modeled on coupled map lattices. For the analysis, logistic map, which exhibits rich dynamical behaviour, is modeled on each node and connectivity between the nodes is defined based on the Watt-Strogatz model for small-world and random networks and Albert-Barabasi model for the scale-free network. The synchronization and control of the complex dynamics is analyzed for varying coupling strengths and connectivity delays in the four network topologies. We observe that the synchronization of the network depends on the connection topology and also on connection delays, with poor synchronization observed in regular and small-world networks compared to random and scale-free topology. The connection delays between constituent units are capable of sustaining complex dynamical behaviour not seen in the undelayed system.
Controlling Dynamics in Different Topological Networks
RAINA ARORA,Nita Parekh
International Conference On Frontiers of Interface between Statistics and Sciences, FISS, 2009
@inproceedings{bib_Cont_2009, AUTHOR = {RAINA ARORA, Nita Parekh}, TITLE = {Controlling Dynamics in Different Topological Networks}, BOOKTITLE = {International Conference On Frontiers of Interface between Statistics and Sciences}. YEAR = {2009}}
The synchronization and control of dynamical systems on a regular lattice has been extensively studied. However, it has been observed that the underlying topology of a real-system need not always be regular, e.g., the brain has been observed to have a connection structure with small-world properties. Thus, non-standard topologies with long-range connections (i.e., non-local diffusion) are not uncommon in real-life systems and may provide different kinds of spatiotemporal dynamics depending on the extent of non-local diffusion. This led us to investigate the spatiotemporal dynamics of coupled logistic map on four different topological networks, viz., regular, small-world, random and scale-free. In particular, we are interested in the synchronization and control of the spatiotemporal dynamics. It has been shown earlier that the transmission delays between constituent units are capable of sustaining complex dynamical behavior, a phenomenon not observed in the undelayed systems. In fact, the synchronization of the dynamics is observed for a much larger parameter space when the underlying topology is scale-free or random. In our earlier work on the control of spatiotemporal dynamics, we have shown the effect of externally applied perturbation or 'pinning' on regular network. By varying the strength and sign of the pinning strength we were able to target the system to any desired dynamical state. Here we investigate the effect of pinning the dynamics on various topologies. Using graph centrality measures, viz., degree, betweenness and closeness for pinning the nodes, our preliminary results suggest that high betweenness nodes perform better in controlling the spatiotemporal dynamics with as few as 10% of nodes required for pinning in case of system exhibiting weak chaotic dynamics. This study has important practical applications in the control of dynamical diseases such as epileptic seizures and bursts on a small-world brain topology.
Graph Theoretic Approach for Domain Identification
HARI KRISHNA YALAMANCHILI,Nita Parekh
International Conference on Bioinformatics & Computational Biology, BICOB, 2009
@inproceedings{bib_Grap_2009, AUTHOR = {HARI KRISHNA YALAMANCHILI, Nita Parekh}, TITLE = {Graph Theoretic Approach for Domain Identification}, BOOKTITLE = {International Conference on Bioinformatics & Computational Biology}. YEAR = {2009}}
Here we present a simple method based on graph spectral properties to automatically partition multi-domain proteins into individual domains. The identification of structural domains in proteins is based on the assumption that the interactions between the amino acids are higher within a domain than across the domains. These interactions and the topological details of protein structures can be effectively captured by the protein contact graph, constructed by considering each amino acid as a node with an edge drawn between two nodes if the Cα atoms of the amino acids are within 7Å. Here we show that Newman’s community detection approach in social networks can be used to identify domains in protein structures. We have implemented this approach on protein contact networks and analyze the eigenvectors of the largest eigenvalue of modularity matrix, which is a modified form of the Adjacency matrix, using a quality function called “modularity” to identify optimal divisions of the network into domains. The proposed approach works even when the domains are formed with amino acids not occurring sequentially along the polypeptide chain and no a priori information regarding the number of nodes is required.
Identifying Structural Repeats in Proteins using Graph Centrality Measures
HARI KRISHNA YALAMANCHILI,RUCHI JAIN,Nita Parekh
World Congress on Nature & Biologically Inspired Computing, NaBIC, 2009
@inproceedings{bib_Iden_2009, AUTHOR = {HARI KRISHNA YALAMANCHILI, RUCHI JAIN, Nita Parekh}, TITLE = {Identifying Structural Repeats in Proteins using Graph Centrality Measures}, BOOKTITLE = {World Congress on Nature & Biologically Inspired Computing}. YEAR = {2009}}
Here we apply the graph-theoretic concept of Betweenness centrality to a class of protein repeats, e.g., Armadillo (ARM) and HEAT. The Betweenness of a node represents how often a node is traversed on the shortest path between all pairs of nodes i, j in the network and thus gives the contribution of each node in the network. These repeats are not easily detectable at the sequence level because of low conservation between independent repeated units, e.g., HEAT repeats are known to have less than 13% identity. Their identification at the structure level typically involves self structure-structure comparison, which can be computationally very intensive. Our analysis of a set of proteins from ARM and HEAT repeat family shows that the repeat regions exhibit similar connectivity patterns for the repeating units. Since it is generally accepted that in many networks, the larger the degree of a node, the larger the chance that many of the shortest paths will pass through this node, computing vertex Betweenness provides a simple and elegant approach for identifying tandem structural repeats in proteins.
Graph Spectral Analysis of Protein Repeat Families
Nita Parekh
Recent Developments and Applications of Probability Theory, Random Process and Random Variables in C, DAPTRPRVC, 2008
@inproceedings{bib_Grap_2008, AUTHOR = {Nita Parekh}, TITLE = {Graph Spectral Analysis of Protein Repeat Families}, BOOKTITLE = {Recent Developments and Applications of Probability Theory, Random Process and Random Variables in C}. YEAR = {2008}}
Recently much attention is being focused on the analysis of the three dimensional structure of proteins using graph and network concepts. The simplest protein network graph is constructed by defining the amino acids as nodes and the edges drawn between amino acids (within a threshold) which attempt to capture not only covalent but also non-covalent interactions. Analysis of the topological details of proteins with known structures, such as clustering of specific types of amino acids important for structure, folding and function, is of great value and is an active field of research. Since structures of a large number of proteins is now available, automatic methods of analysis are required to analyze them and recently, the tools from graph theory are being explored for such analysis. Here, we propose to use graph properties and their spectral analysis for identifying patterns in protein structures, in particular, structural repeats or motifs. These repeats are not easily detectable at the sequence level because of low conservation or high divergence between independent repeated units. The importance of such repeats in understanding biological function resides not only in their high frequency among known sequences, but also in their abilities to confer multiple binding and structural roles on proteins, e.g., zinc finger domain, a constituent of transcription factors involved in DNA binding, where the composition and copy number of individual tandem repeats confers selectivity and activity of DNA binding. This functional versatility is apparent not only among different repeat types, but also for similar repeats from the same family. Our understanding of repeats, with respect to their structures, functions, and evolution, therefore represents a considerable challenge. Our analysis of the degree distribution and spectral analysis of a set of proteins from various repeat families show that the repeat regions exhibit similar connectivity patterns in the graph. The network property, Betweenness, was found to identify accurately the repeat boundaries in protein families. Thus, this approach can help in identifying repeated motifs in proteins which are difficult to identify at the sequence level.
Protein Tandem Repeat Database (PTRDB)
P KRISHNA MANJARI,RIMA KUMARI,Nita Parekh
International Conference on Bioinformatics and Drug Discovery, Bioconvene, BDDB, 2008
@inproceedings{bib_Prot_2008, AUTHOR = {P KRISHNA MANJARI, RIMA KUMARI, Nita Parekh}, TITLE = {Protein Tandem Repeat Database (PTRDB)}, BOOKTITLE = {International Conference on Bioinformatics and Drug Discovery, Bioconvene}. YEAR = {2008}}
About 14% of the protein sequences in the Swissprot database contain repetitive region, viz., tandem repeats, multiple copies of motifs/profiles, multiple copies of domain. And eukaryotic proteins are more likely to have repeats than Bacteria and Archaea. Our main focus is only on tandem repeats. Tandem repeat can be defined as contiguous repeat pattern of two or more copies. These copies can be exact or approximate. Many proteins with these repeat are involved in functions like transcription, translation, protein-protein interaction. Proteins with tandem repeats are involved in various neurodegenerative diseases like Huntington's disease. These proteins are also found to occur in sequences which are poorly conserved in evolution. We have developed a database of tandem repeats in protein sequences: Protein Tandem Repeats DataBase(PTRDB). The data for this database is extracted using our in-house tool PEPPER, a tool for identifying PEPtide PEriodic Repeats. This database is built on SwissProt (ver 51.7). PTRDB have 3145 proteins with 4713 tandem repeats. 77.74% are found in Eukaryota, 16.63% in Bacteria and 0.67% in Archaea. About 5% of proteins in this database are associated with disease. We have classified the database organism wise. This database can be queried by various attributes like SwissProt ID, organism name, PDBID, repeat pattern, repeat length, keyword and copy number. It gives the detail information of tandem repeats like SwissProt ID (hyperlinked to SwissProt to give Fasta sequence), organism name, PDBID (hyperlinked to PDB), one line description of the protein, gene name, protein name, taxonomy ID (hyperlinked to EBI), Family information (hyperlinked to Pfam), Description of associated disease, OMIM ID, scoring matrix with gap extension for the repeat, repeat pattern with alignment score, copy number, repeat length, start point, end point, alignment of repeat pattern with repeat region, secondary structure information of the repeat region. With the increasing abundance of tandem protein repeats in various proteomes this database will be increasingly important in proteome comparison.
PEPPER - A Tool for Identifying Peptide Periodic Repeats
KASTURI KRISHNA KIRAN,BATHULA RADHIKA,Nita Parekh
nternational Conference in Bioinformatics, INCOB, 2008
@inproceedings{bib_PEPP_2008, AUTHOR = {KASTURI KRISHNA KIRAN, BATHULA RADHIKA, Nita Parekh}, TITLE = {PEPPER - A Tool for Identifying Peptide Periodic Repeats}, BOOKTITLE = {nternational Conference in Bioinformatics}. YEAR = {2008}}
Identifying tandem repeats in the proteome of any organism is important not only for understanding the structure and function of the proteins but also for analyzing the association of abnormal expansion of repeat regions with disorders We have developed an efficient tool for identifying Peptide Periodic Repeats (PEPPER) in protein sequences. The tool has three in-built modules: In the first module, initial screening of the protein sequence by a uni-dimensional vector equivalent to the dot-matrix is used to identify all possible repeat regions. In the second module, a sliding window analysis of the putative repeat region using a K-tuple identifies the period and the possible pattern of the repeat. In the third module, wraparound dynamic programming (WDP) algorithm is implemented to obtain the exact repeat boundaries. If the protein sequence does not contain tandem repeat region, then the second and third steps of the algorithm are skipped making it computationally efficient for database search compared to the other existing algorithms. We report up to three possible multiple-periodicities in a given repeat region. The tool reports the consensus repeat pattern, the complete repeat region, the score and alignment of the consensus with the repeat region along with percentage mismatch/indels. A preliminary comparison of our results with the TRIPS database of protein tandem repeats showed that our program not only identifies all the repeats in the TRIPS database, but we were also able to re-annotate a few incorrectly annotated repeats and found few new repeats no reported in the TRIPS database. PEPPER provides the user the facility to change parameter values for identifying small or approximate tandem repeats. A database of repeats can be built by the user for his set of protein sequences on the fly with PEPPER.
Identifying Genomic Islands in Prokaryotic Genomes (GIPro)
RUCHI JAIN,Nita Parekh
International Conference on Bioinformatics and Drug Discovery, Bioconvene, BDDB, 2008
@inproceedings{bib_Iden_2008, AUTHOR = {RUCHI JAIN, Nita Parekh}, TITLE = {Identifying Genomic Islands in Prokaryotic Genomes (GIPro)}, BOOKTITLE = {International Conference on Bioinformatics and Drug Discovery, Bioconvene}. YEAR = {2008}}
Prokaryotic Genomes undergo various mutational events during the course of genome evolution through a variety of processes including Horizontal Gene Transfer. These regions of about 10 - 100 Kb in length are called Genomic Islands and presence of such regions offers selective advantage to the organism and may aid in the organism's growth, survival (saprophytic or symbiotic mode) or display of pathogenicity. Earlier analysis on horizontally transferred regions has shown significant variations from the average genomic values in their GC content, codon preferences and dinucleotide bias. We have developed a web-based tool identifying genomic islands in prokaryotic genomes (GIPro) based on various measures, viz., GC content (total and at individual codon positions), dinculeotide biases, and biases in codon and amino acid usage, carried out in a sliding window analysis along the length of the genome. The regions that deviate by n standard deviations (n - user defined, default value 2) from the average genome values for at least two measures are identified as probable genomic islands by the tool. GIPro can also be used for comparison of user defined regions / gene sets to confirm whether both belong to the same genome or have differences in their dinucleotide or codon usage biases as a result of horizontal transfer.
Gene Identification in silico
Nita Parekh
Bioinformatics and Functional Genomics, BFG, 2006
@inproceedings{bib_Gene_2006, AUTHOR = {Nita Parekh}, TITLE = {Gene Identification in silico}, BOOKTITLE = {Bioinformatics and Functional Genomics}. YEAR = {2006}}
The ultimate goal of molecular cell biology is to understand the physiology of living cells in terms of the information that is encoded in the genome of the cell and we would like to address the question how computational biology can help in achieving this goal. Genes coding for proteins is a very important pattern recognition problem in bioinformatics and this lecture focuses on some computational issues in gene identification.
Controllability of spatiotemporal systems using constant pinnings
Nita Parekh,Somdatta Sinha
Statistical Mechanics and its Applications, Physica-A, 2003
@inproceedings{bib_Cont_2003, AUTHOR = {Nita Parekh, Somdatta Sinha}, TITLE = {Controllability of spatiotemporal systems using constant pinnings}, BOOKTITLE = {Statistical Mechanics and its Applications}. YEAR = {2003}}
Most natural spatiotemporal systems are an organized ensemble of dynamical subsystems whose behaviour is regulated by nonlinearly coupled multi-variable processes. The response of each of these variables and/or combinations of them to any single or composite external perturbation can influence its spatiotemporal behaviour in a complex and non-intuitive manner. This paper attempts to study the dynamical response of two coupled map lattice systems having local dynamics described by (a) coupled discrete maps, and (b) coupled differential equations, to external perturbation (or “pinning”) applied to each variable individually or simultaneously. We show that the response of different variables to external pinning is quite different. Our results indicate that, though this pinning approach is useful in controlling complex dynamics both globally and locally, enhancing complexity in dynamics (“anti-control”) is …