IIIT

Towards Measuring Software Development Waste in Open-Source Software Projects

Author(s): V D Shanmukha Mitra
Advisor(s): Raghu Babu Reddy Y

Masters

February '25
Report no: IIIT/TH//
Center of SERC

Abs PDF

Towards Measuring Software Development Waste in Open-Source Software Projects

Abstract

Software product development can often result in the generation of Software Development Waste (SDW) at any stage of the software development life cycle. SDW is defined as any resource-consuming activity that does not add value to the client or the organization developing the software. SDW impacts a software project’s overall efficiency and productivity as the project scale and size increase. For example, developers may need to rework implementations that cater to ambiguous feature stories; sometimes artifacts may not go into production, resulting in unused artifacts, etc. After the COVID-19 pandemic, software development processes were profoundly impacted. Many organizations are working predominantly with remote or hybrid work models. Traditional practices, reliant on in-person interactions and co-located teams, were disrupted as organizations adapted to new, virtual environments. This transition led to the adoption of various communication and collaboration tools, fundamentally altering how software development was executed and perceived. The work setup required teams to navigate productivity, communication, and workflow management challenges, reshaping established development practices. Our research examined the effects of the pandemic-induced work-from-home situation on the production and handling of SDW. Starting with a multi-year study, we surveyed 615 participants and interviewed 31 from the software industry across eight countries specializing in various domains. We observed a rise in SDWs and identified a new type of SDW, and other wastes were reclassified into existing waste types. Additionally, we found that teams need more direct measures of SDW and rely instead on proxy measures such as productivity and delivery times. The lack of definitive measures to monitor and manage SDW is a concern. The rising adoption of open-source software (OSS) has prompted an examination of the causes of SDW within the use-case of OSS and the appropriate metrics for its measurement. The pandemic reshaped practices, fostering more collaborative and flexible approaches and accelerating OSS adoption. This shift provides a foundation to investigate the influence of these developments on SDW. To address this gap, we propose four measures, namely Stale Forks (SF), Project Diversification Index (PDI), PR Rejection Rate (PRR), Backlog Inversion Index(BII), and the Feature Fulfillment Rate (FFR) visualization to potentially identify unused artifacts, building the wrong feature/product, mismanagement of backlog types of SDW. We applied these measures to ten open-source projects. We observe that OpenCV has 0.85%, a small percentage of active forks, and 95.79%, a high percentage of backup forks, indicating many unused artifacts. Meanwhile, the low stale fork value at 1.43% indicates fewer unused artifacts. Rustlings has a high number of potentially stale (26.11%) and stale forks (10.02%), indicating a high percentage of unused artifacts. The PDI ranges from 0.0143 for Rust in one repository to 2.18 for bootstrap, indicating the rate at which the project’s plan differs from user expectations and how much it aligns with those expectations, respectively. The repository Rustlings has the lowest unused artifact count with a PRR rate of 0.22, while angular and react-native are on the opposite side of the spectrum with values of 6.59 and 19.35, respectively. The project bootstrap has a ’0’ on its BII, with kubernetes showing the highest value at 3.37. Finally, the FFR visualization shows variations in ‘backlog management’ practices among the different projects.

Smart Home Adoption in Mumbai: Investigating Stakeholder Perspectives and the Role of Pricing in Consumer Decision-Making

Author(s): Sahil Chilana
Advisor(s): Vishal Garg

Masters

January '25
Report no: IIIT/TH//
Center of CBS

Abs PDF

Smart Home Adoption in Mumbai: Investigating Stakeholder Perspectives and the Role of Pricing in Consumer Decision-Making

Abstract

Smart home technologies, which integrate automation, energy management, and security, are increasingly influencing modern urban living. This thesis investigates the adoption of smart home systems in Mumbai, a city with notable socio-economic diversity and rapid urbanization. The study examines consumer perceptions, financial constraints, privacy concerns, and the technical feasibility of smart home technologies in an emerging market, with particular focus on price sensitivity. While global smart home adoption is driven by technological infrastructure and growing consumer awareness, markets like India face unique challenges. These include limited affordability, privacy concerns, and a lack of consumer knowledge, which slow the adoption of such technologies. This thesis addresses these gaps, exploring how these factors shape smart home adoption in Mumbai. A mixed-methods approach was employed, involving surveys from 425 residents in Mumbai’s F/South Ward. Conjoint analysis was used to evaluate consumer preferences regarding three key smart home features: security, energy efficiency, and convenience. The research assesses how these features are prioritized by consumers, and how pricing impacts decision-making. The findings reveal that security features—such as smart locks, cameras, and motion sensors— are the most important drivers of adoption, with 41.88% of respondents ranking security as their top priority. This highlights the significant demand for personal safety solutions in Mumbai’s densely populated environment. On the other hand, while energy efficiency is recognized for its potential long-term benefits, it is generally seen as less urgent compared to immediate security concerns. Cost emerges as a key barrier to adoption, particularly for middle- and lower-income households. Respondents expressed reluctance to invest in complete systems priced above ₹70,000, even when acknowledging potential savings in the long run. However, higher-income consumers demonstrated a greater willingness to adopt comprehensive smart home solutions. These findings underscore the need for flexible pricing strategies that offer both entry-level options for price-sensitive consumers and premium packages for those seeking more advanced features.Additionally, privacy and data security concerns pose significant challenges, with 72.94% of respondents expressing fears over data misuse and system vulnerabilities. These concerns, coupled with doubts about the reliability of smart home systems, contribute to consumer hesitation. Addressing these issues through robust data protection, transparent communication, and trustworthy system reliability is essential to building consumer trust. The research recommends that smart home providers focus on offering tiered pricing models to cater to different income levels, along with subscription-based or financing options to reduce upfront costs. Furthermore, professional installation services and strong customer support are crucial in ensuring system reliability and addressing technical concerns. By implementing these strategies and addressing privacy concerns, providers can enhance consumer confidence and promote broader adoption. In conclusion, this thesis provides a comprehensive analysis of the factors influencing smart home technology adoption in Mumbai. By highlighting the importance of security, affordability, and privacy, it offers practical recommendations for industry stakeholders and policymakers. Tailored solutions that address these challenges will be essential for increasing adoption rates in Mumbai and other similar urban markets.

End-to-End ASR Free User Defined Keyword Spotting System

Author(s): Kesavaraj V
Advisor(s): Anil Kumar Vuppala

Masters

January '25
Report no: IIIT/TH//
Center of LTRC

Abs PDF

End-to-End ASR Free User Defined Keyword Spotting System

Abstract

The rise of AI-driven voice technologies has transformed human-machine interaction, with voice assistants like Amazon Alexa, Apple Siri, Google Assistant, and Microsoft Cortana being prominent examples. A critical component of these systems is the keyword spotting (wake word detection) module, which identifies specific keywords in a continuous audio stream. This module plays a vital role in voice assistants by preventing the need for computationally expensive automatic speech recognition (ASR) to run continuously. User-defined keyword spotting (UDKWS), or custom keyword detection, has gained significant attention due to the increasing demand for personalized voice assistants. Unlike closed-vocabulary keyword spotting, UDKWS must identify keywords not seen during training, often functioning in a zero-shot manner. Traditional UDKWS methods rely on ASR to transcribe audio and perform string matching, but these approaches are resource-intensive. To address this, recent neural network-based embedding techniques using text for keyword enrollment have shown promising results. However, these systems struggle to differentiate between phonetically similar keywords (e.g., ”madame” vs. ”modem”) due to alignment mismatches between audio and text representations. In this thesis, we address these challenges at both the model and feature levels. At the model level, we propose a novel approach that leverages knowledge from a pre-trained text-to-speech (TTS) system to enhance the alignment between audio and text projections. The effectiveness of TTS-based transfer learning is examined by analyzing intermediate representations. Our experiments demonstrate that the proposed method significantly outperforms the cross-modality correspondence detector (CMCD) on the challenging LibriPhrase Hard dataset, achieving an 8.22% improvement in area under the curve (AUC) and a 12.56% reduction in equal error rate (EER). In addition to model-level improvements, we explore feature-level enhancements for UDKWS. Existing methods predominantly rely on mel-scale features such as MFCC and mel spectrogram, which do not fully capture temporal dynamics like pronunciation changes over time. To address this, we propose using shifted delta coefficients (SDC), which are known for capturing long-term temporal information. Our experimental results demonstrate that SDC outperforms MFCC, with an 8.32% improvement in AUC and an 8.69% improvement in EER on the Libriphrase Hard dataset, establishing SDC as a robust feature for UDKWS, especially in differentiating between phonetically similar keywords

Adversarial Robust Reject Option Classification

Author(s): Vrund Shah
Advisor(s): Naresh Manwani

Masters

January '25
Report no: IIIT/TH//
Center of MLL

Abs PDF

Adversarial Robust Reject Option Classification

Abstract

Robustness towards adversarial attacks is a vital property for classifiers in several applications such as autonomous driving, medical diagnosis, etc. Also, in such scenarios, where the cost of misclassifi- cation is very high, knowing when to abstain from prediction becomes crucial. A natural question is which surrogates can be used to ensure learning in scenarios where the input points are adversarially perturbed and the classifier can abstain from prediction? This paper aims to characterize and design surrogates calibrated in “Adversarial Robust Reject Option” setting. First, we propose an adversarial robust reject option loss l γd and analyze it for the hypothesis set of linear classifiers (H lin ). Next, we provide a complete characterization result for any surrogate to be (l γd , H lin )- calibrated. To demonstrate the difficulty in designing surrogates to l γd , we show negative calibration results for convex surrogates and quasi-concave conditional risk cases (these gave positive calibration in adversarial setting without reject option). We also empirically argue that Shifted Double Ramp Loss (DRL) and Shifted Double Sigmoid Loss (DSL) satisfy the calibration conditions. Finally, we demonstrate the robustness of shifted DRL and shifted DSL against adversarial perturbations on a synthetically generated dataset.

International Conference on Acoustics, Speech, and Signal Processing

Author(s): N Sai Sri Harsha
Advisor(s): Tapan Kumar Sau

Masters

January '25
Report no: IIIT/TH//
Center of CCNSB

Abs PDF

International Conference on Acoustics, Speech, and Signal Processing

Abstract

Sensing, detection, and screening of bio/chemical species are essential in the fields of smart living, health, foods, agriculture, environmental monitoring, and so on. Intensive research has been going on for identifying, predicting, and diversifying promising sensor substrates that can produce better sensor performance in terms of sensitivity, selectivity, response and recovery, speed, etc. In the quest for minimum footprint with maximum output, metal clusters have emerged as promising candidates. Often the luminescence properties of the metal clusters have been utilized for sensing and detection. The present study explores the potential of gold clusters as semiconduction- and vibrational (Raman and IR) spectroscopy-based sensor substrates taking cysteine (Cys) amino acid as a model ligand-cum-target biomolecule. The sensor activity of a substrate is a convolution of many effects of both the sensor substrate and the target analyte. A detailed atomic/molecular level understanding of the specific systems is required to understand sensing properties. The study involves a density functional theory (DFT)-based systematic computational study to elucidate the structure, electronic properties, and interactions of varied nuclearity gold clusters, Aun (n = 1 – 8) as well as their complexes, Aun-Cys. We have optimized the structures and calculated bonding, charge distribution, binding energies, HOMO-LUMO energies, electron localization function (ELF), density of states (DOS), vibrational spectra of the bare Aun (n = 1-8) and Aun-Cys systems in gas and water solvent media. The study attempted to explore the correlations that exist among the structure, bonding, and electronic properties of clusters and their semiconduction- and Raman/IR spectroscopies-based sensing properties. The nature of interactions and sensing properties of the clusters vary depending on the nature of the media and the cluster nuclearity. The study revealed that gold clusters can be promising candidates for semiconduction- and vibrational spectroscopy-based sensing substrates. Furthermore, vibrational spectroscopy-based single gold atom sensors can be a possibility. The present study extends our understanding beyond the traditional cluster luminescence-based sensing to single-atom vibrational spectroscopic sensing to semiconduction or resistance-based sensing.

Structural, Electronic, and Sensing Properties of Gold Clusters (Aun, n = 1–8): A DFT Study

Author(s): Nookala Sai Sri Harsha
Advisor(s): Tapan Kumar Sau

Masters

January '25
Report no: IIIT/TH//
Center of CCNSB

Abs PDF

Structural, Electronic, and Sensing Properties of Gold Clusters (Aun, n = 1–8): A DFT Study

Abstract

Network-Based Approaches for Cancer Subtype Identification and Prognosis

Author(s): Aswin Jose
Advisor(s): Vinod Palakkad Krishnanunni

Masters

January '25
Report no: IIIT/TH//
Center of CCNSB

Abs PDF

Network-Based Approaches for Cancer Subtype Identification and Prognosis

Abstract

Cancer remains a leading cause of morbidity and mortality worldwide. Despite advances in genomics, identifying clinically relevant subtypes of cancer remains challenging due to its complex and heterogeneous nature. This thesis explores the application of network-based stratification (NBS) approaches to stratify cancer types into clinically relevant subtypes, which can aid in precision medicine and targeted therapy efforts. First, we investigate the effectiveness of a standard NBS pipeline using somatic mutation data to stratify Renal Cell Carcinoma (RCC). We explore the impact of network composition on RCC stratification performance and extend the NBS approach to copy number variation (CNV) data for RCC subtyping. Next, we introduce DeepGraphMut (DGM), a novel graph-based deeplearning pipeline that integrates somatic mutation data with protein-protein interaction (PPI) networks. By employing a graph autoencoder with a graph attention layer and a node-level attention decoder, DGM generates patient-specific clinically relevant encodings for unsupervised and supervised tasks. We demonstrate the effectiveness of DGM across 16 cancer types comprising of 7352 samples from The Cancer Genome Atlas (TCGA). Unsupervised clustering reveals distinct subtypes with significant survival differences in 11 cancer types. In supervised analysis using a Cox regression model, DGM demonstrates excellent performance in predicting survival outcomes, achieving a high concordance index (c-index) value in the range of 0.7 across most cancers, underscoring its robust predictive performance using only somatic mutation data. Furthermore, DGM outperforms traditional methods and its lightweight variant in both unsupervised and supervised analyses. In summary, this thesis presents a promising approach for cancer subtype identification and prognosis, especially in resource-limited settings where multi-omics data may not be readily available. By leveraging the strengths of graph learning and network biology, DGM offers a valuable tool for advancing personalized medicine.

From Pixels to Prognosis: Machine Learning Solutions for Critical Healthcare Challenges

Author(s): Suba S
Advisor(s): Nita Parekh

PhD

January '25
Report no: IIIT/TH//
Center of CCNSB

Abs PDF

From Pixels to Prognosis: Machine Learning Solutions for Critical Healthcare Challenges

Abstract

Application of machine learning methods to two important problems, namely, detection of COVID- 19 using chest radiographs (X-rays and CT scans), and molecular subtyping of breast cancer using multi-omics data is carried out. The recent pandemic made clear the need for fast and reliable tech- niques in distinguishing pneumonia caused by the novel virus SARS-CoV-2 from pneumonia caused by other viral/bacterial/fungal infections. In this work, a basic CNN model was built from scratch and spatial attention-based mechanism (Attn-CNN) incorporated to detect the manifestations of COVID-19 in CXR and CT scan images with improved generalizability and explainability has been developed. The proposed spatial attention-based solution overcomes the need for lung segmentation and region-based annotations for training the CNN models while keeping the model complexity minimized, thus making it deployable in clinical settings. To verify the generalizability of the models, testing has also been carried out on external datasets and explainability has been provided using Grad-CAM visualization of the pixels, selected by the model for classification. Performance evaluation of the proposed approach against five state-of-the-art deep learning models showed 95% accuracy for CXRs and 96% for CT images and outperformed all other models and comparatively generalized well on external datasets. Advancements in the high-throughput techniques have generated large volumes of data, enabling genome-wide profiling of various omics data, such as protein-coding and non-coding (e.g., miRNA, lncRNA, etc.) genes, DNA methylation, and analysis of genetic variations (SNVs, CNVs, etc.). How- ever, identification of diagnostic and prognostic biomarkers is challenging due to heterogeneity at mul- tiple levels and the huge number of features associated with each. This heterogeneity is seen to affect the generalizability and explainability of ML models. To address the high dimensionality and explain- ability issues, a knowledge-based feature selection framework along with a filtering approach using pre- dominant correlations is proposed for multi-omics-based biomarker identification. Breast cancer being hormone-dependent cancer, we considered the molecular subtype classification based on the three hor- mone receptors, viz., estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2): Luminal (ER+, PR±, HER2±), HER2-enriched (ER–, PR–, HER2+), and Triple Negative (ER–, PR–, HER2–). DNA methylation data from protein-coding genes and long non- coding RNAs (lncRNAs) were integrated with gene expression data of the associated genes and copy variant genes for feature selection and classification. Using 172 features obtained from the proposed framework, stratified 5-fold cross-validation was carried out using five ML models. The best perfor- mance is obtained for Random Forest model with an accuracy value of 98.19% and AUC values ≥ 0.98 for all the three classes showing the effectiveness of the proposed approach.

Data-driven drug discovery and application in malaria

Author(s): Soham Choudhuri
Advisor(s): Bhaswar Ghosh

PhD

January '25
Report no: IIIT/TH//
Center of CCNSB

Abs PDF

Data-driven drug discovery and application in malaria

Abstract

Traditional drug discovery and development processes are time-consuming and costly, with only a fraction of compounds progressing to clinical testing and an even smaller percentage making it to mar- ket. To address these challenges, computer-aided drug discovery (CADD) methodologies have emerged as efficient tools for designing and evaluating potential drug candidates. By utilizing computational al- gorithms to model drug-receptor interactions, CADD significantly reduces costs and timeframes associ- ated with lead identification and optimization while maintaining high quality. Integrating deep learning algorithms further enhances drug discovery pipelines by enabling the analysis of molecular structures, genetic data, and biochemical interactions to predict drug efficacy and toxicity, and optimize dosage regimens. Deep learning helps to design new small molecule drugs and peptide drugs. Nowadays, target-based drug design is giving promising results. We propose a novel computational pipeline that leverages single-cell transcriptomic data to identify crucial proteins as drug targets for malaria, a dis- ease with increasing resistance to conventional treatments. Through mutual-information-based feature reduction and protein-protein interaction network analysis, key proteins vital for the survival of Plas- modium falciparum are identified, and potential drug molecules are computationally predicted using deep learning techniques. We can use this pipeline to select targets for any disease for developing drugs. Additionally, we explored peptides as promising therapeutic agents due to their targeted interactions with biological targets and reduced side effects compared to small-molecule drugs. We introduce HY- DRA, a hybrid diffusion model for designing therapeutic peptides tailored to specific target receptors, exemplified by the design of peptides targeting Plasmodium falciparum Erythrocyte Membrane Protein 1 (PfEMP1) genes. HYDRA generates highly stable and diverse peptides based on a protein target. Gene expression is a multifaceted process crucial to understanding molecular biology and pharmacol- ogy. We worked on elucidating the intricate relationship between gene length and kinetic parameters, such as transcription rate (S i ), association rate of transcription factor to bind with DNA (K on ), dissoci- ation rate of transcription faction detached from DNA (K of f ), Burst size (SK of f ), which significantly influence the mean expression levels of genes.Using a two-state stochastic gene expression model im- plemented in Python, we analyzed single-cell transcriptomics data to predict kinetic parameters for each gene. We classified genes into short and long categories, revealing distinct patterns in the relationship between gene length and these parameters. Our results indicate that burst size plays a critical role in mean expression, highlighting its importance for identifying gene targets that require lower drug doses for therapeutic effects. This integrated computational framework holds promise for accelerating drug discovery efforts and combating drug-resistant diseases such as malaria.

Ads and Anomalies: Structuring the Known and Probing the Unknown

Author(s): Keralapura Nagaraju Amruth Sagar
Advisor(s): Ravi Kiran Sarvadevabhatla

Masters

December '24
Report no: IIIT/TH//
Center of CVIT

Abs PDF

Ads and Anomalies: Structuring the Known and Probing the Unknown

Abstract

The convergence of computer vision and advertising analysis has seen progress, but existing adver- tisement datasets remain limited. Many are small subsets of larger datasets, and while larger datasets may offer multiple annotations, they often lack consistent organization across all images, making it challenging to structure ads hierarchically. This lack of clear categorization and overlap in labeling hinders in-depth analysis. To address this, we introduce MAdVerse 1 , a comprehensive, multilingual dataset of over 50,000 advertisements sourced from websites, social media, and e-newspapers. MAd- Verse organizes ads into a hierarchy with 11 primary categories, 51 sub-categories, and 524 specific brands, facilitating fine-grained analysis across a diverse range of brands. We establish baseline perfor- mance metrics for key ad-related tasks, including hierarchical classification, source classification, and hierarchy induction in other ad datasets and, in a multilingual context, thereby providing a structured foundation for advertisement analysis. In our second work, we investigate foundational aspects of out-of-distribution (OOD) detection. Existing OOD benchmarks typically focus on broad, class-level shifts but lack controlled environments for assessing how individual attribute changes such as color or shape affect OOD detection. To bridge this gap, we created two synthetic datasets, SHAPES and CHARS 2 , each designed to allow controlled experimentation with isolated shifts in attributes. Through variations in color, size, rotation, and other factors, these datasets facilitate a targeted examination of OOD detection performance under specific conditions, providing insights into how OOD detection is affected under different attribute shifts. Later, we apply OOD detection methods to advertisements, where models face real-world distribution shifts characteristic of diverse advertising styles. Our contributions, MAdVerse for structured ad analysis and SHAPES and CHARS for controlled OOD studies emphasize the importance of robust, adaptable models for both foundational research and practical applications in advertisement analysis.