IIITH

Do Multilingual Transformers Encode Paninian Grammatical Relations? A Layer-wise Probing Study

Workshop on Quantitative Syntax, QUASY-W, 2025

Abs PDF DOI bibTex

@inproceedings{bib_Do_M_2025, AUTHOR = {Kumar, Akshit and Sharma, Dipti Mishra and Krishnamurthy, Parameswari }, TITLE = {Do Multilingual Transformers Encode Paninian Grammatical Relations? A Layer-wise Probing Study}, BOOKTITLE = {Workshop on Quantitative Syntax}. YEAR = {2025}}

Do Multilingual Transformers Encode Paninian Grammatical Relations? A Layer-wise Probing Study

Abstract

Large multilingual transformers such as XLM-RoBERTa achieve impressive performance on diverse NLP benchmarks, but understanding how they internally encode grammatical information remains challenging. This study investigates the encoding of syntactic and morphological information derived from the Paninian grammatical framework—specifically designed for morphologically rich Indian languages—across model layers. Using diagnostic probing, we analyze the hidden representations of frozen XLM-RoBERTa-base, mBERT, and IndicBERT models across seven Indian languages (Hindi, Kannada, Malayalam, Marathi, Telugu, Urdu, Bengali). Probes are trained to predict Paninian dependency relations (by edge probing) and essential morphosyntactic features (UPOS tags, Vibhakti markers). We find that syntactic structure (dependencies) is primarily encoded in the middle-to-upper-middle layers (layers 6–9), while lexical features peak slightly earlier. Although the general layer-wise trends are shared across models, significant variations in absolute probing performance reflect differences in model capacity, pre-training data, and language-specific characteristics. These findings shed light on how theory-specific grammatical information emerges implicitly within multilingual transformer representations trained largely on unstructured raw text.

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

International Joint Conference on Natural Language Processing, IJCNLP, 2025

Core Rank : B Google Rank :32

Abs DOI bibTex

@inproceedings{bib_Reth_2025, AUTHOR = {Vemula, Saketh Reddy and Dandapat, Sandipan and Sharma, Dipti Mishra and Krishnamurthy, Parameswari }, TITLE = {Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2025}}

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Abstract

The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models -- from pre-training through fine-tuning -- for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.

Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages

MT Summit, MT-S, 2025

Abs DOI bibTex

@inproceedings{bib_Prog_2025, AUTHOR = {Bhaskar, Yash and Shetye, Ketaki Mangesh and Mujadia, Vandan and Sharma, Dipti Mishra and Krishnamurthy, Parameswari }, TITLE = {Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages}, BOOKTITLE = {MT Summit}. YEAR = {2025}}

Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages

Abstract

This study addresses the critical challenge of data scarcity in machine translation for Indian languages, particularly given their morpholog- ical complexity and limited parallel data. We investigate an effective strategy to maximize the utility of existing data by generating nega- tive samples from positive training instances us- ing a progressive perturbation approach. This is used to align the model with preferential data using Kahneman-Tversky Optimization (KTO). Comparing it against traditional Su- pervised Fine-Tuning (SFT), we demonstrate how generating negative samples and leverag- ing KTO enhances data efficiency. By creat- ing rejected samples through progressively per- turbed translations from the available dataset, we fine-tune the Llama 3.1 Instruct 8B model using QLoRA across 16 language directions, in- cluding English, Hindi, Bangla, Tamil, Telugu, and Santali. Our results show that KTO-based preference alignment with progressive pertur- bation consistently outperforms SFT, achieving significant gains in translation quality with an average BLEU increase of 1.84 to 2.47 and CHRF increase of 2.85 to 4.01 compared to SFT for selected languages, while using the same positive training samples and under simi- lar computational constraints. This highlights the potential of our negative sample genera- tion strategy within KTO, especially in low- resource scenarios.

Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages

International Conference on Natural Language Processing., ICON, 2024

Core Rank : - Google Rank :5

Abs PDF bibTex

@inproceedings{bib_Towa_2024, AUTHOR = {Mujadia, Vandan and MISHRA, PRUTHWIK and Ahsan, Arafat and Sharma, Dipti Mishra }, TITLE = {Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2024}}

Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages

Abstract

With the primary focus on evaluating the ef- fectiveness of large language models for au- tomatic reference-less translation assessment, this work presents our experiments on mimick- ing human direct assessment to evaluate the quality of translations in English and Indian languages. We constructed a translation evalua- tion task where we performed zero-shot learn- ing, in-context example-driven learning, and fine-tuning of large language models to pro- vide a score out of 100, where 100 represents a perfect translation and 1 represents a poor translation. We compared the performance of our trained systems with existing methods such as COMET, BERT-Scorer, and LABSE, and found that the LLM-based evaluator (LLaMA- 2-13B) achieves a comparable or higher overall correlation with human judgments for the con- sidered Indian language pairs (Refer figure 1).

Overview of MTIL Track at FIRE 2023: Machine Translation for Indian Languages

Forum for Information Retrieval Evaluation, FIRE, 2024

Core Rank : - Google Rank :30

Abs PDF DOI bibTex

@inproceedings{bib_Over_2024, AUTHOR = {Gangopadhyay, Surupendu and Majumder, Prasenjit and Gain, Baban and Appicharla, Ramakrishna and Ekbal, Asif and Ahsan, Arafat and Sharma, Dipti Mishra }, TITLE = {Overview of MTIL Track at FIRE 2023: Machine Translation for Indian Languages}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2024}}

Overview of MTIL Track at FIRE 2023: Machine Translation for Indian Languages

Abstract

The objective of the MTIL track in FIRE 2023 was to encourage the development of Indian Language to Indian Language (IL-IL) Neural Machine Translation models. The languages covered in the track included Hindi, Gujarati, Kannada, Odia, Punjabi, Urdu, Telugu, Kashmiri, and Sindhi. The track consists of two tasks: (i) a General Translation Task and (ii) a Domain specific Translation Task with Governance and Healthcare being the chosen domains. For the listed languages, we proposed 12 diverse language directions for the general domain translation task and 8 each for healthcare and governance domains. Participants were encouraged to submit models for one or more language pairs. We witnessed the creation of 34 distinct models spanning various language pairs and domains. Model assessments were conducted using five evaluation metrics: BLEU, CHRF, CHRF++, TER, and COMET. The submitted model outputs were ranked based on the CHRF score.

Estimating the Quality of Translated Medical Texts using Back Translation & Resource Description Framework

Semantic Web Solutions for Large-scale Biomedical Data Analytics, SeWeBMeDA, 2024

Abs PDF bibTex

@inproceedings{bib_Esti_2024, AUTHOR = {NEEKHRA, BINAY KUMAR and Sharma, Dipti Mishra }, TITLE = {Estimating the Quality of Translated Medical Texts using Back Translation & Resource Description Framework}, BOOKTITLE = {Semantic Web Solutions for Large-scale Biomedical Data Analytics}. YEAR = {2024}}

Estimating the Quality of Translated Medical Texts using Back Translation & Resource Description Framework

Abstract

How can we effectively estimate the quality of translated texts in the medical field, where back-translation is usually available and/or recommended for sensitive documents. This paper proposes a novel metric, GATE1 , for translation quality estimation task, leveraging the Resource Description Framework (RDF) to encode both semantic and syntactical information of the original and back-translated sentences into RDF graphs. The distance between these graphs is measured to get the semantic similarity score to assess the quality of the translation. Unlike traditional metrics like BLEU and METEOR, our approach is reference-less, capturing both semantic and syntactical information for a comprehensive assessment of translation quality. Our results correlate better with human judgment, giving a better Pearson correlation (0.357) as compared to BLEU (0.200), thereby showing ~70% improvement over BLEU. Our research shows that, in the field of translation evaluation, existing resources like back-translation and RDF could be useful.

Estimating the Quality of Translated Medical Texts using Back Translation & Resource Description Framework

Extended Semantic Web Conference, ESWC, 2024

Core Rank : B Google Rank :28

Abs PDF bibTex

Estimating the Quality of Translated Medical Texts using Back Translation & Resource Description Framework

Abstract

How can we effectively estimate the quality of translated texts in the medical field, where back-translation is usually available and/or recommended for sensitive documents. This paper proposes a novel metric, GATE 1 , for translation quality estimation task, leveraging the Resource Description Framework (RDF) to encode both semantic and syntactical information of the original and back-translated sentences into RDF graphs. The distance between these graphs is measured to get the semantic similarity score to assess the quality of the translation. Unlike traditional metrics like BLEU and METEOR, our approach is reference-less, capturing both semantic and syntactical information for a comprehensive assessment of translation quality. Our results correlate better with human judgment, giving a better Pearson correlation (0.357) as compared to BLEU (0.200), thereby showing ~70% improvement over BLEU. Our research shows that, in the field of translation evaluation, existing resources like back-translation and RDF could be useful.

Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages

NAACL Student Research Workshop, NAACL-SRW, 2024

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Fine_2024, AUTHOR = {Bahad, Sankalp Sanjay and Mishra, Pruthwik and Arora, Karunesh and Balabantaray, Rakesh Chandra and Sharma, Dipti Mishra and Krishnamurthy, Parameswari }, TITLE = {Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages}, BOOKTITLE = {NAACL Student Research Workshop}. YEAR = {2024}}

Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages

Abstract

Named Entity Recognition (NER) is a use- ful component in Natural Language Process- ing (NLP) applications. It is used in various tasks such as Machine Translation, Summa- rization, Information Retrieval, and Question- Answering systems. The research on NER is centered around English and some other ma- jor languages, whereas limited attention has been given to Indian languages. We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recog- nition for Indian Languages. We present a hu- man annotated named entity corpora of ∼40K sentences for 4 Indian languages from two of the major Indian language families. Addition- ally,we present a multilingual model fine-tuned on our dataset, which achieves an F1 score of ∼0.80 on our dataset on average. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.

Towards Disfluency Annotated Corpora for Indian Languages

International Conference Computational Linguistics Workshops, COLING - W, 2024

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Towa_2024, AUTHOR = {Kochar, Chayan and Mujadia, Vandan and MISHRA, PRUTHWIK and Sharma, Dipti Mishra }, TITLE = {Towards Disfluency Annotated Corpora for Indian Languages}, BOOKTITLE = {International Conference Computational Linguistics Workshops}. YEAR = {2024}}

Towards Disfluency Annotated Corpora for Indian Languages

Abstract

In the natural course of spoken language, individuals often engage in thinking and self-correction during speech production. These instances of interruption or correction are commonly referred to as disfluencies. When preparing data for subsequent downstream NLP tasks, these linguistic elements can be systematically removed, or handled as required, to enhance data quality. In this study, we present a comprehensive research on disfluencies in Indian languages. Our approach involves not only annotating real-world conversation transcripts but also conducting a detailed analysis of linguistic nuances inherent to Indian languages that are necessary to consider during annotation. Additionally, we introduce a robust algorithm for the synthetic generation of disfluent data. This algorithm aims to facilitate more effective model training for the identification of disfluencies in real-world conversations, thereby contributing to the advancement of disfluency research in Indian languages

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Technical Report, arXiv, 2023

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Asse_2023, AUTHOR = {Mujadia, Vandan and Ashok, Urlana and Bhaskar, Yash and Pavani, Penumalla Aditya and Shravya, Kukkapalli and Krishnamurthy, Parameswari and Sharma, Dipti Mishra }, TITLE = {Assessing Translation capabilities of Large Language Models involving English and Indian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2023}}

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Abstract

Generative Large Language Models (LLMs) have achieved remarkable advancements in var- ious NLP tasks. In this work, our aim is to ex- plore the multilingual capabilities of large lan- guage models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by ex- ploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine- tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large lan- guage model for the translation task involving LLMs, which is based on LLaMA.