Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages
Sankalp Sanjay Bahad,Pruthwik Mishra,Karunesh Arora,Rakesh Chandra Balabantaray,Dipti Mishra Sharma,Parameswari Krishnamurthy
NAACL Student Research Workshop, NAACL-SRW, 2024
@inproceedings{bib_Fine_2024, AUTHOR = {Sankalp Sanjay Bahad, Pruthwik Mishra, Karunesh Arora, Rakesh Chandra Balabantaray, Dipti Mishra Sharma, Parameswari Krishnamurthy}, TITLE = {Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages}, BOOKTITLE = {NAACL Student Research Workshop}. YEAR = {2024}}
Named Entity Recognition (NER) is a use- ful component in Natural Language Process- ing (NLP) applications. It is used in various tasks such as Machine Translation, Summa- rization, Information Retrieval, and Question- Answering systems. The research on NER is centered around English and some other ma- jor languages, whereas limited attention has been given to Indian languages. We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recog- nition for Indian Languages. We present a hu- man annotated named entity corpora of ∼40K sentences for 4 Indian languages from two of the major Indian language families. Addition- ally,we present a multilingual model fine-tuned on our dataset, which achieves an F1 score of ∼0.80 on our dataset on average. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo
Abhinaba Bala,Urlana Ashok,Rahul Mishra,Parameswari Krishnamurthy
Workshop on Indian Language Data, ILD-W, 2024
@inproceedings{bib_Expl_2024, AUTHOR = {Abhinaba Bala, Urlana Ashok, Rahul Mishra, Parameswari Krishnamurthy}, TITLE = {Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo}, BOOKTITLE = {Workshop on Indian Language Data}. YEAR = {2024}}
Obtaining sufficient information in one’s mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, like Mizo. In this paper, we conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles, which leverages English-language news to supplement and enhance the information related to the corresponding news events. Furthermore, we make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles.
AbhiPaw@ DravidianLangTech: Multimodal Abusive Language Detection and Sentiment Analysis using Transformer based architecture
Abhinaba Bala,Parameswari Krishnamurthy
Workshop on Speech and Language Technologies for Dravidian Languages, DravidianLangTech, 2023
@inproceedings{bib_Abhi_2023, AUTHOR = {Abhinaba Bala, Parameswari Krishnamurthy}, TITLE = {AbhiPaw@ DravidianLangTech: Multimodal Abusive Language Detection and Sentiment Analysis using Transformer based architecture}, BOOKTITLE = {Workshop on Speech and Language Technologies for Dravidian Languages}. YEAR = {2023}}
Detecting abusive language in multimodal videos has become a pressing need in ensuring a safe and inclusive online environment. This paper focuses on addressing this challenge through the development of a novel approach for multimodal abusive language detection in Tamil videos and sentiment analysis for Tamil/Malayalam videos. By leveraging state-of-the-art models such as Multiscale Vision Transformers (MViT) for video analysis, OpenL3 for audio analysis, and the bert-basemultilingual-cased model for textual analysis, our proposed framework integrates visual, auditory, and textual features. Through extensive experiments and evaluations, we demonstrate the effectiveness of our model in accurately detecting abusive content and predicting sentiment categories. The limited availability of effective tools for performing these tasks in Dravidian Languages has prompted a new avenue of research in these domains. Keywords: abusive language detection, sentiment analysis, multimodal analysis, video analysis, Dravidian languages.
AbhiPaw@ DravidianLangTech: Fake News Detection in Dravidian Languages using Multilingual BERT
Abhinaba Bala,Parameswari Krishnamurthy
Workshop on Speech and Language Technologies for Dravidian Languages, DravidianLangTech, 2023
@inproceedings{bib_Abhi_2023, AUTHOR = {Abhinaba Bala, Parameswari Krishnamurthy}, TITLE = {AbhiPaw@ DravidianLangTech: Fake News Detection in Dravidian Languages using Multilingual BERT}, BOOKTITLE = {Workshop on Speech and Language Technologies for Dravidian Languages}. YEAR = {2023}}
This study addresses the challenge of detecting fake news in Dravidian languages by leveraging Google’s MuRIL (Multilingual Representations for Indian Languages) model. Drawing upon previous research, we investigate the intricacies involved in identifying fake news and explore the potential of transformer-based models for linguistic analysis and contextual understanding. Through supervised learning, we fine-tune the ”muril-base-cased” variant of MuRIL using a carefully curated dataset of labeled comments and posts in Dravidian languages, enabling the model to discern between original and fake news. During the inference phase, the fine-tuned MuRIL model analyzes new textual content, extracting contextual and semantic features to predict the content’s classification. We evaluate the model’s performance using standard metrics, highlighting the effectiveness of MuRIL in detecting fake news in Dravidian languages and contributing to the establishment of a safer digital ecosystem. Keywords: fake news detection, Dravidian languages, MuRIL, transformer-based models, linguistic analysis, contextual understanding.x
AbhiPaw@ DravidianLangTech: Abusive Comment Detection in Tamil and Telugu using Logistic Regression
Abhinaba Bala,Parameswari Krishnamurthy
Workshop on Speech and Language Technologies for Dravidian Languages, DravidianLangTech, 2023
@inproceedings{bib_Abhi_2023, AUTHOR = {Abhinaba Bala, Parameswari Krishnamurthy}, TITLE = {AbhiPaw@ DravidianLangTech: Abusive Comment Detection in Tamil and Telugu using Logistic Regression}, BOOKTITLE = {Workshop on Speech and Language Technologies for Dravidian Languages}. YEAR = {2023}}
Abusive comments in online platforms have become a significant concern, necessitating the development of effective detection systems. However, limited work has been done in low resource languages, including Dravidian languages. This paper addresses this gap by focusing on abusive comment detection in a dataset containing Tamil, Tamil-English and TeluguEnglish code-mixed comments. Our methodology involves logistic regression and explores suitablef embeddings to enhance the performance of the detection model. Through rigorous experimentation, we identify the most effective combination of logistic regression and embeddings. The results demonstrate the performance of our proposed model, which contributes to the development of robust abusive comment detection systems in low resource language settings. Keywords: Abusive comment detection, Dravidian languages, logistic regression, embeddings, low resource languages, code-mixed dataset.
Assessing Translation capabilities of Large Language Models involving English and Indian Languages
Vandan Mujadia,Urlana Ashok,Yash Bhaskar,Penumalla Aditya Pavani,Kukkapalli Shravya,Parameswari Krishnamurthy,Dipti Mishra Sharma
Technical Report, arXiv, 2023
@inproceedings{bib_Asse_2023, AUTHOR = {Vandan Mujadia, Urlana Ashok, Yash Bhaskar, Penumalla Aditya Pavani, Kukkapalli Shravya, Parameswari Krishnamurthy, Dipti Mishra Sharma}, TITLE = {Assessing Translation capabilities of Large Language Models involving English and Indian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Generative Large Language Models (LLMs) have achieved remarkable advancements in var- ious NLP tasks. In this work, our aim is to ex- plore the multilingual capabilities of large lan- guage models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by ex- ploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine- tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large lan- guage model for the translation task involving LLMs, which is based on LLaMA.
Transformer-based Context Aware Morphological Analyzer for Telugu
Chelpuri Abhijith,Dasari Priyanka,Nagaraju Vuppala,Mounika Marreddy,Radhika Mamidi,Parameswari Krishnamurthy
Workshop on Speech and Language Technologies for Dravidian Languages, DravidianLangTech, 2023
@inproceedings{bib_Tran_2023, AUTHOR = {Chelpuri Abhijith, Dasari Priyanka, Nagaraju Vuppala, Mounika Marreddy, Radhika Mamidi, Parameswari Krishnamurthy}, TITLE = {Transformer-based Context Aware Morphological Analyzer for Telugu}, BOOKTITLE = {Workshop on Speech and Language Technologies for Dravidian Languages}. YEAR = {2023}}
This paper addresses the challenges faced by Indian languages in leveraging deep learning for natural language processing (NLP) due to limited resources, annotated datasets, and Transformer-based architectures. We specifically focus on Telugu and aim to construct a Telugu morph analyzer dataset comprising 10,000 sentences. Furthermore, we assess the performance of established multilingual Transformer models (m-Bert, XLMR, IndicBERT) and mono-lingual Transformer model BERT-Te (trained from scratch on an extensive Telugu corpus comprising 80,15,588 sentences). Our findings demonstrate the efficacy of Transformer-based representations pre-trained on Telugu data improved the performance of the Telugu morph analyzer, surpassing existing multi-lingual approaches. This highlights the necessity of developing dedicated corpora, annotated datasets, and machine learning models in a mono-lingual setting. Using our dataset, we present benchmark results for the Telugu morph analyzer achieved through simple fine-tuning on BERT-Te. The morph analyzer dataset 1 and codes are open-sourced and available here.