A Category-agnostic Graph Attention-based approach for determining Notability of complex article titles for Wikipedia
Thota Gokul Vamsi,Vasudeva Varma Kalidindi
Companion Proceedings of the ACM Web Conference, WWW- companion, 2024
@inproceedings{bib_A_Ca_2024, AUTHOR = {Thota Gokul Vamsi, Vasudeva Varma Kalidindi}, TITLE = {A Category-agnostic Graph Attention-based approach for determining Notability of complex article titles for Wikipedia}, BOOKTITLE = {Companion Proceedings of the ACM Web Conference}. YEAR = {2024}}
Wikipedia is a highly essential platform because of its informative, dynamic, and easily accessible nature. To identify topics/titles warranting their own Wikipedia article, editors of Wikipedia defined “Notability” guidelines. So far notability is enforced by humans, which makes scalability an issue. There has been no significant work on Notability determination for titles with complex category dependencies. We design a mechanism to identify such titles. We construct a dataset with 9k such titles and propose a category-agnostic approach utilizing Graph neural networks, for their notability determination. Our system outperforms machine learning-based, transformer-based classifiers and entity salience methods. It provides a scalable alternative to manual decision-making about notability.
Generating entity embeddings for populating Wikipedia Knowledge Graph by Notability detection
Thota Gokul Vamsi,Vasudeva Varma Kalidindi
International Conference on Natural Language to Data bases, NLDB, 2024
@inproceedings{bib_Gene_2024, AUTHOR = {Thota Gokul Vamsi, Vasudeva Varma Kalidindi}, TITLE = {Generating entity embeddings for populating Wikipedia Knowledge Graph by Notability detection}, BOOKTITLE = {International Conference on Natural Language to Data bases}. YEAR = {2024}}
Knowledge graphs (KGs) have been playing a crucial role in leveraging informa- tion on web for several downstream tasks, making it vital to construct and maintain them. Despite previous efforts in populating KGs, these methods typically do not focus on analyzing entity-specific content exclusively but rely on a fixed collection of documents. We define an approach to populate such KGs by utilizing entity-specific content on the web, for generating entity embeddings to establish entity-category interconnections. We empirically prove our ap- proach’s effectiveness, by utilizing it for a downstream task of Notability detection, associated with one of the most popular and important Knowledge Graphs - Wikipedia platform. To mod- erate the content uploaded to Wikipedia, “Notability” guidelines are defined by its editors to identify named entities that warrant their article on Wikipedia. So far notability is enforced by humans, which makes scalability an issue, and there has been no significant work on automat- ing this process. In this paper, we define a multipronged category-agnostic approach based on web-based entity features and their text-based salience encodings, to construct entity embed- dings for determining an entity’s notability. We distinguish entities based on their categories and utilize neural networks to perform classification. For validation, we utilize accuracy and prediction confidence on popular Wikipedia pages. Our system outperforms machine learning- based classifier approaches and handcrafted entity salience detection algorithms, by achieving performance accuracy of around 88%. Our system provides an efficient and scalable alterna- tive to manual decision-making about the importance of a topic, which could be extended to other such KG-based tasks
Can LLMs Generate Architectural Design Decisions? - An Exploratory Empirical study
Rudra Dhar,Karthik Vaidhyanathan,Vasudeva Varma Kalidindi
IEEE International Conference on Software Architecture Companion, ICSA, 2024
@inproceedings{bib_Can__2024, AUTHOR = {Rudra Dhar, Karthik Vaidhyanathan, Vasudeva Varma Kalidindi}, TITLE = {Can LLMs Generate Architectural Design Decisions? - An Exploratory Empirical study}, BOOKTITLE = {IEEE International Conference on Software Architecture Companion}. YEAR = {2024}}
Architectural Knowledge Management (AKM) involves the organized handling of information related to architectural decisions and design within a project or organization. An essential artefact of AKM is the Architecture Decision Records (ADR), which documents key design decisions. ADRs are documents that capture decision context, decision made and various aspects related to a design decision, thereby promoting transparency, collaboration, and understanding. Despite their benefits, ADR adoption in software development has been slow due to challenges like time constraints and inconsistent uptake. Recent advancements in Large Language Models (LLMs) may help bridge this adoption gap by facilitating ADR generation. However, the effectiveness of LLM for ADR generation or understanding is something that has not been explored. To this end, in this work, we perform an exploratory study which aims to investigate the feasibility of using LLM for the generation of ADRs given the decision context. In our exploratory study, we utilize GPT and T5-based models with 0-shot, few-shot, and fine-tuning approaches to generate the Decision of an ADR given its Context. Our results indicate that in a 0-shot setting, state-of-the-art models such as GPT-4 generate relevant and accurate Design Decisions, although they fall short of human-level performance. Additionally, we observe that more cost-effective models like GPT-3.5 can achieve similar outcomes in a few-shot setting, and smaller models such as Flan-T5 can yield comparable results after fine-tuning. To conclude, this exploratory study suggests that LLM can generate Design Decisions, but further research is required to attain human-level generation and establish standardized widespread adoption.
Summarizing Indian Languages using Multilingual Transformers based Models
Dhaval Taunk,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2023
@inproceedings{bib_Summ_2023, AUTHOR = {Dhaval Taunk, Vasudeva Varma Kalidindi}, TITLE = {Summarizing Indian Languages using Multilingual Transformers based Models}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
With the advent of multilingual models like mBART, mT5, IndicBART etc., summarization in low resource Indian languages is getting a lot of attention now a days. But still the number of datasets is low in number. In this work, we (Team HakunaMatata) study how these multilingual models perform on the datasets which have Indian languages as source and target text while performing summarization. We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.
Generative Models For Indic Languages: Evaluating Content Generation Capabilities
Savita Bhat,Vasudeva Varma Kalidindi,Niranjan Pedaneka
Recent advance in Natural language Processing, RANLP, 2023
@inproceedings{bib_Gene_2023, AUTHOR = {Savita Bhat, Vasudeva Varma Kalidindi, Niranjan Pedaneka}, TITLE = {Generative Models For Indic Languages: Evaluating Content Generation Capabilities}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2023}}
Large language models (LLMs) and generative AI have emerged as the most important areas in the field of natural language processing (NLP). LLMs are considered to be a key component in several NLP tasks, such as summarization, question-answering, sentiment classification, and translation. Newer LLMs, such as Chat- GPT, BLOOMZ, and several such variants, are known to train on multilingual training data and hence are expected to process and generate text in multiple languages. Considering the widespread use of LLMs, evaluating their efficacy in multilingual settings is imperative. In this work, we evaluate the newest generative models (ChatGPT, mT0, and BLOOMZ) in the context of Indic languages. Specifically, we consider natural language generation (NLG) applications such as summarization and questionanswering in monolingual and cross-lingual settings. We observe that current generative models have limited capability for generating text in Indic languages in a zero-shot setting. In contrast, generative models perform consistently better on manual quality-based evaluation in Indic languages and English language generation. Considering limited generation performance, we argue that these LLMs are not intended to use in zero-shot fashion in downstream applications.
Cross-Lingual Fact Checking: Automated Extraction and Verification of Information from Wikipedia using References
S Shivansh,Ankita Maity,Aakash Jain,Bhavyajeet Singh,Harshit Gupta,Lakshya Khanna,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2023
@inproceedings{bib_Cros_2023, AUTHOR = {S Shivansh, Ankita Maity, Aakash Jain, Bhavyajeet Singh, Harshit Gupta, Lakshya Khanna, Vasudeva Varma Kalidindi}, TITLE = {Cross-Lingual Fact Checking: Automated Extraction and Verification of Information from Wikipedia using References}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2023}}
The paper presents a novel approach for automated cross-lingual fact-checking that extracts and verifies information from Wikipedia using references. The problem involves determining whether a factoid in an article is supported or needs additional citations based on the provided references, with granularity at the fact level. We introduce a cross-lingual manually annotated dataset for fact extraction and verification and an entirely automated pipeline for the task. The proposed solution operates entirely in a cross-lingual setting, where the article text and the references can be in any language. The pipeline integrates several natural language processing techniques to extract the relevant facts from the input sources. The extracted facts are then verified against the references, leveraging the semantic relationships between the facts and the reference sources. Experimental evaluation on a large-scale dataset demonstrates the effectiveness and efficiency of the proposed approach in handling cross-lingual fact-checking tasks. We make our code and data publicly available.
Multilingual Bias Detection and Mitigation for Low Resource Languages
Anubhav Sharma,Ankita Maity,Tushar Abhishek,Rudra Dhar,Radhika Mamidi,Manish Gupta,Vasudeva Varma Kalidindi
Wiki Workshop, Wiki-W, 2023
@inproceedings{bib_Mult_2023, AUTHOR = {Anubhav Sharma, Ankita Maity, Tushar Abhishek, Rudra Dhar, Radhika Mamidi, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Multilingual Bias Detection and Mitigation for Low Resource Languages}, BOOKTITLE = {Wiki Workshop}. YEAR = {2023}}
Subjective bias in Wikipedia textual data is a significant problem and affects millions of readers worldwide. Though some monolingual work has been done in classifying and debiasing biased text in resource-rich languages, the low-resource languages with large numbers of speakers remain unattended. We present an approach for the dual problems of multilingual bias detection and its mitigation with a thorough analysis. In this work, we establish competitive baselines on our preliminary approach, which includes classification-based modelling for bias detection on a multilingual dataset curated from existing monolingual sources. For the problem of bias mitigation, we follow the style transfer paradigm and model using transformer-based seq2seq architectures. We also discuss several approaches for further improvement in both problems as a part of our ongoing work.
Multilingual Bias Detection and Mitigation for Indian Languages
Ankita Maity,Anubhav Sharma,Rudra Dhar,Tushar Abhishek,Manish Gupta,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2023
@inproceedings{bib_Mult_2023, AUTHOR = {Ankita Maity, Anubhav Sharma, Rudra Dhar, Tushar Abhishek, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Multilingual Bias Detection and Mitigation for Indian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Lack of diverse perspectives causes neutrality bias in Wikipedia content leading to millions of worldwide readers getting exposed by potentially inaccurate information. Hence, neutrality bias detection and mitigation is a critical problem. Although previous studies have proposed effective solutions for English, no work exists for Indian languages. First, we contribute two large datasets, mWIKIBIAS and mWNC, covering 8 languages, for the bias detection and mitigation tasks respectively. Next, we investigate the effectiveness of popular multilingual Transformer-based models for the two tasks by modeling detection as a binary classification problem and mitigation as a style transfer problem. We make the code and data publicly available.
XFLT: Exploring Techniques for Generating Cross Lingual Factually Grounded Long Text
Bhavyajeet Singh,Kancharla Aditya Hari,Rahul Mehta,Tushar Abhishek,Manish Gupta,Vasudeva Varma Kalidindi
European Conference on Artificial Intelligence, ECAI, 2023
@inproceedings{bib_XFLT_2023, AUTHOR = {Bhavyajeet Singh, Kancharla Aditya Hari, Rahul Mehta, Tushar Abhishek, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {XFLT: Exploring Techniques for Generating Cross Lingual Factually Grounded Long Text}, BOOKTITLE = {European Conference on Artificial Intelligence}. YEAR = {2023}}
Multiple business scenarios require an automated generation of descriptive human-readable long text from structured input data, where the source is typically a high-resource language and the target is a low or medium resource language. We define the Cross-Lingual Fact to Long Text Generation (XFLT) as a novel natural language generation (NLG) task that involves generating descriptive and human-readable long text in a target language from structured input data (such as fact triples) in a source language. XFLT is challenging because of (a) hallucinatory nature of the state-of-the-art NLG models, (b) lack of good quality training data, and (c) lack of a suitable cross-lingual NLG metric. Unfortunately previous work focuses on different related problem settings (cross-lingual facts to short text or monolingual graph to text) and has made no efforts to handle hallucinations. In this paper, we contribute a novel dataset, XLALIGN with over 64,000 paragraphs across 12 different languages, and English facts. We propose a novel solution to the XFLT task which addresses these challenges by training multilingual Transformer-based encoder-decoder models with coverage prompts and grounded decoding. Further, it improves on the XFLT quality by defining task-specific reward functions and training on them using reinforcement learning. On XLALIGN, we compare this novel solution with several strong baselines using a new metric, cross-lingual PARENT. We also make our code and data publicly available
XOutlineGen: Cross-lingual Outline Generation for Encyclopedic Text in Low Resource Languages
S Shivansh,Dhaval Taunk,Manish Gupta,Vasudeva Varma Kalidindi
Wiki Workshop, Wiki-W, 2023
@inproceedings{bib_XOut_2023, AUTHOR = {S Shivansh, Dhaval Taunk, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {XOutlineGen: Cross-lingual Outline Generation for Encyclopedic Text in Low Resource Languages}, BOOKTITLE = {Wiki Workshop}. YEAR = {2023}}
One crucial aspect of content organization is the creation of article outlines, which summarize the primary topics and subtopics covered in an article in a structured manner. This paper introduces a solution called XOutlineGen, which generates cross-lingual outlines for encyclopedic texts from reference articles. XOutlineGen uses the XWikiRef dataset, which consists of encyclopedic texts generated from reference articles and section titles. The dataset is enhanced with two new languages and three new domains, resulting in ∼92K articles. Our pipeline employs this dataset to train a two-step generation model, which takes the article title and set of references as inputs and produces the article outline.
Graph-based Keyword Planning for Legal Clause Generation from Topics
Aparna Garimella,Vasudeva Varma Kalidindi
Natural Legal Language Processing Workshop, NLLP-W, 2022
@inproceedings{bib_Grap_2022, AUTHOR = {Aparna Garimella, Vasudeva Varma Kalidindi}, TITLE = {Graph-based Keyword Planning for Legal Clause Generation from Topics}, BOOKTITLE = {Natural Legal Language Processing Workshop}. YEAR = {2022}}
Generating domain-specific content such as legal clauses based on minimal user-provided information can be of significant benefit in automating legal contract generation. In this paper, we propose a controllable graph-based mechanism that can generate legal clauses using only the topic or type of the legal clauses. Our pipeline consists of two stages involving a graph-based planner followed by a clause generator. The planner outlines the content of a legal clause as a sequence of keywords in the order of generic to more specific clause information based on the input topic using a controllable graph-based mechanism. The generation stage takes in a given plan and generates a clause. The pipeline consists of a graph-based planner followed by text generation. We illustrate the effectiveness of our proposed two-stage approach on a broad set of clause topics in contracts.
Knowledge-based neural framework for sexism detection and classification
Harika Abburi,Shradha Sehgal,Himanshu Maheshwari,Vasudeva Varma Kalidindi
Iberian Languages Evaluation Forum, IberLEF, 2021
@inproceedings{bib_Know_2021, AUTHOR = {Harika Abburi, Shradha Sehgal, Himanshu Maheshwari, Vasudeva Varma Kalidindi}, TITLE = {Knowledge-based neural framework for sexism detection and classification}, BOOKTITLE = {Iberian Languages Evaluation Forum}. YEAR = {2021}}
Sexism, a prejudice that causes enormous suffering, mani- fests in blatant as well as subtle ways. As sexist content towards women is increasingly spread on social networks, the automatic detection and categorization of these tweets/posts can help social scientists and policy- makers in research, thereby combating sexism. In this paper, we explore the problem of detecting whether a Twitter/Gab post is sexist or not. We further discriminate the detected sexist post into one of the fine-grained sexism categories. We propose a neural model for this sexism detec- tion and classification that can combine representations obtained using RoBERTa model and linguistic features such as Empath, Hurtlex, and Perspective API by involving recurrent components. We also leverage the unlabeled sexism data to infuse the domain-specific transformer model into our framework. Our proposed framework also features a knowledge module comprised of emoticon and hashtag representations to infuse the external knowledge-specific features into the learning process. Several proposed methods outperform various baselines across several standard metrics
T3N : Harnessing Text and Temporal Tree Network for Rumor Detection on Twitter
Nikhil Pinnaparaju,Manish Gupta,Vasudeva Varma Kalidindi
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2021
Abs | | bib Tex
@inproceedings{bib_T3N__2021, AUTHOR = {Nikhil Pinnaparaju, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {T3N : Harnessing Text and Temporal Tree Network for Rumor Detection on Twitter}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2021}}
Social media platforms have democratized the publication process resulting into easy and viral propagation of information. However, spread of rumors via such media often results into undesired and extremely impactful political, economic, social, psychological and criminal consequences. Several manual as well as automated efforts have been undertaken in the past to solve this critical problem. Existing automated methods are text based, user credibility based or use signals from the tweet propagation tree. We aim at using the text, user, propagation tree and temporal information jointly for rumor detection on Twitter. This involves several challenges like how to handle text variations on Twitter, what signals from user profile could be useful, how to best encode the propagation tree information, and how to incorporate the temporal signal. Our novel architecture, (Text and Temporal Tree Network), leverages deep learning based architectures to encode text, user and tree information in a temporal-aware manner. Our extensive comparisons show that our proposed methods outperform the state-of-the-art techniques by 7 and 6% points respectively on two popular benchmark datasets, and also lead to better early detection results.