@inproceedings{bib_Curr_2024, AUTHOR = {Kancharla Aditya Hari, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Curriculum Learning for Cross-Lingual Data-to-Text Generation With Noisy Data}, BOOKTITLE = {Technical Report}. YEAR = {2024}}
Curriculum learning has been used to improve the quality of text generation systems by ordering the training samples according to a particular schedule in various tasks. In the context of data-to-text generation (DTG), previous studies used various difficulty criteria to order the training samples for monolingual DTG. These criteria, however, do not generalize to the crosslingual variant of the problem and do not account for noisy data. We explore multiple criteria that can be used for improving the performance of cross-lingual DTG systems with noisy data using two curriculum schedules. Using the alignment score criterion for ordering samples and an annealing schedule to train the model, we show increase in BLEU score by up to 4 points, and improvements in faithfulness and coverage of generations by 5-15% on average across 11 Indian languages and English in 2 separate datasets. We make code and data publicly available
@inproceedings{bib_Circ_2024, AUTHOR = {Rahul Mehta, Bhavyajeet Singh, Vasudeva Varma Kalidindi, Manish Gupta}, TITLE = {CircuitVQA: A Visual Question Answering Dataset for Electrical Circuit Images}, BOOKTITLE = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa}. YEAR = {2024}}
A visual question answering (VQA) system for electrical circuit images could be useful as a quiz generator, design and verification assistant or an electrical diagnosis tool. Although there exists a vast literature on VQA, to the best of our knowledge, there is no existing work on VQA for electrical circuit images. To this end, we curate a new dataset, CircuitVQA, of 115K+ questions on 5725 electrical images with
70 circuit symbols. The dataset contains schematic as well as hand-drawn images. The questions span various categories like counting, value, junction and position based questions. To be effective, models must demonstrate skills like object detection, text recognition, spatial understanding, question intent understanding and answer generation. We experiment with multiple foundational visio-linguistic models for this task and find that a finetuned BLIP model with component descriptions as additional input provides best results. We make the code and dataset publicly available
Gauging, enriching and applying geography knowledge in Pre-trained Language Models
@inproceedings{bib_Gaug_2024, AUTHOR = {Nitin Ramrakhiyani, Vasudeva Varma Kalidindi, Girish Keshav Palshikar, Sachin Pawar}, TITLE = {Gauging, enriching and applying geography knowledge in Pre-trained Language Models}, BOOKTITLE = {Information Processing & Management}. YEAR = {2024}}
To employ Pre-trained Language Models (PLMs) as knowledge containers in niche domains it is important to gauge the knowledge of these PLMs about facts in these domains. It is also an important pre-requisite to know how much enrichment effort is required to make them better. As part of this work, we aim to gauge and enrich small PLMs for knowledge of world geography. Firstly, we develop a moderately sized dataset of masked sentences covering 24 different fact types about world geography to estimate knowledge of PLMs on these facts. We hypothesize that for this niche domain, smaller PLMs may not be well equipped. Secondly, we enrich PLMs with this knowledge through fine-tuning and check if the knowledge in the dataset is infused sufficiently. We further hypothesize that linguistic variability in the manual templates used to embed the knowledge in masked sentences does not affect the knowledge infusion. Finally, we demonstrate the application of PLMs to tourism blog search and Wikidata KB augmentation. In both applications, we aim at showing the effectiveness of using PLMs to achieve competitive performance.
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations
@inproceedings{bib_Towa_2024, AUTHOR = {Manav Chaudhary, Harshit Gupta, Savita Bhat, Vasudeva Varma Kalidindi}, TITLE = {Towards Understanding the Robustness of LLM-based Evaluations under Perturbations}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2024}}
Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.
@inproceedings{bib_Leve_2024, AUTHOR = {Rudra Dhar, Karthik Vaidhyanathan, Vasudeva Varma Kalidindi}, TITLE = {Leveraging Generative AI for Architecture Knowledge Management}, BOOKTITLE = {International Conference on Software Architecture Companion}. YEAR = {2024}}
While documenting Architectural Knowledge (AK) is crucial, it is frequently neglected in many projects, and existing manual tools are underutilized. Although undocumented, Archi-tecture Knowledge (AK) is dispersed across various sources such as source code, documentation, and runtime logs. To address this, automated tools for efficient AK extraction and documentation are essential. Even after generating AK, navigating through vast the Architectural Records can be overwhelming. Building on that, we propose an automated Architectural Knowledge Management (AKM) System using Information Extraction and Generative AI, which generates AK from various source for a given system and answers architectural queries with respect to the given system. The development of an efficient Architectural Knowledge Management (AKM) system, which is both effective and user-friendly, entails the resolution of numerous challenges. It requires consolidating diverse AK data sources scattered across code, dia-grams, repository commits, and online platforms. The integration of Multimodal AI for AK extraction, incorporation of global AK, and leveraging Generative AI for AK documentation further compounds the problem. Moreover, generating contextually appropriate query responses adds another layer of complexity. To this end, we performed an initial exploratory study on generating Architectural Design Decisions using generative Large Language Models (LLM) in the context of Architecture Decision Records (ADR). Our initial results have been promising indicating the potential impact of GenAI for architectural knowledge management.
Can LLMs Generate Architectural Design Decisions? - An Exploratory Empirical study
@inproceedings{bib_Can__2024, AUTHOR = {Rudra Dhar, Karthik Vaidhyanathan, Vasudeva Varma Kalidindi}, TITLE = {Can LLMs Generate Architectural Design Decisions? - An Exploratory Empirical study}, BOOKTITLE = {IEEE International Conference on Software Architecture Companion}. YEAR = {2024}}
Architectural Knowledge Management (AKM) involves the organized handling of information related to architectural decisions and design within a project or organization. An essential artefact of AKM is the Architecture Decision Records (ADR), which documents key design decisions. ADRs are documents that capture decision context, decision made and various aspects related to a design decision, thereby promoting transparency, collaboration, and understanding. Despite their benefits, ADR adoption in software development has been slow due to challenges like time constraints and inconsistent uptake. Recent advancements in Large Language Models (LLMs) may help bridge this adoption gap by facilitating ADR generation. However, the effectiveness of LLM for ADR generation or understanding is something that has not been explored. To this end, in this work, we perform an exploratory study which aims to investigate the feasibility of using LLM for the generation of ADRs given the decision context. In our exploratory study, we utilize GPT and T5-based models with 0-shot, few-shot, and fine-tuning approaches to generate the Decision of an ADR given its Context. Our results indicate that in a 0-shot setting, state-of-the-art models such as GPT-4 generate relevant and accurate Design Decisions, although they fall short of human-level performance. Additionally, we observe that more cost-effective models like GPT-3.5 can achieve similar outcomes in a few-shot setting, and smaller models such as Flan-T5 can yield comparable results after fine-tuning. To conclude, this exploratory study suggests that LLM can generate Design Decisions, but further research is required to attain human-level generation and establish standardized widespread adoption.
iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers
@inproceedings{bib_iREL_2024, AUTHOR = {Harshit Gupta, Manav Chaudhary, Tathagata Raha, S Shivansh, Vasudeva Varma Kalidindi}, TITLE = {iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
This paper describes our approach for SemEval- 2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAIN- TEASER task comprises multiple-choice Ques- tion Answering designed to evaluate the mod- els’ lateral thinking capabilities. It consists of Sentence Puzzle and Word Puzzle subtasks that require models to defy default common- sense associations and exhibit unconventional thinking. We propose a unique strategy to im- prove the performance of pre-trained language models, notably the Gemini 1.0 Pro Model, in both subtasks. We employ static and dy- namic few-shot prompting techniques and in- troduce a model-generated reasoning strategy that utilizes the LLM’s reasoning capabilities to improve performance. Our approach demon- strated significant improvements, showing that it performed better than the baseline models by a considerable margin but fell short of per- forming as well as the human annotators, thus highlighting the efficacy of the proposed strate- gies. We have made our code open-sourced for the replicability of our methods.
Generating entity embeddings for populating Wikipedia Knowledge Graph by Notability detection
@inproceedings{bib_Gene_2024, AUTHOR = {Thota Gokul Vamsi, Vasudeva Varma Kalidindi}, TITLE = {Generating entity embeddings for populating Wikipedia Knowledge Graph by Notability detection}, BOOKTITLE = {International Conference on Natural Language to Data bases}. YEAR = {2024}}
Knowledge graphs (KGs) have been playing a crucial role in leveraging informa- tion on web for several downstream tasks, making it vital to construct and maintain them. Despite previous efforts in populating KGs, these methods typically do not focus on analyzing entity-specific content exclusively but rely on a fixed collection of documents. We define an approach to populate such KGs by utilizing entity-specific content on the web, for generating entity embeddings to establish entity-category interconnections. We empirically prove our ap- proach’s effectiveness, by utilizing it for a downstream task of Notability detection, associated with one of the most popular and important Knowledge Graphs - Wikipedia platform. To mod- erate the content uploaded to Wikipedia, “Notability” guidelines are defined by its editors to identify named entities that warrant their article on Wikipedia. So far notability is enforced by humans, which makes scalability an issue, and there has been no significant work on automat- ing this process. In this paper, we define a multipronged category-agnostic approach based on web-based entity features and their text-based salience encodings, to construct entity embed- dings for determining an entity’s notability. We distinguish entities based on their categories and utilize neural networks to perform classification. For validation, we utilize accuracy and prediction confidence on popular Wikipedia pages. Our system outperforms machine learning- based classifier approaches and handcrafted entity salience detection algorithms, by achieving performance accuracy of around 88%. Our system provides an efficient and scalable alterna- tive to manual decision-making about the importance of a topic, which could be extended to other such KG-based tasks
A Category-agnostic Graph Attention-based approach for determining Notability of complex article titles for Wikipedia
@inproceedings{bib_A_Ca_2024, AUTHOR = {Thota Gokul Vamsi, Vasudeva Varma Kalidindi}, TITLE = {A Category-agnostic Graph Attention-based approach for determining Notability of complex article titles for Wikipedia}, BOOKTITLE = {Companion Proceedings of the ACM Web Conference}. YEAR = {2024}}
Wikipedia is a highly essential platform because of its informative, dynamic, and easily accessible nature. To identify topics/titles warranting their own Wikipedia article, editors of Wikipedia defined “Notability” guidelines. So far notability is enforced by humans, which makes scalability an issue. There has been no significant work on Notability determination for titles with complex category dependencies. We design a mechanism to identify such titles. We construct a dataset with 9k such titles and propose a category-agnostic approach utilizing Graph neural networks, for their notability determination. Our system outperforms machine learning-based, transformer-based classifiers and entity salience methods. It provides a scalable alternative to manual decision-making about notability.
@inproceedings{bib_Zero_2023, AUTHOR = {Nitin Ramrakhiyani, Vasudeva Varma Kalidindi, Girish Keshav Palshikar, Sachin Pawar}, TITLE = {Zero-shot Probing of Pretrained Language Models for Geography Knowledge}, BOOKTITLE = {Workshop on Evaluation and Comparison of NLP Systems}. YEAR = {2023}}
Gauging the knowledge of Pretrained Language Models (PLMs) about facts in niche domains is an important step towards making them better in those domains. In this paper, we aim at evaluating multiple PLMs for their knowledge about world Geography. We contribute (i) a sufficiently sized dataset of masked Geography sentences to probe PLMs on masked token prediction and generation tasks, (ii) benchmark the performance of multiple PLMs on the dataset. We also provide a detailed analysis of the performance of the PLMs on different Geography facts.
XOutlineGen: Cross-lingual Outline Generation for Encyclopedic Text in Low Resource Languages
S Shivansh,Dhaval Taunk,Manish Gupta,Vasudeva Varma Kalidindi
Wiki Workshop, Wiki-W, 2023
@inproceedings{bib_XOut_2023, AUTHOR = {S Shivansh, Dhaval Taunk, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {XOutlineGen: Cross-lingual Outline Generation for Encyclopedic Text in Low Resource Languages}, BOOKTITLE = {Wiki Workshop}. YEAR = {2023}}
One crucial aspect of content organization is the creation of article outlines, which summarize the primary topics and subtopics covered in an article in a structured manner. This paper introduces a solution called XOutlineGen, which generates cross-lingual outlines for encyclopedic texts from reference articles. XOutlineGen uses the XWikiRef dataset, which consists of encyclopedic texts generated from reference articles and section titles. The dataset is enhanced with two new languages and three new domains, resulting in ∼92K articles. Our pipeline employs this dataset to train a two-step generation model, which takes the article title and set of references as inputs and produces the article outline.
XFLT: Exploring Techniques for Generating Cross Lingual Factually Grounded Long Text
Bhavyajeet Singh,Kancharla Aditya Hari,Rahul Mehta,Tushar Abhishek,Manish Gupta,Vasudeva Varma Kalidindi
European Conference on Artificial Intelligence, ECAI, 2023
@inproceedings{bib_XFLT_2023, AUTHOR = {Bhavyajeet Singh, Kancharla Aditya Hari, Rahul Mehta, Tushar Abhishek, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {XFLT: Exploring Techniques for Generating Cross Lingual Factually Grounded Long Text}, BOOKTITLE = {European Conference on Artificial Intelligence}. YEAR = {2023}}
Multiple business scenarios require an automated generation of descriptive human-readable long text from structured input data, where the source is typically a high-resource language and the target is a low or medium resource language. We define the Cross-Lingual Fact to Long Text Generation (XFLT) as a novel natural language generation (NLG) task that involves generating descriptive and human-readable long text in a target language from structured input data (such as fact triples) in a source language. XFLT is challenging because of (a) hallucinatory nature of the state-of-the-art NLG models, (b) lack of good quality training data, and (c) lack of a suitable cross-lingual NLG metric. Unfortunately previous work focuses on different related problem settings (cross-lingual facts to short text or monolingual graph to text) and has made no efforts to handle hallucinations. In this paper, we contribute a novel dataset, XLALIGN with over 64,000 paragraphs across 12 different languages, and English facts. We propose a novel solution to the XFLT task which addresses these challenges by training multilingual Transformer-based encoder-decoder models with coverage prompts and grounded decoding. Further, it improves on the XFLT quality by defining task-specific reward functions and training on them using reinforcement learning. On XLALIGN, we compare this novel solution with several strong baselines using a new metric, cross-lingual PARENT. We also make our code and data publicly available
A Web-centric entity-salience based system for determining Notability of entities for Wikipedia
Thota Gokul Vamsi,Rahul Khandelwal,Vasudeva Varma Kalidindi
@inproceedings{bib_A_We_2023, AUTHOR = {Thota Gokul Vamsi, Rahul Khandelwal, Vasudeva Varma Kalidindi}, TITLE = {A Web-centric entity-salience based system for determining Notability of entities for Wikipedia}, BOOKTITLE = {Others}. YEAR = {2023}}
Multilingual Bias Detection and Mitigation for Indian Languages
Ankita Maity,Anubhav Sharma,Rudra Dhar,Tushar Abhishek,Manish Gupta,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2023
@inproceedings{bib_Mult_2023, AUTHOR = {Ankita Maity, Anubhav Sharma, Rudra Dhar, Tushar Abhishek, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Multilingual Bias Detection and Mitigation for Indian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Lack of diverse perspectives causes neutrality bias in Wikipedia content leading to millions of worldwide readers getting exposed by potentially inaccurate information. Hence, neutrality bias detection and mitigation is a critical problem. Although previous studies have proposed effective solutions for English, no work exists for Indian languages. First, we contribute two large datasets, mWIKIBIAS and mWNC, covering 8 languages, for the bias detection and mitigation tasks respectively. Next, we investigate the effectiveness of popular multilingual Transformer-based models for the two tasks by modeling detection as a binary classification problem and mitigation as a style transfer problem. We make the code and data publicly available.
Multilingual Bias Detection and Mitigation for Low Resource Languages
Anubhav Sharma,Ankita Maity,Tushar Abhishek,Rudra Dhar,Radhika Mamidi,Manish Gupta,Vasudeva Varma Kalidindi
Wiki Workshop, Wiki-W, 2023
@inproceedings{bib_Mult_2023, AUTHOR = {Anubhav Sharma, Ankita Maity, Tushar Abhishek, Rudra Dhar, Radhika Mamidi, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Multilingual Bias Detection and Mitigation for Low Resource Languages}, BOOKTITLE = {Wiki Workshop}. YEAR = {2023}}
Subjective bias in Wikipedia textual data is a significant problem and affects millions of readers worldwide. Though some monolingual work has been done in classifying and debiasing biased text in resource-rich languages, the low-resource languages with large numbers of speakers remain unattended. We present an approach for the dual problems of multilingual bias detection and its mitigation with a thorough analysis. In this work, we establish competitive baselines on our preliminary approach, which includes classification-based modelling for bias detection on a multilingual dataset curated from existing monolingual sources. For the problem of bias mitigation, we follow the style transfer paradigm and model using transformer-based seq2seq architectures. We also discuss several approaches for further improvement in both problems as a part of our ongoing work.
Cross-Lingual Fact Checking: Automated Extraction and Verification of Information from Wikipedia using References
S Shivansh,Ankita Maity,Aakash Jain,Bhavyajeet Singh,Harshit Gupta,Lakshya Khanna,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2023
@inproceedings{bib_Cros_2023, AUTHOR = {S Shivansh, Ankita Maity, Aakash Jain, Bhavyajeet Singh, Harshit Gupta, Lakshya Khanna, Vasudeva Varma Kalidindi}, TITLE = {Cross-Lingual Fact Checking: Automated Extraction and Verification of Information from Wikipedia using References}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2023}}
The paper presents a novel approach for automated cross-lingual fact-checking that extracts and verifies information from Wikipedia using references. The problem involves determining whether a factoid in an article is supported or needs additional citations based on the provided references, with granularity at the fact level. We introduce a cross-lingual manually annotated dataset for fact extraction and verification and an entirely automated pipeline for the task. The proposed solution operates entirely in a cross-lingual setting, where the article text and the references can be in any language. The pipeline integrates several natural language processing techniques to extract the relevant facts from the input sources. The extracted facts are then verified against the references, leveraging the semantic relationships between the facts and the reference sources. Experimental evaluation on a large-scale dataset demonstrates the effectiveness and efficiency of the proposed approach in handling cross-lingual fact-checking tasks. We make our code and data publicly available.
Generative Models For Indic Languages: Evaluating Content Generation Capabilities
Savita Bhat,Vasudeva Varma Kalidindi,Niranjan Pedaneka
Recent advance in Natural language Processing, RANLP, 2023
@inproceedings{bib_Gene_2023, AUTHOR = {Savita Bhat, Vasudeva Varma Kalidindi, Niranjan Pedaneka}, TITLE = {Generative Models For Indic Languages: Evaluating Content Generation Capabilities}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2023}}
Large language models (LLMs) and generative AI have emerged as the most important areas in the field of natural language processing (NLP). LLMs are considered to be a key component in several NLP tasks, such as summarization, question-answering, sentiment classification, and translation. Newer LLMs, such as Chat- GPT, BLOOMZ, and several such variants, are known to train on multilingual training data and hence are expected to process and generate text in multiple languages. Considering the widespread use of LLMs, evaluating their efficacy in multilingual settings is imperative. In this work, we evaluate the newest generative models (ChatGPT, mT0, and BLOOMZ) in the context of Indic languages. Specifically, we consider natural language generation (NLG) applications such as summarization and questionanswering in monolingual and cross-lingual settings. We observe that current generative models have limited capability for generating text in Indic languages in a zero-shot setting. In contrast, generative models perform consistently better on manual quality-based evaluation in Indic languages and English language generation. Considering limited generation performance, we argue that these LLMs are not intended to use in zero-shot fashion in downstream applications.
Multi-task learning neural framework for categorizing sexism
HARIKA ABBURI,PARIKH PULKIT TRUSHANT KUMAR, Ni Chhaya ,Vasudeva Varma Kalidindi
Computer Speech & Language, CS&L, 2023
Abs | | bib Tex
@inproceedings{bib_Mult_2023, AUTHOR = {HARIKA ABBURI, PARIKH PULKIT TRUSHANT KUMAR, Ni Chhaya , Vasudeva Varma Kalidindi}, TITLE = {Multi-task learning neural framework for categorizing sexism}, BOOKTITLE = {Computer Speech & Language}. YEAR = {2023}}
Sexism, a form of oppression based on one's sex, manifests itself in numerous ways and causes enormous suffering. In view of the growing number of experiences of sexism reported online, automatically classifying these recollections can aid in the battle against sexism by allowing gender studies researchers and government officials involved in policymaking to conduct more effective analyses. This paper investigates the 23-class fine- grained, multi-label classification of accounts (reports) of sexism.
Francis wilde at semeval-2023 task 5: Clickbait spoiler type identification with transformers
I Vijayasaradhi,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_Fran_2023, AUTHOR = {I Vijayasaradhi, Vasudeva Varma Kalidindi}, TITLE = {Francis wilde at semeval-2023 task 5: Clickbait spoiler type identification with transformers}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
Clickbait is the text or a thumbnail image that entices the user to click the accompanying link. Clickbaits employ strategies while deliberately hiding the critical elements of the article and revealing partial information in the title, which arouses sufficient curiosity and motivates the user to click the link. In this work, we identify the kind of spoiler given a clickbait title. We formulate this as a text classification problem. We finetune pretrained transformer models on the title of the post and build models for theclickbait-spoiler classification. We achieve a balanced accuracy of 0.70 which is close to the baselin
Billy-Batson at SemEval-2023 Task 5: An Information Condensation based System for Clickbait Spoiling
Anubhav Sharma,Sagar Sandeep Joshi,Tushar Abhishek,Radhika Mamidi,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_Bill_2023, AUTHOR = {Anubhav Sharma, Sagar Sandeep Joshi, Tushar Abhishek, Radhika Mamidi, Vasudeva Varma Kalidindi}, TITLE = {Billy-Batson at SemEval-2023 Task 5: An Information Condensation based System for Clickbait Spoiling}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known as spoilers to satisfy the curiosity in- duced by a clickbait post. The large context of the article associated with the clickbait and differences in the spoiler forms, make the task challenging. Hence, to tackle the large con- text, we propose an Information Condensation- based approach, which prunes down the unnec- essary context. Given an article, our filtering module optimised with a contrastive learning objective first selects the parapraphs that are the most relevant to the corresponding clickbait. The resulting condensed article is then fed to the two downstream tasks of spoiler type clas- sification and spoiler generation. We demon- strate and analyze the gains from this approach on both the tasks. Overall, we win the task of spoiler type classification and achieve competi- tive results on spoiler generation
IREL at SemEval-2023 Task 11: User Conditioned Modelling for Toxicity Detection in Subjective Tasks
Ankita Maity,Kandru Siri Venkata Pavan Kumar,Bhavyajeet Singh,Kancharla Aditya Hari,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_IREL_2023, AUTHOR = {Ankita Maity, Kandru Siri Venkata Pavan Kumar, Bhavyajeet Singh, Kancharla Aditya Hari, Vasudeva Varma Kalidindi}, TITLE = {IREL at SemEval-2023 Task 11: User Conditioned Modelling for Toxicity Detection in Subjective Tasks}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
This paper describes our system used in the SemEval-2023 Task 11 Learning With Dis- agreements (Le-Wi-Di). This is a subjective task since it deals with detecting hate speech, misogyny and offensive language. Thus, dis- agreement among annotators is expected. We experiment with different settings like loss functions specific for subjective tasks and in- clude anonymized annotator-specific informa- tion to help us understand the level of disagree- ment. We perform an in-depth analysis of the performance discrepancy of these differ- ent modelling choices. Our system achieves a cross-entropy of 0.58, 4.01 and 3.70 on the test sets of HS-Brexit, ArMIS and MD-Agreement, respectively. Our code implementation is pub- licly available.
Tenzin-Gyatso at SemEval-2023 Task 4: Identifying Human Values behind Arguments using DeBERTa
Kandru Siri Venkata Pavan Kumar,Bhavyajeet Singh,Ankita Maity,Kancharla Aditya Hari,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_Tenz_2023, AUTHOR = {Kandru Siri Venkata Pavan Kumar, Bhavyajeet Singh, Ankita Maity, Kancharla Aditya Hari, Vasudeva Varma Kalidindi}, TITLE = {Tenzin-Gyatso at SemEval-2023 Task 4: Identifying Human Values behind Arguments using DeBERTa}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
Identifying human values behind arguments is a complex task which requires understanding of premise, stance and conclusion together. We propose a method that uses a pre-trained lan- guage model, DeBERTa, to tokenize and con- catenate the text before feeding it into a fully connected neural network. We also show that leveraging the hierarchy in values improves the performance by .14 F1 score compared to only using level 2 values. Our code is made publicly available here.1
iREL at SemEval-2023 Task 10: Multi-level Training for Explainable Detection of Online Sexism
Nirmal Manoj C,Sagar Sandeep Joshi,Ankita Maity,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_iREL_2023, AUTHOR = {Nirmal Manoj C, Sagar Sandeep Joshi, Ankita Maity, Vasudeva Varma Kalidindi}, TITLE = {iREL at SemEval-2023 Task 10: Multi-level Training for Explainable Detection of Online Sexism}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
This paper describes our approach for SemEval- 2023 Task 10: Explainable Detection of Online Sexism (EDOS). The task deals with identifi- cation and categorization of sexist content into fine-grained categories for explainability in sex- ism classification. The explainable categoriza- tion is proposed through a set of three hierar- chical tasks that constitute a taxonomy of sexist content, each task being more granular than the former for categorization of the content. Our team (iREL) participated in all three hierarchi- cal subtasks. Considering the inter-connected task structure, we study multilevel training to study the transfer learning from coarser to finer tasks. Our experiments based on pretrained transformer architectures also make use of ad- ditional strategies such as domain-adaptive pre- training to adapt our models to the nature of the content dealt with, and use of the focal loss objective for handling class imbalances. Our best-performing systems on the three tasks achieve macro-F1 scores of 85.93, 69.96 and 54.62 on their respective validation sets
Summarizing Indian Languages using Multilingual Transformers based Models
Dhaval Taunk,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2023
@inproceedings{bib_Summ_2023, AUTHOR = {Dhaval Taunk, Vasudeva Varma Kalidindi}, TITLE = {Summarizing Indian Languages using Multilingual Transformers based Models}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
With the advent of multilingual models like mBART, mT5, IndicBART etc., summarization in low resource Indian languages is getting a lot of attention now a days. But still the number of datasets is low in number. In this work, we (Team HakunaMatata) study how these multilingual models perform on the datasets which have Indian languages as source and target text while performing summarization. We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages
Dhaval Taunk,Sagare Shivprasad Rajendra,Anupam Patil,S Shivansh,Manish Gupta,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2023
@inproceedings{bib_XWik_2023, AUTHOR = {Dhaval Taunk, Sagare Shivprasad Rajendra, Anupam Patil, S Shivansh, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2023}}
Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are sum- marized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization inefective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference ar- ticles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ∼69K Wikipedia articles covering fve domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specifc LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summariza- tion to coarsely identify salient information followed by a neural abstractive model to generate the section-specifc text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average. We make our code and dataset publicly availableLack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are sum- marized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization inefective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference ar- ticles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ∼69K Wikipedia articles covering fve domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specifc LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summariza- tion to coarsely identify salient information followed by a neural abstractive model to generate the section-specifc text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average. We make our code and dataset publicly available
LLM-RM at SemEval-2023 Task 2: Multilingual Complex NER using XLM-RoBERTa
Rahul Mehta,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2023
@inproceedings{bib_LLM-_2023, AUTHOR = {Rahul Mehta, Vasudeva Varma Kalidindi}, TITLE = {LLM-RM at SemEval-2023 Task 2: Multilingual Complex NER using XLM-RoBERTa}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Named Entity Recognition(NER) is a task of recognizing entities at a token level in a sen- tence. This paper focuses on solving NER tasks in a multilingual setting for complex named entities.Our team, LLM-RM partici- pated in the recently organized SemEval 2023 task, Task 2: MultiCoNER II,Multilingual Complex Named Entity Recognition. We approach the problem by leveraging cross- lingual representation provided by fine-tuning XLM-Roberta base model on datasets of all of the 12 languages provided - Bangla, Chinese, English, Farsi, French, German, Hindi, Italian, Portuguese, Spanish, Swedish and Ukrainian
Neural models for Factual Inconsistency Classification with Explanations
Tathagata Raha,Mukund Choudhary,Abhinav S Menon,Harshit Gupta,K V Aditya Srivatsa,Manish Gupta,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2023
@inproceedings{bib_Neur_2023, AUTHOR = {Tathagata Raha, Mukund Choudhary, Abhinav S Menon, Harshit Gupta, K V Aditya Srivatsa, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Neural models for Factual Inconsistency Classification with Explanations}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Factual consistency is one of the most important requirements when editing high quality documents. It is extremely important for automatic text generation systems like summarization, question answering, dialog modeling, and language modeling. Still, automated factual inconsistency detection is rather under-studied. Existing work has focused on (a) finding fake news keeping a knowledge base in context, or (b) detecting broad contradiction (as part of natural language inference literature). However, there has been no work on detecting and explaining types of factual inconsistencies in text, without any knowledge base in context. In this paper, we leverage existing work in linguistics to formally define five types of factual inconsistencies. Based on this categorization, we contribute a novel dataset, FICLE (Factual Inconsistency CLassification with Explanation), with ~8K samples where each sample consists of two sentences (claim and context) annotated with type and span of inconsistency. When the inconsistency relates to an entity type, it is labeled as well at two levels (coarse and fine-grained). Further, we leverage this dataset to train a pipeline of four neural models to predict inconsistency type with explanations, given a (claim, context) sentence pair. Explanations include inconsistent claim fact triple, inconsistent context span, inconsistent claim component, coarse and fine-grained inconsistent entity types. The proposed system first predicts inconsistent spans from claim and context; and then uses them to predict inconsistency types and inconsistent entity types (when inconsistency is due to entities). We experiment with multiple Transformer …
iREL at SemEval-2023 Task 9: Improving understanding of multilingual Tweets using Translation-Based Augmentation and Domain Adapted Pre-Trained Models
Bhavyajeet Singh,Ankita Maity,Kandru Siri Venkata Pavan Kumar,Kancharla Aditya Hari,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_iREL_2023, AUTHOR = {Bhavyajeet Singh, Ankita Maity, Kandru Siri Venkata Pavan Kumar, Kancharla Aditya Hari, Vasudeva Varma Kalidindi}, TITLE = {iREL at SemEval-2023 Task 9: Improving understanding of multilingual Tweets using Translation-Based Augmentation and Domain Adapted Pre-Trained Models}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
This paper describes our system (iREL) for Tweet intimacy analysis shared task of the SemEval 2023 workshop at ACL 2023. Our system achieved an overall Pearson’s r score of 0.5924 and ranked 10th on the overall leaderboard. For the unseen languages, we ranked third on the leaderboard and achieved a Pearson’s r score of 0.485. We used a single multilingual model for all languages, as discussed in this paper. We provide a detailed description of our pipeline along with multiple ablation experiments to further analyse each component of the pipeline. We demonstrate how translation-based augmentation, domainspecific features, and domain-adapted pretrained models improve the understanding of intimacy in tweets. The code can be found at https://github.com/bhavyajeet/Multilingualtweet-intimacy
Cross-lingual Multi-Sentence Fact-to-Text Generation: Generating factually grounded Wikipedia Articles using Wikidata
Bhavyajeet Singh,Kancharla Aditya Hari,Rahul Mehta,Tushar Abhishek,Manish Gupta,Vasudeva Varma Kalidindi
Wiki Workshop, Wiki-W, 2023
@inproceedings{bib_Cros_2023, AUTHOR = {Bhavyajeet Singh, Kancharla Aditya Hari, Rahul Mehta, Tushar Abhishek, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Cross-lingual Multi-Sentence Fact-to-Text Generation: Generating factually grounded Wikipedia Articles using Wikidata}, BOOKTITLE = {Wiki Workshop}. YEAR = {2023}}
Fact-to-text generation can allow for the generation of high-quality, informative texts such as Wikipedia articles. Cross-lingual fact-to-text generation (XF2T) involves using facts available in a language, typically English, and generating texts in a different language based on these facts. This is particularly relevant for low and medium-resource languages, which have relatively structured informative content. This work explores the problem of XF2T for generating long text from given facts with a specific focus on generating factually grounded content. Unfortunately, previous work either focuses on cross-lingual facts to short text or monolingual graph to text generation. In this paper, we propose a novel solution to the multi-sentence XF2T task, which addresses these challenges by training multilingual Transformer-based models with coverage prompts and rebalanced beam search, and further improving the quality by defining task-specific reward functions and training on them using reinforcement learning. Keywords: XF2T, text generation, cross-lingual, NLG evaluation, low resource NLG
Extracting Orientation Relations between Geo-Political Entities from their Wikipedia Text
Nitin Ramrakhiyani,Vasudeva Varma Kalidindi,Girish Keshav Palshikar
CEUR Workshop Proceedings, CEUR, 2023
@inproceedings{bib_Extr_2023, AUTHOR = {Nitin Ramrakhiyani, Vasudeva Varma Kalidindi, Girish Keshav Palshikar}, TITLE = {Extracting Orientation Relations between Geo-Political Entities from their Wikipedia Text}, BOOKTITLE = {CEUR Workshop Proceedings}. YEAR = {2023}}
Augmenting Wikidata with spatial relations specific to Geography can be useful for increasing its utility in multiple applications. In this paper, we aim to extract orientation of borders between countries in the world, from their Wikipedia text and suggest its use to augment the shares_borders_with relation in Wikidata. We propose the use of Natural Language Inference (NLI) for extracting the orientation relations from text and show that when combined with contextual lexical patterns, the performance becomes better than the standard NLI setting.
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
Dhaval Taunk,Lakshya Khanna,Kandru Siri Venkata Pavan Kumar,Vasudeva Varma Kalidindi,Charu Sharma,Makarand Tapaswi
WWW Workshop on Natural Language Processing for Knowledge Graph Construction, NLP4KGc, 2023
@inproceedings{bib_Grap_2023, AUTHOR = {Dhaval Taunk, Lakshya Khanna, Kandru Siri Venkata Pavan Kumar, Vasudeva Varma Kalidindi, Charu Sharma, Makarand Tapaswi}, TITLE = {GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering}, BOOKTITLE = {WWW Workshop on Natural Language Processing for Knowledge Graph Construction}. YEAR = {2023}}
Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG). A typical approach collects nodes relevant to the QA pair from a KG to form a Working Graph (WG) followed by reasoning using Graph Neural Networks (GNNs). This faces two major challenges: (i) it is difficult to capture all the information from the QA in the WG, and (ii) the WG contains some irrelevant nodes from the KG. To address these, we propose GrapeQA with two simple improvements on the WG: (i) Prominent Entities for Graph Augmentation identifies relevant text chunks from the QA pair and augments the WG with corresponding latent representations from the LM, and (ii) ContextAware Node Pruning removes nodes that are less relevant to the QA pair. We evaluate our results on OpenBookQA, CommonsenseQA and MedQA-USMLE and see that GrapeQA shows consistent improvements over its LM + KG predecessor (QA-GNN in particular) and large improvements on OpenBookQA.
Fact Aware Multi-task Learning for Text Coherence Modeling
Tushar Abhishek,Daksh Rawat,Manish Gupta,Vasudeva Varma Kalidindi
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2022
Abs | | bib Tex
@inproceedings{bib_Fact_2022, AUTHOR = {Tushar Abhishek, Daksh Rawat, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Fact Aware Multi-task Learning for Text Coherence Modeling}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2022}}
Coherence is an important aspect of text quality and is crucial for ensuring its readability. It is essential for outputs from text generation systems like summarization, question answering, machine translation, question generation, table-to-text, etc. An automated coherence scoring model is also helpful in essay scoring or providing writing feedback. A large body of previous work has leveraged entity-based methods, syntactic patterns, discourse relations, and traditional deep learning architectures for text coherence assessment. However, these approaches do not consider factual information present in the documents. The transitions of facts associated with entities across sentences could help capture the essence of textual
Graph-based Keyword Planning for Legal Clause Generation from Topics
Aparna Garimella,Vasudeva Varma Kalidindi
Natural Legal Language Processing Workshop, NLLP-W, 2022
@inproceedings{bib_Grap_2022, AUTHOR = {Aparna Garimella, Vasudeva Varma Kalidindi}, TITLE = {Graph-based Keyword Planning for Legal Clause Generation from Topics}, BOOKTITLE = {Natural Legal Language Processing Workshop}. YEAR = {2022}}
Generating domain-specific content such as legal clauses based on minimal user-provided information can be of significant benefit in automating legal contract generation. In this paper, we propose a controllable graph-based mechanism that can generate legal clauses using only the topic or type of the legal clauses. Our pipeline consists of two stages involving a graph-based planner followed by a clause generator. The planner outlines the content of a legal clause as a sequence of keywords in the order of generic to more specific clause information based on the input topic using a controllable graph-based mechanism. The generation stage takes in a given plan and generates a clause. The pipeline consists of a graph-based planner followed by text generation. We illustrate the effectiveness of our proposed two-stage approach on a broad set of clause topics in contracts.
Efficacy of Pretrained Architectures for Code Comment Usefulness Prediction
Sagar Sandeep Joshi,Sumanth Balaji,Aditya Harikrish,Abhijeeth Reddy Singam,Vasudeva Varma Kalidindi
Forum for Information Retrieval Evaluation, FIRE, 2022
@inproceedings{bib_Effi_2022, AUTHOR = {Sagar Sandeep Joshi, Sumanth Balaji, Aditya Harikrish, Abhijeeth Reddy Singam, Vasudeva Varma Kalidindi}, TITLE = {Efficacy of Pretrained Architectures for Code Comment Usefulness Prediction}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2022}}
Source code is usually accompanied by comments which help in improving code comprehension. However, not all comments are helpful in this respect, with some comments being redundant, some being unclear which results in a poorer code readability. Sifting through large volumes of code to identify such comments manually being tedious, the task of automatically evaluating the usefulness of a comment in context of the code it lies in can add value. In this work, we evaluate the performance of various pretrained transformer encoders to solve this task. We demonstrate decent performance of a few models alongside abnormally high performance obtained by a few other pretrained architectures
Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages
Bhavyajeet Singh,Kandru Siri Venkata Pavan Kumar,Anubhav Sharma,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2022
@inproceedings{bib_Mass_2022, AUTHOR = {Bhavyajeet Singh, Kandru Siri Venkata Pavan Kumar, Anubhav Sharma, Vasudeva Varma Kalidindi}, TITLE = {Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2022}}
Massive knowledge graphs like Wikidata attempt to capture world knowledge about multiple entities. Recent approaches concentrate on automatically enriching these KGs from text. However a lot of information present in the form of natural text in low resource languages is often missed out. Cross Lingual Information Extraction aims at extracting factual information in the form of English triples from low resource Indian Language text. Despite its massive potential, progress made on this task is lagging when compared to Monolingual Information Extraction. In this paper, we propose the task of Cross Lingual Fact Extraction(CLFE) from text and devise an end-to-end generative approach for the same which achieves an overall F1 score of 77.46.
Fact Aware Multi-task Learning for Text Coherence Modeling
Tushar Abhishek,Daksh Rawat,Manish Gupta,Vasudeva Varma Kalidindi
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2022
Abs | | bib Tex
@inproceedings{bib_Fact_2022, AUTHOR = {Tushar Abhishek, Daksh Rawat, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Fact Aware Multi-task Learning for Text Coherence Modeling}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2022}}
Coherence is an important aspect of text quality and is crucial for ensuring its readability. It is essential for outputs from text generation systems like summarization, question answering, machine translation, question generation, table-to-text, etc. An automated coherence scoring model is also helpful in essay scoring or providing writing feedback. A large body of previous work has leveraged entity-based methods, syntactic patterns, discourse relations, and traditional deep learning architectures for text coherence assessment. However, these approaches do not consider factual information present in the documents. The transitions of facts associated with entities across sentences could help capture the essence of textual coherence better. We hypothesize that coherence assessment is a cognitively complex task that requires deeper fact-aware models and can benefit from other related tasks. In this work, we propose a novel deep learning model that fuses document-level information with factual information to improve coherence modeling. We further enhance the model efficacy by training it simultaneously with Natural Language Inference task in multi-task learning setting, taking advantage of inductive transfer between the two tasks. Our experiments with popular benchmark datasets across multiple domains demonstrate that the proposed model achieves state-of-the-art results on a synthetic coherence evaluation task and two real-world tasks involving prediction of varying degrees of coherence.
‘You Are Big, S/he Is Small’ Detecting Body Shaming in Online User Content
Redla Varsha Reddy,HARIKA ABBURI,Niyati Chhaya,Tamara Mitrovska,Vasudeva Varma Kalidindi
International Conference on Social Informatics, SocInfo, 2022
Abs | | bib Tex
@inproceedings{bib_‘Y_2022, AUTHOR = {Redla Varsha Reddy, HARIKA ABBURI, Niyati Chhaya, Tamara Mitrovska, Vasudeva Varma Kalidindi}, TITLE = {‘You Are Big, S/he Is Small’ Detecting Body Shaming in Online User Content}, BOOKTITLE = {International Conference on Social Informatics}. YEAR = {2022}}
Body shaming, a criticism based on the body’s shape, size, or appearance, has become a dangerous act on social media. With a rise in the reporting of body shaming experiences on the web, automated monitoring of body shaming posts will help rescue individuals, especially adolescents, from the emotional anguish they experience. To the best of our knowledge, this is the first work on body shaming detection, and we contribute the dataset in which the posts are tagged as body shaming or non-body shaming. We use transformer-based language models to detect body shaming posts. Further, we leverage unlabeled data in a semi-supervised manner using the GAN-BERT model, as it was developed for tasks where labeled data is scarce and unlabeled data is abundant. The findings of the experiments reveal that the algorithm learns valuable knowledge from the unlabeled dataset and outperforms many deep learning and conventional machine learning baselines.
Profiling irony and stereotype spreaders on Twitter based on term frequency in tweets
Dhaval Taunk,Sagar Sandeep Joshi,Vasudeva Varma Kalidindi
Conference and Labs of the Evaluation Forum, CLEF, 2022
@inproceedings{bib_Prof_2022, AUTHOR = {Dhaval Taunk, Sagar Sandeep Joshi, Vasudeva Varma Kalidindi}, TITLE = {Profiling irony and stereotype spreaders on Twitter based on term frequency in tweets}, BOOKTITLE = {Conference and Labs of the Evaluation Forum}. YEAR = {2022}}
The use of stereotypes, irony, mocking and scornful language is prevalent on social media platforms such as Twitter. Identification or profiling of users who are involved in the spread of such content is beneficial for monitoring its spread. In our work, we study the problem of profiling irony and stereotype spreaders on Twitter as a part of the PAN shared task in CLEF 2022. We experiment with machine learning models applied on a TF-IDF representation of user tweets, and find Random Forest to be the best working one
Leveraging Mental Health Forums for User-level Depression Detection on Social Media
Sravani Boinepelli,Tathagata Raha,HARIKA ABBURI,PARIKH PULKIT TRUSHANT KUMAR,Niyati Chhaya,Vasudeva Varma Kalidindi
International Conference on Language Resources and Evaluation, LREC, 2022
@inproceedings{bib_Leve_2022, AUTHOR = {Sravani Boinepelli, Tathagata Raha, HARIKA ABBURI, PARIKH PULKIT TRUSHANT KUMAR, Niyati Chhaya, Vasudeva Varma Kalidindi}, TITLE = {Leveraging Mental Health Forums for User-level Depression Detection on Social Media}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2022}}
The number of depression and suicide risk cases on social media platforms is ever-increasing, and the lack of depression detection mechanisms on these platforms is becoming increasingly apparent. A majority of work in this area has focused on leveraging linguistic features while dealing with small-scale datasets. However, one faces many obstacles when factoring into account the vastness and inherent imbalance of social media content. In this paper, we aim to optimize the performance of user-level depression classification to lessen the burden on computational resources. The resulting system executes in a quicker, more efficient manner, in turn making it suitable for deployment. To simulate a platform agnostic framework, we simultaneously replicate the size and composition of social media to identify victims of depression. We systematically design a solution that categorizes post embeddings, obtained by fine-tuning transformer models such as RoBERTa, and derives user-level representations using hierarchical attention networks. We also introduce a novel mental health dataset to enhance the performance of depression categorization. We leverage accounts of depression taken from this dataset to infuse domain-specific elements into our framework. Our proposed methods outperform numerous baselines across standard metrics for the task of depression detection in text.
Towards Proactively Forecasting Sentence-Specific Information Popularity within Online News Documents
Sayar Ghosh Roy,Anshul Padhi,Risubh Jain,Manish Gupta,Vasudeva Varma Kalidindi
ACM Conference on Hypertext and Social Media, HT&SM, 2022
@inproceedings{bib_Towa_2022, AUTHOR = {Sayar Ghosh Roy, Anshul Padhi, Risubh Jain, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Towards Proactively Forecasting Sentence-Specific Information Popularity within Online News Documents}, BOOKTITLE = {ACM Conference on Hypertext and Social Media}. YEAR = {2022}}
Multiple studies have focused on predicting the prospective popularity of an online document as a whole, without paying attention to the contributions of its individual parts. We introduce the task of proactively forecasting popularities of sentences within online news documents solely utilizing their natural language content. We model sentence-specific popularity forecasting as a sequence regression task. For training our models, we curate InfoPop, the first dataset containing popularity labels for over 1.7 million sentences from over 50,000 online news documents. To the best of our knowledge, this is the first dataset automatically created using streams of incoming search engine queries to generate sentence-level popularity annotations. We propose a novel transfer learning approach involving sentence salience prediction as an auxiliary task. Our proposed technique coupled with a BERT-based neural model exceeds nDCG values of 0.8 for proactive sentence-specific popularity forecasting. Notably, our study presents a non-trivial takeaway: though popularity and salience are different concepts, transfer learning from salience prediction enhances popularity forecasting. We release InfoPop and make our code publicly available1
IIITH at SemEval-2022 Task 5: A comparative study of deep learning models for identifying misogynous memes
Tathagata Raha,Sagar Sandeep Joshi,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2022
@inproceedings{bib_IIIT_2022, AUTHOR = {Tathagata Raha, Sagar Sandeep Joshi, Vasudeva Varma Kalidindi}, TITLE = {IIITH at SemEval-2022 Task 5: A comparative study of deep learning models for identifying misogynous memes}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
This paper provides a comparison of different deep learning methods for identifying misogynous memes for SemEval-2022 Task 5: Multimedia Automatic Misogyny Identification. In this task, we experiment with architectures in the identification of misogynous content in memes by making use of text and image-based information. The different deep learning methods compared in this paper are: (i) unimodal image or text models (ii) fusion of unimodal models (iii) multimodal transformers models and (iv) transformers further pretrained on a multimodal task. From our experiments, we found pretrained multimodal transformer architectures to strongly outperform the models involving fusion of representation from both the modalities.
IIIT-MLNS at SemEval-2022 Task 8: Siamese Architecture for Modeling Multilingual News Similarity
Sagar Sandeep Joshi,Dhaval Taunk,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2022
@inproceedings{bib_IIIT_2022, AUTHOR = {Sagar Sandeep Joshi, Dhaval Taunk, Vasudeva Varma Kalidindi}, TITLE = {IIIT-MLNS at SemEval-2022 Task 8: Siamese Architecture for Modeling Multilingual News Similarity}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
The task of multilingual news article similarity entails determining the degree of similarity of a given pair of news articles in a language-agnostic setting. This task aims to determine the extent to which the articles deal with the entities and events in question without much consideration of the subjective aspects of the discourse. Considering the superior representations being given by these models as validated on other tasks in NLP across an array of high and low-resource languages and this task not having any restricted set of languages to focus on, we adopted using the encoder representations from these models as our choice throughout our experiments. For modeling the similarity task by using the representations given by these models, a Siamese architecture was used as the underlying architecture. In experimentation, we investigated on several fronts including features passed to the encoder model, data augmentation and ensembling among our major experiments. We found data augmentation to be the most effective working strategy among our experiments.
Towards Capturing Changes in Mood and Identifying Suicidality Risk
Sravani Boinepelli,Shivansh Subramanian,Abhijeeth Reddy Singam,Tathagata Raha,Vasudeva Varma Kalidindi
Workshop on Computational Linguistics and Clinical Psychology, CLPsych-w, 2022
@inproceedings{bib_Towa_2022, AUTHOR = {Sravani Boinepelli, Shivansh Subramanian, Abhijeeth Reddy Singam, Tathagata Raha, Vasudeva Varma Kalidindi}, TITLE = {Towards Capturing Changes in Mood and Identifying Suicidality Risk}, BOOKTITLE = {Workshop on Computational Linguistics and Clinical Psychology}. YEAR = {2022}}
This paper describes our systems for CLPsych’s 2022 Shared Task1 . Subtask A involves capturing moments of change in an individual’s mood over time, while Subtask B asked us to identify the suicidality risk of a user. We explore multiple machine learning and deep learning methods for the same, taking real-life applicability into account while considering the design of the architecture. Our team, IIITH, achieved top results in different categories for both subtasks. Task A was evaluated on a post-level (using macro averaged F1) and on a window-based timeline level (using macro-averaged precision and recall). We scored a post-level F1 of 0.520 and ranked second with a timeline-level recall of 0.646. Task B was a user-level task where we also came in second with a micro F1 of 0.520 and scored third place on the leaderboard with a macro F1 of 0.380.
Gui at MixMT 2022: English-Hinglish: An MT approach for translation of code mixed data
Akshat Gahoi,Saransh Rajput,Jayant Duneja,Tanvi Kamble,Anshul Padhi,Dipti Mishra Sharma,Shivam Sadashiv Mangale,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2022
@inproceedings{bib_Gui__2022, AUTHOR = {Akshat Gahoi, Saransh Rajput, Jayant Duneja, Tanvi Kamble, Anshul Padhi, Dipti Mishra Sharma, Shivam Sadashiv Mangale, Vasudeva Varma Kalidindi}, TITLE = {Gui at MixMT 2022: English-Hinglish: An MT approach for translation of code mixed data}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Code-mixed machine translation has become an important task in multilingual communities and extending the task of machine translation to code mixed data has become a common task for these languages. In the shared tasks of WMT 2022, we try to tackle the same for both English + Hindi to Hinglish and Hinglish to English. The first task dealt with both Roman and Devanagari script as we had monolingual data in both English and Hindi whereas the second task only had data in Roman script. To our knowledge, we achieved one of the top ROUGE-L and WER scores for the first task of Monolingual to Code-Mixed machine translation. In this paper, we discuss the use of mBART with some special pre-processing and post-processing (transliteration from Devanagari to Roman) for the first task in detail and the experiments that we performed for the second task of translating code-mixed Hinglish to monolingual English.
IIIT-MLNS at SemEval-2022 Task 8: Siamese Architecture for Modeling Multilingual News Similarity
Sagar Sandeep Joshi,Dhaval Taunk,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2022
@inproceedings{bib_IIIT_2022, AUTHOR = {Sagar Sandeep Joshi, Dhaval Taunk, Vasudeva Varma Kalidindi}, TITLE = {IIIT-MLNS at SemEval-2022 Task 8: Siamese Architecture for Modeling Multilingual News Similarity}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
The task of multilingual news article similarity en- tails determining the degree of similarity of a given pair of news articles in a language-agnostic setting. This task aims to determine the extent to which the articles deal with the entities and events in ques- tion without much consideration of the subjective aspects of the discourse. Considering the supe- rior representations being given by these models as validated on other tasks in NLP across an array of high and low-resource languages and this task not having any restricted set of languages to focus on, we adopted using the encoder representations from these models as our choice throughout our experiments. For modeling the similarity task by using the representations given by these models, a Siamese architecture was used as the underlying architecture. In experimentation, we investigated on several fronts including features passed to the encoder model, data augmentation and ensembling among our major experiments. We found data aug- mentation to be the most effective working strategy among our experiments.
IIITH at SemEval-2022 Task 5: A comparative study of deep learning models for identifying misogynous memes
Tathagata Raha,Sagar Sandeep Joshi,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2022
@inproceedings{bib_IIIT_2022, AUTHOR = {Tathagata Raha, Sagar Sandeep Joshi, Vasudeva Varma Kalidindi}, TITLE = {IIITH at SemEval-2022 Task 5: A comparative study of deep learning models for identifying misogynous memes}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
This paper provides a comparison of different deep learning methods for identifying misogy- nous memes for SemEval-2022 Task 5: Multi- media Automatic Misogyny Identification. In this task, we experiment with architectures in the identification of misogynous content in memes by making use of text and image-based information. The different deep learning meth- ods compared in this paper are: (i) unimodal image or text models (ii) fusion of unimodal models (iii) multimodal transformers models and (iv) transformers further pretrained on a multimodal task. From our experiments, we found pretrained multimodal transformer ar- chitectures to strongly outperform the models involving fusion of representation from both the modalities.
Investigating Strategies for Clause Recommendation
Sagar Sandeep Joshi,Sumanth Balaji,Jerrin John Thomas,Aparna Garimella,Vasudeva Varma Kalidindi
Legal Knowledge and Information Systems, LKIS, 2022
@inproceedings{bib_Inve_2022, AUTHOR = {Sagar Sandeep Joshi, Sumanth Balaji, Jerrin John Thomas, Aparna Garimella, Vasudeva Varma Kalidindi}, TITLE = {Investigating Strategies for Clause Recommendation}, BOOKTITLE = {Legal Knowledge and Information Systems}. YEAR = {2022}}
Clause recommendation is the problem of recommending a clause to a legal contract, given the context of the contract in question and the clause type to which the clause should belong. With not much prior work being done toward the generation of legal contracts, this problem was proposed as a first step toward the bigger problem of contract generation. As an open-ended text generation prob-lem, the distinguishing characteristics of this problem lie in the nature of legal lan-guage as a sublanguage and the considerable similarity of textual content within the clauses of a specific type. This similarity aspect in legal clauses drives us to investi-gate the importance of similar contracts’ representation for recommending clauses. In our work, we experiment with generating clauses for 15 commonly occurring clause types in contracts expanding upon the previous work on this problem and analyzing clause recommendations in varying settings using information derived from similar contracts.
XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages
Sagare Shivprasad Rajendra,Tushar Abhishek,Bhavyajeet Singh,Anubhav Sharma,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2022
@inproceedings{bib_XF2T_2022, AUTHOR = {Sagare Shivprasad Rajendra, Tushar Abhishek, Bhavyajeet Singh, Anubhav Sharma, Vasudeva Varma Kalidindi}, TITLE = {XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. Hence, fact-to-text generation systems have been developed for various downstream tasks like generating soccer reports, weather and financial reports, medical reports, person biographies, etc. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XALIGN for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XALIGN dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XALIGNV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.
An Ensemble Approach to Detect Emotions at an Essay Level
Himanshu Maheshwari,Vasudeva Varma Kalidindi
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA, 2022
@inproceedings{bib_An_E_2022, AUTHOR = {Himanshu Maheshwari, Vasudeva Varma Kalidindi}, TITLE = {An Ensemble Approach to Detect Emotions at an Essay Level}, BOOKTITLE = {Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis}. YEAR = {2022}}
This paper describes our system (IREL, ref- fered as himanshu.1007 on Codalab) for Shared Task on Empathy Detection, Emotion Classifi- cation, and Personality Detection at 12th Work- shop on Computational Approaches to Subjec- tivity, Sentiment & Social Media Analysis at ACL 2022. We participated in track 2 for pre- dicting emotion at the essay level. We pro- pose an ensemble approach that leverages the linguistic knowledge of the RoBERTa, BART- large, and RoBERTa model finetuned on the GoEmotions dataset. Each brings in its unique advantage, as we discuss in the paper. Our pro- posed system achieved a Macro F1 score of 0.585 and ranked one out of thirteen teams (the current top team on leaderboard submitted after the deadline). The code can be found here
XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages
Tushar Abhishek,Shivprasad Sagare,Bhavyajeet Singh,Anubhav Sharma,Manish Gupta,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2022
@inproceedings{bib_XAli_2022, AUTHOR = {Tushar Abhishek, Shivprasad Sagare, Bhavyajeet Singh, Anubhav Sharma, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2022}}
XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages Tushar Abhishek, Shivprasad Sagare, Bhavyajeet Singh, Anubhav Sharma, Manish Gupta∗ and Vasudeva Varma Information Retrieval and Extraction Lab, IIIT Hyderabad, India {tushar.abhishek,shivprasad.sagare,bhavyajeet.singh,anubhav.sharma}@research.iiit.ac.in, {manish.gupta,vv}@iiit.ac.in Abstract Multiple critical scenarios (like Wikipedia text generation given English Infoboxes) need au- tomated generation of descriptive text in low resource (LR) languages from English fact triples. Previous work has focused on English fact-to-text (F2T) generation. To the best of our knowledge, there has been no previous at- tempt on cross-lingual alignment or generation for LR languages. Building an effective cross- lingual F2T (XF2T) system requires alignment between English structured facts and LR sen- tences. We propose two unsupervised meth- ods for cross-lingual alignment. We contribute XALIGN, an XF2T dataset with 0.45M pairs across 8 languages, of which 5402 pairs have been manually annotated. We also train strong baseline XF2T generation models on XAlign dataset
System and method for retrieving and extracting security information
,Raghu Babu Reddy Y,LALIT MOHAN S,Vasudeva Varma Kalidindi
United States Patent, Us patent, 2022
@inproceedings{bib_Syst_2022, AUTHOR = {, Raghu Babu Reddy Y, LALIT MOHAN S, Vasudeva Varma Kalidindi}, TITLE = {System and method for retrieving and extracting security information}, BOOKTITLE = {United States Patent}. YEAR = {2022}}
A system and method for automatically extracting contract data from electronic contracts includes an administrator module configured to provide templates for inputting document patterns and a list of contract data tags for each of a plurality of contract document types. A parser is configured to convert an electronic contract document into a contract text document and reformat the contract text document to provide a pattern for the text contract document. A pattern recognition engine is configured to determine a list of contract document types in the electronic contract by comparing and matching patterns of all known contract document types with the pattern of the contract text document. A contract data extraction engine is configured to extract contract data for each contract document type on the list.
T3N : Harnessing Text and Temporal Tree Network for Rumor Detection on Twitter
Nikhil Pinnaparaju,Manish Gupta,Vasudeva Varma Kalidindi
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2021
Abs | | bib Tex
@inproceedings{bib_T3N__2021, AUTHOR = {Nikhil Pinnaparaju, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {T3N : Harnessing Text and Temporal Tree Network for Rumor Detection on Twitter}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2021}}
Social media platforms have democratized the publication process resulting into easy and viral propagation of information. However, spread of rumors via such media often results into undesired and extremely impactful political, economic, social, psychological and criminal consequences. Several manual as well as automated efforts have been undertaken in the past to solve this critical problem. Existing automated methods are text based, user credibility based or use signals from the tweet propagation tree. We aim at using the text, user, propagation tree and temporal information jointly for rumor detection on Twitter. This involves several challenges like how to handle text variations on Twitter, what signals from user profile could be useful, how to best encode the propagation tree information, and how to incorporate the temporal signal. Our novel architecture, (Text and Temporal Tree Network), leverages deep learning based architectures to encode text, user and tree information in a temporal-aware manner. Our extensive comparisons show that our proposed methods outperform the state-of-the-art techniques by 7 and 6% points respectively on two popular benchmark datasets, and also lead to better early detection results.
Knowledge-based neural framework for sexism detection and classification
Harika Abburi,Shradha Sehgal,Himanshu Maheshwari,Vasudeva Varma Kalidindi
Iberian Languages Evaluation Forum, IberLEF, 2021
@inproceedings{bib_Know_2021, AUTHOR = {Harika Abburi, Shradha Sehgal, Himanshu Maheshwari, Vasudeva Varma Kalidindi}, TITLE = {Knowledge-based neural framework for sexism detection and classification}, BOOKTITLE = {Iberian Languages Evaluation Forum}. YEAR = {2021}}
Sexism, a prejudice that causes enormous suffering, mani- fests in blatant as well as subtle ways. As sexist content towards women is increasingly spread on social networks, the automatic detection and categorization of these tweets/posts can help social scientists and policy- makers in research, thereby combating sexism. In this paper, we explore the problem of detecting whether a Twitter/Gab post is sexist or not. We further discriminate the detected sexist post into one of the fine-grained sexism categories. We propose a neural model for this sexism detec- tion and classification that can combine representations obtained using RoBERTa model and linguistic features such as Empath, Hurtlex, and Perspective API by involving recurrent components. We also leverage the unlabeled sexism data to infuse the domain-specific transformer model into our framework. Our proposed framework also features a knowledge module comprised of emoticon and hashtag representations to infuse the external knowledge-specific features into the learning process. Several proposed methods outperform various baselines across several standard metrics
Cross-lingual Alignment of Knowledge Graph Triples with Sentences
Swayatta Daw,Sagare Shivprasad Rajendra,Tushar Abhishek,Vikram Pudi,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2021
@inproceedings{bib_Cros_2021, AUTHOR = {Swayatta Daw, Sagare Shivprasad Rajendra, Tushar Abhishek, Vikram Pudi, Vasudeva Varma Kalidindi}, TITLE = {Cross-lingual Alignment of Knowledge Graph Triples with Sentences}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2021}}
The pairing of natural language sentences with knowledge graph triples is essential for many downstream tasks like data-to-text generation, facts extraction from sentences (semantic parsing), knowledge graph completion, etc. Most existing methods solve these downstream tasks using neural-based end-to-end approaches that require a large amount of well-aligned training data, which is difficult and expensive to acquire. Recently various unsupervised techniques have been proposed to alleviate this alignment step by automatically pairing the structured data (knowledge graph triples) with textual data. However, these approaches are not well suited for low resource languages that provide two major challenges: (1) unavailability of pair of triples and native text with the same content distribution and (2) limited Natural language Processing (NLP) resources. In this paper, we address the unsupervised pairing of knowledge graph triples with sentences for low resource languages, selecting Hindi as the low resource language. We propose cross-lingual pairing of English triples with Hindi sentences to mitigate the unavailability of content overlap. We propose two novel approaches: NER-based filtering with Semantic Similarity and Key-phrase Extraction with Relevance Ranking. We use our best method to create a collection of 29224 well-aligned English triples and Hindi sentence pairs. Additionally, we have also curated 350 human-annotated golden test datasets for evaluation. We make the code and dataset publicly available.
Retrieval of Prior Court Cases Using Witness Testimonies
Kripabandhu GHOSH,Sachin PAWAR,Girish PALSHIKAR,Pushpak BHATTACHARYYA,Vasudeva Varma Kalidindi
Legal Knowledge and Information Systems, LKIS, 2021
@inproceedings{bib_Retr_2021, AUTHOR = {Kripabandhu GHOSH, Sachin PAWAR, Girish PALSHIKAR, Pushpak BHATTACHARYYA, Vasudeva Varma Kalidindi}, TITLE = {Retrieval of Prior Court Cases Using Witness Testimonies}, BOOKTITLE = {Legal Knowledge and Information Systems}. YEAR = {2021}}
Witness testimonies are important constituents of a court case description and play a significant role in the final decision. We propose two techniques to identify sentences representing witness testimonies. The first technique employs linguistic rules whereas the second technique applies distant supervision where training set is constructed automatically using the output of the first technique. We then represent the identified witness testimonies in a more meaningful structure – event verb (predicate) along with its arguments corresponding to semantic roles A0 and A1 [1]. We demonstrate effectiveness of such representation in retrieving semantically similar prior relevant cases. To the best of our knowledge, this is the first paper to apply NLP techniques to extract witness information from court judgements and use it for retrieving prior court cases.
Fine‑Grained Multi‑label Sexism Classifcation Using a Semi‑Supervised Multi‑level Neural Approach
Harika Abburi,PARIKH PULKIT TRUSHANT KUMAR,Niyati Chhaya,Vasudeva Varma Kalidindi
Data Science and Engineering, DSE, 2021
@inproceedings{bib_Fine_2021, AUTHOR = {Harika Abburi, PARIKH PULKIT TRUSHANT KUMAR, Niyati Chhaya, Vasudeva Varma Kalidindi}, TITLE = {Fine‑Grained Multi‑label Sexism Classifcation Using a Semi‑Supervised Multi‑level Neural Approach}, BOOKTITLE = {Data Science and Engineering}. YEAR = {2021}}
Sexism, a permeate form of oppression, causes profound sufering through various manifestations. Given the increasing number of experiences of sexism shared online, categorizing these recollections automatically can support the battle against sexism, since it can promote successful evaluations by gender studies researchers and government representatives engaged in policy making. In this paper, we examine the fne-grained, multi-label classifcation of accounts (reports) of sexism. To the best of our knowledge, we consider substantially more categories of sexism than any related prior work through our 23-class problem formulation. Moreover, we present the frst semi-supervised work for the multi-label classifcation of accounts describing any type(s) of sexism. We devise self-training-based techniques tailor-made for the multi-label nature of the problem to utilize unlabeled samples for augmenting the labeled set. We identify high textual diversity with respect to the existing labeled set as a desirable quality for candidate unlabeled instances and develop methods for incorporating it into our approach. We also explore ways of infusing class imbalance alleviation for multi-label classifcation into our semisupervised learning, independently and in conjunction with the method involving diversity. In addition to data augmentation methods, we develop a neural model which combines biLSTM and attention with a domain-adapted BERT model in an endto-end trainable manner. Further, we formulate a multi-level training approach in which models are sequentially trained using
Knowledge-based Neural Framework for Sexism Detection and Classification
HARIKA ABBURI,Shradha Sehgal,Himanshu Maheshwari,Vasudeva Varma Kalidindi
Iberian Languages Evaluation Forum, IberLEF, 2021
@inproceedings{bib_Know_2021, AUTHOR = {HARIKA ABBURI, Shradha Sehgal, Himanshu Maheshwari, Vasudeva Varma Kalidindi}, TITLE = {Knowledge-based Neural Framework for Sexism Detection and Classification}, BOOKTITLE = {Iberian Languages Evaluation Forum}. YEAR = {2021}}
Sexism, a prejudice that causes enormous suffering, manifests in blatant as well as subtle ways. As sexist content towards women is increasingly spread on social networks, the automatic detection and categorization of these tweets/posts can help social scientists and policymakers in research, thereby combating sexism. In this paper, we explore the problem of detecting whether a Twitter/Gab post is sexist or not. We further discriminate the detected sexist post into one of the fine-grained sexism categories. We propose a neural model for this sexism detection and classification that can combine representations obtained using RoBERTa model and linguistic features such as Empath, Hurtlex, and Perspective API by involving recurrent components. We also leverage the unlabeled sexism data to infuse the domain-specific transformer model into our framework. Our proposed framework also features a knowledge module comprised of emoticon and hashtag representations to infuse the external knowledge-specific features into the learning process. Several proposed methods outperform various baselines across several standard metrics.
SCATE: Shared Cross Attention Transformer Encoders for Multimodal Fake News Detection
Tanmay Sachan,Nikhil Pinnaparaju,Manish Gupta,Vasudeva Varma Kalidindi
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2021
@inproceedings{bib_SCAT_2021, AUTHOR = {Tanmay Sachan, Nikhil Pinnaparaju, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {SCATE: Shared Cross Attention Transformer Encoders for Multimodal Fake News Detection}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2021}}
Social media platforms have democratized the publication process resulting into easy and viral propagation of information. Oftentimes this misinformation is accompanied by misleading or doctored images that quickly circulate across the internet and reach many unsuspecting users. Several manual as well as automated efforts have been undertaken in the past to solve this critical problem. While manual efforts cannot keep up with the rate at which this content is churned out, many automated approaches only leverage concatenation (of the image and text representations) thereby failing to build effective crossmodal embeddings. Architectures like this fail in many cases because the text or image doesn’t need to be false for the corresponding text, image pair to be misinformation. While some recent work attempts to use attention techniques to compute a crossmodal representation using pretrained text and image embeddings, we show a more effective approach towards utilizing such pretrained embeddings to build richer representations that can be classified better. This involves several challenges like how to handle text variations on Twitter and Weibo, how to encode the image information and how to leverage the text and image encodings together effectively. Our architecture, SCATE (Shared Cross Attention Transformer Encoders), leverages deep convolutional neural networks and transformer-based methods to encode image and text information utilizing crossmodal attention and shared layers for the two modalities. Our experiments with three popular benchmark datasets (Twitter, WeiboA and WeiboB) show that our proposed methods outperform the state-of-the-art meth
Goal-Directed Extractive Summarization of Financial Reports
Yash Agrawal,Vivek Anand,Manish Gupta,S Arunachalam,Vasudeva Varma Kalidindi
International Conference on Information and Knowledge Management, CIKM, 2021
@inproceedings{bib_Goal_2021, AUTHOR = {Yash Agrawal, Vivek Anand, Manish Gupta, S Arunachalam, Vasudeva Varma Kalidindi}, TITLE = {Goal-Directed Extractive Summarization of Financial Reports}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2021}}
Financial reports filed by various companies discuss compliance, risks, and future plans, such as goals and new projects, which directly impact their stock price. Quick consumption of such information is critical for financial analysts and investors to make stock buy/sell decisions and for equity evaluations. Hence, we study the problem of extractive summarization of 10-K reports. Recently, Transformer-based summarization models have become very popular. However, lack of in-domain labeled summarization data is a major roadblock to train such finance-specific summarization models. We also show that zero-shot inference on such pretrained models is not as effective either. In this paper, we address this challenge by modeling 10-K report summarization using a goal-directed setting where we leverage summaries with labeled goal-related data for the stock buy/sell classification goal. Further, we provide improvements by considering a multi-task learning method with an industry classification auxiliary task. Intrinsic evaluation as well as extrinsic evaluation for the stock buy/sell classification and portfolio construction tasks shows that our proposed method significantly outperforms strong baselines.
Transformer Models for Text Coherence Assessment
Tushar Abhishek,Daksh Rawat,Manish Gupta,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2021
@inproceedings{bib_Tran_2021, AUTHOR = {Tushar Abhishek, Daksh Rawat, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Transformer Models for Text Coherence Assessment}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Coherence is an important aspect of text quality and is crucial for ensuring its readability. It is an essential desirable for outputs from text generation systems like summarization, question answering, machine translation, question generation, table-to-text, etc. An automated coherence scoring model is also helpful in essay scoring or providing writing feedback. A large body of previous work has leveraged entity-based methods, syntactic patterns, discourse relations and more recently traditional deep learning architectures for text coherence assessment. We hypothesize that coherence assessment is a cognitively complex task which requires deeper models and can benefit from other related tasks. Accordingly, in this paper, we propose four different Transformerbased architectures for the task: vanilla Transformer, hierarchical Transformer, multi-task learning-based model, and a model with factbased input representation. Our experiments with popular benchmark datasets across multiple domains on four different coherence assessment tasks demonstrate that our models achieve state-of-the-art results outperforming existing models by a good margin.
IIITH at SemEval-2021 Task 7: Leveraging transformer-based humourous and offensive text detection architectures using lexical and hurtlex features and task adaptive pretraining
Tathagata Raha,Ishan Sanjeev Upadhyay,Radhika Mamidi,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2021
@inproceedings{bib_IIIT_2021, AUTHOR = {Tathagata Raha, Ishan Sanjeev Upadhyay, Radhika Mamidi, Vasudeva Varma Kalidindi}, TITLE = {IIITH at SemEval-2021 Task 7: Leveraging transformer-based humourous and offensive text detection architectures using lexical and hurtlex features and task adaptive pretraining}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2021}}
This paper describes our approach (IIITH) for SemEval-2021 Task 5: HaHackathon: Detecting and Rating Humor and Offense. Our results focus on two major objectives: (i) Effect of task adaptive pretraining on the performance of transformer based models (ii) How does lexical and hurtlex features help in quantifying humour and offense. In this paper, we provide a detailed description of our approach along with comparisions mentioned above.
Scibert sentence representation for citation context classification
Himanshu Maheshwari,Bhavyajeet Singh,Vasudeva Varma Kalidindi
Workshop on Scholarly Document Processing, SDP-W, 2021
@inproceedings{bib_Scib_2021, AUTHOR = {Himanshu Maheshwari, Bhavyajeet Singh, Vasudeva Varma Kalidindi}, TITLE = {Scibert sentence representation for citation context classification}, BOOKTITLE = {Workshop on Scholarly Document Processing}. YEAR = {2021}}
This paper describes our system (IREL) for 3C-Citation Context Classification shared task of the Scholarly Document Processing Workshop at NAACL 2021. We participated in both subtask A and subtask B. Our best system achieved a Macro F1 score of 0.26973 on the private leaderboard for subtask A and was ranked one. For subtask B our best system achieved a Macro F1 score of 0.59071 on the private leaderboard and was ranked two. We used similar models for both the subtasks with some minor changes, as discussed in this paper. Our best performing model for both the subtask was a finetuned SciBert model followed by a linear layer. This paper provides a detailed description of all the approaches we tried and their results.
Addressing issues in training Neural Extractive Summarisation models
Ramkishore S,Nikhil Pinnaparaju,Vasudeva Varma Kalidindi
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2021
@inproceedings{bib_Addr_2021, AUTHOR = {Ramkishore S, Nikhil Pinnaparaju, Vasudeva Varma Kalidindi}, TITLE = {Addressing issues in training Neural Extractive Summarisation models}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2021}}
In many works, the problem of extractive summarisation has been framed as the problem of extracting the best summary from a given document. Many popular recent works aim to solve this by employing neural networks. And many of them are trained using a very limited scope, for example, a vast majority of the neural models are trained only using the best summary. Some also consider pruning useless summaries using other models. In this work, we show the problems that can arise when training neural models using such methods. We analyse those problems in some major milestones in Neural Extractive Summarisation. We also show and demonstrate ways to overcome them experimentally. Keywords: Extractive Summarisation · training bias
Identifying COVID-19 Fake News in Social Media
Tathagata Raha,I Vijayasaradhi,Aayush Upadhyaya,Jeevesh Kataria,Pramud Bommakanti,Vikram Keswani,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2021
@inproceedings{bib_Iden_2021, AUTHOR = {Tathagata Raha, I Vijayasaradhi, Aayush Upadhyaya, Jeevesh Kataria, Pramud Bommakanti, Vikram Keswani, Vasudeva Varma Kalidindi}, TITLE = {Identifying COVID-19 Fake News in Social Media }, BOOKTITLE = {Technical Report}. YEAR = {2021}}
The evolution of social media platforms have empowered everyone to access information easily. Social media users can easily share information with the rest of the world. This may sometimes encourage spread of fake news, which can result in undesirable consequences. In this work, we train models which can identify health news related to COVID-19 pandemic as real or fake. Our models achieve a high F1-score of 98.64%. Our models achieve second place on the leaderboard, tailing the first position with a very narrow margin 0.05% points.
Knowledge-based Extraction of Cause-Effect Relations from Biomedical Text
Sachin Pawar,Ravina More,Girish K. Palshikar,Pushpak Bhattacharyya,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2021
@inproceedings{bib_Know_2021, AUTHOR = {Sachin Pawar, Ravina More, Girish K. Palshikar, Pushpak Bhattacharyya, Vasudeva Varma Kalidindi}, TITLE = {Knowledge-based Extraction of Cause-Effect Relations from Biomedical Text}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
We propose a knowledge-based approach for extraction of Cause-Effect (CE) relations from biomedical text. Our approach is a combination of an unsupervised machine learning technique to discover causal triggers and a set of high-precision linguistic rules to identify cause/effect arguments of these causal triggers. We evaluate our approach using a corpus of 58,761 Leukaemia-related PubMed abstracts consisting of 568,528 sentences. We could extract 152,655 CE triplets from this corpus where each triplet consists of a cause phrase, an effect phrase and a causal trigger. As compared to the existing knowledge base - SemMedDB (Kilicoglu et al., 2012), the number of extractions are almost twice. Moreover, the proposed approach outperformed the existing technique SemRep (Rindflesch and Fiszman, 2003) on a dataset of 500 sentences.
Hierarchical Model for Goal Guided Summarization of Annual Financial Reports
Agrawal Yash Chandrakant,Vivek Anand,S. Arunachalam,Vasudeva Varma Kalidindi
International Conference on World wide web - workshop, WWW-W, 2021
@inproceedings{bib_Hier_2021, AUTHOR = {Agrawal Yash Chandrakant, Vivek Anand, S. Arunachalam, Vasudeva Varma Kalidindi}, TITLE = {Hierarchical Model for Goal Guided Summarization of Annual Financial Reports}, BOOKTITLE = {International Conference on World wide web - workshop}. YEAR = {2021}}
Every year publicly listed companies file financial reports to give insights about their activities. These reports are meant for shareholders or general public to evaluate the company’s health and decide whether to buy or sell stakes in the company. However, these annual financial reports tend to be long, and it is time-consuming to go through the reports for each company. We propose a Goal Guided Summarization technique through which the summary is extracted. The goal, in our case, is the decision to buy or sell company’s shares. We use hierarchical neural models for achieving this goal while extracting summaries. By the means of intrinsic and extrinsic evaluation we observe that the summaries extracted by our approach can model the decision of buying and selling shares better compared to summaries extracted by other summarization techniques as well as the complete document itself. We also observe that the summary extractor model can be used to construct stock portfolios which give better returns compared to major stock index.
T3N: Harnessing Text and Temporal Tree Network for Rumor Detection on Twitter
Nikhil Pinnaparaju,Manish Gupta,Vasudeva Varma Kalidindi
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2021
@inproceedings{bib_T3N:_2021, AUTHOR = {Nikhil Pinnaparaju, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {T3N: Harnessing Text and Temporal Tree Network for Rumor Detection on Twitter}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2021}}
Social media platforms have democratized the publication process resulting into easy and viral propagation of information. However, spread of rumors via such media often results into undesired and extremely impactful political, economic, social, psychological and criminal consequences. Several manual as well as automated efforts have been undertaken in the past to solve this critical problem. Existing automated methods are text based, user credibility based or use signals from the tweet propagation tree. We aim at using the text, user, propagation tree and temporal information jointly for rumor detection on Twitter. This involves several challenges like how to handle text variations on Twitter, what signals from user profile could be useful, how to best encode the propagation tree information, and how to incorporate the temporal signal. Our novel architecture, T 3N (Text and Temporal Tree Network), leverages deep learning based architectures to encode text, user and tree information in a temporal-aware manner. Our extensive comparisons show that our proposed methods outperform the state-of-the-art techniques by ∼7 and ∼6 percent points respectively on two popular benchmark datasets, and also lead to better early detection results.
FINSIM20 at the FinSim Task: Making Sense of Text in Financial Domain
Vivek Anand,YASH AGARWAL,Aarti Pol,Vasudeva Varma Kalidindi
Financial Technology and Natural Language Processing, FinNLP, 2021
@inproceedings{bib_FINS_2021, AUTHOR = {Vivek Anand, YASH AGARWAL, Aarti Pol, Vasudeva Varma Kalidindi}, TITLE = {FINSIM20 at the FinSim Task: Making Sense of Text in Financial Domain}, BOOKTITLE = {Financial Technology and Natural Language Processing}. YEAR = {2021}}
Semantics play an important role when it comes to automated systems using text or language and it is different for different domains. In this paper, we tackle the FinSim 2020 shared task at IJCAIPRICAI 2020. The task deals with designing a semantic model which can automatically classify short phrases/terms from financial domain into the most relevant hypernym (or top-level) concept in an external ontology. We perform several experiments using different kinds of word and phrase level embeddings to solve the problem in an unsupervised manner. We also explore the use of a supplementary financial domain data; either to learn better concept representation or generate more training samples. We discuss both the positive and negative results that we observed while applying these approaches.
Task Adaptive Pretraining of Transformers for Hostility Detection
Tathagata Raha,Sayar Ghosh Roy,Ujwal Narayan N,Zubair Abid,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2021
@inproceedings{bib_Task_2021, AUTHOR = {Tathagata Raha, Sayar Ghosh Roy, Ujwal Narayan N, Zubair Abid, Vasudeva Varma Kalidindi}, TITLE = {Task Adaptive Pretraining of Transformers for Hostility Detection}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Identifying adverse and hostile content on the web and more particularly, on social media, has become a problem of paramount interest in recent years. With their ever increasing popularity, fine-tuning of pretrained Transformer-based encoder models with a classifier head are gradually becoming the new baseline for natural language classification tasks. In our work, we explore the gains attributed to Task Adaptive Pretraining (TAPT) prior to fine-tuning of Transformer-based architectures. We specifically study two problems, namely,(a) Coarse binary classification of Hindi Tweets into Hostile or Not, and (b) Fine-grained multi-label classification of Tweets into four categories: hate, fake, offensive, and defamation. Building up on an architecture which takes emojis and segmented hashtags into consideration for classification, we are able to experimentally showcase the performance upgrades due to TAPT. Our system (with team name'iREL IIIT') ranked first in the'Hostile Post Detection in Hindi'shared task with an F1 score of 97.16% for coarse-grained detection and a weighted F1 score of 62.96% for fine-grained multi-label classification on the provided blind test corpora.
Leveraging Multilingual Transformers for Hate Speech Detection
Sayar Ghosh Roy,Ujwal Narayan N,Tathagata Raha,Zubair Abid,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2021
@inproceedings{bib_Leve_2021, AUTHOR = {Sayar Ghosh Roy, Ujwal Narayan N, Tathagata Raha, Zubair Abid, Vasudeva Varma Kalidindi}, TITLE = {Leveraging Multilingual Transformers for Hate Speech Detection }, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Detecting and classifying instances of hate in social media text has been a problem of interest in NaturalLanguage Processing in the recent years. Our work leverages state of the art Transformer languagemodels to identify hate speech in a multilingual setting. Capturing the intent of a post or a comment onsocial media involves careful evaluation of the language style, semantic content and additional pointerssuch as hashtags and emojis. In this paper, we look at the problem of identifying whether a Twitterpost is hateful and offensive or not. We further discriminate the detected toxic content into one of thefollowing three classes: (a) Hate Speech (HATE), (b) Offensive (OFFN) and (c) Profane (PRFN). With apre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identifyand classify hate speech from multiple languages. On the provided testing corpora, we achieve MacroF1 scores of 90.29, 81.87 and 75.40 for English, German and Hindi respectively while performing hatespeech detection and of 60.70, 53.28 and 49.74 during fine-grained classification. In our experiments, weshow the efficacy of Perspective API features for hate speech classification and the effects of exploitinga multilingual training scheme. A feature selection study is provided to illustrate impacts of specificfeatures upon the architecture’s classification head.
Summaformers@ LaySumm 20, LongSumm 20
Sayar Ghosh Roy,Nikhil Pinnaparaju,Risubh Jain,Manish,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2021
@inproceedings{bib_Summ_2021, AUTHOR = {Sayar Ghosh Roy, Nikhil Pinnaparaju, Risubh Jain, Manish, Vasudeva Varma Kalidindi}, TITLE = {Summaformers@ LaySumm 20, LongSumm 20}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Automatic text summarization has been widely studied as an important task in natural language processing. Traditionally, various feature engineering and machine learning based systems have been proposed for extractive as well as abstractive text summarization. Recently, deep learning based, specifically Transformer-based systems have been immensely popular. Summarization is a cognitively challenging task-extracting summary worthy sentences is laborious, and expressing semantics in brief when doing abstractive summarization is complicated. In this paper, we specifically look at the problem of summarizing scientific research papers from multiple domains. We differentiate between two types of summaries, namely,(a) LaySumm: A very short summary that captures the essence of the research paper in layman terms restricting overtly specific technical jargon and (b) LongSumm: A much longer detailed summary aimed at providing specific insights into various ideas touched upon in the paper. While leveraging latest Transformer-based models, our systems are simple, intuitive and based on how specific paper sections contribute to human summaries of the two types described above. Evaluations against gold standard summaries using ROUGE metrics prove the effectiveness of our approach. On blind test corpora, our system ranks first and third for the LongSumm and LaySumm tasks respectively.
SIS@IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis
Sravani Boinepelli,Manish Shrivastava,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation , SemEval, 2020
@inproceedings{bib_SIS@_2020, AUTHOR = {Sravani Boinepelli, Manish Shrivastava, Vasudeva Varma Kalidindi}, TITLE = {SIS@IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis}, BOOKTITLE = {International Workshop on Semantic Evaluation }. YEAR = {2020}}
Memes are steadily taking over the feeds of the public on social media. There is always the threat of malicious users on the internet posting offensive content, even through memes. Hence, the automatic detection of offensive images/memes is imperative along with detection of offensive text. However, this is a much more complex task as it involves both visual cues as well as language understanding and cultural/context knowledge. This paper describes our approach to the task of SemEval-2020 Task 8: Memotion Analysis. We chose to participate only in Task A which dealt with Sentiment Classification, which we formulated as a text classification problem. Through our experiments, we explored multiple training models to evaluate the performance of simple text classification algorithms on the raw text obtained after running OCR on meme images. Our submitted model achieved an accuracy of 72.69% and exceeded the existing baseline’s Macro F1 score by 8% on the official test dataset. Apart from describing our official submission, we shall elucidate how different classification models respond to this task.
Identifying Fake News Spreaders in Social Media Notebook for PAN at CLEF 2020
Nikhil Pinnaparaju,I VIJAYASARADHI,Vasudeva Varma Kalidindi
Conference and Labs of the Evaluation Forum, CLEF, 2020
@inproceedings{bib_Iden_2020, AUTHOR = {Nikhil Pinnaparaju, I VIJAYASARADHI, Vasudeva Varma Kalidindi}, TITLE = {Identifying Fake News Spreaders in Social Media Notebook for PAN at CLEF 2020}, BOOKTITLE = {Conference and Labs of the Evaluation Forum}. YEAR = {2020}}
With the rise of social networking platforms, everyone now has free access to information from around the work. Anyone from anywhere can now share context with the entire world. This allows for more connectivity around the world and more transparency. However, this also allows for the spread of misinformation and fake news often resulting in undesired and extremely impactful political, economic, social, psychological and criminal consequences. Identifying the fake news spreaders is as important as identifying the fake news itself. We put forward a method to utilize content analysis and more user modelling to capture who is more likely to share fake news. We use TF-IDF as our text transformation method coupled with algorithms simple classification algorithm Logistic Regression and achieve an accuracy of 71.5% and 70% in identifying fake news spreaders in both the English as well as Spanish test set respectively.
Distant supervision for medical concept normalization
P NIKHIL PRIYATAM,Vivek Anand,Sangameshwar Patil, Girish Palshikar,Vasudeva Varma Kalidindi
Journal of biomedical informatics, JOBM, 2020
@inproceedings{bib_Dist_2020, AUTHOR = {P NIKHIL PRIYATAM, Vivek Anand, Sangameshwar Patil, Girish Palshikar, Vasudeva Varma Kalidindi}, TITLE = {Distant supervision for medical concept normalization}, BOOKTITLE = {Journal of biomedical informatics}. YEAR = {2020}}
We consider the task of Medical Concept Normalization (MCN) which aims to map informal medical phrases such as “loosing weight” to formal medical concepts, such as “Weight loss”. Deep learning models have shown high performance across various MCN datasets containing small number of target concepts along with adequate number of training examples per concept. However, scaling these models to millions of medical concepts entails the creation of much larger datasets which is cost and effort intensive. Recent works have shown that training MCN models using automatically labeled examples extracted from medical knowledge bases partially alleviates this problem. We extend this idea by computationally creating a distant dataset from patient discussion forums. We extract informal medical phrases and medical concepts from these forums using a synthetically trained classifier and an off-the-shelf medical entity linker respectively. We use pretrained sentence encoding models to find the k-nearest phrases corresponding to each medical concept. These mappings are used in combination with the examples obtained from medical knowledge bases to train an MCN model. Our approach outperforms the previous state-of-the-art by 15.9% and 17.1% classification accuracy across two datasets while avoiding manual labeling.
Scientific Document Summarization for LaySumm’20 and LongSumm’20
Sayar Ghosh Roy,Nikhil Pinnaparaju,Risubh Jain,Manish Gupta,Vasudeva Varma Kalidindi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2020
@inproceedings{bib_Scie_2020, AUTHOR = {Sayar Ghosh Roy, Nikhil Pinnaparaju, Risubh Jain, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Scientific Document Summarization for LaySumm’20 and LongSumm’20}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2020}}
Automatic text summarization has been widely studied as an important task in natural language processing. Traditionally, various feature engineering and machine learning based systems have been proposed for extractive as well as abstractive text summarization. Recently, deep learning based, specifically Transformer-based systems have been immensely popular. Summarization is a cognitively challenging task–extracting summary worthy sentences is laborious, and expressing semantics in brief when doing abstractive summarization is complicated. In this paper, we specifically look at the problem of summarizing scientific research papers from multiple domains. We differentiate between two types of summaries, namely,(a) LaySumm: A very short summary that captures the essence of the research paper in layman terms restricting overtly specific technical jargon and (b) LongSumm: A much longer detailed summary aimed at providing specific insights into various ideas touched upon in the paper. While leveraging latest Transformer-based models, our systems are simple, intuitive and based on how specific paper sections contribute to human summaries of the two types described above. Evaluations against gold standard summaries using ROUGE (Lin, 2004) metrics prove the effectiveness of our approach. On blind test corpora, our system ranks first and third for the LongSumm and LaySumm tasks respectively.
Compression of Deep Learning Models for NLP
Manish Gupta,Vasudeva Varma Kalidindi,Sonam Damani,Kedhar Nath Narahari
International Conference on Information and Knowledge Management, CIKM, 2020
@inproceedings{bib_Comp_2020, AUTHOR = {Manish Gupta, Vasudeva Varma Kalidindi, Sonam Damani, Kedhar Nath Narahari}, TITLE = {Compression of Deep Learning Models for NLP}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2020}}
In recent years, the fields of NLP and information retrieval have made tremendous progress thanks to deep learning models like RNNs and LSTMs, and Transformer[35] based models like BERT[9]. But these models are humongous in size. Real world applications however demand small model size, low response times and low computational power wattage. We will discuss six different types of methods (pruning, quantization, knowledge distillation, parameter sharing, matrix decomposition, and other Transformer based methods) for compression of such models to enable their deployment in real industry NLP projects. Given the critical need of building applications with efficient and small models, and the large amount of recently published work in this area, we believe that this tutorial is very timely. We will organize related work done by the 'deep learning for NLP' community in the past few years and present it as a coherent story.
Fine-grained Multi-label Sexism Classification Using Semi-supervised Learning
HARIKA ABBURI,PARIKH PULKIT TRUSHANT KUMAR,Niyati Chhaya,Vasudeva Varma Kalidindi
International Conference on Web Information Systems Engineering, WISE, 2020
@inproceedings{bib_Fine_2020, AUTHOR = {HARIKA ABBURI, PARIKH PULKIT TRUSHANT KUMAR, Niyati Chhaya, Vasudeva Varma Kalidindi}, TITLE = {Fine-grained Multi-label Sexism Classification Using Semi-supervised Learning}, BOOKTITLE = {International Conference on Web Information Systems Engineering}. YEAR = {2020}}
Sexism, a pervasive form of oppression, causes profound suffering through various manifestations. Given the rising number of experiences of sexism reported online, categorizing these recollections automatically can aid the fight against sexism, as it can facilitate effective analyses by gender studies researchers and government officials involved in policy making. In this paper, we explore the fine-grained, multi-label classification of accounts (reports) of sexism. To the best of our knowledge, we consider substantially more categories of sexism than any related prior work through our 23-class problem formulation. Moreover, we present the first semi-supervised work for the multi-label classification of accounts describing any type(s) of sexism wherein the approach goes beyond merely fine-tuning pre-trained models using unlabeled data. We devise self-training based techniques tailor-made for the multi-label nature of the problem to utilize unlabeled samples for augmenting the labeled set. We identify high textual diversity with respect to the existing labeled set as a desirable quality for candidate unlabeled instances and develop methods for incorporating it into our approach. We also explore ways of infusing class imbalance alleviation for multi-label classification into our semi-supervised learning, independently and in conjunction with the method involving diversity. Several proposed methods outperform a variety of baselines on a recently released dataset for multi-label sexism categorization across several standard metrics.
Semi-supervised Multi-task Learning for Multi-label Fine-grained Sexism Classification
HARIKA ABBURI,PARIKH PULKIT TRUSHANT KUMAR,Niyati Chhaya,Vasudeva Varma Kalidindi
International Conference on Computational Linguistics, COLING, 2020
@inproceedings{bib_Semi_2020, AUTHOR = {HARIKA ABBURI, PARIKH PULKIT TRUSHANT KUMAR, Niyati Chhaya, Vasudeva Varma Kalidindi}, TITLE = {Semi-supervised Multi-task Learning for Multi-label Fine-grained Sexism Classification}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2020}}
Sexism, a form of oppression based on one’s sex, manifests itself in numerous ways and causes enormous suffering. In view of the growing number of experiences of sexism reported online, categorizing these recollections automatically can assist the fight against sexism, as it can facilitate effective analyses by gender studies researchers and government officials involved in policy making. In this paper, we investigate the fine-grained, multi-label classification of accounts (reports) of sexism. To the best of our knowledge, we work with considerably more categories of sexism than any published work through our 23-class problem formulation. Moreover, we propose a multi-task approach for fine-grained multi-label sexism classification that leverages several supporting tasks without incurring any manual labeling cost. Unlabeled accounts of sexism are utilized through unsupervised learning to help construct our multi-task setup. We also devise objective functions that exploit label correlations in the training data explicitly. Multiple proposed methods outperform the state-of-the-art for multi-label sexism classification on a recently released dataset across five standard metrics.
Predicting Clickbait Strength in Online Social Media
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish,Vasudeva Varma Kalidindi
International Conference on Computational Linguistics, COLING, 2020
@inproceedings{bib_Pred_2020, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish, Vasudeva Varma Kalidindi}, TITLE = {Predicting Clickbait Strength in Online Social Media}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2020}}
Hoping for a large number of clicks and potentially high social shares, journalists of various news media outlets publish sensationalist headlines on social media. These headlines lure the readers to click on them and satisfy the curiosity gap in their mind. Low quality material pointed to by clickbaits leads to time wastage and annoyance for users. Even for enterprises publishing clickbaits, it hurts more than it helps as it erodes user trust, attracts wrong visitors, and produces negative signals for ranking algorithms. Hence, identifying and flagging clickbait titles is very essential. Previous work on clickbaits has majorly focused on binary classification of clickbait titles. However not all clickbaits are equally clickbaity. It is not only essential to identify a click-bait, but also to identify the intensity of the clickbait based on the strength of the clickbait. In this work, we model clickbait strength prediction as a regression problem. While previous methods have relied on traditional machine learning or vanilla recurrent neural networks, we rigorously investigate the use of transformers for clickbait strength prediction. On a benchmark dataset with∼ 39K posts, our methods outperform all the existing methods in the Clickbait Challenge.
SIS@ IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis
Sravani Boinepelli,Manish Srivastava,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2020
@inproceedings{bib_SIS@_2020, AUTHOR = {Sravani Boinepelli, Manish Srivastava, Vasudeva Varma Kalidindi}, TITLE = {SIS@ IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2020}}
Memes are steadily taking over the feeds of the public on social media. There is always the threat of malicious users on the internet posting offensive content, even through memes. Hence, the automatic detection of offensive images/memes is imperative along with detection of offensive text. However, this is a much more complex task as it involves both visual cues as well as language understanding and cultural/context knowledge. This paper describes our approach to the task of SemEval-2020 Task 8: Memotion Analysis. We chose to participate only in Task A which dealt with Sentiment Classification, which we formulated as a text classification problem. Through our experiments, we explored multiple training models to evaluate the performance of simple text classification algorithms on the raw text obtained after running OCR on meme images. Our submitted model achieved an accuracy of 72.69% and exceeded the existing baseline’s Macro F1 score by 8% on the official test dataset. Apart from describing our official submission, we shall elucidate how different classification models respond to this task.
Rehoboam at the NTCIR-15 SHINRA2020-ML Task
Tushar Abhishek,AYUSH AGGARWAL,Anubhav Sharma,Vasudeva Varma Kalidindi,Manish
Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question A, NTCIR, 2020
@inproceedings{bib_Reho_2020, AUTHOR = {Tushar Abhishek, AYUSH AGGARWAL, Anubhav Sharma, Vasudeva Varma Kalidindi, Manish}, TITLE = {Rehoboam at the NTCIR-15 SHINRA2020-ML Task}, BOOKTITLE = {Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question A}. YEAR = {2020}}
Maintaining a unified ontology across various languages is expected to result in effective and consistent organization of Wikipedia entities. Such organization of the Wikipedia knowledge base (KB) will in turn improve the effectiveness of various KB oriented multilingual downstream tasks like entity linking, question answering, fact checking, etc. As a first step toward a unified ontology, it is important to classify Wikipedia entities into consistent fine-grained categories across 30 languages. While there is existing work on finegrained entity categorization for rich-resource languages, there is hardly any such work for consistent classification across multiple low-resource languages. Wikipedia webpage format variations, content imbalance per page, imbalance with respect to categories across languages make the problem challenging. We model this problem as a document classification task. We propose a novel architecture, RNN_GNN_XLM-R, which leverages the strengths of various popular deep learning architectures. Across ten participant teams at the NTCIR-15 Shinra 2020-ML Classification Task, our proposed model stands second in the overall evaluation.
Semantic Textual Similarity of Sentences with Emojis
Alok Debnath,Nikhil Pinnaparaju,Manish Srivastava,Vasudeva Varma Kalidindi,Isabelle Augenstein
International Conference on World wide web, WWW, 2020
@inproceedings{bib_Sema_2020, AUTHOR = {Alok Debnath, Nikhil Pinnaparaju, Manish Srivastava, Vasudeva Varma Kalidindi, Isabelle Augenstein}, TITLE = {Semantic Textual Similarity of Sentences with Emojis}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2020}}
In this paper, we extend the task of semantic textual similarity to include sentences which contain emojis. Emojis are ubiquitous on social media today, but are often removed in the pre-processing stage of curating datasets for NLP tasks. In this paper, we qualitatively ascertain the amount of semantic information lost by discounting emojis, as well as show a mechanism of accounting for emojis in a semantic task. We create a sentence similarity dataset of 4000 pairs of tweets with emojis, which have been annotated for relatedness. The corpus contains tweets curated based on common topic as well as by replacement of emojis. The latter was done to analyze the difference in semantics associated with different emojis. We aim to provide an understanding of the information lost by removing emojis by providing a qualitative analysis of the dataset. We also aim to present a method of using both emojis and words for downstream NLP tasks beyond sentiment analysis.
Medical Concept Normalization by Encoding Target Knowledge
P NIKHIL PRIYATAM,Sangameshwar Patil,Girish Palshikar,Vasudeva Varma Kalidindi
Neural Information Processing Systems Workshops, NeurIPS-W, 2020
@inproceedings{bib_Medi_2020, AUTHOR = {P NIKHIL PRIYATAM, Sangameshwar Patil, Girish Palshikar, Vasudeva Varma Kalidindi}, TITLE = {Medical Concept Normalization by Encoding Target Knowledge}, BOOKTITLE = {Neural Information Processing Systems Workshops}. YEAR = {2020}}
Medical concept normalization aims to map a variable length message such as, ‘unable to sleep’ to an entry in a target medical lexicon, such as ‘Insomnia’. Current approaches formulate medical concept normalization as a supervised text classification problem. This formulation has several drawbacks. First, creating training data requires manually mapping medical concept mentions to their corresponding entries in a target lexicon. Second, these models fail to map a mention to the target concepts which were not encountered during the training phase. Lastly, these models have to be retrained from scratch whenever new concepts are added to the target lexicon. In this work we propose a method which overcomes these limitations. We first use various text and graph embedding methods to encode medical concepts into an embedding space. We then train a model which transforms concept mentions into vectors in this target embedding space. Finally, we use cosine similarity to find the nearest medical concept to a given input medical concept mention. Our model scales to millions of target concepts and trivially accommodates growing target lexicon size without incurring significant computational cost. Experimental results show that our model outperforms the previous state-of-the-art by 4.2% and 6.3% classification accuracy across two benchmark datasets. We also present a variety of studies to evaluate the robustness of our model under different training conditions.
Extracting Message Sequence Charts from Hindi Narrative Text
Swapnil Hingmire,Nitin Ramrakhiyani,Avinash Kumar Singh,Sangameshwar Patil,Girish K. Palshikar,Pushpak Bhattacharyya,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_Extr_2020, AUTHOR = {Swapnil Hingmire, Nitin Ramrakhiyani, Avinash Kumar Singh, Sangameshwar Patil, Girish K. Palshikar, Pushpak Bhattacharyya, Vasudeva Varma Kalidindi}, TITLE = {Extracting Message Sequence Charts from Hindi Narrative Text}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
In this paper, we propose the use of Message Sequence Charts (MSC) as a representation for visualizing narrative text in Hindi. An MSC is a formal representation allowing the depiction of actors and interactions among these actors in a scenario, apart from supporting a rich framework for formal inference. We propose an approach to extract MSC actors and interactions from a Hindi narrative. As a part of the approach, we enrich an existing event annotation scheme where we provide guidelines for annotation of the mood of events (realis vs irrealis) and guidelines for annotation of event arguments. We report performance on multiple evaluation criteria by experimenting with Hindi narratives from Indian History. Though Hindi is the fourth most-spoken first language in the world, from the NLP perspective it has comparatively lesser resources than English. Moreover, there is relatively less work in the context of event processing in Hindi. Hence, we believe that this work is among the initial works for Hindi event processing.
Adapting Language Models for Non-Parallel Author-Stylized Rewriting
Bakhtiyar Hussain Syed,Gaurav Verma,Balaji Vasan Srinivasan,Anandhavelu Natarajan,Vasudeva Varma Kalidindi
American Association for Artificial Intelligence, AAAI, 2020
@inproceedings{bib_Adap_2020, AUTHOR = {Bakhtiyar Hussain Syed, Gaurav Verma, Balaji Vasan Srinivasan, Anandhavelu Natarajan, Vasudeva Varma Kalidindi}, TITLE = {Adapting Language Models for Non-Parallel Author-Stylized Rewriting}, BOOKTITLE = {American Association for Artificial Intelligence}. YEAR = {2020}}
Given the recent progress in language modeling using Transformer-based neural models and an active interest in generating stylized text, we present an approach to leverage the generalization capabilities of a language model to rewrite an input text in a target author’s style. Our proposed approach adapts a pre-trained language model to generate author-stylized text by fine-tuning on the author-specific corpus using a denoising autoencoder (DAE) loss in a cascaded encoder-decoder framework. Optimizing over DAE loss allows our model to learn the nuances of an author’s style without relying on parallel data, which has been a severe limitation of the previous related works in this space. To evaluate the efficacy of our approach, we propose a linguistically motivated framework to quantify stylistic alignment of the generated text to the target author at lexical, syntactic and surface levels. The evaluation framework is both interpretable as it leads to several insights about the model, and self-contained as it does not rely on external classifiers, e.g. sentiment or formality classifiers. Qualitative and quantitative assessment indicates that the proposed approach rewrites the input text with better alignment to the target style while preserving the original content better than state-of-the-art baselines.
Extraction of Message Sequence Charts from Software Use-Case Descriptions
Girish Keshav Palshikar,Nitin Ramrakhiyani,Sangameshwar Patil,Sachin Pawar,Swapnil Hingmire,Vasudeva Varma Kalidindi,Pushpak Bhattacharyya
North American Association for Computational Linguistics, NAACL, 2019
@inproceedings{bib_Extr_2019, AUTHOR = {Girish Keshav Palshikar, Nitin Ramrakhiyani, Sangameshwar Patil, Sachin Pawar, Swapnil Hingmire, Vasudeva Varma Kalidindi, Pushpak Bhattacharyya}, TITLE = {Extraction of Message Sequence Charts from Software Use-Case Descriptions}, BOOKTITLE = {North American Association for Computational Linguistics}. YEAR = {2019}}
Software Requirement Specification documents provide natural language descriptions of the core functional requirements as a set of use-cases. Essentially, each use-case contains a set of actors and sequences of steps describing the interactions among them. Goals of use-case reviews and analyses include their correctness, completeness, detection of ambiguities, prototyping, verification, test case generation and traceability. Message Sequence Chart (MSC) have been proposed as a expressive, rigorous yet intuitive visual representation of use-cases. In this paper, we describe a linguistic knowledge-based approach to extract MSCs from use-cases. Compared to existing techniques, we extract richer constructs of the MSC notation such as timers, conditions and alt-boxes. We apply this tool to extract MSCs from several real-life software use-case descriptions and show that it performs better than the existing techniques. We also discuss the benefits and limitations of the extracted MSCs to meet the above goals.
Helium @ CL-SciSumm-19 : Transfer learning for effective scientific research comprehension
Bakhtiyar Syed,Vijayasaradhi Indurthi,Balaji Vasan Srinivasan,Vasudeva Varma Kalidindi
oint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Dig, BIRNDL, 2019
@inproceedings{bib_Heli_2019, AUTHOR = {Bakhtiyar Syed, Vijayasaradhi Indurthi, Balaji Vasan Srinivasan, Vasudeva Varma Kalidindi}, TITLE = {Helium @ CL-SciSumm-19 : Transfer learning for effective scientific research comprehension}, BOOKTITLE = {oint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Dig}. YEAR = {2019}}
. Automatic research paper summarization is a fairly interesting topic that has garnered significant interest in the research community in recent years. In this paper, we introduce team Helium’s system description for the CL-SciSumm shared task colocated with SIGIR 2019. We specifically attempt the first task, targeting in building an improved recall system of reference text spans from a given citing research paper (Task 1A) and constructing better models for comprehension of scientific facets (Task 1B). Our architecture incorporates transfer learning by utilising a combination of pretrained embeddings which are subsequently used for building models for the given tasks. In particular - for task 1A, we locate the related text spans referred to by the citation text by creating paired text representations and employ pre-trained embedding mechanisms in conjunction with XGBoost, a gradient boosted decision tree algorithm to identify textual entailment. For task 1B, we make use of the same pretrained embeddings and use the RAKEL algorithm for multi-label classification. Our goal is to enable better scientific research comprehension and we believe that a new approach involving transfer learning will certainly add value to the research community working on these tasks.
Multi-label Categorization of Accounts of Sexism using a Neural Framework
PARIKH PULKIT TRUSHANT KUMAR,Harika Abburi,PINKESH BADJATIYA,Radhika Krishnan,Niyati Chhaya,Manish,Vasudeva Varma Kalidindi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2019
@inproceedings{bib_Mult_2019, AUTHOR = {PARIKH PULKIT TRUSHANT KUMAR, Harika Abburi, PINKESH BADJATIYA, Radhika Krishnan, Niyati Chhaya, Manish, Vasudeva Varma Kalidindi}, TITLE = {Multi-label Categorization of Accounts of Sexism using a Neural Framework}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2019}}
Sexism, an injustice that subjects women and girls to enormous suffering, manifests in blatant as well as subtle ways. In the wake of growing documentation of experiences of sexism on the web, the automatic categorization of accounts of sexism has the potential to assist social scientists and policy makers in utilizing such data to study and counter sexism better. The existing work on sexism classification, which is different from sexism detection, has certain limitations in terms of the categories of sexism used and/or whether they can co-occur. To the best of our knowledge, this is the first work on the multi-label classification of sexism of any kind(s), and we contribute the largest dataset for sexism categorization. We develop a neural solution for this multilabel classification that can combine sentence representations obtained using models such as BERT with distributional and linguistic word embeddings using a flexible, hierarchical architecture involving recurrent components and optional convolutional ones. Further, we leverage unlabeled accounts of sexism to infuse domain-specific elements into our framework. The best proposed method outperforms several deep learning as well as traditional machine learning baselines by an appreciable margin.
Fermi at SemEval-2019 Task 6: Identifying and categorizing offensive language in social media using sentence embeddings
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish Srivastava,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Ferm_2019, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish Srivastava, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Fermi at SemEval-2019 Task 6: Identifying and categorizing offensive language in social media using sentence embeddings}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes our system (Fermi) for Task 6: OffensEval: Identifying and Categorizing Offensive Language in Social Media of SemEval-2019. We participated in all the three sub-tasks within Task 6. We evaluate multiple sentence embeddings in conjunction with various supervised machine learning algorithms and evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team Fermi’s model achieved an F1-score of 64.40%, 62.00% and 62.60% for sub-task A, B and C respectively on the official leaderboard. Our model for sub-task C which uses pre-trained ELMo embeddings for transforming the input and uses SVM (RBF kernel) for training, scored third position on the official leaderboard. Through the paper we provide a detailed description of the approach, as well as the results obtained for the task.
Fermi at semeval-2019 task 5: Using sentence embeddings to identify hate speech against immigrants and women in twitter
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish Srivastava,Nikhil Chakravartula,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Ferm_2019, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish Srivastava, Nikhil Chakravartula, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Fermi at semeval-2019 task 5: Using sentence embeddings to identify hate speech against immigrants and women in twitter}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes our system (Fermi) for Task 5 of SemEval-2019: HatEval: Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter. We participated in the subtask A for English and ranked first in the evaluation on the test set. We evaluate the quality of multiple sentence embeddings and explore multiple training models to evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team-Fermi’s model achieved an accuracy of 65.00% for English language in task A. Our models, which use pretrained Universal Encoder sentence embeddings for transforming the input and SVM (with RBF kernel) for classification, scored first position (among 68) in the leaderboard on the test set for Subtask A in English language. In this paper we provide a detailed description of the approach, as well as the results obtained in the task.
Fermi at SemEval-2019 task 8: An elementary but effective approach to question discernment in community qa forums
Bakhtiyar Hussain Syed,I VIJAYASARADHI,Manish Srivastava,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Ferm_2019, AUTHOR = {Bakhtiyar Hussain Syed, I VIJAYASARADHI, Manish Srivastava, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Fermi at SemEval-2019 task 8: An elementary but effective approach to question discernment in community qa forums}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Online Community Question Answering Forums (cQA) have gained massive popularity within recent years. The rise in users for such forums have led to the increase in the need for automated evaluation for question comprehension and fact evaluation of the answers provided by various participants in the forum. Our team, Fermi, participated in sub-task A of Task 8 at SemEval 2019-which tackles the first problem in the pipeline of factual evaluation in cQA forums, ie, deciding whether a posed question asks for a factual information, an opinion/advice or is just socializing. This information is highly useful in segregating factual questions from non-factual ones which highly helps in organizing the questions into useful categories and trims down the problem space for the next task in the pipeline for fact evaluation among the available answers. Our system uses the embeddings obtained from Universal Sentence Encoder combined with XGBoost for the classification sub-task A. We also evaluate other combinations of embeddings and off-the-shelf machine learning algorithms to demonstrate the efficacy of the various representations and their combinations. Our results across the evaluation test set gave an accuracy of 84% and received the first position in the final standings judged by the organizers.
Ingredients for Happiness: Modeling constructs via semi-supervised content driven inductive transfer learning
Bakhtiyar Hussain Syed,I VIJAYASARADHI,Shah Kulin Nitinkumar,Manish Gupta,Vasudeva Varma Kalidindi
American Association for Artificial Intelligence Workshops, AAAI-W, 2019
@inproceedings{bib_Ingr_2019, AUTHOR = {Bakhtiyar Hussain Syed, I VIJAYASARADHI, Shah Kulin Nitinkumar, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Ingredients for Happiness: Modeling constructs via semi-supervised content driven inductive transfer learning}, BOOKTITLE = {American Association for Artificial Intelligence Workshops}. YEAR = {2019}}
Modeling affect via understanding the social constructs behind them is an important task in devising robust and accurate systems for socially relevant scenarios. In the CL-Aff Shared Task (part of Affective Content Analysis workshop @ AAAI 2019), the organizers released a dataset of ‘happy’ moments, called the HappyDB corpus. The task is to detect two social constructs: the agency (i.e., whether the author is in control of the happy moment) and the social characteristics (i.e., whether anyone else other than the author was also involved in the happy moment). We employ an inductive transfer learning technique where we utilize a pre-trained language model and fine-tune it on the target task for both the binary classification tasks. At first, we use a language model pre-trained on the huge WikiText-103 corpus. This step utilizes an AWDLSTM with three hidden layers for training the language model. In the second step, we fine-tune the pre-trained language model on both the labeled and unlabeled instances from the HappyDB dataset. Finally, we train a classifier on top of the language model for each of the identification tasks. Our experiments using 10-fold cross validation on the corpus show that we achieve a high accuracy of ∼93% for detection of the social characteristic and ∼87% for agency of the author, showing significant gains over other baselines. We also show that using the unlabeled dataset for fine-tuning the language model in the second step improves our accuracy by 1-2% across detection of both the constructs.
clstk: The Cross-lingual Summarization Tool-kit
JHAVERI NISARG KETAN,Manish Gupta,Vasudeva Varma Kalidindi
International conference on Web search and Data Mining, WSDM, 2019
@inproceedings{bib_clst_2019, AUTHOR = {JHAVERI NISARG KETAN, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {clstk: The Cross-lingual Summarization Tool-kit}, BOOKTITLE = {International conference on Web search and Data Mining}. YEAR = {2019}}
Cross-lingual summarization (CLS) aims to create summaries in a target language, from a document or document set given in a different, source language. Cross-lingual summarization can play a critical role in enabling cross-lingual information access for millions of people across the globe who do not speak or understand languages having large representation on the web. It can also make documents originally published in local languages quickly accessible to a large audience which does not understand those local languages. Though cross-lingual summarization has gathered some attention in the last decade, there has been no serious effort to publish rigorous software for this task. In this paper, we provide a design for an end-to-end CLS software called clstk. Besides implementing a number of methods proposed by different CLS researchers over years, the software integrates multiple components critical for CLS. We hope that this extremely modular tool-kit will help CLS researchers to contribute more effectively to the area.
Inductive Transfer Learning for Detection of Well-formed Natural Language Search Queries
Bakhtiyar Hussain Syed,I VIJAYASARADHI,Manish Gupta,Manish Srivastava,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2019
@inproceedings{bib_Indu_2019, AUTHOR = {Bakhtiyar Hussain Syed, I VIJAYASARADHI, Manish Gupta, Manish Srivastava, Vasudeva Varma Kalidindi}, TITLE = {Inductive Transfer Learning for Detection of Well-formed Natural Language Search Queries}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2019}}
Users have been trained to type keyword queries on search engines. However, recently there has been a significant rise in the number of verbose queries. Often times such queries are not well-formed. The lack of well-formedness in the query might adversely impact the downstream pipeline which processes these queries. A well-formed natural language question as a search query aids heavily in reducing errors in downstream tasks and further helps in improved query understanding. In this paper, we employ an inductive transfer learning technique by fine-tuning a pretrained language model to identify whether a search query is a well-formed natural language question or not. We show that our model trained on a recently released benchmark dataset spanning 25,100 queries gives an accuracy of 75.03% thereby improving by ∼5 absolute percentage points over the state-of-the-art.
Domain Adaptive Neural Sentence Compression by Tree Cutting
LITTON J KURISINKEL,Yue Zhang,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2019
@inproceedings{bib_Doma_2019, AUTHOR = {LITTON J KURISINKEL, Yue Zhang, Vasudeva Varma Kalidindi}, TITLE = {Domain Adaptive Neural Sentence Compression by Tree Cutting}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2019}}
Sentence compression has traditionally been tackled as syntactic tree pruning, where rules and statistical features are defined for pruning less relevant words. Recent years have witnessed the rise of neural models without leveraging syntax trees, learning sentence representations automatically and pruning words from such representations. We investigate syntax tree based noise pruning methods for neural sentence compression. Our method identifies the most informative regions in a syntactic dependency tree by self attention over context nodes and maximum density subtree extraction. Empirical results show that the model outperforms the state-of-the-art methods in terms of both accuracy and F1-measure. The model also yields a comparable accuracy in readability and informativeness as assessed by human evaluators.
A Simple Neural Approach to Spatial Role Labelling
Nitin Ramrakhiyani,Girish .K.Palshikar,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2019
@inproceedings{bib_A_Si_2019, AUTHOR = {Nitin Ramrakhiyani, Girish .K.Palshikar, Vasudeva Varma Kalidindi}, TITLE = {A Simple Neural Approach to Spatial Role Labelling}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2019}}
Spatial Role Labelling involves identification of text segments which emit spatial semantics such as describing an object of interest, a reference point or the object’s relative position with the reference. Tasks in SemEval exercises of 2012 and 2013 propose problems and datasets for Spatial Role Labelling. In this paper, we propose a simple two-step neural network based approach to identify static spatial relations along with the three primary roles - Trajector, Landmark and Spatial Indicator. Our approach outperforms the task submission results and other state-of-the-art results on these datasets. We also include a discussion on the explainability of our model.
Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations
PINKESH BADJATIYA,Manish Gupta,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2019
@inproceedings{bib_Ster_2019, AUTHOR = {PINKESH BADJATIYA, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2019}}
With the ever-increasing cases of hate spread on social media platforms, it is critical to design abuse detection mechanisms to proactively avoid and control such incidents. While there exist methods for hate speech detection, they stereotype words and hence suffer from inherently biased training. Bias removal has been traditionally studied for structured datasets, but we aim at bias mitigation from unstructured text data. In this paper, we make two important contributions. First, we systematically design methods to quantify the bias for any model and propose algorithms for identifying the set of words which the model stereotypes. Second, we propose novel methods leveraging knowledge-based generalizations for bias-free learning. Knowledge-based generalization provides an effective way to encode knowledge because the abstraction they provide not only generalizes content but also facilitates retraction of information from the hate speech detection classifier, thereby reducing the imbalance. We experiment with multiple knowledge generalization policies and analyze their effect on general performance and in mitigating bias. Our experiments with two real-world datasets, a Wikipedia Talk Pages dataset (WikiDetox) of size ∼96k and a Twitter dataset of size ∼24k, show that the use of knowledge-based generalizations results in better performance by forcing the classifier to learn from generalized content. Our methods utilize existing knowledge-bases and can easily be extended to other tasks.
MVAE: Multimodal Variational Autoencoder for Fake News Detection
DHRUV KHATTAR,JAIPAL SINGH GOUD,Manish Gupta,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2019
@inproceedings{bib_MVAE_2019, AUTHOR = {DHRUV KHATTAR, JAIPAL SINGH GOUD, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {MVAE: Multimodal Variational Autoencoder for Fake News Detection}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2019}}
In recent times, fake news and misinformation have had a disruptive and adverse impact on our lives. Given the prominence of microblogging networks as a source of news for most individuals, fake news now spreads at a faster pace and has a more profound impact than ever before. This makes detection of fake news an extremely important challenge. Fake news articles, just like genuine news articles, leverage multimedia content to manipulate user opinions but spread misinformation. A shortcoming of the current approaches for the detection of fake news is their inability to learn a shared representation of multimodal (textual + visual) information. We propose an end-to-end network, Multimodal Variational Autoencoder (MVAE), which uses a bimodal variational autoencoder coupled with a binary classifier for the task of fake news detection. The model consists of three main components, an encoder, a decoder and a fake news detector module. The variational autoencoder is capable of learning probabilistic latent variable models by optimizing a bound on the marginal likelihood of the observed data. The fake news detector then utilizes the multimodal representations obtained from the bimodal variational autoencoder to classify posts as fake or not. We conduct extensive experiments on two standard fake news datasets collected from popular microblogging websites: Weibo and Twitter. The experimental results show that across the two datasets, on average our model outperforms state-of-the-art methods by margins as large as ∼6% in accuracy and ∼5% in F1 scores.
A Distant Supervision Based Approach to Medical Persona Classification
P NIKHIL PRIYATAM,Ponnurangam Kumaraguru,Vasudeva Varma Kalidindi
Journal of biomedical informatics, JOBM, 2019
@inproceedings{bib_A_Di_2019, AUTHOR = {P NIKHIL PRIYATAM, Ponnurangam Kumaraguru, Vasudeva Varma Kalidindi}, TITLE = {A Distant Supervision Based Approach to Medical Persona Classification}, BOOKTITLE = {Journal of biomedical informatics}. YEAR = {2019}}
Identifying medical persona from a social media post is critical for drug marketing, pharmacovigilance and patient recruitment. Medical persona classification aims to computationally model the medical persona associated with a social media post. We present a novel deep learning model for this task which consists of two parts: Convolutional Neural Networks (CNNs), which extract highly relevant features from the sentences of a social media post and average pooling, which aggregates the sentence embeddings to obtain task-specific document embedding. We compare our approach against standard baselines, such as Term Frequency - Inverse Document Frequency (TF-IDF), averaged word embedding based methods and popular neural architectures, such as CNN-Long Short Term Memory (CNN-LSTM) and Hierarchical Attention Networks (HANs). Our model achieves an improvement of 19.7% for classification accuracy and 20.1% for micro F1 measure over the current state-of-the-art. We eliminate the need for manual labeling by employing a distant supervision based method to obtain labeled examples for training the models. We thoroughly analyze our model to discover cues that are indicative of a particular persona. Particularly, we use first derivative saliency to identify the salient words in a particular social media post.
Extraction of Message Sequence Charts from Narrative History Text
Girish K. Palshikar,Sachin Pawar,Sangameshwar Patil,Swapnil Hingmire,Nitin Ramrakhiyani,Harsimran Bedi,Pushpak Bhattacharyya,Vasudeva Varma Kalidindi
Conference of the North American Chapter of the Association for Computational Linguistics Workshops, NAACL-W, 2019
@inproceedings{bib_Extr_2019, AUTHOR = {Girish K. Palshikar, Sachin Pawar, Sangameshwar Patil, Swapnil Hingmire, Nitin Ramrakhiyani, Harsimran Bedi, Pushpak Bhattacharyya, Vasudeva Varma Kalidindi}, TITLE = {Extraction of Message Sequence Charts from Narrative History Text}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
In this paper, we advocate the use of Message Sequence Chart (MSC) as a knowledge representation to capture and visualize multiactor interactions and their temporal ordering. We propose algorithms to automatically extract an MSC from a history narrative. For a given narrative, we first identify verbs which indicate interactions and then use dependency parsing and Semantic Role Labelling based approaches to identify senders (initiating actors) and receivers (other actors involved) for these interaction verbs. As a final step in MSC extraction, we employ a state-of-the art algorithm to temporally re-order these interactions. Our evaluation on multiple publicly available narratives shows improvements over four baselines.
Extraction of Message Sequence Charts from Software Use-Case Descriptions.
Girish K. Palshikar,Nitin Ramrakhiyani,Sangameshwar Patil,Sachin Pawar,Swapnil Hingmire,Vasudeva Varma Kalidindi,Pushpak Bhattacharyya
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2019
@inproceedings{bib_Extr_2019, AUTHOR = {Girish K. Palshikar, Nitin Ramrakhiyani, Sangameshwar Patil, Sachin Pawar, Swapnil Hingmire, Vasudeva Varma Kalidindi, Pushpak Bhattacharyya}, TITLE = {Extraction of Message Sequence Charts from Software Use-Case Descriptions.}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2019}}
Software Requirement Specification documents provide natural language descriptions of the core functional requirements as a set of use-cases. Essentially, each use-case contains a set of actors and sequences of steps describing the interactions among them. Goals of usecase reviews and analyses include their correctness, completeness, detection of ambiguities, prototyping, verification, test case generation and traceability. Message Sequence Charts (MSC) have been proposed as an expressive, rigorous yet intuitive visual representation of use-cases. In this paper, we describe a linguistic knowledge-based approach to extract MSCs from use-cases. Compared to existing techniques, we extract richer constructs of the MSC notation such as timers, conditions and alt-boxes. We apply this tool to extract MSCs from several real-life software use-case descriptions and show that it performs better than the existing techniques.
Using Sentence Embeddings to identify Hate Speech against Immigrants and Women on Twitter
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish Srivastava,Nikhil Chakravartula,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Usin_2019, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish Srivastava, Nikhil Chakravartula, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Using Sentence Embeddings to identify Hate Speech against Immigrants and Women on Twitter}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes our system (Fermi) for Task 5 of SemEval-2019: HatEval: Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter. We participated in the subtask A for English and ranked first in the evaluation on the test set. We evaluate the quality of multiple sentence embeddings and explore multiple training models to evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team - Fermi’s model achieved an accuracy of 65.00% for English language in task A. Our models, which use pretrained Universal Encoder sentence embeddings for transforming the input and SVM (with RBF kernel) for classification, scored first position (among 68) in the leaderboard on the test set for Subtask A in English language. In this paper we provide a detailed description of the approach, as well as the results obtained in the task.
Identifying and Categorizing Offensive Language in Social Media using Sentence Embeddings
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish Srivastava,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Iden_2019, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish Srivastava, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Identifying and Categorizing Offensive Language in Social Media using Sentence Embeddings}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes our system (Fermi) for Task 6: OffensEval: Identifying and Categorizing Offensive Language in Social Media of SemEval-2019. We participated in all the three sub-tasks within Task 6. We evaluate multiple sentence embeddings in conjunction with various supervised machine learning algorithms and evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team (Fermi)’s model achieved an F1-score of 64.40%, 62.00% and 62.60% for sub-task A, B and C respectively on the official leaderboard. Our model for subtask C which uses pretrained ELMo embeddings for transforming the input and uses SVM (RBF kernel) for training, scored third position on the official leaderboard. Through the paper we provide a detailed description of the approach, as well as the results obtained for the task.
An elementary but effective approach to Question Discernment in Community QA Forums.
Bakhtiyar Hussain Syed,I VIJAYASARADHI,Manish Srivastava,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_An_e_2019, AUTHOR = {Bakhtiyar Hussain Syed, I VIJAYASARADHI, Manish Srivastava, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {An elementary but effective approach to Question Discernment in Community QA Forums.}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Online Community Question Answering Forums (cQA) have gained massive popularity within recent years. The rise in users for such forums have led to the increase in the need for automated evaluation for question comprehension and fact evaluation of the answers provided by various participants in the forum. Our team, Fermi, participated in sub-task A of Task 8 at SemEval 2019 - which tackles the first problem in the pipeline of factual evaluation in cQA forums, i.e., deciding whether a posed question asks for a factual information, an opinion/advice or is just socializing. This information is highly useful in segregating factual questions from non-factual ones which highly helps in organizing the questions into useful categories and trims down the problem space for the next task in the pipeline for fact evaluation among the available answers. Our system uses the embeddings obtained from Universal Sentence Encoder combined with XGBoost for the classification subtask A. We also evaluate other combinations of embeddings and off-the-shelf machine learning algorithms to demonstrate the efficacy of the various representations and their combinations. Our results across the evaluation test set gave an accuracy of 84% and received the first position in the final standings judged by the organizers.
Transfer learning for effective scientific research comprehension.
Bakhtiyar Hussain Syed,I VIJAYASARADHI,Balaji Vasan Srinivasan,Vasudeva Varma Kalidindi
International SIGIR Conference on Research and Development in Information Retreival Workshops, SIGIR-W, 2019
@inproceedings{bib_Tran_2019, AUTHOR = {Bakhtiyar Hussain Syed, I VIJAYASARADHI, Balaji Vasan Srinivasan, Vasudeva Varma Kalidindi}, TITLE = {Transfer learning for effective scientific research comprehension.}, BOOKTITLE = {International SIGIR Conference on Research and Development in Information Retreival Workshops}. YEAR = {2019}}
Automatic research paper summarization is a fairly interesting topic that has garnered significant interest in the research community in recent years. In this paper, we introduce team Helium’s system description for the CL-SciSumm shared task colocated with SIGIR 2019. We specifically attempt the first task, targeting in building an improved recall system of reference text spans from a given citing research paper (Task 1A) and constructing better models for comprehension of scientific facets (Task 1B). Our architecture incorporates transfer learning by utilising a combination of pretrained embeddings which are subsequently used for building models for the given tasks. In particular - for task 1A, we locate the related text spans referred to by the citation text by creating paired text representations and employ pre-trained embedding mechanisms in conjunction with XGBoost, a gradient boosted decision tree algorithm to identify textual entailment. For task 1B, we make use of the same pretrained embeddings and use the RAKEL algorithm for multi-label classification. Our goal is to enable better scientific research comprehension and we believe that a new approach involving transfer learning will certainly add value to the research community working on these tasks.
Cause–Effect Relation Extraction from Documents in Metallurgy and Materials Science
Sachin Pawar,Raksha Sharma,Girish Keshav Palshikar,Pushpak Bhattacharyya,Vasudeva Varma Kalidindi
Transactions of the Indian Institute of Metals, TIIM, 2019
@inproceedings{bib_Caus_2019, AUTHOR = {Sachin Pawar, Raksha Sharma, Girish Keshav Palshikar, Pushpak Bhattacharyya, Vasudeva Varma Kalidindi}, TITLE = {Cause–Effect Relation Extraction from Documents in Metallurgy and Materials Science}, BOOKTITLE = {Transactions of the Indian Institute of Metals}. YEAR = {2019}}
Given the explosion in availability of scientific documents (books and research papers), automatically extracting cause–effect (CE) relation mentions, along with other arguments such as polarity, uncertainty and evidence, is becoming crucial for creating scientific knowledge bases from scientific text documents. Such knowledge bases can be used for multiple tasks, such as question answering, exploring research hypotheses and identifying opportunities for new research. Linguistically complex constructs are used to express CE relations in text, which requires complex natural language processing techniques for CE relation extraction. In this paper, we propose two machine learning techniques for automatically extracting CE relation mentions from documents in metallurgy and materials science domains. We show experimentally that our algorithms outperform several baselines for extracting intrasentence CE relation mentions. To the best of our knowledge, this is the first work for extraction of CE relations from documents in metallurgy and materials science domains.
Multi-label Categorization of Accounts of Sexism using a Neural Framework.
PARIKH PULKIT TRUSHANT KUMAR,HARIKA ABBURI,PINKESH BADJATIYA,Radhika Krishnan,Niyati Chhaya,Manish Gupta,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2019
@inproceedings{bib_Mult_2019, AUTHOR = {PARIKH PULKIT TRUSHANT KUMAR, HARIKA ABBURI, PINKESH BADJATIYA, Radhika Krishnan, Niyati Chhaya, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Multi-label Categorization of Accounts of Sexism using a Neural Framework.}, BOOKTITLE = {Technical Report}. YEAR = {2019}}
Sexism, an injustice that subjects women and girls to enormous suffering, manifests in blatant as well as subtle ways. In the wake of growing documentation of experiences of sexism on the web, the automatic categorization of accounts of sexism has the potential to assist social scientists and policy makers in studying and countering sexism better. The existing work on sexism classification, which is different from sexism detection, has certain limitations in terms of the categories of sexism used and/or whether they can co-occur. To the best of our knowledge, this is the first work on the multi-label classification of sexism of any kind(s), and we contribute the largest dataset for sexism categorization. We develop a neural solution for this multi-label classification that can combine sentence representations obtained using models such as BERT with distributional and linguistic word embeddings using a flexible, hierarchical architecture involving recurrent components and optional convolutional ones. Further, we leverage unlabeled accounts of sexism to infuse domain-specific elements into our framework. The best propose method outperforms several deep learning as well as traditional machine learning baselines by an appreciable margin.
Highlights of Software R&D in India
SUPRATIK CHAKRABORTY,Vasudeva Varma Kalidindi
Communications of the ACM, CACM, 2019
@inproceedings{bib_High_2019, AUTHOR = {SUPRATIK CHAKRABORTY, Vasudeva Varma Kalidindi}, TITLE = {Highlights of Software R&D in India}, BOOKTITLE = {Communications of the ACM}. YEAR = {2019}}
INDIA IS A software superpower today. This achievement rests on more than four decades of work spanning software processes, rigorous engineering and value-adding technologies, among others. In this article, we present highlights of some of these activities. This regional section also contains other articles that complement this account of exciting work in software systems stemming from India.
Tapping Community Memberships and Devising a Novel Homophily Modeling Approach for Trust Prediction
PARIKH PULKIT TRUSHANT KUMAR,Manish Gupta,Vasudeva Varma Kalidindi
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2018
@inproceedings{bib_Tapp_2018, AUTHOR = {PARIKH PULKIT TRUSHANT KUMAR, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Tapping Community Memberships and Devising a Novel Homophily Modeling Approach for Trust Prediction}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2018}}
Prediction of trust relations between users of social networks is critical for finding credible information. Inferring trust is challenging, since userspecified trust relations are highly sparse and power-law distributed. In this paper, we explore utilizing community memberships for trust prediction in a principled manner. We also propose a novel method to model homophily that complements existing work. To the best of our knowledge, this is the first work that mathematically formulates an insight based on community memberships for unsupervised trust prediction. We propose and model the hypothesis that a user is more likely to develop a trust relation within the user’s community than outside it. Unlike existing work, our approach for encoding homophily directly links user-user similarities with the pair-wise trust model. We derive mathematical factors that model our hypothesis relating community memberships to trust relations and the homophily effect. Along with low-rank matrix factorization, they are combined into chTrust, the proposed multi-faceted optimization framework. Our experiments on the standard Ciao and Epinions datasets show that the proposed framework outperforms multiple unsupervised trust prediction baselines for all test user pairs as well as the low-degree segment, across evaluation setting
Check It Out : Politics and Neural Networks
Yash Kumar Lal,DHRUV KHATTAR,VAIBHAV KUMAR,Abhimanshu Mishra,Vasudeva Varma Kalidindi
CEUR Workshop Proceedings, CEUR, 2018
@inproceedings{bib_Chec_2018, AUTHOR = {Yash Kumar Lal, DHRUV KHATTAR, VAIBHAV KUMAR, Abhimanshu Mishra, Vasudeva Varma Kalidindi}, TITLE = {Check It Out : Politics and Neural Networks}, BOOKTITLE = {CEUR Workshop Proceedings}. YEAR = {2018}}
The task of fact-checking has been formalised as the assessment of the truthfulness of a claim. Be it a political proclamation or a technological development, verification of a new tidbit of information before its propagation to the general public is of utmost importance. Failing to do so leads to the spread of misinformation, which is a devious tool. Fact-checking is commonly performed by journalists, manually looking up information pertaining to the statement in question. This is a drawn out and tedious process with a chance of the concerned person not covering the domain exhaustively. Some of this effort is reduced by the use of knowledge bases created over a period of time. In this work under Task 2 (Factuality) of the CLEF 2018 CheckThat! Lab, we detail a neural network based methodology which models the textual data of a claim based on various representations of its words and characters. An affixed attention mechanism allows us to encapsulate linguistic features common in false claims. We achieve an accuracy of 39.57% on the task dataset.
Weave&Rec: A Word Embedding based 3-D Convolutional Network for News Recommendation
DHRUV KHATTAR,VAIBHAV KUMAR,Vasudeva Varma Kalidindi,Manish Gupta
International Conference on Information and Knowledge Management, CIKM, 2018
@inproceedings{bib_Weav_2018, AUTHOR = {DHRUV KHATTAR, VAIBHAV KUMAR, Vasudeva Varma Kalidindi, Manish Gupta}, TITLE = {Weave&Rec: A Word Embedding based 3-D Convolutional Network for News Recommendation}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2018}}
An effective news recommendation system should harness the his- torical information of the user based on her interactions as well as the content of the articles. In this paper we propose a novel deep learning model for news recommendation which utilizes the content of the news articles as well as the sequence in which the articles were read by the user. To model both of these information, which are essentially of different types, we propose a simple yet effective architecture which utilizes a 3-dimensional Convolutional Neural Network which takes the word embeddings of the articles present in the user history as its input. Using such a method en- dows the model with the capability to automatically learn spatial (features of a particular article) as well as temporal features (fea- tures across articles read by a user) which signify the interest of the user. At test time, we use this in combination with a 2-dimensional Convolutional Neural Network for recommending articles to users. On a real-world dataset our method outperformed strong baselines which also model the news recommendation problem using neural networks.
Sci-Blogger: A Step Towards Automated Science Journalism
RAGHU RAM VADAPALLI,Bhaktiyar Syed,Nishanth Prabhu,Balaji Vasan Srinivasan,Vasudeva Varma Kalidindi
International Conference on Information and Knowledge Management, CIKM, 2018
@inproceedings{bib_Sci-_2018, AUTHOR = {RAGHU RAM VADAPALLI, Bhaktiyar Syed, Nishanth Prabhu, Balaji Vasan Srinivasan, Vasudeva Varma Kalidindi}, TITLE = {Sci-Blogger: A Step Towards Automated Science Journalism}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2018}}
Science journalism is the art of conveying a detailed scientific research paper in a form that nonscientists can understand and appreciate while ensuring that its underlying information is conveyed accurately. It aims to transform jargon-laden scientific articles into a form that a common reader can comprehend while ensuring that the meaning of the article is retained. It plays a crucial role in making scientific content suitable for consumption by the public at large. Recent advances in Deep Learning research and it’s applications in natural language processing have made way to impressive results in Natural Language Generation. We leverage these advances to explore the possibility of their use in journalism, science journalism in particular, as comprehension of scientific content is much harder challenge than most of the other forms of content, like shorthand, which journalists use while writing articles. In this work, we introduce the problem of automated science journalism and present ways to automate some parts of the workflow by automatically generating the ‘title’ of a blog version of a scientific paper. We have built a corpus of 87, 328 pairs of research papers and their corresponding blogs from two science news aggregators and have used it to build ScienceBlogger - a pipeline-based architecture consisting of a two-stage mechanism to generate the blog titles. To demonstrate the models, we built an interactive tool, where a user can give abstract and title of a research paper, which would be processed by our APIs to produce a blog title, along with some relevant information about the model used for the generation. Evaluation using standard metrics indicate viability of the proposed system.
RARE : A Recurrent Attentive Recommendation Engine for News Aggregators
DHRUV KHATTAR,VAIBHAV KUMAR,SHASHANK GUPTA,Manish Shrivastava,Vasudeva Varma Kalidindi
International Conference on Information and Knowledge Management, CIKM, 2018
@inproceedings{bib_RARE_2018, AUTHOR = {DHRUV KHATTAR, VAIBHAV KUMAR, SHASHANK GUPTA, Manish Shrivastava, Vasudeva Varma Kalidindi}, TITLE = {RARE : A Recurrent Attentive Recommendation Engine for News Aggregators}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2018}}
With news stories coming from a variety of sources, it is crucial for news aggregators to present interesting articles to the user to maximize their engagement. This creates the need for a news recommendation system which understands the content of the articles as well as accounts for the users’ preferences. Methods such as Collaborative Filtering, which are well known for general recommendations, are not suitable for news because of the short life span of articles and because of the large number of articles published each day. Apart from this, such methods do not harness the information present in the sequence in which the articles are read by the user and hence are unable to account for the specific and generic interests of the user which may keep changing with time. In order to address these issues for news recommendation, we propose the Recurrent Attentive Recommendation Engine (RARE). RARE consists of two components and utilizes the distributed representations of news articles. The first component is used to model the user’s sequential behaviour of news reading in order to understand her general interests, i.e., to get a summary of her interests. The second component utilizes an article level attention mechanism to understand her specific interests. We feed the information obtained from both the components to a Siamese Network in order to make predictions which pertain to
Fine Grained Approach for Domain Specific Seed URL Extraction
LALIT MOHAN S,SOURAV SARANGI,Raghu Babu Reddy Y,Vasudeva Varma Kalidindi
Hawaii International Conference on System Sciences, HICSS, 2018
@inproceedings{bib_Fine_2018, AUTHOR = {LALIT MOHAN S, SOURAV SARANGI, Raghu Babu Reddy Y, Vasudeva Varma Kalidindi}, TITLE = {Fine Grained Approach for Domain Specific Seed URL Extraction}, BOOKTITLE = {Hawaii International Conference on System Sciences}. YEAR = {2018}}
Domain Specific Search Engines are expected to provide relevant search results. Availability of enormous number of URLs across subdomains improves relevance of domain specific search engines. The current methods for seed URLs can be systematic ensuring representation of subdomains. We propose a fine grained approach for automatic extraction of seed URLs at subdomain level using Wikipedia and Twitter as repositories. A SeedRel metric and a Diversity Index for seed URL relevance are proposed to measure subdomain coverage. We implemented our approach for 'Security - Information and Cyber' domain and identified 34,007 Seed URLs and 400,726 URLs across subdomains. The measured Diversity index value of 2.10 conforms that all subdomains are represented, hence, a relevant 'Security Search Engine' can be built. Our approach also extracted more URLs (seed and child) as compared to existing approaches for URL extraction.
Translation Quality Estimation for Indian Languages
JHAVERI NISARG KETAN,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the European Association for Machine Translation, EAMT, 2018
@inproceedings{bib_Tran_2018, AUTHOR = {JHAVERI NISARG KETAN, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Translation Quality Estimation for Indian Languages}, BOOKTITLE = {Conference of the European Association for Machine Translation}. YEAR = {2018}}
Translation Quality Estimation (QE) aims to estimate the quality of an automated machine translation (MT) output without any human intervention or reference translation. With the increasing use of MT systems in various cross-lingual applications, the need and applicability of QE systems is increasing. We study existing approaches and propose multiple neural network approaches for sentence-level QE, with a focus on MT outputs in Indian languages. For this, we also introduce five new datasets for four language pairs: two for English–Gujarati, and one each for English–Hindi, English–Telugu and English–Bengali, which includes one manually post-edited dataset for English– Gujarati. These Indian languages are spoken by around 689M speakers world-wide. We compare results obtained using our proposed models with multiple state-ofthe-art systems including the winning system in the WMT17 shared task on QE and show that our proposed neural model which combines the discriminative power of carefully chosen features with Siamese Convolutional Neural Networks (CNNs) works best for all Indian language datasets.
SWDE : A Sub-Word And Document Embedding Based Engine for Clickbait Detection
VAIBHAV KUMAR,DHRUV KHATTAR,MRINAL DHAR,Yash Kumar Lal,Abhimanshu Mishra,Vasudeva Varma Kalidindi,Manish Srivastava
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2018
@inproceedings{bib_SWDE_2018, AUTHOR = {VAIBHAV KUMAR, DHRUV KHATTAR, MRINAL DHAR, Yash Kumar Lal, Abhimanshu Mishra, Vasudeva Varma Kalidindi, Manish Srivastava}, TITLE = {SWDE : A Sub-Word And Document Embedding Based Engine for Clickbait Detection}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2018}}
In order to expand their reach and increase website ad revenue, media outlets have started using clickbait techniques to lure readers to click on articles on their digital platform. Having successfully enticed the user to open the article, the article fails to satiate his curiosity serving only to boost click-through rates. Initial methods for this task were dependent on feature engineering, which varies with each dataset. Industry systems have relied on an exhaustive set of rules to get the job done. Neural networks have barely been explored to perform this task. We propose a novel approach considering different textual embeddings of a news headline and the related article. We generate sub-word level embeddings of the title using Convolutional Neural Networks and use them to train a bidirectional LSTM architecture. An attention layer allows for calculation of significance of each term towards the nature of the post. We also generate Doc2Vec embeddings of the title and article text and model how they interact, following which it is concatenated with the output of the previous component. Finally, this representation is passed through a neural network to obtain a score for the headline. We test our model over 2538 posts (having trained it on 17000 records) and achieve an accuracy of 83.49% outscoring previous state-of-the-art approaches.
RARE : A Recurrent Attentive Recommendation Engine for News Aggregators
DHRUV KHATTAR,VAIBHAV KUMAR,SHASHANK GUPTA,Manish Gupta,Vasudeva Varma Kalidindi
Conference on Information & Knowledge Management Workshops, CIKM-W, 2018
@inproceedings{bib_RARE_2018, AUTHOR = {DHRUV KHATTAR, VAIBHAV KUMAR, SHASHANK GUPTA, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {RARE : A Recurrent Attentive Recommendation Engine for News Aggregators}, BOOKTITLE = {Conference on Information & Knowledge Management Workshops}. YEAR = {2018}}
With news stories coming from a variety of sources, it is crucial for news aggregators to present interesting articles to the user to maximize their engagement. This creates the need for a news recommendation system which understands the content of the articles as well as accounts for the users’ preferences. Methods such as Collaborative Filtering, which are well known for general recommendations, are not suitable for news because of the short life span of articles and because of the large number of articles published each day. Apart from this, such methods do not harness the information present in the sequence in which the articles are read by the user and hence are unable to account for the specific and generic interests of the user which may keep changing with time. In order to address these issues for news recommendation, we propose the Recurrent Attentive Recommendation Engine (RARE). RARE consists of two components and utilizes the distributed representations of news articles. The first component is used to model the user’s sequential behaviour of news reading in order to understand her general interests, i.e., to get a summary of her interests. The second component utilizes an article level attention mechanism to understand her specific interests. We feed the information obtained from both the components to a Siamese Network in order to make predictions which pertain to the user’s generic as well as specific interests. We carry out extensive experiments over three real-world datasets and show that RARE outperforms the state-of-the-art. Furthermore, we also demonstrate the effectiveness of our method in handling the cold start cases.
Believe It or Not! Identifying Bizarre News in Online News Media
I VIJAYASARADHI,OOTA SUBBA REDDY,Manish Gupta,Vasudeva Varma Kalidindi
India Joint International Conference on. Data Science & Management of Data, COMAD/CODS, 2018
@inproceedings{bib_Beli_2018, AUTHOR = {I VIJAYASARADHI, OOTA SUBBA REDDY, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Believe It or Not! Identifying Bizarre News in Online News Media}, BOOKTITLE = {India Joint International Conference on. Data Science & Management of Data}. YEAR = {2018}}
Bizarre news items are those news items so strange and unusual that readers might question the claims presented in the news. This paper presents the first machine learning approach to bizarre news detection in online news media. We contribute by compiling the first bizarre news corpus of 23754 bizarre news headlines, and by developing a bizarre news detection model based on various syntactic and semantic features. We experiment with various machine learning techniques including deep learning methods, and find that a logistic regression classifier achieves an F1 score of 0.92 for the task.
Multi-Task Learning for Extraction of Adverse Drug Reaction Mentions from Tweets
SHASHANK GUPTA,Manish Gupta,Vasudeva Varma Kalidindi,Sachin Pawar,Nitin Ramrakhiyani,Girish Keshav Palshikar
European Conference on Information Retrieval, ECIR, 2018
@inproceedings{bib_Mult_2018, AUTHOR = {SHASHANK GUPTA, Manish Gupta, Vasudeva Varma Kalidindi, Sachin Pawar, Nitin Ramrakhiyani, Girish Keshav Palshikar}, TITLE = {Multi-Task Learning for Extraction of Adverse Drug Reaction Mentions from Tweets}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2018}}
Adverse drug reactions (ADRs) are one of the leading causes of mortality in health care. Current ADR surveillance systems are often associated with a substantial time lag before such events are officially published. On the other hand, online social media such as Twitter contain information about ADR events in real-time, much before any official reporting. Current state-of-the-art in ADR mention extraction uses Recurrent Neural Networks (RNN), which typically need large labeled corpora. Towards this end, we propose a multi-task learning based method which can utilize a similar auxiliary task (adverse drug event detection) to enhance the performance of the main task, i.e., ADR extraction. Furthermore, in absence of the auxiliary task dataset, we propose a novel joint multi-task learning method to automatically generate weak supervision dataset for the auxiliary task when a large pool of unlabeled tweets is available. Experiments with ∼0.48M tweets show that the proposed approach outperforms the state-of-the-art methods for the ADR mention extraction task by ∼7.2% in terms of F1 score.
Co-training for Extraction of Adverse Drug Reaction Mentions from Tweets
SHASHANK GUPTA,Manish Gupta,Vasudeva Varma Kalidindi,Sachin Pawar,Nitin Ramrakhiyani,Girish Keshav Palshikar
European Conference on Information Retrieval, ECIR, 2018
@inproceedings{bib_Co-t_2018, AUTHOR = {SHASHANK GUPTA, Manish Gupta, Vasudeva Varma Kalidindi, Sachin Pawar, Nitin Ramrakhiyani, Girish Keshav Palshikar}, TITLE = {Co-training for Extraction of Adverse Drug Reaction Mentions from Tweets}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2018}}
Adverse drug reactions (ADRs) are one of the leading causes of mortality in health care. Current ADR surveillance systems are often associated with a substantial time lag before such events are officially published. On the other hand, online social media such as Twitter contain information about ADR events in real-time, much before any official reporting. Current state-of-the-art methods in ADR mention extraction use Recurrent Neural Networks (RNN), which typically need large labeled corpora. Towards this end, we propose a semi-supervised method based on co-training which can exploit a large pool of unlabeled tweets to augment the limited supervised training data, and as a result enhance the performance. Experiments with ∼0.1M tweets show that the proposed approach outperforms the state-of-the-art methods for the ADR mention extraction task by ∼5% in terms of F1 score.
Neural Content-Collaborative Filtering for News Recommendation
DHRUV KHATTAR,VAIBHAV KUMAR,Manish Gupta,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2018
@inproceedings{bib_Neur_2018, AUTHOR = {DHRUV KHATTAR, VAIBHAV KUMAR, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Neural Content-Collaborative Filtering for News Recommendation}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2018}}
Popular methods like collaborative filtering and content-based filtering have their own disadvantages. The former method requires a considerable amount of user data before making predictions, while the latter, suffers from over-specialization. In this work, we address both of these issues by coming up with a hybrid approach based on neural networks for news recommendation. The hybrid approach incorporates for both (1) user-item interaction and (2) content-information of the articles read by the user in the past. We first come up with an article-embedding based profile for the user. We then use this user profile with adequate positive and negative samples in order to train the neural network based model. The resulting model is then applied on a real-world dataset. We compare it with a set of established baselines and the experimental results show that our model outperforms the stateof-the-art.
Attention-based Neural Text Segmentation
PINKESH BADJATIYA,LITTON J KURISINKEL,Manish Gupta,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2018
@inproceedings{bib_Atte_2018, AUTHOR = {PINKESH BADJATIYA, LITTON J KURISINKEL, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Attention-based Neural Text Segmentation}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2018}}
Text segmentation plays an important role in various Natural Language Processing (NLP) tasks like summarization, context understanding, document indexing and document noise removal. Previous methods for this task require manual feature engineering, huge memory requirements and large execution times. To the best of our knowledge, this paper is the first one to present a novel supervised neural approach for text segmentation. Specifically, we propose an attention-based bidirectional LSTM model where sentence embeddings are learned using CNNs and the segments are predicted based on contextual information. This model can automatically handle variable sized context information. Compared to the existing competitive baselines, the proposed model shows a performance improvement of ∼7% in WinDiff score on three benchmark datasets.
Unity in Diversity: Learning Distributed Heterogeneous Sentence Representation for Extractive Summarization
ABHISHEK KUMAR SINGH,Manish Gupta,Vasudeva Varma Kalidindi
American Association for Artificial Intelligence, AAAI, 2018
@inproceedings{bib_Unit_2018, AUTHOR = {ABHISHEK KUMAR SINGH, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Unity in Diversity: Learning Distributed Heterogeneous Sentence Representation for Extractive Summarization}, BOOKTITLE = {American Association for Artificial Intelligence}. YEAR = {2018}}
Automated multi-document extractive text summarization is a widely studied research problem in the field of natural language understanding. Such extractive mechanisms compute in some form the worthiness of a sentence to be included into the summary. While the conventional approaches rely on human crafted document-independent features to generate a summary, we develop a data-driven novel summary system called HNet, which exploits the various semantic and compositional aspects latent in a sentence to capture document independent features. The network learns sentence representation in a way that, salient sentences are closer in the vector space than non-salient sentences. This semantic and compositional feature vector is then concatenated with the document dependent features for sentence ranking. Experiments on the DUC benchmark datasets (DUC-2001, DUC-2002 and DUC2004) indicate that our model shows significant performance gain of around 1.5-2 points in terms of ROUGE score compared with the state-of-the-art baselines.
A Workbench for Rapid Generation of Cross-Lingual Summaries
JHAVERI NISARG KETAN,Manish Gupta,Vasudeva Varma Kalidindi
International Conference on Language Resources and Evaluation, LREC, 2018
@inproceedings{bib_A_Wo_2018, AUTHOR = {JHAVERI NISARG KETAN, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {A Workbench for Rapid Generation of Cross-Lingual Summaries}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2018}}
The need for cross-lingual information access is more than ever with the easy accessibility to the Internet, especially in vastly multilingual societies like India. Cross-lingual summarization can help minimize human effort needed for achieving publishable articles in multiple languages, while making the most important information available in target language in the form of summaries. We describe a flexible, web-based tool for human editing of cross-lingual summaries to rapidly generate publishable summaries in a number of Indian Languages for news articles originally published in English, and simultaneously collect detailed logs about the process, at both article and sentence level. Similar to translation post-editing logs, such logs can be used to evaluate the automated cross-lingual summaries, in terms of effort needed to make them publishable. The generated summaries along with the logs can be used to train and improve the automatic system over time.
Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction Mention Extraction
SHASHANK GUPTA,Sachin Pawar,Nitin Ramrakhiyani,Girish Keshav Palshikar,Vasudeva Varma Kalidindi
BMC Bioinformatics, BIO INFO, 2018
@inproceedings{bib_Semi_2018, AUTHOR = {SHASHANK GUPTA, Sachin Pawar, Nitin Ramrakhiyani, Girish Keshav Palshikar, Vasudeva Varma Kalidindi}, TITLE = {Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction Mention Extraction}, BOOKTITLE = {BMC Bioinformatics}. YEAR = {2018}}
Social media is an useful platform to share health-related information due to its vast reach. is makes it a good candidate for public health monitoring tasks, specifically for pharma covigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from Twiter. Medical information extraction from social media is challenging, mainly due to short and highly informal nature of text, as compared to more technical and formal medical reports. Current methods in ADR mention extraction rely on supervised learning methods, which suffer from labeled data scarcity problem. The State-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which are LongShort-Term-Memory networks (LSTMs) [6]. Deep neural networks,due to their large number of free parameters relies heavily on large annotated corpora for learning the end task. But in the real-world, it is hard to get large labeled data, mainly due to the heavy cost associated with the manual annotation. To this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction.
ELDEN: Improved Entity Linking using Densified Knowledge Graphs
PRIYA RADHAKRISHNAN,Partha Talukdar,Vasudeva Varma Kalidindi
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2018
@inproceedings{bib_ELDE_2018, AUTHOR = {PRIYA RADHAKRISHNAN, Partha Talukdar, Vasudeva Varma Kalidindi}, TITLE = {ELDEN: Improved Entity Linking using Densified Knowledge Graphs}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2018}}
Entity Linking (EL) systems aim to automatically map mentions of an entity in text to the corresponding entity in a Knowledge Graph (KG). Degree of connectivity of an entity in the KG directly affects an EL system’s ability to correctly link mentions in text to the entity in KG. This causes many EL systems to perform well for entities well connected to other entities in KG, bringing into focus the role of KG density in EL. In this paper, we propose Entity Linking using Densified Knowledge Graphs (ELDEN). ELDEN is an EL system which first densifies the KG with co-occurrence statistics from a large text corpus, and then uses the densified KG to train entity embeddings. Entity similarity measured using these trained entity embeddings result in improved EL. ELDEN outperforms stateof-the-art EL system on benchmark datasets. Due to such densification, ELDEN performs well for sparsely connected entities in the KG too. ELDEN’s approach is simple, yet effective. We have made ELDEN’s code and data publicly available.
Identifying Clickbait: A Multi-Strategy Approach Using Neural Networks
VAIBHAV KUMAR,DHRUV KHATTAR,SIDDHARTHA GAIROLA,Yash Kumar Lal,Vasudeva Varma Kalidindi
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2018
@inproceedings{bib_Iden_2018, AUTHOR = {VAIBHAV KUMAR, DHRUV KHATTAR, SIDDHARTHA GAIROLA, Yash Kumar Lal, Vasudeva Varma Kalidindi}, TITLE = {Identifying Clickbait: A Multi-Strategy Approach Using Neural Networks}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2018}}
Online media outlets, in a bid to expand their reach and subsequently increase revenue through ad monetisation, have begun adopting clickbait techniques to lure readers to click on articles. The article fails to fulfill the promise made by the headline. Traditional methods for clickbait detection have relied heavily on feature engineering which, in turn, is dependent on the dataset it is built for. The application of neural networks for this task has only been explored partially. We propose a novel approach considering all information found in a social media post. We train a bidirectional LSTM with an attention mechanism to learn the extent to which a word contributes to the post’s clickbait score in a differential manner. We also employ a Siamese net to capture the similarity between source and target information. Information gleaned from images has not been considered in previous approaches. We learn image embeddings from large amounts of data using Convolutional Neural Networks to add another layer of complexity to our model. Finally, we concatenate the outputs from the three separate components, serving it as input to a fully connected layer. We conduct experiments over a test corpus of 19538 social media posts, attaining an F1 score of 65.37% on the dataset bettering the previous state-of-the-art, as well as other proposed approaches, feature engineering or otherwise.
Identification of Alias Links among Participants in Narratives
Sangameshwar Patil,Sachin Pawa,Swapnil Hingmire,Girish K Palshikar,Vasudeva Varma Kalidindi,Pushpak Bhattacharyya
Conference of the Association of Computational Linguistics, ACL, 2018
@inproceedings{bib_Iden_2018, AUTHOR = {Sangameshwar Patil, Sachin Pawa, Swapnil Hingmire, Girish K Palshikar, Vasudeva Varma Kalidindi, Pushpak Bhattacharyya}, TITLE = {Identification of Alias Links among Participants in Narratives}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2018}}
Identification of distinct and independent participants (entities of interest) in a narrative is an important task for many NLP applications. This task becomes challenging because these participants are often referred to using multiple aliases. In this paper, we propose an approach based on linguistic knowledge for identification of aliases mentioned using proper nouns, pronouns or noun phrases with common noun headword. We use Markov Logic Network (MLN) to encode the linguistic knowledge for identification of aliases. We evaluate on four diverse history narratives of varying complexity as well as newswire subset of ACE 2005 dataset. Our approach performs better than the state-of-the-art.
Check It Out : Politics and Neural Networks
Yash Kumar Lal,DHRUV KHATTAR,VAIBHAV KUMAR,Abhimanshu Mishra,Vasudeva Varma Kalidindi
International Conference of the CLEF Association, CLEFS, 2018
@inproceedings{bib_Chec_2018, AUTHOR = {Yash Kumar Lal, DHRUV KHATTAR, VAIBHAV KUMAR, Abhimanshu Mishra, Vasudeva Varma Kalidindi}, TITLE = {Check It Out : Politics and Neural Networks}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2018}}
The task of fact-checking has been formalised as the assessment of the truthfulness of a claim. Be it a political proclamation or a technological development, verification of a new tidbit of information before its propagation to the general public is of utmost importance. Failing to do so leads to the spread of misinformation, which is a devious tool. Fact-checking is commonly performed by journalists, manually looking up information pertaining to the statement in question. This is a drawn out and tedious process with a chance of the concerned person not covering the domain exhaustively. Some of this effort is reduced by the use of knowledge bases created over a period of time. In this work under Task 2 (Factuality) of the CLEF 2018 CheckThat! Lab, we detail a neural network based methodology which models the textual data of a claim based on various representations of its words and characters. An affixed attention mechanism allows us to encapsulate linguistic feature common in false claims. We achieve an accuracy of 39.57% on the task dataset.
Weave&Rec : A Word Embedding based 3-D Convolutional Network for News Recommendation
DHRUV KHATTAR,VAIBHAV KUMAR,Vasudeva Varma Kalidindi,Manish Gupta
International Conference on Information and Knowledge Management, CIKM, 2018
@inproceedings{bib_Weav_2018, AUTHOR = {DHRUV KHATTAR, VAIBHAV KUMAR, Vasudeva Varma Kalidindi, Manish Gupta}, TITLE = {Weave&Rec : A Word Embedding based 3-D Convolutional Network for News Recommendation}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2018}}
An effective news recommendation system should harness the historical information of the user based on her interactions as well as the content of the articles. In this paper we propose a novel deep learning model for news recommendation which utilizes the content of the news articles as well as the sequence in which the articles were read by the user. To model both of these information, which are essentially of different types, we propose a simple yet effective architecture which utilizes a 3-dimensional Convolutional Neural Network which takes the word embeddings of the articles present in the user history as its input. Using such a method endows the model with the capability to automatically learn spatial (features of a particular article) as well as temporal features (features across articles read by a user) which signify the interest of the user. At test time, we use this in combination with a 2-dimensional Convolutional Neural Network for recommending articles to users. On a real-world dataset our method outperformed strong baselines which also model the news recommendation problem using neural networks.
HRAM : A Hybrid Recurrent Attention Machine for News Recommendation
DHRUV KHATTAR,VAIBHAV KUMAR,Vasudeva Varma Kalidindi,Manish Gupta
International Conference on Information and Knowledge Management, CIKM, 2018
@inproceedings{bib_HRAM_2018, AUTHOR = {DHRUV KHATTAR, VAIBHAV KUMAR, Vasudeva Varma Kalidindi, Manish Gupta}, TITLE = {HRAM : A Hybrid Recurrent Attention Machine for News Recommendation}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2018}}
Popular methods for news recommendation which are based on collaborative filtering and content-based filtering have multiple drawbacks. The former method does not account for the sequential nature of news reading and suffers from the problem of cold-start, while the latter, suffers from over-specialization. In order to address these issues for news recommendation we propose a Hybrid Recurrent Attention Machine (HRAM). HRAM consists of two components. The first component utilizes a neural network for matrix factorization. While in the second component, we first learn the distributed representation of each news article. We then use the historical data of the user in a sequential manner and feed it to an attention-based recurrent layer. Finally, we concatenate the outputs from both these components and use further hidden layers in order to make predictions. In this way, we harness the information present in the user reading history and boost it with the information available through collaborative filtering for providing better news recommendations. Extensive experiments over two real-world datasets show that the proposed model provides significant improvement over the state-of-the-art.
When science journalism meets artificial intelligence : An interactive demonstration
RAGHU RAM VADAPALLI,Bakhtiyar Hussain Syed,Nishant Prabhu,Balaji Vasan Srinivasan,Vasudeva Varma Kalidindi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2018
@inproceedings{bib_When_2018, AUTHOR = {RAGHU RAM VADAPALLI, Bakhtiyar Hussain Syed, Nishant Prabhu, Balaji Vasan Srinivasan, Vasudeva Varma Kalidindi}, TITLE = {When science journalism meets artificial intelligence : An interactive demonstration}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2018}}
We present an online interactive tool1 that generates titles of blog titles and thus take the first step toward automating science journalism. Science journalism aims to transform jargon-laden scientific articles into a form that the common reader can comprehend while ensuring that the underlying meaning of the article is retained. In this work, we present a tool, which, given the title and abstract of a research paper will generate a blog title by mimicking a human science journalist. The tool makes use of a model trained on a corpus of 87, 328 pairs of research papers and their corresponding blogs, built from two science news aggregators. The architecture of the model is a two stage mechanism which generates blog titles. Evaluation using standard metrics indicate the viability of the proposed system.
EquGener: A Reasoning Network for Word Problem Solving by Generating Arithmetic Equations
PRUTHWIK MISHRA,LITTON J KURISINKEL,Dipti Mishra Sharma,Vasudeva Varma Kalidindi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_EquG_2018, AUTHOR = {PRUTHWIK MISHRA, LITTON J KURISINKEL, Dipti Mishra Sharma, Vasudeva Varma Kalidindi}, TITLE = {EquGener: A Reasoning Network for Word Problem Solving by Generating Arithmetic Equations}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
Word problem solving has always been a challenging task as it involves reasoning across sentences, identification of operations and their order of application on relevant operands. Most of the earlier systems attempted to solve word problems with tailored features for handling each category of problems. In this paper, we present a new approach to solve simple arithmetic problems. Through this work we introduce a novel method where we first learn a dense representation of the problem description conditioned on the question in hand. We leverage this representation to generate the operands and operators in the appropriate order. Our approach improves upon the state-of-the-art system by 3% in one benchmark dataset while ensuring comparable accuracies in other datasets.
Resolving Actor Coreferences in Hindi Narrative Text
Nitin Ramrakhiyani,Swapnil Hingmire,Sachin Pawar,Sangameshwar Patil,Girish K Palshikar,Pushpak Bhattacharyya,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2018
@inproceedings{bib_Reso_2018, AUTHOR = {Nitin Ramrakhiyani, Swapnil Hingmire, Sachin Pawar, Sangameshwar Patil, Girish K Palshikar, Pushpak Bhattacharyya, Vasudeva Varma Kalidindi}, TITLE = {Resolving Actor Coreferences in Hindi Narrative Text}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2018}}
An important aspect of understanding narrative text is identification of actors, its mentions and coreferences among them. Coreference Resolution in Hindi is a relatively under-explored area. In this paper, we focus on the task of resolving coreferences of actor mentions in Hindi narrative text. We propose a linguistically grounded approach for the task using Markov Logic Networks (MLN). Our approach outperforms two strong baselines on a publicly available dataset and 4 other manually created datasets.
An Alternate Load Distribution Scheme in DHTs
PULKIT GOEL,KUMAR RISHABH,Vasudeva Varma Kalidindi
2018 IEEE International Conference on Cloud Computing Technology and Science, CLOUDCOM, 2017
@inproceedings{bib_An_A_2017, AUTHOR = {PULKIT GOEL, KUMAR RISHABH, Vasudeva Varma Kalidindi}, TITLE = {An Alternate Load Distribution Scheme in DHTs}, BOOKTITLE = {2018 IEEE International Conference on Cloud Computing Technology and Science}. YEAR = {2017}}
This paper analyzes, compares and looks at tradeoffs in different load distribution schemes for consistent hashing in DHTs(distributed hash tables). Different traffic patterns, including an adversarial pattern, were constructed to test the load distribution. A simulator was made for each load distribution scheme and the load on each node was recorded for each traffic pattern. It was shown that increasing the number of consistent hash rings doesn't change the load distribution characteristics of the DHT in simple traffic scenarios, but in the case of adversarial traffic, increasing the number of hash rings leads to an improvement of around 15% in the load distribution statistics
Enhancing categorization of computer science research papers using knowledge bases
SHASHANK GUPTA,PRIYA RADHAKRISHNAN,Manish Gupta,Vasudeva Varma Kalidindi,Umang Gupta
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2017
@inproceedings{bib_Enha_2017, AUTHOR = {SHASHANK GUPTA, PRIYA RADHAKRISHNAN, Manish Gupta, Vasudeva Varma Kalidindi, Umang Gupta}, TITLE = {Enhancing categorization of computer science research papers using knowledge bases}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2017}}
Automatic categorization of computer science research papers using just the abstracts, is a hard problem to solve. This is due to the short text length of the abstracts. Also, abstracts are a general discussion of the topic with few domain specific terms. These reasons make it hard to generate good representations of abstracts which in turn leads to poor categorization performance. To address this challenge, external Knowledge Bases (KB) like Wikipedia, Freebase etc. can be used to enrich the representations for abstracts, which can aid in the categorization task. In this work, we propose a novel method for enhancing classification performance of research papers into ACM computer science categories using knowledge extracted from related Wikipedia articles and Freebase entities. We use state-of-the-art representation learning methods for feature representation of documents, followed by learning to rank method for classification. Given the abstracts of research papers from the Citation Network Dataset containing 0.24 M papers, our method of using KB, outperforms a baseline method and the stateof-the-art deep learning method in classification task by 13.25% and 5.41% respectively, in terms of accuracy. We have also open-sourced the implementation of the project4.
TCS Research at TAC 2017: Joint Extraction of Entities and Relations from Drug Labels using an Ensemble of Neural Networks.
Sachin Pawar,Girish K. Palshikar,Pushpak Bhattacharyya,Nitin Ramrakhiyani,SHASHANK GUPTA,Vasudeva Varma Kalidindi
Text Analysis Conference Workshop, TAC, 2017
@inproceedings{bib_TCS__2017, AUTHOR = {Sachin Pawar, Girish K. Palshikar, Pushpak Bhattacharyya, Nitin Ramrakhiyani, SHASHANK GUPTA, Vasudeva Varma Kalidindi}, TITLE = {TCS Research at TAC 2017: Joint Extraction of Entities and Relations from Drug Labels using an Ensemble of Neural Networks.}, BOOKTITLE = {Text Analysis Conference Workshop}. YEAR = {2017}}
We describe our submission at TAC 2017 for extracting entities and relations of interest from drug labels. We employ an end-to-end relation extraction system which jointly extracts both entities and relations. The task of end-to-end relation extraction consists of identifying boundaries of entity mentions, entity types of these mentions and appropriate semantic relation for each pair of mentions. Based on our earlier work (Pawar et al., 2017), a single neural network model (“All Word Pairs” model ie AWP-NN) is trained to assign an appropriate label to each word pair in a given sentence for performing end-to-end relation extraction. Moreover, we build an ensemble of multiple AWPNN models to achieve better performance than the individual models. We achieved 73.18% and 24.79% F-measures for entity and relation extraction, respectively.
Supporting comprehension of unfamiliar programs by modeling cues
NAVEEN N. KULKARNI,Vasudeva Varma Kalidindi
Software Quality Journal, SQJ, 2017
@inproceedings{bib_Supp_2017, AUTHOR = {NAVEEN N. KULKARNI, Vasudeva Varma Kalidindi}, TITLE = {Supporting comprehension of unfamiliar programs by modeling cues}, BOOKTITLE = {Software Quality Journal}. YEAR = {2017}}
Developers need to comprehend a program before modifying it. In such cases,developers use cues to establish the relevance of a piece of information with a task. Beingfamiliar with different kinds of cues will help the developers to comprehend a programfaster. But unlike the experienced developers, novice developers fail to recognize therelevant cues, because (a) there are many cues and (b) they might be unfamiliar with theartifacts. However, not much is known about the developers’ choice of cue. Hence, weconducted two independent studies to understand the kind of cues used by the developersand how a tool influences their cue selection. First, from our user study on two commoncomprehension tasks, we found that developers actively seek the cues and their cue sourcechoices are similar but task dependent. In our second exploratory study, we investigatedwhether an integrated development environment (IDE) influences a developer’s cuechoices. By observing their interaction history while fixing bugs in Eclipse IDE, we foundthat the IDE’s influence on their cue choices was not statistically significant. Finally, as acase in point, we propose a novel task-specific program summarization approach to aidnovice developers in comprehending an unfamiliar program. Our approach used devel-opers cue choices to synthesize the summaries. A comparison of the synthesized sum-maries with the summaries recorded by the developers shows both had similar content.This promising result encourages us to explore task-specific cue models, which can aidnovice developers to accomplish complex comprehension tasks faster
Deep learning for hate speech detection in tweets
PINKESH BADJATIYA,SHASHANK GUPTA,Manish gupta,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2017
@inproceedings{bib_Deep_2017, AUTHOR = {PINKESH BADJATIYA, SHASHANK GUPTA, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Deep learning for hate speech detection in tweets}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2017}}
Hate speech detection on Twitter is critical for applications like controversial event extraction, building AI chatterbots, content recommendation, and sentiment analysis. We define this task as being able to classify a tweet as racist, sexist or neither. The complexity of the natural language constructs makes this task very challenging. We perform extensive experiments with multiple deep learning architectures to learn semantic word embeddings to handle this complexity. Our experiments on a benchmark dataset of 16K annotated tweets show that such deep learning methods outperform state-of-the-art char/word n-gram methods by~ 18 F1 points.
Scientific article recommendation by using distributed representations of text and graph
SHASHANK GUPTA,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2017
@inproceedings{bib_Scie_2017, AUTHOR = {SHASHANK GUPTA, Vasudeva Varma Kalidindi}, TITLE = {Scientific article recommendation by using distributed representations of text and graph}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2017}}
Scientific article recommendation problem deals with recommending similar scientific articles given a query article. It can be categorized as a content based similarity system. Recent advancements in representation learning methods have proven to be effective in modeling distributed representations in different modalities like images, languages, speech, networks etc. The distributed representations obtained using such techniques in turn can be used to calculate similarities. In this paper, we address the problem of scientific paper recommendation through a novel method which aims to combine multimodal distributed representations, which in this case are: 1. distributed representations of paper's content, and 2. distributed representation of the graph constructed from the bibliographic network. Through experiments we demonstrate that our method outperforms the state-of-the-art distributed representation methods in text and graph, by 29.6% and 20.4%, both in terms of precision and mean-average-precision respectively.
Interpretation of semantic tweet representations
GANESH J,Manish Gupta,Vasudeva Varma Kalidindi
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2017
@inproceedings{bib_Inte_2017, AUTHOR = {GANESH J, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Interpretation of semantic tweet representations}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2017}}
Research in analysis of microblogging platforms is experiencing a renewed surge with a large number of works applying representation learning models for applications like sentiment analysis, semantic textual similarity computation, hashtag prediction, etc. Although the performance of the representation learning models has been better than the traditional baselines for such tasks, little is known about the elementary properties of a tweet encoded within these representations, or why particular representations work better for certain tasks. Our work presented here constitutes the first step in opening the black-box of vector embeddings for tweets.
Simultaneous Inference of User Representations and Trust
SHASHANK GUPTA,PARIKH PULKIT TRUSHANT KUMAR,Manish Gupta,Vasudeva Varma Kalidindi
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2017
@inproceedings{bib_Simu_2017, AUTHOR = {SHASHANK GUPTA, PARIKH PULKIT TRUSHANT KUMAR, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Simultaneous Inference of User Representations and Trust}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2017}}
Inferring trust relations between social media users is critical for a number of applications wherein users seek credible information. The fact that available trust relations are scarce and skewed makes trust prediction a challenging task. To the best of our knowledge, this is the first work on exploring representation learning for trust prediction. We propose an approach that uses only a small amount of binary user-user trust relations to simultaneously learn user embeddings and a model to predict trust between user pairs. We empirically demonstrate that for trust prediction, our approach outperforms classifier-based approaches which use state-of-the-art representation learning methods like DeepWalk and LINE as features. We also conduct experiments which use embeddings pre-trained with DeepWalk and LINE each as an input to our model, resulting in further performance improvement. Experiments with a dataset of ~356K user pairs show that the proposed method can obtain a high F-score of 92.65%.
Medical persona classification in social media
Nikhil Pattisapu,Manish Gupta,Ponnurangam Kumaraguru,Vasudeva Varma Kalidindi
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2017
@inproceedings{bib_Medi_2017, AUTHOR = {Nikhil Pattisapu, Manish Gupta, Ponnurangam Kumaraguru, Vasudeva Varma Kalidindi}, TITLE = {Medical persona classification in social media}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2017}}
Identifying medical persona from a social media post is of paramount importance for drug marketing and pharmacovigilance. In this work, we propose multiple approaches to infer the medical persona associated with a social media post. We pose this as a supervised multi-label text classification problem. The main challenge is to identify the hidden cues in a post that are indicative of a particular persona. We first propose a large set of manually engineered features for this task. Further, we propose multiple neural network based architectures to extract useful features from these posts using pre-trained word embeddings. Our experiments on thousands of blogs and tweets show that the proposed approach results in 7% and 5% gain in F-measure over manual feature engineering based approach for blogs and tweets respectively.
Deep Neural Architecture for News Recommendation.
VAIBHAV KUMAR,DHRUV KHATTAR,SHASHANK GUPTA,Vasudeva Varma Kalidindi
International Conference of the CLEF Association, CLEFS, 2017
@inproceedings{bib_Deep_2017, AUTHOR = {VAIBHAV KUMAR, DHRUV KHATTAR, SHASHANK GUPTA, Vasudeva Varma Kalidindi}, TITLE = {Deep Neural Architecture for News Recommendation.}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2017}}
Deep neural networks have yielded immense success in speech recognition, computer vision and natural language processing. However, the exploration of deep neural networks for recommender systems has received a relatively little introspection. Also, different recommendation scenarios have their own issues which creates the need for different approaches for recommendation. Specifically in news recommendation a major problem is that of varying user interests. In this work, we use deep neural networks with attention to tackle the problem of news recommendation. The key factor in user-item based collaborative filtering is to identify the interaction between user and item features. Matrix factorization is one of the most common approaches for identifying this interaction. It maps both the users and the items into a joint latent factor space such that user-item interactions in that space can be modeled as inner products in that space. Some recent work has used deep neural networks with the motive to learn an arbitrary function instead of the inner product that is used for capturing the user-item interaction. However, directly adapting it for the news domain does not seem to be very suitable. This is because of the dynamic nature of news readership where the interests of the users keep changing with time. Hence, it becomes challenging for recommendation systems to model both user preferences as well as account for the interests which keep changing over time. We present a deep neural model, where a non-linear mapping of users and item features are learnt first. For learning a non-linear mapping for the users we use an attention-based recurrent t layer in combination with fully connected layers. For learning the mappings for the items we use only fully connected layers. We then use a ranking based objective function to learn the parameters of the network. We also use the content of the news articles as features for our model. Extensive experiments on a real-world dataset show a significant improvement of our proposed model over the state-of-the-art by 4.7% (Hit Ratio@10). Along with this, we also show the effectiveness of our model to handle the user cold-start and item cold-start problems.
Abstractive Multi-document Summarization by Partial Tree Extraction, Recombination and Linearization
LITTON J KURISINKEL,Yue Zhang,Vasudeva Varma Kalidindi
International Joint Conference on Natural Language Processing, IJCNLP, 2017
@inproceedings{bib_Abst_2017, AUTHOR = {LITTON J KURISINKEL, Yue Zhang, Vasudeva Varma Kalidindi}, TITLE = {Abstractive Multi-document Summarization by Partial Tree Extraction, Recombination and Linearization}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2017}}
Existing work for abstractive multidocument summarization utilise existing phrase structures directly extracted from input documents to generate summary sentences. These methods can suffer from lack of consistence and coherence in merging phrases. We introduce a novel approach for abstractive multidocument summarization through partial dependency tree extraction, recombination and linearization. The method entrusts the summarizer to generate its own topically coherent sequential structures from scratch for effective communication. Results on TAC 2011, DUC-2004 and 2005 show that our system gives competitive results compared with state of the art abstractive summarization approaches in the literature. We also achieve competitive results in linguistic quality assessed by human evaluators.
SSAS: semantic similarity for abstractive summarization
RAGHU RAM VADAPALLI,LITTON J KURISINKEL,manish Gupta,Vasudeva Varma Kalidindi
International Joint Conference on Natural Language Processing, IJCNLP, 2017
@inproceedings{bib_SSAS_2017, AUTHOR = {RAGHU RAM VADAPALLI, LITTON J KURISINKEL, manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {SSAS: semantic similarity for abstractive summarization}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2017}}
Ideally a metric evaluating an abstract system summary should represent the extent to which the system-generated summary approximates the semantic inference conceived by the reader using a human-written reference summary. Most of the previous approaches relied upon word or syntactic sub-sequence overlap to evaluate system-generated summaries. Such metrics cannot evaluate the summary at semantic inference level. Through this work we introduce the metric of Semantic Similarity for Abstractive Summarization (SSAS), which leverages natural language inference and paraphrasing techniques to frame a novel approach to evaluate system summaries at semantic inference level. SSAS is based upon a weighted composition of quantities representing the level of agreement, contradiction, independence, paraphrasing, and optionally ROUGE score between a system-generated and a human-written summary.
Hybrid memnet for extractive summarization
ABHISHEK KUMAR SINGH,Manish Gupta,Vasudeva Varma Kalidindi
International Conference on Information and Knowledge Management, CIKM, 2017
@inproceedings{bib_Hybr_2017, AUTHOR = {ABHISHEK KUMAR SINGH, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Hybrid memnet for extractive summarization}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2017}}
Extractive text summarization has been an extensive research problem in the field of natural language understanding. While the conventional approaches rely mostly on manually compiled features to generate the summary, few attempts have been made in developing data-driven systems for extractive summarization. To this end, we present a fully data-driven end-to-end deep network which we call as Hybrid MemNet for single document summarization task. The network learns the continuous unified representation of a document before generating its summary. It jointly captures local and global sentential information along with the notion of summary worthy sentences. Experimental results on two different corpora confirm that our model shows significant performance gains compared with the state-of-the-art baselines
User profiling based deep neural network for temporal news recommendation
VAIBHAV KUMAR,DHRUV KHATTAR,SHASHANK GUPTA,Vasudeva Varma Kalidindi
International Conference on Data Mining Workshops, ICDM-W, 2017
@inproceedings{bib_User_2017, AUTHOR = {VAIBHAV KUMAR, DHRUV KHATTAR, SHASHANK GUPTA, Vasudeva Varma Kalidindi}, TITLE = {User profiling based deep neural network for temporal news recommendation}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2017}}
One of the most important and challenging problems in recommendation systems is that of modeling temporal behavior. Typically, modeling temporal behavior increases the cost of parameter inference and estimation. Along with it, it also poses the constraint of requiring a large amount of data for reliably learning the parameters of the model. Therefore, it is often difficult to model temporal behavior in large-scale real-world recommendation systems. In this work, we propose a deep neural network architecture which is based on a two level approach. We first generate document embeddings for every news article. We then use these embeddings and the previously read articles by a user to come up with her user profile. We then use this profile along with adequate positive and negative samples in order to train our model. The resulting model is then applied to a real-world data set. We compare it with a set of established baselines and the experimental results show that our model outperforms the state-of-the-art. We also use the learned model to recommend articles to users who have had very little interaction with items, i.e., have read a very less amount of news articles. We then demonstrate the effectiveness of our model to solve the problem of item cold-start.
Word semantics based 3-d convolutional neural networks for news recommendation
VAIBHAV KUMAR,DHRUV KHATTAR,SHASHANK GUPTA,Vasudeva Varma Kalidindi
International Conference on Data Mining Workshops, ICDM-W, 2017
@inproceedings{bib_Word_2017, AUTHOR = {VAIBHAV KUMAR, DHRUV KHATTAR, SHASHANK GUPTA, Vasudeva Varma Kalidindi}, TITLE = {Word semantics based 3-d convolutional neural networks for news recommendation}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2017}}
Deep neural networks have yielded immense success in speech recognition, computer vision and natural language processing. However, the exploration of deep neural networks for content based recommendation has received a relatively less amount of inspection. Also, different recommendation scenarios have their own issues which creates the need for different approaches for recommendation. One of the problems with news recommendation is that of handling temporal changes in user interests. Hence, modelling temporal behaviour in the domain of news recommendation becomes very important. In this work, we propose a recommendation model which uses semantic similarity between words as input to a 3-D Convolutional Neural Network in order to extract the temporal news reading pattern of the users. This in turn improves the quality of recommendations. We compare our model to a set of established baselines and the experimental results show that our model performs better than the state-of-the-art by 5.8% (Hit Ratio@10).
Leveraging Moderate User Data for New Recommendation
DHRUV KHATTAR,VAIBHAV KUMAR,Vasudeva Varma Kalidindi
International Conference on Data Mining Workshops, ICDM-W, 2017
@inproceedings{bib_Leve_2017, AUTHOR = {DHRUV KHATTAR, VAIBHAV KUMAR, Vasudeva Varma Kalidindi}, TITLE = {Leveraging Moderate User Data for New Recommendation}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2017}}
It is very crucial for news aggregator websites which are recent in the market to actively engage its existing users. A recommendation system would help to tackle such a problem. However, due to the lack of sufficient amount of data, most of the state-of-the-art methods perform poorly in terms of recommending relevant news items to the users. In this paper, we propose a novel approach for Item-based Collaborative filtering for recommending news items using Markov Decision Process (MDP). Due to the sequential nature of news reading, we choose MDP to model our recommendation system as it is based on a sequence optimization paradigm. Further, we also incorporate factors like article freshness and similarity into our system by extrinsically modelling it in terms of reward for the MDP. We compare it with various other state-of-the-art methods. On a moderately low amount of data we see that our MDP-based approach outperforms the other approaches. One of the reasons for this is that the baselines fail to identify the underlying patterns within the sequence in which the articles are read by the users. Hence, the baselines are not able to generalize well.
Sneit: Salient named entity identification in tweets
PRIYA RADHAKRISHNAN,Ganesh Jawahar,Manish Gupta,Vasudeva Varma Kalidindi
Computacion y Sistemas, CyS, 2017
@inproceedings{bib_Snei_2017, AUTHOR = {PRIYA RADHAKRISHNAN, Ganesh Jawahar, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Sneit: Salient named entity identification in tweets}, BOOKTITLE = {Computacion y Sistemas}. YEAR = {2017}}
Social media is a rich source of information and opinion, with exponential data growth rate. However social media posts are difficult to analyze since they are brief, unstructured and noisy. Interestingly, many social media posts are about an entity or entities. Understanding which entity is central (Salient Entity) to a post, helps better analyze the post. In this paper we propose a model that aids in such analysis by identifying the Salient Entity in a social media post, tweets in particular. We present a supervised machine-learning model, to identify Salient Entity in a tweet and propose that the tweet is most likely about that particular entity. We have used the premise that, when an image accompanies a text, the text most likely is about the entity in that image, to build a dataset of tweets and salient entities. We trained our model using this dataset. Note that this does not restrict the applicability of our model in any way. We use tweets with images only to obtain objective ground truth data, while features for the model are derived from tweet text. Our experiments show that the model identifies Salient Named Entity with an F-measure of 0.63. We show the effectiveness of the proposed model for tweet-filtering and salience identification tasks. We have made the human annotated dataset and the source code of this model publicly available.
Perils of opportunistically reusing software module
NAVEEN N. KULKARNI,Vasudeva Varma Kalidindi
SOFTWARE: PRACTICE AND EXPERIENCE, SAPE, 2016
@inproceedings{bib_Peri_2016, AUTHOR = {NAVEEN N. KULKARNI, Vasudeva Varma Kalidindi}, TITLE = {Perils of opportunistically reusing software module}, BOOKTITLE = {SOFTWARE: PRACTICE AND EXPERIENCE}. YEAR = {2016}}
Opportunistic reuse is a need based sourcing of software modules without a prior reuse plan. It is a common tactical approach in software development. Developers often reuse an external software module opportunistically to improve their productivity. But, studies have shown that this results in extensive refactoring and adds maintenance owes. We assert this problem to the mismatches between the software under development and the reused external module; caused because of their different assumptions and constraints. We highlight the problems of such opportunistic reuse practices with the help of a case study. In our study, we found issues such as unanticipated behavior, violated constraints, conflict in assumption, fragile structure, and software bloat. In this paper, we like to draw attention of the research community to the wide spread opportunistic reuse practices and the lack of methods to pro-actively identify and resolve the mismatches. We propose the need for supporting developers in reasoning before reuse from the perspective of identifying and fixing both local and global mismatches. Furthermore, we identify other opportunistic software development practices where similar issues can be observed and also suggest the research areas where further investigation can benefit developers in improving their productivity. Copyright © 2016 John Wiley & Sons, Ltd.
Unsupervised deep semantic and logical analysis for identification of solution posts from community answers
NIRAJ KUMAR,Srinathan Kannan,Vasudeva Varma Kalidindi
International Journal of Information and Decision Sciences, IJIDS, 2016
@inproceedings{bib_Unsu_2016, AUTHOR = {NIRAJ KUMAR, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {Unsupervised deep semantic and logical analysis for identification of solution posts from community answers}, BOOKTITLE = {International Journal of Information and Decision Sciences}. YEAR = {2016}}
These days' discussion forums provide dependable solutions to the problems related to multiple domains and areas. However, due to the presence of huge amount of less-informative/inappropriate posts, the identification of the appropriate problem-solution pairs has become a challenging task. The emergence of a variety of topics, domains and areas has made the task of manual labelling of the problem solution-post pairs a very costly and time consuming task. To solve these issues, we concentrate on deep semantic and logical relation between terms. For this, we introduce a novel semantic correlation graph to represent the text. The proposed representation helps us in the identification of topical and semantic relation between terms at a fine grain level. Next, we apply the improved version of personalised pagerank using random walk with restarts. The main aim is to improve the rank score of terms having direct or indirect relation with terms in the given question. Finally, we introduce the use of the node overlapping version of GAAC to find the actual span of answer text. Our experimental results show that the devised system performs better than the existing unsupervised systems.
A graph-based unsupervised N-gram filtration technique for automatic keyphrase extraction
NIRAJ KUMAR,Srinathan Kannan,Vasudeva Varma Kalidindi
International Journal of Data Mining, Modelling and Management, IJDMMM, 2016
@inproceedings{bib_A_gr_2016, AUTHOR = {NIRAJ KUMAR, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {A graph-based unsupervised N-gram filtration technique for automatic keyphrase extraction}, BOOKTITLE = {International Journal of Data Mining, Modelling and Management}. YEAR = {2016}}
In this paper, we present a novel N-gram (N> = 1) filtration technique for keyphrase extraction. To filter the sophisticated candidate keyphrases (N-grams), we introduce the combined use of: 1) statistical feature (obtained by using weighted betweenness centrality scores of words, which is generally used to identify the border nodes/edges in community detection techniques); 2) co-location strength (calculated by using nearest neighbour Dbpedia texts). We also introduce the use of N-gram (N> = 1) graph, which reduces the bias effect of lower length N-grams in the ranking process and preserves the semantics of words (phraseness), based upon local context. To capture the theme of the document and to reduce the effect of noisy terms in the ranking process, we apply an information theoretic framework for key-player detection on the proposed N-gram graph. Our experimental results show that the devised system performs better than the current state-of-the-art unsupervised systems and comparable/better than supervised systems.
TweetGrep: weakly supervised joint retrieval and sentiment analysis of topical tweets
SATARUPA GUHA,Tanmoy Chakraborty,Samik Datta,Vasudeva Varma Kalidindi
International Conference on Web and Social Media, ICWSM, 2016
@inproceedings{bib_Twee_2016, AUTHOR = {SATARUPA GUHA, Tanmoy Chakraborty, Samik Datta, Vasudeva Varma Kalidindi}, TITLE = {TweetGrep: weakly supervised joint retrieval and sentiment analysis of topical tweets}, BOOKTITLE = {International Conference on Web and Social Media}. YEAR = {2016}}
An overwhelming amount of data is generated everyday on social media, encompassing a wide spectrum of topics. With almost every business decision depending on customer opinion, mining of social media data needs to be quick and easy. For a data analyst to keep up with the agility and the scale of the data, it is impossible to bank on fully supervised techniques to mine topics and their associated sentiments from social media. Motivated by this, we propose a weakly supervised approach (named, TweetGrep) that lets the data analyst easily define a topic by few keywords and adapt a generic sentiment classifier to the topic–by jointly modeling topics and sentiments using label regularization. Experiments with diverse datasets show that TweetGrep beats the state-ofthe-art models for both the tasks of retrieving topical tweets and analyzing the sentiment of the tweets (average improvement of 4.97% and 6.91% respectively in terms of area under the curve). Further, we show that TweetGrep can also be adopted in a novel task of hashtag disambiguation, which significantly outperforms the baseline methods.
Author2Vec: Learning Author Representations by Combining Content and Link Information
GANESH J,SOUMYAJIT GANGULY,Manish Gupta,Vasudeva Varma Kalidindi,Vikram Pudi
International Conference on World wide web, WWW, 2016
@inproceedings{bib_Auth_2016, AUTHOR = {GANESH J, SOUMYAJIT GANGULY, Manish Gupta, Vasudeva Varma Kalidindi, Vikram Pudi}, TITLE = {Author2Vec: Learning Author Representations by Combining Content and Link Information}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2016}}
In this paper, we consider the problem of learning representations for authors from bibliographic co-authorship networks. Existing methods for deep learning on graphs, such as DeepWalk, suffer from link sparsity problem as they focus on modeling the link information only. We hypothesize that capturing both the content and link information in a unified way will help mitigate the sparsity problem. To this end, we present a novel model'Author2Vec', which learns low-dimensional author representations such that authors who write similar content and share similar network structure are closer in vector space. Such embeddings are useful in a variety of applications such as link prediction, node classification, recommendation and visualization. The author embeddings we learn are empirically shown to outperform DeepWalk by 2.35% and 0.83% for link prediction and clustering task respectively.
Non-decreasing sub-modular function for comprehensible summarization
LITTON J KURISINKEL,PRUTHWIK MISHRA,VIGNESHWARAN M,Vasudeva Varma Kalidindi,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics Workshops, NAACL-W, 2016
@inproceedings{bib_Non-_2016, AUTHOR = {LITTON J KURISINKEL, PRUTHWIK MISHRA, VIGNESHWARAN M, Vasudeva Varma Kalidindi, Dipti Mishra Sharma}, TITLE = {Non-decreasing sub-modular function for comprehensible summarization}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics Workshops}. YEAR = {2016}}
Extractive summarization techniques typically aim to maximize the information coverage of the summary with respect to the original corpus and report accuracies in ROUGE scores. Automated text summarization techniques should consider the dimensions of comprehensibility, coherence and readability. In the current work, we identify the discourse structure which provides the context for the creation of a sentence. We leverage the information from the structure to frame a monotone (non-decreasing) sub-modular scoring function for generating comprehensible summaries. Our approach improves the overall quality of comprehensibility of the summary in terms of human evaluation and gives sufficient content coverage with comparable ROUGE score. We also formulate a metric to measure summary comprehensibility in terms of Contextual Independence of a sentence. The metric is shown to be representative of human judgement of text comprehensibility.
Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation
GANESH J,Manish Gupta,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2016
@inproceedings{bib_Doc2_2016, AUTHOR = {GANESH J, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2016}}
Doc2Sent2Vec is an unsupervised approach to learn lowdimensional feature vector (or embedding) for a document. This embedding captures the semantics of the document and can be fed as input to machine learning algorithms to solve a myriad number of applications in the field of data mining and information retrieval. Some of these applications include document classification, retrieval, and ranking. The proposed approach is two-phased. In the first phase, the model learns a vector for each sentence in the document using a standard word-level language model. In the next phase, it learns the document representation from the sentence sequence using a novel sentence-level language model. Intuitively, the first phase captures the word-level coherence to learn sentence embeddings, while the second phase captures the sentence-level coherence to learn document embeddings. Compared to the state-of-the-art models that learn document vectors directly from the word sequences, we hypothesize that the proposed decoupled strategy of learning sentence embeddings followed by document embeddings helps the model learn accurate and rich document representations. We evaluate the learned document embeddings by considering two classification tasks: scientific article classification and Wikipedia page classification. Our model outperforms the current state-of-the-art models in the scientific article classification task by∼ 12.07% and the Wikipedia page classification task by∼ 6.93%, both in terms of F1 score. These results highlight the superior quality of document embeddings learned by the Doc2Sent2Vec approach
MultiStack: Multi-Cloud Big Data Research Framework/Platform
VISHRUT MEHTA SANJIV,KUMAR RISHABH,Reddy Raja,Vasudeva Varma Kalidindi
IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), CCEM, 2016
@inproceedings{bib_Mult_2016, AUTHOR = {VISHRUT MEHTA SANJIV, KUMAR RISHABH, Reddy Raja, Vasudeva Varma Kalidindi}, TITLE = {MultiStack: Multi-Cloud Big Data Research Framework/Platform}, BOOKTITLE = {IEEE International Conference on Cloud Computing in Emerging Markets (CCEM)}. YEAR = {2016}}
With improving efficiency and cost effectiveness of public cloud systems, there has been a growing trend to complement in-house cloud environment with them. We propose an open solution to the problem of distributing jobs in a hybrid environment, as a substitute to hadoop systems which are cloud specific. The proposed system, MultiStack, is a big data orchestration platform for deploying big data jobs across multiple cloud providers. The specific architecture elaborated in this paper uses Amazon Web Services as the public cloud provider and Openstack as the private cloud framework. Our solution supports complete Hadoop ecosystem tools - hive, pig, Hbase, oozie etc. and on-demand scaling of Hadoop clusters. The proposed framework aims at reducing the Job completion time on workloads along with decrease in cost using Spot Instance provisioning compared to on-demand provisioning. This is achieved by providing two modes of operation: Proactive scheduling and Reactive scheduling, which takes into account user providing job characteristics(eg memory, cpu, etc), quota limitation and business objectives. From our experiments, we conclude that Multistack is able to reduce average job completion time by 30-37% with minimal increase in cost.
Hetstore: A platform for io workload assignment in a heterogeneous storage environment
KUMAR RISHABH,VISHRUT MEHTA SANJIV,TUSHANT JHA,Vasudeva Varma Kalidindi
IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), CCEM, 2016
@inproceedings{bib_Hets_2016, AUTHOR = {KUMAR RISHABH, VISHRUT MEHTA SANJIV, TUSHANT JHA, Vasudeva Varma Kalidindi}, TITLE = {Hetstore: A platform for io workload assignment in a heterogeneous storage environment}, BOOKTITLE = {IEEE International Conference on Cloud Computing in Emerging Markets (CCEM)}. YEAR = {2016}}
The problem of providing optimal assignment for backend storage is a central problem in the design of cloud systems. It has taken a further central role as a result of growing heterogeneity from emerging Software Defined Storage systems. In this paper, we propose a solution to optimal IO Workload assignment using statistical modelling to estimate measures of performance such as Throughput, IOPS, et al. The proposed system uses support vector regression to estimate the performance of individual IO Workloads on each available SDS system for optimal assignment. As a proof of concept, we demonstrate our solution in a heterogeneous environment comprising of HDFS, GlusterFS, and Ceph. We first show the accuracy of estimation of throughput and IOPS with values of coefficient of determination over 0.65 in all cases. We further show the analysis of using this regression model to classify workloads to respective SDS backend that will maximize throughput.
Interpreting the syntactic and social elements of the tweet representations via elementary property prediction tasks
GANESH J,Manish Gupta,Vasudeva Varma Kalidindi
Neural Information Processing Systems Workshops, NeurIPS-W, 2016
@inproceedings{bib_Inte_2016, AUTHOR = {GANESH J, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Interpreting the syntactic and social elements of the tweet representations via elementary property prediction tasks}, BOOKTITLE = {Neural Information Processing Systems Workshops}. YEAR = {2016}}
Research in social media analysis is experiencing a recent surge with a large number of works applying representation learning models to solve high-level syntactico-semantic tasks such as sentiment analysis, semantic textual similarity computation, hashtag prediction and so on. Although the performance of the representation learning models are better than the traditional baselines for the tasks, little is known about the core properties of a tweet encoded within the representations. Understanding these core properties would empower us in making generalizable conclusions about the quality of representations. Our work presented here constitutes the first step in opening the black-box of vector embedding for social media posts, with emphasis on tweets in particular. In order to understand the core properties encoded in a tweet representation, we evaluate the representations to estimate the extent to which it can model each of those properties such as tweet length, presence of words, hashtags, mentions, capitalization, and so on. This is done with the help of multiple classifiers which take the representation as input. Essentially, each classifier evaluates one of the syntactic or social properties which are arguably salient for a tweet. This is also the first holistic study on extensively analysing the ability to encode these properties for a wide variety of tweet representation models including the traditional unsupervised methods (BOW, LDA), unsupervised representation learning methods (Siamese CBOW, Tweet2Vec) as well as supervised methods (CNN, BLSTM).
Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text
ADITYA JOSHI,PRABHU AMEYA PANDURANG,Manish Srivastava,Vasudeva Varma Kalidindi
International Conference on Computational Linguistics, COLING, 2016
@inproceedings{bib_Towa_2016, AUTHOR = {ADITYA JOSHI, PRABHU AMEYA PANDURANG, Manish Srivastava, Vasudeva Varma Kalidindi}, TITLE = {Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2016}}
Sentiment analysis (SA) using code-mixed data from social media has several applications in opinion mining ranging from customer satisfaction to social campaign analysis in multilingual societies. Advances in this area are impeded by the lack of a suitable annotated dataset. We introduce a Hindi-English (Hi-En) code-mixed dataset for sentiment analysis and perform empirical analysis comparing the suitability and performance of various state-of-the-art SA methods in social media. In this paper, we introduce learning sub-word level representations in our LSTM (Subword-LSTM) architecture instead of character-level or word-level representations. This linguistic prior in our architecture enables us to learn the information about sentiment value of important morphemes. This also seems to work well in highly noisy text containing misspellings as shown in our experiments which is demonstrated in morpheme-level feature maps learned by our model. Also, we hypothesize that encoding this linguistic prior in the Subword-LSTM architecture leads to the superior performance. Our system attains accuracy 4-5% greater than traditional approaches on our dataset, and also outperforms the available system for sentiment analysis in Hi-En code-mixed text by 18%.
Query-based evolutionary graph cuboid outlier detection
AYUSHI DALMIA,Manish Gupta,Vasudeva Varma Kalidindi
International Conference on Data Mining Workshops, ICDM-W, 2016
@inproceedings{bib_Quer_2016, AUTHOR = {AYUSHI DALMIA, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Query-based evolutionary graph cuboid outlier detection}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2016}}
Graph-OLAP is an online analytical framework which allows us to obtain various projections of a graph, each of which helps us view the graph along multiple dimensions and multiple levels. Given a series of snapshots of a temporal heterogeneous graph, we aim to find interesting projections of the graph which have anomalous evolutionary behavior. Detecting anomalous projections in a series of such snapshots can be helpful for an analyst to understand the regions of interest from the temporal graph. Identifying such semantically related regions in the graph allows the analyst to derive insights from temporal graphs which enables her in making decisions. While most of the work on temporal outlier detection is performed on nodes, subgraphs and communities, we are the first to propose detection of evolutionary graph cuboid outliers. Further, we perform this detection in a query sensitive manner. Thus, an evolutionary graph cuboid outlier is a projection (or cuboid) of a snapshot of the temporal graph such that it contains an unexpected number of matches for the query with respect to other cuboids both in the same snapshot as well as in the other snapshots. Identifying such outliers is challenging because (1) the number of cuboids per snapshot could be large, and (2) number of snapshots could itself be large. We model the problem by predicting the outlier score for each cuboid in each snapshot. We propose to build subspace ensemble regression models to learn (a) the behavior of a cuboid across different snapshots, and (b) the behavior of all the cuboids in a given snapshot. Experimental results on both synthetic and real datasets show the effectiveness of the proposed algorithm in discovering evolutionary graph cuboid outliers.
Improving Tweet Representations using Temporal and User Context
GANESH J,Manish Gupta,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2016
@inproceedings{bib_Impr_2016, AUTHOR = {GANESH J, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Improving Tweet Representations using Temporal and User Context}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2016}}
In this work we propose a novel representation learning model which computes semantic representations for tweets accurately. Our model systematically exploits the chronologically adjacent tweets ('context') from users' Twitter timelines for this task. Further, we make our model user-aware so that it can do well in modeling the target tweet by exploiting the rich knowledge about the user such as the way the user writes the post and also summarizing the topics on which the user writes. We empirically demonstrate that the proposed models outperform the state-of-the-art models in predicting the user profile attributes like spouse, education and job by 19.66%, 2.27% and 2.22% respectively.
A Case Study on Teaching Software Engineering Concepts using a Case-Based Learning Environment.
KIRTI GARG,Ashish Sureka,Vasudeva Varma Kalidindi
International Workshop on Case Method for Computing Education (, CMCE, 2015
@inproceedings{bib_A_Ca_2015, AUTHOR = {KIRTI GARG, Ashish Sureka, Vasudeva Varma Kalidindi}, TITLE = {A Case Study on Teaching Software Engineering Concepts using a Case-Based Learning Environment.}, BOOKTITLE = {International Workshop on Case Method for Computing Education (}. YEAR = {2015}}
Case-based teaching is a well-known teaching methodology consisting of learning by reading, discussing and analyzing real-life cases and scenarios. We present a Case-Oriented Learning Environment (COSEEd) for teaching Software Engineering concepts to undergraduate and graduate students in a first course of Software Engineering. The novelty of the proposed model lies in being a complete learning environment framework, consisting of pedagogy, broad level learning objectives, assessment, resources and management details, all designed specifically for Software Engineering. Learning and teaching is centered around well-designed SE case studies from authentic software development instances. We describe the COSEEd model, a sample case-study and share out insights as well as lessons learnt while applying the proposed model in practice. We implement and evaluate the proposed model in Software Engineering courses at a University in India focused on the core areas of Information Technology. We use empirical studies on student perception and actual performance to determine the effectiveness of COSEEd towards achieving various learning goals of SE.
Systemic requirements of a software engineering learning environment
KIRTI GARG,Vasudeva Varma Kalidindi
India Software Engineering Conference, ISECo, 2015
@inproceedings{bib_Syst_2015, AUTHOR = {KIRTI GARG, Vasudeva Varma Kalidindi}, TITLE = {Systemic requirements of a software engineering learning environment}, BOOKTITLE = {India Software Engineering Conference}. YEAR = {2015}}
Software Engineering (SE) educators worldwide are attempting to create learning environments that can effectively achieve their desired learning objectives. However, there exist additional needs that impact the learning process and the overall quality of a learning environment. We identified two sets of differentiating requirements, Climatic and Systemic, whose inclusion in design can assist in an effective, sustainable and usable SE learning environment. In this paper, we will describe the Systemic requirements, i.e. the desired system wide capabilities that impact the sustainability of a SE learning environment by affecting its operationalization and use in short and long term. We will also discuss, through few examples, the interactions between various differentiating requirements. Current SE course design and evaluation consider these as challenges to deal later, instead of addressing them through a conscientious design. Such courses find it hard to sustain and evolve with time, despite using powerful pedagogies. We intend to change this design approach by identifying and recording the various needs (as requirements) and their influence on the learning environment. Our aim is to draw attention to these differentiating requirements and help the educators look beyond learning objectives and move towards a more holistic and systematic design of SE learning environments.
Towards deep semantic analysis of hashtags
PIYUSH BANSAL,ROMIL BANSAL,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2015
@inproceedings{bib_Towa_2015, AUTHOR = {PIYUSH BANSAL, ROMIL BANSAL, Vasudeva Varma Kalidindi}, TITLE = {Towards deep semantic analysis of hashtags}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2015}}
Hashtags are semantico-syntactic constructs used across various social networking and microblogging platforms to enable users to start a topic specific discussion or classify a post into a desired category. Segmenting and linking the entities present within the hashtags could therefore help in better understanding and extraction of information shared across the social media. However, due to lack of space delimiters in the hashtags (e.g #nsavssnowden), the segmentation of hashtags into constituent entities (“NSA” and “Edward Snowden” in this case) is not a trivial task. Most of the current state-of-the-art social media analytics systems like Sentiment Analysis and Entity Linking tend to either ignore hashtags, or treat them as a single word. In this paper, we present a context aware approach to segment and link entities in the hashtags to a knowledge base (KB) entry, based on the context within the tweet. Our approach segments and links the entities in hashtags such that the coherence between hashtag semantics and the tweet is maximized. To the best of our knowledge, no existing study addresses the issue of linking entities in hashtags for extracting semantic information. We evaluate our method on two different datasets, and demonstrate the effectiveness of our technique in improving the overall entity linking in tweets via additional semantic information provided by segmenting and linking entities in a hashtag.
Readable and Coherent MultiDocument Summarization.
LITTON J KURISINKEL,VIGNESHWARAN M,Vasudeva Varma Kalidindi,Dipti Mishra Sharma
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2015
@inproceedings{bib_Read_2015, AUTHOR = {LITTON J KURISINKEL, VIGNESHWARAN M, Vasudeva Varma Kalidindi, Dipti Mishra Sharma}, TITLE = {Readable and Coherent MultiDocument Summarization.}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2015}}
Extractive summarization is the process of precisely choosing a set of sentences from a corpus which can actually be a representative of the original corpus in a limited space. In addition to exhibiting a good content coverage, the final summary should be readable as well as structurally and topically coherent. In this paper we present a holistic, multi-document summarization approach which takes care of the content coverage, sentence ordering, maintenance of topical coherence, topical order and inter-sentence structural relationships. To achieve this we have introduced a novel concept of a Local Coherent Unit (LCU). Our results are comparable with the peer systems for content coverage and sentence ordering measured in terms of ROUGE and τ score respectively. The human evaluation preference for readability and coherence of summary are significantly better for our approach vis a vis other approaches. The approach is scalable to bigger realtime corpus as well.
Evaluation (SemEval 2015), pages 759–766, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics SIEL: Aspect Based Sentiment Analysis in Reviews
SATARUPA GUHA,ADITYA JOSHI,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2015
@inproceedings{bib_Eval_2015, AUTHOR = {SATARUPA GUHA, ADITYA JOSHI, Vasudeva Varma Kalidindi}, TITLE = {Evaluation (SemEval 2015), pages 759–766, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics SIEL: Aspect Based Sentiment Analysis in Reviews}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2015}}
Following the footsteps of SemEval-2014 Task 4 (Pontiki et al., 2014), SemEval-2015 too had a task dedicated to aspect-level sentiment analysis (Pontiki et al., 2015), which saw participation from over 25 teams. In Aspectbased Sentiment Analysis, the aim is to identify the aspects of entities and the sentiment expressed for each aspect. In this paper, we present a detailed description of our system, that stood 4th in Aspect Category subtask (slot 1), 7th in Opinion Target Expression subtask (slot 2) and 8th in Sentiment Polarity subtask (slot 3) on the Restaurant datasets.
Towards semantic retrieval of hashtags in microblogs
PIYUSH BANSAL,SOMAY JAIN,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2015
@inproceedings{bib_Towa_2015, AUTHOR = {PIYUSH BANSAL, SOMAY JAIN, Vasudeva Varma Kalidindi}, TITLE = {Towards semantic retrieval of hashtags in microblogs}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2015}}
On various microblogging platforms like Twitter, the users post short text messages ranging from news and information to thoughts and daily chatter. These messages often contain keywords called Hashtags, which are semantico-syntactic constructs that enable topical classification of the microblog posts. In this poster, we propose and evaluate a novel method of semantic enrichment of microblogs for a particular type of entity search--retrieving a ranked list of the top-k hashtags relevant to a user's query Q. Such a list can help the users track posts of their general interest. We show that our technique significantly improved microblog retrieval as well. We tested our approach on the publicly available Stanford sentiment analysis tweet corpus. We observed an improvement of more than 10% in NDCG for microblog retrieval task, and around 11% in mean average precision for hashtag retrieval task.
IIIT-H at SemEval 2015: Twitter Sentiment Analysis–The Good, the Bad and the Neutral!
AYUSHI DALMIA,Manish Gupta,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2015
@inproceedings{bib_IIIT_2015, AUTHOR = {AYUSHI DALMIA, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {IIIT-H at SemEval 2015: Twitter Sentiment Analysis–The Good, the Bad and the Neutral!}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2015}}
This paper describes the system that was submitted to SemEval2015 Task 10: Sentiment Analysis in Twitter. We participated in Subtask B: Message Polarity Classification. The task is a message level classification of tweets into positive, negative and neutral sentiments. Our model is primarily a supervised one which consists of well designed features fed into an SVM classifier. In previous runs of this task, it was found that lexicons played an important role in determining the sentiment of a tweet. We use existing lexicons to extract lexicon specific features. The lexicon based features are further augmented by tweet specific features. We also improve our system by using acronym and emoticon dictionaries. The proposed system achieves an F1 score of 59.83 and 67.04 on the Test Data and Progress Data respectively. This placed us at the 18th position for the Test Dataset and the 16th position for the Progress Test Dataset.
SIEL: aspect based sentiment analysis in reviews
SATARUPA GUHA,ADITYA JOSHI,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2015
@inproceedings{bib_SIEL_2015, AUTHOR = {SATARUPA GUHA, ADITYA JOSHI, Vasudeva Varma Kalidindi}, TITLE = {SIEL: aspect based sentiment analysis in reviews}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2015}}
Following the footsteps of SemEval-2014 Task 4 (Pontiki et al., 2014), SemEval-2015 too had a task dedicated to aspect-level sentiment analysis (Pontiki et al., 2015), which saw participation from over 25 teams. In Aspectbased Sentiment Analysis, the aim is to identify the aspects of entities and the sentiment expressed for each aspect. In this paper, we present a detailed description of our system, that stood 4th in Aspect Category subtask (slot 1), 7th in Opinion Target Expression subtask (slot 2) and 8th in Sentiment Polarity subtask (slot 3) on the Restaurant datasets.
Sentibase: Sentiment analysis in twitter on a budget
SATARUPA GUHA,ADITYA JOSHI,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2015
@inproceedings{bib_Sent_2015, AUTHOR = {SATARUPA GUHA, ADITYA JOSHI, Vasudeva Varma Kalidindi}, TITLE = {Sentibase: Sentiment analysis in twitter on a budget}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2015}}
Like SemEval 2013 and 2014, the task Sentiment Analysis in Twitter found a place in this year’s SemEval too and attracted an unprecedented number of participations. This task comprises of four sub-tasks. We participated in subtask 2—Message polarity classification. Although we lie a few notches down from the top system, we present a very simple yet effective approach to handle this problem that can be implemented in a single day!
How education, stimulation, and incubation encourage student entrepreneurship: Observations from MIT, IIIT, and Utrecht University
Slinger Jansen,Tommy van de Zande,Sjaak Brinkkemper,Erik Stam,Vasudeva Varma Kalidindi
The International Journal of Management Education, IJME, 2015
@inproceedings{bib_How__2015, AUTHOR = {Slinger Jansen, Tommy Van De Zande, Sjaak Brinkkemper, Erik Stam, Vasudeva Varma Kalidindi}, TITLE = {How education, stimulation, and incubation encourage student entrepreneurship: Observations from MIT, IIIT, and Utrecht University}, BOOKTITLE = {The International Journal of Management Education}. YEAR = {2015}}
Universities across the world are increasingly trying to become more entrepreneurial, in order to stay competitive, generate new sources of income through licensing or contract research, and follow policy guidelines from governments. The most powerful resource universities have to stimulate entrepreneurship is their students. However, there is no evaluated theory on how to encourage students to become entrepreneurs. Through three case studies the entrepreneurial encouragement offerings applied at MIT in the United States, IIIT in India, and Utrecht University in the Netherlands are investigated. The offerings provided by these institutes have been surveyed, interviews about these offerings with university staff have been performed, and reflected upon through interviews with entrepreneurs that graduated from these institutes. The three case studies provide insight in how student entrepreneurship encouragement offerings contributed to students choosing a career as an entrepreneur. Several successful examples of student entrepreneurship encouragement offerings are presented, and a model is proposed on how to effectively encourage entrepreneurship among students. The model supports academic institutes in constructing an environment that encourages student entrepreneurship and aims to help universities convince students to continue their careers as entrepreneurs.
Query-based graph cuboid outlier detection
AYUSHI DALMIA,Manish Gupta,Vasudeva Varma Kalidindi
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2015
@inproceedings{bib_Quer_2015, AUTHOR = {AYUSHI DALMIA, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Query-based graph cuboid outlier detection}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2015}}
Various projections or views of a heterogeneous information network can be modeled using the graph OLAP (On-line Analytical Processing) framework for effective decision making. Detecting anomalous projections of the network can help the analysts identify regions of interest from the graph specific to the projection attribute. While most previous studies on outlier detection in graphs deal with outlier nodes, edges or subgraphs, we are the first to propose detection of graph cuboid outliers. Further we perform this detection in a query sensitive way. Given a general subgraph query on a heterogeneous network, we study the problem of finding outlier cuboids from the graph OLAP lattice. A Graph Cuboid Outlier (GCOutlier) is a cuboid with exceptionally high density of matches for the query. The GCOutlier detection task is clearly challenging because: (1) finding matches for the query (subgraph isomorphism) is NP-hard; (2) number of matches for the query can be very high; and (3) number of cuboids can be large. We provide an approximate solution to the problem by computing only a fraction of the total matches originating from a select set of candidate nodes and including a select set of edges, chosen smartly. We perform extensive experiments on synthetic datasets to showcase the execution time versus accuracy trade-off. Experiments on real datasets like Four Area and Delicious containing thousands of nodes reveal interesting GCOutliers.
Survey of social commerce research
Anuhya Vajapeyajula,PRIYA RADHAKRISHNAN,Vasudeva Varma Kalidindi
International Conference on Mining Intelligence and Knowledge Exploration, MIKE, 2015
@inproceedings{bib_Surv_2015, AUTHOR = {Anuhya Vajapeyajula, PRIYA RADHAKRISHNAN, Vasudeva Varma Kalidindi}, TITLE = {Survey of social commerce research}, BOOKTITLE = {International Conference on Mining Intelligence and Knowledge Exploration}. YEAR = {2015}}
Social commerce is a field that is growing rapidly with the rise of Web 2.0 technologies. This paper presents a review of existing research on this topic to ensure a comprehensive understanding of social commerce. First, we explore the evolution of social commerce from its marketing origins. Next, we examine various definitions of social commerce and the motivations behind it. We also investigate its advantages and disadvantages for both businesses and customers. Then, we explore two major tools for important for social commerce: Sentiment Analysis, and Social Network Analysis. By delving into well-known research papers in Information Retrieval and Complex Networks, we seek to present a survey of current research in multifarious aspects of social commerce to the scientific research community.
Improving Entity Disambiguation via User Modeling
ROMIL BANSAL,SANDEEP PANEM,Manish Gupta,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2014
@inproceedings{bib_Impr_2014, AUTHOR = {ROMIL BANSAL, SANDEEP PANEM, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Improving Entity Disambiguation via User Modeling}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2014}}
Entity Disambiguation is the task of associating entity name mentions in text to the correct referent entities in the knowledge base, with the goal of understanding and extracting useful information from the document. Entity disambiguation is a critical component of systems designed to harness information shared by users on microblogging sites like Twitter. However, noise and lack of context in tweets makes disambiguation a difficult task. In this paper, we describe an Entity Disambiguation system, EDIUM, which uses User interest Models to disambiguate the entities in the user’s tweets. Our system jointly models the user’s interest scores and the context disambiguation scores, thus compensating the sparse context in the tweets for a given user. We evaluated the system’s entity linking capabilities on tweets from multiple users and showed that improvement can be achieved by combining the user models and the context based models.
Seed selection for domain-specific search
P NIKHIL PRIYATAM,AJAY DUBEY,Krish Perumal,SUGGU SAI PRANEETH,Dharmesh Kakadia,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2014
@inproceedings{bib_Seed_2014, AUTHOR = {P NIKHIL PRIYATAM, AJAY DUBEY, Krish Perumal, SUGGU SAI PRANEETH, Dharmesh Kakadia, Vasudeva Varma Kalidindi}, TITLE = {Seed selection for domain-specific search}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2014}}
The last two decades have witnessed an exponential rise in web content from a plethora of domains, which has necessitated the use of domain-specific search engines. Diversity of crawled content is one of the crucial aspects of a domain-specific search engine. To a large extent, diversity is governed by the initial set of seed URLs. Most of the existing approaches rely on manual effort for seed selection. In this work we automate this process using URLs posted on Twitter. We propose an algorithm to get a set of diverse seed URLs from a Twitter URL graph. We compare the performance of our approach against the baseline zero similarity seed selection method and find that our approach beats the baseline by a significant margin.
Entity tracking in real-time using sub-topic detection on twitter
SANDEEP PANEM,ROMIL BANSAL,Manish Gupta,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2014
@inproceedings{bib_Enti_2014, AUTHOR = {SANDEEP PANEM, ROMIL BANSAL, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Entity tracking in real-time using sub-topic detection on twitter}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2014}}
The velocity, volume and variety with which Twitter generates text is increasing exponentially. It is critical to determine latent sub-topics from such tweet data at any given point of time for providing better topic-wise search results relevant to users’ informational needs. The two main challenges in mining sub-topics from tweets in real-time are (1) understanding the semantic and the conceptual representation of the tweets, and (2) the ability to determine when a new sub-topic (or cluster) appears in the tweet stream. We address these challenges by proposing two unsupervised clustering approaches. In the first approach, we generate a semantic space representation for each tweet by keyword expansion and keyphrase identification. In the second approach, we transform each tweet into a conceptual space that represents the latent concepts of the tweet. We empirically show that the proposed methods outperform the state-of-the-art methods.
EDIUM: Improving Entity Disambiguation via User Modeling
ROMIL BANSAL,SANDEEP PANEM,Manish Gupta,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2014
@inproceedings{bib_EDIU_2014, AUTHOR = {ROMIL BANSAL, SANDEEP PANEM, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {EDIUM: Improving Entity Disambiguation via User Modeling}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2014}}
Entity Disambiguation is the task of associating entity name mentions in text to the correct referent entities in the knowledge base, with the goal of understanding and extracting useful information from the document. Entity disambiguation is a critical component of systems designed to harness information shared by users on microblogging sites like Twitter. However, noise and lack of context in tweets makes disambiguation a difficult task. In this paper, we describe an Entity Disambiguation system, EDIUM, which uses User interest Models to disambiguate the entities in the user’s tweets. Our system jointly models the user’s interest scores and the context disambiguation scores, thus compensating the sparse context in the tweets for a given user. We evaluated the system’s entity linking capabilities on tweets from multiple users and showed that improvement can be achieved by combining the user models and the context based models.
Enrichment of Bilingual Dictionary through News Stream Data.
AJAY DUBEY,Parth Gupta,Vasudeva Varma Kalidindi,Paolo Rosso
International Conference on Language Resources and Evaluation, LREC, 2014
@inproceedings{bib_Enri_2014, AUTHOR = {AJAY DUBEY, Parth Gupta, Vasudeva Varma Kalidindi, Paolo Rosso}, TITLE = {Enrichment of Bilingual Dictionary through News Stream Data.}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2014}}
Bilingual dictionaries are the key component of the cross-lingual similarity estimation methods. Usually such dictionary generation is accomplished by manual or automatic means. Automatic generation approaches include to exploit parallel or comparable data to derive dictionary entries. Such approaches require large amount of bilingual data in order to produce good quality dictionary. Many time the language pair does not have large bilingual comparable corpora and in such cases the best automatic dictionary is upper bounded by the quality and coverage of such corpora. In this work we propose a method which exploits continuous quasi-comparable corpora to derive term level associations for enrichment of such limited dictionary. Though we propose our experiments for English and Hindi, our approach can be easily extendable to other languages. We evaluated dictionary by manually computing the precision. In experiments we show our approach is able to derive interesting term level associations across languages.
Supporting comprehension of unfamiliar programs by modeling an expert's perception
NAVEEN N. KULKARNI,Vasudeva Varma Kalidindi
International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, RAISE, 2014
@inproceedings{bib_Supp_2014, AUTHOR = {NAVEEN N. KULKARNI, Vasudeva Varma Kalidindi}, TITLE = {Supporting comprehension of unfamiliar programs by modeling an expert's perception}, BOOKTITLE = {International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering}. YEAR = {2014}}
Developers need to understand many Software Engineering (SE) artifacts while making changes to the code. In such cases, developers use cues extensively to establish relevance of an information with the task. Their familiarity with different kind of cues will help them in comprehending a program. But, developers face information overload because (a) there are many cues and (b) they might be unfamiliar with artifacts. So, we propose a novel approach to overcome information overload problem by modeling developer's perceived value of information based on cues. In this preliminary study, we validate one such model for common comprehension tasks. We also apply this model to summarize source code. An evaluation of the generated summaries resulted in 83% similarity with summaries recorded by developers. The promising results encourages us to create a repository of perception models that can later aid complex SE tasks.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors
Santosh K,ADITYA JOSHI,Manish Gupta,Vasudeva Varma Kalidindi
Conference on User Modeling, Adaptation and Personalization Workshop, UMAP-W, 2014
@inproceedings{bib_Expl_2014, AUTHOR = {Santosh K, ADITYA JOSHI, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors}, BOOKTITLE = {Conference on User Modeling, Adaptation and Personalization Workshop}. YEAR = {2014}}
For privacy reasons, personally identifiable information like age and gender of people is not available publicly. However accurate prediction of such information has important applications in the fields of advertising, forensics and business intelligence. Existing methods for this problem have focused on classifier learning using content based features like word n-grams and style based features like Part of Speech (POS) n-grams. Two major drawbacks of previous approaches are:(1) they do not consider the semantic relation between words, and (2) they do not handle polysemy. We propose a novel method to address these drawbacks by representing the document using Wikipedia concepts and category information. Experimental results show that classifiers learned using such features along with previously used features help us achieve significantly better accuracy compared to the state-of-the-art methods. Indeed, feature selection shows that our novel features are more effective than previously used content based features.
Modeling the evolution of product entities
PRIYA RADHAKRISHNAN,Manish Gupta,Vasudeva Varma Kalidindi
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2014
@inproceedings{bib_Mode_2014, AUTHOR = {PRIYA RADHAKRISHNAN, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Modeling the evolution of product entities}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2014}}
A large number of web queries are related to product entities. Studying evolution of product entities can help analysts understand the change in particular attribute values for these products. However, studying the evolution of a product requires us to be able to link various versions of a product together in a temporal order. While it is easy to temporally link recent versions of products in a few domains manually, solving the problem in general is challenging. The ability to temporally order and link various versions of a single product can also improve product search engines. In this paper, we tackle the problem of finding the previous version (predecessor) of a product entity. Given a repository of product entities, we first parse the product names using a CRF model. After identifying entities corresponding to a single product, we solve the problem of finding the previous version of any given particular version of the product. For the second task, we leverage innovative features with a Naïve Bayes classifier. Our methods achieve a precision of 88% in identifying the product version from product entity names, and a precision of 53% in identifying the predecessor.
CharBoxes: a system for automatic discovery of character infoboxes from books
Manish Gupta,PIYUSH BANSAL,Vasudeva Varma Kalidindi
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2014
@inproceedings{bib_Char_2014, AUTHOR = {Manish Gupta, PIYUSH BANSAL, Vasudeva Varma Kalidindi}, TITLE = {CharBoxes: a system for automatic discovery of character infoboxes from books}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2014}}
Entities are centric to a large number of real world applications. Wikipedia shows entity infoboxes for a large number of entities. However, not much structured information is available about character entities in books. Automatic discovery of characters from books can help in effective summarization. Such a structured summary which not just introduces characters in the book but also provides a high level relationship between them can be of critical importance for buyers. This task involves the following challenging novel problems: 1. automatic discovery of important characters given a book; 2. automatic social graph construction relating the discovered characters; 3. automatic summarization of text most related to each of the characters; and 4. automatic infobox extraction from such summarized text for each character. As part of this demo, we design mechanisms to address these challenges and experiment with publicly available books.
Exploiting wikipedia inlinks for linking entities in queries
PRIYA RADHAKRISHNAN,ROMIL BANSAL,Manish Gupta,Vasudeva Varma Kalidindi
international workshop on Entity recognition & disambiguation, ERD, 2014
@inproceedings{bib_Expl_2014, AUTHOR = {PRIYA RADHAKRISHNAN, ROMIL BANSAL, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Exploiting wikipedia inlinks for linking entities in queries}, BOOKTITLE = {international workshop on Entity recognition & disambiguation}. YEAR = {2014}}
Given a knowledge base, annotating any text with entities in the knowledge base enhances automated understanding of the text. Entities provide extra contextual information for the automated system to understand and interpret the text better. In the special case when the text is in the form of short text queries, automated understanding can be critical in improving the quality of search results and recommendations. Annotation of queries helps semantic retrieval, ensuring diversity of search results including retrieval of relevant news stories. In this paper, we present SIEL@ERD, a system for automated stamping of entity information in short query text. Our system builds from the state-of-the-art TAGME system and is optimized for time and performance efficiency. Our system achieved an F1 measure of 0.53 and the latency of 0.31 seconds on a dataset of 500 queries and a Freebase snapshot provided for the short track in the Entity Recognition and Disambiguation Challenge at SIGIR 2014.
Cost Based Approach to Block Placement for Distributed File Systems
Lakshminarayanan Srinivasan,Vasudeva Varma Kalidindi
Future Internet of Things and Cloud, FiCloud, 2014
@inproceedings{bib_Cost_2014, AUTHOR = {Lakshminarayanan Srinivasan, Vasudeva Varma Kalidindi}, TITLE = {Cost Based Approach to Block Placement for Distributed File Systems}, BOOKTITLE = {Future Internet of Things and Cloud}. YEAR = {2014}}
Our computing demands have grown so much that we need a robust distributed computing platform to process data. To feed such data hungry systems, we need an equally robust distributed file systems that span across multiple geographically separate locations. Most distributed file systems break any file into a set of blocks or chunks which are spread across the cluster. The major bottleneck is to identify where to place the blocks and where to place the replicas so that cluster is optimized on certain parameters like disk utilization, minimizing network congestion, maximizing throughput, optimal power utilization, etc. This paper proposes assigning a distance measure for each one of the data sources with respect to the other and placing the blocks on the specified disk on the node that minimizes the total distance of the last few requests made for that block. As the request pattern and parameters change, the distances are updated and the blocks are moved dynamically to minimized the distance, in effect optimizing on the required parameters. The distance function is to be modeled based on the cluster and the parameters you wish to optimize on. It can be a function of just bandwidth or bandwidth and latency or other miscellaneous features like disk utilization, processing power, disk speed, power utilization, cooling requirements, temperature, etc. A detailed performance analysis was carried out with disk bandwidth, network bandwidth and disk utilization as the parameters and the performance is far better in comparison to the reference system (10% or more depending on specification differences) which has no understanding of the different type of disks present and the nature of the cluster. Due to the performance aware nature, the system was inherently able to utilize memory for speeding up performance through the use of in-memory partitions.
MultiPaaS-PaaS on Multiple Clouds
SHASHANK SAHNI,Vasudeva Varma Kalidindi
IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), CCEM, 2014
@inproceedings{bib_Mult_2014, AUTHOR = {SHASHANK SAHNI, Vasudeva Varma Kalidindi}, TITLE = {MultiPaaS-PaaS on Multiple Clouds}, BOOKTITLE = {IEEE International Conference on Cloud Computing in Emerging Markets (CCEM)}. YEAR = {2014}}
Current PaaS solutions use an underlying IaaS provider and restrict themselves to a few geographical regions. They operate in one/two zones only and use others for failover or disaster recovery. In this paper, we present the design of MultiPaaS, a PaaS solution which runs on multiple cloud/IaaS providers and leverages the combined global span. Unlike traditional scaling techniques where app servers are scaled up or down in a specific region or data center, we present a design to perform it on a global scale. Origin of the request burst is identified and instances are booted in a data center closest to it. It ensures database persistence and cross-region scalability via a combination of replication techniques. We identify common features available across all IaaS providers and use that as a provider interface to ensure interoperability, but use provider specific extensions to leverage unique offerings and benefits.
Computational advertising: Techniques for targeting relevant ads
KUSHAL SHAILESH DAVE,Vasudeva Varma Kalidindi
Foundations and Trends in Information Retrieval, FTIR, 2014
@inproceedings{bib_Comp_2014, AUTHOR = {KUSHAL SHAILESH DAVE, Vasudeva Varma Kalidindi}, TITLE = {Computational advertising: Techniques for targeting relevant ads}, BOOKTITLE = {Foundations and Trends in Information Retrieval}. YEAR = {2014}}
Computational Advertising, popularly known as online advertising or Web advertising, refers to finding the most relevant ads matching a particular context on the Web. The context depends on the type of advertising and could mean — content where the ad is shown, the user who is viewing the ad or the social network of the user. Computational Advertising (CA) is a scientific sub-discipline at the intersection of information retrieval, statistical modeling, machine learning, optimization, large scale search and text analysis. The core problem addressed in Computational Advertising is of match-making between the ads and the context.CA is prevalent in three major forms on the Web. One of the forms involves showing textual ads relevant to a query on the search page, known as Sponsored Search. On the other hand, showing textual ads relevant to a third party webpage content is known as Contextual Advertising. The third form of advertising also deals with the placement of ads on third party Web pages, but the ads in this form are rich multimedia ads — image, video, audio, flash. The business model with rich media ads is slightly different from the ones with textual ads. These ads are also called banner ads, and this form of advertising is known as Display Advertising.Both Sponsored Search and Contextual Advertising involve retrieving relevant ads for different types of content (query and Web page). As ads are short and are mainly written to attract the user, retrieval of ads pose challenges like vocabulary mismatch between the query/content and the ad. Also, as the user's probability of examining an ad decreases with the position of the ad in the ranked list, it is imperative to keep the best ads at the top positions. Display Advertising poses several challenges including modeling user behaviour and noisy page content and bid optimization on the advertiser's side. Additionally, online advertising faces challenges like false bidding, click spam and ad spam. These challenges are prevalent in all forms of advertising. There has been a lot of research work published in different areas of CA in the last one and a half decade. The focus of this survey is to discuss the problems and solutions pertaining to the information retrieval, machine learning and statistics domain of CA. This survey covers techniques and approaches that deal with several issues mentioned above.Research in Computational Advertising has evolved over time and currently continues both in traditional areas (vocabulary mismatch, query rewriting, click prediction) and recently identified areas (user targeting, mobile advertising, social advertising). In this study, we predominantly focus on the problems and solutions proposed in traditional areas in detail and briefly cover the emerging areas in the latter half of the survey. To facilitate future research, a discussion of available resources, list of public benchmark datasets and future directions of work is also provided in the end.
Structured information extraction from natural disaster events on twitter
SANDEEP PANEM,Manish Gupta,Vasudeva Varma Kalidindi
International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning, WEB-KR, 2014
@inproceedings{bib_Stru_2014, AUTHOR = {SANDEEP PANEM, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Structured information extraction from natural disaster events on twitter}, BOOKTITLE = {International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning}. YEAR = {2014}}
As soon as natural disaster events happen, users are eager to know more about them. However, search engines currently provide a ten blue links interface for queries related to such events. Relevance of results for such queries can be significantly improved if users are shown a structured summary of the fresh events related to such queries. This would not just reduce the number of user clicks to get the relevant information but would also help users get updated with more fine grained attribute-level information. Twitter is a great source that can be exploited for obtaining such fine-grained structured information for fresh natural disaster events. Such events are often reported on Twitter much earlier than on other news media. However, extracting such structured information from tweets is challenging because: 1. tweets are noisy and ambiguous; 2. there is no well defined schema for various types of natural disaster events; 3. it is not trivial to extract attribute-value pairs and facts from unstructured text; and 4. it is difficult to find good mappings between extracted attributes and attributes in the event schema. We propose algorithms to extract attribute-value pairs, and also devise novel mechanisms to map such pairs to manually generated schemas for natural disaster events. Besides the tweet text, we also leverage text from URL links in the tweets to fill such schemas. Our schemas are temporal in nature and the values are updated whenever fresh information flows in from human sensors on Twitter. Evaluation on ~58000 tweets for 20 events shows that our system can fill such event schemas with an F1 of ~0.6.
Energy and SLA aware VM Scheduling
RADHESHYAM NANDURI,Dharmesh Kakadia,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2014
@inproceedings{bib_Ener_2014, AUTHOR = {RADHESHYAM NANDURI, Dharmesh Kakadia, Vasudeva Varma Kalidindi}, TITLE = {Energy and SLA aware VM Scheduling}, BOOKTITLE = {Technical Report}. YEAR = {2014}}
With the advancement of Cloud Computing over the past few years, there has been a massive shift from traditional data centers to cloud enabled data centers. The enterprises with cloud data centers are focusing their attention on energy savings through effective utilization of resources. In this work, we propose algorithms which try to minimize the energy consumption in the data center duly maintaining the SLA guarantees. The algorithms try to utilize least number of physical machines in the data center by dynamically rebalancing the physical machines based on their resource utilization. The algorithms also perform an optimal consolidation of virtual machines on a physical machine, minimizing SLA violations. In extensive simulation, our algorithms achieve savings of about 21% in terms of energy consumption and in terms of maintaining the SLAs, it performs 60% better than Single Threshold algorithm.
A sandhi splitter for malayalam
DEVADATH V V,LITTON J KURISINKEL,Dipti Mishra Sharma,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2014
@inproceedings{bib_A_sa_2014, AUTHOR = {DEVADATH V V, LITTON J KURISINKEL, Dipti Mishra Sharma, Vasudeva Varma Kalidindi}, TITLE = {A sandhi splitter for malayalam}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2014}}
Sandhi splitting is the primary task for computational processing of text in Sanskrit and Dravidian languages. In these languages, words can join together with morpho-phonemic changes at the point of joining. This phenomenon is known as Sandhi. Sandhi splitter splits the string of conjoined words into individual words. Accurate execution of sandhi splitting is crucial for text processing tasks such as POS tagging, topic modelling and document indexing. We have tried different approaches to address the challenges of sandhi splitting in Malayalam, and finally, we have thought of exploiting the phonological changes that take place in the words while joining. This resulted in a hybrid method which statistically identifies the split points and splits using predefined character level linguistic rules. Currently, our system gives an accuracy of 91.1%.
TagMiner: A Semisupervised Associative POS Tagger Effective for Resource Poor Languages
PRATIBHA RANI,Vikram Pudi,Vasudeva Varma Kalidindi
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa, PKDD/ECML, 2014
@inproceedings{bib_TagM_2014, AUTHOR = {PRATIBHA RANI, Vikram Pudi, Vasudeva Varma Kalidindi}, TITLE = {TagMiner: A Semisupervised Associative POS Tagger Effective for Resource Poor Languages}, BOOKTITLE = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa}. YEAR = {2014}}
We present here, TagMiner, a data mining approach for part-of-speech (POS) tagging, an important Natural language process-ing (NLP) classification task. It is a semi-supervised associative clas-sification method for POS tagging. Existing methods for building POS taggers require extensive domain and linguistic knowledge and resources.Our method uses combination of a small POS tagged corpus and a raw untagged text data as training data to build the classifier model using association rules. Our tagger works well with very little training dataalso. The use of semi-supervised learning provides the advantage of notrequiring a large high quality tagged corpus. These properties make it es-pecially suitable for resource poor languages. Our experiments on various resource-rich, resource-moderate and resource-poor languages show good performance without using any language specific linguistic information.We note that inclusion of such features in our method may further im-prove the performance. Results also show that for smaller training data sizes our tagger performs better than state-of-the-art CRF tagger using same features as our tagger.
Cross lingual text reuse detection based on keyphrase extraction and similarity measures
RAMBHOOPAL REDDY K,Vasudeva Varma Kalidindi
Forum for Information Retrieval Evaluation, FIRE, 2013
@inproceedings{bib_Cros_2013, AUTHOR = {RAMBHOOPAL REDDY K, Vasudeva Varma Kalidindi}, TITLE = {Cross lingual text reuse detection based on keyphrase extraction and similarity measures}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2013}}
Information on web in various languages is growing fast, but large amount of content still exists in English. There are several cases of English text re-use (cross language plagiarism) observed in non-English languages. Detecting text re-use in non-English languages is a challenging task due to complexity of the language used. Complexity further increases for less resource languages like Arabic and Indian languages. In this paper, we address the problem proposed in FIRE CL!TR 2011 task of detecting plagiarized documents in Hindi language which was reused from English language source documents. We proposed three approaches using classification and key-phrase retrieval techniques. Our winning approach attained 0.792 F-measure.
Summarizing answers for community question answer services
Vinay Pande,TANMOY KR. MUKHERJEE,Vasudeva Varma Kalidindi
International conference on Language Processing and Knowledge in the Web, GSCL, 2013
@inproceedings{bib_Summ_2013, AUTHOR = {Vinay Pande, TANMOY KR. MUKHERJEE, Vasudeva Varma Kalidindi}, TITLE = {Summarizing answers for community question answer services}, BOOKTITLE = {International conference on Language Processing and Knowledge in the Web}. YEAR = {2013}}
This paper presents a novel answer summarization approach for community Question Answering services (cQAs) to address the problem of “incomplete answer”, i.e., missing valuable information from the “best answer” of a complex multi-sentence question, which can be obtained from other answers to the same question. Our method automatically generate a novel and non-redundant summary from cQA answers using structured determinantal point processes (SDPP). Experimental evaluation on sample dataset from Yahoo Answers shows significant improvement over baseline approaches.
Author profiling using LDA and maximum entropy—Notebook for PAN at CLEF 2013
M S S ADITYA PAVAN,M. ADITYA,Vasudeva Varma Kalidindi
Conference and Labs of the Evaluation Forum, CLEF, 2013
@inproceedings{bib_Auth_2013, AUTHOR = {M S S ADITYA PAVAN, M. ADITYA, Vasudeva Varma Kalidindi}, TITLE = {Author profiling using LDA and maximum entropy—Notebook for PAN at CLEF 2013}, BOOKTITLE = {Conference and Labs of the Evaluation Forum}. YEAR = {2013}}
This paper describes the traditional authorship attribution subtask of the PAN/CLEF 2013 workshop. In our attempt to classify the documents based on gender and age of an author, we have applied a traditional approach of topic modeling using Latent Dirichlet Allocation [LDA]. We used the content based features like topics and style based features like preposition-frequencies, which act as the efficient markers to demarcate the authorship attributes based on age and gender. We demonstrated tenfold cross validation and observed that our classification approach using Maxent and LDA gave an accuracy of 53.3% for English language and 52% for Spanish Language.
Don't Use a Lot When Little Will Do: Genre Identification Using URLs.
P. NIKHIL PRIYATAM,Srinivasan Iyengar,Krish P erumal,Vasudeva Varma Kalidindi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2013
@inproceedings{bib_Don'_2013, AUTHOR = {P. NIKHIL PRIYATAM, Srinivasan Iyengar, Krish P Erumal, Vasudeva Varma Kalidindi}, TITLE = {Don't Use a Lot When Little Will Do: Genre Identification Using URLs.}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2013}}
The ever increasing data on world wide web calls for the use of vertical search engines. Sandhan is one such search engine which offers search in tourism and health genres in more than 10 different Indian languages. In this work we build a URL based genre identification module for Sandhan. A direct impact of this work is on building focused crawlers to gather Indian language content. We conduct experiments on tourism and health web pages in Hindi language. We experiment with three approaches-list based, naive Bayes and incremental naive Bayes. We evaluate our approaches against another web page classification algorithm built on the parsed text of manually labeled web pages. We find that incremental naive Bayes approach outperforms the other two. While doing our experiments we work with different features like words, n-grams and all grams. Using n-gram features we achieve classification accuracies of 0.858 and 0.873 for tourism and health genres respectively.
Leveraging latent concepts for retrieving relevant ads for short text
ANKIT PATIL,KUSHAL SHAILESH DAVE,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2013
@inproceedings{bib_Leve_2013, AUTHOR = {ANKIT PATIL, KUSHAL SHAILESH DAVE, Vasudeva Varma Kalidindi}, TITLE = {Leveraging latent concepts for retrieving relevant ads for short text}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2013}}
The microblogging platforms are increasingly becoming a lucrative prospect for advertisers to attract the customers. The challenge with advertising on such platforms is that there is very little content to retrieve relevant ads. As the microblogging content is short and noisy and the ads are short too, there is a high amount of lexical/vocabulary mismatch between the micropost and the ads. To bridge this vocabulary mismatch, we propose a conceptual approach that transforms the content into a conceptual space that represent the latent concepts of the content. We empirically show that the conceptual model performs better than various state-of-the-art techniques the performance gain obtained are substantial and significant.
Topic-focused summarization of chat conversations
Arpit Sood,MOHAMED THANVIR P,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2013
@inproceedings{bib_Topi_2013, AUTHOR = {Arpit Sood, MOHAMED THANVIR P, Vasudeva Varma Kalidindi}, TITLE = {Topic-focused summarization of chat conversations}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2013}}
In this paper, we propose a novel approach to address the problem of chat summarization. We summarize real-time chat conversations which contain multiple users with frequent shifts in topic. Our approach consists of two phases. In the first phase, we leverage topic modeling using web documents to find the primary topic of discussion in the chat. Then, in the summary generation phase, we build a semantic word space to score sentences based on their association with the primary topic. Experimental results show that our method significantly outperforms the baseline systems on ROUGE F-scores.
Timespent based models for predicting user retention
DAVE KUSHAL SHAILESH,Vishal Vaingankar,Sumanth Kolar,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2013
@inproceedings{bib_Time_2013, AUTHOR = {DAVE KUSHAL SHAILESH, Vishal Vaingankar, Sumanth Kolar, Vasudeva Varma Kalidindi}, TITLE = {Timespent based models for predicting user retention}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2013}}
Content discovery is fast becoming the preferred tool for user engagement on the web. Discovery allows users to get educated and entertained about their topics of interest. StumbleUpon is the largest personalized content discovery engine on the Web, delivering more than 1 billion personalized recommendations per month. As a recommendation system one of the primary metrics we track is whether the user returns (retention) to use the product after their initial experience (session) with StumbleUpon
Sielers: Feature analysis and polarity classification of expressions from Twitter and SMS data
HARSHIT JAIN,M ADITYA,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2013
@inproceedings{bib_Siel_2013, AUTHOR = {HARSHIT JAIN, M ADITYA, Vasudeva Varma Kalidindi}, TITLE = {Sielers: Feature analysis and polarity classification of expressions from Twitter and SMS data}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2013}}
In this paper, we describe our system for the SemEval-2013 Task 2, Sentiment Analysis in Twitter. We formed features that take into account the context of the expression and take a supervised approach towards subjectivity and polarity classification. Experiments were performed on the features to find out whether they were more suited for subjectivity or polarity Classification. We tested our model for sentiment polarity classification on Twitter as well as SMS chat expressions, analyzed their F-measure scores and drew some interesting conclusions from them. Total citations Cited by 3
Generación de diccionarios bilingües usando las propiedades estructurales
AJAY DUBEY,Vasudeva Varma Kalidindi
Computacion y Sistemas, CyS, 2013
@inproceedings{bib_Gene_2013, AUTHOR = {AJAY DUBEY, Vasudeva Varma Kalidindi}, TITLE = {Generación de diccionarios bilingües usando las propiedades estructurales}, BOOKTITLE = {Computacion y Sistemas}. YEAR = {2013}}
Building bilingual dictionaries from Wikipedia has been extensively studied in the area of computation linguistics. These dictionaries play a crucial role in Natural Language Processing (NLP) applications like Cross-Lingual Information Retrieval, Machine Translation and Named Entity Recognition. To build these dictionaries, most of the existing approaches use information present in Wikipedia titles, info-boxes and categories. Interestingly, not many use the structural properties of a document like sections, subsections, etc. In this work we exploit the structural properties of documents to build a bilingual English-Hindi dictionary. The main intuition behind this approach is that documents in different languages discussing the same topic are likely to have similar structural elements. Though we present our experiments only for Hindi, our approach is language independent and can be easily extended to other languages. The major contribution of our work is that the dictionary contains translation and transliteration of words which include Named Entities to a large extent. We evaluate our dictionary using manually computed precision. We generated a massive list of 72k tokens using our approach with 0.75 precision.
Generation of bilingual dictionaries using structural properties
AJAY DUBEY,Vasudeva Varma Kalidindi
Computacion y Sistemas, CyS, 2013
@inproceedings{bib_Gene_2013, AUTHOR = {AJAY DUBEY, Vasudeva Varma Kalidindi}, TITLE = {Generation of bilingual dictionaries using structural properties}, BOOKTITLE = {Computacion y Sistemas}. YEAR = {2013}}
Building bilingual dictionaries from Wikipedia has been extensively studied in the area of computation linguistics. These dictionaries play a crucial role in Natural Language Processing(NLP) applications like Cross-Lingual Information Retrieval, Machine Translation and Named Entity Recognition. To build these dictionaries, most of the existing approaches use information present in Wikipedia titles, info-boxes and categories. Interestingly, not many use the structural properties of a document like sections, subsections, etc. In this work we exploit the structural properties of documents to build a bilingual English-Hindi dictionary. The main intuition behind this approach is that documents in different languages discussing the same topic are likely to have similar structural elements. Though we present our experiments only for Hindi, our approach is language independent and can be easily extended to other languages. The major contribution of our work is that the dictionary contains translation and transliteration of words which include Named Entities to a large extent. We evaluate our dictionary using manually computed precision. We generated a massive list of 72k tokens using our approach with 0.75 precision.
MECCA: mobile, efficient cloud computing workload adoption framework using scheduler customization and workload migration decisions
Dharmesh Kakadia,Prasad Saripalli,Vasudeva Varma Kalidindi
International Symposium on Mobile Ad Hoc Networking and Computing, MOBIHOC, 2013
@inproceedings{bib_MECC_2013, AUTHOR = {Dharmesh Kakadia, Prasad Saripalli, Vasudeva Varma Kalidindi}, TITLE = {MECCA: mobile, efficient cloud computing workload adoption framework using scheduler customization and workload migration decisions}, BOOKTITLE = {International Symposium on Mobile Ad Hoc Networking and Computing}. YEAR = {2013}}
The availability of increasingly richer applications is providing surprisingly wide range of functionalities and new use cases on mobile devices. Even tough mobile devices are becoming increasingly more powerful, the resource utilization of richer application can overwhelm resources on these devices. At the same time, ubiquitous connectivity of mobile devices also opens up the possibility of leveraging cloud resources. Seamless and flexible path to mobile cloud computing requires recognizing opportunities where the application execution on cloud instead of mobile device. In this paper we propose a cloud aware scheduler for application offloading from mobile devices to clouds. We used learning based algorithm for predicting the gain attainable using performance monitoring and high level features. We evaluated prototype of our system on various workloads and under various conditions.
Pairwise Tensor Factorization for learning new facts in Knowledge Bases
TANMOY KR. MUKHERJEE,PANDE VINAY NARAYAN,Vasudeva Varma Kalidindi
Workshop on Mining and Learning with Graphs, MLG, 2013
@inproceedings{bib_Pair_2013, AUTHOR = {TANMOY KR. MUKHERJEE, PANDE VINAY NARAYAN, Vasudeva Varma Kalidindi}, TITLE = {Pairwise Tensor Factorization for learning new facts in Knowledge Bases}, BOOKTITLE = {Workshop on Mining and Learning with Graphs}. YEAR = {2013}}
Knowledge bases provide with the benefit of organizing knowledge in the relational form but suffer from incompleteness of new entities and relationships. Prior work on relation extraction has been focused on supervised learning techniques which are quite expensive. An alternative approach based on distant supervision has been of significant interest where one aligns database records with sentences of these records. A new line of work on embeddings of symbolic representations [2] has shown promise. We introduce a Matrix trifactorization model which can find missing information in knowledge bases. Experiments show that we are able to query and find missing information from text and shows improvement over existing methods.
Extracting semantic knowledge from wikipedia category names
PRIYA RADHAKRISHNAN,Vasudeva Varma Kalidindi
workshop on Automated knowledge base construction, AKBC, 2013
@inproceedings{bib_Extr_2013, AUTHOR = {PRIYA RADHAKRISHNAN, Vasudeva Varma Kalidindi}, TITLE = {Extracting semantic knowledge from wikipedia category names}, BOOKTITLE = {workshop on Automated knowledge base construction}. YEAR = {2013}}
Wikipedia being a large, freely available, frequently updated and community maintained knowledge base, has been central to much recent research. However, quite often we find that the information extracted from it has extraneous content. This paper proposes a method to extract useful information from Wikipedia, using Semantic Features derived from Wikipedia categories. The proposed method provides good performance as a Wikipedia category based method. Experimental results on benchmark datasets show that the proposed method achieves a correlation coefficient of 0.66 with human judgments. The Semantic Features derived by this method gave good correlation with human rankings in a web search query completion application.
Network-aware virtual machine consolidation for large data centers
Dharmesh Kakadia,Nandish Kopri,Vasudeva Varma Kalidindi
International Workshop on Network-Aware Data Management, NDM, 2013
@inproceedings{bib_Netw_2013, AUTHOR = {Dharmesh Kakadia, Nandish Kopri, Vasudeva Varma Kalidindi}, TITLE = {Network-aware virtual machine consolidation for large data centers}, BOOKTITLE = {International Workshop on Network-Aware Data Management}. YEAR = {2013}}
Resource management in modern data centers has become a challenging task due to the tremendous growth of data centers. In large virtual data centers, performance of applications is highly dependent on the communication bandwidth available among virtual machines. Traditional algorithms either do not consider network I/O details of the applications or are computationally intensive. We address the problem of identifying the virtual machine clusters based on the network traffic and placing them intelligently in order to improve the application performance and optimize the network usage in large data center. We propose a greedy consolidation algorithm that ensures the number of migrations is small and the placement decisions are fast, which makes it practical for large data centers. We evaluated our approach on real world traces from private and academic data centers, using simulation and compared the existing algorithms on various parameters like scheduling time, performance improvement and number of migrations. We observed a ~70% savings of the interconnect bandwidth and overall ~60% improvements in the applications performances. Also, these improvements were produced within a fraction of scheduling time and number of migrations.
Domain Specific Facts Extraction Using Weakly Supervised Active Learning Approach
PANDE VINAY NARAYAN,TANMOY KR. MUKHERJEE,Vasudeva Varma Kalidindi
International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, WI, 2013
@inproceedings{bib_Doma_2013, AUTHOR = {PANDE VINAY NARAYAN, TANMOY KR. MUKHERJEE, Vasudeva Varma Kalidindi}, TITLE = {Domain Specific Facts Extraction Using Weakly Supervised Active Learning Approach}, BOOKTITLE = {International Joint Conferences on Web Intelligence and Intelligent Agent Technologies}. YEAR = {2013}}
An ontology is defined using concepts and relationships between the concepts. In this paper, we focus on second problem: relation extraction from plain text. Generic Knowledge Bases like YAGO, Freebase, and DBPedia have made accessible huge collections of facts and their properties from various domains. But acquiring and maintaining various facts and their relations from domain specific corpus becomes very important and challenging task due to low availability of annotated data. Here, we proposed a label propagation based semi-supervised approach for relation extraction by choosing most informative instances for annotation. We also proposed weakly supervised approach for data annotation using generic ontologies like Freebase, which further reduces the cost of annotating data manually. We checked efficiency of our approach by performing experiments on various domain specific corpora.
Network virtualization platform for hybrid cloud
Dharmesh Kakadia,Vasudeva Varma Kalidindi
2018 IEEE International Conference on Cloud Computing Technology and Science, CLOUDCOM, 2013
@inproceedings{bib_Netw_2013, AUTHOR = {Dharmesh Kakadia, Vasudeva Varma Kalidindi}, TITLE = {Network virtualization platform for hybrid cloud}, BOOKTITLE = {2018 IEEE International Conference on Cloud Computing Technology and Science}. YEAR = {2013}}
Cloud computing has enabled elastic and transparent access to infrastructure services without involving IT operating overhead. Virtualization has been a key enabler for cloud computing. While resource virtualization and service abstraction have been widely investigated, networking in cloud remains a difficult puzzle. Even though network has significant role in facilitating hybrid cloud scenarios, it hasn't received much attention in research community until recently. We propose Network as a Service (NaaS), which forms the basis of unifying public and private clouds. In this paper, we identify various challenges in adoption of hybrid cloud. We discuss the design and implementation of a cloud platform with NaaS. We discuss a platform and its implementation for hybrid clouds for providing NaaS along with other compute services, by exploiting the functionality provided by software defined networks. We also provide primary evaluation of our platform.
Exploring the Role of Logically Related Non-Question Phrases for Answering Why-Questions
NIRAJ KUMAR,Rashmi Gangadharia,Srinathan Kannan,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2013
@inproceedings{bib_Expl_2013, AUTHOR = {NIRAJ KUMAR, Rashmi Gangadharia, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {Exploring the Role of Logically Related Non-Question Phrases for Answering Why-Questions}, BOOKTITLE = {Technical Report}. YEAR = {2013}}
In this paper, we show that certain phrases although not present in a given question/query, play a very important role in answering the question. Exploring the role of such phrases in answering questions not only reduces the dependency on matching question phrases for extracting answers, but also improves the quality of the extracted answers. Here matching question phrases means phrases which co-occur in given question and candidate answers. To achieve the above discussed goal, we introduce a bigram-based word graph model populated with semantic and topical relatedness of terms in the given document. Next, we apply an improved version of ranking with a prior-based approach, which ranks all words in the candidate document with respect to a set of root words (ie non-stopwords present in the question and in the candidate document). As a result, terms logically related to the root words are scored higher than terms that are not related to the root words. Experimental results show that our devised system performs better than state-of-the-art for the task of answering Why-questions.
A knowledge induced graph-theoretical model for extract and abstract single document summarization
NIRAJ KUMAR,Srinathan Kannan,Vasudeva Varma Kalidindi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2013
@inproceedings{bib_A_kn_2013, AUTHOR = {NIRAJ KUMAR, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {A knowledge induced graph-theoretical model for extract and abstract single document summarization}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2013}}
Summarization mainly provides the major topics or theme of document in limited number of words. However, in extract summary we depend upon extracted sentences, while in abstract summary, each summary sentence may contain concise information from multiple sentences. The major facts which affect the quality of summary are: (1) the way of handling noisy or less important terms in document, (2) utilizing information content of terms in document (as, each term may have different levels of importance in document) and (3) finally, the way to identify the appropriate thematic facts in the form of summary. To reduce the effect of noisy terms and to utilize the information content of terms in the document, we introduce the graph theoretical model populated with semantic and statistical importance of terms. Next, we introduce the concept of weighted minimum vertex cover which helps us in identifying the most representative and thematic facts in the document. Additionally, to generate abstract summary, we introduce the use of vertex constrained shortest path based technique, which uses minimum vertex cover related information as valuable resource. Our experimental results on DUC-2001 and DUC-2002 dataset show that our devised system performs better than baseline systems.
Online Debate Summarization using Topic Directed Sentiment Analysis
RANADE SARVESH AJIT,JAYANT GUPTA,Vasudeva Varma Kalidindi,Radhika Mamidi
International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM, 2013
@inproceedings{bib_Onli_2013, AUTHOR = {RANADE SARVESH AJIT, JAYANT GUPTA, Vasudeva Varma Kalidindi, Radhika Mamidi}, TITLE = {Online Debate Summarization using Topic Directed Sentiment Analysis}, BOOKTITLE = {International Workshop on Issues of Sentiment Discovery and Opinion Mining}. YEAR = {2013}}
Social networking sites provide users a virtual community interaction platform to share their thoughts, life experiences and opinions. Online debate forum is one such platform where people can take a stance and argue in support or opposition of debate topics. An important feature of such forums is that, they are dynamic and increase rapidly. In such situations, e↵ective opinion summarization approaches are needed so that readers need not go through the entire debate. This paper aims to summarize online debates by extracting highly topic relevant and sentiment rich sentences. The proposed approach takes into account topic relevant, document relevant and sentiment based features to capture topic opinionated sentences. ROUGE scores are used to evaluate our system. Our system significantly outperforms several baseline systems and show 5.2% (ROUGE-1), 7.3%(ROUGE-2) and 5.5% (ROUGE-L) improvement over the state-of-the-art opinion summarization system. The results verify that topic directed sentiment features are most important to generate e↵ective debate summaries.
Dynamic energy efficient data placement and cluster reconfiguration algorithm for mapreduce framework
NITESH MAHESHWARI,RADHESHYAM NANDURI,Vasudeva Varma Kalidindi
Future Generation Computer Systems, FGCS, 2012
@inproceedings{bib_Dyna_2012, AUTHOR = {NITESH MAHESHWARI, RADHESHYAM NANDURI, Vasudeva Varma Kalidindi}, TITLE = {Dynamic energy efficient data placement and cluster reconfiguration algorithm for mapreduce framework}, BOOKTITLE = {Future Generation Computer Systems}. YEAR = {2012}}
With the recent emergence of cloud computing based services on the Internet, MapReduce and distributed file systems like HDFS have emerged as the paradigm of choice for developing large scale data intensive applications. Given the scale at which these applications are deployed, minimizing power consumption of these clusters can significantly cut down operational costs and reduce their carbon footprint—thereby increasing the utility from a provider’s point of view. This paper addresses energy conservation for clusters of nodes that run MapReduce jobs. The algorithm dynamically reconfigures the cluster based on the current workload and turns cluster nodes on or off when the average cluster utilization rises above or falls below administrator specified thresholds, respectively. We evaluate our algorithm using the GridSim toolkit and our results show that the proposed algorithm achieves an energy reduction of 33% under average workloads and up to 54% under low workloads.
Hindi subjective lexicon generation using WordNet graph traversal
PIYUSH ARORA,AKSHAT BAKLIWAL,Vasudeva Varma Kalidindi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2012
@inproceedings{bib_Hind_2012, AUTHOR = {PIYUSH ARORA, AKSHAT BAKLIWAL, Vasudeva Varma Kalidindi}, TITLE = {Hindi subjective lexicon generation using WordNet graph traversal}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2012}}
With the induction of UTF-8 unicode standards, web content in Hindi language is increasing at a rapid pace. There is a great opportunity to mine the content and get insight of sentiments and opinions expressed by people and various communities. In this paper, we present a graph based method to build a subjective lexicon for Hindi language, using WordNet as a resource. Our method takes a pre-annotated seed list and expands this list into a full lexicon using synonym and antonym relations. We show two different evaluation strategies to validate the Hindi Lexicon built. Main contribution of our work 1) Developing a Subjective lexicon of adjectives using Hindi WordNet. 2) Developing an annotated corpora of Hindi reviews.
IIIT Hyderabad at TAC 2012.
Vasudeva Varma Kalidindi,BHASKAR JYOTI GHOSH,MOHAN S,DEEPTI AGGARWAL,PRIYA RADHAKRISHNAN
Text Analysis Conference Workshop, TAC, 2012
@inproceedings{bib_IIIT_2012, AUTHOR = {Vasudeva Varma Kalidindi, BHASKAR JYOTI GHOSH, MOHAN S, DEEPTI AGGARWAL, PRIYA RADHAKRISHNAN}, TITLE = {IIIT Hyderabad at TAC 2012.}, BOOKTITLE = {Text Analysis Conference Workshop}. YEAR = {2012}}
In this paper, we report our participation in Knowledge Base Population at TAC 2012. We adopted an Information Retrieval based approach for the Entity Linking and Slot Filling tasks. In Entity Linking we identify potential nodes from the Knowledge Base and then identify the mapping node using tf-idf similarity. We achieved very good performance in the Entity Linking task. For Slot Filling task we identify documents from the document collection that might contain attribute information. We extract the attribute information using a rule based approach. Our rule based approach hasnt performed up to the mark.
Web Image Annotation Using an Effective Term Weighting
VUNDAVALLI SRINIVASA RAO,Vasudeva Varma Kalidindi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2012
@inproceedings{bib_Web__2012, AUTHOR = {VUNDAVALLI SRINIVASA RAO, Vasudeva Varma Kalidindi}, TITLE = {Web Image Annotation Using an Effective Term Weighting}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2012}}
The number of images on the World Wide Web has been increasing tremendously. Providing search services for images on the web has been an active research area. Web images are often surrounded by different associated texts like ALT text, surrounding text, image filename, html page title etc. Many popular internet search engines make use of these associated texts while indexing images and give higher importance to the terms present in ALT text. But, a recent study has shown that around half of the images on the web have no ALT text. So, predicting the ALT text of an image in a web page would be of great use in web image retrieval. We propose an approach on top of term cooccurrence approach proposed in the literature to ALT text prediction. Our results show that our approach and the simple term co-occurrence approach produce almost the same results. We analyze both the methods and describe the usage of the methods in different situations. We build an image annotation system on top of our proposed approach and compare the results with the image annotation system built on top of the term co-occurrence approach. Preliminary experiments on a set of 1000 images show that our proposed approach performs well over the simple term co-occurrence approach for web image annotation.
Hindi subjective lexicon: A lexical resource for hindi polarity classification
AKSHAT BAKLIWAL,PIYUSH ARORA,Vasudeva Varma Kalidindi
International Conference on Language Resources and Evaluation, LREC, 2012
@inproceedings{bib_Hind_2012, AUTHOR = {AKSHAT BAKLIWAL, PIYUSH ARORA, Vasudeva Varma Kalidindi}, TITLE = {Hindi subjective lexicon: A lexical resource for hindi polarity classification}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2012}}
With recent developments in web technologies, percentage web content in Hindi is growing up at a lighting speed. This information can prove to be very useful for researchers, governments and organization to learn what’s on public mind, to make sound decisions. In this paper, we present a graph based wordnet expansion method to generate a full (adjective and adverb) subjective lexicon. We used synonym and antonym relations to expand the initial seed lexicon. We show three different evaluation strategies to validate the lexicon. We achieve 70.4% agreement with human annotators and∼ 79% accuracy on product review classification. Main contribution of our work
Hindi web page collection tagged with tourism health and miscellaneous
P NIKHIL PRIYATAM,V. SRIKANTH REDDY,Vasudeva Varma Kalidindi
International Conference on Language Resources and Evaluation, LREC, 2012
@inproceedings{bib_Hind_2012, AUTHOR = {P NIKHIL PRIYATAM, V. SRIKANTH REDDY, Vasudeva Varma Kalidindi}, TITLE = {Hindi web page collection tagged with tourism health and miscellaneous}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2012}}
Web page classification has wide number of applications in the area of Information Retrieval. It is a crucial part in building domain specific search engines. Be it’Google Scholar’to search for scholarly articles or’Google news’ to search for news articles, searching within a specific domain is a common practice. Sandhan is one such project which offers domain specific search for Tourism and Health domains across 10 different Indian Languages. Much of the accuracy of a web page classification algorithm depends on the data it gets trained on. The motivation behind this paper is to provide a proper set of guidelines to collect and store this data in an efficient and an error free way. The major contribution of this paper would be a Hindi web page collection manually classified into Tourism, Health and Miscellaneous.
Identifying Microblogs for Targeted Contextual Advertising.
KUSHAL SHAILESH DAVE,Vasudeva Varma Kalidindi
International Conference on Web and Social Media, ICWSM, 2012
@inproceedings{bib_Iden_2012, AUTHOR = {KUSHAL SHAILESH DAVE, Vasudeva Varma Kalidindi}, TITLE = {Identifying Microblogs for Targeted Contextual Advertising.}, BOOKTITLE = {International Conference on Web and Social Media}. YEAR = {2012}}
Micro-blogging sites such as Facebook, Twitter, Google+ present a nice opportunity for targeting advertisements that are contextually related to the microblog content. By virtue of the sparse and noisy text makes identifying the microblogs suitable for advertising a very hard problem. In this work, we approach the problem of identifying the microblogs that could be targeted for advertisements as a twostep classification approach. In the first pass, microblogs suitable for advertising are identified. Next, in the second pass, we build a model to find the sentiment of the advertisable microblog. The systems use features derived from the Partof-speech tags, the tweet content and uses external resources such as query logs and n-gram dictionaries from previously labeled data. This work aims at providing a thorough insight into the problem and analyzing various features to assess which features contribute the most towards identifying the tweets that can be targeted for advertisements.
Mining sentiments from tweets
AKSHAT BAKLIWAL,PIYUSH ARORA,M SENTHIL KUMAR,NIKHIL V KAPRE,MUKESH KUMAR SINGH,Vasudeva Varma Kalidindi
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA, 2012
@inproceedings{bib_Mini_2012, AUTHOR = {AKSHAT BAKLIWAL, PIYUSH ARORA, M SENTHIL KUMAR, NIKHIL V KAPRE, MUKESH KUMAR SINGH, Vasudeva Varma Kalidindi}, TITLE = {Mining sentiments from tweets}, BOOKTITLE = {Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis}. YEAR = {2012}}
Twitter is a micro blogging website, where users can post messages in very short text called Tweets. Tweets contain user opinion and sentiment towards an object or person. This sentiment information is very useful in various aspects for business and governments. In this paper, we present a method which performs the task of tweet sentiment identification using a corpus of pre-annotated tweets. We present a sentiment scoring function which uses prior information to classify (binary classification) and weight various sentiment bearing words/phrases in tweets. Using this scoring function we achieve classification accuracy of 87% on Stanford Dataset and 88% on Mejaj dataset. Using supervised machine learning approach, we achieve classification accuracy of 88% on Stanford dataset.
Language Independent Named Entity Identification using Wikipedia
MAHATHI BHAGAVATULA,SANTOSH SHANTARAM GAIKVVAD,Vasudeva Varma Kalidindi
Workshop on Multilingual Modeling, MLM-W, 2012
@inproceedings{bib_Lang_2012, AUTHOR = {MAHATHI BHAGAVATULA, SANTOSH SHANTARAM GAIKVVAD, Vasudeva Varma Kalidindi}, TITLE = {Language Independent Named Entity Identification using Wikipedia}, BOOKTITLE = {Workshop on Multilingual Modeling}. YEAR = {2012}}
Recognition of Named Entities (NEs) is a difficult process in Indian languages like Hindi, Telugu, etc., where sufficient gazetteers and annotated corpora are not available compared to English language. This paper details a novel clustering and co-occurrence based approach to map English NEs with their equivalent representations from different languages recognized in a language-independent way. We have substituted the required language specific resources by the richly structured multilingual content of Wikipedia. The approach includes clustering of highly similar Wikipedia articles. Then the NEs in an English article are mapped with other language terms in interlinked articles based on co-occurrence frequencies. The cluster information and the term co-occurrences are considered in extracting the NEs from non-English languages. Hence, the English Wikipedia is used to bootstrap the NEs for other languages. Through this approach, we have availed the structured, semi-structured and multilingual content of the Wikipedia to a massive extent. Experimental results suggest that the proposed approach yields promising results in rates of precision and recall.
Retrieval approach to extract opinions about people from resource scarce language news articles
M. ADITYA,Vasudeva Varma Kalidindi
International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM, 2012
@inproceedings{bib_Retr_2012, AUTHOR = {M. ADITYA, Vasudeva Varma Kalidindi}, TITLE = {Retrieval approach to extract opinions about people from resource scarce language news articles}, BOOKTITLE = {International Workshop on Issues of Sentiment Discovery and Opinion Mining}. YEAR = {2012}}
We wish to address the challenging task of opinion mining about organizations, people and places from different languages. It is known that resources and tools for mining opinions are scarce. In our study, we leverage comparable news articles collection to retrieve opinions about people (opinion targets) in resource scarce language like Hindi. Opinions expressed about opinion targets (Named Entities)given by adjectives and verbs known as opinion words are extracted from English collection of comparable corpora to get transliterated and translated to resource scare languages. Transformed opinion words are then used to create subjective language model (SLM) and structured opinion queries (OQs) using inference network (IN) for retrieval to confirm the opinion about opinion targets in documents. Experiments have shown that OQs and SLM with IN framework are effective for opinion mining tasks in minimal resource languages when compared to other retrieval approaches.
Exploiting Wikipedia in Identifying Named Entities: A Language-Independent Approach
MAHATHI BHAGAVATULA,SANTOSH SHANTARAM GAIKVVAD,Vasudeva Varma Kalidindi
International Conference on Information and Knowledge Management, CIKM, 2012
@inproceedings{bib_Expl_2012, AUTHOR = {MAHATHI BHAGAVATULA, SANTOSH SHANTARAM GAIKVVAD, Vasudeva Varma Kalidindi}, TITLE = {Exploiting Wikipedia in Identifying Named Entities: A Language-Independent Approach}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2012}}
This paper details the approach to identify Named Entities (NEs) from a large non-English corpus and associate them with appropriate tags, requiring minimal human intervention and no linguistic expertise. The main objective in this paper is to focus on Indian languages like Telugu, Hindi, Tamil, Marathi, etc., which are considered to be resourcepoor languages when compared to English. The inherent structure of Wikipedia was exploited in developing an efficient co-occurrence frequency based NE identification algorithm for Indian Languages. We describe the methods by which English Wikipedia data can be used to bootstrap the identification of NEs in other languages. On a dataset of 2,622 Marathi Wikipedia articles, with around 10,000 NEs manually tagged, an F-Measure of 81.25% was achieved by our system without availing language expertise. Similarly, an F-measure of 80.42% was achieved on around 12,000 NEs tagged within 2,935 Hindi Wikipedia articles.
Energy efficient data center networks-A SDN based approach
Dharmesh Kakadia,Vasudeva Varma Kalidindi
IBM Collaborative Academia Research Exchange, I-CARE, 2012
@inproceedings{bib_Ener_2012, AUTHOR = {Dharmesh Kakadia, Vasudeva Varma Kalidindi}, TITLE = {Energy efficient data center networks-A SDN based approach}, BOOKTITLE = {IBM Collaborative Academia Research Exchange}. YEAR = {2012}}
The energy consumption in data center is a key component in sustainable growth of paradigms like cloud computing. As servers are becoming more and more energy efficient, the concerns around network power consumption are increasing. We propose an approach to dynamically decide the number of network devices required to support the current load. In this paper, we evaluate our approach and report findings in terms of energy saving and performance of the network.
A hybrid approach to live migration of virtual machines
SHASHANK SAHNI,Vasudeva Varma Kalidindi
IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), CCEM, 2012
@inproceedings{bib_A_hy_2012, AUTHOR = {SHASHANK SAHNI, Vasudeva Varma Kalidindi}, TITLE = {A hybrid approach to live migration of virtual machines}, BOOKTITLE = {IEEE International Conference on Cloud Computing in Emerging Markets (CCEM)}. YEAR = {2012}}
We present, discuss and evaluate a hybrid approach of live migrating a virtual machine across hosts in a Gigabit LAN. Our hybrid approach takes the best of both the traditional methods of live migration - pre and post-copy. In pre-copy, the cpu state and memory is transferred before spawning the VM on destination host whereas the latter is exactly opposite and spawns the VM on destination right after transferring processor state. In our approach, in addition to processor state, we bundle a lot of useful state information. This includes devices and frequently accessed pages of the VM, aka the working set. This drastically reduces the number of page faults over network while we actively transfer memory. Additionally, on every page fault over the network we transfer a set of pages in its locality in addition to the page itself. We propose a prototype design on KVM/Qemu and present a comparative analysis of pre-copy, post-copy and our hybrid approach.
Twitter user behavior understanding with mood transition prediction
M. ADITYA,Vasudeva Varma Kalidindi
Workshop on Data-driven User Behavioral Modelling and Mining from Social Media, DUBMMSM, 2012
@inproceedings{bib_Twit_2012, AUTHOR = {M. ADITYA, Vasudeva Varma Kalidindi}, TITLE = {Twitter user behavior understanding with mood transition prediction}, BOOKTITLE = {Workshop on Data-driven User Behavioral Modelling and Mining from Social Media}. YEAR = {2012}}
Human moods continuously change over time. Tracking moods can provide important information about psychological and health behavior of an individual. Also, history of mood information can be used to predict the future moods of individuals. In this paper, we try to predict the mood transition of a Twitter user by regression analysis on the tweets posted over twitter time line. Initially, user tweets are automatically labeled with mood labels from time 0 to t-1. It is then used to predict user mood transition information at time t. Experiments show that SVM regression attained less root-mean-square error compared to other regression approaches for mood transition prediction.
Language independent sentence-level subjectivity analysis with feature selection
M. ADITYA,Vasudeva Varma Kalidindi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2012
@inproceedings{bib_Lang_2012, AUTHOR = {M. ADITYA, Vasudeva Varma Kalidindi}, TITLE = {Language independent sentence-level subjectivity analysis with feature selection}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2012}}
Identifying and extracting subjective information from News, Blogs and other user generated content has lot of applications. Most of the earlier work concentrated on English data. But, recently subjectivity related research at sentence-level in other languages has increased. In this paper, we achieve sentence-level subjectivity classification using language independent feature weighing and selection methods which are consistent across languages. Experiments performed on 5 different languages including English and South Asian language Hindi show that Entropy based category coverage difference criterion (ECCD) feature selection method with language independent feature weighing methods outperforms other approaches for subjective classification.
Named entity recognition an aid to improve multilingual entity filling in language-independent approach
MAHATHI BHAGAVATULA,SANTOSH SHANTARAM GAIKVVAD,Vasudeva Varma Kalidindi
Conference on Information & Knowledge Management Workshops, CIKM-W, 2012
@inproceedings{bib_Name_2012, AUTHOR = {MAHATHI BHAGAVATULA, SANTOSH SHANTARAM GAIKVVAD, Vasudeva Varma Kalidindi}, TITLE = {Named entity recognition an aid to improve multilingual entity filling in language-independent approach}, BOOKTITLE = {Conference on Information & Knowledge Management Workshops}. YEAR = {2012}}
This paper details the approach to identify Named Entities (NEs) from a large non-English corpus and associate them with appropriate tags, requiring minimal human intervention and no linguistic expertise. The main objective in this paper is to focus on Indian languages like Telugu, Hindi, Tamil, Marathi, etc., which are considered to be resource-poor languages when compared to English. The inherent structure of Wikipedia was exploited in developing an efficient co-occurrence frequency based NE identification algorithm for Indian Languages. We describe the methods by which English Wikipedia data can be used to bootstrap the identification of NEs in other languages which generates a list of NE's. Later, the paper focuses on utilizing this NE list to improve multilingual Entity Filling which showed promising results. On a dataset of 2,622 Marathi Wikipedia articles, with around 10,000 NEs manually tagged, an F-Measure of 81.25% was achieved by our system without availing language expertise. Similarly, an F-measure of 80.42% was achieved on around 12,000 NEs tagged within 2,935 Hindi Wikipedia articles.
Domain specific search in indian languages
P NIKHIL PRIYATAM,VADDEPALLY SRIKANTH,Vasudeva Varma Kalidindi
Conference on Information & Knowledge Management Workshops, CIKM-W, 2012
@inproceedings{bib_Doma_2012, AUTHOR = {P NIKHIL PRIYATAM, VADDEPALLY SRIKANTH, Vasudeva Varma Kalidindi}, TITLE = {Domain specific search in indian languages}, BOOKTITLE = {Conference on Information & Knowledge Management Workshops}. YEAR = {2012}}
Focused crawling has wide number of applications in the area of Information Retrieval. It is a crucial part in building domain specific search engines, personalized search tools and extending digital libraries. Be it Google Scholar to search for scholarly articles or Google news to search for news articles, domain specific search is the most widely acclaimed application of focused crawling. Unfortunately, there are very few domain specific search engines available for Indian languages.
Entity centric opinion mining from blogs
AKSHAT BAKLIWAL,PIYUSH ARORA,Vasudeva Varma Kalidindi
Workshop on Sentiment Analysis where AI meets Psychology., SAAIP, 2012
@inproceedings{bib_Enti_2012, AUTHOR = {AKSHAT BAKLIWAL, PIYUSH ARORA, Vasudeva Varma Kalidindi}, TITLE = {Entity centric opinion mining from blogs}, BOOKTITLE = {Workshop on Sentiment Analysis where AI meets Psychology.}. YEAR = {2012}}
With the growth of web 2.0, people are using it as a medium to express their opinion and thoughts. With the explosion of blogs, journal like user-generated content on the web, companies, celebrities and politicians are concerned about mining and analyzing the discussions about them or their products. In this paper, we present a method to perform opinion mining and summarize opinions at entity level for English blogs. We first identify various objects (named entities) which are talked about by the blogger, then we identify the modifiers which modify the orientation towards these objects. Finally, we generate object centric opinionated summary from blogs. We perform experiments like named entity identification, entity-modifier relationship extraction and modifier orientation estimation. Experiments and Results presented in this paper are cross verified with the judgment of human annotators.
Towards Contextual Advertising on Microblogs
DAVE KUSHAL SHAILESH,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2012
@inproceedings{bib_Towa_2012, AUTHOR = {DAVE KUSHAL SHAILESH, Vasudeva Varma Kalidindi}, TITLE = {Towards Contextual Advertising on Microblogs}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2012}}
Microblogs such as Facebook, Twitter, Google+ present a nice opportunity for advertising on these blogs that are contextually related to the microblog content. The challenge with advertising on such microblog platforms is that there is very little content to classify the tweet as suitable for advertising, which makes it a very hard classification problem. Despite the importance of this area, little formal published research exists. In this work, we try to address the problem of identifying the microblogs that could be targeted for advertisements. We describe a system that learns to classify which microblogs are suitable for targeted advertising. The experiments are conducted on Twitter data. The system user features derived from the user data, the tweet content and uses external resources such as query logs and term frequency from previous twitter data.
Summarizing Online Conversations: A Machine Learning Approach
Arpit Sood,MOHAMED THANVIR P,Vasudeva Varma Kalidindi
International Conference on Computational Linguistics, COLING, 2012
@inproceedings{bib_Summ_2012, AUTHOR = {Arpit Sood, MOHAMED THANVIR P, Vasudeva Varma Kalidindi}, TITLE = {Summarizing Online Conversations: A Machine Learning Approach}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2012}}
Summarization has emerged as an increasingly useful approach to tackle the problem of information overload. Extracting information from online conversations can be of very good commercial and educational value. But majority of this information is present as noisy unstructured text making traditional document summarization techniques difficult to apply. In this paper, we propose a novel approach to address the problem of conversation summarization. We develop an automatic text summarizer which extracts sentences from the conversation to form a summary. Our approach consists of three phases. In the first phase, we prepare the dataset for usage by correcting spellings and segmenting the text. In the second phase, we represent each sentence by a set of predefined features. These features capture the statistical, linguistic and sentimental aspects along with the dialogue structure of the conversation. Finally, in the third phase we use a machine learning algorithm to train the summarizer on the set of feature vectors. Experiments performed on conversations taken from the technical domain show that our system significantly outperforms the baselines on ROUGE F-scores.
Using wikipedia anchor text and weighted clustering coefficient to enhance the traditional multi-document summarization
NIRAJ KUMAR,Srinathan Kannan,Vasudeva Varma Kalidindi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2012
@inproceedings{bib_Usin_2012, AUTHOR = {NIRAJ KUMAR, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {Using wikipedia anchor text and weighted clustering coefficient to enhance the traditional multi-document summarization}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2012}}
Similar to the traditional approach, we consider the task of summarization as selection of top ranked sentences from ranked sentence-clusters. To achieve this goal, we rank the sentence clusters by using the importance of words calculated by using page rank algorithm on reverse directed word graph of sentences. Next, to rank the sentences in every cluster we introduce the use of weighted clustering coefficient. We use page rank score of words for calculation of weighted clustering coefficient. Finally the most important issue is the presence of a lot of noisy entries in the text, which downgrades the performance of most of the text mining algorithms. To solve this problem, we introduce the use of Wikipedia anchor text based phrase mapping scheme. Our experimental results on DUC-2002 and DUC-2004 dataset show that our system performs better than unsupervised systems and better than/comparable with novel supervised systems of this area.
Using graph based mapping of co-occurring words and closeness centrality score for summarization evaluation
NIRAJ KUMAR,Srinathan Kannan,Vasudeva Varma Kalidindi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2012
@inproceedings{bib_Usin_2012, AUTHOR = {NIRAJ KUMAR, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {Using graph based mapping of co-occurring words and closeness centrality score for summarization evaluation}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2012}}
The use of predefined phrase patterns like: N-grams (N>=2), longest common sub sequences or pre defined linguistic patterns etc do not give any credit to non-matching/smaller-size useful patterns and thus, may result in loss of information. Next, the use of 1-gram based model results in several noisy matches. Additionally, due to presence of more than one topic with different levels of importance in summary, we consider summarization evaluation task as topic based evaluation of information content. Means at first stage, we identify the topics covered in given model/reference summary and calculate their importance. At the next stage, we calculate the information coverage in test / machine generated summary, w.r.t. every identified topic. We introduce a graph based mapping scheme and the concept of closeness centrality measure to calculate the information depth and sense of the co-occurring words in every identified topic. Our experimental results show that devised system is better than/comparable with best results of TAC 2011 AESOP dataset.
uPick: Crowdsourcing Based Approach to Extract Relations Among Named Entities
DEEPTI AGGARWAL,KHOT ROHIT ASHOK,Vasudeva Varma Kalidindi,Venkatesh Choppella
Conference on Human-Computer Interaction, HCI, 2012
@inproceedings{bib_uPic_2012, AUTHOR = {DEEPTI AGGARWAL, KHOT ROHIT ASHOK, Vasudeva Varma Kalidindi, Venkatesh Choppella}, TITLE = {uPick: Crowdsourcing Based Approach to Extract Relations Among Named Entities}, BOOKTITLE = {Conference on Human-Computer Interaction}. YEAR = {2012}}
espite the advancement in the information extraction area, the task of identifying associated relations among named entities within a text document remains a significant challenge. Existing automated approaches lack human precision and they also struggle to handle erroneous documents. In this paper, we propose a crowdsourcing-based approach to improve the accuracy of the generated relations from the existing extraction techniques. Our idea is to gather judgments on the extracted relations of an article from the interested users. By contributing, the users in return remember the facts related to a document. This paper presents the complete design of the approach along with a user study done with twelve participants. Results show that the users rated the proposed system positively and were willing to contribute their time and energy for the task.
A simple unsupervised query categorizer for web search engines
PRASHANT ULLEGADDI,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2011
@inproceedings{bib_A_si_2011, AUTHOR = {PRASHANT ULLEGADDI, Vasudeva Varma Kalidindi}, TITLE = {A simple unsupervised query categorizer for web search engines}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2011}}
In web search, understanding the intent behind a user query can help in tasks such as placing of ads relevant to the query, routing the query to an appropriate vertical search and building a user profile for personalization. Such an intent could be represented by categories of the information a user is looking for, often expressed through a short query. In this paper, we address query categorization, which involves classifying a given query into one or more pre-defined categories. We propose an information retrieval based approach similar to document retrieval to solve query categorization. For a given query, we retrieve and rank the categories just as in document retrieval, effectively resulting in query categorization. Unlike previous works, the simplicity of the proposed approach makes it practical in a web search scenario, while achieving performance comparable with other systems, when evaluated on KDD Cup 20051 data set. Further, we also report an improvement of 4.2% in terms of precision at position 1, when compared with the best results of KDD Cup 2005.
TAC 2011 MultiLing pilot overview
G. Giannakopoulos,M. El-Haj,B. Favre,M. Litvak,J. Steinberger,Vasudeva Varma Kalidindi
Text Analysis Conference Workshop, TAC, 2011
@inproceedings{bib_TAC__2011, AUTHOR = {G. Giannakopoulos, M. El-Haj, B. Favre, M. Litvak, J. Steinberger, Vasudeva Varma Kalidindi}, TITLE = {TAC 2011 MultiLing pilot overview}, BOOKTITLE = {Text Analysis Conference Workshop}. YEAR = {2011}}
The Text Analysis Conference MultiLing Pilot of 2011 posed a multi-lingual summarization task to the summarization community, aiming to quantify and measure the performance of multi-lingual, multi-document summarization systems. The task was to create a 240–250 word summary from 10 news texts, describing a given topic. The texts of each topic were provided in seven languages (Arabic, Czech, English, French, Greek, Hebrew, Hindi) and each participant generated summaries for at least 2 languages. The evaluation of the summaries was performed using automatic (AutoSummENG, Rouge) and manual processes (Overall Responsiveness score). The participating systems were 8, some of which providing summaries across all languages. This paper provides a brief description for the collection of the data, the evaluation methodology, the problems and challenges faced, and an overview of participation and corresponding results.
Ranking multilingual documents using minimal language dependent resources
G SANTOSH SAI KRISHNA,N KIRAN KUMAR,Vasudeva Varma Kalidindi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2011
@inproceedings{bib_Rank_2011, AUTHOR = {G SANTOSH SAI KRISHNA, N KIRAN KUMAR, Vasudeva Varma Kalidindi}, TITLE = {Ranking multilingual documents using minimal language dependent resources}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2011}}
This paper proposes an approach of extracting simple and effective features that enhances multilingual document ranking (MLDR). There is limited prior research on capturing the concept of multilingual document similarity in determining the ranking of documents. However, the literature available has worked heavily with language specific tools, making them hard to reimplement for other languages. Our approach extracts various multilingual and monolingual similarity features using a basic language resource (bilingual dictionary). No language-specific tools are used, hence making this approach extensible for other languages. We used the datasets provided by Forum for Information Retrieval Evaluation (FIRE) for their 2010 Adhoc Cross-Lingual document retrieval task on Indian languages. Experiments have been performed with different ranking algorithms and their results are compared. The results obtained showcase the effectiveness of the features considered in enhancing multilingual document ranking.
Finding Influence by Cross-Lingual Blog Mining through Multiple Language Lists
M. ADITYA,Vasudeva Varma Kalidindi
International Conference on Information systems for Indian Languages, ICISIL, 2011
@inproceedings{bib_Find_2011, AUTHOR = {M. ADITYA, Vasudeva Varma Kalidindi}, TITLE = {Finding Influence by Cross-Lingual Blog Mining through Multiple Language Lists}, BOOKTITLE = {International Conference on Information systems for Indian Languages}. YEAR = {2011}}
Blogs has been one of the important resources of information on the internet. Now-a-days lot of Indian language content being generated in the form of blogs. People express their opinions on various situations and events. The content in the blogs may contain named entities–names of people, places, and organizations. Named entities also contain names of eminent personalities who are famous in or out of that language community. The goal of this paper is to find the influence of a personality among cross-language bloggers. The approach we follow is to collect information from blog pages and index the named entities along with their probabilities of occurrence by removing irrelevant information from the blog. When user searches to find the influence of a personality through a query in Indian language, we use a cross language lexicon in the form of multiple language parallel lists to transliterate the query into other Indian languages and mine blogs to return the influence of the personality across Indian language bloggers. An overview of the system and preliminary results are described.
Language-independent context aware query translation using Wikipedia
ROHIT BHARADWAJ G,Vasudeva Varma Kalidindi
Conference of the Association of Computational Linguistics, ACL, 2011
@inproceedings{bib_Lang_2011, AUTHOR = {ROHIT BHARADWAJ G, Vasudeva Varma Kalidindi}, TITLE = {Language-independent context aware query translation using Wikipedia}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2011}}
Cross lingual information access (CLIA) systems are required to access the large amounts of multilingual content generated on the world wide web in the form of blogs, news articles and documents. In this paper, we discuss our approach to query formation for CLIA systems where language resources are replaced by Wikipedia. We claim that Wikipedia, with its rich multilingual content and structure, forms an ideal platform to build a CLIA system. Our approach is particularly useful for under-resourced languages, as all the languages don’t have the resources (tools) with sufficient accuracies. We propose a context aware language-independent query formation method which, with the help of bilingual dictionaries, forms queries in the target language. Results are encouraging with a precision of 69.75% and thus endorse our claim on using Wikipedia for building CLIA systems.
Sentiment classification: a lexical similarity based approach for extracting subjectivity in documents
SARVABHOTLA KIRAN,Prasad Pingali,Vasudeva Varma Kalidindi
Information Retrieval, IR, 2011
@inproceedings{bib_Sent_2011, AUTHOR = {SARVABHOTLA KIRAN, Prasad Pingali, Vasudeva Varma Kalidindi}, TITLE = {Sentiment classification: a lexical similarity based approach for extracting subjectivity in documents}, BOOKTITLE = {Information Retrieval}. YEAR = {2011}}
With the growth of social media, document sentiment classification has become an active area of research in this decade. It can be viewed as a special case of topical classification applied only to subjective portions of a document (sources of sentiment). Hence, the key task in document sentiment classification is extracting subjectivity. Existing approaches to extract subjectivity rely heavily on linguistic resources such as sentiment lexicons and complex supervised patterns based on part-of-speech (POS) information. This makes the task of subjective feature extraction complex and resource dependent. In this work, we try to minimize the dependency on linguistic resources in sentiment classification. We propose a simple and statistical methodology called review summary (RSUMM) and use it in combination with well-known feature selection methods to extract subjectivity. Our experimental results on a movie review dataset prove the effectiveness of the proposed methodology.
Multilingual document clustering using wikipedia as external knowledge
N KIRAN KUMAR,G SANTOSH SAI KRISHNA,Vasudeva Varma Kalidindi
Information Retrieval Facility Conference, IRFC, 2011
@inproceedings{bib_Mult_2011, AUTHOR = {N KIRAN KUMAR, G SANTOSH SAI KRISHNA, Vasudeva Varma Kalidindi}, TITLE = {Multilingual document clustering using wikipedia as external knowledge}, BOOKTITLE = {Information Retrieval Facility Conference}. YEAR = {2011}}
This paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia has evolved to be a major structured multilingual knowledge base. It has been highly exploited in many monolingual clustering approaches and also in comparing multilingual corpora. But there is no prior work which studied the impact of Wikipedia on MDC. Here, we have studied availing Wikipedia in enhancing MDC performance. We have leveraged Wikipedia knowledge structure (such as cross-lingual links, category, outlinks, Infobox information, etc.) to enrich the document representation for clustering multilingual documents. We have implemented Bisecting k-means clustering algorithm and experiments are conducted on a standard dataset provided by FIRE for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English and Hindi datasets for our experiments. By avoiding language-specific tools, our approach provides a general framework which can be easily extendable to other languages. The system was evaluated using F-score and Purity measures and the results obtained were encouraging.
Applying key phrase extraction to aid invalidity search
Manisha Verma,Vasudeva Varma Kalidindi
International Conference on Artificial Intelligence and Law, ICAIL, 2011
@inproceedings{bib_Appl_2011, AUTHOR = {Manisha Verma, Vasudeva Varma Kalidindi}, TITLE = {Applying key phrase extraction to aid invalidity search}, BOOKTITLE = {International Conference on Artificial Intelligence and Law}. YEAR = {2011}}
Invalidity search poses different challenges when compared to conventional Information Retrieval problems. Presently, the success of invalidity search relies on the queries created from a patent application by the patent examiner. Since a lot of time is spent in constructing relevant queries, automatically creating them from a patent would save the examiner a lot of effort. In this paper, we address the problem of automatically creating queries from an input patent. An optimal query can be formed by extracting important keywords or phrases from a patent by using Key Phrase Extraction (KPE) techniques. Several KPE algorithms have been proposed in the literature but their performance on query construction for patents has not yet been explored. We systematically evaluate and analyze the performance of queries created by using state-of-the-art KPE techniques for invalidity search task. Our experiments show that t queries formed by KPE approaches perform better than those formed by selecting phrases based on tf or tf-idf scores.
Effectively mining Wikipedia for clustering multilingual documents
N KIRAN KUMAR,G SANTOSH SAI KRISHNA,Vasudeva Varma Kalidindi
International Conference on Natural Language to Data bases, NLDB, 2011
@inproceedings{bib_Effe_2011, AUTHOR = {N KIRAN KUMAR, G SANTOSH SAI KRISHNA, Vasudeva Varma Kalidindi}, TITLE = {Effectively mining Wikipedia for clustering multilingual documents}, BOOKTITLE = {International Conference on Natural Language to Data bases}. YEAR = {2011}}
This paper presents Multilingual Document Clustering (MDC) using Wikipedia on comparable corpora. Particularly, we utilized the cross lingual links, category, outlinks, Infobox information present in Wikipedia to enrich the document representation. We have used Bisecting k-means algorithm for clustering multilingual documents based on the document similarities. Experiments are conducted based on the usage of English and Hindi Wikipedia. We have considered English and Hindi Datasets provided by FIRE’10 for Ad-hoc Cross-Lingual document retrieval task on Indian languages. No language specific tools are used, which makes the proposed approach easily extendable for other languages. The system is evaluated using F-score and Purity measures and the results obtained are encouraging.
Tossing coins to trim long queries
SUDIP DATTA,Vasudeva Varma Kalidindi
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2011
@inproceedings{bib_Toss_2011, AUTHOR = {SUDIP DATTA, Vasudeva Varma Kalidindi}, TITLE = {Tossing coins to trim long queries}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2011}}
Verbose web queries are often descriptive in nature where a term based search engine is unable to distinguish between the essential and noisy words, which can result in a drift from the user intent. We present a randomized query reduction technique that builds on an earlier learning to rank based approach. The proposed technique randomly picks only a small set of samples, instead of the exponentially many sub-queries, thus being fast enough to be useful for web search engines, while still covering wide sub-query space.
User context as a source of topic retrieval in Twitter
P RAVALI,Vasudeva Varma Kalidindi
Enriching Information Retrieval, ENIR, 2011
@inproceedings{bib_User_2011, AUTHOR = {P RAVALI, Vasudeva Varma Kalidindi}, TITLE = {User context as a source of topic retrieval in Twitter}, BOOKTITLE = {Enriching Information Retrieval}. YEAR = {2011}}
Context in a web-based social system can be a valuable source of user information. On Twitter, context can be derived from user interactions, content streams and friendship. In this paper we focus on extracting user context by means of conversation patterns and user-generated twitter lists. We present a novel approach which utilizes just the context to extract twitter users’ topics of interest. Our approach achieved a precision of 84% indicating that user context can be exploited for topic information.
Exploring Keyphrase Extraction and IPC Classification Vectors for Prior Art Search.
MANISHA VERMA,Vasudeva Varma Kalidindi
International Conference of the CLEF Association, CLEFS, 2011
@inproceedings{bib_Expl_2011, AUTHOR = {MANISHA VERMA, Vasudeva Varma Kalidindi}, TITLE = {Exploring Keyphrase Extraction and IPC Classification Vectors for Prior Art Search.}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2011}}
In this paper we describe experiments conducted for CLEFIP 2011 Prior Art Retrieval track. We examined the impact of 1) using key phrase extraction to generate queries from input patent and 2) the use of citation network and (International Patent Classification) IPC class vector in ranking patents. Variations of a popular key phrase extraction technique were explored for extracting and scoring terms of query patent. These terms are used as queries to retrieve similar patents. In the second approach, we use a two stage retrieval model to find similar patents. Each patent is represented as an IPC class vector. Citation network of patents is used to propagate these vectors from a node (patent) to its neighbors (cited patents). Similar patents are found by comparing query vector with vectors of patents in the corpus. Text based search is used to re-rank this solution set to improve precision. Two-stage system is used to retrieve and rank patents. Finally, we also extract and add citations present within the text of a query patent to the result set. Adding these citations (present in query patent text) to the results shows significant improvement in Mean Average Precision (MAP).
CROSS LANGUAGE INFORMATION ACCESS IN TELUGU
Vasudeva Varma Kalidindi,M. ADITYA,V. SRIKANTH REDDY,RAMBHOOPAL REDDY K
Silicon andhrconference, SiAC, 2011
@inproceedings{bib_CROS_2011, AUTHOR = {Vasudeva Varma Kalidindi, M. ADITYA, V. SRIKANTH REDDY, RAMBHOOPAL REDDY K}, TITLE = {CROSS LANGUAGE INFORMATION ACCESS IN TELUGU}, BOOKTITLE = {Silicon andhrconference}. YEAR = {2011}}
This paper describes a large scale system for Cross Language Information Access(CLIA) which will help accessing information available in a language that is different from the language in which a query or information need is expressed. We specifically focus on Telugu to English and Hindi CLIA system. For a query given in Telugu, this system retrieves information available in English and Hindi. It also tries to present that information back in Telugu using several information access technologies such as: Information Extraction, Summarization and Machine Translation. This system was developed at IIIT Hyderabad, in the context of a larger project funded by Ministry of Communication and Information Technology executed by a consortium of ten Universities and research organizations covering six Indian languages for input and English and Hindi as target languages.
A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents
N KIRAN KUMAR,G SANTOSH SAI KRISHNA,Vasudeva Varma Kalidindi
International Conference of the CLEF Association, CLEFS, 2011
@inproceedings{bib_A_la_2011, AUTHOR = {N KIRAN KUMAR, G SANTOSH SAI KRISHNA, Vasudeva Varma Kalidindi}, TITLE = {A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2011}}
This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to identify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a high resourced language. The identified NEs are then utilized for the formation of multilingual document clusters using the Bisecting k-means clustering algorithm. We didn’t make use of any non-English linguistic tools or resources such as WordNet, Part-Of-Speech tagger, bilingual dictionaries, etc., which makes the proposed approach completely language-independent. Experiments are conducted on a standard dataset provided by FIRE for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English, Hindi and Marathi news datasets for our experiments. The system is evaluated using F-score, Purity and Normalized Mutual Information measures and the results obtained are encouraging.
Patent search using IPC classification vectors
MANISHA VERMA,Vasudeva Varma Kalidindi
workshop on Patent information retrieval., PaIR, 2011
@inproceedings{bib_Pate_2011, AUTHOR = {MANISHA VERMA, Vasudeva Varma Kalidindi}, TITLE = {Patent search using IPC classification vectors}, BOOKTITLE = {workshop on Patent information retrieval.}. YEAR = {2011}}
Finding similar patents is a challenging task in patent information retrieval. A patent application is often a starting point to find similar inventions. Keyword search for similar patents requires significant domain expertise and may not fetch relevant results. We propose a novel representation for patents and use a two stage approach to find similar patents. Each patent is represented as an IPC class vector. Citation network of patents is used to propagate these vectors from a node (patent) to its neighbors (cited patents). Thus, each patent is represented as a weighted combination of its IPC information as well as of its neighbors. A query patent is represented as a vector using its IPC information and similar patents can be simply found by comparing this vector with vectors of patents in the corpus. Text based search is used to re-rank this solution set to improve precision. We experiment with two similarity measures and re-ranking strategies to empirically show that our representation is effective in improving both precision and recall of queries of CLEF-2011 dataset.
Learning to Rank Categories for Web Queries
PRASHANT ULLEGADDI,Vasudeva Varma Kalidindi
International Conference on Information and Knowledge Management, CIKM, 2011
@inproceedings{bib_Lear_2011, AUTHOR = {PRASHANT ULLEGADDI, Vasudeva Varma Kalidindi}, TITLE = {Learning to Rank Categories for Web Queries}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2011}}
In web search, understanding the user intent plays an important role in improving search experience of the end users. Such an intent can be represented by the categories which the user query belongs to. In this work, we propose an information retrieval based approach to query categorization with an emphasis on learning category rankings. To carry out categorization we first represent a category by web documents (from Open Directory Project) that describe the semantics of the category. Then, we learn the category rankings for the queries using 'learning to rank' techniques. To show that the results obtained are consistent and do not vary across datasets, we evaluate our approach on two datasets including the publicly available KDD Cup dataset. We report an overall improvement of 20% on all evaluation metrics (precision, recall and F-measure) over two baselines: a text categorization baseline and an unsupervised IR baseline.
Domain independent model for product attribute extraction from user reviews using wikipedia
KOVELAMUDI SUDHEER,Sethu Ramalingam,Arpit Sood,Vasudeva Varma Kalidindi
International Joint Conference on Natural Language Processing, IJCNLP, 2011
@inproceedings{bib_Doma_2011, AUTHOR = {KOVELAMUDI SUDHEER, Sethu Ramalingam, Arpit Sood, Vasudeva Varma Kalidindi}, TITLE = {Domain independent model for product attribute extraction from user reviews using wikipedia}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2011}}
The world of E-commerce is expanding, posing a large arena of products, their descriptions, customer and professional reviews that are pertinent to them. Most of the product attribute extraction techniques in literature work on structured descriptions using several text analysis tools. However, attributes in these descriptions are limited compared to those in customer reviews of a product, where users discuss deeper and more specific attributes. In this paper, we propose a novel supervised domain independent model for product attribute extraction from user reviews. The user generated content contains unstructured and semi-structured text where conventional language grammar dependent tools like parts-of-speech taggers, named entity recognizers, parsers do not perform at expected levels. We used Wikipedia and Web to identify product attributes from customer reviews and achieved F1score of 0.73.
Towards Enhanced Opinion Classification using NLP Techniques.
AKSHAT BAKLIWAL,PIYUSH ARORA,ANKIT PATIL,Vasudeva Varma Kalidindi
Workshop on Sentiment Analysis where AI meets Psychology., SAAIP, 2011
@inproceedings{bib_Towa_2011, AUTHOR = {AKSHAT BAKLIWAL, PIYUSH ARORA, ANKIT PATIL, Vasudeva Varma Kalidindi}, TITLE = {Towards Enhanced Opinion Classification using NLP Techniques.}, BOOKTITLE = {Workshop on Sentiment Analysis where AI meets Psychology.}. YEAR = {2011}}
Sentiment mining and classification plays an important role in predicting what people think about products, places, etc. In this piece of work, using basic NLP Techniques like NGram, POS-Tagged NGram we classify movie and product reviews broadly into two polarities: Positive and Negative. We propose a model to address the problem of determining whether a review is positive or negative, we experiment and use several machine learning algorithms Naive Bayes (NB), Multi-Layer Perceptron (MLP), Support Vector Machine (SVM) to have a comparative study of the performance of the method we devised in this work. Along with this we also did negation handling and observed improvements in classification. The algorithm we proposed achieved an average accuracy of 78.32% on movie and 70.06% on multi-category dataset. In this paper we focus on the collective study of Ngram and POS tagged information available in the reviews.
IIIT Hyderabad in Summarization and Knowledge Base Population at TAC 2011.
Vasudeva Varma Kalidindi,KOVELAMUDI SUDHEER,Arpit Sood,JAYANT GUPTA,HARSHIT JAIN,P NIKHIL PRIYATAM,M. ADITYA,V. SRIKANTH REDDY
Text Analysis Conference Workshop, TAC, 2011
@inproceedings{bib_IIIT_2011, AUTHOR = {Vasudeva Varma Kalidindi, KOVELAMUDI SUDHEER, Arpit Sood, JAYANT GUPTA, HARSHIT JAIN, P NIKHIL PRIYATAM, M. ADITYA, V. SRIKANTH REDDY}, TITLE = {IIIT Hyderabad in Summarization and Knowledge Base Population at TAC 2011.}, BOOKTITLE = {Text Analysis Conference Workshop}. YEAR = {2011}}
In this report, we present details about the participation of IIIT Hyderabad in Guided Summarization and Knowledge Base Population tracks at TAC 2011. we have enhanced our summarization system with knowledge based measures. Wikipedia based extraction methods and topic modelling are used to score sentences in guided summarization track. For multilingual summarization task, we investigated the HAL (Hyperspace Analogue to Language Model) where we created a semantic space from word co-occurrences. We show that the results obtained with this unsupervised language independent method are competitive with other state-of-the-art systems. For monolingual and multilingual entity linking task, we extended our previous year’s model to a light weight language independent system without utilizing any other external knowledge or resource.
Job aware scheduling algorithm for mapreduce framework
RADHESHYAM NANDURI,NITESH MAHESHWARI,A Reddyraja,Vasudeva Varma Kalidindi
2018 IEEE International Conference on Cloud Computing Technology and Science, CLOUDCOM, 2011
@inproceedings{bib_Job__2011, AUTHOR = {RADHESHYAM NANDURI, NITESH MAHESHWARI, A Reddyraja, Vasudeva Varma Kalidindi}, TITLE = {Job aware scheduling algorithm for mapreduce framework}, BOOKTITLE = {2018 IEEE International Conference on Cloud Computing Technology and Science}. YEAR = {2011}}
MapReduce framework has received a wide acclaim over the past few years for large scale computing. It has become a standard paradigm for batch oriented workloads. As the adoption of this paradigm has increased rapidly, scheduling of these MapReduce jobs has become a problem of great interest in research community. We propose an approach which tries to maintain harmony among the jobs running on the cluster, and in turn decrease their runtime. In our model, the scheduler is made aware of different types of jobs running on the cluster. The scheduler tries to allocate a task on a node if the incoming task does not affect the tasks already running on that node. From the list of available pending tasks, our algorithm selects the one that is most compatible with the tasks already running on that node. We bring up heuristic and machine learning based solutions to our approach and try to maintain a resource balance on the cluster by not overloading any of the nodes, thereby reducing the overall runtime of the jobs. The results show a saving of runtime of around 21% in the case of heuristic based approach and around 27% in the case of machine learning based approach when compared to Yahoo's Capacity scheduler.
Using Unsupervised System with least linguistic features for TAC-AESOP Task.
NIRAJ KUMAR,Srinathan Kannan,Vasudeva Varma Kalidindi
Text Analysis Conference Workshop, TAC, 2011
@inproceedings{bib_Usin_2011, AUTHOR = {NIRAJ KUMAR, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {Using Unsupervised System with least linguistic features for TAC-AESOP Task.}, BOOKTITLE = {Text Analysis Conference Workshop}. YEAR = {2011}}
We consider AESOP Task as Topic based evaluation of information content. Means at first stage, we identify the topics covered in given model/reference summary and calculate their importance. At the next stage, we calculate the information coverage in test/machine generated summary, wrt every identified topic. We use the local importance of words in calculation of importance of topics. From experiments it is clear that use of different methods for identification of topics and calculation of information coverage in test documents wrt every identified topic, have different effect on the result. It is important to note that our devised system do not require any linguistic support or learning or training in entire execution of the system.
Using pattern classification for task assignment in mapreduce
Jaideep Dhok,Vasudeva Varma Kalidindi
India Software Engineering Conference, ISECo, 2010
@inproceedings{bib_Usin_2010, AUTHOR = {Jaideep Dhok, Vasudeva Varma Kalidindi}, TITLE = {Using pattern classification for task assignment in mapreduce}, BOOKTITLE = {India Software Engineering Conference}. YEAR = {2010}}
MapReduce has become a popular paradigm for large scale data processing in the cloud. The sheer scale of MapReduce deployments make task assignment in MapReduce an interesting problem. The scale of MapReduce applications presents unique opportunity to use data driven algorithms in resource management. We present a learning based scheduler that uses pattern classification for utilization oriented task assignment in MapReduce. We also present the application of our algorithm to the Hadoop platform. The scheduler assigns tasks by classifying them in two classes, good and bad. From the tasks labeled as good it selects a task that is least likely to overload a worker node. We allow users to plug in their own policy schemes for prioritizing jobs. The scheduler learns the impact of different applications on utilization rather quickly and achieves a user specified level of utilization. Our results show that our scheduler reduces response times of jobs in certain cases by a factor of two.
IIIT Hyderabad in Guided Summarization and Knowledge Base Population.
Vasudeva Varma Kalidindi,B PRAVEEN KUMAR,KRANTHI REDDY B,VIJAYBHARATH REDDY YARAM,KOVELAMUDI SUDHEER,V. SRIKANTH REDDY,RADHESHYAM NANDURI,N KIRAN KUMAR
Text Analysis Conference Workshop, TAC, 2010
@inproceedings{bib_IIIT_2010, AUTHOR = {Vasudeva Varma Kalidindi, B PRAVEEN KUMAR, KRANTHI REDDY B, VIJAYBHARATH REDDY YARAM, KOVELAMUDI SUDHEER, V. SRIKANTH REDDY, RADHESHYAM NANDURI, N KIRAN KUMAR}, TITLE = {IIIT Hyderabad in Guided Summarization and Knowledge Base Population.}, BOOKTITLE = {Text Analysis Conference Workshop}. YEAR = {2010}}
In this report, we present details about the participation of IIIT Hyderabad in Guided Summarization and Knowledge Base Population tracks at TAC 2010. This year, we enhanced our summaization system with knowledge based measures and utilized domain and sentence tag models to score sentences to suit guided summarization track. We have used an external tool, WikiMiner to identify key concepts in the documents and extract important sentences. We adopted an Information Retrieval based approach for the Entity Linking and Slot Filling tasks. In Entity Linking we identified potential nodes from the Knowledge Base and then identified the mapping node using tf-idf similarity. We achieved very good performance in the Entity Linking task. For Slot Filling task we identified documents from the document collection that might contain attribute information. We extracted the attribute information using a rule based approach.
Prominence based scoring of speech segments for automatic speech-to-speech summarization
YELLA SREE HARSHA,Vasudeva Varma Kalidindi,Kishore Prahallad
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2010
@inproceedings{bib_Prom_2010, AUTHOR = {YELLA SREE HARSHA, Vasudeva Varma Kalidindi, Kishore Prahallad}, TITLE = {Prominence based scoring of speech segments for automatic speech-to-speech summarization}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2010}}
In order to perform speech summmarization it is necessary to identify important segments in speech signal. The importance of a speech segment can be effectively determined by using infomation from lexical and prosodic features. Standard speech summarization systems depend on ASR transcripts or gold standard human reference summaries to train a supervised system which combines lexical and prosodic features to choose a segment into summary. We propose a method which uses prominence values of syllables in a speech segment to rank the segment for summarization. The proposed method does not depend on ASR transcripts or gold standard human summaries. Evaluation results showed that summaries generated by the proposed method are as good as the summaries generated using tf* idf scores and supervised system trained on gold standard summaries.
An iterative approach to extract dictionaries from wikipedia for under-resourced languages
ROHIT BHARADWAJ G,Niket Tandon,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2010
@inproceedings{bib_An_i_2010, AUTHOR = {ROHIT BHARADWAJ G, Niket Tandon, Vasudeva Varma Kalidindi}, TITLE = {An iterative approach to extract dictionaries from wikipedia for under-resourced languages}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2010}}
The problem of extracting bilingual dictionaries from Wikipedia is well known and well researched. Given the structural and rich multilingual content of Wikipedia, a language independent approach is necessary for extracting dictionaries for various languages more so for under-resourced languages. In our attempt to mine dictionaries for under-resourced languages, we developed an iterative approach to construct parallel corpus for building a dictionary, for which we consider several kinds of Wikipedia article information like title, infobox information, category, article text and dictionaries already built at each phase. The average precision over various datasets is encouraging with maximum precision of 76.7%, performing better than existing systems. As no language-specific resources are used, our method is applicable to any pair of language with special focus on under-resourced languages and hence breaking the language barrier.
Query Processing for Enterprise Search with Wikipedia Link Structure.
NIHAR SHARMA,Vasudeva Varma Kalidindi
International Conference on Knowledge Discovery and Information Retrieval, KDIR, 2010
@inproceedings{bib_Quer_2010, AUTHOR = {NIHAR SHARMA, Vasudeva Varma Kalidindi}, TITLE = {Query Processing for Enterprise Search with Wikipedia Link Structure.}, BOOKTITLE = {International Conference on Knowledge Discovery and Information Retrieval}. YEAR = {2010}}
We present a phrase based query expansion (QE) technique for enterprise search using a domain independent concept thesaurus constructed from Wikipedia link structure. Our approach analyzes article and category link information for deriving sets of related concepts for building up the thesaurus. In addition, we build a vocabulary set containing natural word order and usage which semantically represent concepts. We extract query-representational concepts from vocabulary set with a three layered approach. Concept Thesaurus then yields related concepts for expanding a query. Evaluation on TRECENT 2007 data shows an impressive 9 percent increase in recall for fifty queries. In addition to we also observed that our implementation improves precision at top k results by 0.7, 1, 6 and 9 percent for top 10, top 20, top 50 and top 100 search results respectively, thus demonstrating the promise that Wikipedia based thesaurus holds in domain specific search.
Learning based opportunistic admission control algorithm for MapReduce as a service
Jaideep Dhok,NITESH MAHESHWARI,Vasudeva Varma Kalidindi
India Software Engineering Conference, ISECo, 2010
@inproceedings{bib_Lear_2010, AUTHOR = {Jaideep Dhok, NITESH MAHESHWARI, Vasudeva Varma Kalidindi}, TITLE = {Learning based opportunistic admission control algorithm for MapReduce as a service}, BOOKTITLE = {India Software Engineering Conference}. YEAR = {2010}}
Admission Control has been proven essential to avoid overloading of resources and for meeting user service demands in utility driven grid computing. Recent emergence of Cloud based services and the popularity of MapReduce paradigm in Cloud Computing environments make the problem of admission control intriguing. We propose a model that allows one to offer MapReduce jobs in the form of on-demand services. We present a learning based opportunistic algorithm that admits MapReduce jobs only if they are unlikely to cross the overload threshold set by the service provider. The algorithm meets deadlines negotiated by users in more than 80% of cases. We employ an automatically supervised Naive Bayes Classifier to label incoming jobs as admissible and non-admissible. From the list of jobs classified as admissible, we then pick a job that is expected to maximize service provider utility. An external supervision rule automatically evaluates decisions made by the algorithm in retrospect, and trains the classifier. We evaluate our algorithm by simulating a MapReduce cluster hosted in the Cloud that offers a set of MapReduce jobs as services to its users. Our results show that admission control is useful in minimizing losses due to overloading of resources, and by choosing jobs that maximize revenue of the service provider.
Supervised learning approaches for rating customer reviews
SARVABHOTLA KIRAN,R V V PRASAD RAO,Vasudeva Varma Kalidindi
Journal of Intelligent Systems, JIS, 2010
@inproceedings{bib_Supe_2010, AUTHOR = {SARVABHOTLA KIRAN, R V V PRASAD RAO, Vasudeva Varma Kalidindi}, TITLE = {Supervised learning approaches for rating customer reviews}, BOOKTITLE = {Journal of Intelligent Systems}. YEAR = {2010}}
Social media has become highly popular in recent years that people are expressing their views, thoughts about any product, movie through reviews. Reviews are having a great influence on people and decisions made by them. This has led researchers and market analyzers to analyze the opinions of users in reviews and model their preferences. Sometimes reviews are also scored in terms of satisfaction score on any product or movie by customer (ratings). These ratings usually vary on a scale from one to five (stars) or very bad to excellent. In this paper we address the problem of attributing a numerical score (one to five stars) to a review. We view it as a multi-label classification (supervised learning) problem and present two approaches, using Naïve Bayes (NB) and Support Vector Machines (SVM’s). We focus more on feature representations of reviews widely used; problems associated with them and present solutions which address them.
Towards analyzing data security risks in cloud computing environments
AMIT SANGROYA,SAURABH KUMAR,Jaideep Dhok,Vasudeva Varma Kalidindi
International Conference on Information Systems and Technology Management, ICISTM, 2010
@inproceedings{bib_Towa_2010, AUTHOR = {AMIT SANGROYA, SAURABH KUMAR, Jaideep Dhok, Vasudeva Varma Kalidindi}, TITLE = {Towards analyzing data security risks in cloud computing environments}, BOOKTITLE = {International Conference on Information Systems and Technology Management}. YEAR = {2010}}
There is a growing trend of using cloud environments for ever growing storage and data processing needs. However, adopting a cloud computing paradigm may have positive as well as negative effects on the data security of service consumers. This paper primarily aims to highlight the major security issues existing in current cloud computing environments. We carry out a survey to investigate the security mechanisms that are enforced by major cloud service providers. We also propose a risk analysis approach that can be used by a prospective cloud service for analyzing the data security risks before putting his confidential data into a cloud computing environment.
SAGE: An approach to evaluate the impact of SOA governance policies
AMIT SANGROYA,KIRTI GARG,Vasudeva Varma Kalidindi
IEEE International Workshop on Service Oriented Architectures in Converging Networked Environments, SOCNE, 2010
@inproceedings{bib_SAGE_2010, AUTHOR = {AMIT SANGROYA, KIRTI GARG, Vasudeva Varma Kalidindi}, TITLE = {SAGE: An approach to evaluate the impact of SOA governance policies}, BOOKTITLE = {IEEE International Workshop on Service Oriented Architectures in Converging Networked Environments}. YEAR = {2010}}
We propose SAGE (Service Oriented Architecture Governance Evaluation) as a quantitative approach, to evaluate the impact of SOA governance policies on prominent quality attributes. Our approach facilitates an exhaustive quantitative impact analysis to evaluate the effectiveness of SOA governance policies. This can be further used to estimate the current maturity level of any SOA based enterprise. The use of SAGE at an initial stage will help to make SOA management more structured and effective.
An information retrieval approach to spelling suggestion
SAI KRISHNA PENDYALA V,PRASAD RAO P V V,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2010
@inproceedings{bib_An_i_2010, AUTHOR = {SAI KRISHNA PENDYALA V, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {An information retrieval approach to spelling suggestion}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2010}}
In this paper, we present a two-step language-independent spelling suggestion system. In the first step, candidate suggestions are generated using an Information Retrieval (IR) approach. In step two, candidate suggestions are re-ranked using a new string similarity measure that uses the length of the longest common substrings occurring at the beginning and end of the words. We obtained very impressive results by reranking candidate suggestions using the new similarity measure. The accuracy of first suggestion is 92.3%, 90.0% and 83.5% for Dutch, Danish and Bulgarian language datasets respectively.
Learning the click-through rate for rare/new ads from similar ads
DAVE KUSHAL SHAILESH,Vasudeva Varma Kalidindi
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2010
@inproceedings{bib_Lear_2010, AUTHOR = {DAVE KUSHAL SHAILESH, Vasudeva Varma Kalidindi}, TITLE = {Learning the click-through rate for rare/new ads from similar ads}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2010}}
Ads on the search engine (SE) are generally ranked based on their Click-through rates (CTR). Hence, accurately predicting the CTR of an ad is of paramount importance for maximizing the SE's revenue. We present a model that inherits the click information of rare/new ads from other semantically related ads. The semantic features are derived from the query ad click-through graphs and advertisers account information. We show that the model learned using these features give a very good prediction for the CTR values.
Generating simulated relevance feedback: a prognostic search approach
NITHIN KUMAR M,Vasudeva Varma Kalidindi
International Conference on Computational Linguistics, COLING, 2010
@inproceedings{bib_Gene_2010, AUTHOR = {NITHIN KUMAR M, Vasudeva Varma Kalidindi}, TITLE = {Generating simulated relevance feedback: a prognostic search approach}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2010}}
Implicit relevance feedback has proved to be a important resource in improving search accuracy and personalization. However, researchers who rely on feedback data for testing their algorithms or other personalization related problems are loomed with problems like unavailability of data, staling up of data and so on. Given these problems, we are motivated towards creating a synthetic user relevance feedback data, based on insights from query log analysis. We call this simulated feedback. We believe that simulated feedback can be immensely beneficial to web search engine and personalization research communities by greatly reducing efforts involved in collecting user feedback. The benefits from” Simulated feedback” are-it is easy to obtain and also the process of obtaining the feedback data is repeatable, customizable and does not need the interactions of the user. In this paper, we describe a simple yet effective approach for creating simulated feedback. We have evaluated our system using the clickthrough data of the users and achieved 77% accuracy in generating click-through data.
Pattern based keyword extraction for contextual advertising
DAVE KUSHAL SHAILESH,Vasudeva Varma Kalidindi
International Conference on Information and Knowledge Management, CIKM, 2010
@inproceedings{bib_Patt_2010, AUTHOR = {DAVE KUSHAL SHAILESH, Vasudeva Varma Kalidindi}, TITLE = {Pattern based keyword extraction for contextual advertising}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2010}}
Contextual Advertising (CA) refers to the placement of ads that are contextually related to the web page content. The science of CA deals with the task of finding advertising keywords from web pages. We present a different candidate selection method to extract advertising keywords from a web page. This method makes use of Part-of-Speech (POS) patterns that restrict the number of potential candidates a classifier has to handle. It fetches words/phrases that belong to the selected set of POS patterns. We design four systems based on chunking method and the features they use. These systems are trained on a naive Bayes classifier with a set of web pages annotated with 'advertising' keywords. The systems can then find advertising keywords from previously unseen web pages. Empirical evaluation shows that systems using the proposed chunking method perform better than the systems using N-Gram based chunking. All improvements in the systems are found statistically significant at a 99% confidence interval.
A weighted tag similarity measure based on a collaborative weight model
R J SRINIVAS GOKAVARAPU,Niket Tandon,Vasudeva Varma Kalidindi
international workshop on Search and mining user-generated contents, SMUC-W, 2010
@inproceedings{bib_A_we_2010, AUTHOR = {R J SRINIVAS GOKAVARAPU, Niket Tandon, Vasudeva Varma Kalidindi}, TITLE = {A weighted tag similarity measure based on a collaborative weight model}, BOOKTITLE = {international workshop on Search and mining user-generated contents}. YEAR = {2010}}
The problem of measuring semantic relatedness between social tags remains largely open. Given the structure of social bookmarking systems, similarity measures need to be addressed from a social bookmarking systems perspective. We address the fundamental problem of weight model for tags over which every similarity measure is based. We propose a weight model for tagging systems that considers the user dimension unlike existing measures based on tag frequency. Visual analysis of tag clouds depicts that the proposed model provides intuitively better scores for weights than tag frequency. We also propose weighted similarity model that is conceptually different from the contemporary frequency based similarity measures. Based on the weighted similarity model, we present weighted variations of several existing measures like Dice and Cosine similarity measures. We evaluate the proposed similarity model using Spearman's correlation coefficient, with WordNet as the gold standard. Our method achieves 20% improvement over the traditional similarity measures like dice and cosine similarity and also over the most recent tag similarity measures like mutual information with distributional aggregation. Finally, we show the practical effectiveness of the proposed weighted similarity measures by performing search over tagged documents using Social SimRank over a large real world dataset.
Significance of anchor speaker segments for constructing extractive audio summaries of broadcast news
YELLA SREE HARSHA,Vasudeva Varma Kalidindi,Kishore Prahallad
IEEE Spoken Language Technology Workshop, SLT-W, 2010
@inproceedings{bib_Sign_2010, AUTHOR = {YELLA SREE HARSHA, Vasudeva Varma Kalidindi, Kishore Prahallad}, TITLE = {Significance of anchor speaker segments for constructing extractive audio summaries of broadcast news}, BOOKTITLE = {IEEE Spoken Language Technology Workshop}. YEAR = {2010}}
Analysis of human reference summaries of broadcast news showed that humans give preference to anchor speaker segments while constructing a summary. Therefore, we exploit the role of anchor speaker in a news show by tracking his/her speech to construct indicative/informative extractive audio summaries. Speaker tracking is done by Bayesian information criterion (BIC) technique. The proposed technique does not require Automatic Speech Recognition (ASR) transcripts or human reference summaries for training. The objective evaluation by ROUGE showed that summaries generated by the proposed technique are as good as summaries generated by a baseline text summarization system taking manual transcripts as input and summaries generated by a supervised speech summarization system trained using human summaries. The subjective evaluation of audio summaries by humans showed that they prefer summaries generated by proposed technique to summaries generated by supervised speech summarization system.
Smeo: A platform for smart classrooms with enhanced information access and operations automation
AKSHEYA JAWA,SUDIP DATTA,Vishal Garg,Vasudeva Varma Kalidindi,Suresh Chande,Suresh Chande,Murali Krishna Punaganti Venkata
Smart Spaces and Next Generation Wired/Wireless Networking, SSNGWWN, 2010
@inproceedings{bib_Smeo_2010, AUTHOR = {AKSHEYA JAWA, SUDIP DATTA, Vishal Garg, Vasudeva Varma Kalidindi, Suresh Chande, Suresh Chande, Murali Krishna Punaganti Venkata}, TITLE = {Smeo: A platform for smart classrooms with enhanced information access and operations automation}, BOOKTITLE = {Smart Spaces and Next Generation Wired/Wireless Networking}. YEAR = {2010}}
This paper explores the area of creating smart spaces using mobile Internet devices to develop intelligent classrooms. We propose a suite of applications that leverage the recent progress made in the domains of mobile computers and information access, resulting in a richer classroom experience where knowledge is dispensed with greater efficiency while also making it possible to access it efficiently from the Internet as well as peers. We attempt to achieve this by introducing novel means of classroom teaching and management and an intelligent information access system while harnessing the computational capabilities of mobile Internet devices. Though the current work is still in progress, this paper elaborates on our attempts in reducing the gap between technological progress in various domains and the existing pedagogy.
Exploiting N-gram Importance and Wikipedia based Additional Knowledge for Improvements in GAAC based Document Clustering.
NIRAJ KUMAR, Vinay Babu Vemula,Srinathan Kannan,Vasudeva Varma Kalidindi
International Conference on Knowledge Discovery and Information Retrieval, KDIR, 2010
@inproceedings{bib_Expl_2010, AUTHOR = {NIRAJ KUMAR, Vinay Babu Vemula, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {Exploiting N-gram Importance and Wikipedia based Additional Knowledge for Improvements in GAAC based Document Clustering.}, BOOKTITLE = {International Conference on Knowledge Discovery and Information Retrieval}. YEAR = {2010}}
This paper provides a solution to the issue:“How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on a many features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.
An Effective Approach for AESOP and Guided Summarization Task.
NIRAJ KUMAR,Srinathan Kannan,Vasudeva Varma Kalidindi
Text Analysis Conference Workshop, TAC, 2010
@inproceedings{bib_An_E_2010, AUTHOR = {NIRAJ KUMAR, Srinathan Kannan, Vasudeva Varma Kalidindi}, TITLE = {An Effective Approach for AESOP and Guided Summarization Task.}, BOOKTITLE = {Text Analysis Conference Workshop}. YEAR = {2010}}
In this paper we, present (1) an unsupervised system for AESOP task and (2) a generic multi-document summarization system for guided summarization task. We propose the use of:(1) the role and importance of words and sentences in document and,(2) number and coverage strength of topics in document for both AESOP and Guided summarization task. We also use some other statistical features, simple heuristics and grammatical facts to capture the important facts and information from source document (s).
A language-independent transliteration schema using character aligned models at NEWS 2009
PRANEETH MEDHATITHI SHISHTLA,SURYA GANESH VEERAVALLI,S Sethuramalingam,Vasudeva Varma Kalidindi
International Joint Conference on Natural Language Processing, IJCNLP, 2009
@inproceedings{bib_A_la_2009, AUTHOR = {PRANEETH MEDHATITHI SHISHTLA, SURYA GANESH VEERAVALLI, S Sethuramalingam, Vasudeva Varma Kalidindi}, TITLE = {A language-independent transliteration schema using character aligned models at NEWS 2009}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2009}}
In this paper we present a statistical transliteration technique that is language independent. This technique uses statistical alignment models and Conditional Random Fields (CRF). Statistical alignment models maximizes the probability of the observed (source, target) word pairs using the expectation maximization algorithm and then the character level alignments are set to maximum posterior predictions of the model. CRF has efficient training and decoding processes which is conditioned on both source and target languages and produces globally optimal solution.
Exploiting structure and content of wikipedia for query expansion in the context
SURYA GANESH VEERAVALLI,Vasudeva Varma Kalidindi
INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING, RANLP, 2009
@inproceedings{bib_Expl_2009, AUTHOR = {SURYA GANESH VEERAVALLI, Vasudeva Varma Kalidindi}, TITLE = {Exploiting structure and content of wikipedia for query expansion in the context}, BOOKTITLE = {INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING}. YEAR = {2009}}
Retrieving answer containing passages is a challenging task in Question Answering. In this paper we describe a novel query expansion method which aims to rank the answer containing passages better. It uses content and structured information (link structure and category information) of Wikipedia to generate a set of terms semantically related to the question. As Boolean model allows a fine-grained control over query expansion, these semantically related terms are added to the original query to form an expanded Boolean query. We conducted experiments on TREC 2006 QA data. The experimental results show significant improvements of about 24.6%, 11.1% and 12.4% in precision at 1, MRR at 20 and TDRR scores respectively using our query expansion method.
Exploiting the Use of Prior Probabilities for Passage Retrieval in Question Answering
SURYA GANESH VEERAVALLI,Vasudeva Varma Kalidindi
INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING, RANLP, 2009
@inproceedings{bib_Expl_2009, AUTHOR = {SURYA GANESH VEERAVALLI, Vasudeva Varma Kalidindi}, TITLE = {Exploiting the Use of Prior Probabilities for Passage Retrieval in Question Answering}, BOOKTITLE = {INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING}. YEAR = {2009}}
Document Retrieval assumes that a document is independent of its relevance, and non-relevance. Previous works showed that the same assumption is being considered for passage retrieval in the context of Question Answering. In this paper, we relax this assumption and describe a method for estimating the prior of a passage being relevant, and non-relevant to a question. These prior probabilities are used in the process of ranking passages. We also describe a trivial method for identifying relevant and nonrelevant text to a question using the Web and AQUAINT corpus as information sources. An empirical evaluation on TREC 2006 Question Answering test set showed that in the context of Question Answering prior probabilities are necessary in ranking the passages.
Query-focused summaries or query-biased summaries?
K RAHUL,Vasudeva Varma Kalidindi
International Joint Conference on Natural Language Processing, IJCNLP, 2009
@inproceedings{bib_Quer_2009, AUTHOR = {K RAHUL, Vasudeva Varma Kalidindi}, TITLE = {Query-focused summaries or query-biased summaries?}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2009}}
In the context of the Document Understanding Conferences, the task of Query-Focused Multi-Document Summarization is intended to improve agreement in content among humangenerated model summaries. Query-focus also aids the automated summarizers in directing the summary at specific topics, which may result in better agreement with these model summaries. However, while query focus correlates with performance, we show that highperforming automatic systems produce summaries with disproportionally higher query term density than human summarizers do. Experimental evidence suggests that automatic systems heavily rely on query term occurrence and repetition to achieve good performance.
Experiments in CLIR using fuzzy string search based on surface similarity
S Sethuramalingam,ANIL KUMAR SINGH,Pradeep Dasigi,Vasudeva Varma Kalidindi
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2009
@inproceedings{bib_Expe_2009, AUTHOR = {S Sethuramalingam, ANIL KUMAR SINGH, Pradeep Dasigi, Vasudeva Varma Kalidindi}, TITLE = {Experiments in CLIR using fuzzy string search based on surface similarity}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2009}}
Cross Language Information Retrieval (CLIR) between languages of the same origin is an interesting topic of research. The similarity of the writing systems used for these languages can be used effectively to not only improve CLIR, but to overcome the problems of textual variations, textual errors, and even the lack of linguistic resources like stemmers to an extent. We have conducted CLIR experiments between three languages which use writing systems (scripts) of Brahmi-origin, namely Hindi, Bengali and Marathi. We found significant improvements for all the six language pairs using a method for fuzzy text search based on Surface Similarity. In this paper we report these results and compare them with a baseline CLIR system and a CLIR system that uses Scaled Edit Distance (SED) for fuzzy string matching.
Sentence Position revisited: A robust light-weight Update Summarization ‘baseline’Algorithm
K RAHUL,PRASAD RAO P V V,Vasudeva Varma Kalidindi
Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societ, CLIAWS, 2009
@inproceedings{bib_Sent_2009, AUTHOR = {K RAHUL, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Sentence Position revisited: A robust light-weight Update Summarization ‘baseline’Algorithm}, BOOKTITLE = {Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societ}. YEAR = {2009}}
In this paper, we describe a sentence position based summarizer that is built based on a sentence position policy, created from the evaluation testbed of recent summarization tasks at Document Understanding Conferences (DUC). We show that the summarizer thus built is able to outperform most systems participating in task focused summarization evaluations at Text Analysis Conferences (TAC) 2008. Our experiments also show that such a method would perform better at producing short summaries (upto 100 words) than longer summaries. Further, we discuss the baselines traditionally used for summarization evaluation and suggest the revival of an old baseline to suit the current summarization task at TAC: the Update Summarization task.
Transliteration based text input methods for telugu
SOWMYA V B,Vasudeva Varma Kalidindi
International Conference on the Computer Processing of Oriental Languages, ICCPOL, 2009
@inproceedings{bib_Tran_2009, AUTHOR = {SOWMYA V B, Vasudeva Varma Kalidindi}, TITLE = {Transliteration based text input methods for telugu}, BOOKTITLE = {International Conference on the Computer Processing of Oriental Languages}. YEAR = {2009}}
Telugu is the third most spoken language in India and one of the fifteen most spoken languages in the world. But, there is no standardized input method for Telugu, which has a widespread use. Since majority of users of Telugu typing tools on the computers are familiar with English, we propose a transliteration based text input method in which the users type Telugu using Roman script. We have shown that simple edit-distance based approach can give a light-weight system with good efficiency for a text input method. We have tested the approach with three datasets – general data, countries and places and person names. The approach has worked considerably well for all the datasets and holds promise as an efficient text input method.
A light-weight summarizer based on language model with relative entropy
CHANDAN KUMAR,PRASAD RAO P V V,Vasudeva Varma Kalidindi
ACM Symposium on Applied Computing, SAC, 2009
@inproceedings{bib_A_li_2009, AUTHOR = {CHANDAN KUMAR, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {A light-weight summarizer based on language model with relative entropy}, BOOKTITLE = {ACM Symposium on Applied Computing}. YEAR = {2009}}
A new method for sentence extraction on the basis of language model with relative entropy is presented in this paper. The proposed technique first builds a sentence language model and document cluster language model respectively for the sentence and the documents. The sentences are then ranked according to the relative entropies of the estimated document language model with respect to the estimated sentence language model. The overall results on DUC and MSE corpus demonstrate that the proposed approach outperforms some of the best reported results for generic multi-document summarization.
Estimating risk of picking a sentence for document summarization
CHANDAN KUMAR,PRASAD RAO P V V,Vasudeva Varma Kalidindi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2009
@inproceedings{bib_Esti_2009, AUTHOR = {CHANDAN KUMAR, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Estimating risk of picking a sentence for document summarization}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2009}}
Automatic Document summarization is proving to be an increasingly important task to overcome the information overload. The primary task of document summarization process is to pick subset of sentences as a representative of whole document set. We treat this as a decision making problem and estimate the risk involve in making this decision. We calculate the risk of information loss associated with each sentence and extract sentences based on ascending order of their risk. The experimental result shows that the proposed approach performs better than various state of the art approaches.
An effective learning environment for teaching problem solving in software architecture
KIRTI GARG,Vasudeva Varma Kalidindi
India Software Engineering Conference, ISECo, 2009
@inproceedings{bib_An_e_2009, AUTHOR = {KIRTI GARG, Vasudeva Varma Kalidindi}, TITLE = {An effective learning environment for teaching problem solving in software architecture}, BOOKTITLE = {India Software Engineering Conference}. YEAR = {2009}}
A software architect engages in solving Software Engineering (SE) problems throughout his career. Thus inculcating problem solving skills should be one of the learning objectives of SE academic and training programs. But structured problem solving is usually latent or missing in most of the current curriculum. In this paper, we describe an effective learning environment for SE education and training with problem solving as an integral part.
Case studies as assessment tools in software engineering classrooms
KIRTI GARG,Vasudeva Varma Kalidindi
Conference on Software Engineering Education and Training, CSEE&T, 2009
@inproceedings{bib_Case_2009, AUTHOR = {KIRTI GARG, Vasudeva Varma Kalidindi}, TITLE = {Case studies as assessment tools in software engineering classrooms}, BOOKTITLE = {Conference on Software Engineering Education and Training}. YEAR = {2009}}
Software engineering (SE) courses aim to make students well-versed in solving authentic and complex problems by applying varied SE knowledge skills along with problem solving, critical thinking, use of tools, communication skills etc. Thus they have multiple, complex and some higher order cognitive learning goals. Traditional assessment tools like multiple choice questions, subjective questions, etc., may not always be an optimal choice for such a SE course and may not satisfy learning sciences principle and guidelines of assessment design. Case studies in SE have been used since long for research and investigation, recently in teaching, but not as assessment tools. In this paper, we propose and examine the use of SE case studies as assessment tools. We also describe an approach for objective and easy evaluation of solutions. This work reasons that carefully designed case studies can be effective as well as efficient assessment tools for SE education such that they are not only closely aligned with learning goals, but are also in accordance with learning sciences.
A Naïve Approach for Monolingual Question Answering.
ROHIT BHARADWAJ G,SURYA GANESH VEERAVALLI,Vasudeva Varma Kalidindi
International Conference of the CLEF Association, CLEFS, 2009
@inproceedings{bib_A_Na_2009, AUTHOR = {ROHIT BHARADWAJ G, SURYA GANESH VEERAVALLI, Vasudeva Varma Kalidindi}, TITLE = {A Naïve Approach for Monolingual Question Answering.}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2009}}
This paper talks about the system which we have submitted for the ResPubliQA task. We participated in building the QA system for en-en part. We followed a different method for each question type. In this paper we outline the methods which we adapted and the results which we obtained.
Building a semantic virtual museum: from Wiki to semantic Wiki using named entity recognition
Alain Plantec ,Vincent Ribaud,Vasudeva Varma Kalidindi
Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, 2009
@inproceedings{bib_Buil_2009, AUTHOR = {Alain Plantec , Vincent Ribaud, Vasudeva Varma Kalidindi}, TITLE = {Building a semantic virtual museum: from Wiki to semantic Wiki using named entity recognition}, BOOKTITLE = {Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications}. YEAR = {2009}}
In this paper, we describe an approach for creating semantic wiki pages from regular wiki pages, in the domain of scientific museums, using information extraction methods in general and named entity recognition in particular. We make use of a domain specific ontology called CIDOC-CRM as a base structure for representing and processing knowledge. We have described major components of the proposed approach and a three-step process involving name entity recognition, identifying domain classes using the ontology and establishing the properties for the entities in order to generate semantic wiki pages. Our initial evaluation of the prototype shows promising results in terms of enhanced efficiency and time and cost benefits.
IIIT Hyderabad at TAC 2009.
Vasudeva Varma Kalidindi,VIJAYBHARATH REDDY YARAM,Sudheer Kovelamudi,B PRAVEEN KUMAR,G SANTOSH SAI KRISHNA,N KIRAN KUMAR,KRANTHI REDDY B,KARUNA KUMAR YATAM,NITHIN KUMAR M
Text Analysis Conference Workshop, TAC, 2009
@inproceedings{bib_IIIT_2009, AUTHOR = {Vasudeva Varma Kalidindi, VIJAYBHARATH REDDY YARAM, Sudheer Kovelamudi, B PRAVEEN KUMAR, G SANTOSH SAI KRISHNA, N KIRAN KUMAR, KRANTHI REDDY B, KARUNA KUMAR YATAM, NITHIN KUMAR M}, TITLE = {IIIT Hyderabad at TAC 2009.}, BOOKTITLE = {Text Analysis Conference Workshop}. YEAR = {2009}}
In this paper, we report our participation in Update Summarization, Knowledge Base Population and Recognizing Textual Entailment at TAC 2009. This year, we enhanced our basic summaization system with support vector regression to better estimate the combined affect of different features in ranking. A Novelty measure is devised to effectively capture relevance and novelty of a term. For Knowledge Base Population, we analyzed IR approaches and Naive Bayes Classification with Phrase and Token searches. Finally for RTE, we built templates using WordNet and predict entailments.
Modeling novelty and feature combination using support vector regression for update summarization
LAKSHMI V S N PRAVEEN B,VIJAYBHARATH REDDY YARAM,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2009
@inproceedings{bib_Mode_2009, AUTHOR = {LAKSHMI V S N PRAVEEN B, VIJAYBHARATH REDDY YARAM, Vasudeva Varma Kalidindi}, TITLE = {Modeling novelty and feature combination using support vector regression for update summarization}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2009}}
Summarization is the process of condensing a piece of text while retaining important information. A well composed and coherent summary is the solution for information overload. Sentence extractive summarization system requires different features to rank sentences and then generate summaries. In this paper we provide a detailed analysis about effect of various features in context of update summarization. We adapt a machine learning algorithm for combining features while scoring a sentence. Further, we propose a new feature that can effectively capture novelty along with relevancy of a sentence in a topic. Evaluation results show that our summmarizer is able to surpass top performing systems participated at Text analysis conference 2008. Gap between oracle summaries and state of art summaries is analyzed to depict the scope of improvement in sentence extractive summarization.
A Graph Clustering Approach to Product Attribute Extraction.
SANTOSH RAJU VYSYARAJU,PRANEETH MEDHATITHI SHISHTLA,Vasudeva Varma Kalidindi
Indian International Conference on Artificial Intelligence, IICAI, 2009
@inproceedings{bib_A_Gr_2009, AUTHOR = {SANTOSH RAJU VYSYARAJU, PRANEETH MEDHATITHI SHISHTLA, Vasudeva Varma Kalidindi}, TITLE = {A Graph Clustering Approach to Product Attribute Extraction.}, BOOKTITLE = {Indian International Conference on Artificial Intelligence}. YEAR = {2009}}
This work focuses on attribute extraction from product descriptions. We propose a novel solution to extract attributes of a product from a set of text documents. A graph is constructed from the text using word co-occurrence statistics. We compute word clusters and extract attributes from these clusters using graph based methods. Our solution is able to achieve nearly 80% precision and 45% recall. Experiments show that the methods employed are effective in identifying attributes for different dataset sizes.
Classification based approach for Summarizing Opinions in Blog Posts.
SARVABHOTLA KIRAN,KRANTHI REDDY B,Vasudeva Varma Kalidindi
Indian International Conference on Artificial Intelligence, IICAI, 2009
@inproceedings{bib_Clas_2009, AUTHOR = {SARVABHOTLA KIRAN, KRANTHI REDDY B, Vasudeva Varma Kalidindi}, TITLE = {Classification based approach for Summarizing Opinions in Blog Posts.}, BOOKTITLE = {Indian International Conference on Artificial Intelligence}. YEAR = {2009}}
With the growth of web, people are using it as a medium for expressing their opinions, thoughts through blog posts, reviews (in the form of ratings), and forums. Blogosphere is a place where people read, write their views and make comments on others views or thoughts there by exchanging information. It will be very difficult for any business, organization or individual to go through and understand thoughts expressed by others on a product or topic which they are interested in. Hence a summarization system which extracts, analyze and summarize opinions will be useful. Our Summarization system exactly does the same for blog posts. The entire process of summary generation is done in three stages, extract sentences which are sources for opinion (Opinion Mining), then analyze the extracted opinions to determine polarity (Opinion Analysis) and finally ranking the opinion sentences (Opinion Summarization). We present a classification based approach for extracting and analyzing opinions and how we use this approach to rank sentences for generating summary.
Passage retrieval using answer type profiles in question answering
SURYA GANESH VEERAVALLI,Vasudeva Varma Kalidindi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2009
@inproceedings{bib_Pass_2009, AUTHOR = {SURYA GANESH VEERAVALLI, Vasudeva Varma Kalidindi}, TITLE = {Passage retrieval using answer type profiles in question answering}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2009}}
Retrieving answer containing passages is a challenging task in Question Answering. In this paper, we describe a novel passage retrieval methodology using answer type profiles. Our methodology includes two steps: estimation and ranking. In the estimation step, answer type profiles are constructed from question-answer sentence pairs parallel corpus using a statistical alignment model. Each answer type profile consists of triples: the query word, the answering sentence word and the probability of translation. In the ranking step, answer type profiles are incorporated into the Language Modeling framework called Statistical Machine Translation models for Information Retrieval. Using this framework a set of relevant passages are retrieved, given a question. We conducted experiments on FACTOID questions from TREC 2002 to 2006 QA tracks. The experimental results showed significant improvements over different retrieval models including TFIDF, Okapi BM25, Indri and KL-divergence.
Software engineering education in India: Issues and challenges
KIRTI GARG,Vasudeva Varma Kalidindi
Conference on Software Engineering Education and Training, CSEE&T, 2008
@inproceedings{bib_Soft_2008, AUTHOR = {KIRTI GARG, Vasudeva Varma Kalidindi}, TITLE = {Software engineering education in India: Issues and challenges}, BOOKTITLE = {Conference on Software Engineering Education and Training}. YEAR = {2008}}
Indian software industry has set up huge growth targets for future. These targets would be heavily affected by the software engineering (SE) education scenario in the country. The purpose of this paper is to provide a holistic understanding of SE education issues and challenges specific to Indian context, from both industry and academic perspective. This study is based on our (a) interaction with industry through SE education related projects, surveys and discussions. (B) Observations as an integral part of Indian SE educators' community. There is an urgent need for addressing these deep rooted issues, as the lack of proper SE education may be the single largest factor that may negatively affect the industry. Understanding these issues will help to identify the action items that initiate software engineering educational reforms in the country. We also discuss the essential and minimal set of SE knowledge, skills and dispositions that the industry expects from engineers willing to join the industry. Though Indian software industry is growing at phenomenal rates, there not many studies on the issues and effects associated with SE education in the Indian context. Though the discussion is limited to India, but we believe that it represents the existing conditions in many developing countries where IT and ITES (IT Enabled Services) industry is becoming important.
People issues relating to software engineering education and training in India
KIRTI GARG,Vasudeva Varma Kalidindi
India Software Engineering Conference, ISECo, 2008
@inproceedings{bib_Peop_2008, AUTHOR = {KIRTI GARG, Vasudeva Varma Kalidindi}, TITLE = {People issues relating to software engineering education and training in India}, BOOKTITLE = {India Software Engineering Conference}. YEAR = {2008}}
Software Engineering (SE) and Information Technology (IT) jobs are the most sought after career options for Indian youth in the recent times. Indian Software industry is expected to grow at a very healthy rate and each major software company has ambitious plans and growth targets for future. However, lack of proper Software Engineering (SE) education may have severe consequences and may negatively affect these growth targets. In this paper we discuss challenges and issues related to software engineering education and training in Indian academia and industry. These are based on our interaction with industry and through our experience as Software engineering educators. These challenges arise from deep rooted issues in Software Engineering educational goals, pedagogy and instruction as well as the infrastructure. We will discuss their long term effects on various aspects of software development. We put forth our suggestions that may handle these challenges to an extent. We also discuss the essential and minimal set of software engineering knowledge, skills and dispositions that the industry requires from young engineers willing to join the industry. This paper provides course designing guidelines for the academia and learning centers of the industry by focusing on important SE education issues, their causes and possible solutions. This, in turn would help to make SE Education more effective and inline with requirements of the Indian Software industry.
Enabling reuse of citizen centric government processes through service oriented architecture
BHUDEB CHAKRAVARTHI,Vasudeva Varma Kalidindi
India Software Engineering Conference, ISECo, 2008
@inproceedings{bib_Enab_2008, AUTHOR = {BHUDEB CHAKRAVARTHI, Vasudeva Varma Kalidindi}, TITLE = {Enabling reuse of citizen centric government processes through service oriented architecture}, BOOKTITLE = {India Software Engineering Conference}. YEAR = {2008}}
It is observed that information technology (IT) is being used to automate the traditional government processes and to achieve easy access of the information. They are developed to increase the efficiency of the government departments and represent their needs. As a result, they achieved a huge inventory of IT applications providing disparate processes implemented using different technologies and platforms. It thus became very important to use IT as an" integrator" allowing disparate processes running in different departments to interconnect. This helps the citizen to access different government services through a common single-point window and helps the government to transform from government-centric organization to citizen-centric organization
Statistical transliteration for cross language information retrieval using HMM alignment model and CRF
PRASAD RAO P V V,SURYA GANESH VEERAVALLI,YELLA SREE HARSHA,Vasudeva Varma Kalidindi
Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societ, CLIAWS, 2008
@inproceedings{bib_Stat_2008, AUTHOR = {PRASAD RAO P V V, SURYA GANESH VEERAVALLI, YELLA SREE HARSHA, Vasudeva Varma Kalidindi}, TITLE = {Statistical transliteration for cross language information retrieval using HMM alignment model and CRF}, BOOKTITLE = {Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societ}. YEAR = {2008}}
In this paper we present a statistical transliteration technique that is language independent. This technique uses Hidden Markov Model (HMM) alignment and Conditional Random Fields (CRF), a discriminative model. HMM alignment maximizes the probability of the observed (source, target) word pairs using the expectation maximization algorithm and then the character level alignments (n-gram) are set to maximum posterior predictions of the model. CRF has efficient training and decoding processes which is conditioned on both source and target languages and produces globally optimal solutions. We apply this technique for Hindi-English transliteration task. The results show that our technique perfoms better than the existing transliteration system which uses HMM alignment and conditional probabilities derived from counting the alignments.
Hindi and Telugu to English CLIR using Query Expansion
PRASAD RAO P V V,Vasudeva Varma Kalidindi
Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societ, CLIAWS, 2008
@inproceedings{bib_Hind_2008, AUTHOR = {PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Hindi and Telugu to English CLIR using Query Expansion}, BOOKTITLE = {Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societ}. YEAR = {2008}}
This paper presents the experiments of Language Technologies Research Centre (LTRC) as part of their participation in CLEF2 2007 Indian language to English ad-hoc cross language document retrieval task. In this paper we discuss our Hindi and Telugu to English CLIR system and the experiments using CLEF 2007 dataset. We used a variant of TFIDF algorithm in combination with a bilingual lexicon for query translation. We also explored the role of a document summary in fielded queries and two different boolean formulations of query translations. We find that a hybrid boolean formulation using a combination of boolean AND and boolean OR operators improves ranking of documents. We also find that simple disjunctive combination of translated query keywords results in maximum recall.
Statistical machine translation models for personalized search
Rohini U,Vamshi Ambat,Vasudeva Varma Kalidindi
International Joint Conference on Natural Language Processing, IJCNLP, 2008
@inproceedings{bib_Stat_2008, AUTHOR = {Rohini U, Vamshi Ambat, Vasudeva Varma Kalidindi}, TITLE = {Statistical machine translation models for personalized search}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2008}}
Web search personalization has been well studied in the recent few years. Relevance feedback has been used in various ways to improve relevance of search results. In this paper, we propose a novel usage of relevance feedback to effectively model the process of query formulation and better characterize how a user relates his query to the document that he intends to retrieve using a noisy channel model. We model a user profile as the probabilities of translation of query to document in this noisy channel using the relevance feedback obtained from the user. The user profile thus learnt is applied in a re-ranking phase to rescore the search results retrieved using an underlying search engine. We evaluate our approach by conducting experiments using relevance feedback data collected from users using a popular search engine. The results have shown improvement over baseline, proving that our approach can be applied to personalization of web search. The experiments have also resulted in some valuable observations that learning these user profiles using snippets surrounding the results for a query gives better performance than learning from entire document collection
IIIT Hyderabad’s CLIR experiments for FIRE-2008
S Sethuramalingam,Vasudeva Varma Kalidindi
Forum for Information Retrieval Evaluation, FIRE, 2008
@inproceedings{bib_IIIT_2008, AUTHOR = {S Sethuramalingam, Vasudeva Varma Kalidindi}, TITLE = {IIIT Hyderabad’s CLIR experiments for FIRE-2008}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2008}}
This paper discourses our CLIR experiments performed for the FIRE1 workshop. We had submitted our runs for Adhoc monolingual document retrieval in Hindi and English, and Ad-hoc cross-lingual document retrieval from Hindi to English, and English to Hindi. In this paper, we describe our English to Hindi and Hindi to English CLIR systems and the experiments conducted on them using the FIRE-2008 dataset. We had used a dictionary-based approach of query translation and transliteration of named entities in the queries using a mapping-based, Compressed Word Format (CWF) algorithm [1]. Disjunctive Query formulation with different scoring weights based on the part of the topic from which they originated gave us overall better performance.
Experiments in telugu ner: A conditional random field approach
PRANEETH MEDHATITHI SHISHTLA,G. KARTHIK,PRASAD RAO P V V,Vasudeva Varma Kalidindi
International Joint Conference on Natural Language Processing Workshop, IJCNLP-W, 2008
@inproceedings{bib_Expe_2008, AUTHOR = {PRANEETH MEDHATITHI SHISHTLA, G. KARTHIK, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Experiments in telugu ner: A conditional random field approach}, BOOKTITLE = {International Joint Conference on Natural Language Processing Workshop}. YEAR = {2008}}
Named Entity Recognition (NER) is the task of identifying and classifying tokens in a text document into predefined set of classes. In this paper we show our experiments with various feature combinations for Telugu NER. We also observed that the prefix and suffix information helps a lot in finding the class of the token. We also show the effect of the training data on the performance of the system. The best performing model gave an Fβ= 1 measure of 44.91. The language independent features gave an Fβ= 1 measure of 44.89 which is close to Fβ= 1 measure obtained even by including the language dependent features.
Applying Lexical Semantics to Improve Text Classification
Diwakar Padmaraju,Vasudeva Varma Kalidindi
symposium on Indian Morphology, Phonology and Language Engineering, SIMPLE, 2008
@inproceedings{bib_Appl_2008, AUTHOR = {Diwakar Padmaraju, Vasudeva Varma Kalidindi}, TITLE = {Applying Lexical Semantics to Improve Text Classification}, BOOKTITLE = {symposium on Indian Morphology, Phonology and Language Engineering}. YEAR = {2008}}
Text classification techniques such as Bayesian classifiers have been proved to be giving as good or better results than much more sophisticated approaches to induction. These algorithms depend on the conditional independence of the attributes given the class and term based statistical representations of documents. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to variations in word usage. In this paper we investigate the use of word sense disambiguation and other lexical semantic based techniques to achieve consistent improvement with Bayesian classifiers. Finally, we discuss the test cases experimented with Rainbow-a standard Bayesian classifier.
Issues, Challenges and Opportunities for Research in Software Engineering
MANISH KUMAR ANAND,Vasudeva Varma Kalidindi
International Conference on Software Engineering and Applications, ICSEA, 2008
@inproceedings{bib_Issu_2008, AUTHOR = {MANISH KUMAR ANAND, Vasudeva Varma Kalidindi}, TITLE = {Issues, Challenges and Opportunities for Research in Software Engineering}, BOOKTITLE = {International Conference on Software Engineering and Applications}. YEAR = {2008}}
The impact and importance of software has come a long way. And yet, a new generation of software developers must meet many of the same challenges that faced earlier generations. This paper aims to identify some of the most fundamental issues, challenges and opportunities for research in Software Engineering. The paper starts by examining the past, current, and future states of software engineering. The paper then examines the critical technical issues in software engineering including complexity, structure, and evolution of software systems; economics of software engineering, and measurement of software engineering products and processes, as well as the critical people and organizational issues including learning, motivation and performance improvement.
IIIT-Sum at DUC 2007
PRASAD RAO P V V,K RAHUL,Vasudeva Varma Kalidindi
Document Understanding Conference, DUC, 2008
@inproceedings{bib_IIIT_2008, AUTHOR = {PRASAD RAO P V V, K RAHUL, Vasudeva Varma Kalidindi}, TITLE = {IIIT-Sum at DUC 2007}, BOOKTITLE = {Document Understanding Conference}. YEAR = {2008}}
In this paper we report our performance at DUC 2007 summarization tasks. We participated both in the query-focused multidocument summarization main task and in a pilot update summary generation tasks. This year we used a term clustering approach to better estimate a sentence prior. We used only the sentence prior which is query independent, in the update summarization task and found that it’s performance is comparable with the top performing systems. In the main task our system ranked 1 in ROUGE-2, ROUGE-SU4 and ROUGE-BE scores as well as in pyramid scores.
Approximate String Matching Techniques for Effective CLIR Among
RANBEER MAKIN,NIKITA PANDEY,R V V PRASAD RAO,Vasudeva Varma Kalidindi
International Workshop on Fuzzy Logic and Application, WILF, 2008
@inproceedings{bib_Appr_2008, AUTHOR = {RANBEER MAKIN, NIKITA PANDEY, R V V PRASAD RAO, Vasudeva Varma Kalidindi}, TITLE = {Approximate String Matching Techniques for Effective CLIR Among}, BOOKTITLE = {International Workshop on Fuzzy Logic and Application}. YEAR = {2008}}
Commonly used vocabulary in Indian language documents found on the web contain a number of words that have Sanskrit, Persian or English origin. However, such words may be written in different scripts with slight variations in spelling and morphology. In this paper we explore approximate string matching techniques to exploit this situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language. We present an approach to identify cognates and make use of them for improving dictionary based CLIR when the query and documents both belong to two different Indian languages. We conduct experiments using a Hindi document collection and a set of Telugu queries and report the improvement due to cognate recognition and translation.
A Character n-gram Based Approach for Improved Recall in Indian
PRANEETH MEDHATITHI SHISHTLA,PRASAD RAO P V V,Vasudeva Varma Kalidindi
International Joint Conference on Natural Language Processing Workshop, IJCNLP-W, 2008
@inproceedings{bib_A_Ch_2008, AUTHOR = {PRANEETH MEDHATITHI SHISHTLA, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {A Character n-gram Based Approach for Improved Recall in Indian}, BOOKTITLE = {International Joint Conference on Natural Language Processing Workshop}. YEAR = {2008}}
Named Entity Recognition (NER) is the task of identifying and classifying all proper nouns in a document as person names, or—ganization names, location names, date & time expressions and miscellaneous. Previ—ous work (Cucerzan and Yarowsky, 1999) was done using the complete words as fea—tures which suffers from a low recall prob—lem. Character n—gram based approach (Klein et al., 2003) using generative mod—els, was experimented on English language and it proved to be useful over the word based models. Applying the same technique on Indian Languages, we experimented with Conditional Random Fields (CRFs), a dis—criminative model, and evaluated our sys—tem on two Indian Languages Telugu and Hindi. The character n—gram based models showed considerable improvement over the word based models. This paper describes the features used and experiments to increase the recall of Named Entity Recognition Systems which is also language independent
A Dictionary Based Approach with Query Expansion to Cross Language Query Based Multi-Document Summarization: Experiments in Telugu-English
PRASAD RAO P V V,JAGADEESH JAGARLAMUDI,Vasudeva Varma Kalidindi
National Workshop on Artificial Intelligence, NWAI, 2008
@inproceedings{bib_A_Di_2008, AUTHOR = {PRASAD RAO P V V, JAGADEESH JAGARLAMUDI, Vasudeva Varma Kalidindi}, TITLE = {A Dictionary Based Approach with Query Expansion to Cross Language Query Based Multi-Document Summarization: Experiments in Telugu-English}, BOOKTITLE = {National Workshop on Artificial Intelligence}. YEAR = {2008}}
A cross language query based multi-document summarization task requires a system to produce a brief, coherent and readable summary serving the information need represented in the query. In this paper we present a Cross Language Query Based Multi Document Summarization system for Telugu to English language pair. The topics being in Telugu and the documents from which summary is to be generated are in English. The system uses a Telugu to English bilingual lexicon to obtain translations for each word as well as uses some transliteration techniques and a target language relevance based language modeling using Hyper Space Analogue to Language. The system was evaluated using DUC 2005 topics translated into Telugu and the English model summaries provided at DUC 2005 and the results are discussed in the evaluation section.
An enterprise architecture framework for building service oriented e-governance portal
BHUDEB CHAKRAVARTHI,Vasudeva Varma Kalidindi
IEEE Region 10 Conference, TENCON, 2008
@inproceedings{bib_An_e_2008, AUTHOR = {BHUDEB CHAKRAVARTHI, Vasudeva Varma Kalidindi}, TITLE = {An enterprise architecture framework for building service oriented e-governance portal}, BOOKTITLE = {IEEE Region 10 Conference}. YEAR = {2008}}
In recent times, information technology (IT) has a tremendous influence on how different Indian government departments operate. At one level it has given rise to a requirement for timely delivery of information and services to the citizen round the clock. At a higher level, it enables the citizen to conduct transactions through government portals. This changed the purpose of the government portals from primarily a place to advertise the success stories of government to an effective means to provide efficient services to the citizen. As the needs of the citizen shift and IT capabilities evolve, government departments are under increasing pressure to become more innovative - both in terms of government processes and their delivery model. One of the important factors to provide efficient government services is availability of all government services through a single point delivery platform. The government departments should have an integrated platform with a secured access control system. It is thus important to design a coherent enterprise architecture framework for the government departments. The aim of this research paper is to outline an effective enterprise architecture framework and an innovative technological solution that can serve as the common platform for provision of all government services to the citizen of India.
Generating personalized summaries using publicly available web documents
CHANDAN KUMAR,R V V PRASAD RAO,Vasudeva Varma Kalidindi
IEEE/WIC International Conference on Web Intelligence, WI, 2008
@inproceedings{bib_Gene_2008, AUTHOR = {CHANDAN KUMAR, R V V PRASAD RAO, Vasudeva Varma Kalidindi}, TITLE = {Generating personalized summaries using publicly available web documents}, BOOKTITLE = {IEEE/WIC International Conference on Web Intelligence}. YEAR = {2008}}
Many knowledge workers are increasingly using online resources to find out latest developments in their specialty and articles of interest. To extract relevant information from such multiple online information sources summarization is being used. Current summarization systems produce a uniform version of summary for all users. However summaries which are generic in nature do not cater to the userpsilas background and interests. In this paper we propose to make the summarization process user specific and present a design for generating personalized summaries of online articles that are tailored to each personpsilas interest. The userpsilas data available on Web is used for model their background and interest. A controlled user-centered qualitative evaluation carried out on news articles of science and technology domain, indicates better user satisfaction with personalized summaries compared to generic summaries.
Iiit hyderabad at duc 2007
PRASAD RAO P V V,Vasudeva Varma Kalidindi
Document Understanding Conference, DUC, 2007
@inproceedings{bib_Iiit_2007, AUTHOR = {PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Iiit hyderabad at duc 2007}, BOOKTITLE = {Document Understanding Conference}. YEAR = {2007}}
In this paper we report our performance at DUC 2007 summarization tasks. We participated both in the query-focused multidocument summarization main task and in a pilot update summary generation tasks. This year we used a term clustering approach to better estimate a sentence prior. We used only the sentence prior which is query independent, in the update summarization task and found that it’s performance is comparable with the top performing systems. In the main task our system ranked 1 in ROUGE-2, ROUGE-SU4 and ROUGE-BE scores as well as in pyramid scores.
Evaluation of Oromo-English Cross-Language Information Retrieval
Kula Kekeba Tune,Vasudeva Varma Kalidindi,PRASAD RAO P V V
International Joint Conference on Artificial Intelligence, IJCAI, 2007
@inproceedings{bib_Eval_2007, AUTHOR = {Kula Kekeba Tune, Vasudeva Varma Kalidindi, PRASAD RAO P V V}, TITLE = {Evaluation of Oromo-English Cross-Language Information Retrieval}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2007}}
This paper reports on the first Oromo-English CLIR system that is based on dictionary-based query translation techniques. The basic objective of the study is to design and develop an Oromo-English CLIR system with a view to enable Afaan Oromo speakers to access and retrieve the vast online information sources that are available in English by using their own (native) language queries. We describe the major approaches and procedures that have been used in designing and developing Oromo-English CLIR system together with the information retrieval evaluation experiments recently conducted at CLEF 2006 ad hoc track. The purpose of the current initial evaluation experiments was to assess the over all performance of the Oromo-English CLIR system by using different fields of Afaan Oromo topics. Thus we submitted three official runs (experiments) that differed in terms of utilized fields in the topic set, ie title run (OMT), title and description run (OMTD), and title, description and narration run (OMTDN) to CLEF 2006. Since appropriate online language processing resources and information retrieval tools are not available for Afaan Oromo, only limited linguistic resources such as Oromo-English dictionary and Afaan Oromo stemmer that had been designed and developed at our research center were used in conducting the experiments. Yet we found the performances of our evaluation experiments at CLEF 2006 very encouraging which is a good news for development and application of CLIR systems in other indigenous African languages.
Experiments in cross language query focused multi-document summarization
PRASAD RAO P V V,JAGADEESH JAGARLAMUDI,Vasudeva Varma Kalidindi
International Joint Conference on Artificial Intelligence, IJCAI, 2007
@inproceedings{bib_Expe_2007, AUTHOR = {PRASAD RAO P V V, JAGADEESH JAGARLAMUDI, Vasudeva Varma Kalidindi}, TITLE = {Experiments in cross language query focused multi-document summarization}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2007}}
The twin challenges of massive information overload via the web and ubiquitous computers present us with an unavoidable task: developing techniques to handle multilingual information robustly and efficiently, with as high quality performance as possible. Previous research activities on multilingual information access systems have studied cross-language information retrieval (CLIR), information extraction and factoid based question answering tasks in detail. It is believed by the research community and also previously acknowledged in a US-NSF report [Hovy et al., 1999] that a cross-language query focused summarization could play a vital role in multilingual information access, as a bridge between CLIR and machine translation. Surprisingly enough, no detailed studies exist yet on the effects of cross-linguality to query focused summarization. In this paper we study the effects of adding a cross-language dimension to query focused multi-document summarization for the Telugu-English language pair. We use a cross-lingual relevance based language modeling approach to generate extraction based summary. We evaluate the system using DUC1 2005 dataset using ROUGE2 metrics and compare with the mono-lingual baseline which uses relevance based language modeling in mono-lingual setting.
Experiments in cross-lingual ir among indian languages
RANBEER MAKIN,NIKITA PANDEY,PRASAD RAO P V V,Vasudeva Varma Kalidindi
International Workshop on Cross Language Information Processing, CLIP, 2007
@inproceedings{bib_Expe_2007, AUTHOR = {RANBEER MAKIN, NIKITA PANDEY, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Experiments in cross-lingual ir among indian languages}, BOOKTITLE = {International Workshop on Cross Language Information Processing}. YEAR = {2007}}
Commonly used vocabulary in Indian language documents found on the web contain a number of words that have Sanskrit, Persian or English origin. However, such words may be written in different scripts with slight variations in spelling and morphology. In this paper we explore fuzzy string matching techniques to exploit this situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language. We present an approach to identify cognates and make use of them for improving dictionary based CLIR when the query and documents both belong to two different Indian languages. We conduct experiments using a Hindi document collection and a set of Telugu queries and report the improvement due to cognate recognition and translation.
Use of ontologies for organizational knowledge management and knowledge management systems
Vasudeva Varma Kalidindi
@inproceedings{bib_Use__2007, AUTHOR = {Vasudeva Varma Kalidindi}, TITLE = {Use of ontologies for organizational knowledge management and knowledge management systems}, BOOKTITLE = {Ontologies}. YEAR = {2007}}
This chapter describes the role of ontologies and corporate taxonomies in managing the content and knowledge within organizations. Managing content in a reusable and effective manner is becoming increasingly important in knowledge centric organizations as the amount of content generated, both text based and rich media, is growing exponentially. Search, categorization and document characterization, content staging and content delivery are the key technology challenges in knowledge management systems. This chapter describes how corporate taxonomies and ontologies can help in making sense of huge amount of content that gets generated across the locations in different languages and formats Different information silos can be connected and workflow and collaboration can be achieved using ontologies. As the KM solutions are moving from a centralized approach to a distributed approach, a framework where multiple taxonomies and ontologies can co-exist with uniform interfaces is needed.
Multi-lingual Indexing Support for CLIR using Language Modeling.
PRASAD RAO P V V,Vasudeva Varma Kalidindi
IEEE Data Engineering Bulletin, DEB, 2007
@inproceedings{bib_Mult_2007, AUTHOR = {PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Multi-lingual Indexing Support for CLIR using Language Modeling.}, BOOKTITLE = {IEEE Data Engineering Bulletin}. YEAR = {2007}}
An indexing model is the heart of an Information Retrieval (IR) system. Data structures such as term based inverted indices have proved to be very effective for IR using vector space retrieval models. However, when functional aspects of such models were tested, it was soon felt that better relevance models were required to more accurately compute the relevance of a document towards a query. It was shown that language modeling approaches [1] in monolingual IR tasks improve the quality of search results in comparison with TFIDF [2] algorithm. The disadvantage of language modeling approaches when used in monolingual IR task as suggested in [1] is that they would require both the inverted index (term-todocument) and the forward index (document-to-term) to be able to compute the rank of document for a given query. This calls for an additional space and computation overhead when compared to inverted index models. Such a cost may be acceptable if the quality of search results are significantly improved. In a Cross-lingual IR (CLIR) task, we have previously shown in [3] that using a bilingual dictionary along with term co-occurrence statistics and language modeling approach helps improve the functional IR performance. However, no studies exist on the performance overhead in a CLIR task due to language modeling. In this paper we present an augmented index model which can be used for fast retrieval while having the benefits of language modeling in a CLIR task. The model is capable of retrieval and ranking with or without query expansion techniques using term collocation statistics of the indexed corpus. Finally we conduct performance related experiments on our indexing model to determine the cost overheads on space and time
A novel approach for re-ranking of search results using collaborative filtering
U ROHINI,Vasudeva Varma Kalidindi
International Conference on Computer Theory and Applications, ICCTA, 2007
@inproceedings{bib_A_no_2007, AUTHOR = {U ROHINI, Vasudeva Varma Kalidindi}, TITLE = {A novel approach for re-ranking of search results using collaborative filtering}, BOOKTITLE = {International Conference on Computer Theory and Applications}. YEAR = {2007}}
Search engines today often return a large volume of results with possibly a few relevant results. The notion of relevance is subjective and depends on the user and context. Re-ranking of the results to reflect the most relevant results to the user using the relevance feedback has received wide attention in information retrieval in recent years. Also, sharing of information among users having similar interests using collaborative filtering techniques has achieved wide success in recommendation systems. In this paper, we propose a novel approach for re-ranking of the search results using collaborative filtering techniques using relevance feedback of a given user as well as the other users. Our approach is to learn the profiles of the users using machine learning techniques making use of past browsing histories including queries posed and documents found relevant or irrelevant. Re-ranking of the results is done using collaborative filtering techniques. First, the context of the query is inferred from the query category. The user’s community is determined dynamically in the context of the query by using the user profiles. The rank of a document is calculated using the user’s profile as well profiles of the other users in the community
Capturing sentence prior for query-based multi-document summarization
Jagadeesh J,PRASAD RAO P V V,Vasudeva Varma Kalidindi
International Conference on Computer-Assisted Information Retrieval, RIAO, 2007
@inproceedings{bib_Capt_2007, AUTHOR = {Jagadeesh J, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Capturing sentence prior for query-based multi-document summarization}, BOOKTITLE = {International Conference on Computer-Assisted Information Retrieval}. YEAR = {2007}}
In this paper, we have considered a real world information synthesis task, generation of a fixed length multi document summary which satisfies a specific information need. This task was mapped to a topic-oriented, informative multi-document summarization. We also tried to estimate, given the human written reference summaries and the document set, the maximum performance (ROUGE1 scores) that can be achieved by an extraction-based summarization technique. Motivated by the observation that the current approaches are far behind the estimated maximum performance, we have looked at Information Retrieval techniques to improve the relevance scoring of sentences towards information need. Following information theoretic approach we have identified a measure to capture the notion of importance or prior of a sentence. Following a different decomposition of Probability Ranking Principle, the calculated importance/prior is incorporated into the final sentence scoring by weighted linear combination. In order to evaluate the performance, we have explored information sources like WWW and encyclopedia in computing the information measure in a set of different experiments. The t-test analysis of the improvement on DUC2 2005 data set is found to be significant (p∼ 0.05). The same system has outperformed rest of the systems at DUC 2006 challenge in terms of ROUGE scores with a significant margin over the next best system.
A study of the effectiveness of case study approach in software engineering education
KIRTI GARG,Vasudeva Varma Kalidindi
Conference on Software Engineering Education and Training, CSEE&T, 2007
@inproceedings{bib_A_st_2007, AUTHOR = {KIRTI GARG, Vasudeva Varma Kalidindi}, TITLE = {A study of the effectiveness of case study approach in software engineering education}, BOOKTITLE = {Conference on Software Engineering Education and Training}. YEAR = {2007}}
Software engineering (SE) educators have been advocating the use of non-conventional approaches for SE education since long. In this context, we conducted action-research to compare the effectiveness of a case study approach with conventional lecture based approach. We designed and taught a first course in SE, that involved case study approach as well as the traditional lecture based approach. We recorded and analyzed student's perception of learning over using well defined parameters that reconciled with cognitive, skills and metacognitive goals of SE education. Results corroborated that case study approach is more effective and interesting for learning SE than the lecture based approach. These results indicate that academia and industry should further explore learning-by-doing paradigm, specially the case studies. Benefits of approach include bridging of the industry- academia gap and creation of professionals who are well versed with theory and practice and have experienced the intricacies of real software development even before entering the industry. This paper provides empirical data to support that case study approach is very effective in SE education and hence useful for curriculum designers. This work is useful for SE educators and researchers as it describes methodology for rigorous scientific educational researc
IIIT Hyderabad at CLEF 2007-Adhoc Indian Language CLIR Task.
PRASAD RAO P V V,Vasudeva Varma Kalidindi
International Conference of the CLEF Association, CLEFS, 2007
@inproceedings{bib_IIIT_2007, AUTHOR = {PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {IIIT Hyderabad at CLEF 2007-Adhoc Indian Language CLIR Task.}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2007}}
This paper presents the experiments of Language Technologies Research Centre (LTRC) 1 as part of their participation in CLEF2 2007 Indian language to English ad-hoc cross language document retrieval task. In this paper we discuss our Hindi and Telugu to English CLIR system and the experiments using CLEF 2007 dataset. We used a variant of TFIDF algorithm in combination with a bilingual lexicon for query translation. We also explored the role of a document summary in fielded queries and two different boolean formulations of query translations. We find that a hybrid boolean formulation using a combination of boolean AND and boolean OR operators improves ranking of documents. We also find that simple disjunctive combination of translated query
WebKhoj: Indian language IR from multiple character encodings
PRASAD RAO P V V,JAGADEESH JAGARLAMUDI,Vasudeva Varma Kalidindi
International Conference on World wide web, WWW, 2006
@inproceedings{bib_WebK_2006, AUTHOR = {PRASAD RAO P V V, JAGADEESH JAGARLAMUDI, Vasudeva Varma Kalidindi}, TITLE = {WebKhoj: Indian language IR from multiple character encodings}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2006}}
Today web search engines provide the easiest way to reach information on the web. In this scenario, more than 95% of Indian language content on the web is not searchable due to multiple encodings of web pages. Most of these encodings are proprietary and hence need some kind of standardization for making the content accessible via a search engine. In this paper we present a search engine called WebKhoj which is capable of searching multi-script and multi-encoded Indian language content on the web. We describe a language focused crawler and the transcoding processes involved to achieve accessibility of Indian langauge content. In the end we report some of the experiments that were conducted along with results on Indian language web content.
Query independent sentence scoring approach to duc 2006
JAGADEESH JAGARLAMUDI,PRASAD RAO P V V,Vasudeva Varma Kalidindi
Document Understanding Conference, DUC, 2006
@inproceedings{bib_Quer_2006, AUTHOR = {JAGADEESH JAGARLAMUDI, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Query independent sentence scoring approach to duc 2006}, BOOKTITLE = {Document Understanding Conference}. YEAR = {2006}}
The task in Document Understanding Conferences (DUC1) 2006 is to generate a fixed length, user oriented, multi document summary, which remains same as that of DUC 2005. We have used two features to score the sentences based. The sentences are picked to form the summary based on the calculated score. The first feature is a query dependent scoring of a sentence which is an improvement over the HAL feature. The second feature is based on the observation that sentence importance, which is independent of the query, needs to be captured in the current approaches. We have explored the use of web in scoring the sentences in a query independent manner. Experiments show a performance gain of 6-7% over HAL feature by the inclusion of two new features. Our summarization system was ranked 1st in all automatic evaluations with significant margin from second best system, 5th in responsiveness and 9th in linguistic quality evaluations in DUC 2006. Relatively lower performance in linguistic quality can be attributed to the stripped off sentences at the end of summary, when the summary length is exceeding 250 words.
Hindi and Telugu to English Cross Language Information Retrieval at CLEF 2006.
PRASAD RAO P V V,Vasudeva Varma Kalidindi
International Conference of the CLEF Association, CLEFS, 2006
@inproceedings{bib_Hind_2006, AUTHOR = {PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {Hindi and Telugu to English Cross Language Information Retrieval at CLEF 2006.}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2006}}
This paper presents the experiments of Language Technologies Research Centre (LTRC) 1 as part of their participation in CLEF2 2006 ad-hoc document retrieval task. This is our first participation in the CLEF evaluation tasks and we focused on Afaan Oromo, Hindi and Telugu as query languages for retrieval from English document collection. In this paper we discuss our Hindi and Telugu to English CLIR system and the experiments at CLEF.
Security: Bridging the academia-industry gap using a case study
KIRTI GARG,Vasudeva Varma Kalidindi,Giridhar K N,Abhishek K. Mishra
Asia-Pacific Software Engineering Conference, APSEC, 2006
@inproceedings{bib_Secu_2006, AUTHOR = {KIRTI GARG, Vasudeva Varma Kalidindi, Giridhar K N, Abhishek K. Mishra }, TITLE = {Security: Bridging the academia-industry gap using a case study}, BOOKTITLE = {Asia-Pacific Software Engineering Conference}. YEAR = {2006}}
Security is one of the major concerns of modern software development, but there is a wide gap between the industry practice and the academic instruction. Security issues are usually not addressed in academic setup and those attempt to make it as part of their software engineering curriculum realize quickly that it is difficult to make learning happen in a conventional lecture based approach. We explored a case study centered non-conventional approach and found it to be instrumental in bridging this gap between the industry and the academia. We give a detailed methodology for using the case study and our experience, observations and result of the approach. Whole case exercise aims at creating software professionals who realize the importance of security and are well versed with the related concepts, skills and dispositions and thus handle security related challenges in a structured manner.
A relevance-based language modeling approach to duc 2005
JAGADEESH JAGARLAMUDI,PRASAD RAO P V V,Vasudeva Varma Kalidindi
Document Understanding Conference, DUC, 2005
@inproceedings{bib_A_re_2005, AUTHOR = {JAGADEESH JAGARLAMUDI, PRASAD RAO P V V, Vasudeva Varma Kalidindi}, TITLE = {A relevance-based language modeling approach to duc 2005}, BOOKTITLE = {Document Understanding Conference}. YEAR = {2005}}
Abstract The task in Document Understanding Conferences (DUC1) 2005 is to generate fixed length, user oriented, multi document summary. Our approach to address this task is primarily motivated by the observation that metrics based on key concepts overlap give better results when compared to metrics based on n-gram and sentence overlap. In this paper, we present a sentence extraction based summarization system which scores the sentences using Relevance Based Language Modeling, Latent Semantic Indexing and number of special words. From these scored sentences, the system generates a summary of required granularity. Our summarization system was ranked 3rd, 4th, 8th and 17th in ROUGE-SU4, ROUGE-2, responsiveness and linguistic quality evaluations respectively. In post DUC analysis we found that LSI has negative effect on the systems performance, and the performance gained by 5.4% when it is implemented using language modeling and number of special words.
Exploring Creative Concepts in the Nearest Neighborhood using Lexical Ontologies.
PRASAD RAO P V V,JAGADEESH JAGARLAMUDI,Vasudeva Varma Kalidindi,Bipin Indurkhya
Indian International Conference on Artificial Intelligence, IICAI, 2005
@inproceedings{bib_Expl_2005, AUTHOR = {PRASAD RAO P V V, JAGADEESH JAGARLAMUDI, Vasudeva Varma Kalidindi, Bipin Indurkhya}, TITLE = {Exploring Creative Concepts in the Nearest Neighborhood using Lexical Ontologies.}, BOOKTITLE = {Indian International Conference on Artificial Intelligence}. YEAR = {2005}}
Conceptual blending is an important area of research for creativity modeling. In this paper we present a creativity model that takes an existing blend and generates new blends using the nearest neighborhood replacements from a lexical ontology. For example “Artificial Intelligence” is a compounded concept comprising of “Artificiality” and “Intelligence” as two sub-concepts. After concept generation, the system comes up with “Artificial Creativity” as one of the generated concepts. This project is part of a larger ongoing project called CAPRICON whose goal is to provide with stimuli for new futuristic concepts. This tool is a creativity enhancement tool that helps users enhance their creative thinking by providing stimuli for new concepts.
Case studies: the potential teaching instruments for software engineering education
Vasudeva Varma Kalidindi,KIRTI GARG
International Conference on Quality Software, QSIC, 2005
@inproceedings{bib_Case_2005, AUTHOR = {Vasudeva Varma Kalidindi, KIRTI GARG}, TITLE = {Case studies: the potential teaching instruments for software engineering education}, BOOKTITLE = {International Conference on Quality Software}. YEAR = {2005}}
The current approaches to the software engineering education fall short to fulfill the industry demand for quality software engineering. A constant need to create and imbibe more effective learning environments is growing in order to manage this demand. This paper discusses the learning disabilities possessed by both the conventional and the non-conventional approaches for teaching software engineering. We propose that case studies can be used as effective teaching mediums and a case study centric learning environment can address these learning disabilities. A case study approach can help the students to gain and retain realistic exposure to concepts of software engineering as they are applied in the real world, and the students of today can be groomed as excellent professionals who have experienced the intricacies and complexities of the real world as well as tried their hands to manage these complexities.
A case study initiative for software engineering education
Vasudeva Varma Kalidindi,KIRTI GARG,Vamsi
International Conference on Software Engineering Research and Practice, SERP, 2004
@inproceedings{bib_A_ca_2004, AUTHOR = {Vasudeva Varma Kalidindi, KIRTI GARG, Vamsi}, TITLE = {A case study initiative for software engineering education}, BOOKTITLE = {International Conference on Software Engineering Research and Practice}. YEAR = {2004}}
There has been an increasing need to find effective ways to impart Software Engineering Education. This paper discusses the problems the different approaches, both conventional and non conventional, face and proposes a case study approach that addresses these issues. This research aims to create a teaching methodology that will assist the teachers to create an effective learning environment and help the students to gain realistic exposure to concepts of Software Engineering as they are applied in the real world. The focus of the paper is to emphasize on creating a context to learn Software Engineering through case studies that imbibe the best practices from real world experiences.
Building large scale ontology networks
Vasudeva Varma Kalidindi
Language Engineering Conference, LEC, 2002
@inproceedings{bib_Buil_2002, AUTHOR = {Vasudeva Varma Kalidindi}, TITLE = {Building large scale ontology networks}, BOOKTITLE = {Language Engineering Conference}. YEAR = {2002}}
Adoptable, high performance, large scale ontologies that can be extended to support multimedia play a crucial role in building effective content and knowledge management systems and applications. In the context of developing a unified taxonomy and ontology network (UTON), we have undertaken the task of developing a technology framework for building large scale ontologies. This paper describes the architecture of UTON and how general purpose resources such as WordNet and open directory project can be used to create large scale ontology networks and how application specific taxonomies or ontologies can be derived from these general purpose tools and resources.