Abstract
The advent of social media has immensely increased the number of opinions and arguments voiced
on the internet. Social media platforms comprise a significant part of an individual’s social interaction.
These interactions also generate many opinions on issues where there is a significant division—these
virtual interactions, which often result in debates, manifest cases of aggression.
Various online platforms like forums, blogs, and so on help users post comments and reply to other
users’ comments. Some of these comments can be aggressive, hate speech, lovable, offensive languages
etc. With the growing population on social media, interactions over the web have increased and have become aggressive, and related activities like cyberbullying, trolling, hate speech, etc. have also increased
manifold across the globe. Thus, aggressive online behaviour incidents have become a significant source
of social conflict, potentially resulting in an activity of a criminal nature.
Thus, a fundamental challenge for identifying aggression on social media is to classify it from offensive or vitriolic languages. For the task of Aggression Detection, we used a Hindi-English code-mixed
dataset provided for the shared task in the 1st Workshop on Trolling, Aggression and Cyberbullying
(TRAC-1). Keeping these ideas in mind, we developed a system to discriminate between Overtly Aggressive, Covertly Aggressive and Non-aggressive content in texts.
While research has been focused mostly on analyzing aggression, stance, and other dimensions of
speech in isolation from each other, this work also attempts to gain an extensive and fine-grained understanding of aggression and figurative language use patterns when voicing an opinion. However, this task
is daunting since natural language is fraught with ambiguities, and language in social media is boisterous. So, specialized techniques are required to handle issues related to these data streams’ unstructured
and dynamic nature — it can be further used in various contexts to analyze and gain insights from social
behaviours.
Since the users on these social media platforms tend to write in an informal tone in real-time, it is
relatively natural to mix languages as they ease communication. This factor could be attributed to these
users being informal, being multi-lingual, or non-native language speakers. However, it adds another
layer of complexity on top of the dynamic layer of social media data. This thesis explores and develops
techniques that can further help us to gain in-depth insights from such data.
We also present a code-mixed dataset in English-Hindi, of opinion on a politico-social issue. We
annotate it across multiple dimensions: aggression, hate speech, emotion arousal, and figurative language usage (such as sarcasm/irony, metaphors/similes, puns/word-play) across varied modalities. Like
vi
vii
the one presented, such in-depth datasets are required to analyze the not so apparent forms of verbal
aggression displayed on social media and analyze the social dynamics of opinion. The thesis also hopes
to understand linguistic patterns better when voicing an opinion and showing aggression. Furthermore,
such datasets also facilitate classification models that leverage corpora annotated for auxiliary tasks
through transfer learning, joint modelling, and semi-supervised label propagation methods.