Abstract
Exploring and quantifying semantic relatedness is central to representing language and
holds significant implications across various
NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we
instead investigate the broader phenomenon
of semantic relatedness. In this paper, we
present SemRel, a new semantic relatedness
dataset collection annotated by native speakers
across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic,
Modern Standard Arabic, Spanish, and Telugu.
These languages originate from five distinct
language families and are predominantly spoken in Africa and Asia – regions characterised
by a relatively limited availability of NLP resources. Each instance in the SemRel datasets
is a sentence pair associated with a score that
represents the degree of semantic textual relatedness between the two sentences. The scores
are obtained using a comparative annotation
framework. We describe the data collection and
annotation processes, challenges when building the datasets, baseline experiments, and their
impact and utility in NLP.