Abstract
This thesis explores the intersection of applied natural language processing (NLP) and cognitive
neuroscience to address two critical challenges: enhancing knowledge accessibility for low-resource
Indian languages and investigating whether computational language models emulate human brain activity during naturalistic language comprehension. Together, these efforts aim to bridge the gap between
practical NLP applications and our understanding of how the human brain processes language, with
potential implications for designing more inclusive and cognitively aligned AI systems.
The first part of this work focuses on addressing the digital divide faced by low-resource Indian
languages, which are underrepresented in global knowledge repositories like Wikipedia. To tackle this
challenge, we developed a scalable, template-based approach to automatically generate high-quality
Wikipedia articles for Telugu, one of India’s widely spoken yet digitally underserved languages. Using
structured data sources from platforms such as IMDb, USDA, JSTOR, and IUCN, we extracted and
enriched information about diverse domains, including movies, plants, animals, and volcanoes. Advanced translation and transliteration techniques—leveraging tools like Bing Translate API and DeepTranslit—were employed to convert attributes into Telugu, ensuring linguistic accuracy. Manual corrections were applied to handle nuances and ambiguities, resulting in over 25,000 machine-generated
articles that adhere to Wikipedia’s quality standards, including word count, formatting, and references.
This effort not only expanded the digital footprint of Telugu Wikipedia but also demonstrated the feasibility of automating knowledge creation for low-resource languages. However, the success of this
template-based system raised intriguing questions: How do humans process and organize such structured information? Can large language models (LLMs) mimic the cognitive mechanisms underlying
naturalistic language comprehension?
Motivated by these questions, the second part of the thesis delves into the neural underpinnings
of language processing using magnetoencephalography (MEG), a non-invasive neuroimaging technique
with millisecond temporal resolution. We conducted experiments on 27 participants listening to naturalistic stories, aiming to investigate how text and speech representations from unimodal and multimodal
LLMs align with brain activity. Specifically, we compared embeddings from three multimodal models (CLAP, SpeechT5, Pengi) with those from four unimodal text models (BERT, LLaMA-2, XLNet,
FLAN-T5) and three unimodal speech models (Wav2Vec2.0, WavLM, Whisper). Ridge regression was
used to predict MEG responses from these embeddings, enabling us to evaluate their ability to capture
brain-relevant semantics. To isolate higher-level semantic processing from low-level acoustic features,
we employed a residual approach, removing phonological and articulatory cues from multimodal embeddings. Our findings reveal several key insights: First, text embeddings—particularly from multimodal models—closely mirror brain responses during semantic processing, especially beyond 350ms,
indicating alignment with higher-order cognitive functions. Second, speech embeddings predominantly
capture low-level acoustic features and fail to encode meaningful semantics after 350ms. Third, we
uncover an asymmetry in cross-modal knowledge transfer: while text modality benefits significantly
from speech-derived information, particularly around 200ms (auditory processing), the reverse is limited. These results deepen our understanding of how multimodal models integrate information across
modalities and highlight opportunities to refine their design for applications in low-resource contexts.
Together, these contributions bridge applied NLP and cognitive science, offering a novel perspective
on designing linguistically inclusive and cognitively aligned AI systems. By juxtaposing the scalability of template-based systems for low-resource languages with the temporal precision of MEG-based
brain encoding, this thesis advances inclusivity in digital knowledge and paves the way for future interdisciplinary innovation at the intersection of AI and neuroscience. The findings not only enrich our
understanding of how the human brain organizes and processes language but also provide actionable
insights for improving the alignment of NLP systems with human cognition. Ultimately, this research
underscores the importance of integrating biologically inspired mechanisms into computational models,
fostering advancements in both neurolinguistics and applied NLP.