Abstract
Telugu, a Dravidian language spoken by approximately 95.7 million people, faces significant barriers
in digital ecosystems due to its agglutinative morphology and limited computational and informational
resources. This thesis addresses these challenges through two transformative contributions: a contextaware morphological analyzer and the generation of 25,000 high-quality Telugu Wikipedia articles to
enhance knowledge accessibility. To tackle the scarcity of annotated datasets, a novel 10,000-sentence
Telugu corpus was developed, meticulously annotated with morphological features such as lexical category, gender, number, person, case, tense-aspect-mood (TAM), and clause markers. This dataset, capturing complex forms like adav ¯ allalMdark ¯ ¯ı (‘to all women’) and ambiguities like pagalu (‘day’ vs. ‘to
break’), powers a Transformer-based analyzer. Fine-tuned on BERT-Te, a monolingual model pretrained
on 80,15,588 Telugu sentences, the analyzer achieves F1 scores of 0.778 for gender and 0.602 for lexical
category, outperforming multilingual models (mBERT, IndicBERT, XLM-R) by 20-25%. This enables
precise parsing for applications like machine translation and sentiment analysis, addressing a critical
gap in Telugu NLP infrastructure.
Concurrently, a scalable pipeline leveraging Jinja2 templates and structured data from IMDb, USDA,
IUCN, and Smithsonian generated 25,000 Wikipedia articles across movies (8,929), plants (2,140), animals (4,928), and volcanoes (8,676). These articles, incorporating infoboxes, tables, and citations,
achieve a BERTScore of 0.79 for semantic similarity and a Fleiss’ Kappa of 0.82 for human-rated
quality, adhering to Wikipedia’s five pillars (neutrality, verifiability, no original research, free content,
civility). Expanding Telugu Wikipedia’s 95,599-article base by over 25%, they empower native speakers with accessible knowledge in entertainment, science, and geography. Both contributions are opensourced at https://github.com/parameshkrishnaa/Telugu-Morph-Dataset and https://tewiki.iiit.ac.in, ensuring reproducibility and adaptability to other low-resource languages. Despite challenges like sandhi
splitting (8% manual corrections) and transliteration errors (e.g., “The” as Te, 8% manual fixes), the ˆ
research establishes robust benchmarks—F1 scores, BERTScore, and Fleiss’ Kappa—for Telugu NLP.
By enhancing computational tools and public knowledge access, this work elevates Telugu’s digital
presence, fosters linguistic equity, and provides a scalable model for global low-resource language advancement. Future directions include expanding datasets, automating preprocessing, and deploying
articles on Wikipedia to further bridge the digital divide.