Abstract
Cinema, as a pervasive cultural medium, wields profound influence over societal perceptions, yet its
role in perpetuating objectification remains underexplored in non-Western contexts. This thesis bridges
this gap by investigating how cinematic techniques in Indian ”item songs”—a genre marked by provocative choreography and strategic camera framing—to induce objectifying gaze behavior in viewers. Integrating this with computer vision model, we present a dual-methodological approach: an empirical
eye-tracking study and a novel multi-modal deep learning framework for understanding and detecting
visual objectification in videos.
In the experimental component, 91 participants viewed sexualized (SV) and non-sexualized (TV)
music videos while their gaze metrics—fixation duration, visit counts, and scanpaths—were recorded.
Results revealed that sexualized framing significantly redirected attention toward objectified body regions (torso, lower body), with gaze synchronization rates 6× higher in SV than TV (p < 0.001). Dynamic segmentation and ScanGraph analyses demonstrated that camera techniques such as close-ups
and rapid editing overrode individual differences, homogenizing gaze patterns across viewers. These
findings empirically validate theoretical frameworks like objectification theory (Fredrickson & Roberts,
1997) and the male gaze (Mulvey, 1975), highlighting how exogenous cinematic cues force sexual gaze
objectification.
Complementing this, our computational contribution introduces an interpretable multi-modal AI
framework. By fusing video (LLaVA-NeXT-Video-7B-hf), audio (Whisper-large-v2), and text (allmpnet-base-v2) embeddings via contrastive learning, the model quantifies objectification intensity by
dynamically weighting multi-modal cues, achieving state-of-the-art objectification detection (F1: 0.783,
Acc: 0.826). A concept bottleneck mechanism further links predictions to human-interpretable cinematic elements (e.g., ”male gaze framing,” AUC: 0.803).
This work advances interdisciplinary research by quantifying the cognitive impact of cultural media
practices and providing scalable tools for bias detection caused by directors. Its implications extend
to AI-driven content moderation, policy frameworks, ethical cinematography, and cross-cultural studies
of media effects, establishing a foundation for mitigating objectification in increasingly visual digital
ecosystems.