Abstract
Embodiments herein provide a system and a method for automatically generating at least one synthetic talking head video using a machine learning model. The method includes (i) extracting features from each frame of a video that is extracted from data sources,(ii) analyzing, using a face-detection model, the video to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the video,(iii) generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the data sources,(iv) modifying lip movements that are originally present in the driving face video corresponding to the synthetic speech utterances, and (v) generating, using machine learning model, synthetic talking head video based on the lip movements that are modified corresponding to the synthetic speech utterances.