Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings
@inproceedings{bib_Real_2024, AUTHOR = {ACHARY SUDHEER, Girmaji Rohit, Adhiraj Anil Deshmukh, Vineet Gandhi}, TITLE = {Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2024}}
Eliminating time-consuming post-production processesand delivering high-quality videos in today’s fast-paced digital landscape are the key advantages of real-time approaches. To address these needs, we present Real Time GAZED: a real-time adaptation of the GAZED framework integrated with CineFilter, a novel real-time camera trajectory stabilization approach. It enables users to create professionally edited videos in real-time. Comparative evaluations against baseline methods, including the non-real-time GAZED, demonstrate that Real Time GAZED achieves similar editing results, ensuring high-quality video output. Furthermore, a user study confirms the aesthetic quality of the video edits produced by the Real Time GAZED approach. With these advancements in real-time camera trajectory optimization and video editing presented, the demand for immediate and dynamic content creation in industries such as live broadcasting, sports coverage, news reporting, and social media content creation can be met more efficiently.
Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models
@inproceedings{bib_Towa_2024, AUTHOR = {Neilkumar Milankumar Shah, Shirish Karande, Vineet Gandhi}, TITLE = {Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2024}}
We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-tosequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model’s proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus,
setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech. Speech samples can be found at https://nam2speech.github.io/NAM2Speech/
StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin
Neilkumar Milankumar Shah,Neha S,Vishal Thambrahalli,Ramanathan Subramanian,Vineet Gandhi
@inproceedings{bib_Stet_2024, AUTHOR = {Neilkumar Milankumar Shah, Neha S, Vishal Thambrahalli, Ramanathan Subramanian, Vineet Gandhi}, TITLE = {StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin}, BOOKTITLE = {international joint conference on pervasive and ubiquitous computing}. YEAR = {2024}}
We introduce StethoSpeech, a silent speech interface that transforms flesh-conducted vibrations behind the ear into speech. This innovation is designed to improve social interactions for those with voice disorders, and furthermore enable discreet public communication. Unlike prior efforts, StethoSpeech does not require (a) paired-speech data for recorded vibrations and (b) a specialized device for recording vibrations, as it can work with an off-the-shelf clinical stethoscope. The novelty of our framework lies in the overall design, simulation of the ground-truth speech, and a sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and
our proposed StethoText: a large-scale synchronized database of non-audible murmur and text for speech research. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming existing methods on several quantitative and qualitative metrics. Additionally, we showcase its capacity to extend its application to speakers not encountered during training and its effectiveness in challenging, noisy environments. Speech samples are available at https://stethospeech.github.io/StethoSpeech/.
ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations
Neilkumar Milankumar Shah,K Saiteja,Vishal Thambrahalli,Neha S,ANIL KUMAR NELAKANTI,Vineet Gandhi
@inproceedings{bib_Parr_2024, AUTHOR = {Neilkumar Milankumar Shah, K Saiteja, Vishal Thambrahalli, Neha S, ANIL KUMAR NELAKANTI, Vineet Gandhi}, TITLE = {ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2024}}
We present ParrotTTS, a modularized text-tospeech synthesis model leveraging disentangled self supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not
seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual
scenarios. ParrotTTS outperforms state-of-theart multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at https://parrot-tts.
github.io/tts/
Major Entity Identification: A Generalizable Alternative to Coreference Resolution
@inproceedings{bib_Majo_2024, AUTHOR = {S Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi}, TITLE = {Major Entity Identification: A Generalizable Alternative to Coreference Resolution}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2024}}
The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative referential task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, MEI fits the classification framework, which enables the use of robust and intuitive classification-based metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.
Vora Jeet Vipul,Swetanjal Murati Dutta,Kanishk Jain,Shyamgopal Karthik,Vineet Gandhi
@inproceedings{bib_Brin_2023, AUTHOR = {Vora Jeet Vipul, Swetanjal Murati Dutta, Kanishk Jain, Shyamgopal Karthik, Vineet Gandhi}, TITLE = {Bringing Generalization to Deep Multi-View Pedestrian Detection}, BOOKTITLE = {Winter Conference on Applications of Computer Vision Workshops}. YEAR = {2023}}
Multi-view Detection (MVD) is highly effective for occlusion reasoning in a crowded environment. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. The key novelty of our work is to formalize three critical forms of generalization and propose experiments to evaluate them: generalization with i) a varying number of cameras, ii) varying camera positions, and finally, iii) to new scenes. We find that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. To address the concerns: (a) we propose a novel Generalized MVD (GMVD) dataset, assimilating diverse scenes with changing day- time, camera configurations, varying number of cameras, and (b) we discuss the properties essential to bring generalization to MVD and propose a barebones model to incorporate them. We perform a comprehensive set of experiments on the WildTrack, MultiViewX and the GMVD datasets to motivate the necessity to evaluate generalization abilities of MVD methods and to demonstrate the efficacy of the proposed approach. The code and the proposed dataset can be found at https://github.com/jeetv/GMVD
@inproceedings{bib_Grou_2023, AUTHOR = {Kanishk Jain, Varun Chhangani, Amogh Tiwari, K Madhava Krishna, Vineet Gandhi}, TITLE = {Ground then Navigate: Language-guided Navigation in Dynamic Scenes}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2023}}
We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. At each timestamp, the model predicts a segmentation mask corresponding to the intermediate or the final navigable region. Our work contrasts with existing efforts in VLN, which pose this task as a node selection problem, given a discrete connected graph corresponding to the environment. We do not assume the availability of such a discretised map. Our work moves towards continuity in action space, provides interpretability through visual feedback and allows VLN on commands requiring finer manoeuvres like "park between the two cars". Furthermore, we propose a novel meta-dataset CARLA-NAV to allow efficient training and validation. The dataset comprises pre-recorded training sequences and a live environment for validation and testing. We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach.
@inproceedings{bib_Test_2023, AUTHOR = {Kanishk Jain, Shyamgopal Karthik, Vineet Gandhi}, TITLE = {Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2023}}
We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of knowledge or domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for PostHoc Correction called Hierarchical Ensembles (HiE) that utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions. By only requiring the parents of leaf nodes, our method significantly reduces avg. mistake severity while improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets, achieving a new state-of-the-art on both benchmarks. We also investigate the efficacy of our approach in the semisupervised setting. Our approach brings notable gains in top-1 accuracy while significantly decreasing the severity of mistakes as training data decreases for the fine-grained classes. The simplicity and post-hoc nature of HiE render it practical to be used with any off-the-shelf trained model to improve its predictions further.
Laksh Nanwani,Anmol Agarwal,Kanishk Jain,Raghav Prabhakar,Aaron Anthony Monis,Krishna Murthy,Abdul Hafez,Vineet Gandhi,K Madhava Krishna
@inproceedings{bib_Inst_2023, AUTHOR = {Laksh Nanwani, Anmol Agarwal, Kanishk Jain, Raghav Prabhakar, Aaron Anthony Monis, Krishna Murthy, Abdul Hafez, Vineet Gandhi, K Madhava Krishna}, TITLE = {Instance-Level Semantic Maps for Vision Language Navigation}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment which helps them to navigate on-demand when given a linguistic instruction. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recently introduced VL Maps \cite{huang23vlmaps} take a step towards this goal by creating a semantic spatial map representation of the environment without any labelled data. However, their representations are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we address this limitation by integrating instance-level information into spatial map representation using a community detection algorithm and by utilizing word ontology learned by large language models (LLMs) to perform open-set semantic associations in the mapping representation. The resulting map representation improves the navigation performance by two-fold (233\%) on realistic language commands with instance-specific descriptions compared to VL Maps. We validate the practicality and effectiveness of our approach through extensive qualitative and quantitative experiments.
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations
@inproceedings{bib_Robu_2023, AUTHOR = {Neha Sahipjohn, Neil Shah, Vishal Tambrahalli, Vineet Gandhi}, TITLE = {RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/model efficiency due to the entanglement of speech content with ambient information and speaker characteristics. To this end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis. First, a non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content. A vocoder then converts the speech features into raw waveforms. Extensive evaluations confirm the effectiveness of our setup, achieving state-of-the-art performance on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT datasets. Speech samples from RobustL2S
Assessing active speaker detection algorithms through the lens of automated editing
Girmaji Rohit,ACHARY SUDHEER,Adhiraj Anil Deshmukh,Vineet Gandhi
Workshop on Intelligent Cinematography and Editing, WICED, 2023
@inproceedings{bib_Asse_2023, AUTHOR = {Girmaji Rohit, ACHARY SUDHEER, Adhiraj Anil Deshmukh, Vineet Gandhi}, TITLE = {Assessing active speaker detection algorithms through the lens of automated editing}, BOOKTITLE = {Workshop on Intelligent Cinematography and Editing}. YEAR = {2023}}
This paper addresses the challenge of active speaker detection in automated video editing and highlights the limitations of current audio-only and audio-visual speaker detection methods in handling unseen data with overlapped speakers, speaker occlusions, low video resolution, and random noises. Firstly, we select the BBC Old School Dataset, a comprehensive dataset introduced for automated video editing, and annotate it with active speaker labels. We propose an audio-based nearest neigh- bour algorithm that utilizes additional inputs, such as audio samples of each speaker and faces, to predict and track the active speaker. We evaluate the effectiveness of our approach on the BBC Old School Dataset by utilizing f rame-level speaker accu- racy, which we consider a more suitable metric in the context of video editing. We observe that this simple setup outperforms the current state-of-the-art methods in predicting the active speaker. By incorporating these methods into our speaker-based editing approaches, we also notice that our method closely approximates the output obtained using ground truth annotations
Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings
ACHARY SUDHEER,Girmaji Rohit,Adhiraj Anil Deshmukh,Vineet Gandhi
Technical Report, arXiv, 2023
@inproceedings{bib_Real_2023, AUTHOR = {ACHARY SUDHEER, Girmaji Rohit, Adhiraj Anil Deshmukh, Vineet Gandhi}, TITLE = {Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Eliminating time-consuming post-production processes and delivering high-quality videos in today’s fast-paced digital landscape are the key advantages of real-time ap- proaches. To address these needs, we present Real Time GAZED: a real-time adaptation of the GAZED framework integrated with CineFilter, a novel real-time camera trajec- tory stabilization approach. It enables users to create pro- fessionally edited videos in real-time. Comparative evalua- tions against baseline methods, including the non-real-time GAZED, demonstrate that Real Time GAZED achieves simi- lar editing results, ensuring high-quality video output. Fur- thermore, a user study confirms the aesthetic quality of the video edits produced by the Real Time GAZED approach. With these advancements in real-time camera trajectory op- timization and video editing presented, the demand for im- mediate and dynamic content creation in industries such as live broadcasting, sports coverage, news reporting, and so- cial media content creation can be met more efficiently
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self Supervised Representations
Neha S,Neilkumar Milankumar Shah,Vishal Thambrahalli,Vineet Gandhi
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), APSIPA, 2023
@inproceedings{bib_Robu_2023, AUTHOR = {Neha S, Neilkumar Milankumar Shah, Vishal Thambrahalli, Vineet Gandhi}, TITLE = {RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self Supervised Representations}, BOOKTITLE = {Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}. YEAR = {2023}}
Significant progress has been made in speaker-dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/model efficiency due to the entanglement of speech content with ambient information and speaker characteristics. To this end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis. First, a non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content. A vocoder then converts the speech features into raw waveforms. Extensive evaluations confirm the effectiveness of our setup, achieving state-of-the-art performance on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT datasets. Speech samples from RobustL2S can be found at https://neha-sherin.github.io/RobustL2S/
Adversarial Robustness of Mel Based Speaker Recognition Systems
Ritu Srivastava,K Saiteja,Sarath S,Neha S,Vineet Gandhi
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), APSIPA, 2023
@inproceedings{bib_Adve_2023, AUTHOR = {Ritu Srivastava, K Saiteja, Sarath S, Neha S, Vineet Gandhi}, TITLE = {Adversarial Robustness of Mel Based Speaker Recognition Systems}, BOOKTITLE = {Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}. YEAR = {2023}}
Convolutional neural networks (CNN) applied to Mel spectrograms now have a dominant presence in the landscape of speaker recognition systems. Correspondingly, it is also important to evaluate their robustness to adversarial attacks that remains not thoroughly explored for end-to-end trained CNNs for speaker recognition. Our work addresses this gap and investigates variations of the iterative Fast Gradient Sign Method (FGSM) to perform adversarial attacks. We observe that a vanilla iterative FGSM can flip the identity of each speaker sample to that of every other speaker in the LibriSpeech dataset. Furthermore, we propose adversarial attacks specific to Mel spectrogram features by (a) limiting the number of pixels attacked, (b) restricting changes to specific frequency bands, (c) restricting changes to particular time duration, and (d) using a substitute model to craft the adversarial sample. Using thorough qualitative and quantitative results, we demonstrate the fragility and non-intuitive nature of the current CNN-based speaker recognition systems, where the predicted speaker identities can be flipped without any perceptible changes in the audio. The samples are available at "https://advdemo.github.io/speech/"
Instance-Level Semantic Maps for Vision Language Navigation
Laksh Nanwani,K Madhava Krishna,Anmol Agarwal,Kanishk Jain,Raghav Prabhakar,Aaron Anthony Monis,Aditya Mathur,Krishna Murthy,Abdul Hafez,Vineet Gandhi
IEEE International Conference on Robot and Human Interactive Communication, RO-MAN, 2023
@inproceedings{bib_Inst_2023, AUTHOR = {Laksh Nanwani, K Madhava Krishna, Anmol Agarwal, Kanishk Jain, Raghav Prabhakar, Aaron Anthony Monis, Aditya Mathur, Krishna Murthy, Abdul Hafez, Vineet Gandhi}, TITLE = {Instance-Level Semantic Maps for Vision Language Navigation}, BOOKTITLE = {IEEE International Conference on Robot and Human Interactive Communication}. YEAR = {2023}}
Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment, allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recent works take a step towards this goal by creating a semantic spatial map representation of the environment without any labeled data. However, their representations are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we address this limitation by integrating instance-level information into spatial map representation using a community detection algorithm and utilizing word ontology learned by large language models (LLMs) to perform open-set semantic associations in the mapping representation. The resulting map representation improves the navigation performance by two-fold (233%) on realistic language commands with instance-specific descriptions compared to the baseline. We validate the practicality and effectiveness of our approach through extensive qualitative and quantitative experiments.
The prose storyboard language: A tool for annotating and directing movies
REMI RONFARD,Vineet Gandhi,LAURENT BOIRON,VAISHNAVI AMEYA MURUKUTLA
Eurographics Workshop on Intelligent Cinematography and Editing, WICED, 2022
@inproceedings{bib_The__2022, AUTHOR = {REMI RONFARD, Vineet Gandhi, LAURENT BOIRON, VAISHNAVI AMEYA MURUKUTLA}, TITLE = {The prose storyboard language: A tool for annotating and directing movies}, BOOKTITLE = {Eurographics Workshop on Intelligent Cinematography and Editing}. YEAR = {2022}}
The prose storyboard language is a formal language for de- scribing movies shot by shot, where each shot is described with a unique sentence. The language uses a simple syntax and limited vocabulary borrowed from working practices in traditional movie-making, and is intended to be readable both by machines and humans. The language is designed to serve as a high-level user interface for intelligent cinematog- raphy and editing systems
Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Kanishk Jain,Vineet Gandhi
Conference of the Association of Computational Linguistics, ACL, 2022
@inproceedings{bib_Comp_2022, AUTHOR = {Kanishk Jain, Vineet Gandhi}, TITLE = {Comprehensive Multi-Modal Interactions for Referring Image Segmentation}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2022}}
We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intramodal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach’s performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.
Does Audio help in deep Audio-Visual Saliency prediction models?
Ritvik Agrawal,Shreyank Jyoti,Girmaji Rohit,Sarath Sivaprasad,Vineet Gandhi
International Conference on Multimodal Interaction, ICMI, 2022
@inproceedings{bib_Does_2022, AUTHOR = {Ritvik Agrawal, Shreyank Jyoti, Girmaji Rohit, Sarath Sivaprasad, Vineet Gandhi}, TITLE = {Does Audio help in deep Audio-Visual Saliency prediction models?}, BOOKTITLE = {International Conference on Multimodal Interaction}. YEAR = {2022}}
Despite existing works of Audio-Visual Saliency Prediction (AVSP) models claiming to achieve promising results by fusing audio modality over visual-only models, these models fail to leverage audio information. In this paper, we investigate the relevance of audio cues in conjunction with the visual ones and conduct extensive analysis by employing well-established audio modules and fusion techniques from diverse correlated audio-visual tasks. Our analysis on ten diverse saliency datasets suggests that none of the methods worked for incorporating audio. Furthermore, we bring to light, why AVSP models show a gain in performance over visual-only models, though the audio branch is agnostic at inference. Our work questions the role of audio in current deep AVSP models and motivates the community to a clear avenue for reconsideration of the complex architectures by demonstrating that simpler alternatives work equally well.
Salient Face Prediction without Bells and Whistles
Shreyank Jyoti,Girmaji Rohit,Ritvik Agrawal,Sarath Sivaprasad,Vineet Gandhi
Proceedings of the Digital Image Computing: Technqiues and Applications, DICTA, 2022
@inproceedings{bib_Sali_2022, AUTHOR = {Shreyank Jyoti, Girmaji Rohit, Ritvik Agrawal, Sarath Sivaprasad, Vineet Gandhi}, TITLE = {Salient Face Prediction without Bells and Whistles}, BOOKTITLE = {Proceedings of the Digital Image Computing: Technqiues and Applications}. YEAR = {2022}}
Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems
K Saiteja,Sarath S,Niranjan Pedanekar,Anil Nelakanti,Vineet Gandhi
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2022
@inproceedings{bib_Empa_2022, AUTHOR = {K Saiteja, Sarath S, Niranjan Pedanekar, Anil Nelakanti, Vineet Gandhi}, TITLE = {Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2022}}
We present a method to control the emotional prosody of Text to Speech (TTS) systems by us- ing phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to dis- entangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough exper- imental studies, we show that the proposed method improves over the prior art in accu- rately emulating the desired emotions while retaining the naturalness of speech. We ex- tend the traditional evaluation of using indi- vidual sentences for a more complete evalua- tion of HCI systems. We present a novel ex- perimental setup by replacing an actor with a TTS system in offline and live conversa- tions. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly pre- ferred over the state-of-the-art TTS system and adds the much-coveted “human touch” in ma- chine dialogue. Audio samples for our exper- iments and the code are available at: https: //emtts.github.io/tts-demo/
Framework to Computationally Analyze Kathakali Videos
Bulani Pratikkumar Sureshkumar Komal,Jayachandran S,Sarath Sivaprasad,Vineet Gandhi
Eurographics Workshop on Intelligent Cinematography and Editing, WICED, 2022
@inproceedings{bib_Fram_2022, AUTHOR = {Bulani Pratikkumar Sureshkumar Komal, Jayachandran S, Sarath Sivaprasad, Vineet Gandhi}, TITLE = {Framework to Computationally Analyze Kathakali Videos}, BOOKTITLE = {Eurographics Workshop on Intelligent Cinematography and Editing}. YEAR = {2022}}
Kathakali is one of the major forms of Classical Indian Dance. The dance form is distinguished by the elaborately colourful makeup, costumes and face masks. In this work, we present (a) a framework to analyze the facial expressions of the actors and (b) novel visualization techniques for the same. Due to extensive makeup, costumes and masks, the general face analysis techniques fail on Kathakali videos. We present a dataset with manually annotated Kathakali sequences for four downstream tasks, i.e. face detection, background subtraction, landmark detection and face segmentation. We rely on transfer learning and fine-tune deep learning models and present qualitative and quantitative results for these tasks. Finally, we present a novel application of style-transfer of Kathakali video onto a cartoonized face. The comprehensive framework presented in the paper paves the way for better understanding, analysis, pedagogy and visualization of Kathakali videos.
Cross-Domain Class-Contrastive Learning: Finding Lower Dimensional Representations for Improved Domain Generalization
Saransh Dave,Ritam Basu,Vineet Gandhi
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2022
@inproceedings{bib_Cros_2022, AUTHOR = {Saransh Dave, Ritam Basu, Vineet Gandhi}, TITLE = {Cross-Domain Class-Contrastive Learning: Finding Lower Dimensional Representations for Improved Domain Generalization}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2022}}
Domain Generalization (DG) requires a model to learn a hypothesis from multiple distributions that generalizes to an unseen distri- bution. Recent explorations show that, for neural networks, the choice of hyper-parameters and model architecture significantly affects DG performance, and making the right choice is non-trivial. In this paper, we show evidence suggesting that the models that perform better at DG, might be implicitly learning a low dimen- sional representation in the feature space. Furthermore, we take forward this idea and employ explicit feature learning to improve DG. To this end, we propose a DG specific supervised contrastive loss. We show how this performance improvement correlates to reduced dimensionality of the representation. Our work establishes new state-of-the-art on five different DG benchmarks, compared against over two dozen existing approaches in DomainBed. We show how this performance improvement correlates to reduced dimensionality of the representation
Framework to Computationally Analyse Kathakali Videos
Bulani Pratikkumar Sureshkumar Komal,S JAYACHANDRAN,Sarath S,Vineet Gandhi
Eurographics Workshop on Intelligent Cinematography and Editing, WICED, 2022
@inproceedings{bib_Fram_2022, AUTHOR = {Bulani Pratikkumar Sureshkumar Komal, S JAYACHANDRAN, Sarath S, Vineet Gandhi}, TITLE = {Framework to Computationally Analyse Kathakali Videos}, BOOKTITLE = {Eurographics Workshop on Intelligent Cinematography and Editing}. YEAR = {2022}}
Kathakali is one of the major forms of Classical Indian Dance. The dance form is distinguished by the elaborately colourful makeup, costumes and face masks. In this work, we present (a) a framework to analyze the facial expressions of the actors and (b) novel visualization techniques for the same. Due to extensive makeup, costumes and masks, the general face analysis techniques fail on Kathakali videos. We present a dataset with manually annotated Kathakali sequences for four downstream tasks, i.e. face detection, background subtraction, landmark detection and face segmentation. We rely on transfer learning and fine-tune deep learning models and present qualitative and quantitative results for these tasks. Finally, we present a novel application of style-transfer of Kathakali video onto a cartoonized face. The comprehensive framework presented in the paper paves the way for better understanding, analysis, pedagogy and visualization of Kathakali videos.
CCS Concepts: Applied computing --> Performing arts; Human-centered computing --> Visualization
The Curious Case of Convex Networks
Sarath S,Naresh Manwani,Vineet Gandhi
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa, PKDD/ECML, 2021
@inproceedings{bib_The__2021, AUTHOR = {Sarath S, Naresh Manwani, Vineet Gandhi}, TITLE = {The Curious Case of Convex Networks}, BOOKTITLE = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa}. YEAR = {2021}}
In this paper, we investigate a constrained formulation of neural networks where the output is a convex function of the input. We show that the convexity constraints can be enforced on both fully connected and convolutional layers, making them applicable to most architectures. The convexity constraints include restricting the weights (for all but the first layer) to be non-negative and using a non-decreasing convex activation function. Albeit simple, these constraints have profound implications on the generalization abilities of the network. We draw three valuable insights: (a) Input Output Convex Networks (IOC-NN) self regularize and almost uproot the problem of overfitting; (b) Although heavily constrained, they come close to the performance of the base architectures; and (c) The ensemble of convex networks can match or outperform the non convex counterparts. We demonstrate the efficacy of the proposed idea using thorough experiments and ablation studies on MNIST, CIFAR10, and CIFAR100 datasets with three different neural network architectures. The code for this project is publicly available at: https://github.com/sarathsp1729/Convex-Networks.
ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction
Samyak Jain,SREE RAM SAI PRADEEP YARLAGADDA,Shreyank Jyoti,SHYAMGOPAL KARTHIK,Ramanathan Subramanian,Vineet Gandhi
International Conference on Intelligent Robots and Systems, IROS, 2021
@inproceedings{bib_ViNe_2021, AUTHOR = {Samyak Jain, SREE RAM SAI PRADEEP YARLAGADDA, Shreyank Jyoti, SHYAMGOPAL KARTHIK, Ramanathan Subramanian, Vineet Gandhi}, TITLE = {ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2021}}
We propose the ViNet architecture for audiovisual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are
AViNet: Diving Deep into Audio-Visual Saliency Prediction
Samyak Jain,PRADEEP YARLAGADDA,Ramanathan Subramanian,Vineet Gandhi
International Conference on Intelligent Robots and Systems, IROS, 2021
@inproceedings{bib_AViN_2021, AUTHOR = {Samyak Jain, PRADEEP YARLAGADDA, Ramanathan Subramanian, Vineet Gandhi}, TITLE = {AViNet: Diving Deep into Audio-Visual Saliency Prediction}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2021}}
We propose thetextbf {AViNet} architecture for audiovisual saliency prediction. AViNet is a fully convolutional encoder-decoder architecture. The encoder combines visual features learned for action recognition, with audio embeddings learned via an aural network designed to classify objects and scenes. The decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining hierarchical features. The overall architecture is conceptually simple, causal, and runs in real-time (60 fps). AViNet outperforms the state-of-the-art on ten (seven audiovisual and three visual-only) datasets while surpassing human performance on the CC, SIM, and AUC metrics for the AVE dataset. Visual features maximally account for saliency on existing datasets with audio-only contributing to minor gains, except in specific contexts like social events. Our work, therefore, motivates the need to curate saliency datasets reflective of real-life, where both the visual and aural modalities complimentarily drive saliency. Our code and pre-trained models are available at this https URL
Methods and systems of automatically generating video content from scripts/text
Vineet Gandhi,SRINIVASA RAGHAVAN RAJENDRAN
United States Patent, Us patent, 2021
@inproceedings{bib_Meth_2021, AUTHOR = {Vineet Gandhi, SRINIVASA RAGHAVAN RAJENDRAN}, TITLE = {Methods and systems of automatically generating video content from scripts/text}, BOOKTITLE = {United States Patent}. YEAR = {2021}}
In one aspect, a computerized method for automatically generating digital video content from scripts includes a media engine. The media engine receives a script and a plurality of flags; sending the script to a natural language processing (NLP) engine. The NLP engine parses the script. The script is broken into a set of keywords and phrases in a JSON format. The NLP engine, based on the keywords and phrases and the plurality of flags, obtains a relevant background scene for the video and a relevant set of assets for the video, and a set of associated attributes of each of the set of assets. In an asset is a character or object for the video. The NLP engine provides the parsed script in the JSON format and the relevant background scene of the video and the relevant set of assets of the video, and the set of associated attributes of each of the set of assets to a layout engine. The layout engine, based on the parsed script …
No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks
SHYAMGOPAL KARTHIK,Ameya Prabhu,Puneet K. Dokania,Vineet Gandhi
International Conference on Learning Representations, ICLR, 2021
@inproceedings{bib_No_C_2021, AUTHOR = {SHYAMGOPAL KARTHIK, Ameya Prabhu, Puneet K. Dokania, Vineet Gandhi}, TITLE = {No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks}, BOOKTITLE = {International Conference on Learning Representations}. YEAR = {2021}}
There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy baseline in making better mistakes. The reason for the reduction in average mistake-severity can be attributed to the increase in low-severity mistakes, which may also explain the noticeable drop in their accuracy. To this end, we use the classical Conditional Risk Minimization (CRM) framework for hierarchy-aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra hyperparameters and requires adding just a few lines of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top- predictions across datasets, with very little loss in accuracy. CRM, because of its simplicity, can be used with any off-the-shelf trained model that provides reliable likelihood estimates.
Grounding Linguistic Commands to Navigable Regions
Nivedita Rufus,Kanishk Jain,Unni Krishnan R Nair,Vineet Gandhi,K Madhava Krishna
International Conference on Intelligent Robots and Systems, IROS, 2021
@inproceedings{bib_Grou_2021, AUTHOR = {Nivedita Rufus, Kanishk Jain, Unni Krishnan R Nair, Vineet Gandhi, K Madhava Krishna}, TITLE = {Grounding Linguistic Commands to Navigable Regions}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2021}}
— Humans have a natural ability to effortlessly comprehend linguistic commands such as “park next to the yellow sedan” and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command “park next to the yellow sedan,” RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car [1] dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.
The Curious case of Convex Neural networks
Sivaprasad S,Ankur Singh,Naresh Manwani,Vineet Gandhi
The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Da, ECML PKDD, 2021
@inproceedings{bib_The__2021, AUTHOR = {Sivaprasad S, Ankur Singh, Naresh Manwani, Vineet Gandhi}, TITLE = {The Curious case of Convex Neural networks}, BOOKTITLE = {The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Da}. YEAR = {2021}}
Emotional Prosody Control for Speech Generation
Sarath S,K Saiteja,Vineet Gandhi
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021
@inproceedings{bib_Emot_2021, AUTHOR = {Sarath S, K Saiteja, Vineet Gandhi}, TITLE = {Emotional Prosody Control for Speech Generation}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2021}}
Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker’s style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective control, without any observable degradation in the quality of the synthesized speech. Audio samples are available at https://researchweb.iiit.ac. in/˜sarath.s/emotts/
REAPPRAISING DOMAIN GENERALIZATION IN NEURAL NETWORKS
Sarath Sivaprasad,Akshay Goindani,Vaibhav Garg,Vineet Gandhi
Technical Report, arXiv, 2021
@inproceedings{bib_REAP_2021, AUTHOR = {Sarath Sivaprasad, Akshay Goindani, Vaibhav Garg, Vineet Gandhi}, TITLE = {REAPPRAISING DOMAIN GENERALIZATION IN NEURAL NETWORKS}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
main generalization (DG) of machine learning algorithms is defined as their ability to learn a domain agnostic hypothesis from multiple training distributions, which generalizes onto data from an unseen domain. DG is vital in scenarios where the target domain with distinct characteristics has sparse data for training. Aligning with recent work [9], we find that a straightforward Empirical Risk Minimization (ERM) baseline consistently outperforms existing DG methods. We present ablation studies indicating that the choice of backbone, data augmentation, and optimization algorithms overshadows the many tricks and trades explored in the prior art. Our work leads to a new state of the art on the four popular DG datasets, surpassing previous methods by large margins. Furthermore, as a key contribution, we propose a classwise-DG formulation, where for each class, we randomly select one of the domains and keep it aside for testing. We argue that this benchmarking is closer to human learning and relevant in real-world scenarios. We comprehensively benchmark classwise-DG on the DomainBed [9] and propose a method combining ERM and reverse gradients to achieve the state-of-the-art results. To our surprise, despite being exposed to all domains during training, the classwise DG is more challenging than traditional DG evaluation and motivates more fundamental rethinking on the problem of DG.
NO COST LIKELIHOOD MANIPULATION AT TEST TIME FOR MAKING BETTER MISTAKES IN DEEP NETWORKS
SHYAMGOPAL KARTHIK,PRABHU AMEYA PANDURANG,Puneet K. Dokania,Vineet Gandhi
International Conference on Learning Representations, ICLR, 2021
Abs | | bib Tex
@inproceedings{bib_NO_C_2021, AUTHOR = {SHYAMGOPAL KARTHIK, PRABHU AMEYA PANDURANG, Puneet K. Dokania, Vineet Gandhi}, TITLE = {NO COST LIKELIHOOD MANIPULATION AT TEST TIME FOR MAKING BETTER MISTAKES IN DEEP NETWORKS}, BOOKTITLE = {International Conference on Learning Representations}. YEAR = {2021}}
There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy baseline in making better mistakes. The reason for the reduction in average mistake-severity can be attributed to the increase in low-severity mistakes, which may also explain the noticeable drop in their accuracy. To this end, we use the classical Conditional Risk Minimization (CRM) framework for hierarchy-aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra hyperparameters and requires adding just a few lines of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top- predictions across datasets, with very little loss in accuracy. CRM, because of its simplicity, can be used with any off-the-shelf trained model that provides reliable likelihood estimates.
Simple Unsupervised Multi-Object Tracking
SHYAMGOPAL KARTHIK,Ameya Prabhu,Vineet Gandhi
Technical Report, arXiv, 2020
@inproceedings{bib_Simp_2020, AUTHOR = {SHYAMGOPAL KARTHIK, Ameya Prabhu, Vineet Gandhi}, TITLE = {Simple Unsupervised Multi-Object Tracking}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for developing better and larger labeled datasets. In this work, we remove the need for annotated datasets by proposing an unsupervised re-identification network, thus sidestepping the labeling costs entirely, required for training. Given unlabeled videos, our proposed method (SimpleReID) first generates tracking labels using SORT [ 3] and trains a ReID network to predict the generated labels using crossentropy loss. We demonstrate that SimpleReID performs substantially better than simpler alternatives, and we recover the full performance of its supervised counterpart consistently across diverse tracking frameworks. The observations are unusual because unsupervised ReID is not expected to excel in crowded scenarios with occlusions, and drastic viewpoint changes. By incorporating our unsupervised SimpleReID with CenterTrack trained on augmented still images, we establish a new state-of-the-art performance on popular datasets like MOT16/17 without using tracking supervision, beating current best (CenterTrack) by 0.2-0.3 MOTA and 4.4-4.8 IDF1 scores. We further provide evidence for limited scope for improvement in IDF1 scores beyond our unsupervised ReID in the studied settings. Our investigation suggests reconsideration towards more sophisticated, supervised, end-to-end trackers [56 , 5] by showing promise in simpler unsupervised alternatives.
GAZED–Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings
Kommu Lakshmi Bhanu Moorthy,MONEISH KUMAR,Ramanathan Subramanian ,Vineet Gandhi
International Conference of Human Factors in Computing Systems, CHI, 2020
@inproceedings{bib_GAZE_2020, AUTHOR = {Kommu Lakshmi Bhanu Moorthy, MONEISH KUMAR, Ramanathan Subramanian , Vineet Gandhi}, TITLE = {GAZED–Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings}, BOOKTITLE = {International Conference of Human Factors in Computing Systems}. YEAR = {2020}}
We present GAZED– eye GAZ-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera. Eye-gaze has been effectively employed in computational applications as a cue to capture interesting scene content; we employ gaze as a proxy to select shots for inclusion in the edited video. Given the original video, scene content and user eye-gaze tracks are combined to generate an edited video comprising of cinematically valid actor shots and shot transitions to generate an aesthetic and vivid representation of the original narrative. We model cinematic video editing as an energy minimization problem over shot selection, whose constraints capture cinematographic editing conventions. Gazed scene locations primarily determine the shots constituting the edited video. Effectiveness of GAZED against multiple competing methods is demonstrated via a psychophysical study involving 12 users and twelve performance videos. Professional video recordings of stage performances are typically created by employing skilled camera operators, who record the performance from multiple viewpoints. These multi-camera feeds, termed rushes, are then edited together to portray an eloquent story intended to maximize viewer engagement. Generating professional edits of stage performances is both difficult and challenging. Firstly, maneuvering cameras during a live performance is difficult even for experts as there is no option of retake upon error, and camera viewpoints are limited as the use of large supporting equipment (trolley, crane .) is infeasible. Secondly, manual video editing is an extremely slow and tedious process and leverages the experience of skilled editors. Overall, the need for (i) a professional camera crew, (ii) multiple cameras and supporting equipment, and (iii) expert editors escalates the process complexity and costs. Consequently, most production houses employ a large field-of-view static camera, placed far enough to capture the entire stage. This approach is widespread as it is simple to implement, and also captures the entire scene. Such static visualizations are apt for archival purposes; however, they are often unsuccessful at captivating attention when presented to the target audience. While conveying the overall context, the distant camera feed fails to bring out vivid scene details like close-up faces, character emotions and actions, and ensuing interactions which are critical for cinematic storytelling. GAZED denotes an end-to-end pipeline to generate an aesthetically edited video from a single static, wide-angle stage recording. This is inspired by prior work [GRG14], which describes how a plural camera crew can be replaced by a single highresolution static camera, and multiple virtual camera shots or rushes generated by simulating several virtual pan/tilt/zoom cameras to focus on actors and actions within the original recording. In this work, we demonstrate that the multiple rushes can be automatically edited by leveraging user eye gaze information, by modeling (virtual) shot selection as a discrete optimization problem. Eye-gaze represents an inherent guiding factor for video editing, as eyes are sensitive to interesting scene events [RKH∗ 09, SSSM14] that need to be vividly presented in the edited video. The objective critical for video editing and the key contribution of our work is to decide which shot (or rush) needs to be selected to describe each frame of the edited video. The shot selection problem is modeled as an optimization, which incorporates gaze information along with other cost terms that model cinematic editing principles. Gazed scene locations are utilized to define gaze potentials, which measure the importance of the different shots to choose from. Gaze potentials are then combined with other terms that model cinematic principles like avoiding jump cuts (which produce jarring shot transitions), rhythm (pace of shot transitioning), avoiding transient shots . The optimization is solved using dynamic programming. [MKSG20] refers to the detailed full article.
ColorArt: Suggesting Colorizations for Graphic Arts Using Optimal Color-Graph Matching
Murtuza Bohra,Vineet Gandhi
Graphics Interface Conference, GI, 2020
@inproceedings{bib_Colo_2020, AUTHOR = {Murtuza Bohra, Vineet Gandhi}, TITLE = {ColorArt: Suggesting Colorizations for Graphic Arts Using Optimal Color-Graph Matching}, BOOKTITLE = {Graphics Interface Conference}. YEAR = {2020}}
Colorization is a complex task of selecting a combination of colors and arriving at an appropriate spatial arrangement of the colors in an image. In this paper, we propose a novel approach for automatic colorization of graphic arts like graphic patterns, info-graphics and cartoons. Our approach uses the artist’s colored graphics as a reference to color a template image. We also propose a retrieval system for selecting a relevant reference image corresponding to the given template from a dataset of reference images colored by different artists. Finally, we formulate the problem of colorization as a optimal graph matching problem over color groups in the reference and the template image. We demonstrate results on a variety of coloring tasks and evaluate our model through multiple perceptual studies. The studies show that the results generated through our model are significantly preferred by the participants over other automatic colorization methods.
LiDAR guided Small obstacle Segmentation
Aasheesh Singh,Aditya Kamireddypalli,Vineet Gandhi,K Madhava Krishna
International Conference on Intelligent Robots and Systems, IROS, 2020
@inproceedings{bib_LiDA_2020, AUTHOR = {Aasheesh Singh, Aditya Kamireddypalli, Vineet Gandhi, K Madhava Krishna}, TITLE = {LiDAR guided Small obstacle Segmentation}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2020}}
Abstract— Detecting small obstacles on the road is critical for autonomous driving. In this paper, we present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR(VLP-16) and Monocular vision. LiDAR is employed to provide additional context in the form of confidence maps to monocular segmentation networks. We show significant performance gains when the context is fed as an additional input to monocular semantic segmentation frameworks. We further present a new semantic segmentation dataset to the community, comprising of over 3000 image frames with corresponding LiDAR observations. The images come with pixel-wise annotations of three classes off-road, road, and small obstacle. We stress that precise calibration between LiDAR and camera is crucial for this task and thus propose a novel Hausdorff distance based calibration refinement method over extrinsic parameters. As a first benchmark over this dataset, we report our results with 73 % instance detection up to a distance of 50 meters on challenging scenarios. Qualitatively by showcasing accurate segmentation of obstacles less than 15 cms at 50m depth and quantitatively through favourable comparisons vis a vis prior art, we vindicate the method’s efficacy. Our project and dataset is hosted at https://small-obstacle-dataset.github.io/
Tidying Deep Saliency Prediction Architectures
MALLU NAVYASRI REDDY,Samyak Jain,SREE RAM SAI PRADEEP YARLAGADDA,Vineet Gandhi
International Conference on Intelligent Robots and Systems, IROS, 2020
@inproceedings{bib_Tidy_2020, AUTHOR = {MALLU NAVYASRI REDDY, Samyak Jain, SREE RAM SAI PRADEEP YARLAGADDA, Vineet Gandhi}, TITLE = {Tidying Deep Saliency Prediction Architectures}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2020}}
Abstract— Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We review the existing state of the art models on these four components and propose novel and simpler alternatives. As a result, we propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture and brings notable performance gains on the SALICON dataset (the largest saliency benchmark). MDNSal is a parametric model that directly predicts parameters of a GMM distribution and is aimed to bring more interpretability to the prediction maps. The proposed saliency models can be inferred at 25fps, making them suitable for realtime applications. Code and pre-trained models are available at https://github.com/samyak0210/saliency.
TextureToMTF: predicting spatial frequency response in the wild
Murtuza Bohra,SAJAL MAHESHWARI,Vineet Gandhi
Signal,Image and Video Processing, SIViP, 2020
@inproceedings{bib_Text_2020, AUTHOR = {Murtuza Bohra, SAJAL MAHESHWARI, Vineet Gandhi}, TITLE = {TextureToMTF: predicting spatial frequency response in the wild}, BOOKTITLE = {Signal,Image and Video Processing}. YEAR = {2020}}
Abstract In this work, we propose an no-reference image quality assessment (NR-IQA) approach at a confluence of signal processing and deep learning. We use MTF50 (spatial frequency where modulation transfer function is 50% of its peak value) on slanted edged as a measure for image quality. We propose a comprehensive IQA dataset of images captured through hand-held phone camera in variety of situations with slanted edges around it. The MTF50 values at the slanted edges are then used to garner ground truth values for each patch in the captured images. A convolution neural network is then trained to predict MTF50 values from arbitrary image patches. We present results on the proposed dataset and synthetically generated TID2013 dataset and show state-of-the-art performance for IQA in the wild.
Exploring 3 R's of Long-term Tracking: Redetection, Recovery and Reliability
SHYAMGOPAL KARTHIK,ABHINAV MOUDGIL,Vineet Gandhi
Winter Conference on Applications of Computer Vision, WACV, 2020
@inproceedings{bib_Expl_2020, AUTHOR = {SHYAMGOPAL KARTHIK, ABHINAV MOUDGIL, Vineet Gandhi}, TITLE = {Exploring 3 R's of Long-term Tracking: Redetection, Recovery and Reliability}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2020}}
Recent works have proposed several long term tracking benchmarks and highlight the importance of moving towards long-duration tracking to bridge the gap with application requirements. The current evaluation methodologies, however, do not focus on several aspects that are crucial in a long term perspective like Re-detection, Recovery, and Reliability. In this paper, we propose novel evaluation strategies for a more in-depth analysis of trackers from a long-term perspective. More specifically, (a) we test redetection capability of the trackers in the wild by simulating virtual cuts, (b) we investigate the role of chance in the recovery of tracker after failure and (c) we propose a novel metric allowing visual inference on the ability of a tracker to track contiguously (without any failure) at a given accuracy. We present several original insights derived from an extensive set of quantitative and qualitative experiments.
CineFilter: Unsupervised Filtering for Real Time Autonomous Camera Systems
ACHARY SUDHEER,Kommu Lakshmi Bhanu Moorthy,Ashar Javed,P Nikitha Shravan,Vineet Gandhi,Anoop Namboodiri
Eurographics Workshop on Intelligent Cinematography and Editing, WICED, 2020
@inproceedings{bib_Cine_2020, AUTHOR = {ACHARY SUDHEER, Kommu Lakshmi Bhanu Moorthy, Ashar Javed, P Nikitha Shravan, Vineet Gandhi, Anoop Namboodiri}, TITLE = {CineFilter: Unsupervised Filtering for Real Time Autonomous Camera Systems}, BOOKTITLE = {Eurographics Workshop on Intelligent Cinematography and Editing}. YEAR = {2020}}
Autonomous camera systems are often subjected to an optimization/filtering operation to smoothen and stabilize the rough trajectory estimates. Most common filtering techniques do reduce the irregularities in data; however, they fail to mimic the behavior of a human cameraman. Global filtering methods modeling human camera operators have been successful; however, they are limited to offline settings. In this paper, we propose two online filtering methods called Cinefilters, which produce smooth camera trajectories that are motivated by cinematographic principles. The first filter (CineConvex) uses a sliding window-based convex optimization formulation, and the second (CineCNN) is a CNN based encoder-decoder model. We evaluate the proposed filters in two different settings, namely a basketball dataset and a stage performance dataset. Our models outperform previous methods and baselines on both quantitative and qualitative metrics. The CineConvex and CineCNN filters operate at about 250fps and 1000fps, respectively, with a minor latency (half a second), making them apt for a variety of real-time applications.
Cosine meets softmax: A tough-to-beat baseline for visual grounding
Nivedita Rufus,Unni Krishnan R Nair,K Madhava Krishna,Vineet Gandhi
European Conference on Computer Vision Workshops, ECCV-W, 2020
@inproceedings{bib_Cosi_2020, AUTHOR = {Nivedita Rufus, Unni Krishnan R Nair, K Madhava Krishna, Vineet Gandhi}, TITLE = {Cosine meets softmax: A tough-to-beat baseline for visual grounding}, BOOKTITLE = {European Conference on Computer Vision Workshops}. YEAR = {2020}}
In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the given sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.
GAZED-Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings
Kommu Lakshmi Bhanu Moorthy,Moneish Kumar,Ramanathan Subramanian,Vineet Gandhi
Technical Report, arXiv, 2020
@inproceedings{bib_GAZE_2020, AUTHOR = {Kommu Lakshmi Bhanu Moorthy, Moneish Kumar, Ramanathan Subramanian, Vineet Gandhi}, TITLE = {GAZED-Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
We present GAZED-eye GAZe-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera. Eye-gaze has been effectively employed in computational applications as a cue to capture interesting scene content; we employ gaze as a proxy to select shots for inclusion in the edited video. Given the original video, scene content and user eye-gaze tracks are combined to generate an edited video comprising cinematically valid actor shots and shot transitions to generate an aesthetic and vivid representation of the original narrative. We model cinematic video editing as an energy minimization problem over shot selection, whose constraints capture cinematographic editing conventions. Gazed scene locations primarily determine the shots constituting the edited video. Effectiveness of GAZED against multiple competing methods is demonstrated via a psychophysical study involving 12 …
Methods and systems of automatic one click virtual button with ai assist for diy animation
Vineet Gandhi,Srinivasa Raghavan Rajendran
United States Patent, Us patent, 2020
@inproceedings{bib_Meth_2020, AUTHOR = {Vineet Gandhi, Srinivasa Raghavan Rajendran}, TITLE = {Methods and systems of automatic one click virtual button with ai assist for diy animation}, BOOKTITLE = {United States Patent}. YEAR = {2020}}
In one aspect, a computerized method of automatically generating video content using a one click artificial-intelligence assistant for generating an animation video includes the step of providing a do-it-yourself (DIY) computer animation generation system. The DIY computer animation generation system includes an animation generation dashboard. The method includes the step of providing a one click AI assistant for generating an animation video in the DIY computer animation generation system. The method includes the step of providing a one click virtual button that is displayed in the animation generation dashboard. The one click AI assistant automatically suggests a set of animation choices to a user on a single button press of the one click virtual button.
Exploring 3 R’s of Long-term Tracking: Re-detection, Recovery and Reliability
SHYAMGOPAL KARTHIK,ABHINAV MOUDGIL,Vineet Gandhi
Winter Conference on Applications of Computer Vision, WACV, 2020
@inproceedings{bib_Expl_2020, AUTHOR = {SHYAMGOPAL KARTHIK, ABHINAV MOUDGIL, Vineet Gandhi}, TITLE = {Exploring 3 R’s of Long-term Tracking: Re-detection, Recovery and Reliability}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2020}}
Recent works have proposed several long term tracking benchmarks and highlight the importance of moving towards long-duration tracking to bridge the gap with application requirements. The current evaluation methodologies, however, do not focus on several aspects that are crucial in a long term perspective like Re-detection, Recovery, and Reliability. In this paper, we propose novel evaluation strategies for a more in-depth analysis of trackers from a long-term perspective. More specifically, (a) we test redetection capability of the trackers in the wild by simulating virtual cuts, (b) we investigate the role of chance in the recovery of tracker after failure and (c) we propose a novel metric allowing visual inference on the ability of a tracker to track contiguously (without any failure) at a given accuracy. We present several original insights derived from an extensive set of quantitative and qualitative experiments.
Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars.
SRIRAM NARAYANAN,Maniar Tirth Anup,Jayaganesh K,Vineet Gandhi,Brojeshwar Bhowmick,K Madhava Krishna
International Conference on Intelligent Robots and Systems, IROS, 2019
@inproceedings{bib_Talk_2019, AUTHOR = {SRIRAM NARAYANAN, Maniar Tirth Anup, Jayaganesh K, Vineet Gandhi, Brojeshwar Bhowmick, K Madhava Krishna}, TITLE = {Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars.}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2019}}
Abstract— We propose a novel pipeline that blends encodings from natural language and 3D semantic maps obtained from visual imagery to generate local trajectories that are executed by a low-level controller. The pipeline precludes the need for a prior registered map through a local waypoint generator neural network. The waypoint generator network (WGN) maps semantics and natural language encodings (NLE) to local waypoints. A local planner then generates a trajectory from the ego location of the vehicle (an outdoor car in this case) to these locally generated waypoints while a low-level controller executes these plans faithfully. The efficacy of the pipeline is verified in the CARLA simulator environment as well as on local semantic maps built from real-world KITTI dataset. In both these environments (simulated and real-world) we show the ability of the WGN to generate waypoints accurately by mapping NLE of varying sequence lengths and levels of complexity. We compare with baseline approaches and show significant performance gain over them. And finally, we show real implementations on our electric car verifying that the pipeline lends itself to practical and tangible realizations in uncontrolled outdoor settings. In loop execution of the proposed pipeline that involves repetitive invocations of the network is critical for any such language-based navigation framework. This effort successfully accomplishes this thereby bypassing the need for prior metric maps or strategies for metric level localization during traversal.
Nose, eyes and ears: Head pose estimation by locating facial keypoints
ARYAMAN GUPTA,KALPIT THAKKAR,Vineet Gandhi,Narayanan P J
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2019
@inproceedings{bib_Nose_2019, AUTHOR = {ARYAMAN GUPTA, KALPIT THAKKAR, Vineet Gandhi, Narayanan P J}, TITLE = {Nose, eyes and ears: Head pose estimation by locating facial keypoints}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2019}}
Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data captured in controlled environment and generalize on the images in the wild (with varying appearance and illumination of the face). Most present day deep learning approaches which learn a regression function directly on the input images fail to do so. To this end, we propose to use a higher level representation to regress the head pose while using deep learning architectures. More specifically, we use the uncertainty maps in the form of 2D soft localization heatmap images over five facial keypoints, namely left ear, right ear, left eye, right eye and nose, and pass them through an convolutional neural network to regress the head-pose. We show head pose estimation results on two challenging benchmarks BIWI and AFLW and our approach surpasses the state of the art on both the datasets. Index Terms— Image analysis, Pose estimation
Learning unsupervised visual grounding through semantic self-supervision
Syed Ashar Javed,Shreyas Saxena,Vineet Gandhi
International Joint Conference on Artificial Intelligence, IJCAI, 2019
@inproceedings{bib_Lear_2019, AUTHOR = {Syed Ashar Javed, Shreyas Saxena, Vineet Gandhi}, TITLE = {Learning unsupervised visual grounding through semantic self-supervision}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2019}}
Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The intuition behind this idea is to encourage the model to localize to regions which can explain some semantic property in the data, in our case, the property being the presence of a concept in a set of images. We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the ReferItGame dataset and comparable to state-of-art performance on the Flickr30k dataset.
Long-term visual object tracking benchmark
,ABHINAV MOUDGIL,Vineet Gandhi
Asian Conference on Computer Vision, ACCV, 2018
@inproceedings{bib_Long_2018, AUTHOR = {, ABHINAV MOUDGIL, Vineet Gandhi}, TITLE = {Long-term visual object tracking benchmark}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2018}}
Abstract. We propose a new long video dataset 1 (called Track Long and Prosper - TLP) and benchmark for single object tracking. The dataset consists of 50 HD videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and train better deep learning architectures (avoiding/reducing augmentation, which may not reflect real world behaviour). We benchmark the dataset on 17 state of the art trackers and rank them according to tracking accuracy and run time speeds. We further present thorough qualitative and quantitative evaluation highlighting the importance of long term aspect of tracking. Our most interesting observations are (a) existing short sequence benchmarks fail to bring out the inherent differences in tracking algorithms which widen up while tracking on long sequences and (b) the accuracy of trackers abruptly drops on challenging long sequences, suggesting the potential need of research efforts in the direction of long-term tracking.
Watch to Edit: Video Retargeting using Gaze
RACHAVARAPU KRANTHI KUMAR,MONEISH KUMAR,Vineet Gandhi,Ramanathan Subramanian
Computer Graphics Forum, CGF, 2018
@inproceedings{bib_Watc_2018, AUTHOR = {RACHAVARAPU KRANTHI KUMAR, MONEISH KUMAR, Vineet Gandhi, Ramanathan Subramanian}, TITLE = {Watch to Edit: Video Retargeting using Gaze}, BOOKTITLE = {Computer Graphics Forum}. YEAR = {2018}}
We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot. Our algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information, and is composed of piecewise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to gaze driven re-editing [JSSH15] and letterboxing methods, especially for wide-angle static camera recordings.
AN ITERATIVE APPROACH FOR SHADOW REMOVAL IN DOCUMENT IMAGES
VATSAL SHAH,Vineet Gandhi
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2018
@inproceedings{bib_AN_I_2018, AUTHOR = {VATSAL SHAH, Vineet Gandhi}, TITLE = {AN ITERATIVE APPROACH FOR SHADOW REMOVAL IN DOCUMENT IMAGES}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2018}}
Uneven illumination and shadows in document images cause a challenge for digitization applications and automated workflows. In this work, we propose a new method to recover unshadowed document images from images with shadows/uneven illumination. We pose this problem as one of estimating the shading and reflectance components of the given original image. Our method first estimates the shading and uses it to compute the reflectance. The output reflectance map is then used to improve the shading and the process is repeated in an iterative manner. The iterative procedure allows for a gradual compensation and allows our algorithm to even handle difficult hard shadows without introducing any artifacts. Experiments over two different datasets demonstrate the efficacy of our algorithm and the low computation complexity makes it suitable for most practical applications.
Document Quality Estimation using Spatial Frequency Response
PRANJAL KUMAR RAI,SAJAL MAHESHWARI,Vineet Gandhi
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2018
@inproceedings{bib_Docu_2018, AUTHOR = {PRANJAL KUMAR RAI, SAJAL MAHESHWARI, Vineet Gandhi}, TITLE = {Document Quality Estimation using Spatial Frequency Response}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2018}}
The current Document Image Quality Assessment (DIQA) algorithms directly relate the Optical Character Recognition (OCR) accuracies with the quality of the document to build supervised learning frameworks. This direct correlation has two major limitations: (a) OCR may be affected by factors independent of the quality of the capture and (b) it cannot account for blur variations within an image. An alternate possibility is to quantify the quality of capture using human judgement, however, it is subjective and prone to error. In this work, we build upon the idea of Spatial Frequency Response (SFR) to reliably quantify the quality of a document image. We present through quantitative and qualitative experiments that the proposed metric leads to significant improvement in document quality prediction in contrast to using OCR as ground truth.
MergeNet: A Deep Net Architecture for Small Obstacle Discovery
KRISHNAM GUPTA,Syed Ashar Javed,Vineet Gandhi,K Madhava Krishna
International Conference on Robotics and Automation, ICRA, 2018
@inproceedings{bib_Merg_2018, AUTHOR = {KRISHNAM GUPTA, Syed Ashar Javed, Vineet Gandhi, K Madhava Krishna}, TITLE = {MergeNet: A Deep Net Architecture for Small Obstacle Discovery}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2018}}
Abstract— We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose a multi-stage training procedure involving weight-sharing, separate learning of low and high level features from the RGBD input and a refining stage which learns to fuse the obtained complementary features. The model is trained and evaluated on the Lost and Found dataset and is able to achieve state-of-art results with just 135 images in comparison to the 1000 images used by the previous benchmark. Additionally, we also compare our results with recent methods trained on 6000 images and show that our method achieves comparable performance with only 1000 training samples.
Automated top view registration of broadcast football videos
,RAHUL ANAND SHARMA,BHARATH BHAT,Vineet Gandhi,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {, RAHUL ANAND SHARMA, BHARATH BHAT, Vineet Gandhi, Jawahar C V}, TITLE = {Automated top view registration of broadcast football videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2018}}
In this paper, we propose a novel method to register football broadcast video frames on the static top view model of the playing surface. The proposed method is fully automatic in contrast to the current state of the art which requires manual initialization of point correspondences between the image and the static model. Automatic registration using existing approaches has been difficult due to the lack of sufficient point correspondences. We investigate an alternate approach exploiting the edge information from the line markings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The synthetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduce this problem to a minimal per-frame edge map matching procedure. We show that the per-frame results can be improved in videos using an optimization framework for temporal camera stabilization. We demonstrate the efficacy of our approach by presenting extensive results on a dataset collected from matches of football World Cup 2014.
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision
Syed Ashar Javed,Shreyas Saxena,Vineet Gandhi
Technical Report, arXiv, 2018
@inproceedings{bib_Lear_2018, AUTHOR = {Syed Ashar Javed, Shreyas Saxena, Vineet Gandhi}, TITLE = {Learning Unsupervised Visual Grounding Through Semantic Self-Supervision}, BOOKTITLE = {Technical Report}. YEAR = {2018}}
Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The intuition behind this idea is to encourage the model to localize to regions which can explain some semantic property in the data, in our case, the property being the presence of a concept in a set of images. We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the ReferItGame dataset and comparable to state-of-art performance on the Flickr30k dataset
3D Region Proposals For Selective Object Search.
ARRABOTU SHEETAL REDDY,Vineet Gandhi,K Madhava Krishna
International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applicat, VISIGRAPP, 2017
@inproceedings{bib_3D_R_2017, AUTHOR = {ARRABOTU SHEETAL REDDY, Vineet Gandhi, K Madhava Krishna}, TITLE = {3D Region Proposals For Selective Object Search.}, BOOKTITLE = {International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applicat}. YEAR = {2017}}
The advent of indoor personal mobile robots has clearly demonstrated their utility in assisting humans at various places such as workshops, offices, homes, etc. One of the most important cases in such autonomous scenarios is where the robot has to search for certain objects in large rooms. Exploring the whole room would prove to be extremely expensive in terms of both computing power and time. To address this issue,we demonstrate a fast algorithm to reduce the search space by identifying possible object locations as two classes, namely - Support Structures and Clutter. Support Structures are plausible object containers in a scene such as tables, chairs, sofas, etc. Clutter refers to places where there seem to be several objects but cannot be clearly distinguished. It can also be identified as unorganized regions which can be of interest for tasks such as robot grasping, fetching and placing objects. The primary contribution of this paper is to quickly identify potential object locations using a Support Vector Machine(SVM) learnt over the features extracted from thedepth map and the RGB image of the scene, which further culminates into a densely connected Conditional Random Field(CRF) formulated over the image of the scene. The inference over the CRF leads to assignment of the labels - support structure, clutter, others to each pixel.There have been reliable outcomes even during challenging scenarios such as the support structures being far from the robot. The experiments demonstrate the efficacy and speed of the algorithm irrespective of alterations to camera angles, modifications to appearance change, lighting and distance from locations etc
Automated Top View Registration of Broadcast Football Videos
RAHUL ANAND SHARMA,BHARATH BHAT,Vineet Gandhi,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2017
@inproceedings{bib_Auto_2017, AUTHOR = {RAHUL ANAND SHARMA, BHARATH BHAT, Vineet Gandhi, Jawahar C V}, TITLE = {Automated Top View Registration of Broadcast Football Videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2017}}
In this paper, we propose a novel method to register football broadcast video frames on the static top view model of the playing surface. The proposed method is fully automatic in contrast to the current state of the art which requires manual initialization of point correspondences between the image and the static model. Automatic registration using existing approaches has been difficult due to the lack of sufficient point correspondences. We investigate an alternate approach exploiting the edge information from the line markings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The synthetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduce this problem to a minimal per-frame edge map matching procedure. We show that the per-frame results can …
Automatic analysis of broadcast football videos using contextual priors
RAHUL ANAND SHARMA,Vineet Gandhi,Jawahar C V
Signal,Image and Video Processing, SIViP, 2017
@inproceedings{bib_Auto_2017, AUTHOR = {RAHUL ANAND SHARMA, Vineet Gandhi, Jawahar C V}, TITLE = {Automatic analysis of broadcast football videos using contextual priors}, BOOKTITLE = {Signal,Image and Video Processing}. YEAR = {2017}}
The presence of standard video editing practices in broadcast sports videos, like football, effectively means that such videos have stronger contextual priors than most generic videos. In this paper, we show that such information can be harnessed for automatic analysis of sports videos. Specifically, given an input video, we output per-frame information about camera angles and the events (goal, foul, etc.). Our main insight is that in the presence of temporal context (camera angles) for a video, the problem of event tagging (fouls, corners, goals, etc.) can be cast as per frame multi-class classification problem. We show that even with simple classifiers like linear SVM, we get significant improvement in the event tagging task when contextual information is included. We present extensive results for 10 matches from the recently concluded Football World Cup, to demonstrate the effectiveness of our approach.
Long-term visual object tracking benchmark
ABHINAV MOUDGIL,Vineet Gandhi
Asian Conference on Computer Vision, ACCV, 2017
@inproceedings{bib_Long_2017, AUTHOR = {ABHINAV MOUDGIL, Vineet Gandhi}, TITLE = {Long-term visual object tracking benchmark}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2017}}
We propose a new long video dataset (called Track Long and Prosper-TLP) and benchmark for single object tracking. The dataset consists of 50 HD videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and train better deep learning architectures (avoiding/reducing augmentation, which may not reflect real world behaviour). We benchmark the dataset on 17 state of the art trackers and rank them according to tracking accuracy and run time speeds. We further present thorough qualitative and quantitative evaluation highlighting the importance of long term aspect of tracking. Our most interesting observations are (a) existing short sequence benchmarks fail to bring out the inherent differences in tracking algorithms which widen up while tracking on long sequences and (b) the accuracy of trackers abruptly drops on challenging long sequences, suggesting the potential need of research efforts in the direction of long-term tracking.
Beyond ocrs for document blur estimation
PRANJAL KUMAR RAI,SAJAL MAHESHWARI,MEHTA ISHIT BHADRESH,Parikshit sakurikar,Vineet Gandhi
International Conference on Document Analysis and Recognition, ICDAR, 2017
@inproceedings{bib_Beyo_2017, AUTHOR = {PRANJAL KUMAR RAI, SAJAL MAHESHWARI, MEHTA ISHIT BHADRESH, Parikshit Sakurikar, Vineet Gandhi}, TITLE = {Beyond ocrs for document blur estimation}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2017}}
The current document blur/quality estimation algorithms rely on the OCR accuracy to measure their success. A sharp document image, however, at times may yield lower OCR accuracy owing to factors independent of blur or quality of capture. The necessity to rely on OCR is mainly due to the difficulty in quantifying the quality otherwise. In this work, we overcome this limitation by proposing a novel dataset for document blur estimation, for which we physically quantify the blur using a capture set-up which computationally varies the focal distance of the camera. We also present a selective search mechanism to improve upon the recently successful patch-based learning approaches (using codebooks or convolutional neural networks). We present a thorough analysis of the improved blur estimation pipeline using correlation with OCR accuracy as well as the actual amount of blur. Our experiments demonstrate that our method outperforms the current state-of-the-art by a significant margin.
Small obstacle detection using stereo vision for autonomous ground vehicle
KRISHNAM GUPTA,SARTHAK UPADHYAY,Vineet Gandhi,K Madhava Krishna
Advances in Robotics, AIR, 2017
@inproceedings{bib_Smal_2017, AUTHOR = {KRISHNAM GUPTA, SARTHAK UPADHYAY, Vineet Gandhi, K Madhava Krishna}, TITLE = {Small obstacle detection using stereo vision for autonomous ground vehicle}, BOOKTITLE = {Advances in Robotics}. YEAR = {2017}}
Small and medium sized obstacles such as rocks, small boulders, bricks left unattended on the road can pose hazards for autonomous as well as human driving situations. Many times these objects are too small on the road and go unnoticed on depth and point cloud maps obtained from state of the art range sensors such as 3D LIDAR. We propose a novel algorithm that fuses both appearance and 3D cues such as image gradients, curvature potentials and depth variance into a Markov Random Field (MRF) formulation that segments the scene into obstacle and non obstacle regions. Appearance and depth data obtained from a ZED stereo pair mounted on a Husky robot is used for this purpose. While identifying true positive obstacles such as rocks, large stones accurately our algorithm is simultaneously robust to false positive sources such as appearance changes on the road, papers and road markings. High accuracy detection in challenging scenes such as when the foreground obstacle blends with the background road scene vindicates the efficacy of the proposed formulation.
Zooming on all actors: Automatic focus+ context split screen video generation
MONEISH KUMAR,Vineet Gandhi,Remi Ronfard,Michael Gleicher
Computer Graphics Forum, CGF, 2017
@inproceedings{bib_Zoom_2017, AUTHOR = {MONEISH KUMAR, Vineet Gandhi, Remi Ronfard, Michael Gleicher}, TITLE = {Zooming on all actors: Automatic focus+ context split screen video generation}, BOOKTITLE = {Computer Graphics Forum}. YEAR = {2017}}
Recordings of stage performances are easy to capture with a high-resolution camera, but are difficult to watch because the actors’ faces are too small. We present an approach to automatically create a split screen video that transforms these recordings to show both the context of the scene as well as close-up details of the actors. Given a static recording of a stage performance and tracking information about the actors positions, our system generates videos showing a focus+context view based on computed close-up camera motions using crop-and zoom. The key to our approach is to compute these camera motions such that they are cinematically valid close-ups and to ensure that the set of views of the different actors are properly coordinated and presented. We pose the computation of camera motions as convex optimization that creates detailed views and smooth movements, subject to cinematic constraints such as not cutting faces with the edge of the frame. Additional constraints link the close up views of each actor, causing them to merge seamlessly when actors are close. Generated views are placed in a resulting layout that preserves the spatial relationships between actors. We demonstrate our results on a variety of staged theater and dance performances.
Document blur detection using edge profile mining
SAJAL MAHESHWARI,PRANJAL KUMAR RAI,Gopal Sharma,Vineet Gandhi
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2016
@inproceedings{bib_Docu_2016, AUTHOR = {SAJAL MAHESHWARI, PRANJAL KUMAR RAI, Gopal Sharma, Vineet Gandhi}, TITLE = {Document blur detection using edge profile mining}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2016}}
We present an algorithm for automatic blur detection of document images using a novel approach based on edge intensity profiles. Our main insight is that the edge profiles are a strong indicator of the blur present in the image, with steep profiles implying sharper regions and gradual profiles implying blurred regions. Our approach first retrieves the profiles for each point of intensity transition (each edge point) along the gradient and then uses them to output a quantitative measure indicating the extent of blur in the input image. The real time performance of the proposed approach makes it suitable for most applications. Additionally, our method works for both hand written and digital documents and is agnostic to the font types and sizes, which gives it a major advantage over the currently prevalent learning based approaches. Extensive quantitative and qualitative experiments over two different datasets show that our method outperforms almost all algorithms in current state of the art by a significant margin, especially in cross dataset experiments.
A Computational Framework for Vertical Video Editing
Vineet Gandhi,Rémi Ronfard
Eurographics Workshop on Intelligent Cinematography and Editing, WICED, 2016
@inproceedings{bib_A_Co_2016, AUTHOR = {Vineet Gandhi, Rémi Ronfard}, TITLE = {A Computational Framework for Vertical Video Editing}, BOOKTITLE = {Eurographics Workshop on Intelligent Cinematography and Editing}. YEAR = {2016}}
Vertical video editing is the process of digitally editing the image within the frame as opposed to horizontal video editing, which arranges the shots along a timeline. Vertical editing can be a time-consuming and error-prone process when using manual key-framing and simple interpolation. In this paper, we present a general framework for automatically computing a variety of cinematically plausible shots from a single input video suitable to the special case of live performances. Drawing on working practices in traditional cinematography, the system acts as a virtual camera assistant to the film editor, who can call novel shots in the edit room with a combination of high-level instructions and manually selected keyframes.
Capturing and indexing rehearsals: the design and usage of a digital archive of performing arts
Rémi Ronfard,Benoit Encelle,Nicolas Sauret,P.-A. Champin,Thomas Steiner,Vineet Gandhi,Cyrille Migniot,Florent Thiery
International Conference on Digital Heritage, ICDH, 2015
@inproceedings{bib_Capt_2015, AUTHOR = {Rémi Ronfard, Benoit Encelle, Nicolas Sauret, P.-A. Champin, Thomas Steiner, Vineet Gandhi, Cyrille Migniot, Florent Thiery}, TITLE = {Capturing and indexing rehearsals: the design and usage of a digital archive of performing arts}, BOOKTITLE = {International Conference on Digital Heritage}. YEAR = {2015}}
Preserving the cultural heritage of the performing arts raises difficult and sensitive issues, as each performance is unique by nature and the juxtaposition between the performers and the audience cannot be easily recorded. In this paper, we report on an experimental research project to preserve another aspect of the performing arts—the history of their rehearsals. We have specifically designed non-intrusive video recording and on-site documentation techniques to make this process transparent to the creative crew, and have developed a complete workflow to publish the recorded video data and their corresponding meta-data online as Open Data using state-of-the-art audio and video processing to maximize non-linear navigation and hypervideo linking. The resulting open archive is made publicly available to researchers and amateurs alike and offers a unique account of the inner workings of the worlds of theater and opera.
Multi-Clip Video Editing from a Single Viewpoint
Vineet Gandhi,Remi Ronfard,Michael Gleicher
Conference for Visual Media Production, CVMP, 2014
@inproceedings{bib_Mult_2014, AUTHOR = {Vineet Gandhi, Remi Ronfard, Michael Gleicher}, TITLE = {Multi-Clip Video Editing from a Single Viewpoint}, BOOKTITLE = {Conference for Visual Media Production}. YEAR = {2014}}
We propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Assuming important actors and objects can be localized using computer vision techniques, our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically pleasing sub-clips which can easily be edited together using off-the-shelf multi-clip video editing software. We demonstrate our approach on five video sequences of a live theatre performance by generating multiple synchronized subclips for each sequence.