Abstract
n recent years, computer vision has made remarkable progress in understanding visual scenes, in-
cluding tasks such as object detection, human pose estimation, semantic segmentation, and instance
segmentation. These advancements are largely driven by high-capacity models, such as deep neural
networks, trained in fully supervised settings with large-scale labeled data sets. However, reliance on
extensive annotations poses scalability challenges due to the significant human effort required to create
these data sets. Fine-grained annotations, such as pixel-level segmentation masks, keypoint coordinates
for pose estimation, or detailed object instance boundaries, provide the high precision needed for many
tasks but are extremely time-consuming and costly to produce. Coarse annotations, on the other hand,
such as image-level labels or approximate scribbles, are much easier and faster to create but lack the
granularity required for detailed model supervision.
To address these challenges, researchers have increasingly explored alternatives to traditional su-
pervised learning, with weakly supervised learning emerging as a promising approach. This approach
mitigates annotation costs by utilizing coarse annotations (cheaper and less detailed) during training
rather than the fine-grained annotations required at the output stage during testing. Despite its poten-
tial, weakly supervised learning faces challenges in transferring information from coarse annotations
to fine-grained predictions, often encountering ambiguity and uncertainty during this process. Existing
methods rely on various priors and heuristics to refine annotations, which are then used to train models
for specific tasks. This involves managing uncertainty in latent variables during training and ensuring
accurate predictions for both latent and output variables at test time.
This thesis introduces a unified approach to weakly supervised learning in computer vision, address-
ing tasks such as human pose estimation, object detection, and instance segmentation. Central to this
work is a framework based on the dissimilarity coefficient loss, which models uncertainty in the loca-
tion of objects and human poses using coarse annotations. The approach employs two key probability
distributions:
• Conditional Distribution: Captures output probabilities using coarse annotations (e.g., action la-
bels, image-level labels, object counts), modeled with deep generative models for efficient sam-
pling.
• Prediction Distribution: Provides test-time predictions independent of coarse annotations.The framework minimizes the difference between these distributions using the dissimilarity coeffi-
cient loss, facilitating the transfer of information from coarse annotations to accurate predictions. This
methodology is consistently applied across diverse computer vision tasks, showcasing its versatility.
The efficacy of the proposed framework is demonstrated across three progressively complex visual
scene recognition tasks:
• Human Pose Estimation: A probabilistic framework is introduced for learning human poses from
still images using data sets with costly ground-truth pose annotations and inexpensive action la-
bels. By aligning the conditional and prediction distributions through the dissimilarity coefficient
loss, the method achieves significant improvements over baselines on the MPII and JHMDB data
sets, effectively leveraging action information.
• Object Detection: The framework addresses weakly supervised object detection (WSOD) by mod-
eling uncertainty in object locations using a dissimilarity coefficient-based objective. Leveraging
discrete generative models, it efficiently samples from annotation-aware conditional distributions
and integrates coarse annotations, such as image-level labels, object counts, points, and scribbles.
Spatial cluster regularization and curriculum learning further enhance performance, achieving
state-of-the-art results on benchmarks like PASCAL VOC and MS COCO.
• Instance Segmentation: The framework models uncertainty in pseudo-label generation using se-
mantic class-aware, boundary-aware, and annotation-consistent higher-order terms. By aligning
conditional and prediction distributions, it generates accurate pseudo-labels and trains Mask R-
CNN-like architectures effectively. Experiments on the PASCAL VOC 2012 data set demonstrate
state-of-the-art performance, with improved object boundary alignment and significant gains over
baselines