Abstract
Monocular 3D human reconstruction is a very relevant problem due to numerous applications to the
entertainment industry, e-commerce, health care, mobile-based AR/VR platforms, etc. However, it is
severely ill-posed due to self-occlusions from complex body poses and shapes, clothing obstructions, lack
of surface texture, background clutter, single view, etc. Conventional approaches address these challenges
by using different sensing systems - marker-based, marker-less multi-view cameras, inertial sensors,
and 3D scanners. Although effective, such methods are often expensive and have limited wide-scale
applicability. In an attempt to produce scalable solutions, a few have focused on fitting statistical body
models to monocular images, but are susceptible to the costly optimization process.
Recent efforts focus on using data-driven algorithms such as deep learning to learn priors directly from
data. However, they focus on template model recovery, rigid object reconstruction, or propose paradigms
that don’t directly extend to recovering personalized models. To predict accurate surface geometry, our
first attempt was VolumeNet, which predicted a 3D occupancy grid from a monocular image. This was
the first of its kind model for non-rigid human shapes at that time. To circumvent the ill-posed nature
of this problem (aggravated by an unbounded 3D representation), we follow the ideology of providing
maximal training priors with our unique training paradigms, to enable testing with minimal information.
As we did not impose any body-model based constraint, we were able to recover deformations induced
by free-form clothing. Further, we extended VolumeNet to PoShNet by decoupling Pose and Shape, in
which we learn the volumetric pose first, and use it as a prior for learning the volumetric shape, thereby
recovering a more accurate surface.
Although volumetric regression enables recovering a more accurate surface reconstruction, they do
so without an animatable skeleton. Further, such methods yield reconstructions of low resolution at
higher computational cost (regression over the cubic voxel grid) and often suffer from an inconsistent
topology via broken or partial body parts. Hence, statistical body models become a natural choice to
offset the ill-posed nature of this problem. Although theoretically, they are low dimensional, learning
such models has been challenging due to the complex non-linear mapping from the image to the relative
axis-angle representation. Hence, most solutions rely on different projections of the underlying mesh
(2D/3D keypoints, silhouettes, etc.). To simplify the learning process, we propose the CR framework
that uses classification as a prior for guiding the regression’s learning process. Although recovering
personalized models with high-resolution meshes isn’t a possibility in this space, the framework shows
that learning such template models can be difficult without additional supervision. As an alternative to directly learning parametric models, we propose HumanMeshNet to learn an
“implicitly structured point cloud”, in which we make use of the mesh topology as a prior to enable
better learning. We hypothesize that instead of learning the highly non-linear SMPL parameters, learning
its corresponding point cloud (although high dimensional) and enforcing the same parametric template
topology on it is an easier task. This proposed paradigm can theoretically learn local surface deformations
that the body model based PCA space can’t capture. Further, going ahead, attempting to produce highresolution meshes (with accurate geometry details) is a natural extension that is easier in 3D space than
in the parametric one.
In summary, in this thesis, we attempt to address several of the aforementioned challenges and
empower machines with the capability to interpret a 3D human body model (pose and shape) from a
single image in a manner that is non-intrusive, inexpensive and scalable. In doing so, we explore different
3D representations that are capable of producing accurate surface geometry, with a long-term goal of
recovering personalized 3D human models.