Abstract
Autonomously navigating to a goal pose is central to several robotic applications such as inspection, surgery, object manipulation, transportation and more. Often navigation frameworks require the goal pose to be specified in Cartesian space and a localization module that estimates the current pose of the robot. However, in the absence of a metric-scale map and an accurate position estimation module, it is difficult to attain the correct goal pose. Visual servoing approaches, therefore represent the goal pose in the form of an image, which is more intuitive for humans and easier to obtain as compared to the 6-DoF coordinates. The image feedback from a robot's camera is further employed by visual servoing approaches for localizing the robot in image space and navigating it to the goal pose.
Existing visual servoing approaches control the robot such that the overlap between the current image and the goal image is iteratively maximized. Therefore, existing approaches do not generalize while servoing to different objects or even instances from the same category. We introduce a novel problem of instance invariant visual servoing, where the robot is required to servo a given instance in the desired pose represented by a goal image of a different instance from the same category. This problem is more relevant in practical scenarios since the same robot can handle multiple object instances without changing the goal image.
We initially tackle the problem of instance invariant visual servoing using a geometric approach. We propose a novel visual servoing control law based on a linear combination of 3D models representing the given object category. Our approach is capable of accommodating the deformation in shapes among object instances and the resulting controller is able to smoothly navigate the robot to the desired pose even though the instances are different. To tackle the large variation in appearance across object instances, we further propose part-based visual features that uniquely capture the locations of an object's parts in images. Advantages of using such part-aware semantics are two-fold. (i) It conceals the illumination and textural variations from the visual servoing algorithm. (ii) Semantic keypoints enable us to accurately match descriptors across instances.
To circumvent the requirement of 3D models by the above-mentioned geometric solution, we propose a discriminative learning based framework for visual servoing across object instances. This approach is able to learn the desired pose from previously seen examples of goal images and is able to navigate the robot despite appearance and shape variations. Specifically, we learn a binary classifier that discriminates the goal image from images captured from other viewpoints for that object category. The classification error is then used to navigate the robot towards the desired pose. Furthermore, we design controllers for linear, kernel and exemplar Support Vector Machine (SVM) and empirically discuss their performance in the visual servoing context. To address large intra-category variation in appearance, we introduce Principal Oriented Glyph (POG) features that are easier to obtain as compared to part-based features.
We next consider the problem of navigating a robot in unstructured environments using deep learning techniques. Motivated by recent breakthroughs in performance of data driven methods on recognition and detection tasks, we aim to learn visual feature representations suitable for servoing tasks in unstructured and unknown environments. In contrast to existing visual servoing approaches that require the knowledge of a scene geometry especially depth and camera parameters, we present an end-to-end learning based approach for visual servoing in diverse scenes that is agnostic to the knowledge of a scene's geometry and camera parameters. This is achieved by training a Convolutional Neural Network (CNN) over color images with synchronized camera poses.
After assessing the capability of CNNs in classical visual servoing tasks. We next apply them to our instance invariant visual servoing task. However, catering to shape, appearance and viewpoints variations all together is a difficult task for a CNN. Thus, instead of directly regressing to the desired pose, we use CNNs to predict locations of part based keypoints in images so that appearance variations are decoupled. To tackle geometrical and viewpoint variations, we further present a pose induction strategy based on the part-based keypoints predicted by our network. Specifically, we reconstruct the locations of parts of the given instance in 3D, from a sequence of 2D keypoints estimated by the network and align this reconstruction with a template instance. This aligned reconstruction is then used to localize the robot and navigate it the desired pose.
Finally, we employ our framework for the autonomous inspection of vehicles using Micro Aerial Vehicles (MAVs), which is vital