Abstract
Re-identification (ReID) is a critical challenge in computer vision,
predominantly studied in the context of pedestrians and vehicles.
However, robust object-instance ReID, which has significant im-
plications for tasks such as autonomous exploration, long-term
perception, and scene understanding, remains underexplored. In
this work, we address this gap by proposing a novel dual-path
object-instance re-identification transformer architecture that inte-
grates multimodal RGB and depth information. By leveraging depth
data, we demonstrate improvements in ReID across scenes that are
cluttered or have varying illumination conditions. Additionally, we
develop a ReID-based localization framework that enables accurate camera localization and pose identification across different view-
points. We validate our methods using two custom-built RGB-D
datasets, as well as multiple sequences from the open-source TUM
RGB-D datasets. Our approach demonstrates significant improve-
ments in both object instance ReID (mAP of 75.18) and localization
accuracy (success rate of 83% on TUM-RGBD), highlighting the
essential role of object ReID in advancing robotic perception. Our
models, frameworks, and datasets have been made publicly avail-
able.