Abstract
We present Ego-Exo4D, a diverse, large-scale multi- modal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these ac- tivities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by mul- tichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiplepaired language descriptions—including a novel “expert commentary” done by coaches and teach- ers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled hu- man activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity un- derstanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.