Learning to Perceive the 4D World
Perceiving the 4D world (i.e., 3D space over time) from visual input is essential for human interaction with the physical environment. While computer vision has made remarkable progress in 3D scene understanding, much of it remains piecemeal—for example, focusing solely on static scenes or specific categories of dynamic objects. How can we model diverse dynamic scenes in the wild? How can we achieve online perception with human-like capabilities? In this talk, I will first discuss holistic scene representations that enable long-range motion estimation and 4D reconstruction. I will then introduce a unified learning-based framework for online dense 3D perception, which continuously refines scene understanding with new observations. I will conclude by discussing future directions and challenges in advancing spatial intelligence.