
The field of computer vision has been completely transformed by the success of deep Convolutional Neural Networks (ConvNets). State-of-the-art deep models have led to large advancements in visual tasks such as object detection and segmentation. One key ingredient behind this success is a large amount of human supervision for training ConvNets. However, can we really annotate every task we want to solve? As computer vision works towards more difficult and structured AI tasks, it becomes more challenging for humans to provide training supervision.
In this talk, I will argue that we need to go beyond images and exploit the spatial-temporal structure in videos. In videos, we have millions of pixels linked to each other by time. I will discuss how to learn this visual correspondence from continuous observations in videos without any human supervision. Once the correspondence is given, it can be utilized as supervision in training the ConvNets, eliminating the need for manual labels. Going beyond visual recognition, the spatial-temporal structure in videos also provides supervision signals for learning visual interactions. I will talk about our recent efforts on learning scene affordance by passively watching human interactions from videos, and learning visual navigation by actively interacting with the environment.