An image contains a huge amount of pixel data, and a video stream is a massive flow of pixel data. Typically a robot has only a few inputs, the position or velocity of its joints. How do we go from all that camera data to the small amount of data the robot really needs?