Summary of 3D vision


Let's summarize the main points from the lecture.

When we take an image of a scene, it's a perspective projection, and the depth information has been lost.

Human beings use a number of tricks or visual cues to reinstate that missing depth information. The cue that we use depend on the circumstance and also on the distance. Is it within our personal space or is it out towards the horizon?

A particularly important technique that we use, and so do many other animals, also it will be used by robots is a technique called binocular disparity. And that relies on the small subtle changes between images of the same scene taken from different vantage point.

Here's an example of a stereo image pair and we can see between the left hand image and the right hand image to correspond to what we would see with our left eye and our right eye. In general these two scenes look very similar, but if we look more carefully there is some subtle but important differences between them, and it's these small subtle differences that allows to extract information about this three dimensional structure of the world.

Just as we have two eyes, we can build this stereo camera which is a compact assembly that contains effectively two independent cameras and known distance of pi. An important concept in stereovision is disparity. If we identify a point in the left image and find the same point in the right image it will appear to have shifted somewhat to the left, and that leftward shift we refer to as disparity.

Importantly, disparity is a function of distance. So for an object that's close to us, there is a large leftward shift, a large disparity, where something that's far away has got a smaller disparity.

So if we can make the disparity, then using the geometry of stereovision, we can relate the disparity to the distance to the object. Use the two images to compute disparity, and from that, then I can compute Z.

Computing disparity is not trivial. For every pixel in the left hand image, I take a square window of pixels surrounding it within search for that same pattern of pixels in the right hand image, and that involves making a lot of image comparisons. It's the template-matching problem that we looked at earlier.

So given a left image and a right image of a scene, I can compute at every single pixel, the disparity, and I can display that as a disparity image or a disparity map, and in a disparity map, things are bright if they're near and they are darker if they are further away. The converse is that if I have a stereo pair, two images taken from slightly different viewpoints, if I could present those images to my left and right eye, I will have a very vivid sense of the three dimensionality of the scene, a very, very strong sense of depth.

Human beings have been fascinated with the idea of 3D vision for a long, long time, and we discussed an array of technologies, which can perform this function.

Let’s recap the important points from the topics we have covered about human depth perception, display of 3D images and estimating 3D scene structure using stereo and other types of sensors.

Professor Peter Corke

Professor of Robotic Vision at QUT and Director of the Australian Centre for Robotic Vision (ACRV). Peter is also a Fellow of the IEEE, a senior Fellow of the Higher Education Academy, and on the editorial board of several robotics research journals.

Skill level

This content assumes an understanding of high school level mathematics; for example, trigonometry, algebra, calculus, physics (optics) and experience with MATLAB command line and programming, for example workspace, variables, arrays, types, functions and classes.

More information...

Rate this lesson


Leave a comment