Image Formation


Now we're going to look at a more mathematical way to describe the projection process, projection of a point from the real world into the image plane. In the last lecture, we looked at it from a very intuitive and geometric perspective. This time, we're going to look at it more mathematically. 

We're going to use a different projection model to what we used last time. We're going to use a model that's referred to as the "central projection" model. The key elements of this model are the cameras coordinate frame we denote by C. The image plane is parallel to the camera's x and y axis and positioned at a distance F in the positive z direction. F is equivalent to the focal length of the lens that we talked about in the last lecture. 

Now in order to project the point, what we do is we cast a ray from the point in the world through the image plane to the origin of the camera. With the central projection model, you'll note that the image is non-inverted as it was in the case that we talked about in the last lecture. We can write an equation for the point P in homogeneous coordinates, we multiply the world coordinates, X, Y, Z by a 3 x 4 matrix in order to get the homogeneous coordinates of the projected point on the image plane. 

Let's look at this equation in a little bit more detail. It's quite straightforward to write an expression for x̃, ỹ, z̃ in terms of the focal length and the world coordinate, X, Y, Z. We can transform the homogeneous coordinates to Cartesian coordinates using the rule that we talked about in the last section and with a little bit of rearrangement, we can bring the equation into this form and this is exactly the same form as we derived in the last lecture by looking at similar triangles. 

What's really convenient and useful about this homogeneous representation of the image formation process is that it is completely linear. We don't have this explicit division by Z, the distance between the camera and the object. It's implicit in the way we write the equations in homogeneous form. 

Let's look at this equation again and we can factor this matrix into two. The matrix on the right has elements that are either 0 or 1 or f, the focal length of the lens. So this matrix performs the scaling and zooming. It’s a function of the focal length of our lens. The matrix on the left has got an interesting shape, it’s only a 3 x 4 and this matrix performs the dimensionality reduction, crunches points from three dimensions down into two.

And so far, we consider the image plane to be continuous. In reality, the image plane is quantized. It consists of a massive array of light sensing elements which corresponds to the pixels in the output image. The dimension of each pixel in this grid, I’m going to denote by the Greek letter rho. So the pixels are rho u wide and they’re rho v high. Pixels are really really small so the width and height of a pixel is often at the order of around 10 microns, maybe a bit bigger, maybe a bit smaller. 

What we need to do now is to convert the coordinate P which we computed previously and that was in units of meters with respect to the origin of the image plane.

We need to convert it to units of pixels and our pixel coordinate system has got a different origin as we talked about in earlier lectures. Pixel coordinates are measured from the top-left corner of the image so we need to do a scaling and we need to do a shifting and that’s a simple linear operation. So if we have the Cartesian x and y coordinates of the point P on the image plane, we can convert that to the equivalent pixel coordinate which we denote by the coordinates u and v and we can represent that again in homogeneous form.

Here we multiplied by a matrix, the elements of the matrix are the dimensions of the pixel Pu and Pv and the coordinates of what’s called the principal point. The principal point is the pixel coordinate where the z axis of the origin frame pierces the image plane.

The homogeneous pixel coordinates can be converted to the more familiar Cartesian pixel coordinates u and v by the transformation rule that we covered earlier.

Essentially, we take the first and second element of the homogeneous vector and divide it by the third element of the homogeneous vector.

Now, we can put all these pieces together and we can write the complete camera model in terms of three matrices. The product of the first two matrices is typically denoted by the symbol K and we refer to these as the intrinsic parameters. All the numbers in these two matrices are functions of the camera itself. It doesn't matter where the camera is where it’s pointing, they’re only a function of the camera. These numbers include the height and width of the pixels on the image plane, the coordinates of the principal point and the focal length of the lens.
The third matrix describes the extrinsic parameters and these describe where the camera is but they don’t say anything about the type of camera.

The elements in this matrix are a function of the relative pose of the camera with respect to the world origin frame. In fact, it is the inverse of xi C.

The product of all of these matrices together is referred to as the camera matrix and it’s often given the symbol C.

So this single matrix is single 3 x 4 matrix is all we need to describe the mapping from a world coordinate, X, Y and Z through to a homogeneous representation of the pixel coordinate on the image plane. That homogeneous image plane coordinate can be converted to the familiar Cartesian image plane coordinate using this transformation rule here. So this is a very simple and concise way of performing perspective projection.

Let’s consider now what happens when I introduce a non-zero scale factor λ. The homogeneous coordinate elements ũ, ṽ and w̃ will all be scaled by λ. When I convert them to Cartesian form, the λ term will be factored out to the numerator and the denominator so the result will be unchanged. This is a particular advantage of writing the relationship in homogeneous form. It gives us what’s called scaling variance.

Because we can multiply the matrix by an arbitrary scale factor, it means we can write the camera matrix in a slightly simplified form, which we refer to as a normalized camera matrix. We do that by choosing one particular element of that matrix to have a value of 1 and typically we choose the bottom right element and set it to 1.

This normalized camera matrix still contains all of the information to completely describe the image formation process. It contains the focal length of the lens, it contains the dimensions of the pixels, it contains the coordinate of the principal point and it contains the position and orientation of the camera in three dimensional space. And finally, we can convert the homogeneous pixel coordinates to the more familiar Cartesian pixel coordinates which we denote by u and v.


There is no code in this lesson.

We can describe the relationship between a 3D world point and a 2D image plane point, both expressed in homogeneous coordinates, using a linear transformation – a 3×4 matrix. Then we can extend this to account for an image plane which is a regular grid of discrete pixels.

Professor Peter Corke

Professor of Robotic Vision at QUT and Director of the Australian Centre for Robotic Vision (ACRV). Peter is also a Fellow of the IEEE, a senior Fellow of the Higher Education Academy, and on the editorial board of several robotics research journals.

Skill level

This content assumes an understanding of high school level mathematics; for example, trigonometry, algebra, calculus, physics (optics) and experience with MATLAB command line and programming, for example workspace, variables, arrays, types, functions and classes.

More information...

Rate this lesson


Check your understanding

Leave a comment