Template matching


Let’s consider the problem of trying to find a face in a crowd and, in particular, I want to find the face of the robot Bender. Here he is: he is hiding behind a Dalek and standing next to Eve. So rather than us having to scan the crowd to find Bender, let’s see if we can automate the task.

We are going to use the spatial operator approach in order to solve this problem. What we are going to do is for every input window we are going to compare it with a template image, the template being the image, the pattern we are looking for and our function now is a similarity function. We are going to look at the similarity between the particular input window which marches across the image left to right, top to bottom. Every input window, we are going to compare it with our template and the similarity score is going to be placed into the output image. So the template will be the face of Bender and we are going to compare the face of Bender at every single location across the scene. So how do we compare two images, how do we tell that a particular input window W looks like Bender or doesn’t look like Bender?

To do this we need an image similarity measure and the simplest one is called the sum of absolute differences, sometimes abbreviated to SAD or sad() and quite simply we take the difference between the corresponding pixels in the two images that we are trying to compare. Of course, they must be the same size, take their absolute value and sum it up. So if the images are identical, the similarity measure will have a value of 0. If they are dissimilar, S will have a value greater than 0. So 0 means perfect match; a big value means a less perfect match. A quite similar approach is the operator called ssd(), sum of square differences, and instead of using the absolute value operation we are taking the sum of the squares.

Another measure is the zero — mean normalised cross correlation, commonly abbreviated to zncc(). It looks much more complicated, but it has some advantages over the two simple measures above. The zncc() measure varies from -1 to +1. +1 means that the images are identical and -1 means that one image is the negative of the other one. 0 means that the two images are not very well correlated.

Typically, a value above 0.8 is considered to be a reasonable match. Here we revisit the original puzzle and there is Bender highlighted. So we first of all need to create the template, so here I have chopped Bender out of the scene, so it is mostly Bender and a few of his neighbours as well.

And then what I am going to do is to use a Photoshop-like program to remove all of the robots in that scene that are not Bender. I am going to zoom in as much as I can on his head and blacken out all of the other robots that I am not interested in. So this is the template, this is the pattern that I am going to go looking for in the image.

The first thing I am going to do is load the crowd scene into a workspace variable which I am going to call crowd. I am going to load it from the file wheres_walle.png, it is a PNG format file. I want to convert the colour image to grey scale and I want to convert all the pixels to double precision values. That is, they lie between 0 and 1 and then let’s display that.

And there we see the crowd scene, next I am going to load an image of the robot Bender and that is in a PNG file also; going to convert that to double precision and I have loaded that into the workspace variable Bender. We can see that it is a very small image; it has only got 55 rows of pixels and 41 columns of pixels. Let’s display that in a new window.

And that is our template. So the problem to hand is to look for that particular template pattern in every possible location in the input image. So this is a really big searching problem to do that I am going to use the tool box function called isimilarity and I parse in the image of Bender, the template that we are looking for, the scene in which I am searching for the template and the last argument is the similarity measure — we just talked about similarity measures — and I am going to use the zncc measure, zero mean normalise cross correlation, which is perhaps the Rolls Royce of image similarity measures, and I am going to compute that on my rather nice laptop. This calculation takes about 90 something seconds so we are going to skip through to the end. Great, we are done. So we have computed the variable S. It is a large matrix and we can consider it to be an image and display it as such and this is a similarity image. Now what we have here is this similarity measure, so let’s just click around in a few points and we can see that these pixels that I am clicking on have got quite low similarity measure. It means that where I have clicked the template is not at all a good fit.

Down here we can see some similarity values that are a little bit bigger, 0.44, 0.47. The maximum value similarity is 1. So at this location it is possible that the template exists but it is not a marvellous fit. We can see some places that are quite dark where it is certainly is very poor fit. In fact it is a negative similarity measure. But somewhere in here there will be some very bright pixels which represent where the template fits best. So this is a very large matrix: there’s almost a million numbers in here and what we have to do is to find the biggest number within this matrix.

Now we have a tool box function which can do that for us and it’s used like this. So it has got two output variables and I will explain them in just a moment.

The function is called peak2 and we parse in the similarity image. Parse in the argument one which is saying that we want to find the brightest pixel, that is, the brightest pixel with respect to all of the pixels who are one pixel away from it. So it is like, it’s the brightest within a very local region and let’s ask for the five brightest peaks in descending order. So that’s the brightest pixel, the second brightest pixel and so on. Takes a moment to compute; and what we see is we have computed two output variables, now the first one, the one we call MX or maximum is the value of that brightest pixel and the corresponding column in the matrix P is the coordinate of that brightest pixel.

So what it is saying is within that similarity image the greatest similarity that it has found has got a value of 0.6107 and it occurs at a U coordinate of 331 and a V coordinate of 364. The second best fit has got a similarity of 0.5580 and we can see its coordinate, third best fit, fourth best fit, fifth best fit and so on.

So in order to visualise that, what we can do is to select the original image and what I am going to do is to plot some circles on this image that indicate where we found these good fits. So this is rather a lot of arguments to this function, but basically it is saying plot a circle of radius 30 pixels and plot one of these circles for every column in the matrix P. So it is going to plot 5 circles. I want the circles to be coloured blue. I want them to be translucent, so we set the alpha value for these circles to 0.3, which means that they are see through circles and that they do not have any edge colour and here we go, there is a blue circle on every place where we think the robot Bender is.

Now let’s also indicate the rank of the selections; another rather complex function call, but it’s basically saying for every point P, I want to put a sequential number in a bold font text size 36, in yellow writing and here we go, so we can see the first best fit, second best fit, third, fourth and fifth and we can see that the blue circle with the number one next to it is indeed the robot Bender … we are looking closely enough; we can see Bender’s happy face there.

We are going to extract a region of the crowd scene around where we located the robot Bender. We are going to do that by using the iroi function, the region of interest function. And to parse in the entire image, the crowd image, we are going to parse in the coordinate of the point where the template had the best match, that is, the first column of the matrix P. Each column of the matrix P represents a point in the image where the template fitted either first best, second best and so on. So the first column is the point where the template fitted the best and we are going to extract a region which is 50 pixels square and so here we see the region of the crowd scene where we found the robot Bender and on the right we see the template of Bender. We can see that these regions are not quite the same. But we are showing on the left a pattern of pixels where the template fitted the best.

So what I have done is to take the original image of the crowd scene and I have cut out a bunch of pixels which has Bender’s face in it, I have gone into Photoshop and I have blacked out some of the background which is not relevant — they are not part of Bender, they are actually part of the robots around Bender — and that is how I found the template.

So if I applied this template to a different robot crowd scene and perhaps, in that crowd scene, Bender didn’t have a cigar; perhaps he was looking in another direction; perhaps he was closer to the camera and appeared to be bigger, or he was further away and appeared to be smaller. Then a simple-minded template matching technique like this is not going to work particularly well.

Imagine trying to find a face in a crowd. If we know what the face looks like we could search for it at every possible location — this is the essence of template matching. To make it work we need to describe how similar each area we are checking is to the reference face image and we discuss a number of similarity measures such as sum of absolute differences (SAD, ZSAD), sum of squared differences (SSD, ZSSD) and normalized cross correlation (NCC, ZNCC). We also invesigatate the effect of intensity scale and offset error on the performance of these measures.

Professor Peter Corke

Professor of Robotic Vision at QUT and Director of the Australian Centre for Robotic Vision (ACRV). Peter is also a Fellow of the IEEE, a senior Fellow of the Higher Education Academy, and on the editorial board of several robotics research journals.

Skill level

This content assumes an understanding of high school level mathematics; for example, trigonometry, algebra, calculus, physics (optics) and experience with MATLAB command line and programming, for example workspace, variables, arrays, types, functions and classes.

More information...

Rate this lesson


Check your understanding

Leave a comment