Image-Based Visual Servoing


Let’s imagine we have a robot and we want to position its gripper with respect to this purple triangle and imagine that we want to have the gripper directly above the centre of the triangle and at a height something like this.

So if we had a camera on the robot’s gripper, it would see a view something like this. In the particular view that the robot is seeing from here, the pose of the gripper with respect to the object is implicit. For instance, if the robot gripper was initially in a position like this, we can see that the image acquired by the camera is quite different. So if we could come up with a control algorithm that would change the view from this to this, then implicitly the robot is going to perform the desired motion in the world coordinate frame.
Here’s a graphical representation of the problem that we just discussed. We have the initial view of the object. In this case, our object is a triangle and it’s well-defined by the coordinates of its three corners. This is what the object looks like initially. But I want the triangle to look like this in my image.

It’s also well-defined by the coordinates of its three corners. So we can define our problem as changing the view of the object from this to this. What we want to do is to move the yellow circles to where the red circles are. Just like this.

In order to move, these yellow circles need to have some velocity. They need to have a velocity towards the corresponding red circle. So here is the coordinate of one of the yellow circles and here is its velocity. It needs to have this velocity, which would make it move from where it is to where it needs to go.
We’ve already discussed a relationship between pixel velocity and camera velocity. It’s given by the image Jacobian Matrix. We can write the matrix as a function. I call it Jp and that stands for Jacobian for a point feature and it’s a function of the pixel coordinate, u and v, also how far the point is away in the world, which is given by capital Z.

We refer to that as the depth of the point. For three points, remember we have a point on each corner of the triangle, we can describe the velocity that we want point one to have, point two to have and point three to have. All that we’re doing is just stacking three sets of equations, one above the other and as a common factor, which is the camera velocity.

In the middle, we have a stack of these image Jacobians. So for three points, the relationship between the pixel velocities, camera velocity, looks like this. This middle matrix, the stack of image Jacobians, is now a six by six matrix.

Each of them is two by six. Stack three of them on top of each other. The result is a six by six matrix. Because this matrix is square, we can invert it. We have a relationship that looks like this. This is the desired camera velocity. If the camera moved like this, then all the points would have the desired pixel velocity. They would move from where they are, shown by the yellow circle, to where you want them to be indicated by the red circle.

So this is a relationship between the velocity that we want and the velocity that the camera needs to have in order to make the change in the image look like this. It’s computed via the inverse of the stack of image Jacobian matrices.
Let’s look now at how we might compute the desired pixel velocity. We’re going to introduce some extra notation. I’m going to use the asterisk or star to indicate the desired value of a quantity. So where we want the pixel to be is indicated by U* and V* and where it currently is, is indicated by the coordinate U, V. So if I simply take the difference between where I want it to be and where it is now and multiply that by some arbitrary scale factor lambda, I have an equation, which describes the desired pixel velocity.

It’s a vector parallel to a line between where I am and where I want to be. I use the subscript “I” because this is a general relationship that we can apply to any of the corner points of our object. In this case, it’s a triangular object so “I” varies from one to three.

I introduce a bit more notation. I use the symbol “P” which is a vector to represent the coordinate of a point and substituting that into the previous relationship, now I have an expression that tells me the desired velocity of the camera as indicated by this star subscript is equal to lambda, an arbitrary game, multiplied by the inverse of the stack of image Jacobians, multiplied by another vector which is made up of a difference between where the point is and where I’d like the point to be.
Let’s have a look at a simulation of this. On the left, we have a simulated image plane. So the circles indicate where the object is currently projected and the asterisks indicate where it is that I would like those points to be projected to. The asterisks are the destination coordinates. On the right hand side, we have a 3D view of what’s going on. The three red spheres indicate the points in three-dimensional space, the corners of the triangle. In blue, we have a simple 3D representation of a camera with an attached camera coordinate frame.

When I start the simulation, we can see that the camera is moving towards the three points in the world and initially it moves quite quickly and then slows down.

This is a characteristic of the proportional controller that we’re using demonstrates asymptotic convergence and we can see that the circles, the current view of those world points on the image plane moving slowly but surely towards the desired goals.

What’s happened is by simply specifying what needs to happen on the image plane that this coordinate needs to move to this coordinate, in the real world, in the three-dimensional world, the task has been completed. The camera has moved to a desired pose.
But notice in achieving that task, we haven’t had to say anything about the pose of the camera or the pose of the objects in the real world. This whole task has been completed simply by driving to zero the error on the image plane.

This is an important characteristic of what we call image-based visual servoing.

Let’s look at a concrete example here. This is my MATLAB workspace which I've initialized with a few variables.

I've created a camera object with the default settings we've seen before. I've defined the vertices of a triangle in world coordinates and they’re represented by the columns of the matrix “P” and I have created a homogeneous transformation which represents the pose of the camera within the world space and I've done that by defining a position and rotation around the X, Y and Z axes given by this homogeneous transformation here.

So I can plot the projection of the world points on to the image plane, using the plot method of the camera object, pass in the coordinates of the three world points, the columns of the matrix P and I’m going to tell the object that the camera pose is at this variable here, T-cam. Now we see the projection of those three points on the image plane. We can see that the camera is some distance away from the triangle and it’s looking at the triangle somewhat obliquely.
The plot method has returned the projections on the image plane and they’re given by the columns of the matrix little b. Now I can compute an image Jacobian from these three points. I do that by again using the visjac_p method, visual Jacobian for point features and I’m going to pass in the point features, little p and this contains three points and I’m going to specify that all the points are five meters away from the camera.

This is not strictly true. I don’t actually know how far away the points are but for the moment bare with me and we’ll just put in the number five.

Now we have the visual Jacobian’s for each of the three points. Each point has a two by six image Jacobian and now they’re all stacked up so the result is a six by six matrix in the workspace. This is a matrix that I could invert. It’s a square matrix. If I use this matrix to multiply a vector of desired image plane velocities, the result will be the spatial velocity that the camera needs in order to achieve that.
So in summary, image-based visual servoing is all about describing a task in terms of how we want points to move in the image. We then construct a simple controller that moves the camera so as to create the point motion that we require.

As the points move from where they are now to where we’d like them to be, the useful task is achieved in the 3D world. But the task is not described in terms of pose of objects or of the robot. The task is in fact implicit in the motion that we require on the image plane.

In some ways, we’re talking about relative motion of the camera. We’re steering the camera towards the goal location. It’s not defined in terms of absolutes. We don’t need to know the pose of the camera or the pose of the objects with respect to any kind of world coordinate frame.

This technique works really well in this particular case where we have got three points in the scene. It doesn't work for just one or two points and it’s much more complex in a case where we have four or more points.

It’s certainly doable but a little more complicated and beyond the scope of this particular lecture.


There is no code in this lesson.

We use MATLAB and some Toolbox functions to create a robot controller that moves a camera so the image matches what we want it to look like. We call this an image-based visual servoing system.

Professor Peter Corke

Professor of Robotic Vision at QUT and Director of the Australian Centre for Robotic Vision (ACRV). Peter is also a Fellow of the IEEE, a senior Fellow of the Higher Education Academy, and on the editorial board of several robotics research journals.

Skill level

This content assumes high school level mathematics and requires an understanding of undergraduate-level mathematics; for example, linear algebra - matrices, vectors, complex numbers, vector calculus and MATLAB programming.

More information...

Rate this lesson


Check your understanding

Leave a comment