Why a scene in perspective is linear in 1/Z, rather than in Z

This is an explanation as to why a perspective matrix transformation causes points to step evenly in 1/Z rather than in Z, as you march along the pixels of a scan line, in the image of an object seen in perspective.

You already know that a point (X,Y,Z) lies on a plane (a,b,c,d) when aX + bY + cZ + d = 0. In order to be able to more easily manipulate three dimensional points by planar equations, it is useful to represent each point in homogeneous coordinates (x,y,z,w), where x = wX, y = wY, and z = wZ. We call w the homogeneous coordinate. Notice that point (x,y,z,w) lies in plane (a,b,c,d) when ax + by + cz + dw = 0.

So the actual three dimensional point (X,Y,Z), represented by the homogeneous vector (x,y,z,w), is (x/w, y/w, z/w). We can freely scale (x,y,z,w); it will still represent the same point. For example, (1,2,3,1) and (2,4,6,2) represent the same point.

If the only transformations you ever want to do are translate, rotate and scale, then you can just stick with the special case where w = 1. But the general case, where w can take on other values, is useful for describing the points at infinity that appear when you do perspective. Points go to infinity whenever w goes to zero.

Perspective is caused by linear transformations in which the transformation matrix contains something in its last row other than (0,0,0,1). This is when the transformed w coordinate can become zero.

For example, the following is a perspective matrix:

1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0

Notice that the only effect of the above matrix is to swap the homogeneous z and w coordinates, so that, for example, (x,y,z,1) is transformed to (x,y,1,z).

You can see that this is indeed a perspective transformation; since it transforms to a point at infinity any point which was on the Z=0 plane.

This, in fact, is the particular variety of matrix which describes what happens when you do camera perspective, if you place the camera at the origin and aim it along the Z axis.

Remember that the homogeneous vector (x,y,z,w) really represents the point in space (x/w, y/w, z/w). If we swap z and w, then (x,y,z,1) is transformed to (x,y,1,z), which represents (x/z, y/z, 1/z). This is the sort of space we find ourselves in when we do a perspective transformation.

The precise perspective matrix actually depends on the focal length of the camera (ie: how far away the camera is from the object). If the focal length is f, then the proper perspective matrix to use is:

1 0 0   0
0 1 0   0
0 0 0   f-1
0 0 f-1  0

Let us represent (X,Y,Z) in the homogeneous form (X,Y,Z,1). This point is transformed by the above perspective matrix to (X,Y,1/f,Z/f). Dividing through by the homogeneous coordinate, we find that the transformed point (X',Y',Z') is given by (fX/Z, fY/Z, 1/Z).

This means, geometrically, that any object which is at the focal distance (that is, when Z=f) is going to appear at actual size, since this is the case where the f and the Z will cancel out in the X' and Y' coordinates. Any object that is nearer than f will be magnified, and any object which is farther away than f will appear smaller.

Note what happens to Z' as we regularly increment X'. If we march along evenly in X', (that is, we increase X/Z in even steps), then we must also march along evenly in Z', which is proportional to 1/Z in the original (untransformed) space.

In other words, a scene viewed in perspective is linear in 1/Z, not in Z.