Instanced Model Sample

This sample shows how to efficiently render many copies of the same model by using GPU instancing techniques to reduce the cost of repeated draw calls.

Sample Overview

Games often need to render many copies of the same model, for instance covering a landscape with trees or filling a room with crates. The calls needed to render a model are relatively expensive, and can quickly add up if you are drawing hundreds or thousands of models in a row. This sample demonstrates how to reduce the overhead of drawing many copies of the same model.

This instancing technique can dramatically reduce the` amount of CPU work required to draw models, but it makes little difference or in some cases may even slightly increase the GPU cost. The CPU cost of drawing a model is constant regardless of how complex the model may be. However, the GPU cost increases in proportion with the number of triangles and shader complexity. For this reason, drawing low polygon models with simple shaders is likely to be limited mainly by CPU performance. More detailed meshes are likely to be bottlenecked on the GPU side. If the GPU is your bottleneck, there is nothing to be gained from using instancing. Instancing yields the most dramatic performance gains when used with relatively small and simple models—somewhere in the ballpark of 1000 triangles or less.

Sample Controls

This sample uses the following keyboard and gamepad controls.

Action	Keyboard control	Gamepad control
Change techniques	A	A
Add instances	X	X
Remove instances	Y	Y
Exit the sample	ESC or ALT+F4	BACK

How the Sample Works

This sample implements three different rendering techniques.

No instancing or state batching: just calling ModelMesh.Draw many times in a loop.
No instancing: not using any special GPU tricks, but being smarter about repeatedly setting device states.
Hardware instancing: the more efficient technique.

No Instancing or State Batching

This rendering technique does not use instancing at all, and it is not a smart approach or something that you should copy in your own code! This is included only for comparison purposes, so you can see the performance gain achieved by the following techniques.

This technique just loops over all the active instances, calling ModelMesh.Draw (which sets all the rendering state and then calls DrawIndexedPrimitives) once for each copy of the model.

In psuedocode, this technique is implemented as follows:

C#
foreach (Matrix instance in instances) { SetVertexBuffer(); SetIndexBuffer(); SetVertexDeclaration(); SetWorldTransform(instance); foreach (EffectPass pass in effect.CurrentTechnique.Passes) { pass.Apply(); DrawIndexedPrimitives(); } }

        foreach (Matrix instance in instances)
        {
        SetVertexBuffer();
        SetIndexBuffer();
        SetVertexDeclaration();

        SetWorldTransform(instance);

        foreach (EffectPass pass in effect.CurrentTechnique.Passes)
        {
        pass.Apply();

        DrawIndexedPrimitives();
        }
        }

Note the repetitiveness of setting the same state and calling the same effect methods every time around the loop.

No Instancing

This is still not a proper instancing technique. It just rearranges the C# drawing code to hoist repeated operations out of the inner loop. Instead of looping over all the instances and repeating the exact same draw code for each one, the DrawModelNoInstancing method takes an array of instance transform matrices, so it can draw many copies of the same model all in one go, doing the absolute minimum of repeated work per copy.

In psuedocode, the algorithm is:

C#
SetVertexBuffer(); SetIndexBuffer(); SetVertexDeclaration(); foreach (EffectPass pass in effect.CurrentTechnique.Passes) { foreach (Matrix instance in instances) { SetWorldTransform(instance); pass.Apply(); DrawIndexedPrimitives(); } }

        SetVertexBuffer();
        SetIndexBuffer();
        SetVertexDeclaration();

        foreach (EffectPass pass in effect.CurrentTechnique.Passes)
        {
        foreach (Matrix instance in instances)
        {
        SetWorldTransform(instance);
        pass.Apply();

        DrawIndexedPrimitives();
        }
        }

The interesting thing about this technique is that it does not require any shader changes at all, so it works alongside whatever other shaders you may already be using. It achieves a significant performance improvement over naively issuing many calls to ModelMesh.Draw, simply by being smarter about the order in which it does things. Moral: do not be afraid to replace the built-in model drawing code if you can take advantage of more specialized knowledge to optimize your particular scenario!

Hardware Instancing

This technique performs the instancing work entirely on the GPU. It has extremely low CPU load regardless of how many instances you are drawing. For optimal performance, this requires a HiDef game using shader model 3.0 vertex and pixel shaders.

In conventional non-indexed rendering, every set of three indices forms a triangle. The indices are used to look up into a vertex buffer, which provides data such as position, normal, and texture coordinates. The following diagram shows the data layout for a simple model that consists of just one rectangle (two triangles, specified by six indices, and four vertices).

Hardware instancing does not require any changes to that data layout, but it adds a new source of information. This is a second vertex buffer that holds transform matrices, one per instance. This second vertex buffer should be as large as the maximum number of instances you are planning to draw in a single call. It does not need to be the same size as your main vertex buffer. In each frame you call SetData to update this second vertex buffer with the latest transform matrices for each instance.

To pass both vertex buffers into your shader, you must specify a VertexDeclaration that includes this additional transform matrix data. Unfortunately, there is no VertexElementFormat for matrix data, so instead we must split the matrix up into four channels of type Vector4.

We must now set both vertex buffers onto the GraphicsDevice (the first holding our actual geometry data, and the second holding the instance transform matrices), and then draw by using the DrawInstancedPrimitives API method, as opposed to the usual DrawIndexedPrimitives. This is handled by the DrawModelHardwareInstancing method in this sample.

Finally, we must add the per instance transform matrix as an input parameter to our vertex shader. See the HardwareInstancingVertexShader function in InstancedModel.fx for an example of how to do this.

Once we have completed these setup operations, the GPU handles everything else for us. We can just call DrawInstancedPrimitives, and the GPU draws the specified number of copies of our model data. It reuses the same index buffer for each copy. It uses these indices to look up into the same geometry vertex buffer. It also keeps track of which instance it is currently drawing, automatically looking up the transform matrix from the appropriate part of our second vertex buffer and passing this into our vertex shader along with the position, normal, and so on.

Using the same two triangles example shown in the previous diagram, this shows the data flow when drawing two copies of the model using hardware instancing:

Data used for the first instance is shown in blue, and the second instance is green. Grey represents data that is shared between both instances. Note how triangles 0/2 and 1/3 share the same indices, which reference the same information from vertex buffer stream 0, but how each instance pulls a different transform matrix from vertex buffer stream 1.

Extending the Sample

Apart from the instancing behavior, the shader used in this sample is not very interesting. It just implements a simple Lambert diffuse lighting model. You could extend this to add more interesting features such as specular lighting, multiple light sources, per-pixel lighting, normal mapping, and so on.

This sample specifies only a 4×4 matrix per instance, so although each instance can be positioned differently, all the copies look 100-percent identical. You could extend this by adding additional per-instance parameters, as extra data channels of the secondary vertex stream used for hardware instancing. You could then use these extra parameters to give each instance a different tint, or to perform selective color replacement or decal selection to vary their appearance.