Animation Systems 101

So, here’s a post I made on a forum – this is a basic primer on animation systems for video game development. The idea is that it should explain how they are made, what components are part of it

Animation Systems – a Primer

Bone based animations – at it’s root, an animation is basically just a bunch of hierarchical bone rotations (and possibly positions, if you want the capability to scale bones).

An animation file tends to contain several things -
The skeleton default pose – this is a list of bones, in tree order, with positions, when each bone is considered to have “no rotation” – the angles for each bone are 0,0,0 or the matrix is at identity. Usually this pose looks like the T Pose for humanoid characters, but really it’s whatever your animation guys think it should be. The only caveat is that all animation values you ever get are offsets from this original default position.

Then, for each frame of animation, you have a list of bones (in hirearchial tree order) with rotations (and as said, possibly new positions) of each bone. This can be represented in several ways. Simple Euler angles – although if you do this, you do have to indicate what the order of angle operations is (do you apply Roll first? Or Pitch?) – just like you do when creating a matrix from euler angles (which is the final step of the per-frame skeleton generation). Almost all systems use Quaternions for per-bone rotation because a) they are interpolatable (you can take two rotations, one for frame A and one for frame B and just do a linear interpolation of values between the two) which you cannot do for a pure matrix representation, b) they are smaller in size (3 floats for quat vector and one for rotation as opposed for 9 floats for a pure rotation matrix) and c) you can actually encode scale into the vector (ie it’s not normalized) if you want to lengthen bones.

For each frame you wish to display, it’s a very simple calculation to work out which frame you are inside the animation (well, between – usually you are between two frames – I’ll get to frame frequency in a moment), take the two frames rotation for any given bone and interpolate between it, based on percentage of closeness between Frame A and Frame B (Frame A is on time at 300ms, Frame B is on 400ms, we are at 320ms, so we get 80% of Frame A and 20% of Frame B. Bogus example, but you get the idea).

From this, you now have a quaternion for a specific bone. You just then generate a matrix from this bone, multiply that by the parent bone (remember, you are traversing a tree here, from root to extremity and it’s hierarchical, so you need to multiply by the parent to get an absolute rotation for each bone, rather than a parent relative one), and there’s your skeleton.


To skin, you have a mesh which is mapped to the skeleton default pose (as described above) and for every vertex you have two sets of values (beyond simple XYZ position). A list of bones that affect this vertex and percentages for each bone. Usually there are limits to how many bones can affect a given vertex (4 or 8 because of shader usage, but we’ll get to that later. For now, this is a software skinning system.)

So, for each vertex you need to do a transform. What you need to do is transform each vertex by each bone that it says affects it, and then blend together the resulting position based on percentage of the amount each bone affects it.
So if vertex 38 is affected by bones 3 at 20%, 7 at 15%, 9 at 15% and 10 at 50%, you transform the original vertex position by bones 3, 7, 9 and 10, then blend the result using the percentages given.NOTE – you need to blend the result of each of those transforms, not the actual matrices for each bone. You need to physically transform the vertex for each bone – giving you 4 positions, in this example – and then scale blend between the RESULTS, not the initial matrices.

Also note that if you have a per vertex normal, you need to transform that by each bone and do the same blending you are doing for the vertex position.

There are more elegant ways of doing this and generating one matrix to multiply the vertex by, but in my experience the work required to generate that matrix tends to outweigh just doing it repeatedly (particularly if you are using SLL instructions) and doing it that way. Others may well be cleverer than I am.
And you need to do this for every vertex.

That’ll give you your mesh.

At root, that’s the most basic dynamic meshed animation system.

Now it can start getting very complicated very fast.

Optimizing Animation Storage

Firstly, you can vary your key frame rate inside the animation. A key frame is every frame of animation inside the animation. So if your key frame rate is 30hz, that means that you have an actual frame of animation every other tick. 60hz means a frame per frame.
Most animation systems allow for variable rates – so the animation itself tells you “I’m 30hz” or “I’m 20hz” because some animations need higher frame rates than others – an idle pose can be 15fps no problem, but gun animations in an FPS (because they are up close and always on screen at full size) need to be higher and smoother.
Even though you are interpolating in software, it’s relatively easy to miss an extreme frame if you are at low frame rate (e.g. you miss the frame with the hand outstretched to it’s fullest in a punch because it falls on frame that is missed out because you are at 15hz) so your animation doesn’t look right. It also makes circular motions look wrong because you are interpolating between absolute positions – instead of a nice curve you end up with very angular animations.

So what can you do about this? Well, more advanced animation systems allow you to have variable rate frequency animation within the animation itself. So although the animation is set at 15hz, there are some extra, animator marked, frames within it to ensure that extremes of motion aren’t missed. Each frame has a time stamp on it so you know when the last frame and when the next frame are meant to be individually. This is more work for the animator but ultimately results in smoother and better looking animation.

Then there’s per bone LOD and compression – you might actually want to store the per bone quat as 16bit fixed point instead of 4 floats – halves the size and still gives you pretty good results, although less so for dramatic large motions.

Lots of animation systems also don’t store per bone values for every bone per frame. They may well have systems that, when generating the animation in game ready format, say “Hmm, bone 3 didn’t move more than X degrees between frame 10 and frame 11. I’m just going to ignore that and not write any value out at all”. This approach can dramatically reduce out the per frame data for a given skeleton, depending on what your low threshold is for “Any motion below this value, I don’t care about” (or your Epsilon Delta). However, this also can have the effect of loosing certain very subtle aspects of human motion by effectively smoothing them out. It’s usually what makes mocap look like shit – because someone is smoothing out very subtle motion to make the curves look nicer and make the data smaller.
It also has the effect of making the code for generating skeletons have to look harder to find the last / next Quat per bone for frame to frame interpolation, since you can’t guarantee it actually IS in any given frame – you have to keep looking for it in every frame back from where you are till you actually find one (there are ways around this, with caching of data from frame to frame, but it’s a hack to get around thrashing your CPU cache looking for animation data for a given bone).

Anyway, suffice to say there are lots of ways to be clever about bone storage.


Then there’s blending. This can get really complicated, really fast.
So, you’ve already got per frame interpolation going on, so there’s blending going on already. What about blending two animations together? From a standing idle to walk? Well, it’s not really that hard. You just generate two skeletons (quaternion) for each animation, then blend between those two skeletons (using the exact same code you used for generating the interpolation between frame A to frame B ).

Although that does tend to suppose you are doing two full body blend. What if you want to run two different animations? One upper body, one lower body? Then what?
Well, now your animation system has to cope with the fact that, effectively, you might be running any animation on any bone (this is what Ghoul2 does) at differing frame rates.

You will also, probably, want to allow direct override control at any given bone too – so if you have a camera with a rotation bone for the root of the camera, you can direct it to where YOU want it to look, rather than where the animation does.

Again, not hard, but a fair amount of code to get right, and 85% of the time it will go totally unused.

But all this blending does come at a price, and I don’t mean CPU wise. I mean animation wise. The problem comes that animations are hierarchical, which means if you override one specific bone, it’s still starting it’s animation at whatever point the parent was at. So if you override a wrist animation, it’s occurring on the end of an arm that is controlled by something else. You might want to flip someone off, but the flip animation while running on the hand just fine, is running during an run cycle, so the arm is moving according to the run animation.

It gets worse. That the example of the upper and lower body above is a good case in point. While I want to have someone sitting in a chair, I also want that person to be following someone else around a room, so he’s constantly looking at them. You could override the bone at the bottom of the spine, but that’s not realistic – you just twist people around. What you actually have to do is override several bones in the back and twist them each a bit, so the torso rotates more realistically.

At this point you are now into very primitive IK – which is really outside of the animation system itself. The animation system should provide hooks that an IK system can use (like ragdoll) but shouldn’t be doing that work itself, because that means it has knowledge of the meshes / skeletons it’s rendering, which it shouldn’t.

Anyway, you get the point that it starts getting very blurry.

NOTE – there are some very cool things you can do with blending regardless of knowledge. Imagine having a baseball batter. Imagine you have two swings, one high and one low (at the same frame rate). You can run both animations on the batter with a provided percentage blend and what you get out of it is a swing that is at the exact height you want it, based on your percentage. So if the high swing at at 5 feet and the low swing is at 4 feet, a percentage of 50% will give you a swing at 4.5 feet. That’s actually a really cool thing, if you think about it.

Shader Skinning

So what about hardware skinning? Using a shader to do all that matrix math? That’s what vertex shaders are for, right?
Well, yes. However, there are two downsides to this. The first is that you don’t get to see the resulting mesh inside the CPU. If it’s all occurring inside the vertex shader, you don’t get the resulting mesh, which means you can’t be doing per-triangle collision detection against it (well, you can if you do the work on the CPU as WELL as the GPU and there are some higher performance optimization reasons for doing this too which I won’t go into now) – which mostly doesn’t matter but occasionally does.
The second is that now you have to break up your skin. Usually a mesh tends to be one large mesh, all referencing the same texture page. You might break it into two or three – skin being a different mesh / material grouping from clothing – but unless your game has a large runtime dynamic system which generates NPC’s or player characters from lots of smaller pieces, they tend to be 1-3 meshes.

However because of that, if you have a mesh that has a skeleton of 150 bones in it (which is very possible, particularly with articulated faces) you don’t know per vertex which bones that vertex might need; effectively you have to feed all 150 bones into the vertex shader running on that mesh. Which is too many – vertex shaders can’t take that many (that’s why I said there tended to be practical limits on how many bones a given vertex will reference). If any given vertex can reference any given bone (and all bones WILL get referenced by some part of the mesh vertex array) you MUST feed them all in.

The only way around this is to pre-process the mesh and break it up into smaller groups – where group A is known to reference bones 1-16 and you can feed 16 bones into a vertex shader easily. So for each mesh submission you feed in different groups of bones into the vertex shader (and often you’ll submit the same bone several times for different mesh arrays).

This breaks a render into several render calls, which doesn’t sound like much, but given you’ve effectively got about 2000 render calls per frame at 60fps, can make a difference. If you’ve got 20 people on screen, but now you’ve broken the mesh into 5 pieces you’ve just gone from 20 render calls to 100. That *does* make a difference.

Generally you don’t want more than 10 mesh render calls per mesh – and in my experience, those meshes that are about 2000 – 3000 polys with a skeleton of about 100 odd bones tend to come out around that – 7-10 mesh groups.

Other stuff.

Blending frames – generally, my experience has been to blend from one animation over about 200-250ms.

You can do some clever lod systems for animations based on distance – in Ghoul2, we did a camera to object distance and actually halved the rate at which we animated object based on how far it was. We just kept the skeleton / mesh around from the last frame and just re-used that every other frame. We probably could have gotten away with more if I’d really pushed it.

Other stuff you can do is just not animate all extremities – mark certain bones as “do no animate in low lod situations” – we did experiment with having different skeletons / meshes for low lod situations, and that really does work, however you can get cheap purely skeletal animation returns by simply not animating the fingers of a hand, for example. As long as you keep last frames around (or even just use the default frames bone settings) you are golden. Realistically though, unless you are doing a hell of a lot of blending, this doesn’t buy you that much, because skeleton generation isn’t that expensive – it pales next to vertex transformation for example.

Also, another thing that most animation systems have is a event track. This is a timed set of events that can occur over the length of the animation. So, for example, when a foot fully hits the floor, a “foot fall” event is generated. This could fire off a sound, or make a gun fire, or whatever.

What’s interesting is that the event track is actually processed separately from normal animations, because you may be doing clever shit like LODing the animations so not ever frame gets an update – however your DO need to do the animation track every frame because game events maybe happening off the animation (e.g. a weapon firing, or a punch connecting with an NPC).

Note though, event tracks can get a little tricky to handle. If you end up blending two animations together, but you have an event that’s right at the front of the second animation being blended (like fire weapon), what happens? I mean, you are still blending – you have no idea where the weapon is actually pointed at that point. It can get really tricky to know whether to actually fire an event or not based on blend state.

Similarly, the problem of internal motion within an animation can be handled with a per-frame motion track.

The problem is this. If you have a motion captured walk, the per frame motion in it is variable per frame. So, either you retain the motion in the animation itself (ie each frame moves further and further from the 0,0,0 starting point of the animation), which means you aren’t actively moving the object in the world per frame of animation, just moving them at the end of the animation to the new position in the world because the animation is looping OR you try and massage the animation so the motion forward is constant, so you can be moving your game object forward in the world at a constant and game controlled rate.

This is what most people do and where foot sliding comes from.

There is a better solution. Your pre-processor needs to look at the range of animation motion for the root bone of the object over the life of the animation (e.g. a complete walk cycle. It starts at 0,0,0 and ends at 100,0,0). From that you know the delta of the total animation (in this case, 100 in x).
Divide that by the number of frames (in this case 10), then subtract frame * delta from the position of the root bone on each frame.

This moves the animation roughly over 0,0,0 for every frame, offset by whatever little difference there is frame to frame to keep the ‘real’ motion in it.

Now you know what forward motion the average frame has, which your animation can then inform the game of. The game is moving the object through the world at the speed it was always intended to move at, and there’s no foot sliding. As a side effect, you can now increase or decrease that speed and then simply multiply the animation playback speed by the same value, and viola, you’ve made the animation walk faster but again with no footsliding. It’s really very simple.

It doesn’t work well with jumps, or anything that moves a large amount in the animation, but returns to the starting position at the end of the animation, but the way you sort that out is by breaking the animation in two, at the point of most extreme.

There’s a LOAD more about animation systems you can do, but that’s enough for now I think.

This entry was posted in Animation, Game Development. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>