Gameenginegems 2
Gameenginegems 2
Lengyel
Game Engine
Gems 2
Edited by Eric Lengyel
A K Peters, Ltd.
Natick, Massachusetts
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but
the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to
trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained.
If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical,
or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without
written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright
Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a
variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to
infringe.
Preface xv
v
vi Contents
Chapter 22 GPGPU Cloth Simulation Using GLSL, OpenCL, and CUDA 365
Marco Fratarcangeli
22.1 Introduction 365
22.2 Numerical Algorithm 366
22.3 Collision Handling 368
22.4 CPU Implementation 369
22.5 GPU Implementations 371
Contents xi
31.3 A First Approach: Using Win32 Semaphores and Critical Sections 477
31.4 A Second Approach: Lock-Free Algorithms 482
31.5 Processor Architecture Overview and Memory Models 483
31.6 Lock-Free Algorithm Design 486
31.7 Lock-Free Implementation of a Free List 487
31.8 Lock-Free Implementation of a Queue 491
31.9 Interprocess Communication 496
Index 509
Preface
The word gem has been coined in the fields of computer graphics and game
development as a term for describing a short article that focuses on a particular
technique, a clever trick, or practical advice that a person working in these fields
would find interesting and useful. Several book series containing the word
“Gems” in their titles have appeared since the early 1990s, and we continued the
tradition by establishing the Game Engine Gems series in 2010.
This book is the second volume of the Game Engine Gems series, and it
comprises a collection of new game engine development techniques. A group of
29 experienced professionals, several of whom also contributed to the first
volume, have written down portions of their knowledge and wisdom in the form
of the 31 chapters that follow.
The topics covered in these pages vary widely within the subject of game
engine development and have been divided into the three broad categories of
graphics and rendering, game engine design, and systems programming. The first
part of the book presents a variety of rendering techniques and dedicates four
entire chapters to the increasingly popular topic of stereoscopic rendering. The
second part contains several chapters that discuss topics relating to the design of
large components of a game engine. The final part of the book presents several
gems concerning topics of a “low-level” nature for those who like to work with
the nitty-gritty of engine internals.
Audience
The intended audience for this book includes professional game developers,
students of computer science programs, and practically anyone possessing an
interest in how the pros tackle specific problems that arise during game engine
xv
xvi Preface
The Website
The official website for the Game Engine Gems series can be found at the
following address:
http://www.gameenginegems.net/
Supplementary materials for many of the gems in this book are posted on this
website, and they include demos, source code, examples, specifications, and
larger versions of some figures. For chapters that include project files, the source
code can be compiled using Microsoft Visual Studio.
Any corrections to the text that may arise will be posted on the website. This
is also the location at which proposals will be accepted for the next volume in the
Game Engine Gems series.
Acknowledgements
Many thanks are due to A K Peters for quickly assuming ownership of the Game
Engine Gems series after it had lost its home well into the period during which
contributing authors had been writing their chapters. Of course, thanks also go to
these contributors, who took the transition in stride and produced a great set of
gems for this volume. They all worked hard during the editing process so that we
would be able to stay on the original schedule.
Part I
1
1
Fast Computation of Tight‐Fitting
Oriented Bounding Boxes
Thomas Larsson
Linus Källberg
Mälardalen University, Sweden
1.1 Introduction
Bounding shapes, or containers, are frequently used to speed up algorithms in
games, computer graphics, and visualization [Ericson 2005]. In particular, the
oriented bounding box (OBB) is an excellent convex enclosing shape since it
provides good approximations of a wide range of geometric objects [Gottschalk
2000]. Furthermore, the OBB has reasonable transformation and storage costs,
and several efficient operations have been presented such as OBB-OBB
[Gottschalk et al. 1996], sphere-OBB [Larsson et al. 2007], ellipsoid-OBB [Lars-
son 2008], and ray-OBB [Ericson 2005] intersection tests. Therefore, OBBs can
potentially speed up operations such as collision detection, path planning, frus-
tum culling, occlusion culling, ray tracing, radiosity, photon mapping, and other
spatial queries.
To leverage the full power of OBBs, however, fast construction methods are
needed. Unfortunately, the exact minimum volume OBB computation algorithm
given by O’Rourke [1985] has O n 3 running time. Therefore, more practical
methods have been presented, for example techniques for computing a 1 ε -
approximation of the minimum volume box [Barequet and Har-Peled 1999]. An-
other widely adopted technique is to compute OBBs by using principal compo-
nent analysis (PCA) [Gottschalk 2000]. The PCA algorithm runs in linear time,
but unfortunately may produce quite loose-fitting boxes [Dimitrov et al. 2009].
By initially computing the convex hull, better results are expected since this
3
4 1. Fast Computation of Tight‐Fitting Oriented Bounding Boxes
keeps internal features of the model from affecting the resulting OBB orientation.
However, this makes the method superlinear.
The goal of this chapter is to present an alternative algorithm with a simple
implementation that runs in linear time and produces OBBs of high quality. It is
immediately applicable to point clouds, polygon meshes, or polygon soups, with-
out any need for an initial convex hull generation. This makes the algorithm fast
and generally applicable for many types of models used in computer graphics
applications.
1.2 Algorithm
The algorithm is based on processing a small constant number of extremal verti-
ces selected from the input models. The selected points are then used to construct
a representative simple shape, which we refer to as the ditetrahedron, from which
a suitable orientation of the box can be derived efficiently. Hence, our heuristic is
called the ditetrahedron OBB algorithm, or DiTO for short. Since the chosen
number of selected extremal vertices affects the running time of the algorithm as
well as the resulting OBB quality, different instances of the algorithm are called
DiTO-k, where k is the number of selected vertices.
The ditetrahedron consists of two irregular tetrahedra connected along a
shared interior side called the base triangle. Thus, it is a polyhedron having six
faces, five vertices, and nine edges. In total, counting also the interior base trian-
gle, there are seven triangles. Note that this shape is not to be confused with the
triangular dipyramid (or bipyramid), which can be regarded as two pyramids with
equal heights and a shared base.
For most input meshes, it is expected that at least one of the seven triangles
of the ditetrahedron will be characteristic of the orientation of a tight-fitting
OBB. Let us consider two simple example meshes—a randomly rotated cube
with 8 vertices and 12 triangles and a randomly rotated star shape with 10 verti-
ces and 16 triangles. For these two shapes, the DiTO algorithm finds the mini-
mum volume OBBs. Ironically, the PCA algorithm computes an excessively
large OBB for the canonical cube example, with a volume approximately two to
four times larger than the minimum volume, depending on the orientation of the
cube mesh. Similarly, it also computes a loose-fitting OBB for the star shape,
with a volume approximately 1.1 to 2.2 times larger than the optimum, depend-
ing on the given orientation of the mesh. In Figure 1.1, these two models are
shown together with their axis-aligned bounding box (AABB), OBB computed
using PCA, and OBB computed using DiTO for a random orientation of the
models.
1.2 Algorithm 5
Figure 1.1. Computed boxes for a simple cube mesh (12 triangles) and star mesh (16
triangles). The first column shows the AABB, the second column shows the OBB com-
puted by PCA, and the last column shows the OBB computed by DiTO. The meshes were
randomly rotated before the computation.
ple dot products, normals with many 0s and 1s may be preferable, given that they
sample the direction space in a reasonable manner.
Clearly, the DiTO-k algorithm relies on the choice of an appropriate normal
set N s, and simply by choosing a different normal set a new instance of DiTO-k
is created. In the experiments described later, five normal sets are used, yielding
five algorithm instances. The normal sets are listed in Table 1.1. The normals in
N 6, used in DiTO-12, are obtained from the vertices of a regular icosahedron
with the mirror vertices removed. Similarly, the normals in N10, used in DiTO-20,
are taken from the vertices of a regular dodecahedron. The normal set
N16 N 6 N10 is used in DiTO-32. The normals in N 7 and N13, used in DiTO-14
and DiTO-26, are not uniformly distributed, but they are still usually regarded as
good choices for computing k-DOPs [Ericson 2005]. Therefore, they are also ex-
pected to work well in this case.
N6 N 10 N7 N 13
0,1, a 0, a,1 a 1, 0, 0 1, 0, 0
0,1, a 0, a, 1 a 0,1, 0 0,1, 0
1, a, 0 a,1 a, 0 0, 0,1 0, 0,1
1, a,0 a, 1 a, 0 1,1,1 1,1,1
a, 0,1 1 a, 0, a 1,1, 1 1,1, 1
a, 0, 1 1 a, 0, a 1, 1,1 1, 1,1
1,1,1 1, 1, 1 1, 1, 1
1,1, 1 1,1, 0
1, 1,1 1, 1, 0
1, 1, 1 1, 0,1
1, 0, 1
0,1,1
0,1, 1
Table 1.1. Efficient normal sets N 6 , N10, N 7 , and N13 used for DiTO-12, DiTO-20, Di-
TO-14, and DiTO-26, respectively, with the value a 5 1 2 0.61803399. The
normals in N 6 and N10 are uniformly distributed.
1.2 Algorithm 7
max a j b j .
Call these points p 0 and p 1 . Then a third point p 2 is selected from S that lies fur-
thest away from the infinite line through p 0 and p 1 . An example of a constructed
large base triangle is shown on the left in Figure 1.2.
The base triangle is then used to generate three different candidate orienta-
tions, one for each edge of the triangle. Let n be the normal of the triangle, and
e 0 p1 p 0 be the first edge. The axes are then chosen as
u0 e0 e0 ,
u1 n n,
u2 u 0 u1.
The axes are chosen similarly for the other two edges of the triangle. For each
computed set of axes, an approximation of the size of the resulting OBB is com-
puted by projecting the points in S on the axes, and the best axes found are kept.
In Figure 1.3, an example of the three considered OBBs for the base triangle is
shown.
Next, the algorithm proceeds by constructing the ditetrahedron, which con-
sists of two connected tetrahedra sharing the large base triangle. For this, two
additional points q 0 and q 1 are computed by searching S for the points furthest
8 1. Fast Computation of Tight‐Fitting Oriented Bounding Boxes
Figure 1.2. Illustration of how the large base triangle spanning extremal points (left) is
extended to tetrahedra in two directions by finding the most distant extremal points below
and above the triangle surface (right).
above and below the plane of the base triangle. An example of a ditetrahedron
constructed in this way is shown on the right in Figure 1.2. This effectively gen-
erates six new triangles, three top triangles located above the plane of the base
triangle and three bottom triangles located below the base triangle. For each one
of these triangles, candidate OBBs are generated in the same way as already de-
scribed above for the large base triangle, and the best axes found are kept.
After this, all that remains is to define the final OBB appropriately. A final
pass through all n vertices in P determines the true size of the OBB, that is, the
smallest projection values su, sv , and s w , as well as the largest projection values lu,
Figure 1.3. The three different candidate orientations generated from the normal and
edges of the large base triangle. In each case, the box is generated from the edge drawn
with a solid line.
1.2 Algorithm 9
lv , and lw of P along the determined axes u, v, and w. The final OBB parameters
besides the best axes found are then given by
l u su
hu ,
2
lv s v
hv ,
2
lw sw
hw ,
2
l u su l s l s
m u v v v w w w.
2 2 2
The parameters hu , hv, and hw are the half-extents, and m is the midpoint of the
box. Note that m needs to be computed in the standard base, rather than as the
midpoint in its own base.
A final check is also made to make sure that the OBB is still smaller than the
initially computed AABB; otherwise, the OBB is aligned with the AABB in-
stead. This may happen in some cases, since the final iteration over all n points in
P usually grows the OBB slightly compared to the best-found candidate OBB,
whose size only depends on the subset S.
This completes our basic presentation of the DiTO algorithm. Example
source code in C/C++ for the DiTO-14 algorithm is available, which shows how
to efficiently implement the algorithm and how the low-level functions work.
of the already-found base triangle. When this happens, the arising triangles of the
degenerate tetrahedron are simply ignored by the algorithm; that is, they are not
used in the search for better OBB axes.
1.3 Evaluation
To evaluate the DiTO algorithm, we compared it to three other methods referred
to here as AABB, PCA, and brute force (BF). The AABB method simply com-
putes an axis-aligned bounding box, which is then used as an OBB. While this
method is expected to be extremely fast, it also produces OBBs of poor quality in
general.
The PCA method was first used to compute OBBs by Gottschalk et al.
[1996]. It works by first creating a representation of the input model’s shape in
the form of a covariance matrix. High-quality OBB axes are then assumed to be
given by the eigenvectors of this matrix. As an implementation of the PCA meth-
od, we used code from Gottschalk et al.’s RAPID source package. This code
works on the triangles of the model and so has linear complexity in the input size.
The naive BF method systematically tests 90 90 90 different orientations
by incrementing Euler angles one degree at a time using a triple-nested loop. This
method is of course extremely slow, but in general it is expected to create OBBs
of high quality which is useful for comparison to the other algorithms. To avoid
having the performance of BF break down completely, only 26 extremal points
are used in the iterations. This subset is selected initially in the same way as in
the DiTO algorithm.
All the algorithms were implemented in C/C++. The source code was com-
piled using Microsoft Visual Studio 2008 Professional Edition and run single-
threaded using a laptop with an Intel Core2 Duo T9600 2.80 GHz processor and
4 GB RAM. The input data sets were triangle meshes with varying shapes and
varying geometric complexity. The vertex and triangle counts of these meshes
are summarized in Table 1.2. Screenshots of the triangle meshes are shown in
Figure 1.4.
To gather statistics about the quality of the computed boxes, each algorithm
computes an OBB for 100 randomly generated rotations of the input meshes. We
then report the average surface area Aavg , the minimum and maximum surface
areas Amin and Amax , as well as the average execution time t avg in milliseconds. The
results are given in Table 1.3.
The BF algorithm is very slow, but it computes high-quality OBBs on aver-
age. The running times lie around one second for each mesh since the triple-
nested loop acts like a huge hidden constant factor. The quality of the boxes var-
1.3 Evaluation 11
ies slightly
s due too the testing of
o somewhat unevenly disstributed orienntations aris-
ing from
f the increemental stepp ping of the Euuler angles. T
The quality off boxes com-
puted by the otther algorith hms, howeveer, can be m measured quuite well by
comp paring them tot the sizes off the boxes coomputed by thhe BF methodd.
The
T DiTO allgorithm is veery competitiive. For exam mple, it runs significantly
fasteer than the PC CA algorithm m, although booth methods aare fast linearr algorithms.
The big performaance difference is mainly due to the fa fact that the PPCA method
needds to iterate over
o the polyggon data insteead of iteratinng over the liist of unique
vertiices. For conn nected trianglle meshes, thee number of ttriangles is rooughly twice
the number
n of veertices, and eaach triangle hhas three verttices. Therefoore, the total
size of the vertex x data that thhe PCA methhod processess is roughly six times as
largee as the corressponding dataa for the DiTO O algorithm.
Fig
gure 1.4. Visuaalizations of thee triangle meshhes used for evvaluation of thee algorithms.
12 1. Fast Computation of Tight‐Fitting Oriented Bounding Boxes
Pencil Chair
Method Aavg Amin Amax t avg Method Aavg Amin Amax t avg
AABB 1.4155 0.2725 2.1331 0.02 AABB 7.1318 4.6044 8.6380 0.07
PCA 0.2359 0.2286 0.2414 0.51 PCA 4.7149 4.7139 4.7179 1.87
BF 0.2302 0.2031 0.2696 1009 BF 3.6931 3.6106 4.6579 1047
DiTO-12 0.2316 0.1995 0.2692 0.11 DiTO-12 3.8261 3.6119 4.1786 0.35
DiTO-14 0.2344 0.1995 0.2707 0.09 DiTO-14 3.8094 3.6129 4.2141 0.25
DiTO-20 0.2306 0.1995 0.2708 0.14 DiTO-20 3.8213 3.6164 3.9648 0.37
DiTO-26 0.2331 0.1995 0.2707 0.15 DiTO-26 3.8782 3.6232 4.0355 0.35
DiTO-32 0.2229 0.1995 0.2744 0.20 DiTO-32 3.8741 3.6227 3.9294 0.49
Teddy Bunny
Method Aavg Amin Amax t avg Method Aavg Amin Amax t avg
AABB 3.9655 3.5438 4.3102 0.02 AABB 5.7259 4.7230 6.4833 0.19
PCA 4.0546 4.0546 4.0546 0.60 PCA 5.2541 5.2540 5.2541 8.76
BF 3.3893 3.3250 3.5945 1043 BF 4.6934 4.5324 4.9091 1041
DiTO-12 3.7711 3.5438 4.0198 0.14 DiTO-12 4.9403 4.5635 5.7922 1.13
DiTO-14 3.7203 3.5438 3.9577 0.12 DiTO-14 4.9172 4.5810 5.6695 0.98
DiTO-20 3.7040 3.5438 3.8554 0.19 DiTO-20 4.8510 4.5837 5.5334 1.55
DiTO-26 3.7193 3.5438 3.8807 0.16 DiTO-26 4.7590 4.5810 5.3967 1.42
DiTO-32 3.7099 3.5438 3.8330 0.22 DiTO-32 4.7277 4.6552 5.1037 2.04
Frog Hand
Method Aavg Amin Amax t avg Method Aavg Amin Amax t avg
AABB 4.6888 3.0713 5.7148 0.07 AABB 2.8848 2.4002 3.2693 1.98
PCA 2.6782 2.6782 2.6782 1.14 PCA 2.5066 2.5062 2.5069 86.6
BF 2.7642 2.6582 3.5491 1037 BF 2.3071 2.2684 2.4531 1067
DiTO-12 2.7882 2.6652 3.0052 0.28 DiTO-12 2.3722 2.2946 2.5499 11.8
DiTO-14 2.7754 2.6563 2.9933 0.24 DiTO-14 2.3741 2.2914 2.5476 10.0
DiTO-20 2.7542 2.6602 2.9635 0.40 DiTO-20 2.3494 2.2805 2.4978 15.5
DiTO-26 2.7929 2.6579 3.0009 0.36 DiTO-26 2.3499 2.2825 2.5483 14.5
DiTO-32 2.7685 2.6538 2.9823 0.44 DiTO-32 2.3372 2.2963 2.4281 20.6
Table 1.3. The average, minimum, and maximum area as well as the average execution time in ms
over 100 random orientations of the input meshes.
1.3 Evaluation 13
The DiTO algorithm also produces oriented boxes of relatively high quality.
For all meshes except Frog, the DiTO algorithm computes OBBs with smaller
surface areas than the PCA method does. For some of the models, the difference
is significant, and for the Teddy model, the PCA method computes boxes that are
actually looser fitting than the naive AABB method does. The DiTO algorithm,
however, is in general more sensitive than the PCA method to the orientation of
the input meshes as can be seen in the minimum and maximum area columns.
Among the included DiTO instances, there seems to be a small quality im-
provement for increasing k values for some of the models. DiTO-32 seems to
compute the best boxes in general. The quality difference of the computed boxes,
however, is quite small in most cases. Therefore, since DiTO-14 is approximately
twice as fast as DiTO-32, it is probably the preferable choice when speed is a
prioritized factor.
Figure 1.5. Levels 0, 6, 9, and 12 of OBB hierarchies built using AABBs (leftmost col-
umn), PCA (middle column), and DiTO-20 (rightmost column). As can be seen in the
magnified pictures in the bottom row, PCA and DiTO both produce OBBs properly
aligned with the curvature of the model, but the boxes produced by PCA have poor mutu-
al orientations with much overlap between neighboring boxes.
The algorithm used for building the tree structure is a top-down algorithm,
where the set of primitives is recursively partitioned into two subsets until there
is only one primitive left, which is then stored in a leaf node. Before partitioning
the primitives in each step, the selected OBB fitting procedure is called to create
an OBB to store in the node. This means that the procedure is called once for
every node of the tree. To partition the primitives under a node, we use a strategy
that tries to minimize the tree’s total surface area, similar to that used by Wald et
al. [2007] for building AABB hierarchies.
1.3 Evaluation 15
Table 1.4. Total surface areas of OBB hierarchies built using the different OBB fitting
algorithms.
As Table 1.4 shows, the DiTO algorithms create better trees than the other
two algorithms do. An implication of the increasing flatness on the lower hierar-
chy levels is that the first base triangle more accurately captures the spatial ex-
tents of the geometry, and that the two additional tetrahedra get small heights. It
is therefore likely that the chosen OBB most often is found from the base triangle
and not from the triangles in the tetrahedra. The PCA fitter frequently produces
poor OBBs, and in three cases (Chair, Bunny, and Hand), produces even worse
OBBs than the AABB method does. It is also never better than any of the DiTO
versions.
Interesting to note is that there is a weak correspondence between the number
of extremal directions used in DiTO and the tree quality. This can be partly ex-
plained by the fact that the directions included in a smaller set are not always in-
cluded in a larger set, which, for example, is the case in DiTO-14 versus
DiTO-12. This means that for some models, fewer directions happen to give bet-
ter results than a larger number of directions. Another part of the explanation is
that the tested OBBs are only fitted to the set of extracted extremal points S,
which means that good-quality OBBs might be missed because worse OBBs get
better surface areas on the selected extremal points. All this suggests that execu-
tion time can be improved by using fewer extremal directions (see Table 1.3),
while not much OBB quality can be gained by using more.
Note that the AABB hierarchies sometimes have somewhat undeservedly
good figures because the models were kept in their local coordinate systems dur-
ing the construction. This gives the AABB an advantage in, for example, Pencil
and Chair, where much of the geometry is axis-aligned.
16 1. Fast Computation of Tight‐Fitting Oriented Bounding Boxes
Model t tv s
Pencil 0.09 0.035 2.6
Teddy 0.12 0.04 3.0
Frog 0.24 0.08 3.0
Chair 0.25 0.08 3.1
Bunny 0.98 0.21 4.7
Hand 10.0 1.99 5.0
Table 1.5. The execution time t for the scalar version of DiTO-14 versus t v for the
vectorized SSE version. All timings are in ms, and the speed-up factors s are listed in the
last column.
four times are achieved for the most complex models, Bunny and Hand. To be as
efficient as possible for models with fewer input points, the remaining parts of
the algorithm have to be converted to SSE as well. This is particularly true when
building a bounding volume hierarchy, since most of the boxes in the hierarchy
only enclose small subsets of vertices.
OBB axes. As it is now, we have only considered the triangles of this shape one
at a time. Furthermore, the construction of some simple shape other than the
ditetrahedron may be found to be more advantageous for determining the OBB
axes.
Finally, note that DiTO can be adapted to compute oriented bounding rectan-
gles in two dimensions. The conversion is straightforward. In this case, the large
base triangle simplifies to a base line segment, and the ditetrahedron simplifies to
a ditriangle (i.e., two triangles connected by the base line segment). There are
better algorithms available in two dimensions such as the rotating calipers meth-
od, which runs in O n time, but these methods require the convex hull of the
vertices to be present [Toussaint 1983].
References
[Barequet and Har-Peled 1999] Gill Barequet and Sariel Har-Peled. “Efficiently Approx-
imating the Minimum-Volume Bounding Box of a Point Set in Three Dimen-
sions.” SODA ’99: Proceedings of the Tenth Annual ACM-SIAM Symposium on
Discrete Algorithms, 1999, pp. 98–91.
[Dimitrov et al. 2009] Darko Dimitrov, Christian Knauer, Klaus Kriegel, and Günter
Rote. “Bounds on the Quality of the PCA Bounding Boxes.” Computational Ge-
ometry: Theory and Applications 42 (2009), pp. 772–789.
[Ericson 2005] Christer Ericson. Real-Time Collision Detection. San Francisco: Morgan
Kaufmann, 2005.
[Gottschalk et al. 1996] Stefan Gottschalk, Ming Lin, and Dinesh Manocha. “OBBTree:
A Hierarchical Structure for Rapid Interference Detection.” Proceedings of SIG-
GRAPH 1996, ACM Press / ACM SIGGRAPH, Computer Graphics Proceedings,
Annual Conference Series, ACM, pp. 171–180.
[Gottschalk 2000] Stefan Gottschalk. “Collision Queries using Oriented Bounding Box-
es.” PhD dissertation, University of North Carolina at Chapel Hill, 2000.
[Larsson et al. 2007] Thomas Larsson, Tomas Akenine-Möller, and Eric Lengyel. “On
Faster Sphere-Box Overlap Testing.” Journal of Graphics Tools 12:1 (2007), pp.
3–8.
[Larsson 2008] Thomas Larsson. “An Efficient Ellipsoid-OBB Intersection Test.” Jour-
nal of Graphics Tools 13:1 (2008), pp. 31–43.
[O’Rourke 1985] Joseph O’Rourke. “Finding Minimal Enclosing Boxes.” International
Journal of Computer and Information Sciences 14:3 (June 1985), pp. 183–199.
References 19
[Toussaint 1983] Godfried Toussaint. “Solving Geometric Problems with the Rotating
Calipers.” Proceedings of IEEE Mediterranean Electrotechnical Conference 1983,
pp. 1–4.
[Wald et al. 2007] Ingo Wald, Solomon Boulos, and Peter Shirley. “Ray Tracing De-
formable Scenes Using Dynamic Bounding Volume Hierarchies.” ACM Transac-
tions on Graphics 26:1 (2007).
2
Moodelingg, Lighting, and
d Rendeering
echniques for Volume
Te V etric Clo
ouds
Fran
nk Kane
Sund
dog Software,, LLC
Preg
generated sky y box texturess aren’t sufficcient for gam
mes with varyying times of
day, or games wh here the cameera may apprroach the clouuds. Renderinng clouds as
real 3D objects, as
a shown in Figure
F 2.1, is a challenginng task that m
many engines
shy away from, butb techniquees exist to prroduce realisttic results witth good per-
mance. This gem
form g presents an overview of the proceddural generattion of cloud
layerrs, the simulaation of lightt transport w
within a cloudd, and severaal volumetric
Figu
ure 2.1. Volum
metric clouds att dusk rendereed using splattiing. (Image froom the Silver-
Linin
ng SDK, courteesy of Sundog Software,
S LLC.
C.)
21
22 2. Modeling, Lighting, and Rendering Techniques for Volumetric Clouds
■ VAPOR_BIT, indicates whether the cell contains enough water vapor to form a
cloud.
2.1 Modeling Cloud Formation 23
if (i + 1 < cellsAcross)
phaseStates |= cells[i + 1][j][k]->states;
if (j + 1 < cellsDeep)
phaseStates |= cells[i][j + 1][k]->states;
if (k + 1 < cellsHigh)
phaseStates |= cells[i][j][k + 1]->states;
if (i – 1 >= 0)
phaseStates |= cells[i – 1][j][k]->states;
if (j – 1 >= 0)
phaseStates |= cells[i][j – 1][k]->states;
if (k – 1 >= 0)
phaseStates |= cells[i][j][k – 1]->states;
if (i – 2 >= 0)
phaseStates |= cells[i – 2][j][k]->states;
if (i + 2 < cellsAcross)
phaseStates |= cells[i + 2][j][k]->states;
if (j – 2 >= 0)
phaseStates |= cells[i][j – 2][k]->states;
if (j + 2 < cellsDeep)
phaseStates |= cells[i][j + 2][k]->states;
if (k – 2 >= 0)
phaseStates |= cells[i][j][k – 2]->states;
bool phaseActivation =
(phaseStates & PHASE_TRANSITION_BIT) != 0;
if (phaseTransition)
cells[i][j][k]->states |= PHASE_TRANSITION_BIT;
else
cells[i][j][k]->states &= ~PHASE_TRANSITION_BIT;
2.1 Modeling Cloud Formation 25
if (vapor)
cells[i][j][k]->states |= VAPOR_BIT;
else
cells[i][j][k]->states &= ~VAPOR_BIT;
if (hasCloud)
cells[i][j][k]->states |= HAS_CLOUD_BIT;
else
cells[i][j][k]->states &= ~HAS_CLOUD_BIT;
}
}
}
You may have noticed that the probabilities for spontaneous acquisition of
the vapor or phase transition states, as well as the probability for cloud extinc-
tion, are actually stored per cell rather than being applied globally. This is how
we enforce the formation of distinct clouds within the automaton; each cloud
within the layer is defined by an ellipsoid within the layer’s bounding volume.
Within each cloud’s bounding ellipsoid, the phase and vapor probabilities ap-
proach zero toward the edges and the extinction probability approaches zero to-
ward the center. Simply multiply the extinction probability by
x2 y2 z2
,
a2 b2 c2
where x, y , z is the position of the cell relative to the ellipsoid center, and
26 2. Modeling, Lighting, and Rendering Techniques for Volumetric Clouds
a, b, c are the semiaxis lengths of the ellipsoid. Subtract this expression from
one to modulate the phase and vapor probabilities with distance from the ellip-
soid center. As an optimization, the cellular automaton may be limited to cells
contained by these ellipsoids, or each ellipsoid may be treated as independent
cellular automata to eliminate the storage and rendering overhead of cells that are
always empty. If you’re after cumulus clouds with flattened bottoms, using
hemiellipsoids as bounding volumes for the clouds instead of ellipsoids is also
more efficient.
You may grow your simulated clouds by placing a few random phase transi-
tion seeds at the center of each ellipsoid and iterating over the cellular automaton
a few times. The resulting 3D array of cloud states may then be stored for render-
ing, or you may continue to iterate at runtime, smoothing the cloud states in the
time domain to produce real-time animations of cloud growth. In reality, howev-
er, clouds change their shape very slowly—their growth and extinction is gener-
ally only noticeable in time-lapse photography.
{
currentD -= epsilon;
if (currentD <= GetMinimumSize()) return (false);
currentN = 0;
targetN = (int) (((2.0 * GetDesiredArea() * epsilon * alpha *
alpha * alpha * GetDesiredCoverage()) / (PI * chi)) *
exp(-alpha * (currentD)));
}
currentN++;
return (true);
}
bounces off a single cloud droplet at this distance. This means we may approxi-
mate light transport through a cloud by dividing it up into chunks of 100 meters
or less, computing how much light is scattered and absorbed by single scattering
within each chunk, and passing the resulting light as the incident light into the
next chunk. This technique is known as multiple forward scattering [Harris
2002]. It benefits from the fact that computing the effects of single scattering is a
well-understood and simple problem, while the higher orders of scattering may
only be solved using computationally expensive Monte Carlo ray tracing tech-
niques. These higher orders of scattering are increasingly diffuse, meaning we
can reasonably approximate the missing scattered light with a simple ambient
term.
As you might guess, the chunks of cloud we just described map well to the
voxels we generated from the cellular automaton above. Multiple forward scat-
tering computes the color and transparency of a given voxel by shooting a ray
from the light source toward the voxel and iteratively compositing the scattering
and extinction from each voxel we pass through. Essentially, we accumulate the
scattering and absorption of light on a voxel-by-voxel basis, producing darker
and more opaque voxels the deeper we go into the cloud.
To compute the transparency of a given voxel in isolation, we need to com-
pute its optical depth [Blinn 1982], given by
τ nπp 2T .
Here, n is the number of water droplets per unit volume, p is the effective radius
of each droplet, and T is the thickness of the voxel. Physically realistic values of
p in cumulus clouds are around 0.75 μm, and n is around 400 droplets per cubic
centimeter [Bouthors 2008]. The extinction of light within this voxel is then
α 1 e τ.
This informs us as to the transparency of the voxel, but we still need to compute
its color due to forward scattering. The voxel’s color C is given by
aτLP cos Θ
C .
4π
Here, a is the albedo of the water droplets, which is very high—between 0.9 and
1.0. L is the light color incident on the voxel (which itself may be physically
simulated [Kane 2010]), and P cos Θ is the phase function of the cloud, which is
a function of the dot product between the view direction and the light direction.
2.2 Cloud Lighting Techniques 29
The phase function is where things get interesting. Light has a tendency to
scatter in the forward direction within a cloud; this is what leads to the bright
“silver lining” you see on clouds that are lit from behind. The more accurate a
phase function you use, the more realistic your lighting effects are.
A simple phase function is the Rayleigh function [Rayleigh 1883]. Although
it is generally used to describe the scattering of atmospheric molecules and is
best known as the reason the sky is blue, it turns out to be a reasonable approxi-
mation of scattering from cloud droplets under certain cloud densities and wave-
lengths [Petty 2006] and has been used successfully in both prior research [Harris
2002] and commercial products.1 The Rayleigh function is given by
3
P cos Θ 1 cos 2 Θ .
4
The Rayleigh function is simple enough to execute in a fragment program,
but one problem is that it scatters light equally in the backward direction and the
forward direction. For the larger particles that make up a typical cloud, the Heny-
ey-Greenstein function [Henyey and Greenstein 1941] provides a better approxi-
mation:
1 g 2
P cos Θ .
1 g 2 2 g cos Θ
3/ 2
The parameter g describes the asymmetry of the function, and is typically high
( 0.99). Positive values of g produce forward scattering, and negative values
produce backward scattering. Since a bit of both actually occurs, more sophisti-
cated implementations actually use a double-lobed Henyey-Greenstein function.
In this case, two functions are evaluated—one with a positive value of g and one
with a negative value, and they are blended together, heavily favoring the posi-
tive (forward) scattering component.
The ultimate phase function is given by Mie theory, which simulates actual
light waves using Maxwell’s equations in three-dimensional space [Boh-
ren and Huffman 1983]. It is dependent on the droplet size distribution within the
cloud, and as you can imagine, it is very expensive to calculate. However, a free
tool called MiePlot2 is available to perform offline solutions to Mie scattering,
which may be stored in a texture to be looked up for a specific set of conditions.
Mie scattering not only gives you silver linings but also wavelength-dependent
1
For example, SilverLining. See http://www.sundog-soft.com/.
2
See http://www.philiplaven.com/mieplot.htm.
30 2. Modeling, Lighting, and Rendering Techniques for Volumetric Clouds
effects such as fogbows and glories. It has been applied in real-time successfully
by either chopping the function’s massive forward peak [Bouthers et al. 2006] or
restricting it to phase angles where wavelength-dependent effects occur, and us-
ing simpler phase functions for other angles [Petty 2006].
If you rely exclusively on multiple forward scattering of the incident sunlight
on a cloud, your cloud will appear unnaturally dark. There are other light sources
to consider—skylight, reflected light from the ground, and the light from higher-
order scattering should not be neglected. We approximate these contributions
with an ambient term; more sophisticated implementations may use hemisphere
lighting techniques [Rost and Licea-Kane 2009] to treat skylight from above and
light reflected from the ground below independently.
Tone mapping and gamma correcting the final result are also vitally im-
portant for good image quality. We use a gamma value of 2.2 together with the
simplest form of the Reinhard tone-mapping operator [Reinhard et al. 2002] with
good results:
L
Ld .
1 L
For added realism, you’ll also want to simulate atmospheric perspective effects
on distant clouds. Exponentially blending the clouds into the sky with distance is
a simple approach that’s generally “good enough,” although more rigorous ap-
proaches are available [Preetham et al. 1999].
Volumetric Splatting
The simplest technique is called splatting and is illustrated in Figure 2.2. Each
voxel is represented by a billboard that represents an individual cloud puff.
Mathematically, this texture should represent a Gaussian distribution, but adding
some wispy detail and randomly rotating it produces visually appealing results.
Figure 2.3 illustrates how a single texture representing a cloud puff is used to
generate a realistic scene of cumulus clouds.
2.3 Cloud Rendering Technique
es 31
Figu
ure 2.2. Volum
metric cloud daata rendered ussing splatting. (Image courteesy of Sundog
Softw
ware, LLC.)
Figuure 2.3. Wireframe overlay of splatted clouuds with the sinngle cloud pufff texture used
(inseet). (Image courrtesy Sundog Software,
S LLC. )
32 2. Modeling, Lighting, and Rendering Techniques for Volumetric Clouds
Lighting and rendering are achieved in separate passes. In each pass, we set
the blending function to (ONE, ONE_MINUS_SRC_ALPHA) to composite the voxels
together. In the lighting pass, we set the background to white and render the
voxels front to back from the viewpoint of the light source. As each voxel is ren-
dered, the incident light is calculated by multiplying the light color and the frame
buffer color at the voxel’s location prior to rendering it; the color and transparen-
cy of the voxel are then calculated as above, stored, and applied to the billboard.
This technique is described in more detail by Harris [2002].
When rendering, we use the color and transparency values computed in the
lighting pass and render the voxels in back-to-front order from the camera.
To prevent breaking the illusion of a single, cohesive cloud, we need to en-
sure the individual billboards that compose it aren’t perceptible. Adding some
random jitter to the billboard locations and orientations helps, but the biggest is-
sue is making sure all the billboards rotate together in unison as the view angle
changes. The usual trick of axis-aligned billboards falls apart once the view angle
approaches the axis chosen for alignment. Our approach is to use two orthogonal
axes against which our billboards are aligned. As the view angle approaches the
primary axis (pointing up and down), we blend toward using our alternate (or-
thogonal) axis instead.
To ensure good performance, the billboards composing a cloud must be ren-
dered as a vertex array and not as individual objects. Instancing techniques
and/or geometry shaders may be used to render clouds of billboards from a single
stream of vertices.
While splatting is fast for sparser cloud volumes and works on pretty much
any graphics hardware, it suffers from fill rate limitations due to high depth com-
plexity. Our lighting pass also relies on pixel read-back, which generally blocks
the pipeline and requires rendering to an offscreen surface in most modern
graphics APIs. Fortunately, we only need to run the lighting pass when the light-
ing conditions change. Simpler lighting calculations just based on each voxel’s
depth within the cloud from the light direction may suffice for many applications,
and they don’t require pixel read-back at all.
Volumetric Slicing
Instead of representing our volumetric clouds with a collection of 2D billboards,
we can instead use a real 3D texture of the cloud volume itself. Volume render-
ing of 3D textures is the subject of entire books [Engel et al. 2006], but we’ll give
a brief overview here.
The general idea is that some form of simple proxy geometry for our volume
is rendered using 3D texture coordinates relative to the volume data. We then get
2.3 Cloud Rendering Techniques 33
GPU Ra
ay Casting
Real-tim
me ray casting g may sound intimidating,
i bbut in many wways it’s the simplest
and most elegant tech hnique for renndering cloudds. It does invvolve placingg most of
the comp puting in a fraagment progrram, but if opttimized careffully, high fraame rates
are achieevable with precise
p per-fraagment lightiing. Figure 2.5 shows a ddense, 60
square kilometer
k straatocumulus clloud layer renndering at ovver 70 framess per se-
cond on consumer-graade hardwaree using GPU rray casting.
The general idea is to just render the boundding box geoometry of youur clouds
(with baack-face cullin ng enabled) and
a let the frragment proceessor do the rrest. For
each fraggment of the bounding
b boxx, our fragmeent program shhoots a ray thhrough it
from thee camera and computes thee ray’s interssection with th the bounding volume.
We thenn sample our 3D 3 cloud textture along thee ray within thhe volume froom front
to back, compositing the results as we go.
Figure 2.5. Volumetricc cloud data reendered from a single boundiing box using GPU ray
casting. (IImage courtesyy Sundog Softw
ware, LLC.)
2.3 Cloud Rendering Techniques 35
The color of each sample is determined by shooting another ray to it from the
light source and compositing the lighting result using multiple forward scatter-
ing—see Figure 2.6 for an illustration of this technique.
It’s easy to discard this approach, thinking that for every fragment, you need
to sample your 3D texture hundreds of times within the cloud, and then sample
each sample hundreds of more times to compute its lighting. Surely that can’t
scale! But, there’s a dirty little secret about cloud rendering that we can exploit to
keep the actual load on the fragment processor down: clouds aren’t really all that
transparent at all, and by rendering from front to back, we can terminate the pro-
cessing of any given ray shortly after it intersects a cloud.
Recall that we chose a voxel size on the order of 100 meters for multiple
forward scattering-based lighting because in larger voxels, higher orders of scat-
tering dominate. Light starts bouncing around shortly after it enters a cloud, mak-
ing the cloud opaque as soon as light travels a short distance into it. The mean
free path of a cumulus cloud is typically only 10 to 30 meters [Bouthors et al.
2008]—beyond that distance into the cloud, we can safely stop marching our ray
into it. Typically, only one or two samples per fragment really need to be lit,
which is a wonderful thing in terms of depth complexity and fill rate.
The first thing we need to do is compute the intersection of the viewing ray
for the fragment with the cloud’s axis-aligned bounding box. To simplify our
calculations in our fragment program, we work exclusively in 3D texture coordi-
nates relative to the bounding box, where the texture coordinates of the box range
from 0.0 to 1.0. An optimized function for computing the intersection of a ray
with this unit cube is shown in Listing 2.3.
Figure 2.6. GPU ray casting. We shoot a ray into the volume from the eye point, termi-
nating once an opacity threshold is reached. For each sample along the ray, we shoot an-
other ray from the light source to determine the sample’s color and opacity.
36 2. Modeling, Lighting, and Rendering Techniques for Volumetric Clouds
With the parameters of our viewing ray in hand, the rest of our fragment
program becomes straightforward. We sample this ray in front-to-back order
from the camera; if a sample contains a cloud voxel, we then shoot another ray
from the light source to determine the sample’s color. We composite these
samples together until an opacity threshold is reached, at which point we
terminate the ray early. Listing 2.4 illustrates this technique.
// in texcoords
uniform vec3 lightSampleDimensions; // Size of light sample,
// texcoords
uniform vec3 skyLightColor; // RGB sky light component
uniform vec3 multipleScatteringTerm; // RGB higher-order
// scattering term
uniform vec4 lightColor; // RGBA direct sun light color
void main()
{
vec3 texCoord = gl_TexCoord[0].xyz;
if (lightSample != 0)
{
// Multiple forward scattering:
vec4 srcColor;
srcColor.xyz = accumulatedColor.xyz * scattering
* phaseFunction(1.0);
srcColor.w = extinction;
srcColor *= lightSample;
samplePos -= lightSampleInc;
}
vec4 fragSample;
Note that the texture lookup for each sample isn’t just a simple texture3D
call—it calls out to a getCloudDensity() function instead. If you rely on the
3D volume data alone, your clouds will look like nicely shaded blobs. The get-
CloudDensity() function needs to add in procedural noise for realistic results—
we upload a 32 3-texel RGB texture of smoothed random noise, and apply it as a
displacement to the texture coordinates at a couple of octaves to produce fractal
effects. Perlin noise [Perlin 1985] would also work well for this purpose. An
example implementation is shown in Listing 2.5; the noiseOffset uniform
vector is used to animate the noise over time, creating turbulent cloud animation
effects.
Hybrid Approaches
Although GPU ray casting of clouds can be performant on modern graphics
hardware, the per-fragment lighting calculations are still expensive. A large
number of computations and texture lookups may be avoided by actually
performing the lighting on the CPU and storing the results in the colors of the
voxels themselves to be used at rendering time. Recomputing the lighting in this
manner results in a pause in framerate whenever lighting conditions change but
makes rendering the cloud extremely fast under static lighting conditions. By
References 41
eliminating the lighting calculations from the fragment processor, we’re just left
with finding the first intersection of the view ray with a cloud and terminating the
ray early—we’ve now rendered our cloud volume with an effective depth
complexity close to one!
Volumetric slicing may also benefit from a hybrid approach. For example, a
fragment program may be employed to perform lighting of each fragment using
GPU ray casting, while still relying on the slice geometry to handle compositing
along the view direction.
Other approaches render a cloud as a mesh [Bouthors et al. 2008], again
taking advantage of the low mean free path of cumulus clouds. This allows more
precise lighting and avoids the intersection problems introduced by proxy
geometry in volume rendering.
Ultimately, choosing a cloud rendering technique depends on the trade-offs
you’re willing to make between hardware compatibility, physical realism, and
performance—fortunately, there are a variety of techniques to choose from.
References
[Blinn 1982] James F. Blinn. “Light Reflection Functions for Simulation of Clouds and
Dusty Surfaces.” Computer Graphics (Proceedings of SIGGRAPH 82) 16:3, ACM,
pp. 21–29.
[Bohren and Huffman 1983] Craig F. Bohren and Donald R. Huffman. Absorption and
Scattering of Light by Small Particles. New York: Wiley-Interscience, 1983.
[Bouthers et al. 2006] Antoine Bouthors, Fabrice Neyret, and Sylvain Lefebvre. “Real-
Time Realistic Illumination and Shading of Stratiform Clouds.” Eurographics
Workshop on Natural Phenomena, 2006, pp. 41–50.
[Bouthors et al. 2008] Antoine Bouthors, Fabrice Neyret, Nelson Max, Eric Bruneton,
and Cyril Crassin. “Interactive Multiple Anisotropic Scattering in Clouds.” ACM
Symposium on Interactive 3D Graphics and Games (I3D), 2008.
[Dobashi et al. 2000] Yoshinori Dobashi, Kazufumi Kaneda, Hideo Yamashita, Tsuyoshi
Okita, and Tomoyuki Nishita. “A Simple, Efficient Method for Realistic Anima-
tion of Clouds.” Proceedings of SIGGRAPH 2000, ACM Press / ACM SIG-
GRAPH, Computer Graphics Proceedings, Annual Conference Series, ACM, pp.
19–28.
42 2. Modeling, Lighting, and Rendering Techniques for Volumetric Clouds
[Engel et al. 2006] Klaus Engel, Markus Hadwiger, Joe M. Kniss, Christof Rezk-Salama,
and Daniel Weiskopf. Real-Time Volume Graphics. Wellesley, MA: A K Peters,
2006.
[Harris 2002] Mark Harris. “Real-Time Cloud Rendering for Games.” Game Developers
Conference, 2002.
[Henyey and Greenstein 1941] L. Henyey and J. Greenstein. “Diffuse Reflection in the
Galaxy.” Astrophysics Journal 93 (1941), p. 70.
[Ikits et al. 2007] Milan Ikits, Joe Kniss, Aaron Lefohn, and Charles Hansen. “Volume
Rendering Techniques.” GPU Gems, edited by Randima Fernando. Reading, MA:
Addison-Wesley, 2007.
[Kajiya and Herzen 1984] James T. Kajiya and Brian P. Von Herzen. “Ray Tracing Vol-
ume Densities.” Computer Graphics 18:3 (July 1984), ACM, pp. 165–174.
[Kane 2010] Frank Kane. “Physically-Based Outdoor Scene Lighting.” Game Engine
Gems 1, edited by Eric Lengyel. Sudbury, MA: Jones and Bartlett, 2010.
[Nagel and Raschke 1992] K. Nagel and E. Raschke. “Self-Organizing Criticality in
Cloud Formation?” Physica A 182:4 (April 1992), pp. 519–531.
[Nyquist 1928] Harry Nyquist. “Certain Topics in Telegraph Transmission Theory.”
AIEE Transactions 47 (April 1928), pp. 617–644.
[Perlin 1985] Ken Perlin. “An Image Synthesizer.” Computer Graphics (Proceedings of
Siggraph 85) 19:3, ACM, pp. 287–296.
[Petty 2006] Grant W. Petty. A First Course in Atmospheric Radiation. Madison, WI:
Sundog Publishing, 2006.
[Plank 1969] Vernon G. Plank. “The Size Distribution of Cumulus Clouds in Repre-
sentative Florida Populations.” Journal of Applied Meteorology 8 (1969), pp.
46–67.
[Preetham et al. 1999] Arcot J. Preetham, Peter Shirley, and Brian Smits. “A Practical
Analytic Model for Daylight.” Proceedings of SIGGRAPH 1999, ACM Press /
ACM SIGGRAPH, Computer Graphics Proceedings, Annual Conference Series,
ACM, pp. 91–100.
[Rayleigh 1883] Lord Rayleigh. “Investigation of the Character and Equilibrium of an
Incompressible Heavy Fluid of Variable Density.” Proceedings of the London
Mathematical Society 14 (1883), pp. 170–177.
[Reinhard et al. 2002] Erik Reinhard, Michael Stark, Peter Shirley, and James Ferwerda.
“Photographic Tone Reproduction for Digital Images.” Proceedings of SIG-
GRAPH 2002, ACM Press / ACM SIGGRAPH, Computer Graphics Proceedings,
Annual Conference Series, ACM, pp. 267–276.
References 43
[Rost and Licea-Kane 2009] Radni J. Rost and Bill Licea-Kane. OpenGL Shading Lan-
guage, Third Edition. Reading, MA: Addison-Wesley Professional, 2009.
[Wang 2004] Niniane Wang. “Realistic and Fast Cloud Rendering.” Journal of Graphics
Tools 9:3 (2004), pp. 21–40.
3
Simulation of Night‐Vision and
Infrared Sensors
Frank Kane
Sundog Software, LLC
Many action games simulate infrared (IR) and night-vision goggles (NVG) by
simply making the scene monochromatic, swapping out a few textures, and turn-
ing up the light sources. We can do better. Rigorous simulations of IR and NVG
sensors have been developed for military training and simulation applications,
and we can apply their lessons to game engines. The main differences between
visible, IR, and near-IR wavelengths are easily modeled. Sensors may also in-
clude effects such as light blooms, reduced contrast, blurring, atmospheric trans-
mittance, and reduced resolution that we can also simulate, adding to the realism.
j * εσT 4 .
Here, T is the absolute temperature of the object (in Kelvins), is the thermal
emissivity of the material, and is the Stefan-Boltzmann constant, 5.6704 10 8
J s 1 m 2 K 4. If the ambient temperature and temperature of the objects in your
45
46 3. Simulat ion of Night‐V
Vision and Inffrared Sensorrs
scene rem main constan nt, the radiatioon emitted maay be precom mputed and baaked into
special IRI versions of o your texturre maps. Figuure 3.1 illustrrates texture-bbased IR
simulatio on of a tank (note
( the tread
ds and enginee area are hott). If you wannt to sim-
ulate objjects cooling g off over tim me, just storee the emissivvity in your m materials
and/or teextures and compute the equation
e abovve in a vertexx program as the tem-
perature varies.
Tablle 3.1 lists em missivity valu ues for some common maaterials at 8 µ µm. Note
that mosst organic maaterials have high emissivvity and behaave almost liike ideal
black bo odies, while metals
m are morre reflective aand have loweer IR emissions.
It is the subtle diffferences in emissivity
e thaat distinguish different matterials in
your sceene that are at a the same ambient
a tempperature; althhough the diffferences
may be small,
s they arre important for adding deetail. For objects that emit heat of
their own n (i.e., the intterior of a heaated house orr living organnisms), the chhanges in
temperatture are the mainm source off contrast in yyour IR scenee. Stop thinkinng about
modeling g your materiials and textu ures in terms oof RGB colorrs, but rather in terms
of emisssivity and abssolute temperrature. An alteernate set of materials andd/or tex-
tures is required.
r
To add
a additionall detail to thee IR scene, thhere is a physical basis to bblending
in the visible light tex xture as well— —in a pinch, tthis lets you rrepurpose thee visible-
light textures of your objects as detail for the thermal information. There is a
range of about 0.15 in the emissivity between white and black objects; light col-
ors reflect more heat, and dark colors absorb it. Your fragment shader may con-
vert the RGB values of the visible-light textures to monochromatic luminance
values and perturb the final emissivity of the fragment accordingly. Listing 3.1
illustrates a snippet of a fragment shader that might approximate the thermal ra-
diation of a fragment given its visible color and knowledge of its underlying ma-
terial’s emissivity and temperature. This is a valid approach only for objects that
do not emit their own heat; for these objects, emissivities and temperatures
should be encoded directly in specialized IR textures, rather than blending the
visible-light texture with vertex-based thermal properties.
void main()
{
vec3 texCoords = gl_TexCoord[0].xy;
vec3 visibleColor = texture2D(visibleTexture, texCoords).xyz;
// Stefan–Boltzmann equation:
float radiation = finalEmissivity * stefanBoltzmannConstant *
materialTemperature * materialTemperature *
materialTemperature * materialTemperature;
Listing 3.1. A simplified fragment shader for IR sensor simulation using a hybrid of visible-light
textures with material-based thermal properties.
pheric transmittance models for this purpose such as LOWTRAN1 and MOD-
TRAN,2 but we can get by with a simpler approximation. Atmospheric transmit-
tance from water vapor for a wavelength in micrometers is approximated
[Bonjean et al. 2006] by
τ λ τ ωλ
mω 20
.
The value of m represents the air mass, which is a measure of how much atmos-
phere the heat source has passed through before reaching the sensor. This is a
unitless value, normalized to one for a vertical path through the atmosphere at sea
level. More sophisticated implementations for “serious games” may also take
atmospheric scattering and scattering from dust into account.
1
See http://www1.ncdc.noaa.gov/pub/software/lowtran/.
2
See http://www.modtran.org/.
50 3. Simulat ion of Night‐V
Vision and Inffrared Sensorrs
Fixed-pattern
F n noise reflectts variances inn the individuual elements oof the sensor
that remain staticc over time; thhis may be im mplemented tthrough an ovverlay added
to th
he image at thhe sensor’s reesolution. Youu may also w wish to simulaate dead ele-
ments of the sensor, where a feew random piixels remain sstatically coldd.
3.3 Nigh
ht‐Vision Goggle Simulatio
S n
NVG Gs are often described
d as immage intensiffication devicces, but modeern NVGs do
not simply
s ampliffy visible lighht—they actuually operate in the near-IR R part of the
specctrum with veery little overllap with the vvisible spectruum at all. Thiis is because
mostt of the naturaal radiation att night occurss within the nnear-IR band. This doesn’t
mean n that all thee thermal effefects discusseed above alsoo apply to NV VG; near-IR
behaaves more lik ke visible lighht and has litttle to do withh heat emissioon. As such,
simuulating night vision is sim mpler from a pphysical stanndpoint. Figurre 3.3 shows
one example of a simulated NV VG scene.
We
W could gett by with con nverting the vvisible-light teextures to lum
minance val-
ues as
a we did in Listing
L 3.1 annd mapping thhis to the monochromatic green colors
typiccal in NVGs. There is onee material-ba sed effect woorth simulatinng, however,
and that is the chlorophyll eff ffect [Hogervvorst 2009]. T The leaves oof plants and
Figu
ure 3.3. Simulaated NVG vieewpoint with ddepth-of-field effects. (Imagge courtesy off
SDS International.))
52 3. Simulat ion of Night‐V
Vision and Inffrared Sensorrs
trees conntaining chlorrophyll are highly reflectivve in the near-IR band. As a result,
they apppear much briighter than yo ou would exppect in moderrn NVGs. Sim mply ap-
ply a heaalthy boost to o ambient ligght when renddering the plaants and treess in your
night-vission scene to accomplish th his.
The same techniq ques for simulating auto gaain, sensor reesolution, andd random
“shot nooise” discusseed for IR senssors apply to NVGs as weell. It’s especiially im-
portant to
t increase th he noise as a function of ggain; a moonlless night hass signifi-
cant noise compared to a scene illuminated
i bby a full mooon. Many NV VGs also
have notticeable fixed d-pattern noisse in the formm of honeycom mb-shaped seensor el-
ements that may be ad dded to the fin
nal output. Fiigure 3.4 commpares the sam me scene
under loow-light and high-light
h connditions, withh noise varyinng accordinggly. Note
also the loss of contraast under low--light conditioons.
Therre are anomallies specific too NVGs wortth simulating as well. One is depth
of field; NVGs are focused just lik ke a pair of biinoculars—ass a result, mucch of the
g scene is blurrry. Figure 3..3 illustrates ddepth of fieldd in a simulatted NVG
resulting
scene; note that the distant
d ngs are blurrred. Depth off field may be imple-
buildin
mented byb jittering thhe view frustu um about the focal point a few times annd accu-
mulatingg the results [SShreiner 2006 6].
The prominence of stars on a clear night sshould also bbe simulated, and this
can also be seen in Fiigure 3.4. Wh hile there are only about 8000 stars visiible on a
ght to the nak
clear nig ked eye, many y more are viisible with NV VGs. Individdual stars
may satu urate entire piixel elements in the sensorr.
Ligh
ht blooms are an importantt effect speciific to NVGs. Bright lightt sources
in a scene saturate thet sensor; fo or point lightt sources, thiis manifests itself as
bright haalos surround ding the lightt. These haloos may be sim mulated by reendering
Figu
ure 3.4. Identiccal scene rendered under hig
gh ambient lighht (left) and low
w ambient lighht (right).
(Ima
ages courtesy of
o SDS International.)
References 53
billb
boards over th he position of
o point lightts in the scenne. These halos typically
coveer a viewing angle of 1.8 8 degrees, irreespective of the distance of the light
from
m the viewer [Craig
[ et al. 2005]. For nonn-point-light ssources, suchh as windows
into illuminated buildings,
b thee halo effect m
manifests itseelf as a glow rradiating 1.8
degrrees from the shape of the light.
l These eeffects are illuustrated in Figgure 3.5.
Finally,
F you may
m want to use a stencil or overlay too restrict the ffield of view
to a circular area, to match the circular colleectors at the eend of the NVVG tubes.
Referencces
[Bhaatia and Lacy 1999] Sanjiv Bhatia
B and Geoorge Lacy. “Inffra-Red Sensorr Simulation.”
I/ITSEC, 19999.
njean et al. 2006] Maxime E.
[Bon E Bonjean, F Fabian D. Lappierre, Jens S Schiefele, and
Jacques G. Verly.
V “Flight Simulator withh IR and MMW W Radar Imagge Generation
Capabilities.” Enhanced annd Synthetic Viision 2006 (Prooceedings of SP
SPIE) 6226.
[Craiig et al. 2005] Greg Craig, Todd
T Macuda, Paul Thomas, Rob Allison, and Sion Jen-
nings. “Ligh ht Source Hallos in Night V Vision Gogglees: Psychophyysical Assess-
ments.” Procceedings of SPPIE 5800 (20055), pp. 40–44.
[Hog
gervorst 2009] Maarten Hog
gervorst. “Tow
ward Realistic Night Vision Simulation.”
SPIE Newssroom, March 27, 2009. A Available at http://spie.orgg/x34250.xml?
ArticleID=x334250.
[Lengyel 2010] Erric Lengyel. “M
Motion Blur aand the Velociity-Depth-Graddient Buffer.”
Game Enginne Gems 1, ediited by Eric Leengyel. Sudbuury, MA: Joness and Bartlett,
2010.
54 3. Simulation of Night‐Vision and Infrared Sensors
[Shreiner 2006] Dave Shreiner. OpenGL Programming Guide, Seventh Edition. Read-
ing, MA: Addison-Wesley, 2006.
[Thomas 2003] Lynn Thomas. “Improved PC FLIR Simulation Through Pixel Shaders.”
IMAGE 2003 Conference.
4
Screen‐Space Classification for
Efficient Deferred Shading
Balor Knight
Matthew Ritchie
George Parrish
Black Rock Studio
4.1 Introduction
Deferred shading is an increasingly popular technique used in video game ren-
dering. Geometry components such as depth, normal, material color, etc., are
rendered into a geometry buffer (commonly referred to as a G-buffer), and then
deferred passes are applied in screen space using the G-buffer components as
inputs.
A particularly common and beneficial use of deferred shading is for faster
lighting. By detaching lighting from scene rendering, lights no longer affect sce-
ne complexity, shader complexity, batch count, etc. Another significant benefit of
deferred lighting is that only relevant and visible pixels are lit by each light, lead-
ing to less pixel overdraw and better performance.
The traditional deferred lighting model usually includes a fullscreen lighting
pass where global light properties, such as sun light and sun shadows, are ap-
plied. However, this lighting pass can be very expensive due to the number of
onscreen pixels and the complexity of the lighting shader required.
A more efficient approach would be to take different shader paths for differ-
ent parts of the scene according to which lighting calculations are actually re-
quired. A good example is the expensive filtering techniques needed for soft
shadow edges. It would improve performance significantly if we only performed
this filter on the areas of the screen that we know are at the edges of shadows.
55
56 4. Screen‐Space Classification for Efficient Deferred Shading
This can be done using dynamic shader branches, but that can lead to poor per-
formance on current game console hardware.
Swoboda [2009] describes a technique that uses the PlayStation 3 SPUs to
analyze the depth buffer and classify screen areas for improved performance in
post-processing effects, such as depth of field. Moore and Jefferies [2009] de-
scribe a technique that uses low-resolution screen-space shadow masks to classi-
fy screen areas as in shadow, not in shadow, or on the shadow edge for improved
soft shadow rendering performance. They also describe a fast multisample anti-
aliasing (MSAA) edge detection technique that improves deferred lighting per-
formance.
These works provided the background and inspiration for this chapter, which
extends things further by classifying screen areas according to the global light
properties they require, thus minimizing shader complexity for each area. This
work has been successfully implemented with good results in Split/Second, a rac-
ing game developed by Disney’s Black Rock Studio. It is this implementation
that we cover in this chapter because it gives a practical real-world example of
how this technique can be applied.
1. Sky. These are the fastest pixels because they don’t require any lighting cal-
culations at all. The sky color is simply copied directly from the G-buffer.
2. Sun light. Pixels facing the sun require sun and specular lighting calculations
(unless they’re fully in shadow).
3. Solid shadow. Pixels fully in shadow don’t require any shadow or sun light
calculations.
4. Soft shadow. Pixels at the edge of shadows require expensive eight-tap per-
centage closer filtering (PCF) unless they face away from the sun.
5. Shadow fade. Pixels near the end of the dynamic shadow draw distance fade
from full shadow to no shadow to avoid pops as geometry moves out of the
shadow range.
6. Light scattering. All but the nearest pixels have a light scattering calculation
applied.
7. Antialiasing. Pixels at the edges of polygons require lighting calculations for
both 2X MSAA fragments.
4.2 Overview of Method
M 57
We
W calculate which light properties
p aree required forr each 4 4 ppixel tile and
storee the result in
n a 7-bit classification ID. Some of thesse properties are mutually
exclu usive for a single pixel, suuch as sky annd sunlight, buut they can eexist together
when a combined into 4 4 pixxel tiles.
n properties are
Once
O we’ve generated a classificationn ID for everry tile, we then create an
index buffer for each
e ID that points
p to the ttiles with thaat ID and rendder it using a
shadder with the minimum
m lightting code requuired for thosse light propeerties.
W found thaat a 4 4 tilee size gave thhe best balannce between cclassification
We
comp putation timee and shader complexity, leading to bbest overall pperformance.
Smaaller tiles meaant spending too t much tim me classifyingg the tiles, andd larger tiles
mean nt more ligh hting propertiies affecting each tile, leeading to moore complex
shadders. A size ofo 4 4 pixells also conveeniently matchhes the resollution of our
existting screen-sp pace shadow mask [Mooree and Jefferiees 2009], whicch simplifies
the classification
c code, as expplained later. F For Split/Second, the use of 4 4 tiles
addss up to 57,60 00 tiles at a resolution
r of 1280 720. F Figures 4.1 annd 4.2 show
screeenshots from the Split/Seccond tutorial m mode with diifferent globaal light prop-
ertiees highlightedd.
Figu
ure 4.1. A screenshot from Split/Second
Sp wiith soft shadow
w edge pixels hhighlighted in
green
n.
58 4. Scree
en‐Space Classsification forr Efficient Defferred Shadin
ng
4.3
3 Depth‐Related Classifica
C ation
Tile classsification in Split/Secondd is broken innto two parts.. We classifyy four of
the seven n light properties during our
o screen-sppace shadow m mask generattion, and
we classsify the otherr three in a per-pixel
p passs. The reason for this is that the
screen-sp pace shadow w code is already
a geneerating a onne-quarter reesolution
(320 18
80) texture, which
w perfecttly matches oour tile resoluution of 4 4 pixels,
and it iss also readingg depths, meaaning that wee can minimiize texture reeads and
shader complexity
c n the per-pixeel pass by exxtending the screen-space shadow
in
mask cod de to performm all depth-rellated classificcation.
Mooore and Jefferies [2009] ex xplain how wee generate a oone-quarter reesolution
screen-sp pace shadow mask texturre that contaiins three shadow types peer pixel:
pixels in
n shadow, pix xels not in shhadow, and ppixels near thhe shadow eddge. This
work ressults in a textture containin ng zeros for pixels in shaadow, ones foor pixels
not in sh
hadow, and alll other valuees for pixels nnear a shadow w edge. By loooking at
this textuure for each screen-space position, wee can avoid eexpensive PCF for all
areas exccept those neaar the edges of
o shadows thhat we want too be soft.
For tile
t classificattion, we exten nd this code tto also classiffy light scatteering and
shadow fade since th hey’re both caalculated from m depth alonne, and we’ree already
reading depth
d in these shaders to reconstruct
r w
world positionn for the shaddow pro-
jections.
4.3 Depth‐Related Classification 59
Listing 4.1. Classifying light scattering and shadow fade in the first-pass shadow mask shader.
Recall that the shadow mask is generated in two passes. The first pass calcu-
lates the shadow type per pixel at one-half resolution (640 360) and the second
pass conservatively expands the pixels marked as near shadow edge by down-
sampling to one-quarter resolution. Listing 4.1 shows how we add a simple light
scattering and shadow fade classification test to the first-pass shader.
Listing 4.2 shows how we extend the second expand pass to pack the classi-
fication results together into four bits so they can easily be combined with the
per-pixel classification results later on.
if (rgb.r == 0.0)
bits += RAW_SHADOW_SOLID / 255.0;
else if (rgb.r < 4.0)
bits += RAW_SHADOW_SOFT / 255.0;
60 4. Screen‐Space Classification for Efficient Deferred Shading
if (rgb.b != 0.0)
bits += RAW_SHADOW_FADE / 255.0;
if (rgb.g != 0.0)
bits += RAW_LIGHT_SCATTERING / 255.0;
Listing 4.2. Packing classification results together in the second-pass shadow mask shader. Note
that this code could be simplified under shader model 4 or higher because they natively support
integers and bitwise operators.
Table 4.1. The Split/Second G-buffer format. Note that each component has an entry for
both 2X MSAA fragments.
4.5 Combining Classification Results 61
Listing 4.3. Classifying MSAA edge, sky, and sun light. This code could also be simplified in
shader model 4 or higher.
done in a very different way on each platform in order to make the most of their
particular strengths and weaknesses, as explained in Section 4.9.
Listing 4.4. This code counts the number of tiles using each classification ID.
4.6 Index Buffer Generation 63
Listing 4.5. This code builds the index buffer offsets. We store a pointer per shader for index
buffer generation and an index per shader for tile rendering.
*indexBufferPtrs[id]++ = index0;
*indexBufferPtrs[id]++ = index0 + 1;
*indexBufferPtrs[id]++ = index0 + TILE_WIDTH + 2;
*indexBufferPtrs[id]++ = index0 + TILE_WIDTH + 1;
}
}
Listing 4.6. This code builds the index buffer using the QUAD primitive.
64 4. Screen‐Space Classification for Efficient Deferred Shading
#else
#endif
#endif
// Apply sunlight.
#if defined(SUN_LIGHT) && !defined(SOLID_SHADOW)
#endif
#endif
Listing 4.7. This example shader code illustrates how we generate a shader for sunlight and soft
shadow only.
66 4. Screen‐Space Classification for Efficient Deferred Shading
0 1 2 3 16 17 18 19
4 5 6 7 20 21 22 23
8 9 10 11 24 25 26 27
12 13 14 15 28 29 30 31
Figure 4.3. Xbox 360 tile classification IDs are arranged in blocks of 4 4 tiles, giving us
80 45 blocks in total. The numbers show the memory offsets, not the classification IDs.
4.9 Platform Specifics 67
GPU
G-buffer render
Depth-related classification
Pixel classification
CPU Worker Thread
GPU callback Wake
XPS Callback
XPS render submit
PlayStation 3
On the PlayStation 3, the pixel classification pass is piggybacked on top of an
existing depth and normal restore pass as an optimization to avoid needing a spe-
cific pass. This pass creates non-antialiased, full-width depth and normal buffers
for later non-antialiased passes, such as local lights, particles, post-processing,
etc., and we write the classification results to the unused w component of the
normal buffer.
Once we’ve rendered the normals and pixel classification to a full-resolution
texture, we then trigger a series of SPU downsample jobs to convert this texture
into a one-quarter resolution buffer containing only the pixel classification re-
sults. Combination with the depth-related classification results is performed later
on during the index buffer generation because those results aren’t ready yet. This
is due to the fact that we start the depth-related classification work on the GPU at
68 4. Screen‐Space Classification for Efficient Deferred Shading
the same time as these SPU downsample jobs to maximize parallelization be-
tween the two.
We spread the work across four SPUs. Each SPU job takes 64 64 pixels of
classification data (one main memory frame buffer tile), ORs each 4 4 pixel
area together to create a 16 16 block of classification IDs, and streams them
back to main memory. Figure 4.5 shows how output IDs are arranged in main
memory. We take advantage of this block layout to speed up index buffer genera-
tion by coalescing neighboring tiles with the same ID, as explained in Section
4.10. Using 16 16 tile blocks also allows us to send the results back to main
memory in a single DMA call. Once this SPU work and the depth related classi-
fication work have both finished, a GPU callback triggers SPU jobs to combine
both sets of classification results together and perform the index buffer genera-
tion and draw call patching.
The first part of tile rendering is to fill the command buffer with a series of
shader activates interleaved with enough padding for the draw calls to be inserted
later on, once we know their starting indices and counts. This is done on the CPU
during the render submit phase.
Index buffer generation and tile rendering is spread across four SPUs, where
each SPU runs a single job on a quarter of the screen. The first thing we do is
combine the depth-related classification with the pixel classification. Remember
that we couldn’t do it earlier because the depth-related classification is rendered
on the GPU at the same time as the pixel classification downsample jobs are run-
ning on the SPUs. Once we have final 7-bit IDs, we can create the final draw
calls. Listings 4.5 and 4.6 show how we calculate starting indices and counts for
each shader, and we use these results to patch the command buffer with each
draw call.
16 272
Block 1 Block 2
...
...
Figure 4.5. PlayStation 3 tile classification IDs are arranged in blocks of 16 16 tiles,
giving us 20 12 blocks in total. The numbers show the memory offsets, not the classifi-
cation IDs.
4.10 Optimizations 69
GPU
G-buffer render
Pixel classification
SPU Jobs
GPU callback Classification downsample
Wait on signal
SPU Jobs
GPU callback Classification combine
Figure 4.6 shows how it all fits together, particularly the classification flow
between GPU and SPU jobs. The dotted arrow represents other work that we do
to keep the GPU busy while the SPU generates the index buffer.
4.10 Optimizations
Reducing Shader Count
We realized that some of the 7-bit classification combinations are impossible,
such as sun light and solid shadow together, no sunlight and soft shadow togeth-
er, etc., and we were able to optimize these seven bits down to five by collapsing
the four sun and shadow bits into two. This reduced the number of shaders from
128 to 32 and turned out to be a very worthwhile optimization.
70 4. Screen‐Space Classification for Efficient Deferred Shading
■ 00 = solid shadow.
■ 01 = solid shadow + shadow fade + sunlight.
■ 10 = sun light (with no shadows).
■ 11 = soft shadow + shadow fade + sunlight.
The only caveat to collapsing these bits is that we’re now always calculating
shadow fade when soft shadows are enabled. However, this extra cost is negligi-
ble and is far outweighed by the benefits of reducing the shader count to 32.
// If in solid shadow.
if (iCombo & RAW_SHADOW_SOLID)
{
// Set bit 0 if fading to sun.
if ((iCombo & RAW_SHADOW_FADE) &&
(iCombo & RAW_SUN_LIGHT))
bits |= SUN_0;
4.10 Optimizations 71
}
else if (iCombo & RAW_SUN_LIGHT) // else if in sun
{
// Set bit 1.
bits |= SUN_1;
// Write output.
output[iCombo] = bits;
}
Listing 4.8. This builds a lookup table to convert from raw to optimized material IDs.
Tile Coalescing
We mentioned earlier that we take advantage of the tile block layout to speed up
index buffer generation by coalescing adjacent tiles. This is done by comparing
each entry in a block row, which is very fast because they’re contiguous in
memory. If all entries in a row are the same, we join the tiles together to make a
single quad. Figure 4.7 shows an example for the Xbox 360 platform. By coa-
lescing a row of tiles, we only have to generate one primitive instead of four. On
the PlayStation 3, the savings are even greater because a block size is 16 16 tiles.
We extend this optimization even further on the Xbox 360 and coalesce an entire
block into one primitive if all 16 IDs in the block are the same.
10 10 18 10
34 34 34 34
34 18 10 10
10 10 10 10
Figure 4.7. Coalescing two rows of tiles within a single block on the Xbox 360.
72 4. Scree
en‐Space Classsification forr Efficient Defferred Shadin
ng
4.1
11 Perforrmance Comparis
C on
Using thhe scene show wn in Figure 4.8,
4 we comppare performaance betweenn a naive
implemeentation that calculates
c all light
l propertiies for all pixeels and our tille classi-
fication method.
m The results are shhown in Tablee 4.2. The pixxel classificattion cost
on the PlayStation
P 3 is very smalll because we ’re piggybackking onto an existing
pass, and
d the extra shaader cost is minimal.
m
References
[Swoboda 2009] Matt Swoboda. “Deferred Lighting and Post Processing on Play-
Station 3.” Game Developers Conference, 2009.
[Moore and Jefferies 2009] Jeremy Moore and David Jefferies. “Rendering Techniques
in Split/Second.” Advanced Real-Time Rendering in 3D Graphics and Games,
ACM SIGGRAPH 2009 course notes.
5
Delaying OpenGL Calls
Patrick Cozzi
Analytical Graphics, Inc.
5.1 Introduction
It is a well known best practice to write an abstraction layer over a rendering API
such as OpenGL. Doing so has numerous benefits, including improved portabil-
ity, flexibility, performance, and above all, ease of development. Given
OpenGL’s use of global state and selectors, it can be difficult to implement clean
abstractions for things like shader uniforms and frame buffer objects. This chap-
ter presents a flexible and efficient technique for implementing OpenGL abstrac-
tions using a mechanism that delays OpenGL calls until they are finally needed at
draw time.
5.2 Motivation
Since its inception, OpenGL has relied on context-level global state. For exam-
ple, in the fixed-function days, users would call glLoadMatrixf() to set the en-
tries of one of the transformation matrices in the OpenGL state. The particular
matrix modified would have been selected by a preceding call to glMatrix-
Mode(). This pattern is still in use today. For example, setting a uniform’s value
with glUniform1f() depends on the currently bound shader program defined at
some earlier point using glUseProgram().
The obvious downside to these selectors (e.g., the current matrix mode or the
currently bound shader program) is that global state is hard to manage. For ex-
ample, if a virtual call is made during rendering, can you be certain the currently
bound shader program did not change? Developing an abstraction layer over
OpenGL can help cope with this.
75
76 5. Delaying OpenGL Calls
diffuse->SetValue(0.5F);
specular->SetValue(0.2F);
context.Draw(program, /* ... */);
Our goal is to implement an efficient abstraction layer that does not expose
selectors. We limit our discussion to setting shader uniforms, but this technique is
useful in many other areas, including texture units, frame buffer objects, and ver-
tex arrays. To further simplify things, we only consider scalar floating-point uni-
forms. See Listing 5.1 for an example of what code using such an abstraction
layer would look like.
In Listing 5.1, the user creates a shader program, sets two floating-point uni-
forms, and eventually uses the program for drawing. The user is never concerned
with any globals like the currently bound shader program.
In addition to being easy to use, the abstraction layer should be efficient. In
particular, it should avoid needless driver CPU validation overhead by eliminat-
ing redundant OpenGL calls. When using OpenGL with a language like Java or
C#, eliminating redundant calls also avoids managed to native code round-trip
overhead. With redundant calls eliminated, the code in Listing 5.2 only results in
two calls to glUniform1f() regardless of the number of times the user sets uni-
form values or issues draw calls.
diffuse->SetValue(0.5F);
specular->SetValue(0.2F);
context.Draw(program, /* ... */);
diffuse->SetValue(0.5F);
specular->SetValue(0.2F);
context.Draw(program, /* ... */);
context.Draw(program, /* ... */);
Listing 5.2. An abstraction layer should filter out these redundant calls.
5.3 Possible Implementations 77
fore finally issuing a draw call. In this case, glUseProgram() and glUni-
form1f() would be called three times each. The other downside is that
glUseProgram() is called each time a uniform changes. Ideally, it would be
called only once before all of a program’s uniforms change.
Of course, it is possible to keep track of the currently bound program in addi-
tion to the current value of each uniform. The problem is that this becomes diffi-
cult with multiple contexts and multiple threads. The currently bound program is
context-level global state, so each uniform instance in our abstraction needs to be
aware of the current thread’s current context. Also, tracking the currently bound
program in this fashion is error prone and susceptible to thrashing when different
uniforms are set for different programs in an interleaved manner.
class ICleanable
{
public:
virtual ~ICleanable() {}
virtual void Clean() = 0;
};
class ICleanableObserver
{
public:
virtual ~ICleanableObserver() {}
virtual void NotifyDirty(ICleanable *value) = 0;
};
ICleanable ICleanableObserver
has one
(weak reference)
implements implements
has zero to many
(dynamic)
Uniform ShaderProgram
has zero to many
(static)
virtual ~ShaderProgram()
{
// Delete shader objects, program, and m_uniforms.
}
5.4 Delayed Calls Implementation 81
void Clean()
{
std::for_each(m_dirtyUniforms.begin(),
m_dirtyUniforms.end(), std::mem_fun(&ICleanable::Clean));
m_dirtyUniforms.clear();
}
// ICleanableObserver implementation.
virtual void NotifyDirty(ICleanable *value)
{
m_dirtyUniforms.push_back(value);
}
private:
GLuint m_handle;
std::vector<ICleanable *> m_dirtyUniforms;
std::map<std::string, Uniform *> m_uniforms;
};
The other half of the implementation is the code for Uniform, which is
shown in Listing 5.7. A uniform needs to know its OpenGL location
(m_location), its current value (m_value), if it is dirty (m_dirty), and the pro-
gram that is observing it (m_observer). When a program creates a uniform, the
program passes the uniform’s location and a pointer to itself to the uniform’s
constructor. The constructor initializes the uniform’s value to zero and then noti-
fies the shader program that it is dirty. This has the effect of initializing all uni-
forms to zero. Alternatively, the uniform’s value can be queried with
glGetUniform(), but this has been found to be problematic on various drivers.
The bulk of the work for this class is done in SetValue() and Clean().
When the user provides a clean uniform with a new value, the uniform marks
82 5. Delaying OpenGL Calls
itself as dirty and notifies the program that it is now dirty. If the uniform is al-
ready dirty or the user-provided value is no different than the current value, the
program is not notified, avoiding adding duplicate uniforms to the dirty list. The
Clean() function synchronizes the uniform’s value with OpenGL by calling
glUniform1f() and then marking itself clean.
m_currentValue = value;
}
// ICleanable implementation.
virtual void Clean()
{
glUniform1f(m_location, m_currentVvalue);
m_dirty = false;
}
5.5 Implementation Notes 83
private:
GLint m_location;
GLfloat m_currentValue;
bool m_dirty;
ICleanableObserver *m_observer;
};
The final piece of the puzzle is implementing a draw call that cleans a shader
program. This is as simple as requiring the user to pass a ShaderProgram in-
stance to every draw call in your OpenGL abstraction (you’re not exposing a
separate method to bind a program, right?), then calling glUseProgram(), fol-
lowed by the program’s Clean() method, and finally calling the OpenGL draw
function. If the draw calls are part of a class that represents an OpenGL context,
it is also straightforward to factor out redundant glUseProgram() calls.
quire a current OpenGL context. For games with multiple contexts, this can min-
imize bugs caused by incorrect management of the current context. Likewise, if
you are developing an engine that needs to play nice with other libraries using
their own OpenGL context, Uniform::SetValue() has no context side effects
and can be called anytime, not just when your context is current.
Our technique can also be extended to minimize managed to native code
round-trip overhead when using OpenGL with languages like Java or C#. Instead
of making fine-grained glUniform1f() calls for each dirty uniform, the list of
dirty uniforms can be passed to native C++ code in a single coarse-grained call.
On the C++ side, glUniform1f() is called for each uniform, thus eliminating the
per-uniform round trip. This can be taken a step further by making all the re-
quired OpenGL calls for a draw in a single round trip.
glUseProgram(m_handle);
glUniform1f(m_location, value);
As of this writing, DSA is not a core feature of OpenGL 3.3, and as such, is not
available on all platforms, although glProgramUniform*() calls are mirrored in
the separate shader objects extension [Kilgard et al. 2010] which has become
core functionality in OpenGL 4.1.
Delaying selector-based OpenGL calls until draw time has a lot of benefits,
although there are some OpenGL calls that you do not want to delay. It is im-
portant to allow the CPU and GPU to work together in parallel. As such, you
would not want to delay updating a large vertex buffer or texture until draw time
because this could cause the GPU to wait, assuming it is not rendering one or
more frames behind the CPU.
Finally, I’ve had great success using this technique in both commercial and
open source software. I’ve found it quick to implement and easy to debug. An
excellent next step for you is to generalize the code in this chapter to support all
Acknowledgements 85
uniform types (vec2, vec3, etc.), uniform buffers, and other areas of OpenGL
with selectors. Also consider applying this technique to higher-level engine com-
ponents, such as when the bounding volume of a model in a spatial data structure
changes.
Acknowledgements
Thanks to Kevin Ring and Sylvain Dupont from Analytical Graphics, Inc., and Chris-
tophe Riccio from Imagination Technologies for reviewing this chapter.
References
[Gamma et al. 1995] Erich Gamma, Richard Helm, Ralph Johnson, and John M. Vlissi-
des. Design Patterns. Reading, MA: Addison-Wesley, 1995.
[Kilgard 2009] Mark Kilgard. EXT_direct_state_access OpenGL extension, 2009.
Available at http://www.opengl.org/registry/specs/EXT/direct_state_access.txt.
[Kilgard et al. 2010] Mark Kilgard, Greg Roth, and Pat Brown. ARB_separate_shader_
objects OpenGL extension, 2010. Available at http://www.opengl.org/registry/
specs/ARB/separate_shader_objects.txt.
6
A Framework for GLSL Engine Uniforms
Patrick Cozzi
Analytical Graphics, Inc.
6.1 Introduction
The OpenGL 3.x and 4.x core profiles present a clean, shader-centric API. Many
veteran developers are pleased to say goodbye to the fixed-function pipeline and
the related API entry points. The core profile also says goodbye to the vast ma-
jority of GLSL built-in uniforms, such as gl_ModelViewMatrix and gl_Proj-
ectionMatrix. This chapter addresses the obvious question: what do we use in
place of GLSL built-in uniforms?
6.2 Motivation
Our goal is to design a framework for commonly used uniforms that is as easy to
use as GLSL built-in uniforms but does not have their drawback: global state.
GLSL built-in uniforms were easy to use because a shader could just include a
built-in uniform, such as gl_ModelViewMatrix, and it would automatically pick
up the current model-view matrix, which may have been previously set with calls
like the following:
glMatrixMode(GL_MODELVIEW);
glLoadMatrixf(modelViewMatrix);
A shader could even use built-in uniforms derived from multiple GL states, such
as gl_ModelViewProjectionMatrix, which is computed from the current mod-
el-view and projection matrices (see Listing 6.1(a)). Using built-in uniforms
makes it easy to use both fixed-function and shader-based rendering code in the
same application. The drawback is that the global state is error prone and hard to
manage.
87
88 6. A Framework for GLSL Engine Uniforms
#version 120
void main()
{
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
}
#version 330
in vec4 position;
uniform mat4 u_ModelViewProjectionMatrix;
void main()
{
gl_Position = u_ModelViewProjectionMatrix * position;
}
Listing 6.1(b). Pass-through vertex shader using our engine uniforms framework.
class State
{
public:
Listing 6.2. Minimal interface for state used to automatically set engine uniforms.
formations, operator overloads, and a method called Pointer() that gives us di-
rect access to the matrix’s elements for making OpenGL calls.
By encapsulating the state required for engine uniforms in a class, different
instances of the class can be passed to different draw methods. This is less error
prone than setting some subset of global states between draw calls, which was
required with GLSL built-in uniforms. An engine can even take this one step fur-
ther and define separate scene state and object state classes. For example, the
view and projection matrices from our State class may be part of the scene state,
and the model matrix would be part of the object state. With this separation, an
engine can then pass the scene state to all objects, which then issue draw com-
mands using the scene state and their own state. For conciseness, we only use one
state class in this chapter.
6.3 Implementation
To implement this framework, we build on the shader program and uniform ab-
stractions discussed in the previous chapter. In particular, we require a Uniform
class that encapsulates a 4 4 matrix uniform (mat4). We also need a Shader-
Program class that represents an OpenGL shader program and contains the pro-
gram’s uniforms in a std::map, like the following:
std::map<std::string, Uniform *> m_uniformMap;
90 6. A Framework for GLSL Engine Uniforms
class IEngineUniform
{
public:
virtual ~IEngineUniform() {}
virtual void Set(const State& state) = 0;
};
class IEngineUniformFactory
{
public:
virtual ~IEngineUniformFactory() {}
virtual IEngineUniform *Create(Uniform *uniform) = 0;
};
Listing 6.3. Interfaces used for setting and creating engine uniforms.
Our framework uses a list of engine uniforms’ names to determine what en-
gine uniforms a program uses. A program then keeps a separate list of engine
uniforms, which are set using a State object before every draw call. We start our
implementation by defining two new interfaces: IEngineUniform and IEngine-
UniformFactory, shown in Listing 6.3.
Each engine uniform is required to implement both classes: IEngineUniform
is used to set the uniform before a draw call, and IEngineUniformFactory is
used to create a new instance of the engine uniform when the shader program
identifies an engine uniform.
Implementing these two interfaces is usually very easy. Let’s consider im-
plementing them for an engine uniform named u_ModelViewMatrix, which is
the model-view matrix similar to the built-in uniform gl_ModelViewMatrix. The
implementation for IEngineUniform simply keeps the actual uniform as a mem-
ber and sets its value using State::GetModelView(), as shown in Listing 6.4.
The implementation for IEngineUniformFactory simply creates a new Mod-
elViewMatrixUniform object, as shown in Listing 6.5.
The relationship among these interfaces and classes is shown in Figure 6.1.
Implementing additional engine uniforms, such as a model-view-projection ma-
trix, is usually just as straightforward as this case. In isolation, classes for engine
uniforms are not all that exciting. They need to be integrated into our
ShaderProgram class to become useful. In order to identify engine uniforms, a
6.3 Implementation 91
private:
Uniform *m_uniform;
};
program needs access to a map from uniform name to engine uniform factory.
The program can use this map to see which of its uniforms are engine uniforms.
Each program could have a (potentially different) copy of this map, which
would allow different programs access to different engine uniforms. In practice, I
have not found this flexibility useful. Instead, storing the map as a private static
member and providing a public static InitializeEngineUniforms() method,
as in Listing 6.6, is generally sufficient. The initialization method should be
called once during application startup so each program has thread-safe, read-only
IEngineUniform IEngineUniformFactory
implements implements
ModelViewMatrixUniform ModelViewMatrixFactory
creates
Figure 6.1. The relationship among engine uniform interfaces and the model-view matrix
implementation.
m_engineUniformFactories.clear();
}
private:
access to the map afterward. It is also a best practice to free the memory for the
factories on application shutdown, which can be done by calling Destroy-
EngineUniforms(), also shown in Listing 6.6.
There are only two steps left to our implementation: identify and store a pro-
gram’s engine uniforms and set the engine uniforms before a draw call. As stated
earlier, each program keeps a map from a uniform’s name to its uniform imple-
mentation, m_uniformMap. This map is populated after the program is linked. To
implement engine attributes, a program also needs to keep a list of engine uni-
forms:
std::vector<IEngineUniform *> m_engineUniformList;
The relationship among a program, its various pointers to uniforms, and the
uniform factories is shown in Figure 6.2. Using the uniform map and factory
ShaderProgram
Uniform
All programs use the
A program has a Uniform object
same map of factories.
for each active uniform.
IEngineUniformFactory IEngineUniform
ICleanable
Figure 6.2. The relationship among a program, its uniforms, and the engine uniform
factories.
94 6. A Framework for GLSL Engine Uniforms
if (j != m_engineUniformFactories.end())
{
m_engineUniformList.push_back(j->second->Create(i->second));
}
}
map, the list of engine uniforms can be populated by creating an engine uniform
for each uniform that has a factory for the uniform’s name, as shown in List-
ing 6.7. In our implementation, engine uniforms are not reference counted, so the
program’s destructor should delete each engine uniform.
Now that a program has a list of its engine uniforms, it is easy to set them
before a draw call. Assuming we are using the delayed technique for setting uni-
forms introduced in the previous chapter, a shader program has a Clean() meth-
od that is called before each OpenGL draw call to set each dirty uniform. To set
engine uniforms, this method is modified to take a State object (which presum-
ably is also passed to the draw method that calls this) and then call the Set()
method for each automatic uniform, as shown in Listing 6.8.
for (i = m_engineUniformList.begin();
i != m_engineUniformList.end(); ++i)
{
(*i)->Set(state);
}
6.4 Beyond GLSL Built‐in Uniforms 95
std::for_each(m_dirtyUniforms.begin(), m_dirtyUniforms.end(),
std::mem_fun(&ICleanable::Clean));
m_dirtyUniforms.clear();
}
Listing 6.9. C++ templates can reduce the amount of handwritten factories.
you. For example, using the factory in Listing 6.8, the ModelViewMatrixFacto-
ry class would be replaced with the EngineUniformFactory<ModelView-
MatrixUniform> class using the template shown in Listing 6.9.
If you are up for parsing GLSL code, you could also eliminate the need for
shader authors to declare engine uniforms by carefully searching the shader’s
source for them. This task is nontrivial considering preprocessor transformations,
multiline comments, strings, and compiler optimizations that could eliminate uni-
forms altogether.
Finally, a careful implementation of the State class can improve perfor-
mance. Specifically, derived state, like the model-view-projection matrix, can be
cached and only recomputed if one of its dependents change.
Acknowledgements
Thanks to Kevin Ring and Sylvain Dupont from Analytical Graphics, Inc., and Chris-
tophe Riccio from Imagination Technologies for reviewing this chapter.
2
See http://www.openscenegraph.org/projects/osg.
3
See http://developer.amd.com/gpu/rendermonkey/Pages/default.aspx.
7
A Spatial and Temporal Coherence
Framework for Real‐Time Graphics
Michał Drobot
Reality Pump Game Development Studios
97
98 7. A Spatial and Tem
mporal Cohereence Framew
work for Real‐‐Time Graphiccs
(a)
(b)
Figure 7..1. (a) A conveentional four-taap SSAO passs. (b) A four-taap SSAO pass using the
spatiotem
mporal framewo ork.
7.1
1 Introdu
uction
The mosst recent gen
neration of gaame consoless has broughht some drammatic im-
provemeents in graph
hics renderingg quality. Sevveral new techniques werre intro-
duced, like
l ft shadows, sscreen-space ambient
deferred lighting, peenumbral soft
7.1 Introduction 99
Bilateral Upsampling
We can assume that many complex shader operations are low-frequency in na-
ture. Visual data like ambient occlusion, global illumination, and soft shadows
tend to be slowly varying and, therefore, well behaved under upsampling opera-
tions. Normally, we use bilinear upsampling, which averages the four nearest
samples to a point being shaded. Samples are weighted by a spatial distance func-
tion. This type of filtering is implemented in hardware, is extremely efficient, and
yields good quality. However, a bilinear filter does not respect depth discontinui-
ties, and this creates leaks near geometry edges. Those artifacts tend to be dis-
turbing due to the high-frequency changes near object silhouettes. The solution is
to steer the weights by a function of geometric similarity obtained from a high-
resolution geometry buffer and coarse samples [Kopf et al. 2007]. During inter-
polation, we would like to choose certain samples that have a similar surface ori-
entation and/or a small difference in depth, effectively preserving geometry
edges. To summarize, we weight each coarse sample by bilinear, normal-
similarity, and depth-similarity weights.
Sometimes, we can simplify bilateral upsampling to account for only depth
discontinuities when normal data for coarse samples is not available. This solu-
7.2 The Spatiotemporal Framework 101
upsampledResult /= totalWeight;
tion is less accurate, but it gives plausible results in most situations when dealing
with low-frequency data.
Listing 7.1 shows pseudocode for bilateral upsampling. Bilateral weights are
precomputed for a 2 2 coarse-resolution tile, as shown in Figure 7.2. Depending
on the position of the pixel being shaded (shown in red in Figure 7.2), the correct
weights for coarse-resolution samples are chosen from the table.
Reprojection Caching
Another optimization concept is to reuse data over time [Nehab et al. 2007]. Dur-
ing each frame, we would like to sample previous frames for additional data, and
thus, we need a history buffer, or cache, that stores the data from previous
frames. With each new pixel being shaded in the current frame, we check wheth-
er additional data is available in the history buffer and how relevant it is. Then,
102 7. A Spatial and Temporal Coherence Framework for Real‐Time Graphics
Low-resolution sample
0 1
High-resolution sample
0 1
2 3 0 1 2 3
0 9/16 3/16 3/16 1/16
1 3/16 9/16 1/16 3/16
2 3 2 3/16 1/16 9/16 3/16
3 1/16 3/16 3/16 9/16
we decide if we want to use that data, reject it, or perform some additional com-
putation. Figure 7.3 presents a schematic overview of the method.
For each pixel being shaded in the current frame, we need to find a corre-
sponding pixel in the history buffer. In order to find correct cache coordinates,
we need to obtain the pixel displacement between frames. We know that pixel
movement is a result of object and camera transformations, so the solution is to
reproject the current pixel coordinates using a motion vector. Coordinates must
be treated with special care. Results must be accurate, or we will have flawed
history values due to repeated invalid cache resampling. Therefore, we perform
computation in full precision, and we consider any possible offsets involved
when working with render targets, such as the half-pixel offset in DirectX 9.
Coordinates of static objects can change only due to camera movement, and
the calculation of pixel motion vectors is therefore straightforward. We find the
Reuse/
Hit Accumulate
Calculate
Lookup Check Update
cache coord
Miss Reject/
Recompute
pixel position in camera space and project it back to screen space using the pre-
vious frame’s projection matrix. This calculation is fast and can be done in one
fullscreen pass, but it does not handle moving geometry.
Dynamic objects require calculation of per-pixel velocities by comparing the
current camera-space position to the last one on a per-object basis, taking skin-
ning and transformations into consideration. In engines calculating a frame-to-
frame motion field (i.e., for per-object motion blur [Lengyel 2010]), we can reuse
the data if it is pixel-correct. However, when that information is not available, or
the performance cost of an additional geometry pass for velocity calculation is
too high, we can resort to camera reprojection. This, of course, leads to false
cache coordinates for an object in motion, but depending on the application, that
might be acceptable. Moreover, there are several situations where pixels may not
have a history. Therefore, we need a mechanism to check for those situations.
Cache misses occur when there is no history for the pixel under considera-
tion. First, the obvious reason for this is the case when we are sampling outside
the history buffer. That often occurs near screen edges due to camera movement
or objects moving into the view frustum. We simply check whether the cache
coordinates are out of bounds and count it as a miss.
The second reason for a cache miss is the case when a point p visible at time
t was occluded at time t 1 by another point q. We can assume that it is impossi-
ble for the depths of q and p to match at time t 1. We know the expected depth
of p at time t 1 through reprojection, and we compare that with the cached depth
at q. If the difference is bigger than an epsilon, then we consider it a cache miss.
The depth at q is reconstructed using bilinear filtering, when possible, to account
for possible false hits at depth discontinuities, as illustrated in Figure 7.4.
q p
t−1 t
Figure 7.4. Possible cache-miss situation. The red area lacks history data due to occlu-
sion in the previous frame. Simple depth comparison between projected p and q from t 1
is sufficient to confirm a miss.
104 7. A Spatial and Temporal Coherence Framework for Real‐Time Graphics
If there is no cache miss, then we sample the history buffer. In general, pixels
do not map to individual cache samples, so some form of resampling is needed.
Since the history buffer is coherent, we can generally treat it as a normal texture
buffer and leverage hardware support for linear and point filtering, where the
proper method should be chosen for the type of data being cached. Low-
frequency data can be sampled using the nearest-neighbor method without signif-
icant loss of detail. On the other hand, using point filtering may lead to disconti-
nuities in the cache samples, as shown in Figure 7.5. Linear filtering correctly
handles these artifacts, but repeated bilinear resampling over time leads to data
smoothing. With each iteration, the pixel neighborhood influencing the result
grows, and high-frequency details may be lost. Last but not least, a solution that
guarantees high quality is based on a higher-resolution history buffer and nearest-
neighbor sampling at a subpixel level. Nevertheless, we cannot use it on consoles
because of the additional memory requirements.
Motion, change in surface parameters, and repeated resampling inevitably
degrade the quality of the cached data, so we need a way to refresh it. We would
like to efficiently minimize the shading error by setting the refresh rate propor-
tional to the time difference between frames, and the update scheme should be
dependent on cached data.
If our data requires explicit refreshes, then we have to find a way to guaran-
tee a full cache update every n frames. That can easily be done by updating one
of n parts of the buffer every frame in sequence. A simple tile-based approach or
jittering could be used, but without additional processing like bilateral upsam-
pling or filtering, pattern artifacts may occur.
The reprojection cache seems to be more effective with accumulation func-
tions. In computer graphics, many results are based on stochastic processes that
combine randomly distributed function samples, and the quality is often based on
the number of computed samples. With reprojection caching, we can easily
amortize that complex process over time, gaining in performance and accuracy.
t=0 t = −1 t = −2
Figure 7.5. Resampling artifacts arising in point filtering. A discontinuity occurs when
the correct motion flow (yellow star) does not match the approximating nearest-neighbor
pixel (red star).
7.2 The Spatiotemporal Framework 105
This method is best suited for rendering stages based on multisampling and low-
frequency or slowly varying data.
When processing frames, we accumulate valid samples over time. This leads
to an exponential history buffer that contains data that has been integrated over
several frames back in time. With each new set of accumulated values, the buffer
is automatically updated, and we control the convergence through a dynamic fac-
tor and exponential function. Variance is related to the number of frames con-
tributing to the current result. Controlling that number lets us decide whether
response time or quality is preferred.
We update the exponential history buffer using the equation
h t h t 1 w r 1 w ,
t t+1
Figure 7.6. We prepare the expected sample distribution set. The sample set is divided
into N subsets (four in this case), one for each consecutive frame. With each new frame,
missing subsets are sampled from previous frames, by exponential history buffer look-up.
Older samples lose weight with each iteration, so the sampling pattern must repeat after N
frames to guarantee continuity.
106 7. A Spatial and Temporal Coherence Framework for Real‐Time Graphics
weight according to changes in the scene, and this could be done based on the
per-pixel velocity. This would provide a valid solution for temporal ghosting arti-
facts and high-quality convergence for near-static elements.
Special care must be taken when new samples are calculated. We do not
want to have stale data in the cache, so each new frame must bring additional
information to the buffer. When rendering with multisampling, we want to use a
different set of samples with each new frame. However, sample sets should re-
peat after N frames, where N is the number of frames being cached (see Figure
7.6). This improves temporal stability. With all these improvements, we obtain a
highly efficient reprojection caching scheme. Listing 7.2 presents a simplified
solution used during our SSAO step.
Bilateral Filtering
Bilateral filtering is conceptually similar to bilateral upsampling. We perform
Gaussian filtering with weights influenced by a geometric similarity function
[Tomasi and Manduchi 1998]. We can treat it as an edge-aware smoothing filter
or a high-order reconstruction filter utilizing spatial coherence. Bilateral filtering
proves to be extremely efficient for content-aware data smoothing. Moreover,
with only insignificant artifacts, a bilateral filter can be separated into two direc-
tions, leading to O n running time. We use it for any kind of slowly-varying
data, such as ambient occlusion or shadows, that needs to be aware of scene ge-
ometry. Moreover, we use it to compensate for undersampled pixels. When a
pixel lacks samples, lacks history data, or has missed the cache, it is reconstruct-
ed from spatial coherency data. That solution leads to more plausible results
compared to relying on temporal data only. Listing 7.3 shows a separable, depth-
aware bilateral filter that uses hardware linear filtering.
float2 color;
float4 pSamples, nSamples;
float4 diffIp, diffIn;
float4 pTaps[2];
color.r *= centerWeight;
color.r = Wp * (dot(diffIp, float4(pTaps[0].x, pTaps[0].z,
pTaps[1].x, pTaps[1].z)) + color.r);
return (color.r);
}
Spatiotemporal Coherency
We would like to combine the described techniques to take advantage of the spa-
tiotemporal coherency in the data. Our default framework works in several steps:
6. The CB is bilaterally filtered with a higher smoothing rate for pixels with a
lower convergence rate to compensate for smaller numbers of samples or
cache misses.
7. The CB is bilaterally upsampled to the original resolution for further use.
8. The CB is swapped with the HB.
The buffer format and processing steps differ among specific applications.
7.3 Applications
Our engine is composed of several complex pixel-processing stages that include
screen-space ambient occlusion, screen-space soft shadows, subsurface scattering
for skin shading, volumetric effects, and a post-processing pipeline with depth of
field and motion blur. We use the spatiotemporal framework to accelerate most
of those stages in order to get the engine running at production-quality speeds on
current-generation consoles.
1
π H
AO V p , ω N ω dω ,
where N is the surface normal and Vp,ω is the visibility function at p (such that
Vp,ω 0 when occluded in the direction ω , and Vp,ω 1 otherwise). It can be effi-
ciently computed in screen space by multiple occlusion checks that sample the
depth buffer around the point being shaded. However, it is extremely taxing on
the GPU due to the high sample count and large kernels that trash the texture
cache. On current-generation consoles, it seems impractical to use more than
eight samples. In our case, we could not even afford that many because, at the
time, we had only two milliseconds left in our frame time budget.
After applying the spatiotemporal framework, we could get away with only
four samples per frame, and we achieved even higher quality than before due to
amortization over time. We computed the SSAO at half resolution and used bi-
lateral upsampling during the final frame combination pass. For each frame, we
changed the SSAO kernel sampling pattern, and we took care to generate a uni-
formly distributed pattern in order to minimize frame-to-frame inconsistencies.
110 7. A Spatial and Tem
mporal Cohereence Framew
work for Real‐‐Time Graphiccs
Due to memory
m consttraints on con
nsoles, we deecided to relyy only on deppth infor-
mation, leaving
l the su
urface normall vectors avaiilable only forr SSAO compputation.
Furthermmore, since we w used only camera-based
c d motion blurr, we lacked pper-pixel
motion vectors,
v so an
n additional paass for motionn field compuutation was oout of the
questionn. During caching, we reso orted to cam
mera reprojectiion only. Ouur cache-
miss dettection algoritthm compenssated for thatt by calculatinng a runningg conver-
gence baased on the distance
d betw
ween a historry sample andd the predictted valid
position.. That policy tended to giv ve good resultts, especially considering tthe addi-
tional prrocessing step ps involved. After
A reprojecction, ambiennt occlusion ddata was
bilaterallly filtered, taaking converrgence into cconsideration when availaable (PC
only). Pixels
P with high
h temporal confidence retained higgh-frequencyy details,
while othhers were recconstructed sp patially depennding on the cconvergence ffactor. It
is worth noticing thatt we were sw witching histoory buffers affter bilateral ffiltering.
Thereforre, we were filtering
f overr time, whichh enables us to use smalll kernels
without significant qu T complete solution reqquired only onne milli-
uality loss. The
second of
o GPU time and enabled us to use SSA AO in real tim me on the Xbbox 360.
Figure 7.7 shows finaal results comp pared to the ddefault algoritthm.
Softt Shadows
Our shadowing solution works in a deferreed manner. W We use the spatiotemporal
mework for su
fram un shadows only since thoose are compuutationally exxpensive and
visib
ble all the timme. First, we draw sun shhadows to an offscreen low-resolution
buffe
fer. While shaadow testing against a casscaded shadow w map, we uuse a custom
percentage closerr filter. For eaach frame, wwe use a diffeerent sample ffrom a well-
distrributed samplle set in ordeer to leveragee temporal cooherence [Sccherzer et al.
2007 7]. Reprojectiion caching accumulates thhe samples ovver time in a manner sim-
ilar to
t our SSAO solution. Theen the shadow w buffer is billaterally filterred in screen
spacce and bilaterally upsampled for the finnal compositiion pass. Figures 7.8 and
7.9 show
s our finaal results for th
he Xbox 360 implementattion.
Figure 7.9.
7 The spatio otemporal frammework efficieently handles sshadow acne aand other
flickering
g artifacts (righ
ht) that appear in
i the original scene (left).
Shadow
ws and Amb
bient Occlu
usion Comb
bined
Since ou ur shadowing pipeline is siimilar to the one used durring screen-sppace am-
bient occclusion, we in
ntegrate both into one passs in our most eefficient impllementa-
tion runn ning on conssoles. Our hisstory buffer iis half the reesolution of tthe back
buffer, and
a it is storedd in RG16F format.
f The ggreen channell stores the mminimum
depth off the four un nderlying pixeels in the Z--buffer. The red channel contains
shadowin ng and ambieent occlusionn informationn. The fractionnal part of thhe 16-bit
floating--point value is
i used for occclusion becaause it requirres more variiety, and
the integger part holds the shadowinng factor. Funnctions for paacking and unnpacking
these vallues are showwn in Listing 7.4.
7
Everry step of thhe spatiotemp poral framewwork runs in parallel on bboth the
shadowin ng and ambieent occlusion values usingg the packing and unpacking func-
tions. Thhe last step of the framewwork is bilaterral upsamplinng combined with the
main defferred shading pass. Figurre 7.10 showss an overview w of the pipelline. The
performaance gained by b using ourr technique oon the Xbox 360 is shownn in Ta-
ble 7.1.
7.3 Applications 113
Listing 7.4. Code for shadow and occlusion data packing and unpacking.
Bilateral
Separable upsampling
bilateral during main
filtering deferred
shading pass
History buffer
Shadow + AO
Min depth
Figure 7.10. Schematic diagram of our spatiotemporal framework used with SSAO and
shadows.
114 7. A Spatial and Temporal Coherence Framework for Real‐Time Graphics
Table 7.1. Performance comparison of various stages and a reference solution in which
shadowing is performed in full resolution with 2 2 jittered PCF, and SSAO uses 12 taps
and upsampling. The spatiotemporal (ST) framework is 2.5 times faster than the refer-
ence solution and still yields better image quality.
Postprocessing
Several postprocessing effects, such as depth of field and motion blur, tend to
have high spatial and temporal coherency. Both can be expressed as a multisam-
pling problem in time and space and are, therefore, perfectly suited for our
framework. Moreover, the mixed frequency nature of both effects tends to hide
any possible artifacts. During our tests, we were able to perform production-
ready postprocessing twice as fast as with a normal non-cached approach.
Additionally, blurring is an excellent candidate for use with the spatiotem-
poral framework. Normally, when dealing with extremely large blur kernels, hi-
erarchical downsampling with filtering must be used in order to reach reasonable
performance with enough stability in high-frequency detail. Using importance
sampling for downsampling and blurring with the spatiotemporal framework, we
are able to perform high-quality Gaussian blur, using radii reaching 128 pixels in
a 720p frame, with no significant performance penalty (less than 0.2 ms on the
Xbox 360). The final quality is shown in Figure 7.11.
First, we sample nine points with linear filtering and importance sampling in
a single downscaling pass to 1/64 of the screen size. Stability is sustained by the
reprojection caching, with different subsets of samples used during each frame.
The resulting image is blurred, cached, and upsampled. Bilateral filtering is used
when needed by the application (e.g., for depth-of-field simulation where geome-
try awareness is required).
7.4 Future Work 115
Figu
ure 7.11. The bottom
b image shows the resuult of applyingg a large-kernel (128-pixel)
Gausssian blur usedd for volumetricc water effects to the scene shhown in the toop image. This
proceess is efficien
nt and stable on
o the Xbox 360 using thee spatiotemporral coherency
framework.
7.4 Futu
ure Work
Therre are several interesting concepts
c that uuse the spatiootemporal cohherency, and
we performed
p expperiments thaat produced suurprisingly goood results. H
However, due
to prroject deadlinnes, additionaal memory reequirements, aand lack of teesting, those
conccepts were not implementeed in the final iteration of ththe engine. W
We would like
to prresent our find
dings here annd improve uppon them in thhe future.
Anttialiasing
The spatiotemporral framework k is also easiily extended to full-scenee antialiasing
(FSAAA) at a reaso onable perforrmance and mmemory cost [Yang et. Al 2009]. With
deferred renderers, we normallly have to rennder the G-buuffer and perfform lighting
comp putation at a higher resolu
ution. In geneeral, FSAA buuffers tend too be twice as
big as the origin nal frame bufffer in both tthe horizontaal and verticaal directions.
Wheen enough prrocessing pow wer and mem mory are avaailable, higheer-resolution
antiaaliasing schem
mes are preferrred.
116 7. A Spatial and Temporal Coherence Framework for Real‐Time Graphics
t+3
used to avoid instability artifacts, which tend to be disturbing when dealing with
high-frequency details. We found it useful with the whole light accumulation
buffer, allowing us to perform lighting four times faster with similar quality.
Still, several instability issues occurred that we would like to solve.
References
[Kopf et al. 2007] Johannes Kopf, Michael F. Cohen, Dani Liscinski, and Matt Uytten-
daele. “Joint Bilateral Upsampling.” ACM Transactions on Graphics 26:3 (July
2007).
[Lengyel 2010] Eric Lengyel. “Motion Blur and the Velocity-Depth-Gradient Buffer.”
Game Engine Gems 1, edited by Eric Lengyel. Sudbury, MA: Jones and Bartlett,
2010.
[Nehab et al. 2007] Diego Nehab, Padro V. Sander, Jason Lawrence, Natalya Tararchuk,
and John R. Isidoro. “Accelerating Real-Time Shading with Reverse Reprojection
Caching.” Graphics Hardware, 2007.
[Scherzer et al. 2007] Daniel Scherzer, Stefan Jeschke, and Michael Wimmer. “Pixel-
Correct Shadow Maps with Temporal Reprojection and Shadow Test Confidence.”
Rendering Techniques, 2007, pp. 45–50.
118 7. A Spatial and Temporal Coherence Framework for Real‐Time Graphics
[Tomasi and Manduchi 1998] Carlo Tomasi, and Roberto Manduchi. “Bilateral Filtering
for Gray and Color Images.” Proceedings of International Conference on Comput-
er Vision, 1998, pp. 839–846.
[Yang et al. 2009] Lei Yang, Diego Nehab, Pedro V. Sander, Pitchaya Sitthi-Amorn,
Jason Lawrence, and Hugues Hoppe. “Amortized Supersampling.” Proceedings of
SIGGRAPH Asia 2009, ACM.
8
Implementing a Fast DDOF Solver
Holger Grün
Advanced Micro Devices, Inc.
This gem presents a fast and memory-efficient solver for the diffusion depth-of-
field (DDOF) technique introduced by Michael Kass et al. [2006] from Pixar.
DDOF is a very high-quality depth-of-field effect that is used in motion pictures.
It essentially works by simulating a heat diffusion equation to compute depth-of-
field blurs. This leads to a situation that requires solving systems of linear
equations.
Kass et al. present an algorithm for solving these systems of equations, but
the memory footprint of their solver can be prohibitively large at high display
resolutions. The solver described in this gem greatly reduces the memory foot-
print and running time of the original DDOF solver by skipping the construction
of intermediate matrices and the execution of associated rendering passes. In con-
trast to the solver presented by Shishkovtsov and Rege [2010], which uses Di-
rectX 11, compute shaders, and a technique called parallel cyclic reduction
[Zhang et al. 2010], this gem utilizes a solver that runs only in pixel shaders and
can be implemented on DirectX 9 and DirectX 10 class hardware as well.
8.1 Introduction
DDOF essentially relies on solving systems of linear equations described by
tridiagonal systems like the following:
b1 c1 0 y1 x1
a b2 c2 y x
2 2 2
a3 b3 c 3 y 3 x 3 . (8.1)
0 a n bn
y n x n
119
120 8. Implementing a Fast DDOF Solver
It needs to solve such a system of equations for each row and each column of the
input image. Because there are always only three coefficients per row of the ma-
trix, the system for all rows and columns of an input image can be packed into a
four-channel texture that has the same resolution as the input image.
The x i represent the pixels of one row or column of the input image, and the
y i represent the pixels of one row or column of the yet unknown image that is
diffused by the heat equation. The values a i , bi, and c i are derived from the circle
of confusion (CoC) in a neighborhood of pixels in the input image [Riguer et al.
2004]. Kass et al. take a single time step from the initial condition y i x i and
arrive at the equation
y i x i β i y i 1 y i β i 1 y i y i 1 . (8.2)
Here, each β i represents the heat conductivity of the medium at the position i.
Further on, β 0 and β n are set to zero for an image row of resolution n so that the
row is surrounded by insulators. If one now solves for x i , then the resulting equa-
tion reveals a tridiagonal structure:
x i β i 1 y i 1 β i 1 β i 1 y i β i y i 1. (8.3)
It is further shown that β i is the square of the diameter of the CoC at a certain
pixel position i. Setting up a i , bi, and c i now is a trivial task.
As in Kass et al., the algorithm presented in this gem first calculates the dif-
fusion for all rows of the input image and then uses the resulting horizontally-
diffused image as an input for a vertical diffusion pass.
Kass et al. use a technique called cyclic reduction (CR) to solve for the y i .
Consider the set of equations
a i 1 y i 2 bi 1 y i 1 c i 1 y i x i 1
a i y i 1 bi y i c i y i 1 xi (8.4)
a i 1 y i bi 1 y i 1 c i 1 y i 2 x i 1 .
CR relies on the simple fact that a set of three equations containing five un-
knowns y i 2, y i 1, y i , y i 1, and y i 2, like those shown in Equation (8.4), can be
reduced to one equation by getting rid of y i 1 and y i 1. This is done by multiply-
ing the first equation by a i bi 1 and multiplying the third equation by c i bi 1 to
produce the equations
8.1 In
ntroduction 121
a i 1 c a
a i y i 2 a i y i 1 a i i 1 y i i x i 1
bi 1 bi 1 bi 1
a c c
c i i 1 y i c i y i 1 c i i 1 y i 2 i x i 1. (8.5)
bi 1 b i 1 bi 1
Wheen these are added to the seecond line off Equation (8..4), we arrivee at the equa-
tion
since the a i and the c i need to be set to zero to implement the necessary insula-
tion. This now allows us to trivially solve for y i .
For modern GPUs, reducing the system of all image rows to size one does
not make use of all the available hardware threads, even for high display resolu-
tions. In order to generate better hardware utilization, the shaders presented in
this chapter actually stop at two or three systems of equations and directly solve
for the remaining unknown y i . Each of the two or three results gets computed,
and the proper one is chosen based on the output position (SV_POSITION) of the
pixel in order to avoid dynamic flow control. The shaders need to support solving
for the three remaining unknown values as well because input resolutions that are
not a power of two are never reduced to a point where two unknown values are
left.
If we reduce the rows of an input image and a texture containing the a i , bi
and c i, then the resulting y i now represent one column of the resulting horizontal-
ly diffused image. The y i values are written to a texture that is used to create a
full-resolution result. In order to achieve this, the y i values need to be substituted
back up the resolution chain to solve for all the other unknown y i . This means
running a number of solving passes that blow the results texture up until a full-
resolution image with all y i values is reached. Each such pass doubles the num-
ber of known y i .
If we assume that a particular value y i kept during reduction is on an odd
index i, then we just have to use the value of y i that is available one level higher.
What needs to be computed are the values of y i for even indices. Figure 8.2 illus-
trates this back-substitution process and shows which values get computed and
which just get copied.
In fact, the first line from Equation (8.4) can be used again because y i and
y i 2 are already available from the last solving pass. We can now just compute
y i 1 as follows:
x i 1 a i 1 y i 2 c i 1 y i
y i 1 . (8.9)
bi 1
copy
copy
copy
te
te
te
co
co
co
u
u
mp
mp
mp
mp
mp
mp
ute
ute
ute
co
co
co
co
mp ute
ute mp
copy
co
struct ReduceO
{
float4 abc : SV_TARGET0;
float4 x : SV_TARGET1;
};
// Use the minimum CoC as the real CoC as described in Kass et al.
float fRealCoC_4 = min(fCoC_4, fCoC_3);
float fRealCoC_3 = min(fCoC_3, fCoC_2);
8.2 Modifying the Basic CR Solver 125
3. Perform a final one-to-four solving pass to deal with the initial four-to-one
reduction pass. Again, a very hands-on approach for solving the problem at
hand is used, and it also has three phases. Since an initial four-to-one reduc-
tion shader was used, we don’t have all the data available to perform the
8.2 Modifying the Basic CR Solver 127
needed one-to-four solving pass. Phase 1 of the shader therefore starts to re-
construct the missing data from the unchanged and full-resolution input data
in the same fashion that was used in Listing 8.1. Phase 2 uses this data to per-
form several one-to-two solving steps to produce the missing y i values of the
intermediate pass that we skip. Phase 3 finally uses all that data to produce
the final result. Listing 8.2 shows a shader model 4 code fragment imple-
menting the corresponding algorithm for that final solver stage. Again, only
the code for the horizontal version of the algorithm is shown.
4. Stop at two or three unknowns instead of reducing it all down to just one un-
known. Given that the number of hardware threads in a modern GPU is in the
thousands, this actually makes sense because it keeps a lot more threads of a
modern GPU busy compared to going down to just one unknown. Cramer’s
rule is used to solve the resulting 2 2 or 3 3 equation systems.
5. Optionally pack the evolving y i and the a i , bi, and c i into just one four-
channel uint32 texture to further save memory and to gain speed since the
number of texture operations is cut down by a factor of two. This packing us-
es Shader Model 5 instructions (see Listing 8.3) and relies on the assumption
that the x i values can be represented as 16-bit floating-point values. It further
assumes that one doesn’t need the full mantissa of the 32-bit floating-point
values for storing a i , bi, and c i, and it steals the six lowest mantissa bits of
each one to store a 16-bit x i channel.
// Pack six floats into a uint4 variable. This steals six mantissa bits
// from the three. 32-bit FP values that hold abc to store x.
uint4 pack(float3 abc, float3 x)
{
uint z = f32tof16(x.z);
return (uint4(((asuint(abc.x) & 0xFFFFFFC0) | (z & 0x3F)),
((asuint(abc.y) & 0xFFFFFFC0) | ((z >> 6) & 0x3F)),
((asuint(abc.z) & 0xFFFFFFC0) | ((z >> 12) & 0x3F)),
(f32tof16(x.x) + (f32tof16(x.y) << 16))));
}
8.3 Results 131
struct ABC_X
{
float3 abc;
float3 x;
};
ABC_X unpack(uint4 d)
{
ABC_X res;
Listing 8.3. Packing/unpacking all solver variables into/from one rgab32_uint value.
8.3 Results
Table 8.1 shows how various implementations of the DDOF solver perform at
various resolutions and how much memory each solver consumes. These perfor-
mance numbers (run on a system with an AMD HD 5870 GPU with 1 GB of vid-
eo memory) show that the improved solver presented in this gem outperforms the
traditional solvers in terms of running time and also in terms of memory re-
quirements.
In the settings used in these tests, the packing shown in Listing 8.3 does not
show any obvious differences (see Figure 8.3). Small differences are revealed in
Figure 8.4, which shows the amplified absolute difference between the images in
Figure 8.3. If these differences stay small enough, then packing should be used in
DirectX 11 rendering paths in games that implement this gem.
132 8. Implem
menting a Fasst DDOF Solveer
Reso
olution So
olver Runninng Memoory
Time (m
ms) (~MBB)
1024
1280 Sttandard solver 2.46 90
1024
1280 Sttandard solver + Packing 1.97 70
1024
1280 Fo
our-to-one redu
uction 1.92 50
1024
1280 Fo
our-to-one redu
uction + packinng 1.87 40
1200
1600 Sttandard solver 3.66 1322
1200
1600 Sttandard solver + packing 2.93 1022
1200
1600 Fo
our-to-one redu
uction 2.87 73
1200
1600 Fo
our-to-one redu
uction + packinng 2.75 58
1200
1920 Sttandard solver 4.31 1588
1200
1920 Sttandard solver + packing 3.43 1222
1200
1920 Fo
our-to-one redu
uction 3.36 88
1200
1920 Fo
our-to-one redu
uction + packinng 3.23 70
2560 1600 Sttandard solver 7.48 2811
2560 1600 Sttandard solver + packing 5.97 2199
2560 1600 Fo
our-to-one redu
uction 5.80 1566
2560 1600 Fo
our-to-one redu
uction + packinng 5.59 1255
Table
T 8.1. Com
mparison of sollver efficiencyy.
(a) (b)
Figu
ure 8.4. Absolu
ute difference between
b the im
mages in Figuree 8.3, multiplieed by 255 and
inverrted.
Referencces
[Kass et al. 2006] Michael Kass,, Aaron Lefohnn, and John O
Owens. “Interacctive Depth of
Field Using Simulated Diffusion on a G GPU.” Techniccal report. Pixxar Animation
Studios, 20006. Available at http://www
w.idav.ucdavis.edu/publicatioons/print_pub?
pub_id=898..
[Shisshkovtsov and Rege 2010] Oles Shishkovvtsov and Ashhu Rege. “DX X11 Effects in
Metro 2033:: The Last Reffuge.” Game D Developers Connference 2010. Available at
http://develo
oper.download..nvidia.com/preesentations/2010/gdc/metro.ppdf.
[Rigu
uer et al. 2004] Guennadi Riguer, Natalyaa Tatarchuk, annd John Isidoroo. “Real-Time
Depth of Fiield Simulation n.” ShaderX2,, edited by W
Wolfgang Engel. Plano, TX:
Wordware, 2004. Availaable at http:///ati.amd.com/ddeveloper/shadeerx/shaderx2_
real-timedeppthoffieldsimullation.pdf.
[Zhang et al. 2010]] Yao Zhang, Jonathan Coheen, and John D D. Owens, “Fasst Tridiagonal
Solvers on the
t GPU.” Prroceedings of the 15th ACM M SIGPLAN Sy Symposium on
Principles annd Practice off Parallel Proggramming. 20110. Available aat http://www.
idav.ucdaviss.edu/func/returrn_pdf?pub_idd=978.
9
Automatic Dynamic Stereoscopic 3D
Jason Hughes
Steel Penny Games
Stereoscopic rendering is not new. In film, it has been used for years to make
experiences more immersive and exciting, and indeed, the first theatrical showing
of a 3D film was in 1922 [Zone 2007]. A cinematographer can design scenes for
film that take advantage of the 3D projection without unduly taxing the audience,
intentionally avoiding the various inherent issues that exist with a limited field
projection of stereoscopic content. Further, an editor can trim unwanted frames
or even modify the content to mask objectionable content within a frame.
Interactive stereoscopic 3D (S3D), however, is fairly new territory. Those
same problems that filmmakers consciously avoid must be handled by clever
programmers or designers. Sometimes, this means changing some camera angles,
or adjusting a few parameters in certain areas of an environment. Sometimes this
is not an option, or it may be infeasible to change the content of a title for S3D
purposes. It’s likely that some visual issues that arise from player control are bet-
ter dealt with rather than avoided, such as window violations. While some re-
search is available to guide an implementation, practical examples that handle
dynamic environments well are hard to come by. This is an exciting time for
game programmers: some experimentation and creativity is actually required on
our parts until best practices become standard. Here follows the results of our
work with Bluepoint Games in porting the excellent Shadow of the Colossus and
ICO to PlayStation 3 in full HD and S3D.
135
136 9. Automatiic Dynamic Sttereoscopic 3D
Figure 9..1. A window violation occurs when one eyye sees part off an object but the other
does not.
Window
w Violation
ns
A windo ow violation occurs
o when ana object clipps against thee left or right edges of
the screeen while not exactly at th he focal planee distance. W While this happpens in
front of and behind th he focal planee, it is more ddisturbing whhen in negativve paral-
lax (closser to the vieewer). This veery irritating,, unnatural feeeling is the result of
the brain
n failing to merge
m two imaages of an obbject that it pperceives parttially but
has no visual occlussion to acco ount for the missing infoormation. Figgure 9.1
demonsttrates a windo ow violation.
Converrgence Fatiggue
Unlike when
w viewingg 2D images, with S3D, thhe audience m must use eye muscles
to refocu us on objects as they change in depth. IIt is uncomm mon in naturall settings
for people to experien nce radical ch
hanges in focaal depth, certaainly not for eextended
periods ofo time. This muscular fatiigue may nott be noticed im mmediately, bbut once
the viewwer’s eyes are stressed, theyy will be hardd-pressed to enjoy content further.
It is worth notingg that the eyee’s angular aadjustment efffort decreasees as ob-
jects groow more distaant and increases as they approach thee viewer. Witth this in
mind, it is generally preferable
p to keep
k objects oof interest at a distance ratther than
constantly poking outt of the screen n. This reducees eye strain, but still yieldds a nice
3D effecct.
9.1 General Problems in S3D 137
The film industry has already learned that the audience becomes fatigued if
rapid shot changes force a refocusing effort to view the scene. It is unclear
whether this fatigue is due to the eye muscles being unable to cope with contin-
ued rapid convergence changes or whether the brain is more efficient at tracking
objects spatially over time but perhaps must work harder to identify objects ini-
tially and then tires quickly when constantly reparsing the scene from different
viewpoints. In any case, giving those shot changes relatively similar convergence
points from shot to shot helps reduce the stress placed on the audience.
Accommodation/Convergence Deviation
Another problem with S3D is when the brain interprets the focal distance to be
significantly closer or farther than the location of the TV screen. This happens
when tension placed on the muscles inside the eye that controls the focal point of
the eyes differs significantly from the expected tension placed on muscles outside
the eye that controls angular adjustment of the eyes. Figure 9.2 demonstrates a
situation where accommodation and convergence are different. When the brain
notices these values being in conflict with daily experience, the illusion of depth
begins to break down. Related factors seem to involve the content being dis-
played, the size of the screen, the distance that the viewer sits from the screen,
and the spacing between the viewer’s eyes. It’s complicated, but in short, don’t
push the effect too much, or it becomes less convincing. For a deeper examina-
tion of visual fatigue and how people perceive different content, see the report by
Mikšícek [2006].
Accommodation
Covergence
Figure 9.2. Accommodation is the physical distance between the viewer and the display
surface. Convergence is the apparent distance between the observed object and the
viewer.
138 9. Automatic Dynamic Stereoscopic 3D
Keystone Distortions
It should be noted that simply rendering with two monoscopic projection matri-
ces that tilt inward around a forward axis creates keystone distortions. Figure 9.3
shows an exaggerated example of how incorrectly set up viewing matrices ap-
pear. This kind of distortion is similar to holding two sheets of paper slightly ro-
tated inward, then overlapped in 3D—the intersection is a line rather than a
plane, and it warps the image incorrectly for stereoscopic viewing. For the brain
+ =
Figure 9.3. Keystone distortion, also known as the tombstone effect, occurs when a rec-
tangle is warped into a trapezoidal shape, stretching the image. This can be seen when
pointing a projector at a wall at any nonperpendicular angle. This image shows how two
monoscopic projections tilted inward produce a final image that the brain cannot merge
as a single checkerboard square.
9.2 Problems in S3D Unique to Games 139
to merge or fuse two images correctly requires that the focal plane be aligned in a
coplanar fashion between the two images. Instead, the correct projection matrices
are asymmetrical, off-axis projection matrices. These contain shearing so that the
corners of the focal plane exactly match between the left and right stereo images
but have a different point of origin. We don’t reproduce the math here because it
can be found elsewhere with better descriptions of the derivation [Bourke 1999,
Schertenleib 2010, Jones et al. 2001].
2D Image‐Based Effects
A staple of modern monoscopic games are the image-based effects, commonly
called “post effects.” These cheap adjustments to rendered images make up for
the lack of supersampling hardware, low-resolution rendering buffers, inability
for game hardware to render high-quality motion blurs, poor-quality lighting and
shadowing with image-space occlusion mapping, blooms and glows, tone map-
ping, color gamut alterations, etc. It is a major component of a good-quality ren-
dering engine but, unfortunately, is no longer as valuable in S3D. Many of these
tricks do not really work well when the left/right images deviate significantly
because the alterations are from each projection’s fixed perspective. Blurring
around the edges of objects, for example, in an antialiasing pass causes those
edges to register more poorly with the viewer. Depending on the effect, it can be
quite jarring or strange to see pixels that appear to be floating or smeared in
space because they do not seem to fit. The result can be somewhat akin to win-
dow violations in the middle of the screen and, in large enough numbers, can be
very distracting. When considering that post effects take twice the processing
power to modify both images, a suggestion is to dial back these effects during
S3D rendering or remove them entirely since the hardware is already being
pushed twice as hard just to render the scene for both eyes separately.
However, any object in 3D that draws in front of the focal plane at that location
acts like a window violation because the UI is (typically) rendered last and drawn
over that object. The sort order feels very wrong, as if a chunk of the object was
carved out and a billboard was placed in space right at the focal plane. And since
the parts of the near object that are occluded differ, the occluded edges appear
like window violations over the UI. All in all, the experience is very disruptive.
The two other alternatives are to always render the UI in the world, or to
move the UI distance based on the nearest pixel. In short, neither works well.
Rendering a UI in world space can look good if the UI is composed of 3D ob-
jects, but it can be occluded by things in the world coming between the eye and
the UI. This can be a problem for gameplay and may irritate users. The second
alternative, moving the UI dynamically, creates tremendous eye strain for the
player because the UI is constantly flying forward to stay in front of other ob-
jects. This is very distracting, and it causes tremendous ghosting artifacts on cur-
rent LCD monitors.
The best way to handle subtitles and UIs is to remove them as much as pos-
sible from gameplay or ensure that the action where they are present exists en-
tirely in positive parallax (farther away than the monitor).
ly place this box around your monitor. By decreasing the focal distance, we slide
the box farther away from the viewer, which causes more objects in the scene to
appear behind the focal plane (the monitor) and fewer to appear in front. By in-
creasing the focal distance, we slide the box toward the viewer and bring more
objects in front of the focal plane (the monitor).
Para
allaxSeparation = min(ma
ax((FocalDis
stance - 300
0.0F) /
4800.0F, 0
0.0F) + 0.05
5F, 0.35F);
In plain English, the parallax x separation vvaries linearlyy between 0..050 and
0.350 as the focal disstance goes frrom 300 cm too 4800 cm. T This was baseed on the
observattion that the character,
c wheen close to thhe camera, didd not need mmuch par-
allax sepparation to make
m the backgrounds feel 3D. Howeveer, in a vast aand open
room, when
w the charracter was farr from the caamera and noo walls were near the
camera, the parallax separation
s vaalue needed too be cranked up to show m much 3D
detail at large focal diistances. The problem withh this setup iss that anytimee a torch
o wall comes into the fraame from the side, the parrallax separatiion is so
or rock or
great thaat the object seems to protrrude incrediblly far out of tthe TV, and thhe view-
er almosst certainly seees window violations.
v W
Without takingg into accountt objects
pushed heavily
h towarrd the viewer, there would inevitably bee uncomfortabble areas
in each game.
g See Fig
gure 9.4 for ann example off this kind of aartifact.
Figu
ure 9.4. The siimple algorithm m always keepps the characterr in focus. Gennerally, this is pleasing,
exceept that objectts nearer to thhe camera apppear disturbinggly close to tthe viewer. W Worse yet,
extreme window violations
v are possible, such
h as with the ttree on the rigght. (Image coourtesy of
Sonyy Computer En ntertainment, Inc.)
In
9.5 Content‐Adaptive Feedback 143
In the end, we realized that the relationship between focal distance and paral-
lax separation is not simple because for any given bothersome scene, there were
two ideal settings for these parameters. Nearly any scene could look good by ei-
ther significantly reducing the strength of the effect by lowering the parallax sep-
aration value or by moving the focal plane to the nearest object and reducing the
parallax separation value less.
TV
Figure 9.5. The focal plane, where the TV is, has a range indicated in blue where com-
fortable viewing occurs with minimal stress. The red areas show where discomfort oc-
curs. Notice the tolerance for extremely near objects is very low, whereas distant objects
are acceptable.
144 9. Automatic Dynamic Stereoscopic 3D
handed use of S3D incorrectly indicated to players that certain areas were peri-
lously high.
The other observation was that objects that come between the focal plane and
the camera with high stereoscopic deviation tend to upset players, especially
when they reach the left or right edges of the screen (window violations). Other
researchers have suggested variable-width black bars on the sides of the screen to
solve the window violation problem [Gunnewiek and Vandewalle 2010], but this
is not an ideal solution. While this works for film, it is not ideal for games. Doing
this causes a loss of effective resolution for the player, still does not address the
issue that objects can feel uncomfortably close at times, and requires tracking all
objects that might be drawn to determine which is closest to the camera and visi-
ble. This is not an operation most games typically do, and it is not a reasonable
operation for world geometry that may be extremely large and sprawling within a
single mesh. The only good way to determine the closest object is to do so by
examining the depth buffer, at which point we recommend a better solution
anyway.
Our approach is to categorize all pixels into three zones: comfortable, far,
and close. Then, we use this information to adjust the focal plane nearer to the
viewer as needed to force more of the pixels that are currently close into a cate-
gory corresponding to farther away. This is accomplished by capturing the depth
buffer from the renderer just before applying post effects or UI displays, then
measuring the distance for each pixel and adding to a zone counter based on its
categorization. Figure 9.6 shows an example of a typical scene with this algo-
rithm selecting the parameters. On the PlayStation 3, this can be done quickly on
SPUs, and it can be done on GPUs for other platforms.
To simplify the construction of this comfortable zone, we define the half-
width of the comfortable zone to extend halfway between the focal distance and
the closest pixel drawn (last frame). This is clamped to the focal distance in case
the closest pixel is farther away. Since the comfortable zone is symmetrical
around the focal plane, the transition between the comfortable zone and far zone
is trivial to compute.
Once categorized, we know the ratios of close pixels to comfortable pixels to
far pixels. We assume all the pixels in the near zone are going to cause some dis-
comfort if they are given any significant stereoscopic deviation. Given the pixel
distribution, we have to react to it by changing S3D parameters to render it better
next frame. (The one-frame delay in setting S3D parameters is unfortunate, but
game logic can be instrumented to identify hard cuts of the camera a priori and
force the S3D parameters back to a nearly 2D state, which are otherwise distress-
ing glitches.)
9.5 Content‐Adaptive Feedback 145
Similar
S to thee first algoritthm presentedd, the parallaax separation value is ad-
justeed based on thet focal disttance. Howevver, two impportant changges allow for
content-adaptive behavior: thee focal distannce is reduceed quickly whhen the near
zonee has a lot of pixels in it, and
a the paralllax separationn is computedd based on a
normmalized weigh hting of the pixel
p depth diistribution. T
This additionaal control di-
mension is cruciaal because it allows us too tune the streength of the stereoscopic
effecct based on the ratio of near and faar pixels. Figgure 9.7 show ws how the
adapptive algorithmm handles neaar pixels to avvoid window violations.
There
T are twoo situations worth
w tuning ffor: focal planne is near the camera, and
focal plane is rellatively far frrom the cameera. In each case, we wannt to specify
how near and farr pixels affecct the parallaxx separation, so we use thhe focal dis-
tancee to smoothly y interpolate between twoo different weeightings. (W Weighting the
distrribution is neccessary becau use a few pixeels in the neaar zone are veery, very im-
portaant to react sttrongly to wh hen the focal plane is far aaway, whereaas this is not
146 9. Automatiic Dynamic Sttereoscopic 3D
so imporrtant when th he focal planee is very closse to the viewwer. This nonnuniform
responsee to pixels at certain depth
hs is crucial foor good resullts. See Figure 9.8 for
details.) The resultant weighting is a three-vecctor (near, com mfortable, farr) that is
multiplieed componen ntwise againsst the pixel ddistribution thhree-vector aand then
renormalized. Finally y, take the doot product beetween the w weighted vectoor and a
three-vecctor of parallax separation
n values, eachh element of w which correspponds to
the stren
ngth of the S3D effect at th
he current focaal distance iff all pixels were to fall
exclusively inside thaat zone. This gives contexxt to the overrall stereoscoopic ren-
dering based
b on how
w a scene wo ould look if tthe focal planne is near orr far and
based on n how many pixels
p are too close for commfort, how mmany are very far from
the playeer, or any commbination theereof. Figure 9.9 shows a situation whhere win-
dow viollations are avoided with thhe adaptive dyynamic algoriithm.
9.5 Content‐Adaptive Feedback 147
Figuure 9.8. This shhows actual vaalues in use foor two shippedd games. As thhe focal plane
floatss between thesse endpoints, th
he values show
wn interpolate smoothly to pprevent jarring
S3D pops.
n this case, a laarge rock is veery close to thee camera. Settinng the focal diistance farther
Figure 9.9. In
away than thee rock would inncrease the pixels that would fall in the nearr category, cauusing the focal
distance to shorten significcantly. This self-correcting
s feedback is w what allows ffor extremely
dynamic environments to bee handled com mfortably. It eff
ffectively stopss window violaations in most
cases, and it prevents unco omfortable pro otrusions towaard the viewerr. (Image courrtesy of Sony
Computer Enttertainment, In nc.)
148 9. Automatic Dynamic Stereoscopic 3D
Another small but critical detail is that a single frame of objectionable con-
tent is not especially noticeable, but some circumstances could exist where the
focal distance and parallax separation could bounce around almost randomly if
objects came into and out of frame rapidly, such as birds passing by the camera
or a horse that comes into frame at certain parts of an animation. The best way to
handle these situations is to allow “deadening” of the 3D effect quickly, which
tends not to bother the player but makes increases to the effect more gradual. In
this way, the chance of objectionable content is minimized immediately, and the
increase of the stereoscopic effect is subtle rather than harsh. Our specific im-
plementation allows for the focal distance and parallax separation to decrease
instantaneously, whereas they may only increase in proportion to their current
value, and we have seen good results from this.
One problem that we ran into while implementing this algorithm was low-
valued oscillation. Assuming the units for the near and far focal distances match
your game, the oscillation should be observed at less than one percent for focal
distance and parallax separation. At this low level, no visual artifacts are appar-
ent. However, if the near and far focal distances do not coincide with reasonable
camera-to-focal-object distances, or if the weights on each category are made
more extreme, stability cannot be certain. This is because the depth of each ren-
dered frame influences the depth of the next rendered frame. Heavy weights and
poorly tuned distances can cause extreme jumps in the focal distance, causing an
oscillation. If there is a visible artifact that appears to be an oscillation, it is an
indication that some of the control parameters are incorrect.
References
[Zone 2007] Ray Zone. Stereoscopic Cinema & the Origins of 3-D Film. University
Press of Kentucky, 2007.
[Mikšícek 2006] František Mikšícek. “Causes of Visual Fatigue and its Improvements in
Stereoscopy.” Technical Report No. DCSE/TR-2006-04, University of West Bo-
hemia in Pilsen, 2006.
[Bourke 1999] Paul Bourke. “Calculating Stereo Pairs.” 1999. Available at http://local.
wasp.uwa.edu.au/~pbourke/miscellaneous/stereographics/stereorender/.
[Jones et al. 2001] Graham Jones, Delman Lee, Nicolas Holliman, and David Ezra.
“Controlling Perceived Depth in Stereoscopic Images.” Technical Report, Sharp
Laboratories of Europe Ltd., 2001.
[Gunnewiek and Vandewalle 2010] René Klein Gunnewiek and Patrick Vandewalle.
“How to Display 3D Content Realistically.” Technical Report, Philips Research
Laboratories, VPQM, 2010.
10
Practical Stereo Rendering
Matthew Johnson
Advanced Micro Devices, Inc.
This chapter discusses practical stereo rendering techniques for modern game
engines. New graphics cards by AMD and Nvidia enable application developers
to utilize stereoscopic technology in their game engines. Stereo features are ena-
bled by middleware, driver extensions, or the 3D API itself. In addition, new
consumer-grade stereoscopic displays are coming to the market, fueled by the
excitement over 3D movies such as Avatar and How to Train Your Dragon.
151
152 10. Practical Stereo Rendering
The stereo algorithm described in this article takes advantage of binocular dispar-
ity to achieve the desired stereo effect.
Frame packing
Side-by-side (half) Top-bottom (half)
Left
Left
Left Right h
h (Optional) Blanking
Right
Right
w w
Figure 10.1. Common stereo display formats. Frame packing, side-by-side, and top-
bottom are all supported by HDMI 1.4.
Several standards exist for storing the stereo pair content to be delivered to
the target display. Frame packing is the most promising format, enabling applica-
tions to use full-resolution back buffers for both the left eye and right eye.
A few of the common stereo display formats are shown in Figure 10.1. Some
formats, such as side-by-side (half) and top-bottom (half), can be generated with
the same display bandwidth and resolution by using half the horizontal or vertical
resolution per eye.
Right
Left
Figure 10.2. A stereo camera configuration utilizing two parallel asymmetric frustums.
At zero horizontal parallax, the left eye and right eye overlap completely.
negative parallax does not exceed the interaxial distance. Another alternative is to
avoid negative parallax completely and set the near plane distance equal to the
zero-parallax distance.
In real life, eye convergence introduces vertical parallax, but this can also
cause eye discomfort. To avoid vertical parallax, the frustums should not be ro-
tated to the same look-at point. Using parallel symmetric frustums avoids vertical
parallax, but at the cost of introducing excessive negative parallax. Therefore, the
preferred way of rendering stereoscopic scenes is utilizing two parallel asymmet-
ric frustums.
P
z
θ θ
2 2
z0
Horizontal parallax ( W0 )
n
Figure 10.3. The relationship between camera distances and horizontal parallax, based on
the projected view-space point P.
The general rule of thumb is for the interaxial width Wa between the eyes to
be equal to 1/30th of the distance to the horizontal zero-parallax plane, where
θ 1.9 (the angle between the focus point and each eye). This relationship can
be visualized in Figure 10.3. The manufacturers of shutter glasses generally pre-
fer a maximum parallax of θ 1.5, which comes out to roughly 1/38th the dis-
tance to zero parallax.
As an example, suppose that the bounding sphere of a scene in view space
has center C and radius 45.0 and that we need to calculate the interaxial distance
necessary to achieve zero parallax at the center of the scene. In view space, the
camera location is 0,0,0 . Therefore, the distance from the camera to the center
of the sphere is C z , and thus z C z . For a maximum parallax of θ 1.5, we have
the relationship
1.5 W a
tan .
2 2z
So Wa 0.0262 z. Thus, one can find the desired interaxial distance given a hori-
zontal parallax distance of zero.
156 10. Practical Stereo Rendering
Setting the near plane n of the viewing frustum can be done based on the de-
sired acceptable negative parallax. The maximum negative parallax should not
exceed the interaxial distance. A relationship between these distances can be es-
tablished by using similar triangles and limits.
As another example, we calculate the parallax at z z 0, z z 0 2, z , and
z 0. By definition, the parallax at z 0 is zero. The parallax can be solved for the
other values by similar triangles:
W0 z z 0
,
Wa z
W0 Wa 1 z0
z
.
For z z 0 2,
W0 Wa 1 z0
z
W a .
For z ,
W 0 lim W a 1
z
z0
z
Wa .
For z 0,
W 0 lim W a 1
z 0
z0
z
.
ΔW0
ΔW a ΔWa
z0
ΔW n ΔW n
n
θ θ
ΔW a
Left Center
Figure 10.4. Offsetting the left and right frustum planes to form an asymmetric matrix
with zero parallax. (For clarity, the right eye is not shown.)
To set up the left and right cameras, offset the left and right cameras horizon-
tally by half the interaxial distance. For projection, offset the left and right frus-
tum planes for each eye in order to setup the asymmetric frustum.
Given a horizontal camera offset of ΔWa , we can calculate the ΔWn offset for
the left and right projection planes by using similar triangles, as follows:
ΔW n n
.
ΔW a z 0
Therefore,
n
ΔW n ΔW a .
z0
Listing 10.1 outlines the pseudocode for modifying the view frustum and
camera transformation for stereoscopic viewing.
158 10. Practical Stereo Rendering
each eye. This rendering technique may improve performance, especially if the
geometry shader is being used anyway.
One method could be to use geometry amplification to do this. This requires
that we perform the view-projection transformations in the geometry shader,
which may not be as efficient as instancing if the geometry shader is the bottle-
neck. Another way is to use geometry instancing and write a pass-through geom-
etry shader. Sample HLSL code for Direct3D 10 is shown in Listing 10.2. To
render with instancing, set the geometry instanced count to two, as follows:
pD3DDevice->DrawIndexedInstanced(numIndices, 2, 0, 0, 0);
matrix WorldViewProjLeft;
matrix WorldViewProjRight;
struct PS_INPUT
{
float4 Pos : SV_POSITION;
float4 Color : COLOR0;
uint InstanceId : INSTANCE;
};
struct GS_OUTPUT
{
float4 Pos : SV_POSITION;
float4 Color : COLOR0;
uint InstanceId : INSTANCE;
uint Viewport : SV_ViewportArrayIndex;
};
if (InstanceId == 0)
{
input.Pos = mul(input.Pos, WorldViewProjLeft);
}
else
References 161
{
input.Pos = mul(input.Pos, WorldViewProjRight);
}
input.Color = Color;
input.InstanceId = InstanceId;
return (input);
}
[maxvertexcount(3)]
void GSStereo(triangle PS_INPUT In[3],
inout TriangleStream<GS_OUTPUT> TriStream)
{
GS_OUTPUT output;
TriStream.Append(output);
}
}
Listing 10.2. This HLSL code renders a stereo pair using the geometry shader with instancing.
References
[AMD 2009] AMD. “AMD Advances 3D Entertainment: Demonstrates Blu-Ray Stereo-
scopic 3D Playback at 2010 International CES”. December 7, 2009. Available at
http:// www.amd.com/us/press-releases/Pages/amd-3d-2009dec7.aspx.
[Lengyel 2004] Eric Lengyel. Mathematics for 3D Game Programming & Computer
Graphics. Hingham, MA: Charles River Media, 2004.
[McAllister 2006] David F. McAllister. “Display Technology: Stereo & 3D Display
Technologies.” Encyclopedia of Imaging Science and Technology, Edited by Jo-
seph P. Hornak, Wiley, 2006.
162 10. Practical Stereo Rendering
[HDMI 2010] HDMI Licensing, LLC. “HDMI Licensing, LLC Makes 3D Portion of
HDMI Specification Version 1.4 Available for Public Download.” February 3,
2010. Available at http://www.hdmi.org/press/press_release.aspx?prid=119.
[National Instruments 2010] National Instruments. “3D Video: One of Seven New Fea-
tures to Test in HDMI 1.4.” 2010. Available at http://zone.ni.com/devzone/cda/
tut/p/id/11077.
[Ramm 1997] Andy Ramm. “Stereoscopic Imaging.” Dr. Dobbs Journal. September 1,
1997. Available at http://www.drdobbs.com/184410279.
[Bourke 1999] Paul Bourke. “Calculating Stereo Pairs.” July 1999. Available at http://
local.wasp.uwa.edu.au/~pbourke/miscellaneous/stereographics/stereorender/.
11
Making 3D Stereoscopic Games
Sébastien Schertenleib
Sony Computer Entertainment Europe
11.1 Introduction
With the large variety of 3D content being made available (sports events, movies,
TV, photos, games, etc.), stereoscopic 3D is gaining momentum. With the sup-
port for 3D content on the PC and game consoles such as the PlayStation 3 and
Nintendo 3DS, it is likely that it will become even more widespread. In this chap-
ter, we present some topics that need to be considered when creating or convert-
ing a game to stereoscopic 3D. We also present some optimization techniques
that are targeted to improving both the run-time performance and visual fidelity.
■ Active shutter glasses. The screen alternately displays the left and right im-
ages and sends a signal to the LCD screen in the lens for each eye, blocking
or transmitting the view as necessary.
■ Passive polarized glasses. The screen is paired with adjacent right and left
images using orthogonal polarizations. The filter on each eye blocks the or-
thogonally polarized light, allowing each eye to see only the intended image.
■ Parallax barrier. The screen features a layer of material with some slits
placed in front of it, allowing each eye to see a different set of pixels without
glasses, but with restricted view angles.
163
164 11. Making 3D Stereoscopic Games
Glasses
(a) (b)
L R L R
(c)
L R
Figure 11.2. (a) A simple offset is applied to the left and right cameras. (b) Both cameras
are rotated inward. (c) The cameras have parallel view directions by use asymmetric pro-
jections. Configurations (a) and (b) lead to issues that deteriorate the stereoscopic 3D
experience. Configuration (c) avoids those shortcomings by using asymmetric projection
matrices.
2n r l
r l 0 0
r l
0 2n t b
0
M proj t b t b
n f 2nf
0 0
n f n f
0 0 1 0
top
left right
bottom
Figure 11.3. The view volume for an asymmetric frustum. The left and right values rep-
resent the minimum and maximum x values of the view volume, and the bottom and top
values represent the minimum and maximum y values of the view volume, respectively.
Having the ability to alter the camera properties every frame provides a much
larger degree of freedom for controlling the 3D scene. For instance, we can ad-
just the convergence of the cameras to control the depth and size of the objects
within the environment. The convergence corresponds to areas of the left and
right projected images that superimpose perfectly and therefore have zero paral-
lax, appearing in the plane of the screen. We can also adjust the interaxial dis-
tance, which is the separation between both cameras, in order to push back
foreground objects. This is very important because it allows us to offer a much
more comfortable experience.
Uncomfortable
stereo
Screen
space
Image plane
Audience
space
Comfortable
stereo
L R
Figure 11.4. Objects very close to and far from the image plane are difficult to focus on
and are uncomfortable. Ideally, most of the scene should reside in the safe and comforta-
ble area.
Screen
L R
Figure 11.5. Frustum culling for stereoscopic 3D. Instead of combining the view frustum
of both cameras, we want to cull any object that would be visible only to one camera,
reducing window violations.
168 11. Making 3D Stereoscopic Games
Floating Floating
window window
Figure 11.6. Without a floating window, some objects might be more visible to one eye,
but by using a floating window we can prevent such situations.
where d is the distance to the viewing plane, s is the separation distance between
the left and right cameras, and fov is the horizontal field-of-view angle.
It is also possible to go one step further and use a dynamic floating window
that has a more negative parallax than the closest object, where we avoid that part
of an object that becomes more visible to one eye, as shown in Figure 11.6. This
creates the illusion of moving the screen surface forward. It is also possible to
minimize the difference between the frame and its surface using a graduation or
motion blur at the corners of the screen.
We might also have to consider limiting the maximum parallax to prevent
any divergence that occurs when the separation of an object for both eyes on the
screen is larger than the gap between our eyes ( 6.4 cm). Thankfully, the
HDMI 1.4 specifications allow retrieving the size of the TV, which can be used
to calibrate the camera separation. Depending on the screen size, the number of
pixels N that are contained within this distance varies as
d interocular wpixels
N ,
wscreen
where d interocular is the distance between the eyes measured in centimeters, w pixels is
the width of the screen in pixels, and wscreen is the width of the screen measured in
centimeters. For example, for a 46-inch TV and a resolution of 1920 1080 pix-
els, the number of pixels N for a typical human interocular distance is about 122
pixels.
11.5 Technical Considerations 169
1280 pixels
while (notdead)
{
updateSimulation(time);
renderShadowMaps();
renderScene(LeftEye, RightEye);
renderHUD(LeftEye, RightEye);
vsyncThenFlip();
}
Figure 11.8 presents ways to minimize the impact for both the GPU and the
CPU by ensuring that view-independent render targets are shared. Some effects
that are view-dependent, such as reflections, can sometimes be shared for both
views if the surface covered is relatively small, as it often is for mirrors. This
leads to artifacts, but they might be acceptable. On some platforms like the
PlayStation 3, it is also possible to perform some effects asynchronously on the
SPU, such as cascaded shadow maps. In particular, the CPU overhead can also be
reduced by caching the relevant rendering states.
It is also possible to use multiple render targets (MRTs) to write to both left
and right frame buffers in a single pass. This technique can be used to write to
both render targets in a single pass when rendering objects at the screen level or
when applying full-screen postprocessing effects, such as color enhancement or
crosstalk reduction. This is depicted in Figure 11.9.
■ Back buffer
■ Depth/stencil buffer
■ HDR
■ Blur
Render for each eye ■ Bloom
■ Mirrors
■ Parallax mapping
■ Depth of field
■ ...
■ Shadow maps
Render once for both eyes ■ Spot light maps projected in the scene
■ Offscreen surfaces
Figure 11.8. Scene management where view-independent render targets are computed
once.
11.6 Same Scene, Both Eyes, and How to Optimize 171
Input textures
Render targets
Left scene
MRT0
Fragment
program
Right scene
MRT1
Figure 11.9. Multiple render targets allow us to write to both the left and right frame
buffers in a single pass.
Some GPUs flush the rendering pipeline when a new surface is bound as a
render target. This might lead to a performance hit if the renderer frequently
swaps surfaces for the left and right eyes. A simple solution for avoiding this
penalty consists of binding a single surface for both eyes and then moving the
viewport between left and right rendering positions, as illustrated in Listing 11.1.
setRenderStates();
setLeftEyeProjection();
setLeftEyeViewport(); // surface.x = 0, surface.y = 0
Drawcall();
setRightEyeProjection();
setRightEyeViewport(); // surface.x = 0, surface.y = 720 + 30
Drawcall();
// You can carry on with the same eye for the next object to
// minimize the change of projection matrix and viewport.
setRenderStates();
Drawcall();
172 11. Making 3D Stereoscopic Games
setLeftEyeProjection();
setLeftEyeViewport();
Drawcall();
Listing 11.1. This code demonstrates how the images for the left and right eyes can be combined
in a single render target by moving the viewport.
max N x x R , N x x R e N y y N z z .
operations. For checking the back-facing triangle for both views, the efficiency
of this test is improved by around 33 percent.
1
For example, see http://www.trioviz.com/.
174 11. Making 3D Stereoscopic Games
fore, keeping good texture filtering and antialiasing ensures a good correlation
between both images by reducing the difference of the pixel intensities. The hu-
man visual system extracts depth information by interpreting 3D clues from a 2D
picture, such as a difference in contrast. This means that large untextured areas
lack such information, and a sky filled with many clouds produces a better result
than a uniform blue sky, for instance. Moreover, and this is more of an issue with
active shutter glasses, a large localized contrast is likely to produce ghosting in
the image (crosstalk).
Stereo Coherence
Both images need to be coherent in order to avoid side effects. For instance, if
both images have a different contrast, which can happen with postprocessing ef-
fects such as tone mapping, then there is a risk of producing the Pulfrich effect.
This phenomenon is due to the signal for the darker image being received later by
the brain than for the brighter image, and the difference in timing creates a paral-
lax that introduces a depth component. Figure 11.10 illustrates this behavior with
a circle rotating at the screen distance and one eye looking through filter (such as
sun-glasses). Here, the eye that looks through the filter receives the image with a
slight delay, leading to the impression that the circle rotates around the up axis.
Other effects, such as view-dependent reflections, can also alter the stereo
coherence.
L L
R R
Figure 11.10. The Pulfrich effect.
11.9 Visual Quality 175
Fast Action
The brain needs some time to accommodate the stereoscopic viewing in order to
register the effect. As a result, very fast-moving objects might be difficult to in-
terpret and can cause discomfort to the user.
3D Slider
Using a 3D slider allows the user to reduce the stereoscopic effect to a comforta-
ble level. The slider controls the camera properties for interaxial distance and
convergence. For instance, reducing the interaxial distance makes foreground
objects move further away, toward the comfortable area (see Figure 11.4), while
reducing the convergence moves the objects closer. The user might adjust it to
accommodate the screen size, the distance he is sitting from the screen, or just for
personal tastes.
Color Enhancement
With active shutter glasses, the complete image is generally darkened due to the
LCD brightness level available on current 3D glasses. It is possible to minimize
this problem by implementing a fullscreen postprocessing pass that increases the
quality, but it has to be handled with care because the increased contrast can in-
crease the crosstalk between both images.
Crosstalk Reduction
Crosstalk is a side effect where the left and right image channels leak into each
other, as shown in Figure 11.11. Some techniques can be applied to reduce this
problem by analyzing the color intensity for each image and subtracting them
from the frame buffer before sending the picture to the screen. This is done by
creating calibration matrices that can be used to correct the picture. The concept
consists of fetching the left and right scenes in order to extract the desired and
unintended color intensities so that we can counterbalance the expected intensity
leakage, as shown in Figure 11.12. This can be implemented using multiple ren-
der targets during a fullscreen postprocessing pass. This usually produces good
results, but unfortunately, there is a need to have specific matrices tuned to each
display, making it difficult to implement on a wide range of devices.
176 11. Making 3D Stereoscopic Games
Left Right
No
crosstalk
Left Right
Intensity
leakage
Figure 11.11. Intensity leakage, where a remnant from the left image is visible in the
right view.
Calibration texture
B
G
R
I unintended
Right scene Crosstalk Corrected Display R
texture I desired reduction R-L right scene
Screen
Viewer
Asymmetric
field of view
Dynamic
view frustum
11.11 Conclusion
Converting from a monoscopic game to stereoscopic 3D requires less than twice
the amount of processing but requires some optimization for the additional ren-
dering overhead. However, the runtime performance is only one component with-
in a stereoscopic 3D game, and more effort is needed to ensure the experience is
comfortable to watch. A direct port of a monoscopic game might not create the
best experience, and ideally, a stereoscopic version would be conceived in the
early stages of development.
178 11. Making 3D Stereoscopic Games
References
[Carucci and Schobel 2010] Francesco Carucci and Jens Schobel. “AAA Stereo-3D in
CryEngine.” Game Developers Conference Europe, 2010.
[Lang et al. 2010] Manuel Lang, Alexander Hornung, Oliver Wang, Steven Poulakos,
Aljoscha Smolic, and Markus Gross. “Nonlinear Disparity Mapping for Stereo-
scopic 3D.” ACM Transactions on Graphics 29:4 (July 2010).
12
A Generic Multiview Rendering
Engine Architecture
M. Adil Yalçın
Tolga Çapın
Department of Computer Engineering, Bilkent University
12.1 Introduction
Conventional monitors render a single image, which is generally observed by the
two eyes simultaneously. Yet, the eyes observe the world from slightly different
positions and form different images. This separation between the eyes provides
an important depth cue in the real world. Multiview rendering aims to exploit this
fundamental feature of our vision system for enhanced 3D rendering.
Technologies that allow us to send different images to the two eyes have
been around for years, but it is only now that they can reach the consumer level
with higher usability [Bowman et al. 2004]. The existing technologies vary
among the different types of 3D displays, and they include shutter glasses, binoc-
ular head-mounted displays, and the more recent and popular autostereoscopic
displays that require no special glasses.
Recent techniques for multiview rendering differ in terms of visual character-
istics, fidelity, and hardware requirements. Notably, multiview rendering engines
should be able to support more than two simultaneous views, following recent
3D display technologies that can mix a higher number of simultaneous views
than traditional stereo view [Dodgson 2005].
Currently, many available multiview applications are configured for the ste-
reo-view case, and the routines that manage stereo rendering are generally im-
plemented as low-level features targeted toward specific APIs and displays. We
present a higher-level multiview rendering engine architecture that is generic,
robust, and easily configurable for various 3D display platforms, as illustrated in
179
180 12. A Generic Multiview Rendering Engine Architecture
Mono Lenticular-based
Parallax-based Anaglyph-based
1
See http://openreng.sourceforge.net/.
12.2 Analyzing Multiview Displays 181
GB − RGB RGB
R + RGB RGB
GB − RGB RGB
R + RGB RGB
GB − RGB RGB
R + RGB
Right (GB) Right (−) Right RGB
GB − RGB Right RGB
R + RGB RGB
GB − RGB RGB
R + RGB RGB
Color filtering Screen Polarized screen Screen HMDs
glasses filtering glasses
Figure 12.2. Stereo rendering techniques that require wearing glasses. From left to right, anaglyph
glasses, polarized glasses, shutter glasses, and head-mounted displays.
■ Anaglyph glasses. These are based on multiplexing color channels. The two
views are filtered with different colors and then superimposed to achieve the
final image.
■ Head-mounted displays (HMDs). These are based on displaying both views
synchronously to separate display surfaces, typically as miniaturized LCD,
organic light-emitting diode (OLED), or CRT displays.
■ Shutter glasses. These are based on temporal multiplexing of the two views.
These glasses work by alternatively closing the left or right eye in sync with
the refresh rate of the display, and the display alternately displays a different
view for each eye.
■ Polarized glasses. With passive and active variants, these glasses are based
on presenting and superimposing the two views onto the same screen. The
viewer wears a special type of eyeglasses that contain filters in different ori-
entations.
L L
R R
L L
R R
L L
R Right R
Right
L L
R R
L L
R Parallax R
lenses barrier
Screen Screen
Figure 12.3. Autostereoscopic displays: a lenticular sheet (left) and a parallax barrier
(right).
openings that allow light to pass through only in certain directions. These two
technologies are illustrated in Figure 12.3. In both cases, the intensity of the light
rays passing through the filter changes as a function of the viewing angle, as if
the light is directionally projected. The pixels for both eyes are combined in a
single rendered image, but each eye sees the array of display pixels from a differ-
ent angle and thus sees only a fraction of the pixels, those precisely conveying
the correct left or right view.
The number of views supported by autostereoscopic displays varies. The
common case is two-view, which is generally called stereo-view or stereo-
rendering. Yet, some autostereoscopic 3D displays can render 4, 8, or 16 or more
views simultaneously. This allows the user to move his head side to side and ob-
serve the 3D content from a greater number of viewpoints. Another basic varia-
ble is the size of the display. Three-dimensional TVs, desktop LCD displays, and
even mobile devices with multiview support are becoming popular and accessible
to the mass market.
As a result, it is a challenge to build applications that run on these different
types of devices and 3D displays in a transparent way. There is a need for a mul-
tiview rendering architecture that hides the details of multiplexing and displaying
processes for each type of display.
To
T be able to o support both h single-vieww and multiviiew renderingg seamlessly,
the multiview
m sysstem architeccture is integrrated into a vviewport absttraction over
displlay surfaces. This further allows multiiview renderiing in multiple viewports
on thhe screen, ev ven with diffeerent multivieew configuraations for eacch multiview
viewwport, as show wn in Figure 12.4. With tthis approachh, you can add picture-in-
pictu
ure multiview w regions to your
y screen, oor you can shhow your single-view 2D
grapphical user intterface (GUI)) elements ovver multiview w 3D content bby rendering
it on
nly a single time after the multiview coontent is mergged into the fframe buffer.
Briefly, with this approach, yo ou have contrrol over wheree and how yoou want your
multtiview conten nt to be rendeered. An overrview of ourr architecture is shown in
Figuure 12.5.
Multiview
M renndering is enaabled by attacching a multivview camera, a multiview
buffe
fer, and a multtiview compo ositor to a vieewport in a reender window
w. Since most
of th
hese componeents are config gurable on thheir own and pprovide abstraaction over a
distinct set of feaatures, the sysstem can be addjusted to fit into many taarget scenari-
os. At
A render tim me, the multiv view renderinng pipeline iss activated if the attached
Figu
ure 12.4. The same scene reendered to diffferent viewporrts with differeent multiview
confiigurations: anaaglyph using color-mode
c (boottom-left), paarallax using bboth off-target
and on-target
o (stencciling), on-targ
get-wiggle, andd on-target-separated.
184 12. A Generic Multiview Rendering Engine Architecture
Multiview
buffer
CameraStereoView& camera(CameraStereoView::create(*camNode));
mvbParams.viewCount = 2;
mvbParams.type = MVBufferType_Offtarget;
mvbParams.offtarget.colorFormat = ImageFormat_RGBA;
mvbParams.offtarget.sharedDepthStencilTargets = true;
mvbParams.offtarget.sharedFrameBuffer = true;
12.4 The Multiview Camera 185
protected:
protected:
uchar mActiveView;
};
protected:
Listing 12.2. Projection and view matrix management for stereo rendering.
The remainder of this section focuses on the special case of stereo cameras
and describes the basic parameters that can allow easy and intuitive manipulation
of the camera matrices. The concepts described in this section can be extended to
multiview configurations that require more views.
As shown in Figure 12.6, different approaches can be applied when creating
stereo image pairs. In the figure, d denotes the separation between the eyes. Par-
allel and oriented frustums use the basic symmetrical perspective projection setup
for each view and offset the individual camera positions (and the camera orienta-
tions, in the oriented frustum case). The skewed frustum approach modifies the
perspective projection matrix instead of updating the camera orientation, so the
two image planes are parallel to the zero-parallax plane.
The camera view position offset depends on the right direction vector of the
base multiview camera and the d parameter. If the base camera is assumed to be
in the middle of the individual view points, the offset distance is simply d 2 for
each view. To generate skewed frustum projection pairs, assuming that the frus-
tum can be specified with left-right, bottom-top, and near-far values, as is done
188 12. A Generic Multiview Rendering Engine Architecture
d d d
Parallel frustums (no focal distance) Oriented frustums Skewed (asymmetric) frustums
Figure 12.6. Basic approaches for setting up stereo camera projection frustum pairs.
for the OpenGL function glFrustum(), only the left and right values need to be
modified. The offset Δx for these values can be calculated using the formula
dn
Δx ,
2f
where n is the distance to the near plane, and f is the focal distance. Note that the
projection skew offset Δx is added for the right camera and subtracted for the left
camera.
Figure 12.7 shows a simple diagram that can be used to derive the relation-
ship among the angle shift s, the eye separation d, half field-of-view angle , and
the focal distance f. By using trigonometric relationships and the fact that lines
intersect at the point p in the figure, the following equation can be derived:
d 1 tan 2θ tan 2 s
f .
2 tan s 1 tan 2θ
As expected, a smaller angle shift s results in a larger focal distance, and the eye
separation parameter d affects the focal distance linearly.
f
p
d θ−s
−
2
d θs
2
Figure 12.7. Derivation of the relationship between the angle shift s, the eye separation d,
and the focal distance f.
12.5 The Multiview Buffer 189
An on-target multiview buffer uses the attached viewport’s render surface instead
of creating any new (offscreen) surfaces. A final compositing phase may not be
needed when an on-target multiview buffer is used because the multiview render-
ing output is stored in a single surface. The rendering pipeline can still be spe-
cialized for per-view operations using the multiview compositor attachments. For
example, to achieve on-target anaglyph-based rendering, an attached compositor
can select per-view color write modes, in turn separating the color channels of
each view, or a compositor can select different rendering regions on the same
surface. Also, OpenGL quad-buffer stereo mode can be automatically managed
as an on-target multiview buffer since no additional surfaces need to be set up
other than the operating system window surface, and its usage depends on target
surface’s native support for left and right view buffering.
An off-target multiview buffer renders to internally managed offscreen sur-
faces instead of the attached viewport’s render surface, and it can thus be config-
ured more flexibly and independently from the target viewport. Offscreen
rendering, inherent in off-target multiview buffers, allows rendering the content
of each view to different surfaces (such as textures). The application viewport
190 12. A Generic Multiview Rendering Engine Architecture
surface is later updated with the composite image generated by the attached mul-
tiview compositor. If the composition (merge) step of the multiview display de-
vice requires that complex patterns be sampled from each view, as is common in
lenticular-based displays, or if the per-view outputs need to be stored in separate
resources with different configurations (sizes, component types, etc.) as a mul-
tiview optimization step, using an off-target multiview buffers is required.
Some additional aspects of off-target buffer configurations are the following:
Since the multiview buffers use GPU textures to store render results, the mul-
tiview compositors can process the texture data on the GPU with shaders, as
shown in Listings 12.3 and 12.4. Using a shader-driven approach, the view buff-
ers can be upsampled or downsampled in the shaders, using available texture fil-
tering options provided by the GPUs (such as nearest or linear filtering).
in vec2 vertexIn;
out vec2 textureCoord;
void main()
{
textureCoord = vertexIn.xy * 0.5 + 0.5;
glPosition = vec4(vertexIn.xy, 0.0, 1.0);
}
Listing 12.3. A sample vertex shader for a parallax-based multiview rendering composition phase.
void main()
{
vec4 colorL = texture2D(viewL, textureCoord);
vec4 colorR = texture2D(viewR, textureCoord);
Listing 12.4. A sample fragment shader for a parallax-based multiview rendering composition
phase.
After all objects are rendered to all of the views, the multiview compositor for
the viewport can process the view outputs and generate the final multiview im-
age.
Once the rendering requirements of an object-view pair are known, there are
two options for rendering the complete scene, as shown in Figure 12.8. In the
first case, a specific view is activated only once, and all of the visible objects are
rendered for that view. This process is continued until all of the views are com-
pleted, and such an approach keeps the frame target “hot” to avoid frequent
frame buffer swapping. In the second case, each object is activated only once,
and it is rendered to all viewports sequentially, this time keeping the object “hot.”
This approach can reduce vertex buffer or render state switches if view-specific
geometry/render state data is not set up. Also, with this approach, the camera
should cache projection and view matrix values for each view since the active
view is changed very frequently. Depending on the setup of the scene and the
number of views, the fastest approach may differ. A mixed approach is also pos-
sible, where certain meshes in the scene are processed once into multiple views
and the rest are rendered as a view-specific batch.
For each view, render all objects in the scene. For each object in the scene, render to all views.
■ Most importantly, the same scene data is used to render the 3D scene (while
this can be extended by using a multiview object LOD system). As a result,
animations modifying the scene data need to be only applied once.
■ Object or light culling can be applied once per frame using a single shared
frustum for multiview camera objects, containing the frustums of view-
specific internal cameras.
this parameter to select the vertex data and the rendering state of an object. Lev-
el-of-detail can then be implemented in one or more of the following ways:
12.9 Discussion
Multiview Scene Setup
Our architecture supports customization and extensible parameterization, but
does not further provide guidelines on how to set the multiview camera parame-
ters and scene in order to achieve maximum viewing comfort. In the first volume
of Game Engine Gems, Hast [2010] describes the plano-stereoscopic view mech-
anisms, common stereo techniques such as anaglyph, temporal multiplexing
(shutter glasses), and polarized displays and discusses their pros and cons. Some
key points are that contradicting depth cues should be avoided and that special
care needs to be directed at skyboxes and skydomes, billboards and impostors,
GUIs, cursors, menus in virtual 3D space, frame rate, view synchronization, and
scene-to-scene camera setup consistency (such as focal distance). Viewers may
have different eye separation distances and display sizes, and the distance of the
viewer to the display can differ among different platforms. It should be kept in
mind that creating the right 3D feeling is a process that requires a scalable tech-
nical infrastructure (as presented in this chapter) and an analysis of the target
platforms, the virtual scene, and animations.
Postprocessing Pipelines
Post-processing pipelines are commonly used, and their adaptation to multiview
rendering can present a challenge. Most of the post-processing filters use spatial
information about a fragment to calculate the output. The spatial information is
partly lost when different views are merged into a single image. Thus, applying
the same post-processing logic to the single composited image may not produce
the expected output. If spatial data is not used, such as in color filters, the post-
processing can natively interact with the results in separate views. However, fil-
ters like high dynamic range and bloom may interact with spatial data and special
care may need to be taken [Hast 2010]. In our architecture, the post-processing
logic can be integrated into multiview compositor logic (shaders) to provide an-
other rendering pass optimization.
3D Video Playback
To be able to playback 3D video over our architecture, it is possible to send the
decoded 3D video data for separate views to their corresponding multiview buff-
er color render targets and specify the composition by defining your own mul-
tiview compositors. It is also possible to skip the multiview buffer interface and
perform the composition work directly using the decoded video data inside the
multiview compositor merge routines.
Acknowledgements 197
Acknowledgements
This project has been supported by 3DPHONE, a project funded by the European Union
EC 7th Framework Programme.
References
[Bowman et al. 2004] Doug A. Bowman, Ernst Kruijff, Joseph J. LaViola, and Ivan
Poupyrev. 3D User Interfaces: Theory and Practice. Reading, MA: Addison-
Wesley, 2004.
[Bulbul et al. 2010] Abdullah Bulbul, Zeynep Cipiloglu, and Tolga Çapın. “A Perceptual
Approach for Stereoscopic Rendering Optimization.” Computers & Graphics 34:2
(April 2010), pp. 145–157.
[Dodgson 2005] Neil A. Dodgson. “Autostereoscopic 3D Displays.” Computer 38:8
(August 2005), pp. 31–36.
[Hast 2010] Anders Hast. “3D Stereoscopic Rendering: An Overview of Implementation
Issues.” Game Engine Gems 1, edited by Eric Lengyel. Sudbury, MA: Jones and
Bartlett, 2010.
[Pellacini 2005] Fabio Pellacini. “User-Configurable Automatic Shader Simplification.”
ACM Transactions on Graphics 24:3 (July 2005), pp. 445–452.
[Stelmach et al. 2000] L. Stelmach, Wa James Tam, D. Meegan, and A. Vincent. “Stereo
Image Quality: Effects of Mixed Spatio-Temporal Resolution.” IEEE Transactions
on Circuits and Systems for Video Technology 10:2 (March 2000), pp. 188–193.
[Yang et al. 2009] Baoguang Yang, Jieqing Feng, Gaël Guennebaud, and Xinguo Liu.
“Packet-Based Hierarchal Soft Shadow Mapping.” Computer Graphics Forum
28:4 (June–July 2009), pp. 1121–1130.
13
3D in a Web Browser
Rémi Arnaud
Screampoint Inc.
199
200 13. 3D in a Web Browser
are still growing despite the internet bubble bursting circa 2000. Currently, there
are over two dozen web browsers available for virtually all platforms, including
desktop and laptop computers, mobile phones, tablets, and embedded systems.
The browser war started in 1995, and Microsoft (with Internet Explorer) won the
first round against Netscape to dominate the market by early 2000. The browser
wars are not over as Google (Chrome), Mozilla (Firefox), Opera (Opera) and
Apple (Safari) are now eroding Microsoft’s dominance.
During the same period of time, 3D has grown significantly as a mass-market
medium and has generated large revenues for the entertainment industry through
games and movies. 3D display systems have materialized in movie theaters and
generate additional revenues. So the question remains: Why has VRML/X3D not
had the same pervasive path as HTML? The web is filled with tons of opinions as
to why this did not work out. (Note: X3D is still being proposed to the W3C
HTML Working Group for integration with HTML 5.) Mark Pesce himself of-
fered his opinion in an interview published in 2004, ten years after introducing
VRML to the WWW conference [2]:
This comment is of particular interest in the context of this book since its
target audience is game developers. According to Mark Pesce, game developers
should have been all over VRML and creating content for it. Indeed, content is
what makes a medium successful, and games represent a significant amount of
3D interactive content, although not all games require 3D graphics.
Game developers are important because they are recognized for pushing the
limits of the technology in order to provide the best possible user experience.
Game technology needs to empower artists and designers with tools to express
their creativity and enable nonlinear interactive storytelling that can address a
good-sized audience and build a business case for 3D on the web. Game devel-
opers do not care if a technology is recognized by ISO as a standard. They are
more interested in the availability of tools they can take immediate advantage of,
and they require full control and adaptability of the technology they use, for the
13.1 A Brief History 201
The problem is that these design goals have not been proven to be universal
or able to solve the needs for all representations of and interaction with 3D con-
tent. In fact, another scene graph technology was developed at SGI at the same
time as Open Inventor in 1991: Iris Performer [5]. Its principal design goals were
to allow application developers to more easily obtain maximum performance
from 3D graphics workstations featuring multiple CPUs and to support an imme-
diate-mode rendering library. In other words, both Open Inventor and Iris Per-
former used a scene graph technology, but one was designed for performance and
the other for object-oriented user interactivity:
SGI tried several times to combine both the performance of Performer and
usability of Open Inventor. First, SGI introduced the Cosmo 3D library, which
offered an Open Inventor-style scene graph built on an Iris Performer-style low-
level graphics API. After the first beta release, SGI joined with Intel and IBM to
push OpenGL++ on the OpenGL architecture review board (ARB) [6] as a stand-
ard scene graph layer on top of OpenGL that could be used to port Performer or
Inventor (or Optimizer, yet another scene graph for the computer-assisted design
(CAD) market). The ARB was also interested in seeing OpenGL++ become a
standard for the web. This project died when SGI turned their attention to an al-
most identical project with Microsoft named Fahrenheit. The idea was that SGI
would focus on the high-level API (scene graph) while Microsoft worked on the
low-level Fahrenheit (FLL) API that would eventually replace both OpenGL and
Direct3D. Fahrenheit was killed when it became clear Microsoft was playing SGI
and was instead focused on releasing DirectX 7 in 1999 [7].
Sun Microsystems was also interested in creating a standard scene graph API
that could be universal and bridge desktop applications with web application:
Java3D [8], based on the cross-platform Java development language and run-time
library. The first version was released in December 1998, but the effort was dis-
continued in 2004. It was restarted as a community source project but then “put
on hold” in 2008 [9]. Project Wonderland, a Java virtual world project based on
Java3D, was ported to the Java Monkey Engine (jME) API, a Java game engine
used by NCSoft that produced better visuals and performance and reduced design
constraints.
Today, game engines are closer in design to Iris Performer than to Open In-
ventor. For instance, game engines provide offline tools that can preprocess the
data into a format that is closer to the internal data structures needed on the target
platform, thus eliminating complex and time-consuming data processing in the
game application itself in order to save precious resources, such as CPU time,
memory, and user patience. Also, game engines often create interactivity and
user interfaces with native programming through scripting languages such as
Lua. This provides maximum flexibility for tuning the user experience without
having to spend too much time in designing the right object model and file
format.
204 13. 3D in a Web Browser
This new business has evolved very fast, so fast that major game publishers
are having trouble adapting from a culture of management of mostly multimillion
dollar, multiyear-development AAA titles for game consoles to a very different
culture of low-budget games distributed and marketed primarily through social
networking. Some prominent game developers predict this current trend will
have a profound impact on the overall game industry business model [14]. The
video game industry has been through massive changes in the past, including a
crash in 1984. This time, the industry experts are not predicting a crash but a
massive shift that is poised to push more game publishing onto the web.
The need for superior interactive animated graphical content running in a
web page has been growing since the introduction of HTML. In 1997, Macrome-
dia (part of Adobe since April 2005) released Flash 1.0, commonly used to create
animations and advertisements, to integrate video into web pages such as
YouTube, and more recently, to develop rich internet applications (RIAs). Flash
is a growing set of technologies that includes an editing tool, a scripting language
closely related to JavaScript called ActionScript, a file format called .swf, and a
browser plug-in available for many platforms.
In order to visualize Flash content, a plug-in needs to be installed by the cli-
ent in the web browser. In order to maximize their market reach, Macromedia
provided the plug-in for free and worked out several deals to ensure the plug-in
(a.k.a. Flash Player) came preinstalled on all computers. In 2001, 98% of web
browsers came preinstalled with the Flash player (mostly because market-
dominant Microsoft Internet Explorer included the Flash player), so that users
could directly visualize Flash content.
This created enough fertile ground for games to start spreading on the web,
and the term “Flash game” quickly became popular. Thousands of such games
exist today and can be found on aggregator sites such as addictinggames.com,
owned by Nickelodeon. Some games created with the first releases of Flash, such
as Adventure Quest (launched in 2002, see Figure 13.2), are still being updated
and are played by thousands.
Most of the games created with Flash are 2D or fake 3D since Flash does not
provide a 3D API. Still, there are several 3D engines, open source and commer-
cial, that have been developed in ActionScript and are used to create 3D games.
We look into this in more detail in Section 13.3.
Another technology created about at the same time as Flash, with the goal of
enhancing web development, is Java, first released in 1996. Java is a program-
ming language for which a virtual machine (VM) executes the program in a safe
environment regardless of the hardware platform and includes a just-in-time (JIT)
compiler that provides good performance. Unfortunately, Sun Microsystems did
206 13. 3D in a Web Browseer
not enjoyy a good relattionship with Microsoft, siince they saw w Java as a competitor
rather th
han an enhanccement to Miicrosoft’s prooducts [15], aand it was usuually the
case thatt a plug-in haad to be instaalled. Unlike F Flash, Java ddoes offer binndings to
3D hard dware accelerration and th herefore offerrs much bettter performannce than
Flash. We
W explore thiis technology in Section 133.4.
Still, if a plug-in has to be dowwnloaded, whhy not packagge a real gam me engine
as a browwser plug-in in order to provide
p the beest possible pperformance and user
experiennce? There is a reduction in n the addressaable market bbecause the game dis-
tribution
n websites hav ve to agree to
o support the ttechnology, aand the user tto down-
load andd install the plug-in,
p as weell as agree too the license, but this maay be the
only viabble technolog gy available immediately
i tto enable the 3D experiennce. Sev-
eral gamme engines are a available as plug-ins, and these ar are discussed in Sec-
tion 13.5
5.
One way to deal with the plug g-in situationn is to improvve the mechaanism by
which a browser is extended, and that is what Google Natiive Client is about. It
oduces a hard
also intro dware accelerrated graphicss API secure enough to bee includ-
ed in a web
w browser. This new tech hnology is exxplored in Secction 13.6.
Receently, HTML L5 was pusheed to the fronnt of the scenne, specificallly when
Apple CEO
C Steve Jobbs took a pubblic stand aboout why he dooesn’t allow F Flash on
13.3 3D with Flash 207
Apple’s mobile platforms [16] and instead pushed for either native applications
or HTML5 technologies. The HTML5 suite of standards is not yet published, but
some portions are considered stable, such as the Canvas 3D API, which provides
graphics hardware acceleration to JavaScript. Canvas 3D is now implemented as
the WebGL API [17], a standard developed by the Khronos Group, and is a close
adaptation of OpenGL ES (another Khronos Group standard) available for mo-
bile phones (such as iPhone/iPad, Android, etc.). The Khronos Group is also the
home of the COLLADA standard, which, coupled with WebGL, provides a
standard solution for both content and an API for 3D on the web [18]. At the time
of this writing, WebGL has not yet been released to the general public, but we
describe it in Section 13.7.
Figure 13.3.
1 PreFab3DD, an Adobe AIRA tool for Away3D. Seee http://www.cclosier.nl/
prefab/. (IImage © Fabrrice Closier.)
can bake
b lights an
nd shadows into textures uusing ray traccing, create loow poly-
gon models with normal maps, refine UV m mapping, and apply real-tim me shad-
ing techniques.
t
■ Alterrnativa3D. This
T company y develops annd licenses a 3D engine ffor Flash
withh support for multiuser
m reaal-time conne ction of cliennts and playerrs. Their
commercial prod duct, now in version 5.6. 0, is aimed at game devvelopers.
They y offer a multtiplayer tank battle game ((see Figure 13.4) as a livee demon-
stration of the cappabilities of th
heir engine.
All of
o these projeects have a veery active devveloper comm munity,1 and tthey pro-
vide tuto
orials, samplees, documentaation, and bo oks [20, 21]. They enablee loading
of 3D asssets using CO OLLADA an nd 3DS Max formats. The engine’s cappabilities
are relateed to the perfformance of th he virtual maachine, which does not provvide any
native 3DD hardware acceleration.
a Flash
F 10 introoduced some hardware-acccelerated
features that are used d to acceleratte compositioon and renderring on a 2D surface.
1
At the tiime of this wriiting, Away3D
D and Alternativva3D are curreently the preferrred solu-
tions for performance anda features beecause they haave a more active developm ment com-
munity, but
b this is not a measure of fuuture performannce.
13.3 3D with Flash
h 209
package
{
import org.papervi
ision3d.Pape
ervision3D;
import org.papervi
ision3d.came
eras.*;
import org.papervi
ision3d.mate
erials.*;
import org.papervi
ision3d....
// add
addChild(viewport);
// create a renderer
renderer = new BasicRenderEngine();
model.addEventListener(FileLoadEvent.LOAD_COMPLETE,
OnModelLoaded);
DAE(model).load("duck_triangulate.dae");
}
/**
* show model once loaded
*/
private function OnModelLoaded(e : FileLoadEvent) : void
{
e.target.removeEventListener(FileLoadEvent.LOAD_COMPLETE,
OnModelLoaded);
scene.addChild(model);
}
/**
* Render!
*/
private function handleRender(event : Event = null) : void
{
// orbit the camera
camera.orbit(_camTarget, _camPitch, _camYaw, _camDist);
// render
renderer.renderScene(scene, camera, viewport);
}
}
}
Listing 13.1. Papervision3D code snippet for loading and displaying a COLLADA model.
Figu
ure 13.6. Poisoonville, a jME Grand
G Theft A uto-like game launched from m the web, but
runniing in its own (small) windoow. See http:///poisonville.biggpoint.com/. (IImage © Big-
pointt.)
214 13. 3D in a Web Browser
probably ignore the high-level scene graph technologies that limit the game en-
gine to a predefined behavior, narrowing down our choice to JOGL. This wrap-
per library provides full access to the APIs in the specifications for OpenGL 1.3–
3.0, OpenGL 3.1 and later, OpenGL ES 1.x, and OpenGL ES 2.x, as well as near-
ly all vendor extensions. It is currently an independent open source project under
the BSD license [23]. Note that other similar bindings have also been created for
audio and GPU compute APIs—OpenAL (JOAL) and OpenCL (JOCL) are
available on the same web page as JOGL.
This is the choice made by the few game engines that have been built with
this technology, such as the jME, now in version 3 (jME3), that can run inside
the browser as an applet or launched in an external JVM application that can run
fullscreen. There are many applications taking advantage of OpenGL 3D acceler-
ation with Java on the web, but the jME is probably the only mature technology
available to make games with Java plus JOGL (see Figure 13.6).
Despite its long existence, its advanced technology providing 3D hardware
acceleration, and both its desktop and mobile availability, Java has had more suc-
cess as a language and framework to run on servers than it has on clients, and it
has had no real success in the game space. The confusion created by the Sun-
Microsoft dispute over the Java technology (which created incompatibilities), the
confusion as to which style of 3D graphics API to use, the perceived lack of sup-
Tools
Common Elements
Developer
Tools
Figure 13.7. JavaFX, a new language running inside a JVM language. (Image © 2010
Oracle Corporation.)
13.5 3D with a Game Engine Plug‐In 215
port for OpenGL on Windows, and the very poor support by integrated chipsets
[24] may have, until recently, caused a lot of damage to Java’s popularity. Java is
also handicapped by its 80 percent installed base (compared to 97 percent for
Flash [25]), vulnerabilities, large download size ( 15 MB), and long startup
times.
The main issue seems to be the design of Java itself. It is very structured and
object-oriented but is not well suited for the creation of applications with intense
graphical user interfaces. Web programming requires more dynamic structures
and dynamic typing. Enter JavaFX [26], a completely new platform and language
(see Figure 13.7) that includes a declarative syntax for user interface develop-
ment. However, there is no clear path stated about when 3D will become part of
this new JVM technology, so JavaFX is not suitable for writing 3D games for the
web for the time being.
and mob bile platformss. Therefore, the opportunnity was availlable for new w compa-
nies withh a new busin ness model too provide thosse engines. E Engines such as Unity
and ShiV Va (see Figure 13.8) are allready poweriing hundreds of games on the web
or nativeely on Windo ows, Mac OS, Linux, and mobile platfo forms such ass iPhone,
iPad, An ndroid, and Paalm. The business model iis adapted to the market, tthe plug-
ways free and small in size (Unity is 3.11 MB, and ShhiVa is 700 kB
in is alw B), there
are no royalties, therre is a complletely free enntry-level or learning edittion, and
upgradess to the professional full-fe
featured versioons cost less than $2,000. So even
the mostt expensive version is welll within the bbudget of anyy game develooper and
is in the same range asa a modeling g tool that is nnecessary to create the 3D
D content
to be useed in the gamee.
Respponding to th he game development teaam needs, theese two engiines also
offer assset managemeent servers an nd collaboratiion tools. Thhe motto for tthese en-
gines ded dicated to weeb and mobilee developmennt is ease-of-uuse, cost effectiveness,
and perfformance. Th hey give a new
n meaningg to cross-pllatform devellopment,
which ussed to mean developing
d a game
g across aall the game cconsoles. Thee success
of ShiVaa and Unity iss bringing add ditional comppetition from the engines tthat used
to serve exclusively the
t console game
g developpers (Xbox 3660, PlayStatioon 3, and
Wii), suuch as the reccently announ nced Trinigyy WebVision engine, but it is not
clear wh hether they will
w be able to o adopt the ssame low price model or whether
they brinng enough diifferentiation to justify a m major differeence in price.. In fact,
Figure 13.8. ShiVa Edditor with onee-click play buutton. See httpp://www.stonettrip.com/.
(Image © 2010 Stonetrip
ip.)
13.5 3D with a Game Engine Plug‐In 217
it could very well be the other way around, as ShiVa and Unity (both already on
Wii) could be made available for cross-console development in the future.
Regarding developing and delivering 3D games for the web, those two tech-
nologies today provide by far the best performance and quality compared to the
other technologies studied in this chapter. They also provide essential tools that
provide ease-of-use, real-time feedback, editing, and tuning that are mandatory
for the productivity demanded by short schedules and tight budgets in web de-
velopment. But they also enable deployment of the game outside of the web as
mobile phone and tablet native applications, as well as PC/Mac standalone appli-
cations. (However, PC/Mac versions of these engines do not offer the same level
of quality as PC-specific engines.) Except for the market-limiting fact that on the
web the user has to install a plug-in, the cross-platform aspect of those technolo-
gies is sometimes the only choice that makes sense when publishing a game on
both the web and mobile devices in order to grow potential revenues without
multiplying the development cost by the number of target platforms. As the ven-
dors of those engines are focusing on ease-of-use and are very responsive to
game developer needs, the results are quite astonishing in quality in terms of
what non-highly-specialized developers can do in a few weeks.
To maximize flexibility and performance, Unity is taking advantage of the
JIT compiler technology of .NET (Mono for cross-platform support) and provid-
ing support for three scripting languages: JavaScript, C#, and a dialect of Python.
ShiVa is using a fast version of Lua. Scripts are compiled to native code and,
therefore, run quite fast. Unity scripts can use the underlying .NET libraries,
which support databases, regular expressions, XML, file access and networking.
ShiVa provides compatibility with JavaScript (bidirectional interaction), PHP,
ASP, Java, etc., using XML (send, receive, and simple object access protocol).
Integration with Facebook is also possible, opening the door to the half-
billion customers in that social network. This is done though an integration with
Flash,2 taking advantage of the fact that Flash is on 97 percent of the platforms
and available as supported content by most websites.
As a tip: don’t miss the Unity statistics page (see Figure 13.9) that provides
up-to-date information about what hardware platforms are being used. This pro-
vides a good indication of the range of performance and types of hardware used
to play 3D games in a browser. (Note that these statistics show that most web
game players have very limited GPUs and would not be able to run advanced
engines from Epic or Crytek.)
2
For example, see http://code.google.com/p/aquiris-u3dobject/.
218 13. 3D in a Web Browseer
Figure 13.9.
1 Unity graphics
g card statistics for Q2 2010. S See http://unitty3d.com/
webplayeer/hardware-staats.html. (Imagge © 2010 Unitty Technologiees.)
WebGL
Implementation Desktop GL
Textures Buffers ANGLE
GLES2 Client
Library Interprocess
Communication
Figure 13.11. AS3 versus JavaScript (JS) performance test (Data from http://www.
JacksonDunstan.com/articles/618, March 2010).
222 13. 3D in a Web Browser
even if not the fastest in all cases, Flash provides consistent and good perfor-
mance, regardless of which browser it is running inside. A given plug-in can im-
pose the browser to update the Flash plug-in to the AS3 version if needed, but
there is no such thing for native JavaScript because the entire browser would
have to be upgraded.
So equipped with a fast scripting language and a strong desire to add hard-
ware-accelerated 3D for web application developers, Google, Apple, Mozilla,
and Opera announced during Siggraph 2009 that they would create the WebGL
working group under the intellectual property (IP) protection umbrella of the
Khronos Group [17] and joined the many working groups already working on
graphics standards, such as OpenGL, OpenCL, and COLLADA. When Vladimir
Vukićević (Mozilla) was thinking about where to create the new standard work-
ing group he basically had two choices: the W3C, home of all web standards, and
Khronos, home of the graphics standards. Since his group was composed of web
browser specialists, he thought they should join the standard body, where they
could meet with graphic specialists, because expertise in both areas is required to
create this new standard. This also enables complementary standards to be used
conjointly and solve a bigger piece of the puzzle, such as how COLLADA and
WebGL can be used together to bring content to the web [18].
Technically speaking, WebGL is an extension to the HTML canvas element
(as defined by the W3C’s WHATWG HTML5 specification), being specified and
standardized by the Khronos Group. The HTML canvas element represents an
element on the page into which images can be rendered using a programmatic
interface. The only interface currently standardized by the W3C is the Can-
vasRenderingContext2D. The Khronos WebGL specification describes another
interface, WebGLRenderingContext, which faithfully exposes OpenGL ES 2.0
functionalities. WebGL brings OpenGL ES 2.0 to the web by providing a 3D
drawing context to the familiar HTML5 canvas element through JavaScript ob-
jects that offer the same level of functionality.
This effort proved very popular, and a public mailing list was established to
keep up general communication with the working group, working under strict IP
protection, which was quite a new way of functioning for the Khronos Group.
Even though the specification has not yet been released, several implementations
are already available for the more adventurous web programmers, and at Sig-
graph 2010, a dozen applications and frameworks were already available for
demonstration. This is quite an impressive community involvement effort, indi-
cating the large interest in having 3D acceleration without the need for a plug-in.
After the demonstration, “finally” was whispered in the audience, since some
13.7 3D with HTML5 223
have been waiting for this since the first HTML demo was made almost two dec-
ades ago.
The inclusion of native 3D rendering capabilities inside web browsers, as
witnessed by the interest and participation in the Khronos Group’s WebGL work-
ing group, aims at simplifying the development of 3D for the web. It does this by
eliminating the need to create a 3D web plug-in and requiring a nontrivial user
download with manual installation before any 3D content can be viewed by the
user.
When creating a 3D game for the web, graphics is fundamental but is only a
subset of the full application. Other features are to be provided by various
nongraphics technologies that together form the HTML5 set of technologies that
web browsers are implementing (see Figure 13.12). In order to bring data into the
web browser, the game application will have to either embed the content in the
HTML page or use the XMLHttpRequest API to fetch content through the web
browser. Unfortunately, the programmer will also have to obey the built-in secu-
rity rules, which in this case restricts access to content only from the same server
from which the HTML page containing the JavaScript code was obtained. A pos-
sible workaround then needs to be implemented on the server in order to request
external content, possibly through a simple script that relays the request.
One major issue with JavaScript is the fact that the entire source code of the
application is downloaded to the browsers. There are utilities to obfuscate the
code, which make it impossible to debug as well, but it is not too hard to reverse
engineer. This may not be the ideal, however, since game developers are not nec-
essarily happy to expose their source code.
WebGL In Context
CSS SVG Workers
SIGGRAPH 2010
29 JULY 2010
Figure 13.12. Vladimir Vukićević (Mozilla) presents how WebGL sits in the context of
the HTML5 suite of standards at the Siggraph 2010 WebGL BOF.
224 13. 3D in a Web Browser
Another issue with HTML5 is that there are a lot of hacks involved for cross-
browser compatibility. It is not clear if this issue will be resolved anytime soon,
so there is a need for libraries and other tools to isolate the developer from these
issues so 3D game development won’t be as painful on native HTML5. Flash,
Unity, and ShiVa are doing a good job at isolating the developer from those
browser compatibility issues.
WebGL is a cutting-edge technology with many things to be discovered be-
fore it can safely be used to develop 3D games for the web. The lack of tools for
game developers is probably the most problematic point because this makes it
impractical and certainly not cost effective. Many WebGL-supporting initiatives
are under way (and more are coming along every month), such as GLGE, Spi-
derGL, CopperLicht, the Seneca College Canvas 3D (C3DL) project, and the
Sirikata project. From these efforts and those as yet unforeseen, new and compel-
ling content will be developed.
13.8 Conclusion
The conclusion to this chapter is not the one I had hoped for. But the research is
clear: there is no ideal solution to what technology to use today to write 3D
games for the web.
Unity and ShiVa are a primary choice based on their performance, success
stories, and professional tools. But they still require the installation of a plug-in,
which may be a nonstarter because the client paying for the development of the
game may dictate that either Flash or no plug-in at all is to be used.
This is why Flash, especially if Adobe decides to add game-related features,
will always be viable in the near future. But 3D acceleration is not enough be-
cause there are many more features required by the game engine (including phys-
ics and artificial intelligence) that would have to be provided and optimized for
Flash, and Adobe may have no interest in developing these.
HTML5 definitely has a bright future. 3D has lots of usages besides games
that will be covered by WebGL, and the no-plug-in story will be of interest mov-
ing forward. But this is still in theory, and it will depend on how Internet Explor-
er maintains market share and whether they provide WebGL (right now, there is
no sign of this support). So that means WebGL will require a plug-in installation
on Internet Explorer, which defeats the purpose.
Google Native Client is perhaps the one technology I had not considered at
first, but it seems to be the most promising. It will provide optimum performance
and will require a plug-in in many cases, but that plug-in will be required by so
References 225
References
[1] XMediaLab. “Mark Pesce, Father of Virtual Reality Markup Language.” Video
interview. November 25, 2005. Available at http://www.youtube.com/watch?v=
DtMyTn8nAig.
[2] 3d-test. “VRML: The First Ten Years.” Interview with Mark Pesce and Tony
Parisi. March 21, 2004. Available at http://www.3d-test.com/interviews/media
machines_2.htm.
[3] Paul S. Strauss and Rikk Carey. “An Object-Oriented 3D Graphics Toolkit.”
Computer Graphics 26:2 (July 1992), pp. 341–349.
[4] Web3D Consortium. “Open Standards for Real-Time 3D Communication.”
Available at http://www.web3d.org/.
[5] John Rohlf and James Helman. “IRIS Performer: A High Performance
Multiprocessing Toolkit for Real-Time 3D Graphics.” Proceedings of Siggraph
1994, ACM Press / ACM SIGGRAPH, Computer Graphics Proceedings, Annual
Conference Series, ACM, pp. 381–394.
[6] OpenGL ARB. “OpenGL++-relevant ARB meeting notes.” 1996. Available at
http://www.cg.tuwien.ac.at/~wimmer/apis/opengl++_summary.html.
[7] Randall Hand. “What Led to the Fall of SGI? – Chapter 4.” Available at http://
www.vizworld.com/2009/04/what-led-to-the-fall-of-sgi-chapter-4/.
[8] Oracle. Maintenance of the Java 3D specification. Available at http://www.jcp.
org/en/jsr/detail?id=926.
[9] Sun Microsystems. “ANNOUNCEMENT: Java 3D plans.” 2008. Available at
http://forums.java.net/jive/thread.jspa?threadID=36022&start=0&tstart=0.
[10] The Khronos Group. “COLLADA – 3D Asset Exchange Schema.” Available at
http://khronos.org/collada/.
226 13. 3D in a Web Browser
[11] Rémi Arnaud and Mark Barnes. COLLADA: Sailing the Gulf of 3D Digital
Content Creation. Wellesley, MA: A K Peters, 2006.
[12] Rémi Arnaud. “The Game Asset Pipeline.” Game Engine Gems 1, edited by Eric
Lengyel. Sudbury, MA: Jones and Bartlett, 2010.
[13] Tim Merel. “Online and mobile games should generate more revenue than console
games.” GamesBeat, August 10, 2010. Available at http://venturebeat.com/2010/
08/10/online-and-mobile-games-should-generate-more-revenue-than-console-
games/.
[14] Jason Rubin. “Naughty Dog founder: Triple-A games ‘not working’.” CVG,
August 3, 2010. Available at http://www.computerandvideogames.com/article.
php?id=258378.
[15] “Microsoft and Sun Microsystems Enter Broad Cooperation Agreement; Settle
Outstanding Litigation.” Microsoft News Center, April 2, 2004. Available at http://
www.microsoft.com/presspass/press/2004/apr04/04-02SunAgreementPR.mspx.
[16] Steve Jobs. “Here’s Why We Don’t Allow Flash on the iPhone and iPad.”
Business Insider, April 29, 2010. Available at http://www.businessinsider.com/
steve-jobs-heres-why-we-dont-allow-flash-on-the-iphone-2010-4.
[17] The Khronos Group. “WebGL – OpenGL ES 2.0 for the Web.” Available at
http://khronos.org/webgl/.
[18] Rita Turkowski. “Enabling the Immersive 3D Web with COLLADA & WebGL.”
The Khronos Group, June 30, 2010. Available at http://www.khronos.org/collada/
presentations/WebGL_Collada_Whitepaper.pdf.
[19] Steve Jenkins. “Five Questions with Carlos Ulloa.” Web Designer, September 22,
2009. Available at http://www.webdesignermag.co.uk/5-questions/five-questions-
with-carlos-ulloa/.
[20] Paul Tondeur and Jeff Winder. Papervision3D Essentials. Birmingham, UK:
Packt Publishing, 2009.
[21] Richard Olsson and Rob Bateman. The Essential Guide to 3D in Flash. New York:
friends of ED, 2010.
[22] 3D Radar. “Adobe Flash 11 to Get 3D Functionality.” Available at http://3dradar.
techradar.com/3d-tech/adobe-flash-11-get-3d-functionality-09-07-2010.
[23] JogAmp.org. The JOGL project hosts the development version of the Java Binding
for the OpenGL API (JSR-231). Available at http://jogamp.org/jogl/www/.
[24] Linden Research. “Intel chipsets less than a 945 are NOT compatible.” Quote from
“Second Life hardware compatibility.” Available at http://secondlife.com/support/
system-requirements/.
References 227
[25] Stat Owl. “Web Browser Plugin Market Share.” Available at http://www.statowl.
com/plugin_overview.php.
[26] Oracle. “Better Experiences for End-Users and Developers with JavaFX 1.3.1.”
Available at http://javafx.com/.
[27] Ars Technica. “Worldwide OS Share Trend.” Available at http://arstechnica.com/
microsoft/news/2010/08/windows-7-overtakes-windows-vista.ars.
[28] Ken Russell and Vangelis Kokkevis. “WebGL in Chrome.” Siggraph 2010
WebGL BOF, The Khronos Group. Available at http://www.khronos.org/
developers/library/2010_siggraph_bof_webgl/WebGL-BOF-2-WebGL-in-
Chrome_SIGGRAPH-Jul29.pdf.
14
2D Magic
Daniel Higgins
Lunchtime Studios, LLC
Magician street performers don’t require a big stage, scantily clothed assistants,
or saw boxes to amaze an audience. They are extraordinary in their ability to en-
tertain with a coin or a deck of cards. Many modern-day game developers are
like street performers and do amazing things on a smaller stage. They often work
on mobile platforms and in small groups instead of giant teams. Small budgets
and limited time are no excuses for developers producing subpar products, how-
ever. Limited resources means that these teams must stretch what little they have
in order to produce magic on screen.
The magic described in this chapter is simple, powerful, and requires only a
small amount of graphics knowledge, such as how to construct and render a quad
[Porter and Duff 1984]. This isn’t an introduction to graphics programming, and
it isn’t about using shaders or the latest API from DirectX or OpenGL. Instead,
it’s a collection of easy-to-implement graphical tricks for skilled (non-graphics
specialist) programmers. In this article, we first explore very basic concepts that
are central to graphics programming, such as vertices, colors, opacity, and texture
coordinates, as well as creative ways to use them in two dimensions. Next, we
examine how to combine these building blocks into a powerful and complex
structure that allows for incredible 2D effects (with source code included on the
website). Finally, we conclude by conjuring our own magic and demonstrating
some simple examples of this structure in action.
229
230 14. 2D Magic
fingers. Once we have a good understanding of the coin’s properties, we can de-
velop tricks that suit its potential. Likewise in graphics, our props are vertices.
Knowing what composes a vertex gives us clues about how to develop tricks that
harness its powers.
In its most basic form, a vertex is a point with properties that are used while
rendering. It’s graphics information for a drawing location somewhere on (or off)
the screen. In 2D, it’s most commonly used as one of four points that make up a
quad (two triangles sharing an edge) and is generally used in the rendering of text
or an image. A vertex contains position, color, opacity, and often texture coordi-
nates. Let’s examine each property in detail.
14.2 Position
A vertex’s position determines where it is rendered on the screen. Location on its
own may not seem interesting, but when combined with the other vertices of a
quad, we open up a window for scale, motion, and perspective effects.
Scale Effects
Imagine a boat race game where we want to display a countdown prior to the
start of a race. Displaying the count involves starting each number small, then
increasing its size over the course of one second. We accomplish this by altering
the position of each vertex in a quad relative to the others in such a way that the
screen space becomes larger. Consider dressing this up with some acceleration
modifications that affect how fast it scales. Does it overshoot its destination scale
and have to snap back, producing a wobbling jelly-like effect? Perhaps it starts
quickly and eases into its destination scale, like a car pulling into a parking spot.
Regardless of what acceleration modifier you choose, don’t choose linear
acceleration—how boring! Adding unique scaling techniques, such as the smooth
step shown in Figure 14.1, adds character to a game. Smooth step is just a simpli-
fication of a Hermite spline [Pipenbrinck 1998] but produces elegant results with
little CPU usage. Other functions to consider when modifying percentages are
sine, cosine, power, square, and cube. Once you determine a scaling acceleration
theme, develop a cohesive style by using it throughout the user interface (UI).
Motion Effects
Motion effects, such as sliding or bouncing a quad across the screen, are
achieved by applying a physics force to a quad (and eventually each vertex). We
are not limited to maintaining a flat border in 2D space. That is, the y coordinate
14.2 Position 231
smoothstep x x 2 3 − 2 x
1
0
0 1
for our
o top-left veertex and the y coordinate for our top-riight vertex doo not have to
be thhe same. We can just as eaasily alter thee orientation oof our objectss, spin them,
or fllip them upsid de down. Wee accomplish this by storinng an orientattion value in
the range
r 0, 2π on
o the quad. When
W the orieentation channges, the quadd updates the
posittion of each of o its four verrtices to be re lative to its nnew orientatioon. We could
also store the new orientatio on in a trannsform matrixx that is appplied before
rend
dering.
Persspective Effects
A biig advantage 3D has over 2D is lightinng. In the 2D world, we’ree often stuck
withh the lighting rendered into
o the art duriing productioon. We shoulddn’t bemoan
this problem, how wever—whilee we cannot cchange the liighting in thee art, we can
Figuure 14.2. (a) Th he original imaage. (b) A shaadow generatedd by darkeningg and blurring
its piixels. (c) Squaashing the vertiices to build pperspective. (d)) The final imaages rendered
togetther.
232 14. 2D Magic
enhance it at run time. A perfect example of this is using shadows with our 2D
objects. Consider what happens in Figure 14.2. Here we take a rendering of one
of our characters, darken it, blur it, and squash its vertices, projecting it down to
create a slightly angled shadow. The result is an almost magical, subtle effect that
provides depth to our game world visuals.
Colored UI Model
Optimization isn’t just for frame rates; its principles should be applied to a com-
pany’s development processes. With that in mind, look to vertex coloring as a
means for small, budget-conscious teams to reduce art pipelines, promote reuse
of textures, and develop consistency in UI design.
Traditionally, UI art is built in third-party tools and rendered verbatim to the
screen. When it’s time to render, each vertex is rendered white (no visible
change) and may use its alpha value as the vertex’s transparency. This has the
advantage that artists know what the outcome will be, but it does not promote art
reuse and requires third-party tools if color changes are needed. What if we want
to reuse much of our game’s art, save memory, or want our UI to adapt to the
player’s preferences? We need to look at our UI rendering differently than we
have in the past and move to the colored UI model.
The colored UI model involves building reusable pieces of UI in grayscale,
focusing on making them as white as possible, with darkened areas to indicate
shadow and depth, then coloring these pieces when rendering. This allows for
UIs to quickly change their color and opacity based on run-time data or be initial-
ized from data files during load. Consider looking into luminance format textures
for more performance improvements.
Black and white is all about predictability. Render a colorful texture to a
green-colored quad, and the result is predictably a modulation of source texel and
destination color but is likely not the programmer’s intended result. If, however,
our UI is black and white, we know that the ending colors will all be shades of
the original colors we applied through the vertex. The catch with this is that we
14.3 Color and Opacity 233
are changing only the purely white areas of the original texture to our desired
color, and all other final display colors are darker depending on the shade of gray
in the original texture.
The most common example of white art and vertex coloring is text rendering.
When rendering a string that uses a font where each character has been preren-
dered in white onto a transparent background, we have the ability to color text
dynamically. To accomplish this, we look up a character’s texture coordinates,
choose the color for each character, and then render it on a quad of that color.
Our result? Characters from the same font file are shown on screen but colored
differently, with no extra texture memory required. We now extend this reuse to
our entire UI.
Imagine building a sports game where each team has a light and dark color.
Our designer wants to connect our gamers with the identity of their team through
the use of team colors. We could build a uniquely colored UI for each team, but
that takes a lot of effort and resources. Instead, artists design a black and white
UI, with the knowledge that some UI pieces will use the light team colors and
other areas will use the dark team colors. Now we have one art set instead of
many, resulting in a savings in memory, art development time, and likely load
time as well.
This isn’t to say an image of a person somersaulting off a building should be
in black and white and tinted when rendered. Use white art and tint it when there
is an asset that can be reused if colored differently. Often, buttons, frames, bor-
ders, scroll bars, tab controls, and other basic UI components are easily colored
to fit into different scenes.
Interpolation
We’re not done with colored UIs yet. A white UI and colored vertices give us
another advantage, dynamic interpolation. Initialization time isn’t the only time
we alter color. Game-state changes we want to communicate should also affect
color. For example, if we have a button we want the player to press, we can make
it pulse back and forth between two colors (and opacities). Interpolating between
two color values using a parameter t in the range 0,1 is easy using a simple ex-
pression such as the following:
UI Frames
Using the wrap texture address mode is particularly useful when building reusa-
ble UI components. Consider UI objects such as window frames where sizes may
be unknown or where dynamic scaling is desirable. If we want to build a frame
that supports many sizes, we would construct art for each corner and then create
tileable sections for the top middle, bottom middle, left middle, and right middle.
In Figure 14.3, we demonstrate how we can render complete corner images hav-
ing u , v values from 0, 0 to 1,1, but then repeat thin slices from our middle
14.4 Texture (UV) Coordinates 235
middle repeated
u: 0–9.8
v: 0–1
u: 0–1 u: 0–1
v: 0–1 v: 0–1
Figure 14.3. Frame where middle area of the top, bottom, left, and right are repeatable.
sections to create a dynamically sized box. We repeat the middle section by in-
creasing its u or v value depending on its direction. For example, if the middle
section of a frame should be 15 pixels in x, and the art is only 10 pixels wide,
then a u value of 1.5 and a texture-addressing mode of wrap would repeat half of
the texture.
Texture Scrolling
While most motion is done using the positions of vertices, UVs have their role,
especially with tileable textures. Given a repeatable (tileable) texture such as wa-
ter or sky and slowly adding to its texture coordinates, known as texture scroll-
ing, we create the appearance of a flowing river or clouds moving across a sky.
This is particularly effective when done at different speeds to indicate depth,
such as water scrolling slowly in the distance while a river close to the player’s
perceived location appears to scroll faster; this effect is known as parallax scroll-
ing [Balkan et al. 2003].
Texture Sheets
Why waste textures space? Some graphics APIs require textures to be a power of
two, and if not, then it’s often better for performance reasons anyway, so what if
we have textures that don’t fit nicely into a power-of-two size? By building in a
simple addressing system, we put multiple objects in one texture (known as a
texture sheet) and store their coordinates in an atlas (such as an XML file). In
Figure 14.4, we show such a texture sheet from the mobile phone application
atPeace. The texture sheet is divided into many 32 32-pixel blocks, with each
236 14. 2D Magic
<Texture
eEntries>
<Entry
y id="mm_sheep
phead" tex="m
m1.png">
<Bl x="13" y="10"/>
BlockStartXY x
<Bl
BlocksWH x="1" y="1"/>
</Entr
ry>
individual image’s sttarting block,, width, and height storedd in the textuure atlas.
When thhe final imagee is made, thee backgroundd blocks are hhidden, and thhe back-
ground is
i transparentt. Anytime an n image in thhe texture is nneeded, we usse an ID
and conssult our texture atlas. Therre we find in which texturre an image iss located
and what its texture coordinates are.
Haviing blocks off 32 32 pixells does waste some texturee space, but thhe size is
simple to understand d and use. Lu uckily, there are numerouus resources aavailable
online to
o help with generating teexture sheetss and their ccorrespondingg atlases
[Ivanov 2006].
Enouugh of these basic properrties! It is timme to put eveerything togeether and
make somme magic.
14
4.5 What a Mesh!
As in liffe, building blocks
b can bee combined innto systems tthat become both in-
credible and complex x. While verttices aren’t eexactly atomss or strands oof DNA,
they are powerful buiilding blocks for graphics programminng. They link together
and formm wonderful tapestries of interesting aand powerful structures. O One such
structuree is a mesh. Seen
S often in 3D games, a mesh allowss programmerrs to ma-
nipulate many areas ofo a texture, resulting
r in brreathtaking effects that farr surpass
14.6 Mesh Archite
ecture 237
whatt can be donee when simply y rendering aas a quad. Thhese structuress are formed
throu
ugh the disseection of a sim
mple image, ccomposed off four verticess, and subse-
quennt partitioning
g into a mesh of many verttices, as showwn in Figure 14.5.
The
T partitioning process creates
c a seriees of related quads knownn as a mesh.
Eachh quad is respponsible for rendering
r a poortion of the texture, but iit’s up to the
meshh to manage the quads inttelligently. W With the meshh holding theese quads to-
gether, we can nown apply a global intelliigence acrosss the image tto produce a
syncchronized system of effectts, such as fliickering skies, flashlight eeffects, river
ripplles, trampolin
ne physics, an
nd water illusiions.
So
S why not just build separate texture objects and llet them act iindependent-
ly? Having
H a meesh gives us the
t ability too add structurre, intelligencce, and fine-
grain
n control oveer our texturee. Creating thhe illusion off a treasure chhest beneath
the water
w would be
b difficult without
w a messh. If designeed for easy veertex modifi-
catio
on, a mesh maakes seemingly impossiblee tasks like waater easy to im mplement.
14.6 Me
esh Archittecture
Accoompanying th his chapter iss a folder on the website ffull of goodiees, including
source code. We provide a fu ully implemennted C++ meesh that is cross-platform
and ready to be dropped
d into a new code b ase and used with minimaal effort. Op-
timizzation was no
ot the primaryy goal of this implementattion, althoughh recommen-
datio
ons for perforrmance are inncluded later in this chapteer. Ease of unnderstanding
and extensibility were our foccus. With minnimal effort, nnew effects ccan be added
to ou
ur mesh that go
g well beyon nd the severall examples wee provide herre.
238 14. 2D Magic
The mesh architecture consists of three important items: vertex, mesh, and
modifiers. First, we use a vertex class to hold each vertex’s position, color, opaci-
ty, and texture coordinates. Next, we store all our vertices inside a mesh class and
use the mesh class as a way to coordinate all the vertices, acting as a type of brain
for the overall image. Last, our modifiers do the dirty work and alter the mesh in
such a way that we conjure magic on the screen.
Mesh
The mesh is a class that contains a 2D array of vertex objects. Every place where
there is a line intersection in Figure 14.5, there exists a vertex. Using Cartesian
coordinates, a vertex can be quickly retrieved and modified. Our mesh design has
a base Mesh class and a derived MeshUV class. In the base implementation, we do
not support texture coordinates. It’s useful for times when we want to use colored
lighting without a texture, as demonstrated in our double rainbow example later
in this chapter.
Space
There is a good argument to be made about whether a mesh should exist in
screen (or local) space or in world space. On the side of screen space, if a mesh
changes size or position, then we don’t need to recompute any of our vertices. On
the side of world space, everything with the mesh requires just a bit more work,
calculation, and CPU time on a frame-by-frame basis. That is to say, if we’re not
moving or scaling our mesh, then we’re wasting CPU time by going to and from
screen space. Our implementation assumes screen space; that is, the mesh con-
siders its top left corner to be 0, 0 , and vertices see their x , y positions as per-
centage offsets into the mesh. Therefore, only if we change the number of mesh
points per row or column do we need to rebuild our mesh from scratch.
State
Another contention point in mesh design is whether to build vertices that contain
state. Having state means that a vertex has two sets of properties, original and
working. The original properties could be considered the normal, “resting” prop-
erties that would make the texture appear as if it had no mesh. The working prop-
erties are copies of the originals that have been modified by an effect and, thus,
have altered the state of the vertex. The vertex then fights its way back towards
its original properties by using mesh-defined physics and other parameters to al-
ter its working set.
14.7 Mesh Examples 239
One side of the argument contends that without state, vertices are simple ob-
jects, and the entire system is easier to understand. The problem with this ap-
proach, however, is that combination effects can often get dicey, and we have to
recompute modifications on a vertex many more times than we would if there
was state.
The argument for having state is quite persuasive. If a vertex contains a des-
tination set of properties (position, color, etc.) and uses mesh parameters to
morph from their current states toward the destination states, then we end up with
a very consistent and predictable set of movements. It allows easy control of ver-
tex physics and provides the ability to apply a mesh-wide set of physical proper-
ties. It’s a “fire-and-forget” mentality where a force gets applied to the mesh, and
the mesh is responsible for reacting. Our result can be quite interesting consider-
ing that we can alter a texture’s physics on a mesh-wide basis. Take a texture that
appears as rubble and a texture that appears as cloth. Both can now react to the
same force in very different ways. A texture of rubble would be rigid and morph
little given a physics force, while a cloth texture would have drastic mesh-
morphing results. While both have their uses, for simplicity, our version uses the
stateless implementation, and we recompute our effects each rendering cycle. In
accordance with that, the burden of vertex modification is passed from vertex
physics onto modifier objects.
Modifiers
Modifiers are what the mesh is all about. After we split an image into a mesh, it
still looks exactly the same to the player. It’s when we apply modifier objects to
the mesh that we change its appearance with effects that amaze.
Modifier objects are given the mesh each frame update and are allowed to
modify the vertex properties prior to rendering. Each mesh holds onto a collec-
tion of these modifiers, which are sorted into priorities since some may need to
happen prior to others. For example, if deforming a vertex, one may want the
deformation to happen before normal colored lighting so that the darkening effect
of simulated shadow happens last.
Modifiers derive from a modifier base class, which consists of the data mem-
bers shown in Listing 14.1.
Point Light
We often color our vertices. A simple way to achieve this is to develop a light
source similar to a 3D point light. A point light contains a position, radius, color,
and fall-off function, as shown in Listing 14.2. The fall-off function determines
how hard or soft the light’s boundary is at its radius.
Modifiers that use lighting derive from a common MeshModifierLightBase
class that handles generic point-light lighting, as shown in Listing 14.3.
class PointLight
{
protected:
return (true);
}
// Update dirty.
UpdateDirtyIfNeeded();
// Virtual call, gets us the modifier (or mesh) color for each vertex.
const float *theBaseColor = GetBaseColor();
// Is it in the bounds?
if (!Contains(vert.GetWorldX(), vert.GetWorldY()))
continue;
return (res);
}
Listing 14.4. Example of how a modifier with many lights interacts with a mesh.
Ultimately, our mesh makes many effects possible with minimal coding. It’s
designed around building modifier objects, similar to functors in the standard
template library. Certainly, instead of a mesh, unique systems can be generated
for many of the effects listed below. Particles work for some, while independent-
244 14. 2D Magic
Sunrise
e and Doub
ble Rainbow
ws
Many off us wonder whatw seeing a double rainbbow means. W We can give oour play-
ers the chance to pond der that questtion with a simmple lightingg modifier effefect. This
effect reequires no texxture and onlly renders collored quads. It consists of adding
multiple point lights in an arc to a lighting grroup modifierr, then changging per-
operties over time,
light pro t altering
g the color, raadius, fall-off,, and positionn of each
point light, as shown in Figure 14.6 6.
Thiss type of effeect is incredibly useful inn adding a dyynamic feel to back-
ground images.
i It can
n apply to a nighttime
n skyy, a space scenne, or the fliccker of a
sunrise. Incorporating g lighting withh standard texxtures is partiicularly effective. For
example, the aforemeentioned flick kering sunrisee is implemennted as a textture con-
taining a sun in the foreground
f an
nd a sunbeam m-style point light behind it. Both
rise togeether, while the
t point ligh ht modifies tthe backgrouund screen’s mesh in
seemingly perfect con ncert with thee rising-sun ttexture. To thhe player, the sun and
the pointt light behind
d it are one and the same.
Therre is almost no
n limit to wh hat visual pairrings work, ggiven a mesh and nor-
mal foreeground texture. Imagine a rocket whosse exhaust traail is a series of point
Figure 14.6.
1 Lighting effects
e on a meesh, from the m
mobile phone aapplication atP
Peace [6].
14.7 Mesh Examples 245
lightts spat out froom its enginees. These lighhts ejected froom the back oof the rocket
would spin, gently y fall, and fad
de away as thhe rocket ship sped off intoo space.
Taking
T this one step furtheer, you can ligght multiple oobjects in a w
world by hav-
ing many
m objects with meshess and adding llighting modiifiers on eachh one. Creat-
ing and
a modifyin ng the point lights
l outsidee of the meshh means that they can be
applied to multiple mesh objeccts, in a sensee providing a global light source. That
lightt would corrrectly impactt the meshess of objects around it, regardless of
whetther they each h have their own
o mesh. Soound confusinng? Considerr that any 2D
object can have a mesh and th hat each meshh has its own modifier. Thhe key is that
any point light caan be passed intoi each messh’s modifierrs. In other woords, we can
control point ligh hts from the application aand change tthe lighting m modifiers on
eachh mesh prior to t rendering. That providees global lighhting for the sscene, which
is su
uperb for worlld lighting consistency.
Flasshlight
The mesh has seeemingly no limit to whaat it offers. IInstead of m moving lights
across a mesh an nd altering coolors, it proviides us the bbasis for negaative effects,
suchh as the flash
hlight. Instead
d of coloringg vertices as a light movees through a
mesh h, we have th
he ability to hide
h the textuure and use thhe point lightt as a way to
uncoover an imagee. Figure 14.77 shows seveeral lights thatt, when appliied to an im-
age, uncover it. One could imagine how w lights couuld be used in different
stren
ngths and collors such as a hard-beameed flashlight oor a dim torcch. Indeed, a
Figure 14.7
7. Using lights tto uncover an image.
246 14. 2D Magic
casual game could bee based largeely on this cooncept of unccovering an im mage by
using diffferent types of
o lights that have varied sshapes, colorss, and sizes.
The implementatiion of this efffect takes onnly a few linees of code sinnce it de-
rives from our lightinng-modifier grroup class. Thhe only changge is that everry vertex
is coloreed black and our
o point lighht color value is pure whitee. Normal falll-off dis-
tances an nd screen-rad
dius parameters used in ouur point light cclass still appply with-
out alteraation.
Pinch
Colors are just the begginning. Wheere the rubberr meets the rooad is in the m
modifica-
tion of vertex
v positio
ons. It’s here where we caan make an iimage seem aas if it’s
underwaater by apply ying a water morphing efffect, or wherre we can m make our
mesh resspond as if it’s a trampolinne. Our pinchh example is about as sim mple as it
gets. Giv
ven a point, radius, and falll-off functionn (similar to a point light)), we can
bend verrtices in towaard the centerr of our pinchh point as deemonstrated inn Figure
14.8. Noote how the debugging
d nes rendered on the imagee show how each af-
lin
fected veertex is bent in toward thee pinch point and how wee color our veertices in
the pinch
h to indicate depth. The opposite effecct would be eqqually as sim mple, and
we couldd create bulgees in our textu
ure instead.
Figure 14.8. The pinch effeect pulls verticces towards thee pinch point.
14.7 Mesh Examples 247
Optimization
As mentioned earlier, when implementing the mesh that accompanies this article,
optimization was a low priority. That’s not to say it shouldn’t be a high priority
prior to shipping, since it greatly affects the frame rate. This mesh is fast enough
to support many mobile platforms, especially when used selectively as a single-
textured sky background. However, given multiple meshes and many modifier
objects, the number of vertices that require updating (and rendering) grows sub-
stantially. If optimization is a priority, here are some tips to remember:
■ As with every optimization, it pays to make sure you’re fixing what is bro-
ken. Use a profiler to gather real data on bottlenecks.
■ A major issue is the number of vertices affected per update or frame. Consid-
er the following:
□ Reduce the mesh points per row or column.
□ Use bounding areas to eliminate large groups of vertices. The best
situation is one where we never need to even look at a vertex to
know it’s unaffected by our modifiers and we can skip updating (or
rendering) the vertex. We achieve this by combining the bounding
box of all modifiers together. With these combined bounds, we then
determine which vertices are unaffected by our modifiers and can
optimize accordingly.
□ Since each modifier has the ability to affect every vertex on the
mesh, limit the number of modifier objects or have ways to short cir-
cuit an update. This is extremely important since every modifier add-
ed has the potential to update every vertex on the mesh. Something
has to give in order to reduce the world load.
■ Consider building the mesh using state-based vertex objects where modifiers
exist as “fire-and-forget” objects that modify the vertices of the mesh once,
and let each vertex find its way back to their original state.
■ Use texture sheets. Rendering objects on the same texture at the same time is
an optimization over texture swapping. Avoid loading in many texture sheets
that are only required for a small piece of texture. Instead, try to include as
many assets on a texture sheet for a given scene as possible.
■ Graphics bottlenecks need to be considered. A detailed mesh has many
quads. Consider things such as grouping vertices into large quads for unaf-
fected areas and rendering those areas as chunks. Think of it this way: an
unmodified mesh is easily rendered with only the typical four corners.
248 14. 2D Magic
14.8 Conclusion
Life has a way of taking small building blocks and generating colossal structures.
We can do the same in 2D if we closely examine our own building blocks in
graphics. Just because it’s 2D doesn’t mean we’re limited to simple textures or
“sprites.” We need to exploit every power they provide. Small development
shops building 2D games need to be particularly aggressive in the pursuit of
quality and an edge as they often face competition that is bigger and better fund-
ed. Using a colored UI or 2D mesh is a perfect place to put some wind in the sails
and get a jump on the competition. Big results with minimal effort, an almost
magical situation.
Acknowledgements
Special thanks to Rick Bushie for the art in the examples from the game atPeace.
References
[Balkan et al. 2003] Aral Balkan, Josh Dura, Anthony Eden, Brian Monnone, James
Dean Palmer, Jared Tarbell, and Todd Yard. “Flash 3D Cheats Most Wanted.”
New York: friends of ED, 2003.
[Dietrich 2000] Sim Dietrich. “Texture Addressing.” Nvidia, 2000. Available at http://
developer.nvidia.com/object/Texture_Addressing_paper.html.
[Ivanov 2006] Ivan-Assen Ivanov. “Practical Texture Atlases.” Gamasutra, 2006. Avail-
able at http://www.gamasutra.com/features/20060126/ivanov_01.shtml.
[Pipenbrinck 1998] Nils Pipenbrinck. “Hermite Curve Interpolation.” 1998. Available at
http://www.cubic.org/docs/hermite.htm.
[Porter and Duff 1984] Thomas Porter and Tom Duff. “Compositing Digital Images.”
Computer Graphics (Proceedings of SIGGRAPH 84) 18:3, ACM, pp. 253–259.
Part II
249
15
High‐Performance Programming
with Data‐Oriented Design
Noel Llopis
Snappy Touch
251
252 15. High‐Performance Programming with Data‐Oriented Design
Relative performance
10,000
1000
s
ar Gap
100 ye
2
y
er
ev
2 ×
PU:
C
10
6y ears
2× e very
DR AM :
1
1980 1985 1990 1995 2000 2005
With these kinds of access times, it’s very likely that the CPU is going to
stall waiting to read data from memory. All of a sudden, performance is not de-
termined so much by how efficient the program executing on the CPU is, but
how efficiently it uses memory.
Barring a radical technology change, this is not a situation that’s about to
change anytime soon. We’ll continue getting more powerful, wider CPUs and
larger memories that are going to make memory access even more problematic in
the future.
Looking at code from a memory access point of view, the worst-case situa-
tion would be a program accessing heterogeneous trees of data scattered all over
memory, executing different code at each node. There we get not just the con-
stant data cache misses but also bad instruction cache utilization because it’s call-
ing different functions. Does that sound like a familiar situation? That’s how
most modern games are architected: large trees of different kinds of objects with
polymorphic behavior.
What’s even worse is that bad memory access patterns will bring a program
down to its metaphorical knees, but that’s not a problem that’s likely to appear
anywhere in the profiler. Instead, it will result in the common situation of every-
thing being slower than we expected, but us not being able to point to a particular
spot. That’s because there isn’t a single place that we can fix. Instead, we need to
change the whole architecture, preferably from the beginning, and use a data-
oriented approach.
15.2 Principles of Data‐Oriented Design 253
Table 15.1. Access times for different levels of the memory hierarchy for modern plat-
forms.
have you had just one player in the game? Or one enemy? One vehicle? One bul-
let? Never! Yet somehow, we insist on treating each object separately, in isola-
tion, as if it were the only one in the world. Data-oriented design encourages
optimizing for the common case of having multiple objects of the same type.
1. Cache utilization. This is the big one that motivated us to look at data in the
first place. Because we can concentrate on data and memory access instead
of the algorithms themselves, we can make sure our programs have as close
to an ideal memory access pattern as possible. That means avoiding hetero-
geneous trees, organizing our data into large sequential blocks of homogene-
ous memory, and processing it by running the same code on all of its
elements. This alone can bring a huge speed-up to our code.
2. Parallelization. When we work from the data point of view, it becomes a lot
easier to divide work up into parts that different cores can process simultane-
ously with minimal synchronization. This is true for almost any kind of par-
allel architecture, whether each core has access to main memory or not.
3. Less code. People are often surprised at this one. As a consequence of look-
ing at the data and only writing code to transform input data into output data,
there is a lot of code that disappears. Code that before was doing boring
bookkeeping, or getter/setters on objects, or even unnecessary abstractions,
all go away. And simplifying code is very much like simplifying an algebraic
equation: once you make a simplification, you often see other ways to simpli-
fy it further and end up with a much smaller equation than you started with.
1. Easier to test. When your code is something that simply transforms input
data into output data, testing it becomes extremely simple. Feed in some test
input data, run the code, and verify the output data is what you expected.
There are no pesky global variables to deal with, calls to other systems, inter-
action with other objects, or mocks to write. It really becomes that simple.
15.4 How to Apply Data‐Oriented Design 255
2. Easier to understand. Having less code means not just higher performance
but also less code to maintain, understand, and keep straight in our heads. Al-
so, each function in itself is much simpler to understand. We’re never in the
situation of having to chase function call after function call to understand all
the consequences of one function. Everything you want to know about it is
there, without any lower-level systems involved.
To be fair and present all the sides, there are two disadvantages to data-
oriented design:
Finally, the most important step is to look at the data you’ve identified as
input and figure out how it can be transformed into the output data in an efficient
way. How does the input data need to be arranged? Normally, you’ll want a large
block of the same data type, but perhaps, if there are two data types that need to
be processed at the same time, interleaving them might make more sense. Or
maybe, the transformation needs two separate passes over the same data type, but
the second pass uses some fields that are unused in the first pass. In that case, it
might be a good candidate for splitting it up into two types and keeping each of
them sequentially in memory.
Once you have decided on the transformation, the only thing left is gathering
the inputs from the rest of the system and filling the outputs. When you’re transi-
tioning from a more traditional architecture, you might have to perform an ex-
plicit gathering step—query some functions or objects and collect the input data
in the format you want. You’ll have to perform a similar operation with the out-
put, feeding it into the rest of the system. Even though those extra steps represent
a performance hit, the benefits gained usually offset any performance costs. As
more systems start using the data-oriented approach, you’ll be able to feed the
output data from one system directly into the input of another, and you’ll really
be able to reap the benefits.
Heterogeneous Data
Game entities are the perfect example of why the straightforward particle ap-
proach doesn’t work in other game subsystems. You probably have dozens of
different game entity types. Or maybe you have one game entity, but have doz-
ens, or even hundreds, of components that, when grouped together, give entities
their own behavior.
15.5 Real‐World Situations 257
One simple step we can take when dealing with large groups of heterogene-
ous data like that is to group similar data types together. For example, we would
lay out all the health components for all entities in the game one right after the
other in the same memory block. Same thing with armor components, and every
other type of component.
If you just rearranged them and still updated them one game entity at a time,
you wouldn’t see any performance improvements. To gain a significant perfor-
mance boost, you need to change the update from being entity-centric to being
component-centric. You need to update all health components first, then all ar-
mor components, and proceed with all component types. At that point, your
memory access patterns will have improved significantly, and you should be able
to see much better performance.
lots of grouped ray casts, it might be beneficial to first sort the ray casts spatially,
and when they’re being resolved, you’re more likely to hit data that is already
cached.
Conditional Execution
Another common situation is that not all data of the same type needs to be updat-
ed the same way. For example, the navigation component doesn’t always need to
cast the same number of rays. Maybe it normally casts a few rays every half a
second, or more rays if other entities are closer by.
In that case, we can let the component decide whether it needs a second-pass
update by whether it creates a ray-cast query. Now we’re not going to have a
fixed number of ray queries per entity, so we’ll also need a way to make sure we
associate the ray cast with the entity it came from.
After all ray casts are performed, we iterate over the navigation components
and only update the ones that requested a ray query. That might save us a bit of
CPU time, but chances are that it won’t improve performance very much because
we’re going to be randomly skipping components and missing out on the benefits
of accessing memory linearly.
If the amount of data needed by the second update is fairly small, we could
copy that data as an output for the first update. That way, whenever we’re ready
to perform the second update, we only need to access the data generated this way,
which is sequentially laid out in memory.
If copying the data isn’t practical (there’s either too much data or that data
needs to be written back to the component itself), we could exploit temporal co-
herence, if there is any. If components either cast rays or don’t, and do so for
several frames at a time, we could reorder the components in memory so all nav-
igation components that cast rays are found at the beginning of the memory
block. Then, the second update can proceed linearly through the block until the
last component that requested a ray cast is updated. To be able to achieve this, we
need to make sure that our data is easily relocatable.
Polymorphism
Whenever we’re applying data-oriented design, we explicitly traverse sets of data
of a known type. Unlike an object-oriented approach, we would never traverse a
set of heterogeneous data by calling polymorphic functions in each of them.
Even so, while we’re transforming some well-known data, we might need to
treat another set of data polymorphically. For example, even though we’re updat-
ing the bullet data (well-known type), we might want to deliver damage to any
15.6 Parallelization 259
entity it hits, independent of the type of that entity. Since using classes and inher-
itance is not usually a very data-friendly approach, we need to find a better alter-
native.
There are many different ways to go about this, depending on the kind of
game architecture you have. One possibility is to split the common functionality
of a game entity into a separate data type. This would probably be a very small
set of data: a handle, a type, and possibly some flags or indices to components. If
every entity in the world has one corresponding set of data of this type, we can
always count on it while dealing with other entities. In this case, the bullet data
update could check whether the entity has a damage-handling component, and if
so, access it and deliver the damage.
If that last sentence left you a bit uncomfortable, congratulations, you’re
starting to really get a feel for good data access patterns. If you analyze it, the
access patterns are less than ideal: we’re updating all the current bullets in the
world. That’s fine because they’re all laid out sequentially in memory. Then,
when one of them hits an entity, we need to access that entity’s data, and then
potentially the damage-handling component. That’s two potentially random ac-
cesses into memory that are almost guaranteed to be cache misses.
We could improve on this a little bit by having the bullet update not access
the entity directly and, instead, create a message packet with the damage it wants
to deliver to that entity. After we’re done updating all of the bullets, we can make
another pass over those messages and apply the damage. That might result in a
marginal improvement (it’s doubtful that accessing the entity and its component
is going to cause any cache misses on the following bullet data), but most im-
portantly, it prevents us from having any meaningful interaction with the entity.
Is the entity bullet proof? Maybe that kind of bullet doesn’t even hit the entity,
and the bullet should go through unnoticed? In that case, we really want to access
the entity data during the bullet update.
In the end, it’s important to realize that not every data access is going to be
ideal. Like with all optimizations, the most important ones are the ones that hap-
pen more frequently. A bullet might travel for hundreds of frames, and it will hit
something at most in one frame. It’s not going to make much of a difference if,
during the frame when it hits, we have a few extra memory accesses.
15.6 Parallelization
Improving memory access patterns is only part of the performance benefits pro-
vided by data-oriented design. The other half is being able to take advantage of
multiple cores very easily.
260 15. High‐Performance Programming with Data‐Oriented Design
15.7 Conclusion
Data-oriented design is a departure from traditional code-first thinking. It ad-
dresses head-on the two biggest performance problems in modern hardware:
memory access and parallelization. By thinking about programs as instructions to
transform data and thinking first about how that data should be laid out and
worked on, we can get huge performance boosts over more traditional software
development approaches.
16
Game Tuning Infrastructure
Wessam Bahnassi
Electronic Arts, Inc.
16.1 Introduction
Every game has to go through a continuous cycle of tuning and tweaking during
its development. To support that, many of today’s game engines provide some
means for editing and tuning “objects” (or “entities”) and other settings in the
game world. This article provides food for thought on implementing tuning capa-
bilities in a game engine. We cover the infrastructure options available and dis-
cuss their particularities and the scenarios that would benefit from each one of
them. Finally, we conclude with case studies from a published game and a com-
mercial engine.
263
264 16. Game Tuning Infrastructure
within the context of tuning, as user-interface discussions are a very big topic on
their own and are outside the scope of this book.
For a while, editors had to provide only the means to modify game data and
build this data to become ready for use by the game run-time code. However, a
situation similar to executable build times has risen. Level build times have also
become increasingly lengthy, making the turnaround time of visualizing data
modifications too long and less productive. The need for a faster method for vis-
ualizing changes has thus become necessary, and a new breed of the so-called
what-you-see-is-what-you-get (WYSIWYG) editors has also become available.
However, it is important to not get caught up in industry frenzy. Indeed, not
all projects need or have the opportunity to utilize an engine with a full-scale edi-
tor. Except in the case of using a licensed engine, the game team might not have
the time to build anything but the most primitive tool to do data editing and tun-
ing. For such cases, this article also considers approaches that lack the presence
of a separate editor application.
■ Acceptable turnaround time for each class of tunable parameters. This is the
time wasted between the user making the modification and seeing its effect.
■ Convenience and ease of use with regard to the target system users and fre-
quency of usage. A system that is going to be used daily should receive more
focus than a system used rarely.
■ Amount of development work involved and how intrusive code changes are
allowed to be. The engine’s code base is a primary factor in this area.
■ Potential for system reuse in other projects, or in other words, generality. For
example, is it needed to serve one game only? Or games of similar genre? Or
any game project in general?
■ Type of data to be tuned and its level of sophistication, complexity, and mul-
tiplicity (e.g., global settings, per-entity, per-level).
■ Cross-platform tuning support—certain parameters are platform-specific and
require tuning on the target platform directly.
16.4 The Tuning Tool 265
One additional point that was taken out from the list above is stability and
data safety. This should not be a “consideration,” but rather a strict requirement.
An editor that crashes in the middle of work is not considered a valid piece of
software.
Next, we go into detail about the available options for implementing a tuning
infrastructure in a game engine. The options are categorized into four main sec-
tions: The Tuning Tool, Data Exchange, Schema and Exposure, and Data Stor-
age. A choice from each section can be made and then mixed and matched with
choices from the other sections to finally form the definition of the tuning system
that would best serve the team’s needs. The list of considerations above should
be kept in mind when reading through the following sections in order to help
make the right decisions.
Listing 16.1. Pseudo C++ code showing the proper setup for interactive global variable tuning.
1
Digital content creation tool is a term used to refer to programs such as Softimage, 3DS
Max, Maya, etc.
16.4 The Tuning Tool 267
for the
t pipeline to o build them in order to viiew the resultts in the gamme’s renderer,
this approach offeers them instaant visualizatiion, which caan boost produuctivity.
DCC
D tools co
ome with a laarge palette oof utilities thaat can be usedd to layout a
levell and art exacctly as one waants, thus prooviding an exccellent tuningg experience.
How wever, the pro ogramming efffort required by this technnique is not too be underes-
timaated.
There
T are twoo methods off approachingg this option. One methodd is to imple-
ment the game reenderer insidee the DCC toool (possible in Autodeskk Softimage),
and the other is to o have the tool communiccate with the game executaable in order
to seend it updatess about changges happeningg in the DCC tool to directtly display it
to th
he user. In thiss venue, an en
ngine can go as far as impplementing assset hot load-
ing sos changes to o models and d textures couuld be directlyy visualized iin the game,
too. The storage of tuned dataa (models, leevels, and cusstom propertiies) then be-
comes the role off the DCC too ol.
Ded
dicated Edittor Application
Another sophisticated method
d is to buildd a dedicatedd editor application that
work
ks with the game.
g Such a method puuts total contrrol and flexiibility in the
268 16. G
Game Tuning Infrastructurre
hands off the team to make the too ol tune virtuallly any value deemed tunaable. The
widgets and controls for tuning can n all be custoomized to be aas suitable to the data
as possib ble (e.g., coloor pickers, raange-limited vvalue sliders,, or curve editors). A
full-blow wn implemen ntation can go o as far as prroviding instaantaneous livve tuning
through the actual gam me’s executab ble, such as inn Unreal and CryEngine.
Depeending on how sophisticatted a team waants to get wiith this, impleementing
the user interface and d controls forr a successfull editor can bbe a difficult ttask that
requires background and experien nce in develooping user innterfaces, whhich is a
specializzation of its own.
o If not coonsidered welll, the result m
most probablyy will be
inconven nient (e.g., wiidgets not beh having in a sttandard way, bad interfacee layout,
missing shortcuts, or missing undo o or redo).
It caan now be und derstood why this method, while being most sophistiicated, is
most diffficult to get right.
r It can in
nvolve a hugee amount of pprogrammer w work and
the dediccation of a nu umber of mem mbers of the tteam. Going w with third-paarty solu-
tions can n be a wise deecision here.
On thet implemen ntation side, the
t editor appplication, beiing separate ffrom the
game, iss free to use its i own progrramming langguage. .NET languages haave been
found to o be excellen nt for rapidly developing desktop conttrol-rich appllications.
Howeverr, using a diffferent languag ge than the gaame can compplicate data eexchange
and com mmunication, but b solutions exist [Bahnasssi 08].
Some implementations, such as the Unreal editor shown in Figure 16.2, in-
volve the editor directly within the actual game’s executable. This simplifies the
task of data exchange and provides access to the game’s most up-to-date render-
ing features and data structures. But on the other hand, such approaches can
complicate the game’s code by involving the editor’s code in many of the game’s
internal aspects and forcing it to handle situations that are not faced when launch-
ing the game in standalone mode.
Another important issue is stability. By binding the editor to the game code
itself, the editor inherits the same level of instability as the game (which can be
high during game production), causing grief and lost work to the entire team.
Thus, it is advisable to implement the editor as a separate application running in
its own address space and using its own codebase. It can be made so that even if
the game or visualization tool crashes, then it would still be possible to save data
so that no work is lost.
Interprocess Communication
When the tuning tool is implemented as a separate application, it can communi-
cate with the game process through interprocess communication methods. This
first requires writing a game-side module to handle the communication and exe-
cute the tuning commands. Second, a protocol must be devised to identify tuna-
ble values and serialize them between the game and the tuning tool. As is shown
in Section 16.6, this can be a complicated task when the tuned data structure is
complex or the game does not offer proper identification of the objects living in
its world.
270 16. Game Tuning Infrastructure
Cross‐Platform Communication
If an interprocess communication system has been put in place, then it is relative-
ly easy to extend it to handle cross-platform communication. With this feature,
the tuning tool can be run on the development platform (e.g., PC) and tuning can
occur live inside the game running on the target platform (e.g., PlayStation 3).
All major console SDKs provide some means to communicate programmatically
with the console from a PC. Implementing this method is the only way for dedi-
cated editors to support on-console tuning.
Reflection
If C++ had code reflection capabilities, then the task of exposing tunables would
have been much simpler. This lack of reflection has strongly influenced some
engines to extend their systems with an accompanying reflected language in their
engines [Sweeny 1998]. The presence of reflection information simplifies matters
a lot and makes for uniform tuning code.
The .NET languages are a good role model in this area. The great news is
that C++/CLI has become a well-developed solution that can remedy such a situ-
16.7 Data Storage 271
ation elegantly. For example, any class can become reflected if declared as a
CLR type; and the rest of the class can be left untouched in frequent cases. This
reflection can be easily turned off for the final game executable if the dependen-
cy on .NET is to be avoided in the final product. The tuning tool can then read
the structure of the tuned object and deal directly with its members.
Explicit Definition
Another alternative is to define the properties of tunable game objects in a sepa-
rate place outside the game code and rely on a tool to generate data definition and
access information for both the editor and the game to use. For example, a data
schema can be written once and then passed to a code generation utility to gener-
ate a C++ class for in-game use, along with an accompanying reflection infor-
mation class usable by the tuning tool. The code generation utility can go all the
way to even implementing the necessary serialization code between the native
object and its tuning service proxy.
Text or Binary
Text file storage is highly recommended. First, the tunable values are usually not
overwhelmingly dense, which drops the necessity of storing them in a binary
format. Second, such a human readable format allows for easy differencing to see
a history of changes that went through a file. This is a valuable feature for de-
bugging.
If loading efficiency is to be maximized, a text-to-binary conversion process
can optionally be supported for the release version of the game. Interestingly,
some games actually prefer to ship those files in a human readable format for
players to tweak and modify.
team members are involved in tuning different areas of the game because they
are forced to wait for each other to release the file and continue working. Al-
though it is usually possible to merge changes from different versions of the file
and thus restore the capability of parallel work, it is not a good idea to involve
yet an additional step in the workflow of game tuning. Thus, it is advisable that
tunable values be stored in different files in accordance with logical sectioning,
thus reducing the possibility of conflicts and the need to resolve such conflicts.
File Format
There are well-known formats that can be used for storing tuned data. Those in-
clude the .INI file format (sectioned name-value pairs) and the more expressive
XML format. Adopting a well-known format can save development time by us-
ing libraries that are already available for reading and writing such file formats.
[weapon]
name = "Tachyon Gun"
sustainTime = 10
chargeRate = 6
[upgrade]
name = "Heat Dispenser"
effect = sustainTime
value = +5
[upgrade]
16.8 Case Studies 273
Listing 16.2. Sample tuned data stored with explicit naming. The order of data appearance is
irrelevant in this format.
Writing Data
One final aspect of data storage is considering what it is that actually writes the
data. Depending on all the possibilities mentioned above, a game might have a
dedicated PC-side tool communicating with it. In this case, storage can be initiat-
ed by the PC tool, requesting values for all tunable parameters and writing them
down in the project’s directory on the PC. The developer can then submit this
updated file to the source control system to be shared with the rest of the team.
This actually might be the sole possible option for some console systems.
Alternatively, for systems lacking interprocess communication but still run-
ning on console platforms, the game itself can write the values to its local folder,
and then the developer can manually copy the generated files by hand from the
console’s file system to the project’s directory.
Quraish
Quraish is a 3D real-time strategy game, shown in Figure 16.3, developed in
from 2003 to 2006 for the PC platform by a small team made up of one game
designer, one level designer, three core programmers, and about eight artists and
animators. The game engine offers a separate world editor application for build-
ing maps, as well as for editing all entity properties and defining new races.
Both the game and the world editor share the core rendering and animation
engine code base. Thus, the editor provides an exact instantaneous view of the
game world when building maps. Since the core code is shared, data exchange is
very easy because the same C++ structures had to be written only once and then
used in both the game and the editor verbatim. The editor was written in C++
using the MFC framework.
274 16. G
Game Tuning Infrastructurre
Figure 16
6.3. Image from
m the game Qu
uraish. (Image courtesy of Daar Al-Fikr Publishing.)
The editor stores all of its datta in a singlee custom dataabase file thaat can be
loaded inn sections byy the game, and
a those secttions are packed and preppared for
game usse ahead of tiime. A pointter fix-up passs is the onlyy post-load ooperation
needed for
f the data to o become usab ble (maps werre also storedd similarly).
A major
m amount of tuning cod de was writteen in the editoor in a game-specific
way. Th hat is, there is no generaliity towards ccode structurees. Editors haad to be
written explicitly
e for each entity type
t (e.g., unnits, weapons, buildings, aanimals).
This wass feasible beccause the nummber of entityy types was quuite limited aand man-
ageable. However, th his could havve become a problem if tthe number oof entity
types greew to somethiing above sev ven or eight.
In-gaame tuning was
w possible through
t the uuse of a conssole system bbuilt into
the gamee (see Figuree 16.1) and it connected too C++ functioons and variaables de-
fined to be
b tunable maanually in cod de through a rregistration pprocess.
Tunaable values co ould be writteen to a C-scriipt file, like tthe example sshown in
Listing 16.3.
1 The gamme can execu ute both comppiled and inteerpreted C-scrript files
using a text-parsing
t hat described bby Boer [20001]. This was to allow
system like th
players to
t tweak and modify the gameg in a fleexible mannerr. The files thhat were
not to bee exposed to the players were
w compileed, and the reest of them w were left
open in text
t format.
16.8 Case Studies 275
#include "DataBase\\QurHighAIDif.c"
ApplyWaterTax(SecondsToAITime(25), 1);
DayTimeScale = 0.1;
CompleteUpgrade(AI_UPGRADEID_MONEYPRINT);
CompleteUpgrade(AI_UPGRADEID_ANAGLYPH);
able to bugs and crashes due to continuous development work occurring in the
game’s code.
The engine uses binary package files to store its information, which is neces-
sary for geometry and texture data, but it suffers from the binary data issues men-
tioned earlier in Section 16.7. Still, although the data is stored in binary format,
the engine is capable of handling data structure changes conveniently. For exam-
ple, variable order changes do not corrupt the data, and removing existing varia-
bles or adding new ones works just fine, without any additional effort needed to
patch existing package files.
The engine also allows in-game tuning outside of the editor through a con-
sole window that can access virtually any UnrealScript property or function. Val-
ues tuned this way are not meant to be persistent, though; they exist only for
temporary experimentation. Additionally, the engine uses .INI files to read set-
tings across all areas of the game and the editor. The intention seems to be that
.INI files are used for global settings, and UnrealScript and UnrealEd are used
for per-object data.
Acknowledgements
I would like to thank Homam Bahnassi, Abdul Rahman Lahham, and Eric Lengyel for
proofreading this article and helping me out with its ideas.
References
[Bahnassi 2005] Homam Bahnassi and Wessam Bahnassi, “Shader Visualization Sys-
tems for the Art Pipeline.” ShaderX3, edited by Wolfgang Engel. Boston: Charles
River Media, 2005.
[Bahnassi 2008] Wessam Bahnassi. “3D Engine Tools with C++/CLI.” ShaderX6, edited
by Wolfgang Engel. Boston: Cengage Learning, 2008.
References 277
[Boer 2001] James Boer, “A Flexible Text Parsing System.” Game Programming Gems
2, edited by Mark DeLoura. Hingham, MA: Charles River Media, 2001.
[Jensen 2001] Lasse Staff Jensen. “A Generic Tweaker.” Game Programming Gems 2,
edited by Mark DeLoura. Hingham, MA: Charles River Media, 2001.
[Sweeny 1998] Tim Sweeny. UnrealScript Language Reference. Available at http://
unreal.epicgames.com/UnrealScript.htm.
[Woo 2007] Kim Hyoun Woo. “Shader System Integration: Nebula2 and 3ds Max.”
ShaderX5, edited by Wolfgang Engel. Boston: Charles River Media, 2007.
17
Placeholders beyond Static
Art Replacement
Olivier Vaillancourt
Richard Egli
Centre MOIVRE, Université de Sherbrooke
279
280 17. Placeholders beyond Static Art Replacement
are a common sight for many developers. They usually require very little time to
produce, and little effort is made to make them look or behave like the final asset.
It can also be said that the majority of placeholders seen during development in-
volve 3D game objects in some ways, since the construction of this particular
type of asset requires a large amount of work from multiple trades (modeling,
texturing, animation, game logic programming, sound editing, etc.). However,
it’s important to keep in mind that placeholders aren’t limited to 3D objects and
also extend to other assets types such as sound, text, or even code.
The motivation behind placeholders arises from the fact that multiple devel-
opers of different trades often work on the exact same part of the game at differ-
ent moments in time, or they iterate over a given task at different paces. In these
cases, some of the developers involved in a task usually have to wait until it at-
tains a certain level of completion before being able to start working on it. These
small wait times can add up and cause important production bottlenecks when
considered as a whole. Fortunately, in the cases where these dependencies in-
volve the creation of game assets, the wait time can be greatly reduced by the
creation of placeholder assets. They provide a minimal working resource to the
developers needing it, without overly affecting the schedule of the rest of the
team. In the end, the developers might still have to revisit their work to ensure
that it fits the final asset, but they nonetheless save more time than if they had
remained idle while waiting for the completed resource.
Unfortunately, despite having a generally positive effect on development
time, placeholders can become a source of confusion when they are incorrectly
crafted. For example, a common practice is to reuse an existing asset from the
production or a previous production instead of building a dedicated placeholder
asset. While this has the advantage of inserting more attractive and functional
temporary objects, it greatly increases the risk of forgetting that the asset is, in
fact, a placeholder. In some extreme cases, licensed or branded content from a
previous game could even make its way through to another client’s game. Com-
paratively, the presence of temporary assets from previous productions in a game
sometimes proves to be an important source of distraction during testing. The
fact that a resource doesn’t match the game’s style or quality, but still appears
final, can end up being more confusing than anything else for an unaware tester.
In these cases, the testers often waste time reporting art problems with the place-
holder, thinking they are part of the game’s assets. This becomes especially true
in the case of focus group tests not involving professional testers. Obviously, the
testers can be explicitly told that some of the art isn’t final, but this reduces their
confidence in reporting legitimate art problems. There are ways to “tweak” these
reused assets to remove potential confusion, but at this point, the time involve-
17.1 Placeholder Assets in a Game 281
■ Provide an accurate estimate of the look and behavior of the final asset. This
one is a bit tricky and can be considered a bit more loosely. The idea of “ac-
curate estimate” in terms of look and behavior differs a lot from one project
to another. Using a cylinder with a sphere sitting on top of it might be
enough to picture a character in a real-time strategy game. To express the be-
havior of the final resource, a simple arrow can be added to indicate in which
direction the unit is facing. However, in a platformer or an adventure game,
the placeholder character might need to be a bit more detailed. If the anima-
tions are already available, that character could even be articulated and ani-
mated. The amount of work that needs to be done to correctly represent the
look and behavior of the final asset amounts to the quantity of details re-
quired to understand the placeholder.
■ Can be created or generated rapidly with no particular skill required. By
definition, placeholders are temporary and do not involve detailed work at
all. If creating them becomes a time burden or requires fine tuning, the whole
point of building a placeholder gets somewhat lost. Moreover, if special
skills are required to create the placeholder, such as artistic or programming
skills, the idea of reducing coupling by creating them also gets lost in the
process, since someone depends on the expertise and schedule of someone
else to have the placeholder created. Ideally, everyone on a development
team should be able to create a placeholder within a matter of minutes.
■ Facilitate communication between team members. While a placeholder will
never replace a design document or a good old explanation, it does not nec-
essarily mean it cannot help on the communication side of things. Placehold-
ers have a shape or a surface through which information can be conveyed.
What information to convey and how to do it should be one of the primary
questions when building a placeholder system.
■ Integrate easily in the already existing resource pipeline. One of the im-
portant aspects of a placeholder system is for it to have a small footprint on
production time. This applies to the creation of the placeholder itself but also
to the creation of the system. If integrating placeholders into your 3D pipe-
line requires three months of refactoring, you might want to rethink your
original placeholder idea (or rethink the flexibility of your resource pipeline,
but that’s another discussion).
■ Scalable and flexible enough to work on multiple platforms. Placeholders are
a bit like code. If they’re done correctly, they should be able to scale and
adapt to any kind of platform. This is especially true now, since game studios
have become more and more likely to build games on multiple platforms at
the same time. Some studios have even started working on game engines that
17.2 Preaching by Example: The Articulated Placeholder Model 285
run on PCs, consoles, handhelds, and cell phones alike. Avoid using plat-
form-specific features when creating placeholders, and ensure that the tech-
niques used to generate them can be scaled and adapted to multiple
platforms.
(a) (c)
(b)
Figure 17.1. Depiction of the articulated placeholder generation technique. (a) The ani-
mation skeleton is used as the basis for the (b) mesh generation using implicit surface. (c)
The mesh is skinned to the skeleton, which can then be animated, animating the articulat-
ed placeholder mesh itself.
employs popular computer graphics techniques that are very easy to implement
and well documented. Special care was also put into ensuring that the placehold-
er construction uses popular rendering engine infrastructures when possible to
further reduce the time required to integrate the system into a production pipe-
line. The generation process is sufficiently detailed to ensure that a user with no
prior experience with skeletal animation and implicit surface generation can still
build the system. In this regard, advanced users can feel comfortable skipping
some of the entry-level explanations.
Skeleton Construction
The first step is to create the animation skeleton itself. The animation skeleton is
the root of a widely used animation technique called “skeletal animation” [Kavan
and Žára 2003], which is arguably the most popular animation technique current-
ly used in the videogame industry.
Skeletal animation is based on a two-facet representation of an animated 3D
mesh: the skin and the skeleton. The skin is the visual representation of the object
to be animated, which consists of a surface representation of the object. This sur-
face is often, but not always, made of tightly knit polygons called a mesh. The
other part of skeletal animation is the skeleton, which is a representation of the
underlying articulated structure of the model to animate. By animating the skele-
ton and then binding the surface to it, it is possible to animate the surface itself.
17.2 Preaching by Example: The Articulated Placeholder Model 287
Before delving into the intricacies of the latter part of the process, we first begin
by looking at the construction of the skeleton.
The skeleton consists of a hierarchical set of primitives called bones. A bone
is made up of a 3D transformation (position, scale, and orientation) and a refer-
ence to a parent bone. (The parent bone is optional, and some bones, such as the
topmost bone of the hierarchy, have no parent.) The parent-child relationship be-
tween bones creates the skeletal hierarchy that determines how a bone is trans-
formed. For example, if an upper arm bone is rotated upward, the forearm bone
rotates upward accordingly since it is attached (through the hierarchy) to the up-
per arm bone.
Even if the final bone animation only has one transformation, the bones, dur-
ing the animation creation, must have two separate transforms: The bone-space
transform B 1 and the pose transform P. Since these transforms apply to a partic-
ular bone of the skeleton, we identify the bones as the j-th bone of the skeleton,
and denote the transforms by B j 1 and Pj. These transforms can be represented as
a 4 4 homogeneous matrix having the general form
T jrot T jtrans
Tj , (17.1)
0 1
M j Pj B j 1. (17.2)
The matrix M j is used later to transform the mesh vertices according to the ani-
mation during the skinning process.
At this point, you should have the required components and mathematics to
build your own animation skeleton. However, it remains a hierarchy of 3D trans-
288 17. Placeholders beyond Static Art Replacement
Bone j
Animated pose skeleton
Pj
B −j 1
World space y
Figure 17.2. Bone j from the skeleton is brought to the origin by the transformation B j 1.
The transformation P j then transforms the bone j to its animated pose.
formations, and this can be rather hard to visualize. To help in viewing your skel-
eton and ensuring that your transformations are applied correctly, it is suggested
that you visually represent a skeleton as a set of bones and joints, where the dif-
ference between the translations for two transformation matrices (the bones) are
modeled as line segments, and the connections between bones (the joints) are
visualized as spheres. If you already work in a development environment, this
representation is probably already built into your 3D modeling software or into
your engine’s debugging view.
try (triangle count) that constructs the mesh, to emulate animation and rendering
costs. Finally, it must be able to generate the mesh rapidly and automatically. The
technique we suggest to satisfy all of these requirements is to use implicit surfac-
es, also known as level sets, to generate the mesh geometry.
A level set, generally speaking, is a set of points x1 ,, x n for which a real-
valued function f of n variables satisfies the constraint f x1 ,..., x n c, where c is
a constant. The entire set is then
When n 3, the set of points defined by the constraint creates what is called a
level surface, also commonly known as an isosurface or an implicitly defined sur-
face. A classic example of a 3D implicit surface is the sphere, which can be de-
fined as a set of points x, y , z through the general equation
x, y, z f x, y , z x 2 y 2 z 2 r 2 . (17.4)
In this case, our constant c would be r 2 and our implicit surface would be defined
as all the 3D points x, y , z located at a distance r from the origin, effectively
creating a sphere.
Another way to picture an implicit surface is to imagine that the function
f x1 ,..., x n defines a density field in the space where it is located and that all the
points having a certain density value c within the field are part of the set (which
is a surface in the 3D case). Finding the density value at a certain point is done
simply by evaluating f with the coordinates of that point.
In the case of our placeholder system, we define a mesh by generating an
implicit surface from the skeleton itself. The idea is to define a density field
based on an inverse squared distance function defined from the bones of the skel-
eton. As we saw earlier, a bone only consists of a transformation matrix and a
reference to its parent. However, to create a distance function for a given bone,
the bone must be mathematically represented as a line segment rather than a sin-
gle transformation. To obtain a line segment from the bone, we use the bone-
space transform. The translation of the bone B trans j is used to create one end of the
trans
line segment, while the parent’s translation B parent j is used as the other end of the
segment, where parent j maps the index j to the index of the parent bone. (Note
that we’re using B and not B 1 here.) With this formulation, the squared distance
function can be defined as
290 17. Placeholders beyond Static Art Replacement
where p x, y , z and
B parent
trans
j B j
trans
B trans
j p
d . (17.6)
j B j
trans trans
B parent
Once the distance can be calculated for a single bone, evaluating the density
field at a certain point amounts to summing the distance function for all bones
and determining whether that point is on the surface by setting a certain distance
threshold d. The implicit surface generated from our skeleton can therefore be
formulated using the distance function and the set
p f p dist p, j d . (17.7)
j
One way to define the whole 3D surface from the field generated by the skel-
eton would be to sample every point in our 3D space and build a polygonal sur-
face from those that happen to fall directly on the threshold d. This, however,
would be extremely inefficient and would most likely produce an irregular sur-
face unless the sampling is very precise. A smarter way to generate an implicit
surface from a density field is to use the marching cubes algorithm1 [Lorensen
and Cline 1987] or its triangular equivalent, marching tetrahedrons [Doi and
Koide 1991, Müller and Wehle 1997], which we favor here for its simplicity and
robustness.
The marching tetrahedrons algorithm, depicted in Figure 17.3, starts by sam-
pling the density field at regular intervals, creating a 3D grid. Every discrete
point is then evaluated using the density function and flagged as inside or outside
depending on the computed value of the density field in regard to the threshold
(inside if the value is higher than the threshold and outside if the value is lower).
The whole 3D grid can be seen as an array of cells defined from groups of eight
points (2 2 2 points) forming a cube. Each of these cells can be further divided
in six tetrahedrons that are the base primitives defining our polygonal implicit
1
The marching cubes algorithm is explained in detail in the chapter “Volumetric Repre-
sentation of Virtual Environments” in Game Engine Gems 1 [Williams 2010].
17.2 Preaching by Example: The Articulated Placeholder Model 291
3
1
Figuure 17.3. Reprresentation off the marchingg tetrahedronss algorithm. E Every cell of
2 2 2 points is sp
plit into six tettrahedrons. Eacch of these tetrrahedrons is m
matched to one
of the eight cases depicted
d on the right, and the geometry is geenerated accorddingly.
surfaace. Every teetrahedron is built with foour different ppoints, and eeach of these
poinnts can have two
t different states (insidee or outside). Therefore, thhere exist 16
posssible differentt configuratio
ons of insidee and outside values for eeach tetrahe-
dronn. (These 16 configurations
c s can be reduuced to two m mirrored sets oof eight con-
figurrations.) At this point, all
a of the tettrahedrons coontaining a ttransition (a
transsition occurs when some points
p of the ttetrahedron aare higher thann the thresh-
old and
a other poin nts are lower)) create one oor two trianglees based on thheir configu-
ratio
on. When all of the tetrah hedrons have been evaluatted, have beeen classified,
and have generated their triangles, the resulting geometry is a fully defined im-
plicit surface based on the density field. In our case, with a density field generat-
ed from the squared distance to our skeleton, the implicit surface should look like
a party balloon having roughly the shape of our skeleton, as shown in Fig-
ure 17.4.
One of the advantages of using this technique is that the amount of generated
geometry can be easily adjusted by increasing or decreasing the sampling preci-
sion. A larger sampling grid generates shapes with less geometry, whereas a finer
sampling grid increases the polygon count. This comes in handy and generates a
placeholder mesh that provides similar performance to the final desired asset. It
also becomes particularly useful when working under polygon constraints for
game objects, as the generated mesh provides a good performance estimate.
weights are given rellative to theirr distance. (Itt is advisablee to use the ccubed or
squared distance becaause they givee better visuaal results thann the direct Euclidean
distance..) In some casses, the numbber of bones w within the thrreshold mightt be low-
er than n,
n which is no ormal if fewerr bones are loocated near thhe vertex. Thhis whole
two-stepp process enssures that spaatially close bbones that arre far in the skeletal
structuree, and thus aree physiologically unrelatedd, do not sharre weights all over the
mesh. To recapitulatee, the automaatic weight aassignment allgorithm requuires the
ng steps:
followin
Figure 17 7.5. Weight diistribution for a nontrivial caase in an animaation skeleton. The pre-
sented weeight distribution is for the riight femoral boone of the skelleton (highlightted in the
inset). Brrighter colors on the mesh surface indicaate a greater innfluence of thhe middle
bone, whereas completeely dark parts of o the surface rrepresent no influence.
17.2 Preaching by Example: The Articulated Placeholder Model 295
AssignationMap[NearestDist].bone = nearestBone;
AssignationMap[NearestDist].dist = nearestDist;
nearestBone->SetVisited(true);
return (assignationMap);
}
Skinning
At this point in the process, we now know which bone affects a particular vertex
and to what extent it affects it. The only thing left to do is to actually grab the
bones’ matrices and apply them to our mesh’s vertices in order to transform the
mesh and animate it. The act of deforming a mesh to fit on a given skeleton’s
animation is called skinning. Multiple skinning techniques exist in the industry,
the most popular being linear-blend skinning [Kavan and Žára 2003] and spheri-
cal-blend skinning [Kavan and Žára 2005]. Both of the techniques have been im-
plemented with a programmable vertex shader in the demo code on the website,
but only the linear-blend skinning technique is explained in this section. Spheri-
cal-blend skinning requires some more advanced mathematics and could be the
subject of a whole gem by itself. However, keep in mind that if you can afford it,
spherical-blend skinning often provides better visual quality than does linear-
blend skinning. Again, also note that if you are already operating in a 3D devel-
opment environment, skinning techniques are almost certainly already available
and reusing them is preferable, as stated in our list of desired placeholder fea-
tures.
Linear-blend skinning is a very simple and widely popular technique that has
been in use since the Jurassic era of computer animation. While it has some visi-
ble rendering artifacts, it has proven to be a very efficient and robust technique
that adapts very well to a wide array of graphics hardware. The details given here
apply to programmable hardware but can be easily adapted for nonprogrammable
GPUs where the same work can be entirely performed on the CPU.
The idea behind linear-blend skinning is to linearly blend the transformation
matrices. This amounts to multiplying every bone’s matrix by its weight for a
given vertex and then summing the multiplied matrices together. The whole pro-
cess can be expressed with the equation
where v i is the i-th untransformed vertex of the mesh in its bind pose, vi is the
transformed vertex after it has been skinned to the skeleton, M j is the transfor-
mation matrix of the j-th bone, and w ij is the weight of bone j when applied to
vertex i. (In the case where only the n closest bones are kept, you can view all the
w ij as being set to zero except for the n closest ones.)
Implementing linear-blend skinning on programmable rendering hardware
remains equally straightforward and can be completely done in the vertex shader
stage of the pipeline. Before looking at the code, and to ensure that the imple-
298 17. Placeeholders beyo
ond Static Artt Replacemen
nt
mentatioon remains sim mple, the num mber of boness that can affeect a single veertex has
to be lim
mited to four. (The bone count affectingg a vertex rarrely goes higgher than
four in practice,
p as shown in Figure 17.6.)
The standard inpu ut values for the linear-bleend skinning vertex shadeer are the
vertex position (attrib bute), the mo odel-view-proojection matriix (uniform), and the
bone trannsformation matrices
m (unifforms). (The normal vectoor and texturee coordi-
nates areen’t required if you only want to perfo form linear-bllend skinningg.) Three
more inpputs dedicated d to skinning have to be addded to the shhader. The firrst one is
a uniformm array of 4 4 matrices th mation matrix for each
hat contains tthe transform
of the annimation skelleton’s boness (the M j mattrices) at the animated poose to be
renderedd. The size off the array dettermines how w many boness the animatioon skele-
ton can have (40 iss usually a good g numbeer). The secoond input is a four-
dimensio onal integer vector
v attributte that encodees the IDs of tthe four closeest bones
for each vertex. If thee influencing bone count iis lower than four, the suppplemen-
tary matrrix identifierss and weightss can be set too zero, which ensures they have no
influencee on the finall vertex posittion. The last input is a foour-dimensionnal float-
ing-poinnt vector attrib
bute storing th he weight valulues for the foour closest bones. The
weights stored in this vector must be in the sam me order as thee correspondiing bone
17.2 Preaching by Example: The Articulated Placeholder Model 299
// Input attributes
attribute vec4 vertex;
attribute vec3 normal;
attribute vec4 weight;
attribute vec4 boneId;
void main(void)
{
vec4 Position = vertex;
Position.w = 1.0;
Listing 17.2. GLSL implementation of the linear blend skinning vertex shader.
IDs in the previous vector. With this information, the linear-blend skinning given
by Equation (17.8) can be directly implemented, giving us the skinning shader
shown in Listing 17.2 (with an added calculation for the normal vector).
Skinning now integrated, the mesh can be loaded in your game, the anima-
tions can be run, and you should now be able to see what looks like an articulated
balloon running in your scene. The completed technique can be seen in Fig-
ure 17.7. From this point on, it should be pretty obvious that the placeholder is
not a final asset, other programmers are able to test their game logic by using
300 17. Placeeholders beyo
ond Static Artt Replacemen
nt
existing animations, and they willl most likelyy be able to aaccomplish relatively
accurate performancee tests by tweaaking the amoount of geom
metry and usinng differ-
ent skinn
ning methodss on the placeeholder. Mosst importantlyy, the additionn of this
new placceholder in your
y productiion environmment should ddecouple the work of
modelerss, animators, and programmmers since oonly a very bbasic skeletonn is now
required to obtain con
nvenient and useful
u placehholder geomettry.
Limitattions
The meth hod presented d above shouuld give resultts of sufficiennt quality for a place-
holder tyype of asset. However, it has some shhortcomings tthat should be under-
scored. The weakestt part of thee process, thhe automatic weight assiignment,
providess good resultss for a placeh holder asset bbut in no wayy replaces thee precise
and detaailed work a goodg animato
or or modelerr would do. T The problem bbecomes
mostly apparent
a in reegions with multiple
m bonee influences ((three or fourr), where
subtle teearing can be seen in the mesh.
m The auutomatic weigght assignmeent could
be impro oved to reducce this artifacct by using heeat diffusion methods or ggeodesic
distance methods on the t mesh.
Another artifact that
t should be
b mentionedd is the “balloooning” effecct of the
skeleton joints on thee generated mesh.
m As can be noticed, rround artifactts appear
in the geeometry aroun nd the skeleto
on’s articulatiion. This is bbecause each skeleton
bone is considered
c in
ndividually, annd their densiity fields addd up near the articula-
tions. Thhis artifact cann be attenuateed by reducinng the intensitty of the denssity field
17.3 Integration in a Production Environment 301
near the extremities of a bone. Keep in mind, though, that some people tend to
prefer this look since it clearly defines the presence of an articulation.
Finally, the generated mesh might appear a bit flat, with undefined features,
if every bone has the same effect on the density field. A possible improvement to
this is to add a simple “influence” property to every bone and manually increase
the influence of some of them (such as the head of a humanoid character to make
it appear rounder). This is, however, a purely aesthetic improvement and other-
wise affects the system in no way.
Placeholders will not revolutionize the way you build games, and many peo-
ple on your team might never realize someone had to actually build the place-
holder systems they’re using. The positive effect of these systems is more subtle
and only appears in the long run when you look at the big picture. The challenges
of today’s game industry are all about efficiency, creating more in less time, with
fewer people, and at lower costs. Thinking about how development can be ren-
dered more efficient plays a big part in a team’s success, and every developer has
to do his share of work on this front. Study your everyday routine, identify your
production bottlenecks, and verify whether a simple placeholder system could
alleviate the pressure. They will not make their way into the final build, but these
placeholders will definitely show their true value before that.
17.5 Implementation
The code provided on the website allows you to compile and use the articulated
placeholder system described in this chapter. A Visual Studio solution with the
required library and header files is located on the website, and it contains every-
thing necessary to compile the project.
References
[Aguiar et al. 2008] Edilson de Aguiar, Christian Theobalt, Sebastian Thrun, and Hans-
Peter Seidel. “Automatic Conversion of Mesh Animations into Skeleton-Based
Animations.” Computer Graphics Forum 27:2 (April 2008), pp. 389–397.
[Baran and Popović 2007] Ilya Baran and Jovan Popović. “Automatic Rigging and Ani-
mation of 3D Characters.” ACM Transactions on Graphics 26:3 (July 2007).
[Doi and Koide 1991] Akio Doi and Akio Koide. “An Efficient Method of Triangulating
Equi-Valued Surface by Using Tetrahedral Cells.” IEICE Transactions E74:1
(January 1991), pp. 214–224.
[Kavan and Žára 2003] Ladislav Kavan and Jiří Žára. “Real Time Skin Deformation
with Bones Blending.” WSCG Short Papers Proceedings, 2003.
[Kavan and Žára 2005] Ladislav Kavan and Jiří Žára. “Spherical Blend Skinning: A Re-
al-time Deformation of Articulated Models.” Proceedings of the 2005 Symposium
on Interactive 3D Graphics and Games 1 (2005), pp. 9–17.
[Lally 2003] John Lally. “Giving Life to Ratchet & Clank.” Gamasutra. February 11,
2003. Available at http://www.gamasutra.com/view/feature/2899/giving_life_to_
ratchet__clank_.php.
References 305
[Lorensen and Cline 1987] William E. Lorensen and Harvey E. Cline. “Marching Cubes:
A High Resolution 3D Surface Construction Algorithm.” Computer Graphics
(Proceedings of SIGGRAPH 87) 21:4, ACM, pp. 163–169.
[Müller and Wehle 1997] Heinrich Müller and Michael Wehle. “Visualization of Implic-
it Surfaces Using Adaptive Tetrahedrizations.” Dagstuhl ’97 Proceedings of the
Conference on Scientific Visualization, pp. 243–250.
[Rosen 2009] David Rosen. “Volumetric Heat Diffusion Skinning.” Gamasutra Blogs,
November 24, 2009. Available at http://www.gamasutra.com/blogs/DavidRosen/
20091124/3642/Volumetric_Heat_Diffusion_Skinning.php.
[Speyrer and Jacobson 2006] David Speyrer and Brian Jacobson. “Valve’s Design Pro-
cess for Creating Half-Life 2.” Game Developers Conference, 2006.
[Williams 2010] David Williams. “Volumetric Representation of Virtual Environments.”
Game Engine Gems 1, edited by Eric Lengyel. Sudbury, MA: Jones and Bartlett,
2010.
18
Believable Dead Reckoning
for Networked Games
Curtiss Murphy
Alion Science and Technology
18.1 Introduction
Your team’s producer decides that it’s time to release a networked game, saying
“We can publish across a network, right?” Bob kicks off a few internet searches
and replies, “Doesn’t look that hard.” He dives into the code, and before long,
Bob is ready to begin testing. Then, he stares in bewilderment as the characters
jerk and warp across the screen and the vehicles hop, bounce, and sink into the
ground. Thus begins the nightmare that will be the next few months of Bob’s life,
as he attempts to implement dead reckoning “just one more tweak” at a time.
This gem describes everything needed to add believable, stable, and efficient
dead reckoning to a networked game. It covers the fundamental theory, compares
algorithms, and makes a case for a new technique. It explains what’s tricky about
dead reckoning, addresses common myths, and provides a clear implementation
path. The topics are demonstrated with a working networked game that includes
source code. This gem will help you avoid countless struggles and dead ends so
that you don’t end up like Bob.
18.2 Fundamentals
Bob isn’t a bad developer; he just made some reasonable, but misguided, as-
sumptions. After all, the basic concept is pretty straight forward. Dead reckoning
is the process of predicting where an actor is right now by using its last known
position, velocity, and acceleration. It applies to almost any type of moving actor,
including cars, missiles, monsters, helicopters, and characters on foot. For each
307
308 18. Believablee Dead Recko
oning for Netw
worked Gamees
remote actor
a being controlled
c som
mewhere elsee on the netw work, we receive up-
dates abbout its kinem matic state th
hat include itts position, vvelocity, acceeleration,
orientation, and angullar velocity. In
I the simplesst implementaation, we takee the last
position we received on the netwo ork and projecct it forward in time. Thenn, on the
next upddate, we do some
s sort of blending andd start the prrocess all oveer again.
Bob is riight that the fundamentals
fu aren’t that coomplex, but m
making it belieevable is
a different story.
Myth Busting—G
Ground Tru
uth
Let’s start
s with the following facct: there is noo such thing aas ground truuth in a
networrked environ nment. “Grou und truth” iimplies that you have pperfect
knowleedge of the state of all actors at all tim mes. Surely, yyou can’t knoow the
exact state
s of all rem
mote actors without
w sendinng updates evvery frame in a zero
packet loss, zero laatency enviro onment. Whatt you have innstead is youur own
perceiv
ved truth. Thu us, the goal becomes
b belieevable estimaation, as oppoosed to
perfectt re-creation.
Basic Math
M
To derivve the math, wew start with the
t simplest ccase: a new acctor comes accross the
network.. In this case, one of our op pponents is ddriving a tank, and we receeived our
first kineematic state update
u as it caame into view w. From heree, dead reckonning is a
straightfforward linearr physics pro oblem, as desscribed by Arronson [19977]. Using
the valuees from the message,
m p the vehiclee at position P 0 and beginn moving
we put
it at vellocity V 0 witth acceleratioon A 0 , as shhown in Figgure 18.1. Thhe dead-
reckoned d position Q t at a specific time
t T is calcculated with thhe equation
1
Q t P0 V0T A 0T 2 .
2
Continuing
C our
o scenario, the opponentt saw us, slow wed his tankk, and took a
d right. Soon, we receive a message uupdating his kinematic state. At this
hard
poin
nt, we have co onflicting reaalities. The firrst reality is tthe position Q t where we
guesssed he would d be using thee previous foormula. The second realityy is where he
actuaally went, ou ur new P 0 , whhich we refer to as the lastt known state because it’s
the last
l thing we know to be correct. Thiss dual state iss the beginniing of Bob’s
nigh
htmares. Sincee there are tw wo versions off each value, w we use the prrime notation
(e.g., P 0 ) to indicaate the last kn
nown, as show wn in Figure 18.2.
To
T resolve th he two realitiies, we need to create a bbelievable cuurve between
wherre we though ht the tank wo ould be, and wwhere we estimate it will be in the fu-
ture.. Don’t botheer to path thee remote tankk through its last known pposition, P 0 .
Insteead, just movee it from wheere it is now, P0 , to where w we think it is supposed to
n the future, P1.
be in
This
T techniqu ue actually wo orks pretty w
well. It is simpple and gives a nice curve
betwween our poin nts. Unfortunaately, it has ooscillations thhat are as badd as or worse
than the spline teechniques. Upon
U inspectioon, it turns oout that with all of these
technniques, the oscillations
o arre caused by the changes in velocity (V0 and V 0).
May ybe if we do something
s wiith the velocitty, we can reeduce the osciillations. So,
let’s try it again, with a tweakk. This time, we compute a linear interrpolation be-
tweeen the old velocity V0 and the last know wn velocity V 0 to create a nnew blended
velocity Vb. Then n, we use this to project forrward from w where we weree.
The
T techniquee, projective velocity
v blendding, works liike this:
1
Pt P0 VbTt A0Tt 2 (projectinng from where we were),
2
1
Pt P0 V0Tt A0Tt 2 (projectinng from last kknown),
2
Q t Pt Pt Pt Tˆ (combineed).
And the red lines in Figure 18..3 show whatt it looks like in action.
Prove It!
So it sou
unds good in theory,
t but let’s get some pproof. We cann perform a bbasic test
by drivinng a vehicle in a repeatablee pattern (e.gg., a circle). B
By subtractingg the real
location from the deaad-reckoned lo ocation, we ccan determinee the error. Thhe imag-
es in Fig
gure 18.4 and statistics in Table
T 18.1 shhow the clear result. The prrojective
velocity blending is roughly
r five to
t seven perccent more acccurate than cuubic Bé-
nes. That ratio improves a bit more whhen you can’tt publish acceeleration.
zier splin
If you want
w to test it yourself,
y the demo
d applicaation on the w website has im
mplemen-
tations of
o both projecttive velocity blending
b andd cubic Bézierr splines.
Table 18.1. Improvement using projective velocity blending. Deck-reckoning (DR) error
is measured in meters.
update rate varies significantly depending on the type of game, the network con-
ditions, and the expected actor behavior. As a general rule of thumb, an update
rate of three per second looks decent and five or more per second looks great.
T
Tˆ t .
TΔ
Now we have all the times we need to compute the projective velocity blend-
ing equations. That leaves just one final wrinkle in time. It happens when we go
past T Δ (i.e., Tt T Δ). This is a very common case that can happen if we miss an
update, have any bit of latency, or even have minor changes in frame rate. From
earlier,
Q t Pt Pt Pt Tˆ .
Because Tˆ is clamped at one, the Pt drops out, leaving the original equation
1
Q t P0 V0Tt A0Tt 2 .
2
The math simplifies quite nicely and continues to work for any value of Tˆ 1.0 .
■ Due to the nature of networking, you can receive updates at any time, early
or late. In order to maintain C 1 continuity, you need to calculate the instanta-
neous velocity between this frame’s and the last frame’s dead-reckoned posi-
tion, Pt Pt 1 T f . When you get the next update and start the new curve,
use this instantaneous velocity for V0. Without this, you will see noticeable
changes in velocity at each update.
18.5 Publish or Perish 315
■ Actors send updates at different times based on many factors, including crea-
tion time, behavior, server throttling, latency, and whether they are moving.
Therefore, track the various times separately for each actor (local and re-
mote).
■ If deciding your publish rate in advance is problematic, you could calculate a
run-time average of how often you have been receiving network updates and
use that for T Δ . This works okay but is less stable than a predetermined rate.
■ In general, the location and orientation get updated at the same time. Howev-
er, if they are published separately, you’ll need separate time variables for
each.
■ It is possible to receive multiple updates in a single frame. In practice, let the
last update win. For performance reasons, perform the dead reckoning calcu-
lations later in the game loop, after the network messages are processed. Ide-
ally, you will run all the dead reckoning in a single component that can split
the work across multiple worker threads.
■ For most games, it is not necessary to use time stamps to sync the clocks be-
tween clients/servers in order to achieve believable dead reckoning.
When to Publish?
Let’s go back and consider the original tank scenario from the opponent’s per-
spective. The tank is now a local actor and is responsible for publishing updates
on the network. Since network bandwidth is a precious resource, we should re-
duce traffic if possible. So the first optimization is to decide when we need to
publish. Naturally, there are times when players are making frequent changes in
direction and speed and five or more updates per second are necessary. However,
there are many more times when the player’s path is stable and easy to predict.
For instance, the tank might be lazily patrolling, might be heading back from a
respawn, or even sitting still (e.g., the player is chatting).
The first optimization is to only publish when necessary. Earlier, we learned
that it is better to have a constant publish rate (e.g., three per second) because it
keeps the remote dead reckoning smooth. However, before blindly publishing
every time it’s allowed (e.g., every 0.333 s), we first check to see if it’s neces-
316 18. Believable Dead Reckoning for Networked Games
return (forceUpdateResult);
}
sary. To figure that out, we perform the dead reckoning as if the vehicle was re-
mote. Then, we compare the real and the dead-reckoned states. If they differ by a
set threshold, then we go ahead and publish. If the real position is still really
close to the dead-reckoned position, then we hold off. Since the dead reckoning
algorithm on the remote side already handles Tt T Δ, it’ll be fine if we don’t up-
date right away. This simple check, shown in Listing 18.1, can significantly re-
duce network traffic.
What to Publish
Clearly, we need to publish each actor’s kinematic state, which includes the posi-
tion, velocity, acceleration, orientation, and angular velocity. But there are a few
things to consider. The first, and least obvious, is the need to separate the actor’s
real location and orientation from its last known location and orientation. Hope-
fully, your engine has an actor property system [Campbell 2006] that enables you
to control which properties get published. If so, you need to be absolutely sure
18.5 Publish or Perish 317
you never publish (or receive) the actual properties used to render location and
orientation. If you do, the remote actors will get an update and render the last
known values instead of the results of dead reckoning. It’s an easy thing to over-
look and results in a massive one-frame discontinuity (a.k.a. blip). Instead, create
publishable properties for the last known values (i.e., location, velocity, accelera-
tion, orientation, and angular velocity) that are distinct from the real values.
The second consideration is partial actor updates, messages that only contain
a few actor properties. To obtain believable dead reckoning, the values in the
kinematic state need to be published frequently. However, the rest of the actor’s
properties usually don’t change that much, so the publishing code needs a way to
swap between a partial and full update. Most of the time, we just send the kine-
matic properties. Then, as needed, we send other properties that have changed
and periodically (e.g., every ten seconds) send out a heartbeat that contains eve-
rything. The heartbeat can help keep servers and clients in sync.
With this in mind, the third consideration is determining what the last known
values should be. The last known location and orientation come directly from the
actor’s current render values. However, if the velocity and acceleration values
from the physics engine are giving bad results, try calculating an instantaneous
velocity and acceleration instead. In extreme cases, try blending the velocity over
two or three frames to average out some of the sharp instantaneous changes.
Publishing Tips
Below are some final tips for publishing:
■ If an actor isn’t stable at speeds near zero due to physics, consider publishing
a zero velocity and/or acceleration instead. The projective velocity blend will
resolve the small translation change anyway.
■ If publishing regular heartbeats, be sure to sync them with the partial updates
to keep the updates regular. Also, try staggering the heartbeat time by a ran-
dom amount to prevent clumps of full updates caused by map loading.
■ Some types of actors don’t really move (e.g., a building or static light). Im-
prove performance by using a static mode that simply teleports actors.
■ In some games, the orientation might matter more than the location, or vice
versa. Consider publishing them separately and at different rates.
■ To reduce the bandwidth using ShouldForceUpdate(), you need to dead
reckon the local actors in order to check against the threshold values.
■ Evaluate the order of operations in the game loop to ensure published values
are computed correctly. An example order might include: handle user input,
tick local (process incoming messages and actor behaviors), tick remote (per-
form dead reckoning), publish dead reckoning, start physics (background for
next frame), update cameras, render, finish physics. A bad order will cause
all sorts of hard-to-debug dead reckoning anomalies.
■ There is an optional damping technique that can help reduce oscillations
when the acceleration is changing rapidly (e.g., zigzagging). Take the current
and previous acceleration vectors and normalize them. Then, compute the dot
product between them and treat it as a scalar to reduce the acceleration before
publishing (shown in the ComputeCurrentVelocity() function in List-
ing 18.2).
■ Acceleration in the up/down direction can sometimes cause floating or sink-
ing. Consider publishing a zero instead.
mSecsSinceLastUpdateSent += elapsedTime;
mTimeUntilHeartBeat -= elapsedTime;
if (forceUpdate)
{
SetLastKnownValuesBeforePublish(pos, rot);
if (fullUpdate)
{
mTimeUntilHeartBeat = HEARTBEAT_TIME; // +/- random offset
NotifyFullActorUpdate();
}
else
{
NotifyPartialActorUpdate();
}
mSecsSinceLastUpdateSent = 0.0F;
}
}
SetCurrentAcceleration(computedAccel);
SetCurrentVelocity(mComputedLinearVel);
}
mLastPos = pos;
mPrevFrameTime = deltaTime;
}
the velocity is going to project the dead-reckoned position under the ground. Few
things are as disconcerting as watching a tank disappear halfway into the dirt. As
an example, the demo on the website allows mines to fall under ground.
Use H as the final clamped ground height for the actor and use the normal to de-
termine the final orientation. While not appropriate for all cases, this technique is
fast and easy to implement, making it ideal for distant objects.
Other Considerations
■ Another possible solution for this problem is to use the physics engine to
prevent interpenetration. This has the benefit of avoiding surface penetration
in all directions, but it can impact performance. It can also create new prob-
lems, such as warping the position, the need for additional blends, and sharp
discontinuities.
■ Another way to minimize ground penetration is to have local actors project
their velocities and accelerations into the future before publishing. Then,
damp the values as needed so that penetration will not occur on remote actors
(a method known as predictive prevention). This simple trick can improve
behavior in all directions and may eliminate the need to check for interpene-
tration.
322 18. Believable Dead Reckoning for Networked Games
■ When working with lots of actors, consider adjusting the ground clamping
based on distance to improve performance. You can replace the three-point
ray multicast with a single point and adjust the height directly using the inter-
section normal for orientation. Further, you can clamp intermittently and use
the offset from prior ground clamps.
■ For character models, it is probably sufficient to use single-point ground
clamping. Single-point clamping is faster, and you don’t need to adjust the
orientation.
■ Consider supporting several ground clamp modes. For flying or underwater
actors, there should be a “no clamping” mode. For vehicles that can jump,
consider an “only clamp up” mode. The last mode, “always clamp to
ground,” would force the clamp both up and down.
18.7 Orientation
Orientation is a critical part of dead reckoning. Fortunately, the basics of orienta-
tion are similar to what was discussed for position. We still have two realities to
resolve: the current drawn orientation and the last known orientation we just re-
ceived. And, instead of velocity, there is angular velocity. But that’s where the
similarities end.
Hypothetically, orientation should have the same problems that location had.
In reality, actors generally turn in simpler patterns than they move. Some actors
turn slowly (e.g., cars) and others turn extremely quickly (e.g., characters). Either
way, the turns are fairly simplistic, and oscillations are rarely a problem. This
means C 1 and C 2 continuity is less important and explains why many engines
don’t bother with angular acceleration.
Myth Busting—Quaternions
Your engine might use HPR (heading, pitch, roll), XYZ vectors, or full rota-
tion matrices to define an orientation. However, when it comes to dead reck-
oning, you’ll be rotating and blending angles in three dimensions, and there is
simply no getting around quaternions [Hanson 2006]. Fortunately, quaterni-
ons are easier to implement than they are to understand [Van Verth and Bish-
op 2008]. So, if your engine doesn’t support them, do yourself a favor and
code up a quaternion class. Make sure it has the ability to create a quaternion
from an axis/angle pair and can perform spherical linear interpolations
(slerp). A basic implementation of quaternions is provided with the demo
code on the website.
18.7 Orientation 323
Vec3 angVelAxis(mLastKnownAngularVelocityVector);
With this in mind, dead reckoning the orientation becomes pretty simple:
project both realities and then blend between them. To project the orientation, we
need to calculate the rotational change from the angular velocity. Angular veloci-
ty is just like linear velocity; it is the amount of change per unit time and is usual-
ly represented as an axis of rotation whose magnitude corresponds to the rate of
rotation about that axis. It typically comes from the physics engine, but it can be
calculated by dividing the change in orientation by time. In either case, once you
have the angular velocity vector, the rotational change R Δt is computed as shown
in Listing 18.3.
If you also have angular acceleration, just add it to rotationAngle. Next,
compute the two projections and blend using a spherical linear interpolation. Use
the last known angular velocity in both projections, just as the last known accel-
eration was used for both equations in the projective velocity blending technique:
This holds true for Tˆ 1.0. Once again, Tˆ is clamped at one, so the math simpli-
fies when Tˆ 1.0:
Server Validation
Dead reckoning can be very useful for server validation of client behavior. The
server should always maintain a dead-reckoned state for each player or actor.
With each update from the clients, the server can use the previous last known
state, the current last known state, and the ongoing results of dead reckoning as
18.8 Advanced Topics 325
input for its validation check. Compare those values against the actor’s expected
behavior to help identify cheaters.
Articulations
Complex actors often have articulations, which are attached objects that have
their own independent range of motion and rotation. Articulations can generally
be lumped into one of two groups: real or fake. Real articulations are objects
whose state has significant meaning, such as the turret that’s pointing directly at
you! For real articulations, use the same techniques as if it were a full actor. For-
tunately, many articulations, such as turrets, can only rotate, which removes the
overhead of positional blending and ground clamping. Fake articulations are
things like tires and steering wheels, where the state is either less precise or
changes to match the dead-reckoned state. For those, you may need to implement
custom behaviors, such as for turning the front tires to approximate the velocity
calculated by the dead reckoning.
Subscription Zones
Online games that support thousands of actors sometimes use a subscription-
zoning technique to reduce rendering time, network traffic, and CPU load [Cado
2007]. Zoning is quite complex but has several impacts on dead reckoning. One
significant difference is the addition of dead reckoning modes that swap between
simpler or more complex dead reckoning algorithms. Actors that are far away or
unimportant can use a low-priority mode with infrequent updates, minimized
ground clamping, quantized data, or simpler math and may take advantage of
delayed dead reckoning. The high-priority actors are the only ones doing frequent
updates, articulations, and projective velocity blending. Clients are still responsi-
ble for publishing normally, but the server needs to be aware of which clients are
receiving what information for which modes and publish data accordingly.
18.9 Conclusion
Dead reckoning becomes a major consideration the moment your game becomes
networked. Unfortunately, there is no one-size-fits-all technique. The games in-
dustry is incredibly diverse and the needs of a first-person MMO, a top-down
RPG, and a high-speed racing game are all different. Even within a single game,
different types of actors might require different techniques.
The underlying concepts described in this gem should provide a solid foun-
dation for adding dead reckoning to your own game regardless of the genre. Even
so, dead reckoning is full of traps and can be difficult to debug. Errors can occur
anywhere, including the basic math, the publishing process, the data sent over the
Acknowledgements 327
network, or plain old latency, lag, and packet issues. Many times, there are mul-
tiple problems going on at once and they can come from unexpected places, such
as bad values coming from the physics engine or uninitialized variables. When
you get stuck, refer back to the tips in each section and avoid making assump-
tions about what is and is not working. Believable dead reckoning is tricky to
achieve, but the techniques in this gem will help make the process as easy as it
can be.
Acknowledgements
Special thanks to David Guthrie for all of his contributions.
References
[Aronson 1997] Jesse Aronson. “Dead Reckoning: Latency Hiding for Networked
Games.” Gamasutra, September 19, 1997. Available at http://www.gamasutra.
com/view/feature/3230/dead_reckoning_latency_hiding_for_.php.
[Cado 2007] Olivier Cado. “Propagation of Visual Entity Properties Under Bandwidth
Constraints.” Gamasutra, May 24, 2007. Available at http://www.gamasutra.com/
view/feature/1421/propagation_of_visual_entity_.php.
[Campbell 2006] Matt Campbell and Curtiss Murphy. “Exposing Actor Properties Using
Nonintrusive Proxies.” Game Programming Gems 6, edited by Michael Dick-
heiser. Boston: Charles River Media, 2006.
[Feidler 2009] Glenn Fiedler. “Drop in COOP for Open World Games.” Game Develop-
er’s Conference, 2009.
[Hanson 2006] Andrew Hanson. Visualizing Quaternions. San Francisco: Morgan
Kaufmann, 2006.
[Koster 2005] Raph Koster. A Theory of Fun for Game Design. Paraglyph Press, 2005.
[Lengyel 2004] Eric Lengyel. Mathematics for 3D Game Programming & Computer
Graphics, Second Edition. Hingham, MA: Charles River Media, 2004.
[Moyer and Speicher 2005] Dale Moyer and Dan Speicher. “A Road-Based Algorithm
for Dead Reckoning.” Interservice/Industry Training, Simulation, and Education
Conference, 2005.
[Sayood 2006] Khalid Sayood. Introduction to Data Compression, Third Edition. San
Francisco: Morgan Kaufmann, 2006.
328 18. Believable Dead Reckoning for Networked Games
[Van Verth and Bishop 2008] James Van Verth and Lars Bishop. Essential Mathematics
in Games and Interactive Applications: A Programmer’s Guide, Second Edition.
San Francisco: Morgan Kaufmann, 2008.
19
An Egocentric Motion
Management System
Michael Ramsey
Ramsey Research, LLC
The egocentric motion management system (ECMMS) is both a model for agent
movement and an application of a behavioral theory. Any game that features
agents (e.g., animals, soldiers, or tanks) that move around in a 3D scene has a
need for an agent movement solution. A typical movement solution provides
mechanisms that allow for an agent to move through a scene, avoiding geometry,
all the while executing some sort of behavior.
This article discusses not only how focusing on the agent drives the immedi-
ate interactions with the environment but also, more importantly, that by gather-
ing some information about the environment during locomotion, we gain the
ability to generate spatial semantics for use by the agent’s behavior system. Por-
tions of the ECMMS were used in a cross-platform game entitled World of Zoo
(WOZ), shown in Figure 19.1. WOZ is an animal simulator that requires various
zoo animals to move through their environments in an incredibly compelling
manner while the players constantly alter the environment. So the proving ground
for this system was in an environment that could be changed around the agents at
any particular moment.
329
330 19. An EEgocentric Motion Management System
m
Figure
F 19.1. Sccreenshots from
m World of Zooo.
Conv vincing behav vior is not a one-way streeet—what thee agent does iis just as
importan nt as what is perceived
p by the user. Thee immediate qquestion that ccomes to
mind is howh we control the percep ption of an aggent in a scenne in such a w way that
facilitatees perceived intent. Burm medez [2007] delineates bbetween simpple min-
dreading g and perceptu ual mindread ding. Simple m mindreading is fundamenntally be-
havioral coordination n—this gets us u nowhere, aas we are obbviously not going to
expect th he user to mimmic the agentt’s behavior inn order to undderstand it. H However,
perceptu ual mindreadin ng is slightly
y different in tthat the focuss is on the peerceptual
states off others, and wew accomplissh this by prooviding mechhanisms to infform the
user of not
n only the ag gent’s intendeed state but itts eventual staate (i.e., goal--directed
behaviorr, also known n as propositio onal attitudess). This is crittical because humans
have a criterion for understanding behavior. If they witness the attributioon of de-
sire to belief,
b then itt is likely th
hat the behavvior is justifiable. It’s sim mply not
enough to t represent the
t states of a behavior— —humans neeed to understaand how
they fit together.
t
One of the higherr-order goals when designning a motionn managemennt system
is that we
w ideally wou uld like to observe an agennt in a scene responding too stimuli
in an app propriate man nner. What iss appropriate is open to intterpretation, bbut what
is not op pen for interp
pretation is thhe desire that we perceive the agent actting in a
purposefful manner. To T help faciliitate this, we need to undderstand how physics,
animatio on, and artificcial intelligence are interw
woven into a sshadowy subsstance to
imbue ou ur agents witth these charaacteristics. Ass such, this chhapter focusees on the
componeents that form m the system and its resulttant end produuct, but the taakeaway
should beb about the process throug gh which thes e results weree obtained.
19.1 Fundamental Components of the ECMMS 331
Generates callback
Figure 19.2. A top view of a bird with one collision sensor. When an object overlaps
with the collision sensor, a callback is invoked to register the object with the agent.
332 19. An Egocentric Motion Management System
interact with the object is the following: the current behavior, the available space
for an action, and the ability of the agent to interact with the object. We discuss
this later on, but note that this system is predicated on a very fundamental dic-
tum, and agents need to influence behavior in an egocentric manner. Behaviors
need to be understood from the perspective of the agent—what may be usable by
an elephant may have an entirely different use or purpose for a mouse.
(a) (b)
Level 1
Level 2
Level 3
Figure 19.3. (a) A side view of a query space with three levels of collision sensors.
(b) The same query space viewed from directly above.
19.3 Query Space 333
agent to not only receive callbacks due to spatially proximal bodies but also al-
lows for the agent to receive more distal callbacks that could influence its behav-
ior. As mentioned above, the query space is used to generate the semantics for
the agent relative to the geometry in the environment (which we discuss in more
detail in Section 19.7), but what the query space fundamentally allows for is the
ability to ask behavioral questions such as, “Is there anything on my right?” or
“Do I need to modify my gait to keep pace with the pack?” These are questions
that the agent should ask itself based upon its orientation with objects in the envi-
ronment.
behavior is a series of events that may be externally visible to the player (or per-
haps not), but nonetheless influences the agent’s intention. Now it makes obvious
sense to make as much as possible visible to the player, so agents should provide
visual cues to what is transpiring in their minds (e.g., facial movements to more
exaggerated actions like head looks or body shifts).
Navigation Manager
Agent
ECMMS Manager
Agent Brain
bodies with the bones of an agent’s skeleton. Then when the agent animates, the
rigid bodies are keyframed to their respective bone positions. This in itself allows
for nice granular interactions with dynamic game objects. The collision sensors
placed around an agent do and should differ based upon aspects such as the
agent’s size, turning radius, and speed. The layout of the query space needs to be
done in conjunction with the knowledge of corresponding animation information.
If an agent is intended to jump long distances, then the query space generally
needs to be built such that the collision sensors receive callbacks from overlap-
ping geometry in time to not only determine the validity of the actions but also
the intended behavioral result.
that you can build your own behavioral system. An agent’s observed behavior
provides significant insight into its emotional state, attitudes, and attention, and
as a result, a considerable amount of perceived behavior originates from how an
agent moves through the world relative to not only the objects but also to the
available space within that environment. How an agent makes use of space has
been covered [Ramsey 2009a, Ramsey 2009b]—here we focus on how the
ECMMS provides the underpinnings for a behavioral model that embraces ego-
centric spatial awareness.
As we’ve mentioned before, the ECMMS is a system that allows for an agent
to gather information about its environment as it moves through it. What an agent
needs is the ability to classify this information and generate what we call spatial
semantics; spatial semantics allow the higher-order systems to make both short-
term as well as long-term decisions based upon the spatial orientation of an agent
relative to the geometry in the scene. Spatial semantics signifies an important
distinction from the typical approach of agent classification in games, where they
rely upon methods of perhaps too fine a granularity to drive the immediate action
of an agent. To that end, we want to build the basis for informing behavioral de-
cisions from one aspect of the situation at hand, that being the relative spatial
orientation of an agent with the elements in its environment.
Figure 19.7 shows an example of an agent that is next to a wall, as well as
what its query space looks like. In general, we came up with a series of funda-
mental categories that allowed us to generate a meaning from the raw collision
sensor information. The syntax we allowed consisted of SOLeft, SORight, SOBe-
hind, SOInFront, SOAbove, SOBelow, SONear, and SOFar. If something was im-
(a) (b)
Figure 19.7. (a) A bird next to a wall. (b) The bird’s query space.
338 19. An Egocentric Motion Management System
Figure 19.8. (a) A bird in a corner. (b) The bird’s generated syntax. (c) The semantic of
the spatial situation that the bird is in.
19.8 Animation Validation 339
Puppy Responses
Get food
Get water
Go outside
Play with user
Solicit for toy/interaction
agent’s internal attributes along with any spatial semantics generated from the
ECMMS. The single agent behavioral response algorithm selects a response to a
perceived situation in the scene; the responses are already associated with specif-
ic macro situations that may occur in a scene. For our example, we have a simple
puppy game where the puppy has three attributes: hunger, water, and fun. Ta-
ble 19.1 shows the puppy’s response pool, and Table 19.2 lists the perceived sit-
uations that a puppy can find itself in.
The following is a straightforward algorithm for classifying responses gener-
ated from not only typical behavioral considerations but also from the spatial ori-
entations that the ECMMS has provided.
Perceived Situation
Next to sofa
Next to wall
Cornered
Playing with user
Hungry
Thirsty
Play
Explore
Table 19.2. Situation pool that contains behavioral characteristics, as well as possible
spatial orientations that a puppy might find itself in.
19.9 A Single Agent Behavioral Response Algorithm and Example 341
The single-agent response algorithm allows for the prioritization of the pup-
py’s spatial situations and its needs at the same time. This allows for the envi-
ronment to have an immediate influence on the puppy’s behavior. The response
algorithm’s perceived situations list is initially populated by information from the
agent itself (this would include how hungry the puppy is, how thirsty, etc.). The
ECMMS then inserts the situational semantics into the list (this may include in-
formation similar to: wall on the right, there is a wall behind me, I’m in a corner,
etc.). The puppy then scores each of the situation and response entries (in this
case, there is some game code that evaluates the entries and generates a priority),
and the list is sorted. The behavior system decides whether the highest-priority
entry is appropriate, and if so, executes the response. It is expected that not every
situation will have a response, and this is definitely okay because there are (and
should be) several default behaviors that the puppy goes into.
Table 19.3. Response to situation mapper. This mapping has the responses to the situa-
tion without any contextual prioritization factored in.
342 19. An Egocentric Motion Management System
The responses in Table 19.3 contain both a priority and an objective. For the
example in our puppy game, food and water would receive a higher priority over
activities such as play, but then again, that choice is context dependent since the
game may have a specific area just for extensive play periods where we don’t
want our puppy to get hungry or thirsty. So it makes sense to have response pri-
orities contextually modifiable. The generated response also has an objective that
is used to fulfill that specific response; this is another area in which the ECMMS
can aid the behavioral model by providing a list of suitable objectives that satisfy
the response, in essence creating some variability as opposed to always executing
the same response. If no suitable objects are within the query space of the puppy,
then the ECMMS can suggest to the behavioral model to seek out the desired
objective. What this behavioral example provides is an agent that is exhibiting a
nice ebb and flow between itself and its environment, as well as providing the
players with an interesting window into the agent’s perceived and intended be-
havior.
References
[Bermudez 2007] Jose Luis Bermudez. Thinking Without Words. Oxford University
Press, 2007.
[Gibson 1986] James J. Gibson. The Ecological Approach to Visual Perception. Hills-
dale, NJ: Lawrence Erlbaum Associates, 1986.
[Hart et al. 1968] Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. “A Formal Basis
for the Heuristic Determination of Minimum Cost Paths.” IEEE Transactions on
Systems Science and Cybernetics SSC4 4:2 (July 1968), pp. 100–107.
[Ramsey 2009a] Michael Ramsey. “A Unified Spatial Representation for Navigation
Systems.” Proceedings of The Fifth AAAI Artificial Intelligence and Interactive
Digital Entertainment Conference, 2009, pp. 119–122.
[Ramsey 2009b] Michael Ramsey. “A Practical Spatial Architecture for Animal and
Agent Navigation.” Game Programming Gems 8, edited by Adam Lake. Boston:
Charles River Media, 2010.
20
Pointer Patching Assets
Jason Hughes
Steel Penny Games, Inc.
20.1 Introduction
Console development has never been harder. The clock speeds of processors
keep getting higher, the number of processors is increasing, the number of
megabytes of memory available is staggering, even the storage size of optical
media has ballooned to over 30 GB. Doesn’t that make development easier, you
ask? Well, there’s a catch. Transfer rates from optical media have not improved
one bit and are stuck in the dark ages at 8 to 9 MB/s. That means that in the best
possible case of a single contiguous read request, it still takes almost a full
minute to fill 512 MB of memory. Even with an optimistic 60% compression,
that’s around 20 seconds.
As long as 20 seconds sounds, it is hard to achieve without careful planning.
Most engines, particularly PC engines ported to consoles, tend to have the
following issues that hurt loading performance even further:
Bad Solutions
There are many ways to address these problems. One popular old-school way to
improve the disk seeks between files is to log out all the file requests and
rearrange the file layout on the final media so that seeks are always forward on
the disk. CD-ROM and DVD drives typically perform seeks forward more
quickly than backward, so this is a solution that only partially addresses the heart
343
344 20. Pointer Patching Assets
of the problem and does nothing to handle the time wasted processing the data
after each load occurs. In fact, loading individual files encourages a single-
threaded mentality that not only hurts performance but does not scale well with
modern multithreaded development.
The next iteration is to combine all the files into a giant metafile for a level,
retaining a familiar file access interface, like the FILE type, fopen() function,
and so on, but adding a large read-ahead buffer. This helps cut down further on
the bandwidth stalls, but again, suffers from a single-threaded mentality when
processing data, particularly when certain files contain other filenames that need
to be queued up for reading. This spider web of dependencies exacerbates the
optimization of file I/O.
The next iteration in a system like this is to make it multithreaded. This
basically requires some accounting mechanism using threads and callbacks. In
this system, the order of operations cannot be assured because threads may be
executed in any order, and some data processing occurs faster for some items
than others. While this does indeed allow for continuous streaming in parallel
with the loaded data initialization, it also requires a far more complicated scheme
of accounting for objects that have been created but are not yet “live” in the game
because they depend on other objects that are not yet live. In the end, there is a
single object called a level that has explicit dependencies on all the subelements,
and they on their subelements, recursively, which is allowed to become live only
after everything is loaded and initialized. This undertaking requires clever
management of reference counts, completion callbacks, initialization threads, and
a lot of implicit dependencies that have to be turned into explicit dependencies.
Analysis
We’ve written all of the above solutions, and shipped multiple games with each,
but cannot in good faith recommend any of them. In our opinion, they are
bandages on top of a deeper-rooted architectural problem, one that is rooted in a
failure to practice a clean separation between what is run-time code and what is
tools code.
How do we get into these situations? Usually, the first thing that happens on
a project, especially when an engine is developed on the PC with a fast hard disk
drive holding files, is that data needs to be loaded into memory. The fastest and
easiest way to do that is to open a file and read it. Before long, all levels of the
engine are doing so, directly accessing files as they see fit. Porting the engine to a
console then requires writing wrappers for the file system and redirecting the
calls to the provided file I/O system. Performance is poor, but it’s working. Later,
20.1 Introduction 345
some sad optimization engineer is tasked with getting the load times down from
six minutes to the industry standard 20 seconds. He’s faced with two choices:
1. Track down and rewrite all of the places in the entire engine and game where
file access is taking place, and implement something custom and appropriate
for each asset type. This involves making changes to the offline tools
pipeline, outputting data in a completely different way, and sometimes
grafting existing run-time code out of the engine and into tools. Then, deal
with the inevitable ream of bugs introduced at apparently random places in
code that has been working for months or years.
2. Make the existing file access system run faster.
Oh, and one more thing—his manager says there are six weeks left before the
game ships. There’s no question it’s too late to pick choice #1, so the intrepid
engineer begins the journey down the road that leads to the Bad Solution. The
game ships, but the engine’s asset-loading pipeline is forever in a state of
maintenance as new data files make their way into the system.
Considerations
Unfortunately, pointer patching of assets is relatively hard to retrofit into existing
engines. The tools pipeline must be written to support outputting data in this
format. The run time code must be changed to expect data in this format,
generally implying the removal of a lot of initialization code, but more often than
not, it requires breaking explicit dependencies on loading other files directly
during construction and converting that to a tools-side procedure. This sort of
346 20. Pointer Patching Assets
dynamic loading scheme tends to map cleanly onto run-time lookup of symbolic
pointers, for example. In essence, it requires detangling assets from their disk
access entirely and relegating that chore to a handful of higher-level systems.
If you’re considering retrofitting an existing engine, follow the 80/20 rule. If
20 percent of the assets take up 80 percent of the load time, concentrate on those
first. Generally, these should be textures, meshes, and certain other large assets
that engines tend to manipulate directly. However, some highly-processed data
sets may prove fruitful to convert as well. State machines, graphs, and trees tend
to have a lot of pointers and a lot of small allocations, all of which can be done
offline to dramatically improve initialization performance when moved to a
pointer patching pipeline.
(a) Concatenate all the structures into a single contiguous block of memory,
remembering the location to which each structure was relocated.
(b) Iterate through all the pointers in each relocated structure and convert the
raw addresses stored into offsets into the concatenated block. See
Figure 20.1 for an example of how relative offsets are used as relocatable
pointers. Any pointers that cannot be resolved within the block are
indications that the serialized data structure is not coherent.
(c) Append a table of pointer locations within the block that will need to be
fixed up after the load has completed.
(d) Append a table of name/value pairs that allows the run-time code to
locate the important starting points of structures.
0008000C 0000FFF4
Pointer Pointer DATA DATA DATA Pointer Pointer DATA
1 2 3 4
Figure 20.1. This shows all the flavors of pointers stored in a block of data on disk.
Forward pointers are relative to the pointer’s address and have positive values. Backward
pointers are relative to the pointer’s address and have negative values. Null pointers store
a zero offset, which is a special case when patched and is left zero.
Desirable Properties
There are many ways to get from the above description to actual working code.
We’ve implemented this three different ways over the years and each approach
has had different drawbacks and features. The implementation details of the
following are left as an exercise to the reader, but here are some high-level
properties that should be considered as part of your design when creating your
own pointer patching asset system:
■ Relocatable assets. If you can patch pointers, you can unpatch them as well.
This affords you the ability to defragment memory, among other things—the
holy grail of memory stability.
■ General compression. This reduces load times dramatically and should be
supported generically.
■ Custom compression for certain structures. A couple of obvious candidates
are triangle list compression and swizzled JPEG compression for textures.
There are two key points to consider here. First, perform custom compression
before general compression for maximum benefit and simplest
implementation. This is also necessary because custom compression (like
JPEG/DCT for images) is often lossy and general compression (such as LZ77
with Huffman or arithmetic encoding) is not. Second, perform in-place
decompression, which requires the custom compressed region to take the
same total amount of space, only backfilled with zeros that compress very
well in the general pass. You won’t want to be moving memory around
during the decompression phase, especially if you want to kick off the
decompressor functions into separate jobs/threads.
■ Offline block linking. It is very useful to be able to handle assets generically
and even combine them with other assets to form level-specific packages,
without having to reprocess the source data to produce the pointer patched
348 20. Pointer Patching Assets
asset. This can lead to a powerful, optimized tools pipeline with minimal
load times and great flexibility.
■ Symbolic pointers with delayed bindings. Rather than all pointers having
physical memory addresses, some studios use named addresses that can be
patched at run time after loading is complete. This way you can have pointers
that point to the player’s object or some other level-specific data without
needing to write custom support for each case.
■ Generic run-time asset caches. Once loading is handled largely through
pointer-patched blocks with named assets, it is fairly simple to build a
generic asset caching system on top of this, even allowing dynamic reloading
of assets at run time with minimal effort.
■ Simple tools interface that handles byte swapping and recursion. Writing out
data should be painless and natural, allowing for recursive traversals of live
data structures and minimal intrusion.
■ Special pointer patching consideration for virtual tables. A method for
virtual table patching may be necessary to refresh class instances that have
virtual functions. Or choose ways to represent your data without using virtual
functions.
■ Offline introspection tools. Not only is it very useful for debugging the asset
pipeline, but a generic set of introspection tools can help perform vital
analyses about the game’s memory consumption based on asset type,
globally, and without even loading the game on the final platform!
■ Propagate memory alignment requirements. Careful attention to data
alignment allows hardware to receive data in the fastest possible way. Design
your writing interface and linking tools to preserve and propagate alignments
so that all the thinking is done in tools, even if it means inserting some
wasted space to keep structures starting on the right address. All the run-time
code should need to know is the alignment of the block as a whole.
While all the above properties have their merits, a complete and full-featured
system is an investment for studios to undertake on their own. A very basic
system is provided on the website. It supports byte swapping of atomic types,
pointer patching, and a clean and simple interface for recursion.
them at run time. While this example only contains a few nodes, it is nonetheless
a nontrivial example given the comparative complexity of any reasonable and
flexible alternative.
Tools Side
The first order of business is to declare the data structure we want in the run-time
code. Note that there are no requirements placed on the declarations—minimal
intrusion makes for easier adoption. You should be able to use the actual run-
time structure declaration in those cases where there are limited external
dependencies.
Listing 20.1 shows a simple example of how to dump out a live tree data
structure into a pointer-patched block in just a few lines of code. As you can see,
the WriteTree() function just iterates over the entire structure—any order is
actually fine—and submits the contents of each node to the writer object. Each
call to ttw.Write*() is copying some data from the Node into the writer’s
memory layout, which matches exactly what the run-time code will use. As
written, WriteTree() simply starts writing a tree recursively down both the left
and right branches until it exhausts the data. The writer interface is designed to
handle data in random order and has an explicit finalization stage where
addresses of structures are hooked to pointers that were written out. This
dramatically improves the flexibility of the tools code.
struct Node
{
Node(float v) : mLeft(NULL), mRight(NULL), mValue(v) {}
Node *mLeft;
Node *mRight;
float mValue;
};
if (n->mLeft)
WriteTree(n->mLeft, ttw);
ttw.WritePtr();
350 20. Pointer Patching Assets
if (n->mRight)
WriteTree(n->mRight, ttw);
ttw.Write4();
ttw.EndStruct();
}
ToolsTimeWriter ttw(false);
ttw.StartAsset("TreeOfNodes", root);
WriteTree(root, ttw);
std::vector<unsigned char> packedData = ttw.Finalize();
Listing 20.1. This is the structure declaration for the sample tree.
Figure 20.2. This is the in-memory layout of our sample tree. Notice it requires five
memory allocations and various pointer traversals to configure.
20.3 A Brief Example 351
Addr 00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38
Data 000C 002C 3.14 0000 0008 5.0 000C 0000 1.0 0000 0000 0.01 0000 0000 777.0
Figure 20.3. This is the finalized data from the writer, minus the header and format
details. Notice that all offsets are relative to the pointer’s address, rather than an absolute
index. This vastly simplifies bookkeeping when merging multiple blocks together.
Next, there is a block of code that creates a live tree structure using typical
allocation methods you would use in tools code, graphically depicted in
Figure 20.2. Finally, we write out the live tree structure and finalize it down to a
single block of opaque data. This is done by declaring a writer object, marking
the start of the specific structure that the run-time code will later find by name,
calling our data writing function above, and then retrieving the baked-down
block of bytes that should be written to disk. Amazingly, Figure 20.3 shows the
entire structure in just a handful of bytes.
The ToolsTimeWriter class shown in Listing 20.2 is only a toy
implementation. Some basic limitations in this implementation are that it only
supports 32-bit pointers, doesn’t handle alignment requirements per structure,
etc. Still, it is educational to see the approach taken by this one of many possible
interfaces.
class ToolsTimeWriter
{
public:
ToolsTimeWriter(bool byteSwap);
// Once a struct has been started, you call these to pump out
// data. These are needed to handle byte swapping and to
// measure struct sizes.
void Write1(void);
void Write2(void);
void Write4(void);
void Write8(void);
void WritePtr(void);
void WriteRaw(int numBytes);
Listing 20.2. This basic interface is close to the minimum requirements for a pointer patching
asset interface.
Figure 20.4. This is a very simple file format and is fully loaded into RAM in a single
read before the fix-up phase occurs.
One important characteristic is that the fix-up process only modifies the
pointers stored within the data block and not in the metadata. This is partly
because patching is generally done only once, and therefore, no value is causing
a lot of cache lines to be flushed out to RAM by writing them out. It is also partly
because the pointer patch table, among other relative pointers, has to be handled
outside the normal patching mechanism (otherwise the addresses of the pointers
in the table would need to be in the table, see?). If they already must be dealt
with explicitly, there would need to be a flag indicating whether the pointers are
patched or not, and there would need to be two code paths, depending on whether
patching is necessary. So, we leave them unpatched all the time and enjoy
reduced code complexity and the strongest possible run-time performance while
patching pointers.
354 20. Pointer Patching Assets
Listing 20.3. The following function simply walks the table of offsets to pointers, then adds each
pointer’s address to the offset stored at that location, constructing a properly patched pointer.
Listing 20.4. Assuming that the variable loadedFromDiskPtr points to the address of a coherent
block stored on disk, these two lines of code are all that is necessary to reconstruct a full tree data
structure.
21
Data‐Driven Sound Pack
Loading and Organization
Simon Franco
The Creative Assembly
21.1 Introduction
Typically, a game’s audio data, much like all other game data, cannot be fully
stored within the available memory of a gaming device. Therefore, we need to
develop strategies for managing the loading of our audio data. This is so we only
store in memory the audio data that is currently needed by the game. Large audio
files that are too big to fit into memory, such as a piece of music or a long sound
effect that only has one instance playing at a time, can potentially be streamed in.
Streaming a file in this way results in only the currently needed portion of the file
being loaded into memory. However, this does not solve the problem of how to
handle audio files that may require multiple instances of an audio file to be
played at the same time, such as gunfire or footsteps.
Sound effects such as these need to be fully loaded into memory. This is so
the sound engine can play multiple copies of the same sound effect, often at
slightly different times, without needing to stream the same file multiple times
and using up the limited bandwidth of the storage media.
To minimize the number of file operations performed, we typically organize
our audio files into sound packs. Each sound pack is a collection of audio files
that either need to be played together at the same time or within a short time pe-
riod of each other.
Previously, we would package up our audio files into simplified sound packs.
These would typically have been organized into a global sound pack, character
sound packs, and level sound packs. The global sound pack would contain all
audio files used by global sound events that occur across all levels of a game.
355
356 21. Data‐Driven Sound Pack Loading and Organization
This would typically have been player and user interface sound events. Character
sounds would typically be organized so that there would be one sound pack per
type of character. Each character sound pack would contain all the audio files
used by that character. Level sound packs would contain only the audio files only
used by sound events found on that particular level.
However, this method of organization is no longer applicable, as a single
level’s worth of audio data can easily exceed the amount of sound RAM availa-
ble. Therefore, we must break up our level sound packs into several smaller
packs so that we can fit the audio data needed by the current section of the game
into memory. Each of these smaller sound packs contain audio data for a well-
defined small portion of the game. We then load in these smaller sound packs
and release them from memory as the player progresses through the game. An
example of a small sound pack would be a sound pack containing woodland
sounds comprising bird calls, trees rustling in the wind, forest animals, etc.
The problem with these smaller sound packs is how we decide which audio
files are to be stored in each sound pack. Typically, sound packs, such as the ex-
ample woodlands sound, pack are hand-organized in a logical manner by the
sound designer. However, this can lead to wasting memory as sounds are
grouped by their perceived relation to each other, rather than an actual require-
ment that they be bound together into a sound pack.
The second problem is how to decide when to load in a sound pack. Previ-
ously, a designer would place a load request for a sound pack into the game
script. This would then be triggered when either the player walks into an area or
after a particular event happens. An example of this would be loading a burning
noises sound pack and having this ready to be used after the player has finished a
cut-scene where they set fire to a building.
This chapter discusses methods to automate both of these processes. It allows
the sound designer to place sounds into the world and have the system generate
the sound packs and loading triggers. It also alerts the sound designer if they
have overpopulated an area and need to reduce either the number of variations of
a sound or reduce the sample quality.
Sound emitters are typically points within 3D space representing the posi-
tions from which a sound will play. As well as having a position, an emitter also
contains a reference to a sound event. A sound event is a collection of data dictat-
ing which audio file or files to play, along with information on how to control
how these audio files are played. For example, a sound event would typically
store information on playback volume, pitch shifts, and any fade-in or fade-out
durations.
Sound designers typically populate the game world with sound emitters in
order to build up a game’s soundscape. Sound emitters may be scripted directly
by the sound designer or may be automatically generated by other game objects.
For example, an animating door may automatically generate wooden_door_
open and wooden_door_close sound emitters.
Once all the sound emitters have been placed within the game world, we can
begin our data collection. This process should be done offline as part of the pro-
cess for building a game’s level or world data.
Each sound event has an audible range defined by its sound event data. This
audible range is used to calculate both the volume of the sound and whether the
sound is within audible range by comparing against the listener’s position. The
listener is the logical representation of the player’s ear in the world—it’s typical-
ly the same as the game’s camera position. We use the audible range property of
sound emitters to see which sound emitters are overlapping.
We construct a sound map to store an entry for each sound emitter found
within the level data. The sound map can be thought of as a three-dimensional
space representing the game world. The sound map contains only the sound emit-
ters we’ve found when processing the level data. Each sound emitter is stored in
the sound map as a sphere, with the sphere’s radius being the audible distance of
the sound emitter’s event data.
Once the sound map is generated, we can construct an event table containing
an entry for each type of sound event found in the level. For each entry in the
table, we must mark how many instances of that sound event there are within the
sound map, which other sound events overlap with them (including other in-
stances of the same sound event), and the number of instances in which those
events overlap. For example, if a single Bird_Chirp sound emitter overlaps with
two other sound emitters playing the Crickets sound event, then that would be
recorded as a single occurrence of an overlap between Bird_Chirp and Crick-
ets for the Bird_Chirp entry. For the Crickets entry, it would be recorded as
two instances of an overlap. An example table generated for a sample sound map
is shown in Table 21.1. From this data, we can begin constructing our sound
packs.
358 21. Data‐Driven Sound Pack Loading and Organization
Table 21.1. Sound events discovered within a level and details of any other overlapped
events.
in the streamed sound event table contains the name of a streamed sound event
and a list of other streamed sound events whose sound emitters overlap with a
sound emitter having this type of sound event. To decide whether a sound event
should be streamed, we must take the following rules into account:
■ Is there only one instance of the sound event in the table? If so, then stream-
ing it would be more efficient. The exception to this rule is if the size of the
audio file used by the event is too small. In this eventuality, we should load
the audio file instead. This is so our streaming resources are reserved for
more suitable files. A file is considered too small if its file size is smaller
than the size of one of your streaming audio buffers.
■ Does the sound event overlap with other copies of itself? If not, then it can be
streamed because the audio data only needs to be processed once at any giv-
en time.
■ Does the audio file used by the sound event have a file size bigger than the
available amount of audio RAM? If so, then it must be streamed.
■ Are sufficient streaming resources available, such as read bandwidth to the
storage media, to stream data for this sound event? This is done after the oth-
er tests have been passed because we need to see which other streamed sound
events may be playing at the same time. If too many are playing, then we
need to prioritize the sound events. Larger files should be streamed, and
smaller-sized audio files should be loaded into memory.
Using the data from Table 21.1, we can extract a streamed sound event table
similar to that shown in Table 21.2. In Table 21.2, we have three sound events
that we believe to be suitable candidates for streaming. Since Water_Mill and
NPC_Speech1 overlap, we should make sure that the audio files for these events
are placed close to each other on the storage media. This reduces seek times
when reading the audio data for these events.
Table 21.3. Remaining sound events that need to be placed into sound packs.
■ Sound events that overlap approximately two-thirds of the time are potential
candidates for having their audio files placed into the same sound pack. The
overlap count should be a high percentage both ways so that sound event A
overlaps with sound event B a high number of times, and vice versa. Other-
wise, we may end up with redundant audio data being loaded frequently for
one of the sound events.
■ All audio files used by a sound event should be placed into the same sound
pack.
■ The file size of a sound pack should not exceed the amount of available
sound RAM.
■ The ratio between the size of a sound event’s audio data and the file size of
the sound pack should closely match the percentage of event overlaps. For
instance, if we have a sound event whose audio data occupies 80 percent of a
sound pack’s file size, but is only used 10 percent of the time, then it should
be placed in its own sound pack.
next to each other (five occurrences of an overlap) and next to a single cricket
(five occurrences of an overlap with Crickets) in our table. For the Crickets
entry, we have only a single cricket that was next to the frogs, so there is only
one instance of an overlap.
There were no instances of the Bird_Chirp event overlapping with the
Frogs event, so we should put Bird_Chirp and Crickets into a single sound
pack and put Frogs into a separate sound pack.
mark_all_sound_packs_for_removal(m_loaded_sound_pack_list);
get_list_of_packs_needed(required_pack_list, m_listener_position);
/*
Iterate through all the loaded sound packs and retain
those which are in the required list.
*/
for (each required_pack in required_pack_list)
{
for (each loaded_pack in m_loaded_sound_pack_list)
{
if (loaded_pack == required_pack)
{
retain_sound_pack(loaded_pack);
remove_pack_from_required_list(required_pack);
break;
}
}
}
unload_sound_packs_not_retained(m_loaded_sound_pack_list);
21.5 Conclusion 363
/*
Now all the sound packs remaining in the required_sound_packs
list are those which are not yet loaded.
*/
for (each required_pack in required_sound_pack_list)
{
load_sound_pack_and_add_to_loaded_list(required_pack);
}
21.5 Conclusion
While not all sound events can have their data packaged up by the process de-
scribed in this chapter, it still helps simplify the task of constructing and manag-
ing sound packs used by a game’s environment. Sound events that typically can’t
take advantage of this process are global sounds, such as the player being hit or a
projectile being fired, because they can occur at any time during a level.
22
2
GP
PGPU Clloth Sim
mulatio
on Usingg
GLLSL, Ope
enCL, and CUD
DA
Marrco Fratarcaangeli
Taitu
us Software Italia
It
22.1 Introduction
n
Thiss chapter prov vides a compparison studyy between thrree popular pplatforms for
geneeric programm ming on the GPU,
G namelyy, GLSL, CU UDA, and OpeenCL. These
technnologies are used for imp plementing ann interactive physically-baased method
that simulates a piece
p of cloth
h colliding wiith simple priimitives like sspheres, cyl-
inders, and planess (see Figure 22.1). We asssess the advanntages and thhe drawbacks
of eaach different technology
t in
n terms of usaability and perrformance.
Figu
ure 22.1. A pieece of cloth fallls under the innfluence of graavity while coolliding with a
spherre at interactiv
ve rates. The cloth is compossed of 780,0000 springs connnecting 65,000
particcles.
365
366 22. GPGPU Cloth Simulation Using GLSL, OpenCL, and CUDA
(i, j + 1)
(i, j)
(i + 1, j − 1)
(i − 2, j − 2)
Figure 22.2. A 4 4 grid of particle vertices and the springs for one of the particles.
F spring k l l 0 b x ,
where l represents the current length of the spring (i.e., its magnitude is the dis-
tance between the connected particles), l 0 represents the rest length of the spring
at the beginning of the simulation, k is the stiffness constant, x is the velocity of
the particle, and b is the damping constant. This equation means that a spring
always applies a force that brings the distance between the connected particles
back to its initial rest length. The more the current distance diverges from the rest
length, then the larger is the applied force. This force is damped proportionally to
the current velocity of the particles by the last term in the equation. The blue
springs in Figure 22.2 simulate the stretch stress of the cloth, while the longer red
ones simulate the shear and bend stresses.
For each particle, the numerical algorithm that computes its dynamics is
schematically illustrated in Figure 22.3. For each step of the dynamic simulation,
22.2 Numerical Algorithm 367
Initial state
x t 0 , x t 0
Compute acceleration
x t F t m
Update state
x t , x t ← x t Δt , x t Δt
Handle collisions
the spring forces and other external forces (e.g., gravity) are applied to the parti-
cles, and then their dynamics are computed according to the Verlet method [Mül-
ler 2008] applied to each particle in the system through the following steps:
1. x t x t x t Δ t Δ t .
x t F x t , x t m .
2.
3. x t Δ t 2 x t x t Δ t x t Δ t 2.
Here, F t is the current total force applied to the particle, m is the particle mass,
x t is its acceleration, x t is the velocity, x t is the current position, and Δt is
the time step of the simulation (i.e., how much time the simulation is advanced
for each iteration of the algorithm).
The Verlet method is very popular in real-time applications because it is
simple and fourth-order accurate, meaning that the error for the position compu-
tation is O Δ t 4 . This makes the Verlet method two orders of magnitude more
precise than the explicit Euler method, and at the same time, it avoids the compu-
368 22. GPGPU Cloth Simulation Using GLSL, OpenCL, and CUDA
where c and r are the center and the radius of the sphere, respectively. If a colli-
sion occurs, then it is handled by moving the particle into a valid state by moving
its position just above the surface of the sphere. In particular, the particle should
be displaced along the normal of the surface at the impact point. The position of
the particle is updated according to the formula
x t Δt c
d ,
x t Δt c
x t Δt c dr,
where x t Δ t is the updated position after the collision. If the particle does not
penetrate too far, d can be considered as an acceptable approximation of the
normal to the surface at the impact point.
22.4 CPU Implementation 369
Even though the normal vector is computed during the simulation, it is used
only for rendering purposes and does not affect the simulation dynamics. Here,
the normal vector of a particle is defined to be the average of the normal vectors
of the triangulated faces to which the particle belongs. A different array is created
for storing the current positions, previous positions, and normal vectors. As ex-
plained in later sections of this chapter, for the GPU implementation, these at-
tributes are loaded as textures or buffers into video memory. Each array stores
the attributes for all the particles. The size of each array is equal to the size of an
attribute (four floating-point values) multiplied by the number of particles. For
example, the position of the i-th particle p i is stored in the positions array and
accessed as follows:
0,1. A particle is identified by its array index i, which is related to the row and
the column in the grid as follows:
rowi i n ,
col i i mod n.
From the row and the column of a particle, it is easy to access its neighbors by
simply adding an offset to the row and the column, as shown in the examples in
Figure 22.2.
The pseudocode for calculating the dynamics of the particles in an n n grid
is shown in Listing 22.1. In steps 1 and 2, the current and previous positions of
the the i-th particle are loaded in the local variables pos ti and pos ti 1, respectively,
and then the current velocity vel ti is estimated in step 3. In step 4, the total force
force i is initialized with the gravity value. Then, the for loop in step 5 iterates
over all the neighbors of p i (steps 5.1 and 5.2), spring forces are computed (steps
5.3 to 5.5), and they are accumulated into the total force (step 5.6). Each neigh-
bor is identified and accessed using a 2D offset x offset , y offset from the position of
p i within the grid, as shown in Figure 22.2. Finally, the dynamics are computed
in step 6, and the results are written into the output buffers in steps 7 and 8.
Listing 22.1. Pseudocode to compute the dynamics of a single particle i belonging to the n n
grid.
22.5 GPU Implementations 371
Input Output
Buffer 0 GPU Buffer 1
x t computation x t Δt
Vertex
Buffer Draw
GPU Object
Output Input
Buffer 0 GPU Buffer 1
x t Δt computation x t
Vertex
Draw Buffer
Object GPU
Figure 22.4. The ping-pong technique on the GPU. The output of a simulation step be-
comes the input of the following step. The current output buffer is mapped to a VBO for
fast visualization.
372 22. GPGPU Cloth Simulation Using GLSL, OpenCL, and CUDA
glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, fbo->fb);
glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT,
GL_COLOR_ATTACHMENT0_EXT, GL_TEXTURE_2D, texid[0], 0);
glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT,
GL_COLOR_ATTACHMENT1_EXT, GL_TEXTURE_2D, texid[1], 0);
glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT,
GL_COLOR_ATTACHMENT2_EXT, GL_TEXTURE_2D, texid[2], 0);
In the initialization phase, both of the FBOs holding the initial state of the
particles are uploaded to video memory. When the algorithm is run, one of the
FBOs is used as input and the other one as output. The fragment shader reads
the data from the input FBO and writes the results in the render targets of the
output FBO (stored in the color buffers). We declare the output render targets by
using the following code, where fb_out is the FBO that stores the output:
glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, fb_out);
GLenum mrt[] = {GL_COLOR_ATTACHMENT0_EXT,
GL_COLOR_ATTACHMENT1_EXT, GL_COLOR_ATTACHMENT2_EXT};
glDrawBuffers(3, mrt);
22.7 CUDA Implementation 373
In the next simulation step, the pointers to the input and output FBOs are
swapped so that the algorithm uses the output of the previous iteration as the cur-
rent input.
The two FBOs are stored in the video memory, so there is no need to upload
data from the CPU to the GPU during the simulation. This drastically reduces the
amount of data bandwidth required on the PCI-express bus, improving the per-
formance. At the end of each simulation step, however, position and normal data
is read out to a pixel buffer object that is then used as a VBO for drawing pur-
poses. The position data is stored into the VBO directly on the GPU using the
following code:
glReadBuffer(GL_COLOR_ATTACHMENT0_EXT);
glBindBuffer(GL_PIXEL_PACK_BUFFER, vbo[POSITION_OBJECT]);
glReadPixels(0, 0, texture_size, texture_size,
GL_RGBA, GL_FLOAT, 0);
First, the color buffer of the FBO where the output positions are stored is select-
ed. Then, the positions’ VBO is selected, specifying that it will be used as a pixel
buffer object. Finally, the VBO is filled with the updated data directly on the
GPU. Similar steps are taken to read the normals’ data buffer.
It is important to note that this instruction does not cause a buffer upload from
the CPU to the GPU because the buffer is already stored in video memory. The
374 22. GPGPU Cloth Simulation Using GLSL, OpenCL, and CUDA
In the initialization phase, we declare that we are sharing data in video memory
with OpenGL VBOs through CUDA graphical resources. Then, during the exe-
cution of the algorithm kernel, we map the graphical resources to buffer pointers.
The kernel computes the results and writes them in the buffer. At this point, the
graphical resources are unmapped, allowing the VBOs to be used for drawing.
Now suppose we pass input data of some different type (e.g., a float) in the fol-
lowing way:
When executed, the program will fail silently without giving any error message
because it expects an int instead of a float. This made the OpenCL implemen-
tation rather complicated to develop.
22.9 Results 375
22.9 Results
The described method has been implemented and tested on two different ma-
chines:
We collected performance times for each GPU computing platform, varying the
numbers of particles and springs, from a grid resolution of 32 32 (1024 particles
and 11,412 springs) to 256 256 (65,536 particles and approximately 700,000
springs). Numerical results are collected in the plots in Figures 22.5 and 22.6.
From the data plotted in Figures 22.5 and 22.6, the computing superiority of
the GPU compared with the CPU is evident. This is mainly due to the fact that
this cloth simulation algorithm is strongly parallelizable, like most of the particle-
based approaches. While the computational cost on the CPU keeps growing line-
arly with the number of particles, the computation time on the GPU remains rela-
tively low because the particle dynamics are computed in parallel. On the
GTS250 device, this leads to a performance gain ranging from 10 to 40 times,
depending on the number of particles.
It is interesting to note that in this case, GLSL has a much better performance
than CUDA does. This can be explained by considering how the memory is ac-
cessed by the GPU kernels. In the GLSL fragment program, images are em-
ployed to store particle data in texture memory, while in CUDA and OpenCL,
these data is stored in the global memory of the device. Texture memory has two
main advantages [Nvidia 2010]. First, it is cached, and thus, video memory is
accessed only if there is a cache miss. Second, it is built in such a way as to op-
timize the access to 2D local data, which is the case because each particle corre-
sponds to a pixel, and it must have access to the positions of its neighbors, which
are stored in the immediately adjacent texture pixels. Furthermore, the results in
GLSL are stored in the color render targets that are then directly mapped to
VBOs and drawn on the screen. The data resides in video memory and does not
need to be copied between different memory areas. This makes the entire process
extremely fast compared with the other approaches.
The plots also highlight the lower performance of OpenCL compared with
CUDA. This difference is caused by the fact that it has been rather difficult to
tune the number of global and local work items due to causes requiring further
376 22. GPGPU Cloth Simulation Using GLSL, OpenCL, and CUDA
35 99.8
30
25.5
25
20
Time (ms)
15
10
6.48
4.10
5
1.57 1.02 0.70 0.99 0.71 1.58 1.36
0.25 0.28 0.23 0.69 0.17
0
CPU GLSL OCL CUDA CPU GLSL OCL CUDA CPU GLSL OCL CUDA CPU GLSL OCL CUDA
70 160
60
50
42.5
40
Time (ms)
34.1
30
20
10.4 9.78 10.1 11.0
10
2.54 3.93 3.32
1.66 0.81
0.30 0.30 0.26 0.29
0
CPU GLSL OCL CUDA CPU GLSL OCL CUDA CPU GLSL OCL CUDA CPU GLSL OCL CUDA
investigation. OpenCL is a very young standard, and both the specification and
the driver implementation are likely to change in the near future in order to avoid
such instabilities.
The GLSL program works on relatively old hardware, and different from
CUDA, it does not require Nvidia hardware. CUDA on the other hand, is a more
flexible architecture that has been specifically devised for performing computing
tasks (not only graphics, like GLSL), which is easier to debug and provides ac-
cess to hardware resources, like the shared memory, allowing for a further boost
to the performance. OpenCL has the same features as CUDA, but its implementa-
tion is rather naive at the moment, and it is harder to debug. However, different
from CUDA, it has been devised to run on the widest range of hardware plat-
forms (including consoles and mobile phones), not limited to Nvidia ones, and
thus, it is the main candidate for becoming the reference platform for GPGPU in
the near future.
The main effort when dealing with GPGPU is in the design of the algorithm.
The challenging task that researchers and developers are currently facing is how
to redesign algorithms that have been originally conceived to run in a serial man-
ner for the CPU, to make them parallel and thus suitable for the GPU. The main
disadvantage of particle-based methods is that they require a very large number
of particles to obtain realistic results. However, it is relatively easy to parallelize
algorithms handling particle systems, and the massive parallel computation capa-
bilities of modern GPUs now makes it possible to simulate large systems at inter-
active rates.
22.11 Demo
An implementation of the GPU cloth simulation is provided on the website, and
it includes both the source code in C++ and the Windows binaries. The demo
allows you to switch among the computing platforms at run time, and it includes
a hierarchical profiler. Even though the source code has been developed for Win-
dows using Visual Studio 2008, it has been written with cross-platform compati-
bility in mind, without using any Windows-specific commands, so it should
compile and run on *nix platforms (Mac and Linux). The demo requires a ma-
chine capable of running Nvidia CUDA, and the CUDA Computing SDK 3.0
needs to have been compiled. A video is also included on the website.
Acknowledgements
The shader used for rendering the cloth is “fabric plaid” from RenderMonkey 1.82 by
AMD and 3DLabs. The author is grateful to Professor Ingemar Ragnemalm for having
introduced him to the fascinating world of GPGPU.
References
[Müller 2008] Matthias Müller, Jos Stam, Doug James, and Nils Thürey. “Real Time
Physics.” ACM SIGGRAPH 2008 Course Notes. Available at http://www.
matthiasmueller.info/realtimephysics/index.html.
[Nvidia 2010] “NVIDIA CUDA Best Practices Guide,” Version 3.0, 2010. Available at
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_
CUDA_BestPracticesGuide.pdf.
[Tejada 2005] Eduardo Tejada and Thomas Ertl. “Large Steps in GPU-Based Deforma-
ble Bodies Simulation.” Simulation Modelling Practice and Theory 13:8 (Novem-
ber 2005), pp. 703–715.
23
A Jitter‐Tolerant Rigid Body
Sleep Condition
Eric Lengyel
Terathon Software
23.1 Introduction
One of the primary optimizations employed by any physics engine is the ability
to put a rigid body to sleep when it has reached a resting state. A sleeping rigid
body is not processed by the physics engine until some kind of event occurs, such
as a collision or broken contact, necessitating that it is woken up and simulated
once again. The fact that most rigid bodies are in the sleeping state at any one
time is what allows a large number of simulated objects to exist in a game level
without serious performance problems.
A problem faced by all physics engines is how to decide that a rigid body has
actually come to rest. If we put an object to sleep without being sure that it has
come to rest, then we risk accidentally freezing it in mid-simulation, which can
look odd to the player. On the other hand, if we are too conservative and wait for
too strict of a condition to be met before we put an object to sleep, then we can
run into a situation where too many objects are being simulated unnecessarily
and performance suffers.
The sleep decision problem is complicated by the fact that all physics en-
gines exhibit some jitter no matter how good the constraint solver is. If the right
sleep condition isn’t chosen in the design of a physics engine, then jitter can pre-
vent an object from ever going to sleep. Or at the very least, jitter can delay an
object from entering the sleep state, and this generally causes a higher number of
objects to be simulated at any given time. This chapter discusses a simple condi-
tion that can be used to determine when it is the proper time to put a rigid body to
sleep, and it is highly tolerant to jitter.
379
380 23. A Jitter‐Tolerant Rigid Body Sleep Condition
Cy
Cx
Figure 23.1. Bounding boxes are maintained for the center of mass C and the two points
C x and C y, where x and y are the first two columns of the matrix transforming the
rigid body into world space.
The number of steps n and the bounding box size threshold t are parameters
that can be adjusted in the physics engine. Typically, if the three test points re-
main inside small enough bounding boxes for one second, then it’s safe to put a
rigid body to sleep. The size threshold for the two points away from the center of
mass can be different from that used for the center of mass in order to impose
angular velocity limits.
Whenever the sleep test fails because one of the bounding boxes has grown
too large during a particular simulation step, the whole test should be reset. That
is, the current values of C, C x, and C y should be used to reinitialize the
bounding boxes, and the resting step count should be restarted at one. For a rigid
body that is actively moving in some way, this reset occurs on every simulation
step.
Part III
Systems Programming
383
24
Bit Hacks for Games
Eric Lengyel
Terathon Software
Game programmers have long been known for coming up with clever tricks that
allow various short calculations to be performed more efficiently. These tricks
are often applied inside tight loops of code, where even a tiny savings in CPU
clock cycles can add up to a significant boost in speed overall. The techniques
usually employ some kind of logical bit manipulation, or “bit twiddling,” to ob-
tain a result in a roundabout way with the goal of reducing the number of instruc-
tions, eliminating expensive instructions like divisions, or removing costly
branches. This chapter describes a variety of interesting bit hacks that are likely
to be applicable to game engine codebases.
Many of the techniques we describe require knowledge of the number of bits
used to represent an integer value. The most efficient implementations of these
techniques typically operate on integers whose size is equal to the native register
width of the CPU running the code. This chapter is written for integer registers
that are 32 bits wide, and that is the size assumed for the int type in C/C++. All
of the techniques can be adapted to CPUs having different native register widths
(most commonly, 64 bits) by simply changing occurrences of the constants 31
and 32 in the code listings and using the appropriately sized data type.
385
386 24. Bit Hacks for Games
to the C++ standard1, but all compilers likely to be encountered in game devel-
opment give the expected outcome—if the bits are shifted to the right by n bit
positions, then the value of the highest bit of the input is replicated to the highest
n bits of the result.
Absolute Value
Most CPU instruction sets do not include an integer absolute value operation.
The most straightforward way of calculating an absolute value involves compar-
ing the input to zero and branching around a single instruction that negates the
input if it happens to be less than zero. These kinds of code sequences execute
with poor performance because the branch prevents good instruction scheduling
and pollutes the branch history table used by the hardware for dynamic branch
prediction.
A better solution makes clever use of the relationship
x ~ x 1, (24.1)
where the unary operator ~ represents the bitwise NOT operation that inverts
each bit in x. In addition to using the ~ operator in C/C++, the NOT operation can
be performed by taking the exclusive OR between an input value and a value
whose bits are all 1s, which is the representation of the integer value 1. So we
can rewrite Equation (24.1) in the form
where the reason for subtracting 1 at the end will become clear in a moment.
If we shift a signed integer right by 31 bits, then the result is a value that is
all ones for any negative integer and all zeros for everything else. Let m be the
value of x shifted right by 31 bits. Then a formula for the absolute value of x is
given by
x x ^ m m. (24.3)
1
See Section 5.8, Paragraph 3 of the C++ standard.
24.1 Integer Sign Manipulation 387
Listing 24.1. These functions calculate the absolute value and negative absolute value.
Note that the absolute value function breaks down if the input value is
0x80000000. This is technically considered a negative number because the most
significant bit is a one, but there is no way to represent its negation in 32 bits.
The value 0x80000000 often behaves as though it were the opposite of zero, and
like zero, is neither positive nor negative. Several other bit hacks discussed in
this chapter also fail for this particular value, but in practice, no problems typical-
ly arise as a result.
Sign Function
The sign function sgn x is defined as
1, if x 0;
sgn x 0, if x 0; (24.4)
1, if x 0.
This function can be calculated efficiently without branching by realizing that for
a nonzero input x, either x >> 31 is all ones or -x >> 31 is all ones, but not
388 24. Bit Hacks for Games
both. If x is zero, then both shifts also produce the value zero. This leads us to the
code shown in Listing 24.2, which requires four instructions. Note that on Pow-
erPC processors, the sign function can be evaluated in three instructions, but that
sequence makes use of the carry bit, which is inaccessible in C/C++. (See [Hoxey
et al. 1996] for details.)
Listing 24.2. This function calculates the sign function given by Equation (24.4).
Sign Extension
Processors typically have native instructions that can extend the sign of an 8-bit
or 16-bit integer quantity to the full width of a register. For quantities of other bit
sizes, a sign extension can be achieved with two shift instructions, as shown in
Listing 24.3. An n-bit integer is first shifted right by 32 n bits so that the value
occupies the n most significant bits of a register. (Note that this destroys the bits
of the original value that are shifted out, so the state of those bits can be ignored
in cases when an n-bit quantity is being extracted from a larger data word.) The
result is shifted right by the same number of bits, causing the sign bit to be
smeared.
On some PowerPC processors, it’s important that the value of n in Listing
24.3 be a compile-time constant because shifts by register values are microcoded
and cause a pipeline flush.
Listing 24.3. This function extends the sign of an n-bit integer to a full 32 bits.
24.2 Predicates
We have seen that the expression x >> 31 can be used to produce a value of 0 or
1 depending on whether x is less than zero. There may also be times when we
24.2 Predicates 389
want to produce a value of 0 or 1, and we might also want to produce these val-
ues based on different conditions. In general, there are six comparisons that we
can make against zero, and an expression generating a value based on these com-
parisons is called a predicate.
Table 24.1 lists the six predicates and the branchless C/C++ code that can be
used to generate a 0 or 1 value based on the boolean result of each comparison.
Table 24.2 lists negations of the same predicates and the code that can be used to
generate a mask of all 0s or all 1s (or a value of 0 or 1) based on the result of
each comparison. The only difference between the code shown in Tables 24.1
and 24.2 is that the code in the first table uses unsigned shifts (a.k.a. logical
shifts), and the second table uses signed shifts (a.k.a. arithmetic or algebraic
shifts).
Table 24.1. For each predicate, the code generates the value 1 if the condition is true and
generates the value 0 if the condition is false. The type of a and x is signed integer.
Table 24.2. For each predicate, the code generates the value 1 if the condition is true
and generates the value 0 if the condition is false. The type of a and x is signed integer.
390 24. Bit Hacks for Games
Listing 24.4. These functions increment and decrement the input value modulo n.
Clamping to Zero
Another use of masks is clamping against zero. The minimum and maximum
functions shown in Listing 24.5 take a single input and clamp to a minimum of
zero or a maximum of zero. On processors that have a logical AND with com-
plement instruction, like the PowerPC, both of these functions generate only two
instructions.
Listing 24.5. These functions take the minimum and maximum of the input with zero.
392 24. Bit Hacks for Games
Listing 24.6. These functions return the minimum and maximum of a pair of integers when we
can assume two bits of sign.
24.3 Miscellaneous Tricks 393
Listing 24.8. The case code for a cell is constructed by shifting the sign bits from the eight corner
voxel values into specific bit positions. One of the voxel values is shifted seven bits right to
produce a mask of all 0s or all 1s, and it is then exclusive ORed with the case code to determine
whether a cell contains triangles.
Table 24.3. This table lists all possible combinations for the truth values b 0 , b1, and b 2 relating the
values v 0 , v1, and v 2 . The sum of b0 | b1 and b1 & b 2 gives the index of the largest value.
Listing 24.9. This function returns the index, in the range 0,2, corresponding to the largest value
in a set of three.
x & (~x - 1) Create mask for bits right of lowest 0 bit, exclusive. 1 remains 1.
Table 24.4. Logic formulas and their effect on the binary representation of a signed integer.
References
[Anderson 2005] Sean Eron Anderson. “Bit Twiddling Hacks.” 2005. Available at http://
graphics.stanford.edu/~seander/bithacks.html.
[Ericson 2008] Christer Ericson. “Advanced Bit Manipulation-fu.” realtimecollision de-
tection.net - the blog, August 24, 2008. Available at http://realtimecollision detec-
tion.net/blog/?p=78.
[Hoxey et al. 1996] Steve Hoxey, Faraydon Karim, Bill Hay, and Hank Warren, eds. The
PowerPC Compiler Writer’s Guide. Palo Alto, CA: Warthman Associates, 1996.
25
Introspection for C++ Game Engines
Jon Watte
IMVU, Inc.
25.1 Introduction
This gem describes a mechanism for adding general-purpose introspection of
data types to a C++ program, using a minimum of macros, mark-up, or repetition.
Introspection is the capability of a computer program to look at its own data and
make modifications to it. Any high-level language of note has a rich introspec-
tion API, generally as part of an even larger reflection API, but users of C++
have had to make do with the built-in class type_info ever since the early days.
A conversation with the introspection system of a language such as C#, Java, or
Python might look something like this:
“Hello, introspection system, I’d like you to tell me a bit about this here
piece of data I have!”
“Why, certainly, game program! It’s a data structure you’d like to call a
TreasureChest.”
“That’s useful to know, but what I really want to know is whether it has a
property called ‘Position’?”
“Yes, it does! It’s of type float3, containing the position in the world.”
“That’s great! Actually, now that I think about it, the designer just
clicked on it, and wants to edit all the properties. What are they?”
“Well, we’ve got ‘Name’, which is a string, and ‘Contents’, which is a
list of references to Object templates, and ‘Model’, which is a string
used to reference the 3D mesh used to render the object, and …”
In code, it looks something more like Listing 25.1 (using C#/.NET).
397
398 25. Introspection for C++ Game Engines
So, what can we do in C++? Using only the standard language and library,
we can’t do much. You can find out the name of a type (using typeid), and the
size of a type (using sizeof), and that’s about it. You can also find out whether a
given object instance derives from a given base class (using the horribly expen-
sive dynamic_cast<>), but only if the type has a virtual table, and you can’t iter-
ate over the set of base classes that it derives from, except perhaps by testing
against all possible base classes. Despite these draw-backs, C++ game program-
mers still need to deal with data, display it in editors, save it to files, send it over
networks, wire compatible properties together in object/component systems, and
do all the other data-driven game development magic that a modern game engine
must provide.
Trying to solve this problem, many engines end up with a number of differ-
ent mechanisms to describe the data in a given object, entity, or data structure.
For example, you may end up having to write code that’s something like what is
shown in Listing 25.2.
/* properties */
string name;
float3 position;
list<ObjectRef> contents;
string model;
public:
25.1 Introduction 399
/* saving */
void WriteOut(Archive &ar)
{
ar.beginObject("TreasureChest");
GameObject::WriteOut(ar);
ar.write(name);
ar.write(position);
ar.write(contents.size());
ar.write(model);
ar.endObject();
}
/* loading */
void ReadIn(Archive &ar)
{
ar.beginObject("TreasureChest");
GameObject::ReadIn(ar);
ar.read(name);
ar.read(position);
size_t size;
ar.read(size);
ar.read(model);
ar.endObject();
}
#if EDITOR
400 25. Introspection for C++ Game Engines
/* editing */
void AddToEditor(Editor &ed)
{
ed.beginGroup("TreasureChest");
GameObject::AddToEditor(ed);
ed.addString(name, "name", "The name of the object");
ed.addPosition(position, "position",
"Where the object is in the world");
ed.addObjectRefCollection(contents, "contents",
"What's in this chest");
ed.addFilename(model, "model",
"What the chest looks like", "*.mdl");
ed.endGroup();
}
#endif
}
This approach has several problems, however. For example, each addition of
a new property requires adding it in several parts of the code. You have to re-
member the order in which properties are serialized, and you generally have to
write even more support code to support cloning, templating, and other common
operations used by most game engines and editors. More than one game has
shipped with bugs caused by getting one of these many manual details wrong. If
you apply the technique in this gem, you will be able to avoid this whole class of
bugs—as well as avoid a lot of boring, error-prone typing. Next to having the
language run-time library do it for you, this is the best you can get!
■ A test program, which exercises the API mechanisms one at a time to let you
easily verify the functionality of the different pieces.
■ A server program, which lets you host a simple chat server (from the com-
mand line).
■ A client program, which lets you connect to a running chat server and ex-
change chat messages with other users (again, from the command line).
■ An editor program, which lets you edit the list of users used by the server
program, using the introspection mechanisms outlined in this chapter.
These are baked into two executables: the introspection test program, to verify
that the API works as intended, and the simplechat program, to act as a client,
server, or user list editor for the simple chat system.
To build the programs, either use the included solution and project files for
Microsoft Visual Studio 2010 (tested on Windows 7) or use the included GNU
make file for GCC (tested on Ubuntu Linux 10.04). Run the sample programs
from the command line.
#include <introspection/introspection.h>
struct UserInfo
{
std::string name;
std::string email;
std::string password;
int shoe_size;
INTROSPECTION
(
UserInfo,
MEMBER(name, "user name")
MEMBER(email, "e-mail address")
MEMBER(password, "user password")
MEMBER(shoe_size,
introspection::int_range("shoe size (European)", 30, 50))
);
};
settings (or the makefile) to reference the parent of the introspection directory.
Note that include files in the samples are included using angle brackets, naming
the introspection directory <introspection/introspection.h>.
Given the above declaration, you can now do a number of interesting things
to any instance of struct UserInfo. Most importantly, you can get a list of the
members of the structure, as well as their type and offset within the structure,
programmatically. The function test_introspection() in the sample
main.cpp file shows how to do this and is illustrated in Listing 25.4.
void test_introspection()
{
std::stringstream ss;
if ((*ptr).access().compound())
{
ss << " [compound]";
}
if ((*ptr).access().collection())
{
ss << " {collection}";
}
ss << std::endl;
std::cout << ss.str();
ss.str("");
}
}
The name of the type that contains information about each compound type is
type_info_base. It contains standard-template-library-style iterator accessors
begin() and end() to iterate over the members of the type, each of which is de-
scribed using a member_t instance. Additionally, type_info_base contains an
access() accessor for the member_access_base type, which implements opera-
tions on the compound type itself, such as serializing it to and from a binary
stream, converting it to and from a text representation, and creating and destroy-
ing instances in raw memory. Each type (including the basic types like int and
float) has a corresponding member_access_base, so this structure tells you
whether a type is compound (such as struct UserInfo) and whether it is a col-
lection (such as a list or a vector).
return (info); \
}
This keeps the namespace of the introspected type (such as UserInfo) clean,
only introducing the typedef self_t and the member_info() member function.
Access to each member of the structure is done using pointer-to-member syntax.
However, the actual pointer to member does not need to be dereferenced because
template metaprogramming can pick it apart and turn it into an offset-and-
typecast construct, to save run-time CPU cycles.
Using the size() and create() functions, it inflates an instance of the appropri-
ate structure into an allocated array of bytes and constructs it using the appropri-
ate C++ constructor, after which it passes the instance through a dispatcher
selected by the integer code (type) of the packet. When the dispatcher returns, the
destroy() function calls the appropriate C++ destructor, and the memory can be
returned to the system as an array of bytes again.
This is illustrated in more detail in the client and server programs, where
it is implemented in the protocol_t::encode() and protocol_t::de-
code()functions, respectively, using the memory stream functions of the sim-
ple_stream class. A good example use of this mechanism is the send_-
a_message() function from the client.cpp file, as shown in Listing 25.6.
ssp.message = line;
25.6 In Closing
Finally, I have set up a forum for discussion about this gem, as well as any errata
and code update releases, that you can find at my website: http://www.
enchantedage.com/geg2-introspection. I hope you find that this gem saves you a
lot of error-prone typing, and I hope to hear your experiences and feedback in the
forum!
26
A Highly Optimized Portable
Memory Manager
Jason Hughes
Steel Penny Games, Inc.
26.1 Introduction
Every game has a memory manager of some sort. On PCs, this tends to be Mi-
crosoft’s C run-time library memory manager. On the various consoles, it’s likely
to either be the platform-specific memory manager that was written by the hard-
ware vendor or the one provided by the compiler company in its run-time library.
Do they all work the same way? No. Do they exhibit the same performance char-
acteristics? Absolutely no. Some allow the heap to become fragmented very
quickly, while others may be very slow for small allocations or when the number
of allocations becomes quite large, and still others may have a high per-allocation
overhead that invisibly eats away at your memory.
Memory allocation is a fundamental operation, thus, it has to satisfy a wide
number of use cases robustly and efficiently. This is a serious technical chal-
lenge. Even a good implementation can harm a game’s performance if exercised
in just the wrong way. A naive implementation can utterly cripple performance
or cause crashes due to artificial low-memory situations (e.g., fragmentation or
overhead). The good news is that most of the provided memory managers are
relatively efficient and work well enough for simple cases with few allocations.
After enough experiences where cross-platform stability and performance
came down strictly to the memory manager, however, you may be tempted to
implement your own that is scalable and easy to tune for best performance. These
days, with so many platforms to support, it’s a mark of quality for an engine to
run well across all machines.
409
410 26. A Highly Optimized Portable Memory Manager
Desirable Properties
After writing about eight or nine different memory managers, a list of priorities
emerges from recognizing the importance of certain attributes that make a
memory manager good. Here are a few:
1. Must not thrash the CPU cache. High performance comes with limiting the
number of bytes that have to interact with RAM. Whatever is touched should
fit within a minimum number of cache lines, both to reduce memory latency
and to limit the amount of cache disruption to the rest of the program.
2. No searching for free space. The naive implementation of a memory manag-
er is a linked list of blocks that are marked free or empty. Scanning this list is
hugely expensive and slow under all circumstances. Good memory managers
have lookup tables of some sort that point to places where free blocks of
memory of various sizes are.
3. Minimum overhead per allocation. Reducing overhead in a memory manager
is almost like adding compression to memory—you can fit more data in the
same physical space.
4. Should be easy to debug. Sometimes memory problems happen, and it’s im-
portant to consider the difficulty in tracking down such issues. Sometimes
this means temporarily adding features to the memory manager. Ideally, the
debug build should do some basic instrumentation as it runs that determines
whether memory has been trampled without slowing down the system.
5. Should resist corruption. Most memory managers are organized such that
blocks of program data are sandwiched between heap tracking information.
26.2 Overview 411
26.2 Overview
The following design is one such system that satisfies the minimum criteria, as
specified in the introduction, though there are many others possible. There are
certain trade-offs that are often made during large design processes that ultimate-
ly shape the final software product. Although we describe a specific instance of
the memory manager, the version found on the website is actually a template-
based solution that allows reconfiguration of many parameters so you can easily
experiment with what is right for your specific situation. In fact, it would be easy
to capture all allocations and deletions in a log file with your current memory
manager, then replay the operations in a unit test setting to measure exactly what
performance and memory fragmentation would be under realistic heap
conditions.
Free Space
Small Alloc Small Alloc
Free Space
Figure 26.1. From left to right: the heap is free and unfragmented, one large allocation is
made, one small allocation is made, the large allocation is freed. In the rightmost dia-
gram, the heap is partitioned in half and exhibits an approximately 0.5 fragmentation
ratio.
26.2 Overview 413
to be considerably more limited, and players are more likely to play the game
without resetting for hours or even days on end. Thus, the dreaded “uptime
crash” is almost unavoidable without special precautions, careful planning, and
perhaps a custom memory management strategy that attempts to avoid fragmen-
tation automatically.
For further clarification of how fragmentation can occur within a program,
see Figure 26.1. Fragmentation can be demonstrated in three simple operations:
allocate twice and free once. If one allocation is quite large and is followed by a
small allocation, then once you release the large block back to the system, the
memory manager now has an approximate 0.5 fragmentation ratio. If the next
allocation is slightly larger than 50 percent of the total available memory, it will
fail.
Most decent memory managers have allocation strategies that react different-
ly based on the size of the allocation. Often, there is a specific handler for tiny
allocations (under 256 bytes, for instance) as well as a general allocation method
for everything larger. The idea behind this is primarily to reduce overhead in al-
locating very small blocks of memory that tend to represent the lion’s share of
allocations in most C/C++ programs. One happy consequence is that it prevents
that group of allocations from possibly splitting the big block in half and causing
fragmentation by preventing them from coming from the same memory
pool.
By extending this understanding to the design of a memory manager, it is
possible to reduce external fragmentation, at some small expense of internal
fragmentation. To illuminate by exaggeration, if you round up all of your alloca-
tions to the largest size you’ll ever allocate, then no external fragmentation is
important because no allocation will ever fail due to a fragmented heap, and any
allocation that gets freed can always fit any other allocation that may come in the
future. Of course, all of your memory would be exhausted before that could hap-
pen! Preposterous as it is, scaling back the idea and applying it to smaller alloca-
tions, where rounding up the size is less significant, makes the concept viable.
The trick is to make it fast and limit the amount of internal fragmentation (i.e.,
wasted space) the strategy produces.
Our approach is to make a very fast small block allocator (SBA) for alloca-
tions under 256 bytes, a reasonably fast medium block allocator (MBA) for allo-
cations that are larger than 256 bytes but smaller than a large allocation, and a
large block allocator (LBA) for any allocation of at least 4096 bytes. As men-
tioned above, the code on the website is templatized, so these numbers can be
modified trivially for your purposes, and when set to equal sizes, you can com-
pletely remove the SBA, the MBA, or both.
414 26. A Highly Optimized Portable Memory Manager
Paged Allocation
It is fine to suggest different allocation strategies for different-sized memory re-
quests, but that memory has to come from somewhere. The simplest method is to
preallocate a fixed number of blocks for each allocation size and hope you don’t
exceed that limit. Having done this on shipped titles, we honestly don’t recom-
mend it. However, it has one benefit in that you can determine the size of an allo-
cation based on the address alone (because it is either in the small, medium, or
large region of memory). Still, it is far better to have a flexible solution that
doesn’t require tweaking for every project having slightly different memory allo-
cation characteristics.
The solution is to preallocate in large batches, or pages, and string these pag-
es together to serve as small- and medium-sized memory heaps. Ideally, these are
quite large, granular pages of equal sizes for both SBA and MBA pages, or MBA
pages are at least an even multiple of the page size used for SBA. This is prefera-
ble, as mentioned above, due to the inherent resistance to fragmentation when
allocations are all the same size. By making page sizes large, there are fewer of
them allocated, leading to fewer chances for failure and less overall CPU time
spent managing pages.
Inevitably, a single page does not hold all of the allocations the game needs
for a specific-sized allocation, so pages must be linked together by some data
structure. We selected a linked list for pages to make the traversal cheap and to
make it easy to extract or insert pages with minimal cache line misses. Searching
a linked list is slow, so we come up with ways to prevent the search entirely.
32 bytes/block
4 kB Page
Figure 26.2. This is the memory layout for an example SBA page that totals 4 kB seg-
mented into 32-byte allocations. Notice the tail is the last block and is quite small relative
to the amount of memory managed.
416 26. A Highly Optimized Portable Memory Manager
Performance Analysis
The two major operations, allocation and deallocation, are very fast. Deallocation
is a constant-time O 1 operation and has no loops once the SBA page has been
located. Locating the page is a separate operation, described in Section 26.6. Al-
location requires scanning for set bits in a bit mask, on average n 2, where a page
is divided into n blocks. A naive C implementation of scanning for set bits using
bytes can be this slow, however, we can trivially test 32, 64, or even 128 bits at a
time by loading the mask into a register and comparing against zero, generally in
just two instructions. Intel CPUs can use the bit scan forward (BSF) instruction to
check for the first set bit in either a 32- or 64-bit register in just a few cycles. In
addition, for most practical page and block sizes, the entire Tail structure and
free block bit mask fits within a single L1 cache line, ensuring the fastest possi-
ble allocation performance.
As a specific example, given a 4-kB page with 32 bytes per block, there are
127 allocatable blocks per page plus one reserved block for the Tail structure.
This is 127 bits for the bit mask, which fits in only four 32-bit words. Worst case,
obtaining an allocation requires four 4-byte requests from memory, but on aver-
age finds a free block in two requests. The first load causes a delay as a cache
line is flushed and another is pulled into the L2 and L1 caches, but the second
load requires only one or two cycles on typical hardware.
418 26. A Highly Optimized Portable Memory Manager
Memory Overhead
The SBA scheme described is very memory-efficient. For a given page, there is a
fixed amount of overhead due to the Tail structure of exactly 12 bytes. There is
also a variable amount of overhead for the bit mask, which is always rounded up
to the next four-byte boundary, and occupies one bit per block in the page. For
the 4-kB page with 32 bytes per block above, this works out to 16 bytes for the
bit mask. So the total amount of management overhead within a single SBA page
is less than two bits per allocation. In practical settings, pages are probably larger
than 4 kB, resulting in overhead that rapidly approaches one bit per allocation.
Contrast this with dlmalloc, which has a relatively efficient 16 bytes of overhead
per allocation.
128 bytes/block
16 kB Page
Figure 26.3. This is the memory layout of an MBA page. This particular example is a
16 kB page with 128-byte blocks. All of the management metadata fits within a single
block at the end of the page.
420 26. A Highly Optimized Portable Memory Manager
Instead, we could store a value in an array at the end of the page, similar to how
the free block bit mask is stored, although that is quite expensive as well. Con-
sidering most of the allocations are larger than one block, many of the entries in
this array would be unused at any given time. Even if we stored the length per
block in one byte, it would be far too wasteful!
The solution: store a single bit in a bit mask that identifies whether a block is
at the end of an allocation. The pointer passed in to the Free() function identi-
fies the start of the block already. Using these two pieces of information, we can
extrapolate the length of an allocation simply by scanning the bit mask forward
looking for a set bit. As a matter of standards, in the code provided, unallocated
blocks have their end-of-allocation bit set to zero, but it won’t ever be checked so
it shouldn’t matter. See Figure 26.3 for a more visual explanation.
the Alloc() function. Again, the memory manager requests this updated infor-
mation from the page and reorders its internal structures based on the big block
size in this page to optimize the next allocation.
Performance Analysis
Allocation and deallocation are very fast. The act of taking possession of the re-
quested memory once a page is found is simply a few pointer additions and a
couple of memory dereferences. Setting a few bits in the free block bit mask is a
relatively quick n 32 operation when using 32-bit bitset operations, where n is
the number of blocks that the allocation spans. Free() performs the same opera-
tions as Alloc(), except that it reads bits from the free block bit mask rather than
sets them. However, searching for the big block takes m 32 memory accesses,
where m is the number of blocks in the page. Since m is a constant for a given
page and block size, technically both of these analyses are bound by a constant
time (O 1), meaning the number of cycles that is required can be computed at
compile time. Practically speaking, it is very fast, requiring only one L1 cache
line to be loaded if page and block sizes are carefully chosen for your architec-
ture.
Here’s a specific example: given a 16-kB page with 128 bytes per block,
there are 127 allocatable blocks per page plus one reserved block for the Tail
structure. This is 127 bits for the bit mask, which fits in only four 32-bit words. A
good target is to make all MBA pages at least two times larger than the maxi-
mum medium block allocation size to improve memory performance when allo-
cating close to the maximum. In the worst case, a single allocation might be half
the size of a page, thus writing 64 bits, 32 bits at a time. The first load causes a
delay as a cache line is flushed and another is pulled into the L2 and L1 caches,
but the second load takes only one or two cycles on typical hardware.
Memory Overhead
The MBA scheme is almost as efficient as the SBA approach, given that the
memory requirements are very similar with an extra bit mask and a couple more
words stored in the Tail structure. For any page, the Tail structure is 20 bytes in
size. There is a fixed amount of overhead for the two bit masks, which is two bits
per block, aligned and rounded up to the next 32-bit word. For a 16-kB page
with 128 bytes per block, this is 32 bytes for bit masks. Management overhead
for this example is under three bits per block, but since allocations tend to span
many blocks, this could add up. The largest reasonable allocation in this example
page size would be 8 kB (half the page size), which covers 64 blocks. At two bits
422 26. A Highly Optimized Portable Memory Manager
OSAPI because it may have special knowledge of determining that based on the
page handling implementation.
SBA Pages
Page management for small block pages is relatively straightforward. Since each
page has a specific block size that it can allocate, the memory manager simply
keeps two tables of linked lists of pages. One table contains pages that have abso-
lutely no free space in them (and are thus useless when trying to allocate
memory). The other table contains pages that have at least one free block availa-
ble. As shown in Figure 26.4, each list contains pages that have equal block sizes,
so whenever a page fills up, it can be moved to the full table, and the next page in
the available list will service the next request. Any time a block is freed from a
page in the full list, that page is moved back to the available list for that block
size. Because of this, no searching for an allocation is ever required!
It should be noted that the current implementation of the page management
feature for SBA always requires allocations be as tightly fitting as possible. As a
result, any time there are no pages of exactly the correct block size, a new page
must be allocated for that block size and added to the available list. In low-
memory conditions, however, this could cause an early failure to allocate because
larger block sizes might be available but not considered for allocation. You may
wish to implement this as a fallback in case of page allocation failure to extend
the life of your game in the event of memory exhaustion. Alternatively, you
might always return a piece of memory that is already in an available page rather
than allocating a new page when a block size runs out of pages, which may help
utilize memory more effectively (although it is quite wasteful) because it will
Figure 26.4. The SBA always allocates a single sized object from each list of pages. So
to prevent searching, full pages are removed from the list from which we allocate.
424 26. A Highly Optimized Portable Memory Manager
wait until the absolute last allocation is used before requesting a new page. Or
use some hybrid approach in which the next three or four block sizes are consid-
ered valid if the correct one is full.
MBA Pages
Medium block page management is more complicated only because the lists are
constructed based on the size of the largest available block in each page. Conse-
quently, for nearly every allocation or deallocation, the affected page most likely
has its list pointers adjusted. Unlike SBA, it is entirely possible that no big block
matches the requested allocation size exactly (see Figure 26.5 for an example of a
sparse page list). In fact, a brand-new page would show the big block being the
full size of the page (minus the metadata overhead in the Tail structure), so re-
questing a new page is also not likely to add a page in the list to which the alloca-
tion first indexes. So, whenever a specific-sized list is empty, the page manager
must scan the larger lists looking for a page, guaranteeing that the allocation
function succeeds. However, since list pointers are four bytes each in a 32-bit
machine, an array of pointers actually requires multiple cache lines and far too
many fetches from memory for good performance. Instead, we implement the
lookup first as a bit mask that uses 0 to indicate a null pointer and 1 to indicate
any non-null pointer. In this way, a single cache line can track the availability of
hundreds of lists with just a few memory fetches.
Figure 26.5. This MBA page table shows two pages that are completely full, one with
384 bytes in a single block, and another with 640 bytes in a single block. It makes no
statement about how many blocks of that size might exist or how many smaller ones
exist.
26.7 OSAPI Ideas 425
27.1 Introduction
A remote heap is an old tool in the memory management toolbox. It is appropri-
ate to consider using a remote heap whenever a memory architecture is segment-
ed, such as on the Wii, where MEM1 is much faster than MEM2 and can benefit
from having all the allocation and deletion work done with lower latency. A re-
mote heap is also useful when a processor has to handle feeding other custom
processors with their own local memory, such as pushing audio data to a sound
chip, when the only way to touch that memory is via bulk DMA calls that have
significant latency. Similarly, on the PlayStation 3, the PPU may want to treat the
memory associated with each SPU as a remote heap so that it can play traffic
cop, with data whizzing back and forth, without needing to have direct access to
their individual memories. Conceptually, remote heaps could also be used in a
unified memory architecture using multiple processors, such as the Xbox 360,
where each heap is mutually distinct and partitions memory for each CPU to
work on, with a guarantee that no other processor will cause mutex blocking at
the central memory manager.
Considerations
There are limitations to every technique. Unfortunately, remote heaps suffer from
a significant overhead in performance and local memory footprint, which can be
traced primarily back to the fact that the only data that a typical program has to
refer to an allocation is its address. In a standard heap, metadata describing each
allocation is stored a few bytes before each allocated block of memory, so opera-
tions such as Free() tend to do some quick pointer arithmetic to do their work.
427
428 27. Simple Remote Heaps
This is not as straightforward with a remote heap because that metadata has to be
associated with an address, without the benefit of storing data at that address.
Designing a good remote heap means flirting with hash tables, red-black trees, or
similar data structures.
Unfortunately, hash tables of addresses are terrible when searching for
memory to allocate (sequential allocations would be randomly scattered through
memory and have O n allocation time, where n scales with the hash table rather
than with the entries used) and equally bad for merging adjacent free blocks.
Red-black trees of addresses (such as the standard template library map) are
memory hungry and somewhat tedious to set up with allocators from fixed
memory pools but do work very well when freeing blocks in O log n time,
where n is the number of current allocations, and merging adjacent free blocks is
a constant-time operation (neglecting the reordering of the tree when a node is
deleted). Unfortunately, they have O n running time for allocation because best-
fit and first-fit strategies must scan all blocks to find free space.
One solution is to carefully pack bits so that the O n searches are as fast as
possible. Here, we accept the worst-case performance and try to make the best of
it. There are some benefits to this method, which is why we present it. Some-
times it is ideal. In general, though, it lacks performance. We call this the “bit-
wise remote heap” approach because each unit of memory is represented as a
single bit.
The other solution is to realize that a single address-based data structure is
not ideal for all functions. However, keeping two structures in sync tends to force
each data structure to have the worst-case performance of the other. Our solution
is to allow them to fall out of sync and correct mistakes as they are detected. We
call this the “blockwise remote heap” because it tracks individual allocations as
single units.
can find the next set bit in a single instruction,1 and we’re always looking for free
blocks. On other processors, the opposite might be true.
One downside of this approach is that all allocations must be rounded to an
even granularity of the chunk size, which can be wasteful if this size is set too
large. Another disadvantage is that, when finding an allocation, the implementa-
tion may need to scan all of the bits in the bitmap, only to fail. This failure grows
slower as the number of bits in the bitmap increases.
Consequently, this kind of implementation is ideal when relatively few allo-
cations are required, they can be easily rounded to a specific size, and the amount
of memory managed is relatively small. This sounds like a lot of restrictions to
place on a remote heap, but in reality, they are satisfied quite frequently. Most
remote heaps dealing with custom hardware, such as audio processors or special-
ized processors (vector units, GPUs, etc.), have alignment requirements anyway,
and they often have a relatively small block of memory that needs to be
managed.
Algorithm
The bitwise remote heap has only two important functions:
■ Alloc(). The obvious method for finding a free block of memory in a bit-
map is to scan for a sequence of bits that are set. This can be accomplished
by loading 32 bits at a time, scanning for the first 1 bit, noting that location,
then scanning for the first 0 bit after it. Often this requires spanning multiple
32-bit values in sequence, so an optimized routine is required. Also, many
processor architectures have assembly language mnemonics that do this kind
of search in a single instruction for further performance enhancement. Clear-
ly, this operation has worst-case n 32 memory fetches, where n is the number
of bits in the heap bitmap. Once the range is found, the bits in the range are
set to zero. A single bit is set in an end-of-allocation bitmap that tells what
the last block in the allocation is. Think of it like null-terminating a string.
■ Free(). When deleting a chunk of memory, we simply turn a range of 0 bits
into 1 bits, scanning forward until we find the allocation terminator bit in the
end-of-allocation bitmap. Once found, we clear the terminating bit as well.
Coalescing free blocks is completely unnecessary because adjacent free bits
implicitly represent contiguous free memory. Thus, there is relatively little
management overhead in deallocation.
1
See the bit scan forward (BSF) instruction in Intel 64 and IA-32 Architectures Software
Developer’s Manual Volume 2A: Instruction Set Reference, A-M. Many details on bit
scan operations are also available at http://chessprogramming.wikispaces.com/ BitScan.
430 27. Simple Remote Heaps
address in user-memory. Since we cannot use the address to quickly find metada-
ta about allocations, the naive implementation leaves us with a linked list of
tracking blocks, ordered by address. A naive implementation forces all alloca-
tions to scan the heap’s tracking linked list before allocating or freeing any
memory. A naive implementation has a slow O n running time, where n is the
number of allocations in the heap.
The implementation on the website does not attempt to make Free() as fast
as possible but, rather, does a simple linked list search. This could be improved
by creating a second data structure that keeps the tracking block pointer associat-
ed with each allocated address, perhaps in a hash table or balanced tree, but that
is an exercise left to the reader. The memory requirements for such a structure
are substantial and may be entirely unnecessary for applications where Free() is
rarely or never called, but rather, the entire heap is dropped all at once.
However, since fast allocation is actually a complicated problem, we present
a solution that is nearly constant-time for the majority of allocations and is linear
for the rest. We do this by maintaining a bookmark table that essentially points to
the first free block of each power-of-two–sized block of free memory. It remem-
bers the last place the code found a free block of each size. Once we allocate a
block, that entry in the table may no longer be accurate. Updating the entry may
require a full traversal of the heap, a very slow operation, so we allow it to be-
come stale. Instead, during allocation and free calls, we store any blocks we
come across in the table at the appropriate (rounded down to the next) power of
two. While there is no guarantee that the table is updated, in practice it tends to
be close, at no additional performance cost. During an allocation, if a table entry
appears invalid, we can always check the next-higher power of two, or the next,
until one is found to be valid. In most cases, this works very well. In empirical
tests, 65 percent of the allocations have positive hits on the cached references in
the bookmark table, which means 65 percent of the long searches for free
memory were avoided.
The tracking data for allocations is stored in a pool with a threaded free-list,
making the location of a valid metadata block during allocation and deletion an
O 1 operation. Threaded free-lists act like a stack, since we simply want a blank
node to write to when allocating (pop) and want to return a node to the unused
pool when freeing (push). As in any other standard pool implementation, the used
and unused structures occupy the same physical memory at different times, and
we just cast to one or the other, depending on the current state of that block. To
aid in debugging, as well as to facilitate lazy bookmarking, unused metadata
nodes are marked with a sentinel value that cannot appear in an allocation, so we
can repair the bookmark table when it gets out of date.
432 27. Simple Remote Heaps
struct AllocatedBlock
{
unsigned int mAddress; // This is where the block starts.
unsigned int mAllocBytes; // Allocated bytes are at the BEGINNING
// of the allocation.
unsigned int mFreeBytes; // Free bytes are at the END.
unsigned int mNextBlockIndex; // This is a linked list in
// address-order and makes
// allocation and merging quick.
};
Listing 27.1. An AllocatedBlock represents an allocated chunk of memory in the remote address
space and some number of unallocated bytes that immediately follow it.
Many heap implementations treat free and used blocks with separate tracking
data types and store a flag or sentinel to know which metadata is which. I always
simplify mine by combining tracking data for used and free blocks into a single
structure. This has two benefits: first, the memory requirements are extremely
minimal to do so, and second, half the number of linked list iterations are re-
quired to walk the entire heap. If you think about it, just about every allocation
will have at least a few bytes of free space after it (for hardware alignment rea-
sons), and the heap has to decide whether to assign each tiny piece of memory its
own node in the heap or consider that memory to be part of the allocation. These
decisions need not be made at all if there is no distinction made, and thus, no cost
added.
Even when there is no free memory between two allocations, this is no less
efficient. It does have a minor downside in that the memory requirements for
each metadata node are slightly higher, but not significantly so. See Listing 27.1
for an example of a metadata node.
In a standard heap, a metadata block, such as AllocatedBlock, is stored just
prior to the actual address of the allocation. In a remote heap, we can’t do that.
Instead, we store these blocks in a typical allocation pool, where some of the
blocks are currently in use, and some are not (see the UnusedBlock declaration in
Listing 27.2). Note that both structures have “next” pointers. An Allocated-
Block only points to other AllocatedBlock structures, so scans of the heap are
limited to real allocations. The UnusedBlock only points to other UnusedBlock
structures and is effectively a threaded free list.
Recall that the lazy power-of-two table that helps with allocation keeps
pointers into this pool to know where free spaces are. Under normal circumstanc-
27.3 Blockwise Remote Heap 433
struct UnusedBlock
{
// This is a threaded free list inside the mAllocations pool.
// Doing this allows for O(1) location of AllocatedBlock
// objects in the pool.
unsigned int mNextUnusedBlockIndex;
Listing 27.2. These are placeholders in the local memory pool of AllocationBlocks, the meta-
data for the remote heap.
es, a pool structure would never link to an unused node, but with a lazy book-
mark table, we can and do. We must, therefore, have a way to detect an unused
node and handle it gracefully. This is the reason for the sentinel value. Whenever
we look for an allocation using the bookmark table, we can tell if the block is still
allocated by casting the block to an UnusedBlock and checking the mSentinel
variable. If it is UINT_MAX, clearly the table is out of date, and we should look
elsewhere. Otherwise, we cast to an AllocatedBlock and see if the number of
free bytes is sufficient to satisfy the new allocation.
One other trick that simplifies the code is the use of a dummy block in the
linked list of allocations. Since the initial block has all the free memory and no
allocations, we need to have some way to represent that for the Free() function.
Rather than write a hairy mess of special-case code for the first allocation, we
just initialize one node to have the minimum allocated block and all the remain-
ing free bytes and stick it in the list of allocations. This node always stays at the
head of the list, and consequently, all other allocations come after it. Note, how-
ever, we never count that node as a user allocation, so the number of real alloca-
tions available to the user is one fewer than what is actually present.
Algorithm
The blockwise remote heap also has only two interesting functions:
■ Alloc(). First, we compute what power of two can contain the allocation,
use that as an index into the bookmark table, and find the metadata block for
the associated slot in the table. If we find a valid block with enough free
434 27. Simple Remote Heaps
space, we’re set. If not, we iteratively search the next-highest power of two
until we run out of table entries. (This scan is why fragmentation is so high,
because we’re more likely to cut a piece out of the big block rather than
search the heap looking for a tighter fit someplace else.) If no large block is
verified, we search the entire heap from the start of the memory address
space looking for a node with enough free bytes to afford this new allocation.
Then, the new allocation information is stored in local memory in a metadata
node. The allocation of a metadata block is quite simple. We pop an Allo-
catedBlock off the metadata pool (which is always the first UnusedBlock),
fill out the structure with information about our new allocation, reduce the
free bytes of the source block to zero and assign them to the new block, and
link the new block into the list after the source block, returning the remote
memory pointer to the caller. Since we have a dummy block in the linked list
that is always at the head, we never need to worry about updating the head
pointer of this list.
■ Free(). The operation that hurts performance most is searching the heap for
a memory address (or free block of sufficient size). Free() has to search to
figure out where an address is in the heap. This is quite slow and dominates
the running time of the heap implementation. While searching, we keep the
pointer to the previous node so we can collapse all the memory into it once
the address is found. The merging operation simply adds the free and allocat-
ed bytes to the previous node, links around the deleted node, and releases the
metadata back to the pool. Consequently, freeing data is an O n operation,
where n is the number of live allocations.
The bookmark table is updated after every allocation and deletion so that any
time a piece of free memory is available, the appropriate power-of-two indexed
slot is given the address to its metadata node. There is no sense in checking hun-
dreds or thousands of nodes if they are in the bookmark table during every opera-
tion, so updates are limited to overwriting what is currently stored there, not
cleaning out the table when sizes change. As a result, the table can sometimes
point to metadata nodes that are smaller or larger than expected, or have even
been completely deleted. So, anywhere the code consults the bookmark table, it
verifies the data is good and corrects the data (by marking it empty) if not.
For better performance, an improvement could be made by using a doubly
linked list of metadata nodes and a hash table or tree structure that maps address-
es to metadata pointers. At some expense to memory, you can get as good as
O 1 running time by doing this. We have not done this, but it is a straightfor-
ward extension.
27.4 Testing Results 435
Memory Efficiency
The results in Table 27.1 show that the bitwise method is very good at packing
the maximum amount of data into a remote heap, regardless of the granularity.
Comparatively worse, the blockwise method is not as economical on memory use
(i.e., it results in fewer successful allocations) and scales even more poorly as the
number of managed allocations drops, most likely because the metadata table is
full.
The allocation count reservations for the blockwise tests are chosen such that
the amount of local memory used is roughly equal between the two heap types.
We chose this metric not only because it’s a valid comparison but also because
we found that there is a hard limit above which adding more allocation space to
the blockwise heap yields no benefit whatsoever, and indeed, begins to lose its
CPU performance edge. The second key point is that the bitwise method is sig-
nificantly more compact in memory, if memory is tight.
Table 27.1. Memory allocation efficiency comparison between remote heap types.
436 27. Simple Remote Heaps
Performance
Table 27.2 shows how drastically different the performance is between bitwise
remote heaps and traditional metadata remote heaps. The two bitwise tests differ
solely on the basis of whether the cached starting point was used. These test re-
sults clearly show that avoiding a complete search by caching the last free
block’s location is at least 50 percent better and improves performance further as
the allocation bitmap gets longer. Some CPU architectures are much faster or
much slower, depending on how the bit scan operation is handled. Moving to a
wider data path, such as 128-bit MMX instructions, or using 64-bit registers,
could potentially double or quadruple the performance, making bitwise an excel-
lent choice. Hierarchical bitmasks could also take numerous O n operations and
make them O log n .
With these performance characteristics in mind, it does appear that a block-
wise heap is far superior in terms of performance; however, be aware that the
number of allocations that the blockwise heap can fit is reduced significantly as
the allocation count is reduced. The time required to allocate with the blockwise
heap is relatively constant, but the current implementation causes deallocation to
scale linearly with the number of allocations in the heap, hence the linear relation
between allocation count and speed.
Sorting is one of the most basic building blocks of many algorithms. In graphics,
a sort is commonly used for depth-sorting for transparency [Patney et al. 2010] or
to get better Z-cull performance. It is a key part of collision detection [Lin 2000].
Dynamic state sorting is critical for minimizing state changes in a scene graph
renderer. Recently, Garanzha and Loop [2010] demonstrated that it is highly
profitable to buffer and sort rays within a ray tracer to extract better coherency,
which is key to high GPU performance. Ray sorting is one example of a well-
known practice in scientific computing, where parallel sorts are used to handle
irregular communication patterns and workloads.
Well, can’t we just use the standard template library (STL) sort? We can, but
we can also do better. How about up to six times better? Quicksort is probably
the best comparison-based sort and, on average, works well. However, its worst-
case behavior can be O n 2 , and its memory access pattern is not very cache-
friendly. Radix sort is the only practical O n sort out there (see the appendix for
a quick overview of radix sort). Its memory access pattern during the first pass,
where we are building counts, is very cache-friendly. However, the final output
phase uses random scatter writes. Is there a way for us to use radix sort but min-
imize its weaknesses?
Modern parallel external sort (e.g., AlphaSort [Nyberg et al. 1995]) almost
always uses a two-pass approach of in-memory sort followed by a merge. Each
item only has to be read from disk twice. More importantly, the merge phase is
very I/O friendly since the access is purely sequential. Substitute “disk” with
“main memory” and “memory” with “cache,” and the same considerations ap-
ply—we want to minimize reads from main memory and also love the sequential
access pattern of the merge phase.
437
438 28. A Cache‐Aware Hybrid Sorter
Hence, if we partition the input into substreams that fit into the cache and
sort each of them with radix sort, then the scatter writes now hit the cache, and
our main concern is addressed. One can substitute shared memory for cache in
the above statement and apply it to a GPU-based sort. Besides the scattering con-
cern, substreams also enable us to keep the output of each pass in cache so that it
is ready for the next pass without hitting main memory excessively.
Our variant of radix sort first makes one pass through the input and accumu-
lates four sets of counters, one for each radix digit. We are using radix-256,
which means each digit is one byte. Next, we compute the prefix sums of the
counters, giving us the final positions for each item. Finally, we make several
passes through the input, one for each digit, and scatter the items into the correct
order. The output of the scattering pass becomes the input to the next pass.
Radix sort was originally developed for integers since it relies on extracting
parts of it using bit operations. Applying it directly to floating-point values works
fine for positive numbers, but for negative numbers, the results are sorted in the
wrong order. One common approach is to treat the most significant radix digit as
a special case [Terdiman 2000]. However, that involves a test in the inner loop
that we would like to avoid. A nice bit hack by [Herf 2001] solves this nasty
problem for radix sort.
For efficient merging, we use an oldie but goodie called the loser tree. It is a
lot more efficient than the common heap-based merger.
At the end we get a sorter that is two to six times faster than STL and has
stable performance across a wide range of datasets and platforms.
cx = cachelevel;
28.1 Stream Splitting 439
U32 sets = cx + 1;
U32 linesize = bx & 0x0FFF; // [11:0]
U32 partitions = (bx >> 12) & 0x03FF; // [21:12]
U32 ways = (bx >> 22) & 0x03FF; // [31:22]
return ((ways + 1) * (partitions + 1) * (linesize + 1) * sets);
}
To determine our substream size, we have to consider the critical step for our
radix sorter, the scatter phase. To perform scattering, the code has to access the
count table using the next input radix digit to determine where to write the item
to in the output buffer. The input stream access is sequential, which is good for
the prefetcher and for cache hits. The count table and output buffer writes are
both random accesses using indexing, which means they should both be in the
cache. This is very important since we need to make several passes, one for each
radix digit. If we are targeting the L1 cache, then we should reserve space for the
counters (1–2 kB) and local temporaries (about four to eight cache lines). If we
440 28. A Cache‐Aware Hybrid Sorter
are targeting the L2 cache, then we might have to reserve a little more space in
L2 since the set-associative mechanism is not perfect.
For the counting phase, the only critical data that should be in the cache is
the counts table. Our use of radix-256 implies a table size of 256 *
sizeof(int), which is 1 kB or 2 kB. For a GPU-based sort that has limited
shared memory, one might consider a lower radix and a few extra passes.
Rule 1 turns 2.0 into 0xC0000000, 2.0 into 0x40000000, and 4.0 into
0x40800000, which implies 2.0 4.0 2.0. This is better, but still wrong. Ap-
plying rule 2 turns 2.0 into 0x3FFFFFFF and 4.0 into 0x3F7FFFFF, giving
2.0 2.0 4.0, which is the result we want. This bit hack has great implica-
tions for radix sort since we can now treat all the digits the same. Interested read-
ers can compare Terdiman’s and Herf’s code to see how much this simple hack
helps to simplify the code.
The code in Listing 28.2 is taken from Herf’s webpage. Like Herf’s code, we
build all four histograms in one pass. If you plan on sorting a large number of
items and you are running on the CPU, you can consider Herf’s radix-2048 opti-
28.2 Substream Sorting 441
Listing 28.2. Herf’s original bit hack for floats and its inverse (IFloatFlip).
mization, which reduces the number of scattering passes from four to three. The
histogram table is bigger since 211 2 kB, and you need three of them. Herf re-
ported a speedup of 40 percent. Keep in mind that the higher radix demands more
L1 cache and increases the fixed overhead of the sorter. Our substream split
strategy reduces the benefit of a higher radix since we strive to keep the output of
each pass within the cache.
One small but important optimization for FloatFlip is to utilize the natural
sign extension while shifting signed numbers. Instead of the following line:
we can write
The right shift smears the sign bit into a 32-bit mask that is used to flip the input
if the sign bit is set, hence implementing rule 2. Strictly speaking, this behavior is
not guaranteed by the C++ standard. Practically all compilers we are likely to
encounter during game development do the right thing. Please refer to the chapter
“Bit Hacks for Games” in this book for more details. The same idea can be ap-
plied to IFloatFlip, as follows:
struct Node
{
float key; // our key
int value; // always the substream index 'key' is from
};
1
For an excellent graphical illustration of a loser tree in action, please refer to the course
notes by Dr. Thomas W. Bennet, available at http://sandbox.mc.edu/~bennet/cs402/lec/
losedex.html.
28.3 Stream Merging and Loser Tree 443
// Build the loser tree after populating all the leaf nodes:
int InitWinner(int root)
{
if (root >= kStreams)
{
// leaf reached
return (root); // leaf index
}
else
{
int left = InitWinner(root * 2);
int right = InitWinner(root * 2 + 1);
Key lk = m_nodes[left].m_key;
Key rk = m_nodes[right].m_key;
from the root of the tree to the original position of the winner at the bottom. We
repeat those matches with the new item, and a new winner emerges, as demon-
strated in Listing 28.5.
The next little trick to speed up the merge is to mark each substream with an
end-of-stream marker that is guaranteed to be larger than all keys. The marker
stops the merger from pulling from that substream when it reaches the merger.
We chose infinity() as our marker. This also allows us to handle the uneven
444 28. A Cache‐Aware Hybrid Sorter
loserslot = Parent(loserslot);
}
return (winner);
}
substream lengths when the input is not exactly divisible by the number of
substreams. In the sample code on the website, we copy the input and allocate the
extra space needed by the marker. With some careful coding, we only need the
extra space for the last substream. In a production environment, if one can be
28.4 Multicore Implementation 445
sure the input buffer always has extra space at the end, then we can avoid the
extra buffer and copying. For example, we can define the interface to the sorter
as such or take an extra input argument that defines the actual size of the input
buffer and copy only if no room is reserved.
Our merge phase is very cache friendly since all the memory operations are
sequential. The sorted streams are accessed sequentially, and the output is
streamed out from the merger. The merger is small since it only needs space to
store one item from each stream. The size of the tree is 2 * kNumStreams and,
therefore, fits into the L1 cache easily. One can even consider keeping the merger
entirely within registers for maximum speed. For small datasets, our sort is 2.1 to
3.5 times faster than STL sort. The relative disadvantage of quicksort is smaller
when the dataset is smaller and fits nicely into the caches.
The loser tree might not be the best approach if you are writing GPU-based
applications. An even–odd or bitonic merge network is probably a better way to
exploit the wide SIMD parallelism. That being said, the merge phase is only a
small fraction of the total processing time (~25%). The sorter is encapsulated in a
class to hide details, like the stream markers and substream sizes, and to reuse
temporary buffers for multiple sorts.
our sort runs in 16 to 17 ms. The picture is very different on an I7, which shows
better scalability, where four threads is about three times faster than serially sort-
ing the four substreams. The runtime for STL sort is 58 ms on an I7, which is a
six-times speed-up for the hybrid sorter.
28.5 Conclusion
We now have a sorter that is fast in a single-threaded application and also func-
tions well in a multicore setting. The sort is a stable sort and has a predictable run
Appendix 447
time, which is critical for real-time rendering environments. It can also be used as
an external sorter, where inputs are streamed from the disk and each core per-
forms the radix sort independently. The merging step would be executed on a
single core, but the I/O is purely sequential. The merger is very efficient; it took
6 milliseconds to merge one million items on a single core of a 2.4-GHz Q6600.
It can be further improved by simple loop unrolling since the number of tourna-
ments is always equal to the height of the tree.
This chapter is heavily influenced by research in cache-aware and cache-
oblivious sorting. Interested readers are encouraged to follow up with excellent
works like Brodal et al. [2008], where funnel sort is introduced. While this work
is independently conceived, a paper by Satish et al. [2009] shares a lot of the
same ideas and shows great results by carefully using radix sort to get great par-
allelism on a GPU.
Appendix
Radix sort is a refinement of distribution sort. If we have a list of n input bytes,
we need 256 bins of size n to distribute the inputs into, since in the worst case, all
the inputs can be the same byte value. Of course, we could have dynamically
grown each of the bins, but that would incur a large overhead in calls to the
memory allocator and also fragments our heap. The solution is a two-pass ap-
proach, where we allocate 256 counters and only count how many items belong
in each bin during the first pass. Next, we form the prefix sum within the array of
counters and use that to distribute each input digit to an output buffer of size n.
We have just discovered counting sort.
The most commonly used form of radix sort is called the least-significant-
digit sort. That is, we start from the least-significant digit and work ourselves
towards more significant digits. For a 32-bit integer, the most common practice is
to use radix-256—i.e., we break the integer into four 8-bit digits and perform the
sort in four passes. For each pass, we perform the above counting sort. Since
counting sort is a stable sort, the ordering of a pass is preserved in the later
passes.
Acknowledgements
We would like to thank Michael Herf for his permission to include his radix sort code and
for inventing the bit hack for floating-point values. We would also like to thank Warren
Hunt for supplying his excellent radix sort implementation for another project that uses
negative indexing and the improved version of the bit hack.
448 28. A Cache‐Aware Hybrid Sorter
References
[Brodal et al. 2008] Gerth Stølting Brodal, Rolf Fagerberg, and Kristoffer Vinther. “En-
gineering a Cache-Oblivious Sorting Algorithm.” Journal of Experimental Algo-
rithms 12 (June 2008).
[Herf 2001] Michael Herf. “Radix Tricks.” 2001. Available at http://www.stereopsis.
com/radix.html.
[Knuth 1973] Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and
Searching. Reading, MA: Addison-Wesley, 1973.
[Garanzha and Loop 2010] Kirill Garanzha and Charles Loop. “Fast Ray Sorting and
Breadth-First Packet Traversal for GPU Ray Tracing.” Computer Graphics Forum
29:2 (2010).
[Lin 2000] Ming C. Lin. “Fast Proximity Queries for Large Game Environments.” Game
Developers Conference Course Notes. Available at http://www.cs.unc.edu/~lin/
gdc2000_files/frame.htm.
[Marin 1997] Mauricio Marin. “Priority Queues Operations on EREW-PRAM.” Pro-
ceedings of Euro-Par ’97 Parallel Processing. Springer, 1997.
[Nyberg et al. 1995] Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and Dave
Lomet. “AlphaSort: A Cache Sensitive Parallel External Sort.” The VLDB Journal
4:4 (October 1995), pp. 603–627.
[Patney et al. 2010] Anjul Patney, Stanley Tzeng, and John D. Owens. “Fragment-
Parallel Composite and Filter.” Computer Graphics Forum 29:4 (June 2010), pp.
1251–1258.
[Satish et al. 2009] Nadathur Satish, Mark Harris, and Michael Garland. “Designing Ef-
ficient Sorting Algorithms for Manycore GPUs.” Proceedings of the 2009 IEEE
International Symposium on Parallel & Distributed Processing.
[Terdiman 2000] Pierre Terdiman. “Radix Sort Revisited.” April 1, 2000. Available at
http://codercorner.com/RadixSortRevisited.htm.
[Wright 2006] Christopher Wright. “Using CPUID for SIMD Detection.” 2006. Availa-
ble at http://softpixel.com/~cwright/programming/simd/cpuid.php.
29
Thread Communication Techniques
Julien Hamaide
Fishing Cactus
449
450 29. Thread Communication Techniques
sections. But most of the time, those variables are falsely shared. If a first-in-
first-out (FIFO) queue is used to communicate commands between threads, the
item count is a shared variable. If only a single thread writes to the collection,
and only a single thread is reading from the collection, should both threads share
that variable? The item count can always be expressed with two variables—one
that counts the inserted element and one that counts the removed element, the
difference being the item currently in the queue. This strategy is used in the sim-
ple structures presented here.
return (true);
}
return (false);
}
memory and ensures the item is completely copied to memory before the index is
incremented. Let’s imagine that the CPU had reordered the writes and the
WriteIndex has already updated to its new value, but the item is not yet copied.
If at the same time, the consumer queries the WriteIndex and detects that a new
item is available, it might start to read the item, although it is not completely
copied.
When the WriteIndex is incremented, the modulo operator is not applied to
it. Instead, the modulo operator is applied only when accessing the item table.
The WriteIndex is then equal to the number of items ever pushed onto the
queue, and the ReadIndex is equal to the number of items ever popped off the
queue. So the difference between these is equal to the number of items left in the
queue. As shown in Listing 29.3, the queue is full if the difference between the
indices is equal to the maximum item count. If the indices wrap around to zero,
the difference stays correct as long as indices are unsigned. If ItemCount is not a
power of two, the modulo operation is implemented with a division. Otherwise,
the modulo can be implemented as index & (ItemCount - 1).
return (false);
}
The Push() method only writes to WriteIndex and ItemTable. The write
index is only written to by the producer thread, so no conflict can appear. But the
item table is shared by both threads. That’s why we check that the queue is not
full before copying the new item.
At the other end of the queue, the Pop() method is used by the consumer
thread (see Listing 29.4). A similar mechanism is used to pop an item. The Is-
Empty() method ensures there is an item available. If so, the item is read. A read
barrier is then inserted, ensuring ReadIndex is written to after reading the item.
If incremented ReadIndex was stored before the end of the copy, the item con-
tained in the item table might be changed by another Push().
A question that may arise is what happens if the consumer has already con-
sumed an item, but the read index is not updated yet? Nothing important—the
producer still thinks the item is not consumed yet. This is conservative. At worst,
the producer misses a chance to push its item. The latency of this system is then
dependent on the traffic.
For this technique to work, a fundamental condition must be met: writing an
index to the shared memory must be atomic. For x86 platforms, 32-bit and 64-bit
writes are atomic. On PowerPC platforms, only 32-bit writes are atomic. But on
29.3 The Aggregator 453
the PlayStation 3 SPU, there are no atomic writes under 128 bytes! That means
each index must use 128 bytes, even though only four bytes are really used. If
some undefined behavior occurs, always verify the generated code.
We now have a simple SWSR FIFO queue using no locks. Let’s see what can
be built with this structure.
void Aggregator::InitializeWriterThread()
{
int data_index = GetNextAvailableFifo();
assert((index != -1) && (!"No buffer left"));
ThreadFifo.SetValue(&ThreadFifoTable[data_index]);
}
return (false);
}
If no element is available at all after checking each thread’s FIFO, the func-
tion returns false. The function does not wait for an element, and lets the caller
decide what to do if none are available. It can decide to sleep, to do some other
work, or to retry if low latency is important. For example, in a sound system, the
update might process all commands until there are no commands left and then
update the system.
To prevent any other thread from reading from the aggregator, we also regis-
ter the reader thread and store its identifier. The identifiers are then matched in
Pop() in debug mode.
This implementation does not allow more than a fixed number of threads.
But generally in a game application, the number of threads is fixed, and a thread
pool is used, thus limiting the need for a variable-size architecture.
From a performance point of view, care should be taken with memory usage.
Depending on the architecture, different FIFO structures should be placed on dif-
ferent cache lines. To ensure consistency, when a thread writes to a memory slot,
the corresponding line cache is invalidated in all other processor caches. An up-
dated version must be requested by the thread reading the same cache line. The
traffic on the bus can increase substantially, and the waiting threads are stalled by
cache misses. The presented implementation does not take this problem into ac-
count, but an ideal solution would be to align the FIFO items with cache lines,
thus padding the item up to the cache line size so that neighbor items are not
stored in the same cache line.
ple, in a job system, this thread can pop all jobs in local variables and sort them
by priority before pushing them to each worker thread. It might also try to pre-
vent two ALU-intensive jobs from being assigned to threads that shared the same
ALU.
While implementing this structure, some important questions drive its details.
Do all writer threads also read from the queue? Is the central thread one of the
writer or reader threads? Or is it a special thread that only works on dispatching
the items? No generic implementation of the gateway is practical. A custom im-
plementation is recommended for each system.
29.6 Debugging
Multithreading and debugging frequently can’t get along. Every queue we have
presented is made from an SWSR queue. This means that a condition must be
satisfied: for a single SWSR queue, only a single thread can read and only a sin-
gle thread can write. The queue should be able to detect if an unexpected thread
accesses its data. Comparing the ID of the accessing thread with the expected
thread ID and asserting their equality is strongly advised. If some crash continues
to occur despite the correct thread accessing the collection, it might be the barri-
er. Check compiler-generated code and validate that the correct instructions are
issued to prevent reordering (e.g., lwsync on the PowerPC).
References
[Acton 2009] Mike Acton. “Problem #1: Increment Problem.” CellPerformance. August
7, 2009. Available at http://cellperformance.beyond3d.com/articles/index.html.
30
A Cross‐Platform
Multithreading Framework
Martin Fleisz
Thinstuff s.r.o.
Over the last couple of years, a new trend in game engine design has started due
to changing processor designs. With the increasing popularity of multicore CPUs,
game engines have entered the world of parallel computing. While older engines
tried to avoid multithreading at all costs, it is now a mandatory technique in order
to be able to take full advantage of the available processing power.
To be able to focus on the important parts of a game engine, like the graphics
engine or the networking code, it is useful to have a robust, flexible, and easy to
use multithreading framework. This framework also serves as an abstraction lay-
er between the platform-dependent thread and the platform-independent game
engine code. While the C++0x standard provides support for threading, develop-
ers still have to refer to external libraries like Boost to gain platform-independent
threading support. The design and interface of the Boost threading framework
strongly resembles the POSIX Threads (or Pthreads) library, which is a rather
low-level threading library. In contrast, our framework offers a higher-level inter-
face for threading and synchronization. What we provide is a collection of easy
to use synchronization objects, a flexible and simple way of handling threads,
and additional features like deadlock detection. We also show a few examples of
how to extend and customize our framework to your own needs.
30.1 Threading
ThreadManager
Let us first take a look at the core component of our threading framework, the
ThreadManager class. As can be seen in the sample code on the website, this
457
458 30. A Cross‐Platform Multithreading Framework
ThreadInfo
The most important component in our framework, when working with threads, is
the ThreadInfo class. In order to be able to uniquely identify a thread, our
ThreadInfo class stores a thread name and a thread ID. This information can be
used during logging in order to identify which thread wrote which log entry.
ThreadInfo also offers methods to wait for a thread, to stop it, or to kill it, alt-
hough killing a thread should usually be avoided.
Another important feature of ThreadInfo is the recording of wait object in-
formation. Whenever a synchronization object changes its wait state, it stores that
information in the ThreadInfo’s wait object information array. With this infor-
mation, the ThreadManager is able to construct a complete snapshot of the
threading framework’s current state. Additionally, each thread is able to detect
whether there are any pending wait operations left before it is stopped (e.g., a
mutex might still be locked after an exception occurred). In this case, it can issue
a warning as such a scenario could potentially lead to a deadlock.
Another feature that we have built into our ThreadInfo objects is active
deadlock detection. Using a simple trigger system, we can easily create a watch-
dog service thread that keeps checking for locked threads. In order for this fea-
ture to work, a thread has to continuously call the TriggerThread() method of
its ThreadInfo instance. If the gap between the last TriggerThread() call and
the current timestamp becomes too large, a thread might be deadlocked. Do not
use too small values (i.e., smaller than a second) in order to prevent accidental
timeouts. If a thread is known to be blocking, you can temporarily disable the
trigger system using the SetIgnoreTrigger() method. You can also use this
30.2 Synchronization Objects 459
The Pthreads standard does not provide us with an exact critical section
equivalent. In order to provide the same functionality, our BasicSection class
can either use a mutex or a spin lock. Spin locks, depending on their implementa-
tion, can cause various problems or performance issues [Sandler 2009]. There-
fore, we decided to go with the safer option and implemented the BasicSection
class using a mutex. Of course, this doesn’t result in any performance improve-
ment compared to using the BasicMutex class.
1
See sem_pack.h, available on Joseph Kee Yin’s home page at http://www.comp.hkbu.
edu.hk/~jng/comp2320/sem_pack.h.
462 30. A Cross‐Platform Multithreading Framework
int sem;
existed. In case a new semaphore set was created, we set the semaphore value to
p_InitialValue and the semaphore reference count to a large integer value (we
cannot count up from zero because we use this state to determine whether the set
has been initialized). Finally, we decrement our reference counter and release the
lock on semaphore 2 to complete the initialization.
In Listings 30.1 and 30.2, we can also see the power of the semop() function.
Beside the semaphore set ID, this function also expects an array of semaphore
operations. Each entry in this array is of type sembuf and has the data fields de-
scribed in Table 30.1.2
int semval;
2
See also The Single UNIX Specification, Version 2, available at http://opengroup.org/
onlinepubs/007908799/xsh/semop.html.
30.2 Synchronization Objects 463
{
// Initialize semaphore value.
if (semctl(sem, 0, SETVAL, p_InitialValue) < 0)
THROW_LAST_UNIX_ERROR();
The operation member specifies the value that should be added to the sema-
phore counter (or subtracted in case it is negative). If we subtract a value that is
greater than the semaphore’s current value, then the function suspends execution
of the calling thread. When we are using an operation value of zero, semop()
checks whether the given semaphore’s value is zero. If it is, the function returns
immediately, otherwise, it suspends execution. Additionally, there are two flags
that can be specified with each operation, IPC_NOWAIT and SEM_UNDO. If
IPC_NOWAIT is specified, then semop() always returns immediately and never
blocks. Each operation performed with SEM_UNDO is recorded internally in the
semaphore set. In case the program terminates unexpectedly, the kernel uses this
information to reverse all effects of the recorded operations. Using this mecha-
nism, we avoid any situation in which a semaphore remains locked by a dead
process.
Member Description
sem_num Identifies a semaphore within the current semaphore set.
sem_op Semaphore operation.
sem_flg Operation flags.
Cleanup of our semaphore set works in a way similar to its creation. First, we
acquire the lock of our binary semaphore with ID 2 and increment the reference
count of semaphore 1. Then, we compare the current value of semaphore 1 to the
constant integer we used to initialize it. If they have the same value, then we can
go ahead and delete the semaphore from the system using semctl() with the
IPC_RMID flag. If the values do not match, we know that there are still references
to the semaphore set, and we proceed by freeing our lock on semaphore 2.
Our implementation of BasicSemaphore makes use of both POSIX and Sys-
tem V semaphores for various reasons. For named semaphores, we use the Sys-
tem V API because of its SEM_UNDO feature. While developing the framework, it
so happened that an application with a locked POSIX semaphore crashed. After
restarting the application and trying to obtain a lock to the same semaphore, we
became deadlocked because the lock from our previous run was still persisting.
This problem doesn’t arise when we use System V semaphores because the ker-
nel undoes all recorded operations and frees the lock when an application crash-
es. A problem that both solutions still suffer from is that semaphore objects
remain alive in case of an application crash since there is no automatic cleanup
performed by the operating system. POSIX semaphores are used for process-
internal communication because they have a little smaller overhead than private
System V semaphores.
Lock Mutex
Loop infinitely
Wait on Condition Variable
if return code is 0
if signal flag set to true
if manual reset disabled
Set signal flag to false
Unlock Mutex and return true
else
Handle error
End Loop
Unlock Mutex
The pseudocode in Listing 30.3 is one of three separate cases that require
distinct handling, depending on the timeout value passed to the Wait() method.
It shows the execution path when passing an infinite timeout value. If we use a
timeout value of zero, then we just test the current signal state and return imme-
diately. In this case, we only execute the first block and skip the loop. If neither
zero nor infinite is used as the timeout value, then we can use the same code as in
the listing with two minor changes. First, we use pthread_cond_timedwait() to
wait on the condition variable with a given timeout. The second adaption is that
our Wait() method has to return false in case pthread_cond_timedwait() re-
turns ETIMEDOUT.
The last method, which is rather easy to implement, is used to set an event’s
state to nonsignaled and is called Reset(). All we need to do is to obtain the lock
on our mutex and set the event’s signal state to false.
Thanks to the flexibility of System V semaphores, we can easily rebuild the
functionality of an event using a binary semaphore. Construction and cleanup is
performed exactly as described for the BasicSemaphore class. The value of our
semaphore is directly mapped to the event state—a value of one indicates non-
signaled and a value of zero represents a signaled state. The Set() method is a
simple call to semop() that decrements the semaphore value. We also specify the
IPC_NOWAIT flag to avoid blocking in case the event is already in the signaled
state. If the semaphore value is zero, then the semop() call simply fails with EA-
GAIN, which means that Set() can be called multiple times without any unex-
pected side effects. The Reset() method works almost as simply, but uses two
operations. The first operation checks whether the event is currently in signaled
state using the IPC_NOWAIT flag. In case the semaphore value is zero, the first
operation succeeds and performs the second operation, which increments the val-
ue. In case the event is already reset, the semaphore value is already one, and the
first operation fails with EAGAIN. In this case, the second operation is not execut-
ed, which ensures the coherence of our event status in the semaphore.
The implementation of the Wait() method for our semaphore-based event is
also quite easy. Again, we have three different scenarios, depending on the speci-
fied timeout value. In order to check whether the event is in signaled state, we
use semop() with either one or two operations, depending on whether the event
is a manual reset event. The first operation simply checks whether the semaphore
value equals zero, with or without (zero or infinite timeout specified, respective-
ly) using the IPC_NOWAIT flag. If we are not using a manual reset event, we in-
crement the semaphore value after testing for zero, thus setting the state back to
nonsignaled. This is almost the same trick we used in the Reset() method, again
using the flexibility and power of semop(). Some systems also support a timed
30.2 Synchronization Objects 467
version of semop(), called semtimedop(), which can be used for the remaining
timeout scenario. Our implementation does not use semtimedop() in order to
maintain a high level of portability. Instead, we poll the semaphore at a prede-
fined interval using the nanosleep() function to put the thread to sleep between
attempts. Of course, this is not the best solution and, if semtimedop() is availa-
ble on your platform, you should favor using it.
Wait Objects
Now that we have the basic implementation of our synchronization mechanism,
we have to extend our objects in order to add state recording functionality. To do
so, we introduce a new class called WaitObject that implements state recording
using the wait state array from our ThreadInfo class. This array is a simple inte-
ger array with a constant size, and it belongs to exactly one thread. Because only
the owner thread is allowed to modify the content of the array, we do not have to
use any synchronization during state recording.
The format of a single state record is shown in Table 30.2. The first field
specifies the current state of the object, which can be either STATE_WAIT or
STATE_LOCK. The next two fields specify the time at which the object entered the
current state and, in case of a wait operation, the specified timeout. The object
count field defines how many object IDs are to follow, which might be more than
one in case of a WaitForMultipleObjects() operation. Finally, the last field
specifies the total size of the record in integers, which in most cases is six. You
might wonder why we put the state size field at the end and not at the beginning
of a record. In most cases, we remove a record from the end of our wait
state array, and it therefore makes sense that we start our search from the end of
the integer array, which means the state size field is actually the first element in
our record.
In order to set an object’s state to wait, we use the SetStateWait() function
from our WaitObject class, which simply fills the array with the required infor-
mation. If an object’s state becomes locked, then we need to call the
SetStateLock() method. This method requires a parameter that specifies
whether the object immediately entered the locked state or whether it made a
state transition from a previous wait. In the first case, again we can simply fill the
array with the lock state information. However, in the second case, we have to
find the wait state record of our object in the wait state array, change the state to
locked, and set the timestamp value. Thanks to the state size field, scanning the
array for the right entry can be done very quickly. RemoveStateWait() and Re-
moveStateLock() work in a similar fashion. First, we search for the wait or lock
entry of the current object in the array. Then we delete it by setting all fields to
zero. Of course, it might happen that the entry is not the last in the array, in
which case we have to relocate all of the following entries.
WaitObject() also defines four important abstract methods that need to be
implemented by derived classes: Lock(), Unlock(), LockedExtern(), and Un-
lockedExtern(). We provide classes derived from WaitObject for each of our
basic synchronization objects. Each class implements Lock() and Unlock() by
forwarding the call to the underlying object and setting the according wait and
lock states. LockedExtern() and UnlockedExtern() are called if a Wai-
tObject’s internal synchronization object is locked or unlocked outside the
class. These methods are helper functions used by our WaitForMultipleOb-
jects() implementation in order to keep each WaitObject’s state correctly up-
to-date. In order to prevent someone else from messing up our wait states with
these methods, we declared them private and made the WaitList class a friend of
WaitObject.
With the recorded wait state information, our ThreadManager is now able to
construct a complete snapshot of the current program state as in the example
shown in Listing 30.4. The output shows us that we have three different threads.
We start with the first one with the ID 27632, which is our main program thread.
Because our ThreadManager did not create this thread but only attached itself to
it, no thread name is given. We can see that our main thread owns a lock (since
the timestamp is 500 ms) on a wait object with ID 1013 named WaitMutex.
WaitThread is the name of the next thread in the list, and this thread is blocked
in a WaitForMultipleObjects() call. It waits for an event named WaitEvent
with ID 1012, and it waits for the same mutex that our main thread has locked.
We can also see that we have already been waiting for 500 ms and that the speci-
fied timeout value is infinite. Finally, the last thread in the list is called Dump-
Thread. This is the thread that was used to output the status information, and it
has neither waits nor locks.
BasicMutex mtx;
mtx.Lock();
... // do some work here
if (errorOccured) return (false);
mtx.Unlock();
return (true);
BasicMutex mtx;
Listing 30.6. Example for using a mutex with the LockT helper class
introduce a certain overhead due to the additional wait state recording. Therefore,
you should carefully choose what synchronization object type to use when per-
formance matters.
Wait for
Lock failed signal Timeout
Signal
received
Lock()
Try lock Exit
Lock success
Of course, the easiest solution is to simply poll our wait object list and try to
lock each object with a zero timeout (nonblocking). If one of our attempts suc-
ceeds, then we can return true immediately. Otherwise, if all attempts fail, then
we put our thread to sleep for a short period. This solution is not the most elegant
one since we keep waking up the thread and polling the wait objects even though
none of their states have changed. Therefore, our implementation uses a different
approach, slightly based on [Nagarajayya and Gupta 2000].
Figure 30.1 shows the different states and the respective transitions in our
WaitForMultipleObjects() implementation. We start off by trying to lock one
of the wait objects in our list. If this step is not successful, then we put our thread
to sleep, waiting on a condition variable. We are using a single global condition
variable that notifies us whenever a synchronization object has changed its state
to signaled. Every synchronization object in our framework sets this variable
when its Unlock() or Reset() method is called. Only if such a state transition
has occurred do we go through our list of wait objects again and try to lock each
one. The advantage of this solution is that it only resumes the thread and
checks whether a lock can be obtained after an object has actually changed to a
signaled state. The overhead of our solution is also relatively small because we
only use one additional condition variable in the whole framework. Of course, it
can still happen that the thread performs unnecessary checks in the case that an
object is unlocked but is not part of the WaitList. Therefore, this function
should be used with caution on UNIX systems.
30.3 Limitations
Now that we have provided an overview of our threading framework and its fea-
tures, it is time to discuss some limitations and potential pitfalls. The first thing
472 30. A Cross‐Platform Multithreading Framework
we want to discuss is the possibility of a Mac OS X port. The source code pro-
vided on the website is basically ready to be built on a Mac, but there are a few
things to consider. The first one is that the Mac OS X kernel only supports a total
of 10 System V semaphores, using the SEM_UNDO flag, in the whole system
[Huyler 2003]. This means that if no other application is using a System V sema-
phore, we can have a maximum of three named semaphores, mutexes, or events
with our framework on Mac OS X. A simple solution to this problem is to use
only POSIX semaphores for both named and unnamed synchronization objects
on Mac OS X platforms. However, a problem that is introduced by this fix is that
the named auto-reset event implementation in our framework requires the flexi-
bility of System V’s semop() function. A POSIX implementation could be made
similar to the one for an unnamed event, but it has to keep the signal flag in a
shared memory segment. Another problem we experienced on Mac OS X was
that the sem_init() function always failed with an ENOSYS error. This is because
the POSIX semaphores implementation on Mac OS X only supports named sem-
aphores and does not implement sem_init() [Jew 2004]. Apart from these limi-
tations, the framework should already work perfectly on Mac OS X, without
requiring any code changes.
In this chapter, we also showed how to implement a wait function similar to
the Windows API WaitForMultipleObjects() function. However, our imple-
mentation currently only works with unnamed events. This limitation is caused
by the condition variable used to signal that an object has been unlocked, which
is only visible to a single process. This means that if a named mutex is unlocked
in process A, our wait function in process B won’t be notified of the state change
and will remain blocked. A possible solution to this problem is to use a System V
semaphore to signal a state change across process boundaries. However, this
should be used with care since our processes might end up receiving many notifi-
cations, resulting in a lot of polling in the WaitList. On Windows, you can also
specify whether you want to wait for any or all objects in the WaitList. Our cur-
rent implementation only supports the first case, but adding support for the se-
cond case should be straightforward. Finally, if you are really concerned about
performance, you might want to add a BasicWaitList class that works with the
basic synchronization classes, instead of WaitObject-derived ones.
The last potential pitfall that we want to highlight occurs when using named
mutexes. On Windows, mutexes can be locked by the same thread multiple times
without blocking. With Pthreads, you can specify whether the mutex should be
recursive using the PTHREAD_MUTEX_RECURSIVE type. Unfortunately, our imple-
mentation does not offer this functionality using our System V semaphore with-
out introducing additional overhead to keep track of the current thread and its
30.4 Future Extensions 473
lock count. Therefore, if you really require recursive mutexes, then you should
try to avoid using named mutex instances. One more thing left to do when inte-
grating the framework into your own projects is to implement logging. If you
search through the code, you will find some comments containing TRACE state-
ments. These lines should be replaced with your engine’s own logging facilities.
References
[Meyers 1995] Scott Meyers. More Effective C++: 35 New Ways to Improve Your
Programs and Designs. Reading, MA: Addison-Wesley, 1995.
[Sandler 2009] Alexander Sandler. “pthread mutex vs pthread spinlock.” Alex on Linux,
May 17, 2009. Available at http://www.alexonlinux.com/pthread-mutex-vs-
pthread-spinlock.
[Nagarajayya and Gupta 2000] Nagendra Nagarajayya and Alka Gupta. “Porting of
Win32 API WaitFor to Solaris.” September 2000. Available at http://developers.
sun.com/solaris/articles/waitfor_api.pdf.
474 30. A Cross‐Platform Multithreading Framework
[Huyler 2003] Christopher Huyler. “SEM_UNDO and SEMUME kernel value issues.”
osdir.com, June 2003. Available at http://osdir.com/ml/macosx.devel/2003-06/
msg00215.html.
[Jew 2004] Matthew Jew. “semaphore not initialized - Question on how to implement.”
FreeRADIUS Mailing List, October 28, 2004. Available at http://lists.cistron.nl/
pipermail/freeradius-devel/2004-October/007620.html.
31
Producer‐Consumer Queues
Matthew Johnson
Advanced Micro Devices, Inc.
31.1 Introduction
The producer-consumer queue is a common multithreaded algorithm for handling
a thread-safe queue with first-in-first-out (FIFO) semantics. The queue may be
bounded, which means the size of the queue is fixed, or it may be unbounded,
which means the size can grow dynamically based on available memory. Finally,
the individual items in the queue may be fixed or variable in size. Typically, the
implementation is derived from a circular array or a linked list data structure. For
simplicity, this chapter describes bounded queues with elements of the same size.
Multithreaded queues are a common occurrence in existing operating system
(OS) APIs. One example of a thread-safe queue is the Win32 message model,
which is the main communication model for applications to process OS and user
events. Figure 31.1 shows a diagram of a single-producer and single-consumer
model. In this model, the OS produces events such as WM_CHAR, WM_MOUSEMOVE,
WM_PAINT, etc., and the Win32 application consumes them.
Producer Consumer
475
476 31. Producer‐Consumer Queues
The applications for producer and consumer queues extend beyond input
messaging. Imagine a real-time strategy game where the user produces a list of
tasks such as “build factory,” “move unit here,” “attack with hero,” etc. Each task
is consumed by a separate game logic thread that uses pathfinding knowledge of
the environment to execute each task in parallel. Suppose the particular consumer
thread uses an A* algorithm to move a unit and hit a particularly worst-case per-
formance path. The producer thread can still queue new tasks for the consumer
without having to stall and wait for the algorithm to finish.
The producer-consumer queue naturally extends to data parallelism. Imagine
an animation engine that uses the CPU to update N animated bone-skin skeletons
using data from static geometry to handle collision detection and response. As
pictured in Figure 31.2, the animation engine can divide the work into several
threads from a thread pool by producing multiple “update bone-skin” tasks, and
the threads can consume the tasks in parallel. When the queue is empty, the ani-
mation thread can continue, perhaps to signal the rendering thread to draw all of
the characters.
In Figure 31.3, only one item at a time is consumed from the queue (since the
queue is read atomically from the tail); however, the actual time the consumer
thread starts or finishes the task may be out of order because the OS thread
scheduler can preempt the consumer thread before useful work begins.
Consumer 2
Consumer 1
Consumer 0
Producer
Producer
Consumer 0
Consumer 1
Consumer 2
Figure 31.3. Sample timeline for producer-consumer model from Figure 31.2.
~Fifo()
{
delete m_items;
}
bool IsEmpty()
{
return (m_tail == m_head);
}
bool IsFull()
{
return (m_tail == m_head + m_maxSize);
}
T Remove()
{
assert(!IsEmpty());
return (m_items[m_head++ % m_maxSize]);
}
private:
T *m_items;
31.3 A First Approach: Using Win32 Semaphores and Critical Sections 479
UINT m_head;
UINT m_tail;
enum
{
MaxItemsInQueue = 256
};
ProducerConsumerQueue() : m_queue(MaxItemsInQueue)
{
InitializeCriticalSection(&m_cs);
if (m_queue.IsFull())
{
LeaveCriticalSection(&m_cs);
SwitchToThread();
continue; // Queue full
}
else if (SUCCEEDED(ReleaseSemaphore(m_sm, 1, NULL)))
{
m_queue.Insert(item);
}
LeaveCriticalSection(&m_cs);
break;
}
}
T Remove()
{
T item;
for (;;)
{
if (WAIT_OBJECT_0 != WaitForSingleObject(m_sm, INFINITE))
break;
EnterCriticalSection(&m_cs);
if (!m_queue.IsEmpty())
{
item = m_queue.Remove();
LeaveCriticalSection(&m_cs);
break;
}
else
31.3 A First Approach: Using Win32 Semaphores and Critical Sections 481
{
LeaveCriticalSection(&m_cs); // Queue empty
}
}
return (item);
}
DWORD LastError()
{
return (::GetLastError());
}
private:
Fifo<T> m_queue;
CRITICAL_SECTION m_cs;
HANDLE m_sm;
};
When the semaphore count is zero, the consumer(s) enter into a low-
overhead wait state until a producer inserts an item at the head of the queue. Note
that in this implementation, the Insert() function can still return without insert-
ing an item, for example, if the ReleaseSemaphore() function fails. Finally, the
Remove() function can still return without removing an item, for example, if the
WaitForSingleObject() function fails. These failure paths are rare, and the
application can call the LastError() function to get the Win32 error result.
Some designs may want to defer to the application to retry the operation on
the producer side if the queue is full. For example, one may want to return false if
an item can’t be inserted at that moment in time because the queue is full. In the
implementation of a thread-safe queue, one needs to be careful how to proceed if
the queue is empty or full. For example, when removing an object, the code in
Listing 31.2 first requests exclusive access to check if the queue has at least one
item, but backs out and tries again if it has no items to remove. If the check was
done before entering the critical section, a thread context switch could occur and
another consumer could remove the last item right before entering the critical
section, which would result in trying to remove an item from an empty queue!
482 31. Producer‐Consumer Queues
This particular class has a few limitations. For example, there is no way to
signal to all the consumers to stop waiting for the producer(s) to insert more data.
One way to handle this is by adding a manual reset event that is initially non-
signaled and calling the WaitForMultipleObjects() function on both the stop
event and semaphore, with the bWaitAll set to FALSE. After a SetEvent() call,
all consumers would then wake up, and the return value of the WaitForMulti-
pleObjects() function indicates whether it was the event that signaled it.
Finally, the semaphore object maintains an exact physical count of the ob-
jects, but this is only internal to the OS. To modify the consumer to get the num-
ber of items currently in the queue, we can add a member variable that is
incremented or decremented whenever the queue is locked. However, every time
the count is read, it would only be a logical snapshot at that point in time since
another thread could insert or remove an item from the queue. This might be use-
ful for gauging a running count to see if the producer is producing data too quick-
ly (or the consumer is consuming it too quickly), but the only way to ensure the
count would remain accurate would be to lock the whole queue before checking
the count.
ly do not suffer from deadlocks, starvation, or priority inversion. There is, how-
ever, no guarantee that a lock-free algorithm will outperform one with locks, and
the additional complexity makes it not worth the headache in many cases. Final-
ly, another disadvantage is that the range of algorithms that can be converted to a
lock-free equivalent is limited. Luckily, queues are not among them.
One of the challenges when accessing variables shared among threads is the
reordering of read and write operations, and this can break many multithreaded
algorithms. Both the compiler and processor can reorder reads and writes. Using
the volatile keyword does not necessarily fix this problem, since there are no
guarantees in the C++ standard that a memory barrier is created between instruc-
tions. A memory barrier is either a compiler intrinsic or a CPU instruction that
prevents reads or writes from occurring out of order across the execution point. A
full barrier is a barrier that operates on the compiler level and the instruction
(CPU) level. Unfortunately, the volatile keyword was designed for memory-
mapped I/O where access is serialized at the hardware level, not for modern mul-
ticore architectures with shared memory caches.
In Microsoft Visual C++, one can prevent the compiler from reordering in-
structions by using the _ReadBarrier(), _WriteBarrier(), or _ReadWrite-
Barrier() compiler intrinsics. In Microsoft Visual C++ 2003 and beyond,
volatile variables act as a compiler memory barrier as well. Finally, the Memory-
Barrier() macro can be used for inserting a CPU memory barrier. Interlocked
instructions such as InterlockedIncrement() or InterlockedCompareAndEx-
change() also act as full barriers.
x64 multiprocessor architectures, reads are not reordered relative to reads, and
writes are not reordered relative to writes. However, reads may be reordered with
a previous write if the read accesses a different memory address.
During write buffering, it is possible that the order of writes across all pro-
cessors appears to be committed to memory out of order to a local processor. For
example, if processor A writes A.0, A.1, and A.2 to three different memory loca-
tions and processor B writes B.3, B.4, and B.5 to three other memory locations in
parallel, the actual order of the writes to memory could appear to be written or-
dered A.0, A.1, B.3, B.4, A.2, B.5. Note that neither the A.0, A.1, and A.2 writes
nor the B.3, B.4, and B.5 writes are reordered locally, but they may be interleav-
ed with the other writes when committed to memory.
Since a read can be reordered with a write in certain circumstances, and
writes can be done in parallel on multiprocessor machines, the default memory
model breaks for certain multithreaded algorithms. Due to these memory model
intricacies, certain lock-free algorithms could fail without a proper CPU memory
barrier such as Dekker’s mutual exclusion algorithm, shown in Listing 31.3.
while (b)
{
if (turn)
{
a = false;
while (turn) { /* spin */ };
a = true;
}
}
// critical section
turn = true;
a = false;
}
31.5 Processor Architecture Overview and Memory Models 485
void threadB()
{
b = true;
while (a)
{
if (!turn)
{
b = false;
while (!turn) { /* spin */ };
b = true;
}
}
// critical section
turn = false;
b = false;
}
Processor A Processor B
mov dword ptr [a], 1 mov dword ptr [b], 1
cmp dword ptr [b], 0 cmp dword ptr [a], 0
Figure 31.4. On modern x86 or x64 processors, a read instruction can be reordered with a
previous write instruction if they are addressing different memory locations.
486 31. Producer‐Consumer Queues
struct ProcessorSupport
{
ProcessorSupport()
{
int cpuInfo[4] = {0};
__cpuid(cpuInfo, 1);
bool supportsCmpXchg64;
bool supportsCmpXchg128;
};
ProcessorSupport processorSupport;
if (processorSupport.supportsCmpXchg64) { /*...*/ }
This chapter uses the 32-bit CAS and 64-bit CAS2 in its design. The ad-
vantage of using a CAS operation is that one can ensure, atomically, that the
memory value being updated is exactly as expected before updating it. This pro-
motes data integrity. Unfortunately, using the CAS operation alone in the design
of a lock-free stack and queue isn’t enough. Many lock-free algorithms suffer
from an “ABA” problem, where an operation on another thread can modify the
state of the stack or queue but not appear visible to the original thread, since the
original item appears to be the same.
This problem can be solved using a version tag (also known as a reference
counter) that is automatically incremented and stored atomically with the item
when a CAS operation is performed. This makes the ABA problem extremely
unlikely (although technically possible, since the version tag could eventually
wrap around). The implementation requires the use of a CAS2 primitive to re-
serve room for an additional version tag.
AllocRef(volatile AllocRef& a)
{
val = a.val;
}
struct
{
UINT64 arrayIndex : 20;
UINT64 stackIndex : 20;
UINT64 version : 24;
};
UINT64 val;
};
date the CAS2 primitive. In addition, it includes a version tag to prevent the ABA
problem.
The Allocator class in Listing 31.7 implements a free list, which is an algo-
rithm that manages a preallocated pool of memory. The code uses an array-based
lock-free implementation of a stack algorithm by Shafiei [2009]. Since this algo-
rithm also requires a stack index, we pack the stack index, stack value (array in-
dex), and version number in one 64-bit value. This limits the maximum number
of items in the stack and queue to 1,048,575 entries (20 20 1) and the version
number to 24 bits. The 20-bit index 0x000FFFFF is reserved as a null index.
m_pFreeList[0].arrayIndex = AllocRef::NullIndex;
m_pFreeList[0].stackIndex = 0;
m_pFreeList[0].version = 0;
Allocator::~Allocator()
{
delete[] m_pFreeList;
}
private:
UINT32 Allocator::Alloc()
{
for (;;)
{
AllocRef top = m_top;
AllocRef stackTop = m_pFreeList[top.stackIndex];
CAS2(&m_pFreeList[top.stackIndex].val,
AllocRef(stackTop.arrayIndex, stackTop.stackIndex,
top.version - 1).val,
AllocRef(top.arrayIndex, stackTop.stackIndex,
top.version).val);
return (top.arrayIndex);
}
}
}
CAS2(&m_pFreeList[top.stackIndex].val,
AllocRef(stackTop.arrayIndex, stackTop.stackIndex,
top.version - 1).val,
AllocRef(top.arrayIndex, stackTop.stackIndex,
top.version).val);
- The allocator atomically retrieves a free index from the free pool.
- This index points to an unused entry in the item array.
- The item is copied into the array at the index.
- The index is inserted atomically at the queue's tail and is now visible.
- The index will remain unavailable until consumer atomically removes it.
- When that happens, the index will be placed back on the free pool.
// Array of Items
m_pItems = new T[maxItemsInQueue];
assert(m_pItems != NULL);
}
~Queue()
{
delete[] m_pItems;
delete[] m_pQueue;
}
private:
do
{
// Obtain free index from free list.
index = m_allocator.Alloc();
} while (index == AllocRef::NullIndex); // Spin until free index.
494 31. Producer‐Consumer Queues
m_pItems[index] = item;
if (alloc.arrayIndex == AllocRef::NullIndex)
{
if (CAS2(&m_pQueue[tail % m_maxQueueSize].val, alloc.val,
AllocRef(index, alloc.version + 1).val))
{
CAS(&m_tail, tail, tail+1);
return;
}
}
else if (m_pQueue[tail % m_maxQueueSize].arrayIndex !=
AllocRef::NullIndex)
{
CAS(&m_tail, tail, tail+1);
}
}
}
if (head == m_tail)
{
if (m_pQueue[tail % m_maxQueueSize].arrayIndex ==
AllocRef::NullIndex)
if (tail == m_tail) continue; // Queue is empty.
if (alloc.arrayIndex != AllocRef::NullIndex)
{
if (CAS2(&m_pQueue[head % m_maxQueueSize].val, alloc.val,
AllocRef(AllocRef::NullIndex, alloc.version + 1).val))
{
CAS(&m_head, head, head+1);
T item = m_pItems[alloc.arrayIndex];
When inserting an item, both the stack and queue continue spinning until
they are not full. When removing an item, both the stack and queue continue
496 31. Producer‐Consumer Queues
spinning until they are not empty. To modify these methods to return an error
code instead, replace continue with return at the appropriate locations. This is
a better strategy when data is being inserted and removed from the queue sporad-
ically, since the application can handle the best way to wait. For example, the
application can call SwitchToThread() to allow the OS thread scheduler to yield
execution to another thread instead of pegging the processor at 100% with un-
necessary spinning.
References
[AMD 2010] AMD. AMD64 Architecture Programmer’s Manual Volume 2: System
Programming. Advanced Micro Devices, 2010. Available at http://support.amd.
com/us/Processor_TechDocs/24593.pdf.
[Colvin and Groves 2005] Robert Colvin and Lindsay Groves. “Formal Verification of
an Array-Based Nonblocking Queue.” Proceedings of Engineering of Complex
Computer Systems, June 2005.
[Dawson 2008] Bruce Dawson. “Lockless Programming Considerations for Xbox 360
and Microsoft Windows.” MSDN, June 2008. Available at http://msdn.microsoft.
com/en-us/library/ee418650%28VS.85%29.aspx.
References 497
[Intel 2010] Intel. Intel 64 and IA-32 Architectures. Software Developer’s Manual.
Volume 3A: System Programming Guide. Intel, 2010. Available at http://www.
intel.com/Assets/PDF/manual/253669.pdf.
[Leonard 2007] Tom Leonard. “Dragged Kicking and Screaming: Source Multicore.”
Game Developers Conference, 2007. Available at http://www.valvesoftware.com/
publications/2007/GDC2007_SourceMulticore.pdf.
[Newcomer 2001] Joseph M. Newcomer. “Using Semaphores: Multithreaded Produc-
er/Consumer.” The Code Project, June 14, 2001. Available at http://www.
codeproject.com/KB/threads/semaphores.aspx.
[Sedgewick 1998] Robert Sedgewick. Algorithms in C++, 3rd Edition. Reading, MA:
Addison-Wesley, 1998.
[Shafiei 2009] Niloufar Shafiei. “Non-blocking Array-Based Algorithms for Stacks and
Queues.” Proceedings of Distributed Computing and Networking, 2009.
[Shann et al. 2000] Chien-Hua Shann, Ting-Lu Huang, and Cheng Chen. “A Practical
Nonblocking Queue Algorithm using Compare-and-Swap.” Proceedings of Paral-
lel and Distributed Systems, 2000.
Co
ontributtor Bioggraphiees
Rémi Arnaud
remi@
@acm.org
Rémi Arnaud is Ch hief Software Architect at Screampoint International,
workinng on interoperrable 5D digitaal models. Rém mi’s involvem
ment with real-
time graphics
g started
d at Thales S Simulation where he designned the Space
Magic real-time visu ual system and finalized his P
PhD. In 1996, Rémi relocat-
ed to California
C to jo
oin Silicon Grraphics’s IRIS Performer teaam before co-
foundinng Intrinsic Graphics,
G an aadvanced technnology cross-pplatform mid-
dlewarre company forr PlayStation 22, Xbox, GameeCube, and PC C. In 2003, he
joined Sony Computter Entertainmeent as Graphiccs Architect, w working on the
PlayStaation 3 SDK, and
a joined the Khronos Grouup, creating thhe COLLADA
standarrd. More recenntly, Rémi wass building a gaame technologyy team for the
Larrabee project at Inntel.
Wesssam Bahnaassi
wbahn
nassi@infram
mez.com
Wessam m Bahnassi is a software enngineer with a background inn building ar-
chitectture. This com
mbination is bellieved to be thhe reason behiind Wessam’s
passionn for game eng gine design. H
He has written and dealt withh a variety of
engines throughout a decade of gam me developmennt. Currently, hhe is in charge
of animmation and renndering engineeering at Electtronic Arts Incc., and he is a
superv
visor of the Araabic Game Devveloper Networrk.
Tolgaa Çapın
tcapin
n@cs.bilkentt.edu.tr
Tolga Çapın is an asssistant professsor in the Deppartment of Coomputer Engi-
g at Bilkent Un
neering niversity. Befoore joining Bilkkent, he workedd at the Nokia
Researrch Center as a Principal Sciientist. He has contributed too various mo-
499
500 Contributor Biographiees
Patrick Cozzi
C
pjcozzi@ssiggraph.org
g
Patrick Cozzzi is a seniorr software devveloper on the 3D team at A Analytical
Graphics, Inc. (AGI) and coauthor of thhe book 3D Enngine Design foor Virtual
Globes. Hiis interests incclude real-timee rendering, G GPU programm ming and
architecturee, OpenGL, anda software aarchitecture. P Patrick has ann M.S. in
Computer anda Informatio on Science from m the Universiity of Pennsylvvania and
a BS in Coomputer Sciencce from Penn S State. He is also a contributoor to Sig-
graph. Befo ore joining AGGI in 2004, hee participated iin IBM’s Extreeme Blue
internship program
p at the Almaden Reseearch Lab, inteerned with IBM M’s z/VM
operating syystem team, annd interned withh the chipset vaalidation group at Intel.
Michał Drobot
D
hello@dro
obot.org
Michał Dro obot has been working in ggame developm ment for five yyears. He
started as Technical
T Artisst, switching laater to Effect P
Programmer annd Visual
Technical Director
D at Reeality Pump. C Currently, he iss creating somme cutting
edge rendering technolog gy as Senior TTech Programm mer at Guerrillaa Games.
In his spare time, Michaał frequently sppeaks on the ttopic of game develop-
ment and graphics
g prograamming, havingg presented at GDC, GCDC,, SFI, and
having giveen series of lecctures at univeersities. Moreoover, he is the author of
several articles on real-tim
me rendering ppublished in boooks and magaazines. He
is also the main
m consultannt and active teeacher on EGA A, the Europeaan Games
Academy in n Krakow, Poland. He loves eating pixels ffor breakfast.
Richard Egli
richard.eg
gli@usherbrrooke.ca
Richard Eg
gli has been a professor
p in thee Department oof Computer Scciences at
the Universsity of Sherbro
ooke since 20000. He receiveed his BSc in C Computer
Science andd his MSc in Computer
C Scieence at the Unniversity of Sheerbrooke.
He receivedd his PhD in Computer
C Scieence from the University of Montréal
Contrributor Biograaphies 501
in 200
00. He is thee director of the centre M MOIVRE (MO Odélisation en
Imagerrie, Vision et RÉseaux de nneurones). Hiss research inteerests include
compuuter graphics, physical
p simulaations, and digiital image proccessing.
Marttin Fleisz
martin.fleisz@kab
bsi.at
Martinn Fleisz receiveed an MS degrree in Computeer Science from m the Univer-
sity off Edinburgh, Sccotland and a BS degree in Computer Scieence from the
Univerrsity of Derby,, England. He has worked ass a professionaal C++ devel-
oper foor more than siix years. Martiin spends mostt of his spare ttime enlarging
his kno owledge aboutt game and graaphics related pprogramming ttechniques. In
his rem
maining time, he
h plays drumss in a band, triees to learn to play the guitar,
and som metimes enjoyys playing a viddeo game.
Simo
on Franco
simon
n_franco@ho
otmail.com
Simon Franco began his love of coomputer prograamming shortlyy after receiv-
ing a Commodore
C Amiga,
A and haas been codingg ever since. HHe joined the
games industry in 20000 after comppleting a degreee in Computeer Science. He
d at The Creatiive Assembly in 2004, wheere he has beenn to this day.
started
Duringg his spare tim
me, Simon cann be found plaaying the latesst PC strategy
game oro writing asseembly code foor the ZX specctrum. His babby girl Astrid
was reccently born.
Marcco Fratarcaangeli
marco
o@fratarcan
ngeli.net
Marco Fratarcangeli is a Senior Sooftware Engineeer at Taitus Sooftware Italia,
developing cross-plaatform visual ttools for the reepresentation oof space mis-
sion daata like planett rendering annd informationn visualization. In 2009, he
obtaineed a PhD in Computer
C Enginneering from UUniversity of R
Rome Sapien-
za, Itally, jointly with
h the Institute of Technologyy of Linköpingg, in Sweden.
Duringg his academicc activities, Maarco researchedd mainly noveel methods for
automaatic rigging off facial animaation through physically-bassed animation
and mo otion retargetinng. His earliesst memories arre of programmming Basic on
the ZXX Spectrum 48K K.
502 Contributor Biographiees
Holger Grün
G
holger.gru
uen@amd.co
om
Holger Grüün ventured intto 3D real-timee graphics righht after universsity, writ-
ing fast sofftware rasterizzers in 1993. SSince then, hee has held reseearch and
developmen nt positions in
n the middlewaare, games, andd simulation inndustries.
He began working
w in devveloper relationns in 2005 and now works foor AMD’s
product grooup. Holger, hiis wife, and hiis four kids livve in Germanyy, close to
Munich and d near the Alpss.
Julien Hamaide
H
julien.ham
maide@fishin
ngcactus.com
m
Julien Hammaide graduated d as a multimeedia electrical engineer at thhe Faculté
Polytechniqque de Mons, Belgium. Afteer two years w working on sppeech and
image proccessing at TCT TS/Multitel andd three years leading a teamm on next-
generation consoles at Elsewhere
E Enttertainment (w www.elsewhereeentertain
ment.com),, Julien started his own studioo called Fishinng Cactus (www.fishing
cactus.com
m) with three associates.
a He has publishedd several articlles in the
Game Prog gramming Gem ms series and in AI Prograamming Wisdoom 4. He
spoke abouut 10Tacle’s mo ovement and innteraction systtem at the Gam me Devel-
opers Confference in 2008 and about m multithreading applied to AI in 2009.
His experieence revolves around multithhreaded and sccripted system ms, mostly
applied to AI
A systems. Hee is now leadinng developmennt at Fishing Caactus.
Daniel Higgins
H
dan@luncchtimestudio
os.com
Over ten yeears ago, Dan Higgins
H begann his career in ggames at Stainless Steel
Studios, wh here he was onne of the originnal creators of the Titan gam
me engine.
As one of the
t chief AI pro ogrammers onn Empire Earthh, Empires: Daawn of the
Modern Wo orld and Rise & Fall: Civilizzations at War,, he spent yearrs design-
ing and inn novating practtical AI solutioons for difficuult problems. Later, he
enjoyed wo orking for Tiltted Mill Enterrtainment on C Caesar IV andd SimCity
Societies.
Dan’s coding domain n extends welll beyond AI, aas he enjoys alll aspects
of game en ngine developmment and is ofteen called on for his optimizattion skills
both inside and outside off the games inddustry. Today, along with hiss wife, he
is owner annd manager of Lunchtime
L Stuudios, LLC.
Contrributor Biograaphies 503
Jason
n Hughes
jhugh
hes@steelpen
nnygames.com
m
Jason Hughes
H is an industry
i veteraan game progrrammer of 16 yyears and has
been actively
a coding g for a decadee longer. His bbackground ruuns the gamut
from modem
m driverss in 6502 assemmbly and fluidd dynamics onn the Wii to a
proprieetary multiplatform 3D enginne and tools suuite. When nott working as a
hired gun
g for game developers,
d Jasson tinkers witth exotic data sstructures, ad-
vancedd compression algorithms, annd various toolls and technoloogy relating to
the gammes industry. Prior to foundding Steel Pennny Games, he spent several
years at
a Naughty Dog, developing pipeline tools and technologgy for the ICE
and Eddge libraries on
n the PlayStatioon 3.
Mattthew Johnson
matt.jjohnson@am
md.com
Mattheew Johnson is a Software Enngineer at Advvanced Micro Devices, Inc.
with ovver 12 years off experience inn the computerr industry. He w
wrote his first
game asa a hobby in Z80
Z assembly language for thhe TI-86 graphhic calculator.
Today,, he is a memb ber of the DireectX 11 driver team and actively involved
in deveeloping softwaare for future G
GPUs. Matthew w currently livees in Orlando,
Floridaa with his lovelly wife.
Linuss Källberg
linus.k
kallberg@mdh.se
Linus Källberg
K mputer Science at Mälardalen University in
is a teeacher in Com
Sweden. He has prev viously workedd with program
m analysis as a research en-
gineer on the ALL-TTIMES project.. His MSc in C Computer Sciennce was com-
pleted in March 20100.
Frank Kane
fkane@sundog-soft.com
Frank Kane
K is the ow
wner of Sundogg Software, LLLC, makers of tthe SilverLin-
ing SDDK for real-tim me rendering oof skies, cloudds, and precipiitation effects
(see htttp://www.sund dog-soft.com/ for more inforrmation). Frannk’s game de-
velopmment experience began at Sierrra On-Line, w where he workeed on the sys-
tem-levvel software of a dozen cclassic adventture game tittles including
Phantaasmagoria, Gabriel Knight III, Police Questt: SWAT, and Q Quest for Glo-
ry V. He’s
H also an allumnus of Loooking Glass Sttudios, where hhe helped de-
velop Flight
F Unlimitted III. Frank developed thee C2Engine scene-rendering
504 Contributor Biographiees
Manny Ko
mannyk90
0@yahoo.com
m
Manny Ko is currently working
w in thee Rendering G
Group for DreaamWorks
Animation.. Prior to that,, he worked foor Naughty Doog as a membber of the
ICE team,, where he worked on nnext-generatioon lighting annd GPU
technologiees.
Balor Kn
night
Balor.Kniight@Disney
y.com
Balor Knig ght is a Senio
or Principal Prrogrammer at Black Rock S Studio in
Brighton, UK.
U He has beeen writing gam mes for over 220 years, havinng started
on the ZX Spectrum in thet late 1980s. Since then, hhe has worked on many
uding Re-Volt, the Moto G
titles, inclu GP series, Purre, and most recently,
Split/Seconnd. His main area of experrtise lies in coonsole game rrendering
engines.
Thomass Larsson
thomas.la
arsson@mdh
h.se
Thomas Laarsson is an Asssistant Profes sor at Mälardaalen Universityy in Swe-
den where he has been teeaching Compuuter Science sinnce 1996. His PhD the-
sis about efficient
e interssection queriees was completed in Januaary 2009.
Currently, he
h gives coursees in C/C++, mmultimedia, andd computer graaphics.
Eric Lengyel
lengyel@tterathon.com
m
Eric Lengyyel is a veteran of the computter games induustry with overr 16 years
of experiennce writing gaame engines. H He has a PhD D in Computerr Science
from the University
U of California, Daviis, and he has an MS in Matthematics
from Virgin nia Tech. Eric is the founderr of Terathon SSoftware, wherre he cur-
rently leadss ongoing deveelopment of thee C4 Engine.
Eric was
w the Lead Programmer forr Quest for Gllory V at Sierrra Online,
he worked on the OpenG GL team for Appple, and he waas a member oof the Ad-
vanced Tecchnology Grou up at Naughtyy Dog, wheree he designed graphics
driver softw
ware used on thhe PlayStation 3. Eric is the aauthor of the bestselling
Contrributor Biograaphies 505
book Mathematics
M fo
or 3D Game P Programming aand Computer Graphics and
severall chapters in other books inccluding the Gaame Programmming Gems se-
ries. His
H articles have also been puublished in thee Journal of Gaame Develop-
ment, in
i the journal of
o graphics toools, and on Gam
masutra.com.
Noel Llopis
noel@
@snappytouch.com
Noel Llopis
L is follow
wing his lifelonng dream of beeing an indie ddeveloper. He
foundeed Snappy Tou uch to focus exxclusively on iiPhone developpment and re-
leased Flower Gardeen and Lorax Garden. He w writes about game develop-
ment regularly, from m a monthly collumn in Gamee Developer Maagazine to the
Game Programming Gems series oor his book C+ ++ for Game P Programmers.
Some of o his past gammes include Th e Bourne Consspiracy, Darkw watch, and the
MechA Assault series. He
H earned an M MS in Computter Science from m the Univer-
sity off North Carolinna at Chapel H Hill and a BS iin Computer SSystems Engi-
neering g from the Uniiversity of Masssachusetts Am
mherst.
Curtiiss Murphyy
cmmu
urphy@alion
nscience.com
Curtisss Murphy is a Senior Project Engineer at A Alion Science aand Technolo-
gy. Hee manages the game-based trraining and 3D D visualizationn development
efforts for the Norfollk-based AMS STO Operationn of Alion. He is responsible
for thee serious gamee efforts for a variety of Maarine, Navy, AAir Force, and
Joint DoD
D customerss. He is an auuthor and frequuent speaker aat conferences
and leaads the game development
d teeam that createed the award w
winning Dam-
age Coontrol Trainer. He has been ddeveloping and managing soft ftware projects
for 18 years and cu urrently works in Norfolk, V VA. Curtiss hholds a BS in
Compu uter Science fro
om Virginia Teech.
Georrge Parrish
h
Georg
ge.Parrish@D
Disney.com
Georgee Parrish is a Senior Engiine Programm mer at Black Rock Studio.
Georgee started prograamming on hiss Commodore 664 when he waas 13. He then
went on
o to a career in i IT, developiing network seervers and webb applications
for clieents such as th
he National Heealth Service annd the Ministrry of Defence.
He finaally entered thhe games indusstry in 2002 annd has workedd on a number
of gam
mes, including Pure
P and Split//Second.
506 Contributor Biographiees
Michaell Ramsey
mike@ram
mseyresearch.com
Mike Ramssey is the princciple programm mer on the GL LR AI Engine. Mike has
developed core technolog gies for the Xbbox 360, PC, aand Wii at various com-
panies. He has also shipp ped a variety o f games, incluuding World off Zoo (PC
and Wii), Men
M of Valor (Xbox( and PCC), Master of thhe Empire, sevveral Zoo
Tycoon 2 products,
p and other
o titles. Mikke has contribuuted multiple aarticles to
both the Game
G Program mming Gems aand AI Game Programmingg Wisdom
series, and he has presentted at the AIID DE conference at Stanford onn uniform
spatial reprresentations forr dynamic enviironments. Mikke has a BS in Comput-
er Science from Metropo olitan State Co llege of Denveer, and his pubblications
can be foun nd at http://wwww.masteremppire.com/. He also has a fortthcoming
book entitleed A Practical Cognitive Enggine for AI. Whhen Mike isn’t working,
he enjoys playing speedminton, drinnking mochas,, and having thought-
provoking discussions with w his fantasstic wife and daughter, Deenise and
Gwynn!
Matthew
w Ritchie
Matthew.Ritchie@Dissney.com
Matthew Ritchie
R is a Grap
phics Engine PProgrammer at Black Rock Studio. He
joined the games
g industry
y when he wass 18. Since thenn, he has workked in the
areas of networking,
n so
ound, and toools, but he cuurrently speciializes in
graphics. His
H most recent titles are the M
MotoGP07 seriies and Split/Seecond.
Sébastie
en Scherte
enleib
sscherten@bluewin.ch
h
Sébastien Schertenleib
S haas been involvved in academic research projojects cre-
ating 3D mixed-reality
m sy
ystems using sstereoscopic viisualization whhile com-
pleting his PhD in Comp puter Graphics at the Swiss IInstitute of Teechnology
in Lausanne. In 2006, he joined
j Sony C Computer Enterrtainment Euroope R&D,
and since then, he has been trying to support gaame developerrs on all
PlayStationn platforms by providing tecchnical trainingg, presenting aat various
games confferences, and working
w directtly with gamee developers viia on-site
technical viisits and code sharing.
s
Contrributor Biograaphies 507
Olivier Vaillanccourt
olivierr.vaillancourrt@usherbroooke.ca
Olivierr Vaillancourt is an MSc stuudent in the Deepartment of C Computer Sci-
ence at
a University ofo Sherbrooke.. His research interests incluude computer
graphiccs and scientiffic visualizatioon. He is a lectturer for underrgraduate stu-
dents in interaction design
d and com mputer graphicss for the Univeersity of Sher-
brookee. He has con ntributed to m multiple handheeld game prodductions as a
softwaare engineer att Artificial Miind and Movem ment, where hhe worked on
high-prrofile licenses such as The Sims, Lord of The Rings, and Ironman.
When he’s not workiing on his LCD D tan, he’s spennding time on various home
improvvement projects and enjoyingg the Canadian great outdoorss.
Jon Watte
W
jwattee@gmail.com
m
Jon Watte
W started prrogramming att the age of 100 on a Z80-baased computer
named d “ABC-80.” AfterA evolvingg through a vaariety of micrro- and mini-
compu uter platforms through
t the 19880s and attendding the Masterr of Computer
Science program at KTH, Stockhoolm, Sweden, Jon moved frrom the arctic
winter to the swelteering summer of Austin, Teexas, where hhe worked on
CodeW Warrior productts for game co nsoles and thee alternative BeeOS operating
systemm.
Onne thing led to another, and JJon combined his love of muusic and video
with hiis love of systeems programm ming by taking tthe reins of thee media group
at Be, building a foundation
f forr low-latency, high-throughpput multipro-
cessingg computing on n desktop compputers.
Affter Be was succcessfully soldd to Palm, Jon moved to helpp build avatar,
networrking, and simu ulation technollogy for the Thhere.com virtuaal world, a job
that theen led to a position as CTO oof enterprise viirtual world proovider Forter-
ra Systtems, where he h contributed to the future of virtual worrlds for enter-
tainmeent and businesss.
Cuurrently, Jon is i enjoying addding some eexciting new ttechnology to
power the spectacullar growth off the connected entertainm ment company
IMVU U, Inc., which is i powered byy an amazing ccollection of uuser-generated
2D and d 3D content.
Jo
on also moderattes online foruums on game ddevelopment annd networking
for Microsoft XNA A Creators andd the indepenndent game ddeveloper site
GameD Dev.net, and heh has publisheed both code aand articles thaat help power
severall successful gaames, both indeependent and ppublished.
508 Contributor Biographiees
M. Adil Yalçın
yalcin@umd.edu
M. Adil Yaalçın is a PhD student in thee Department oof Computer S Science at
University of Maryland, College
C Park. H He received hiis MS in Compputer En-
gineering at
a Bilkent Univ versity, Ankaraa, Turkey in JJune 2010 withh a thesis
focused on n deformationss of height fieeld structures. If something is about
computer graphics
g and programming,
p more specificcally real-timee shading
techniques,, graphic engin nes, GPU proggramming or pphysically baseed anima-
tions, it wiill probably caatch his attentiion. He tries too experiment wwith new
stuff and orrganize most ofo his studies uunder an open--source moderrn render-
ing engine development effort.
e
Computer Graphics
Lengyel
Game Engine
Gems 2
Edited by Eric Lengyel