AMD Logo AMD Developer Central

Doom on AMD: Experiment with Multi-threaded Game Development 

Skip Navigation LinksHome > Docs & Articles > Articles & Whitepapers
Rather than creating yet another Hello World app or dull accounting software to illustrate some basic concepts, why not use a real product to put these tricks into play and have some fun while you're at it? For example, the gaming classic "Doom" runs on source code that's now become highly accessible, yet is sophisticated enough to let you try out new programming techniques. The source code is free, and has been for almost 10 years now. Developing for multi-core processor architectures doesn't have to hurt.
Justin Whitney  3/5/2009 

Aching for a playground for your burgeoning expertise at multi-threading on the AMD platform? What you need is a fun, graphics-heavy game with plenty of optimization opportunity but not so much code that it's overwhelming. Free source code would be nice, too. Here's just the thing: Doom.

Classic Doom started a revolution in gaming in 1993. Even the source code release for this antique first-person shooter celebrated its 10-year anniversary in 2007. After first being released under a proprietary license, the Doom source was re-released in 1999 under the GNU General Public License and has since been ported to dozens of platforms from Windows to Nokia handhelds to Xbox.

The wide availability of different incarnations makes the classic Doom engine an ideal testbed for experimenting with optimization techniques for AMD's multi-core platforms. The examples used here come from the original Linux code release (watch out for pop-ups). But a quick perusal of other ports should give you whatever source you'd prefer to use. Note: this article also covers relevant differences for Windows environments.

AMD recommends several different techniques for coding to two distinct aspects of AMD processor technology: multiple cores and 64-bitness. Currently, processors come in single-core, triple-core and quad-core flavors of which some are available in 2P or more platforms (mostly for server). Learning to code to an unknown and potentially large number of processors is necessary for building scalable apps.

Multi-Threading

Coding to multi-core comes down to one key concept: multi-threading. By both architecting your app to be highly threadable and coding for n-threads, your app will be prepared to run on any number of cores. Threading in C++ isn't quite as easy as with managed code because Threading classes aren't included the same way as they are in, for instance, C# and Java libraries. Also, how you code for threading will depend on your OS and its threading capabilities. AMD Dev Central has copious resources for coding to both Linux and Windows platforms.

The main thing to understand is that you have two primary techniques for multi-threading: functional threading and data-parallel threading. When coding games for Windows, you also have the option of running DirectX on a separate thread. But since that's platform-specific and doesn't scale beyond two threads anyway, it won't be covered here. Look for future articles on DX11 which expands threading capabilities.

Functional Threading

With functional threading, you split off threads based on function. For example, render textures while calculating physics while generating particles while playing sound while managing the network connection. With foresight and advanced planning, you can architect your game to spin off each of these functions into separate threads.

For instance, take a look at the following snippet, taken from p_pspr.c:

void P_FireWeapon (player_t* player)
{
statenum_t newstate;

if (!P_CheckAmmo (player))
return;

P_SetMobjState (player->mo, S_PLAY_ATK1);
newstate = weaponinfo[player->readyweapon].atkstate;
P_SetPsprite (player, ps_weapon, newstate);
P_NoiseAlert (player->mo, player->mo);
}

P_NoiseAlert checks mobs in the vicinity to see if they heard you firing your weapon. This can be spun off onto a separate thread while redrawing (P_SetPsprite) and at the same time setting mob state (P_SetMobjState). Though the latter may happen so fast that it's not worth the overhead of creating a new thread, you get the idea—separate functions can happen concurrently, if they don't depend on one another.

Take a look at the full Doom source code, though, and you'll see your first obstacle. This code was written years before threading was standardized, much less widely practiced. You may see a few easy-to-grab opportunities for functional threading, but to do it right, you'd most likely have to rewrite the code extensively.

Another consideration: this technique may work well right now for dual-core. And it may even work for quad-core if you can find enough functions to run concurrently. But you're limiting yourself by hard-coding the number of threads you run. And it's very unlikely that the workloads on the different threads will be well balanced, so the cores won't be fully utilized. A more robust option would be to leave that question unanswered and let the OS decide how many threads to kick off. And the way to do that is with data-parallel threads.

Data-Parallel Threading

Data-parallel threading splits up the same functionality across multiple threads, so that each thread handles the same task with a different set of data. It’s easy to see how data-parallel threading can potentially scale up and utilize many processor cores. Balancing the workload on the various threads is usually straightforward. In AMD multi-core processors, each core has independent L1 and L2 caches. This allows each core to handle its own data stream concurrently, which improves performance.

You also have more opportunity for letting the code expand to more than two, or four, cores. You can query the OS for the number of available cores and factor that into your routines. (Note: you should always ask the OS, not the hardware. Depending on system configuration, virtualization, and other factors, the answer may not be the same.)

Here's one example. The following code handles the BFG explosion, from p_pspr.c:

// A_BFGSpray    
// Spawn a BFG explosion on every monster in view
//
void A_BFGSpray (mobj_t* mo)
{
int i;
int j;
int damage;
angle_t an;

// offset angles from its attack angle
for (i=0 ; i<40 ; i++)
{
an = mo->angle - ANG90/2 + ANG90/40*i;
// mo->target is the originator (player)
// of the missile
P_AimLineAttack (mo->target, an, 16*64*FRACUNIT);

if (!linetarget)
continue;

P_SpawnMobj (linetarget->x,
linetarget->y,
linetarget->z + (linetarget->height>>2),
MT_EXTRABFG);

damage = 0;
for (j=0;j<15;j++)
damage += (P_Random()&7) + 1;

P_DamageMobj (linetarget, mo->target,mo->target, damage);
}
}
;

This would be a great place to spin off multiple threads. Here, the code loops through the angle of attack, looking for targets and applying the MT_EXTRABFG effect to their sorry arses. You could split the for-loop into multiple wedges, factoring in the number of cores available. This requires that the code in the loop is "thread safe", which in this case it isn't. For example, the variable "linetarget" needs either to be made local, to be invariant across threads, or to be protected with a lock. You could also, with a bit of rewriting, set each mob redraw on its own thread, a technique that applies to the game as a whole.

Multi-Threading with OpenMP

In 1998, the OpenMP Architecture Review Board published the C/C++ OpenMP standard for multi-processing. Now that multi-core processors exist, developers have the perfect opportunity to implement some surprisingly simple, yet surprisingly powerful, syntax to help with parallel threading.

Supported by Visual Studio 2005 and Visual Studio 2008 and other IDEs, OpenMP uses a "fork and join" model, in which a single thread executes until it reaches a fork. At that point, it splits into multiple threads, awakened from a pool of dormant threads, and execution proceeds in parallel. Once all the threads finish, they join back into a single thread until the next fork.

Use OpenMP to fork processor-intensive activities, such as particle rendering, into multiple parallel threads. You can do this by architecting your code to allow for parallel threading, then adding a single line of code:

#pragma omp parallel for

An example: here's the routine for redrawing player sprites, from p_pspr.c:

// P_MovePsprites  
// Called every tic by player thinking routine.
//
void P_MovePsprites (player_t* player)
{
int i;
pspdef_t* psp;
state_t* state;

psp = &player->psprites[0];
for (i=0 ; istate) )
{
// drop tic count and possibly change state

// a -1 tic count never changes
if (psp->tics != -1)
{
psp->tics--;
if (!psp->tics)
P_SetPsprite (player, i, psp->state->nextstate);
}
}
}

player->psprites[ps_flash].sx = player->psprites[ps_weapon].sx;
player->psprites[ps_flash].sy = player->psprites[ps_weapon].sy;
}

Multi-threading the for-loop using OpenMP will let you redraw the player sprites faster, except for one problem. The OpenMP specs describe canonical shape for the for-loop, including the requirement of a single iterative variable. You'll need to adjust the code slightly, such as by calculating a thread-local value for psp inside the loop, based on the loop variable i. After that, just add the pragma above the for-loop and you now have parallel threading:

        #pragma omp parallel for
for (i=0 ; istate) )
{
if (psp[i]->tics != -1)
{
psp[i]->tics--;
if (!psp[i]->tics)
P_SetPsprite (player, i, psp[i]->state->nextstate);
}
}
}

If you need some OpenMP compilers, you'll find them here www.compunity.org/resources/compilers.

Porting to 64-Bit

While optimizing for multi-core nets you the biggest long-term gains, every coder these days needs a few 64-bit optimization tricks. Doom helps you practice that, as well.

Since the Doom code is pretty old, first get an overview of potential porting problems by compiling it using the Visual Studio 2008 x64 compiler, and running on any 64-bit Windows.  For Linux, try compiling using 64-bit gcc or Open64 compiler.  Also check out the switch and option recommendations in the Compiler Usage Guidelines document.

In his 2005 GDC presentation, Mike Wall outlined several great techniques for optimizing for 64-bit. Here are a couple of examples, as applied to Doom.

"Polymorphic" Data Types

On the Windows platform, use "polymorphic" data types like LONG_PTR and INT_PTR to take advantage of 64-bit computing without having to code for it directly. Like size_t, these data types change size based on the compiler mode used. For example, on Windows, LONG_PTR maps to data type long on 32-bit mode and __int64 on 64-bit mode. This is useful because Windows programs often follow the highly dubious practice of storing pointers as type LONG, which works in 32-bit mode but breaks in 64-bit mode. (If you just use ordinary pointer types and don’t mix pointers and INTs or LONGs, you don’t need to worry about LONG_PTR and the like.)

Take a look at the Doom source code. In info.h you'll see the following struct:

typedef struct
{
spritenum_t sprite;
long frame;
long tics;
actionf_t action;
statenum_t nextstate;
long misc1, misc2;
} state_t;

To use polymorphic data types, just change the code to the following:

typedef struct
{
spritenum_t sprite;
LONG_PTR frame;
LONG_PTR tics;
actionf_t action;
statenum_t nextstate;
LONG_PTR misc1, misc2;
} state_t;
Natural Alignment

Another optimization technique is to order the elements of your structs so that they don't accidentally include inefficient padding between them. In a 32-bit environment, long types and pointers are both four bytes. But in 64-bit, pointers become eight bytes, and in Linux/gcc longs also become eight bytes, causing excess padding in structures if the elements are not ordered most efficiently.

For example, when coding to 32-bit, you may have had a structure that looked like this:

struct biggun {
int a; /* 4 bytes */
char *b; /* 4 bytes */
long c; /* 4 bytes */
int d; /* 4 bytes */
}; /* Total 16 bytes */

When compiled to 64-bit, this same structure bloats unnecessarily:

struct biggun {
int a; /* 4 bytes + 4 bytes padding */
char *b; /* 8 bytes */
long c; /* 8 bytes (gcc/Linux), 4 + 4 padding (Windows) */
int d; /* 4 bytes + 4 bytes padding */
}; /* Total 32 bytes */

With a little rearranging, you can reduce the total size of the structure:

struct biggun {
char *b; /* 8 bytes */
long c; /* 8 bytes in gcc/Linux, 4 bytes in Windows */
int a; /* 4 bytes */
int d; /* 4 bytes */
}; /* Total 24 bytes */

Now take a look at the Doom code. Here's an example from d_player.h (comments have been removed for brevity):

typedef struct player_s
{
mobj_t* mo;
playerstate_t playerstate;
ticcmd_t cmd;
fixed_t viewz;
fixed_t viewheight;
fixed_t deltaviewheight;
fixed_t bob;
int health;
int armorpoints;
int armortype;
int powers[NUMPOWERS];
Boolean cards[NUMCARDS];
Boolean backpack;
int frags[MAXPLAYERS];
weapontype_t readyweapon;
weapontype_t pendingweapon;
Boolean weaponowned[NUMWEAPONS];
int ammo[NUMAMMO];
int maxammo[NUMAMMO];
int attackdown;
int usedown;
int cheats;
int refire;
int killcount;
int itemcount;
int secretcount;
char* message;
int damagecount;
int bonuscount;
mobj_t* attacker;
int extralight;
int fixedcolormap;
int colormap;
pspdef_t psprites[NUMPSPRITES];
Boolean didsecret;
} player_t;

As you can see, you have a lot of opportunities here. The good news is that this is one of the biggest structs in the game. Track down these data types and you'll have a good handle on what you can do with the rest of the code, as well, all of which will help prepare you for coding today's (and tomorrow's) 64-bit games. Sun and Microsoft both have additional information on alignment and packing.

In addition to saving space by intelligently ordering struct members for alignment, there is an opportunity to maximize cache utilization by ensuring that commonly used members reside on the same cache line. This will result in better cache locality and thus improve performance, as long as "false sharing" concerns are taken into consideration

Compiler Options

The tools you use for compiling will have the most impact on how well optimized you are for 64-bit. Use the latest drivers, optimized libraries, and 64-bit compilers, including Visual Studio 2008, GNU compilers, Open64, and others. For example, in Visual Studio you'll get the benefit of optimized Libc functions, like memcpy, memset, etc.

You'll also have a ton of compiler options at your disposal, some for Windows, some for Linux, some for both. Focus on O1, O2, O2b2, fp:fast, GL, and LTCG. Experiment and test the performance of your builds with CodeAnalyst (see below). For example, /O2 is supposed to optimize for speed, but ironically, in 64-bit mode, using /O1 for small code size can sometimes give higher performance than /O2 because it improves instruction cache utilization. You can find information on these and more for both Windows and Linux platforms. Also check out Microsoft's "Optimization Best Practices" and see AMD's tip sheets on preferred compiler options for Windows, Linux and Sun.

Optimize with CodeAnalyst

When optimizing code for 64-bit mode, multi-core, and general performance, AMD CodeAnalyst, a free download, will be a useful tool. Use CodeAnalyst to analyze and optimize performance, specifically timer and event-based, instruction based sampling, and thread profiling to see if, and how efficiently, multi-threading is taking place.

To read more about using CodeAnalyst for Linux on AMD, read "Increased performance with AMD CodeAnalyst software and Instruction-Based Sampling (on Linux)." Windows developers should check out "Optimizing for Multi-Core with AMD CodeAnalyst," which focuses on threading.

Additional Resources

Justin Whitney is a regular contributor to DevX.com and Jupitermedia. He currently lives in San Francisco, where he consults for leading high-tech firms and writes about emerging technologies.

Back to top
© 2009 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.