For two consecutive nights, I slept on the floor of a vault. This was in 1997, if I recall correctly. At that time, I was one of the few Microsoft Certified Exchange Specialists in the area. (I may have been the only one still at that point.) I had come off of the process of designing and beginning the rollout of the world-wide email system for global fiber-optic pioneer MFS Communications. (Before they got absorbed by WorldCom.) This vault belonged to a significantly smaller and less far-flung enterprise. As a public utility serving a few million people spread across tens of thousands of square miles, however, they were a little prickly about their data center. That explains the vault.
The reason for my nocturnal pseudo-imprisonment in said vault was something relatively insignificant and yet managed to cripple their email system for days. In fact, it took us a few days to even ferret out what the problem was. My series of phone calls to Microsoft’s technical support generally ended up at the Exchange development team themselves.
In the end, we found the cause of the problem by tracking back through the path of damage that it had done. It came down to the RAID controller (the hard drive controller for those that don’t do server hardware). Specifically, there was something called “lazy write cache” that was in effect. Rather than the drive controller getting data from the database engine, writing that bit of data, verifying that it was written properly, and then asking for more work from the database engine, it says “just give it all to me… I’ll get to it when I can.” The problem with this approach is that, if there is a hiccup in writing the data, the drive controller can’t report back to the database and say, “Uh, dude? Like… we had a bit of a problem… can you give me that one piece again?”
“Being out of sync sucks.”
In the situation with my client, the ramifications of this was a massive database corruption that started with one tiny blurb of data and spread throughout their whole email system with undirected but amazingly annoying persistence. So I slept on the data center floor in 20-minute chunks, entombed by mind-boggling amounts of concrete and steel, restoring from mind-numbingly slow backup tapes one day further into the past at a time until we could find the day where the corruption actually started.
To sum up, the cause of the whole problem was that the database and the hard drives were allowed, even for a moment to be out of sync. That RAID controller was saying “trust me… all will be well in the end.” Well… we now know how well that worked out. The moral of the story? Being out of sync is a disaster waiting to happen.
Old McDonald Had a Core…
I am reminded of this delightful adventure because of a recent thread (which borders on being titled a spectacular rant) on the AiGameDev.com forums about a certain hardware manufacturer’s outrageous lies about game AI (free forum registration and introduction required to view). The participants in the discussion lash out at what seems to be Intel’s implication that game AI sucks and will continue to do so until we can have dozens, hundreds or even thousands of cores to play with. I will let you read the responses that ensued in the forum itself.
In the beginning, all we had to worry about was a single processor. Everything was done on that one processor — with the eventual exception of stuff that was offloaded to graphics cards. All the different parts of the game had to share that single processor — input, logic, rendering, etc. Anything you could think of. With the advent of Windows (et al), we were introduced to the idea that multiple programs could be running on a single processor at one time. However, we still thought of a program (be it game or otherwise) as being a single stream of commands run in order.
The idea of multiple threads for the same application in a single processor environment sometimes was a little anachronistic. After all, the different threads had to take turns on the same processor anyway. The only benefits at that time was that NiftyTaskA could interrupt NiftyTaskB even before NiftyTaskB was done. One common use for this was to keep from having sometimes sluggish world-processing calculations from making your control scheme visibly sporadic. The bottom line, however, is that they were all sharing the same processor no matter how many threads there were. In fact, you could actually slow things down just because of the overhead involved in switching between threads.
Here a Core, There a Core…
And then came the much vaunted “Dual Core” systems. And I’m not entirely sure that the game world was ready for it. As much as we like to think we are on the cutting edge of technology, there were are startling number of games that were not at all “optimized for multiple cores.” And why were they not optimized? Largely, because they were running in a single thread. I remember playing the same RPG (which shall remain unidentified) on my old laptop (P4 2.6 Ghz) and then on my new one (Dual Core 1.7 Ghz) and being startled that it was running slower. C’mon… isn’t (1.7 x 2) > 2.6 ?
A quick peek at the performance monitor realized that the game was beating the snot outta one processor while the other one skipped along at a leisurely 3% — seemingly unaware of the flurry of activity that its neighbor was churning out. So, in effect, the game was running at 1.7 Ghz. Wonderful. I wanted desperately for them to simply cut the input over to a separate thread so my cursor didn’t stutter around the screen like it was on drugs. (Or like I was.) Thankfully, they eventually released a patch that was “optimized for multi-core machines”. (Brilliant!)
Everywhere a Core! Core!
For the most part, the only benefit that was garnered from multiple cores was that some of the OS and background applications were split up so as not to compete with the game (at least not as much). But that benefit can only go so far. Even with processor hogs such as Vista and the seemingly endless onslaught of “completely necessary stuff” that wants to install itself in our system trays, the effect on the processor is relative loose change compared to the electronic bombardment of a modern PC game.
The same can be said for consoles. With the emergence of Sony’s use of SPE’s in their multi-core PS3, there is a lot of talk about how to use those efficiently. (Of course, there is also a lot of griping about how each SPE only has 256k of RAM sitting around.) One way of using them is to process little chunks of information on a temporary basis. Wow… doesn’t this look like multi-threading?
So now that we are likely going to be surrounded by billions and billions of cores (nod to Carl Sagan), what good will they do us? I suppose that you can eventually get one cute little system tray app on each one so they don’t fight with each other. And some of the crap that I’m running right now uses a startling number of threads (a glance at my Task Manager shows Outlook using 32, Skype with 27, and Steam with 61?!?). But still splitting up our games to even half this many threads still frightens us as game developers. And yet will we need to do it to expand in a technology world that may not care that we would prefer a single processor running at 32 Ghz rather than 32 processors running at 1 Ghz?
So Where Do We Cut, Doctor?
Regardless of platform or architecture, the issue that keeps coming up is “where do we split that single homogeneous flow?” The reason that it is so scary is the same moral that we gleaned from my nights behind two-foot thick steel doors. “Being out of sync sucks.” And yet that’s what we face when we start breaking things off from the main data flow.
On of the advantages of having everything in one thread was that our game processes were forced to march in order. NiftyTaskA. NiftyTaskB. NiftyTaskC… every time through the game loop, it was the same parade. Boring, yet rigidly predictable. You knew that NiftyTaskB could depend completely on everything in NiftyTaskA being completed. There is a sense of security in that. Having a separate thread for just moving the cursor around the screen is one thing. After all, we are just going to cache any input until such time as it is needed in the next frame. Anything that has to do with the data, however, gets a little more dicey.
“Debugging across threads? You get used to it.”
In 2003, my late colleague and friend Eric Dybsand glibly shrugged off questions about multi-threading. He stated quite simply that he typically offloaded an A* pathfinder to a separate thread so that agents could request paths without bogging down the rest of the game loop. In that same conversation, Eric even fielded the question of debugging data in a multi-threaded environment with the calm response of “you get used to it after a while.” (I’m waiting for Eric to respawn so that I can ask him more about his techniques in this capacity.)
Needless to say, Eric’s proclamations alarmed many people. “What if…?” was the common cry. The thought was that, by the time a path was returned, it would no longer be valid. I won’t get into the array of questions and answers here. (Entire books can be written on the caveats.) Even something as simple as pathfinding — which only reads world data… it doesn’t change it — causes some significant consternation. Imagine what would happen if we wanted to go even further than this?
The Needs of the Many…
I do have to give Intel’s claim some credit in this respect… as AI programmers, we always gripe about not having enough processing power at our disposal. However, in the past, we could rely on Dr. Moore to come to the rescue by just bestowing more processor speed on us every 8 months or so. That may no longer something we can rely on… at least not in a purely linear sense. Just flipping open your latest PC catalog over the past few years has revealed the trend that clock speeds of the processors have somewhat leveled off. Instead, we see more processors — sometimes even running slower than their single-core predecessors. Sure, we have more “processor time” available to us… it’s just no longer in one place. If we want to use it, we are going to have to get over our industry-endemic fear of threads.
So… what is the possibility and/or feasibility of multi-threading? As it inevitable? Is it necessary? Does it scare the heck out of you? Or have you even given it a shot and used it successfully? Can we do it without getting “out of sync?” Or, for that matter, does being out of sync really suck that badly?
(Alex saw my topic for this week and warned that it might be a bit touchy. So, after I hit save, I think I may pathfind myself to go hide in a big, data center vault for a few days.)









Comments
Comment on this article. | Show full forum thread.- Programming the SPEs is not quite the same as multithreading in most cases, since SPEs operate on their own little chunk of RAM and are typically used for streamed or batched numerical computation.
- Delayed algorithms like Eric's threaded A* were commonly used in non-threaded scenarios. A* in particular is very easy to timeslice across multiple frames. The scary part was not that the path might not be valid after a few frames (that's going to happen anyway), but the loss of determinism and the typical issues with accessing data while it's being modified.
- Multithreading the AI code across many cores (beyond stuff like physics raycasts, visibility queries and such) has to go hand in hand with multithreading the entire gameplay codebase, or else Ahmdal comes to bite your ass. Perhaps there's a discussion pending on what areas of gameplay code are considered AI and what aren't (hint hint). :) For example props, mission scripts, effects, environment, etc.
- It's hard to really design and implement a massively multithreaded version of your gameplay code when you don't have massively multicore hardware. Right now doing so has to be a research project, because, as you say, multithreading far beyond the number of physical cores reduces performance and therefore harms games in production. Few teams can afford the manpower to go very far with this level of research if it won't improve their current or near-future projects.
Since designing a massively multithreaded gameplay engine is a very hard problem with many unknowns, the way I typically think about it is without regard to performance or resources. Imagine if you could lock-protect anything that is going to be shared; imagine if you could afford to double-buffer everything. That at least gives me a starting point.. not that I have thought too far beyond that, but heh. :)
With that said, AI for many games isn't that hard to subdivide. Path planning is an obvious place to start, but really only a tiny part. Many games already timeslice their path planning (meaning that you can't expect a result in the same update anyway). Putting that in a separate thread is a no-brainer.
Going beyond that - it's usually not necessary for every AI to update every frame. In Kohan 2, our strategic AIs were only updating once every 20-30 seconds. We could easily have run each in its own thread if we wanted to - in fact, we did run each in its own lightweight fiber. This made it easy to timeslice them arbitrarily, while still allowing us to retain control of when they released the CPU (thus solving most, but not all, of our synchronization issues). We also had separate simulation and graphics threads (all of the AI fibers were part of the simulation thread). Again, I don't believe there was all that much pain as a result (at least, I didn't see our graphics programmer ripping his hair out over it), and I don't think this is all that uncommon of an approach.
One thing to be aware of if you do go down this path... and the following is based only on my vague understanding of how things work (and significantly oversimplified even so), so if you're planning to work on these machines definitely take it with a grain of salt...
Modern PCs and the XBox 360 have multiple cores which can run separate threads on each core fairly cleanly and easily, so they are fairly well suited to this sort of approach. You can imagine them as having two separate CPUs which share the same access to everything else, so it doesn't much matter where a thread runs. The problem here, of course, is that if the CPUs are fighting for access to other resources then you obviously don't get all that much speedup.
The PS3 is an entirely different beast. As I understand it, it has only one main CPU, with a bunch of little SPUs that have their own memory and little if any access to other resources. Basically the way the SPU works is that you send it a job, and then wait for it to send the answer back. Sending it a job is expensive, because the bandwidth to the SPU is low, so you have to send it a task that is going to run a long time in order to make the startup cost worth it. However each SPU has only a relatively small amount of memory (the 256k that Dave alludes to), so it needs to do that crunching with a limited data set.
I personally think that writing AI that was optimized for the PS3 would be a really fun challenge - but only as a technical problem. From a professional point of view, I think it's unlikely that I would ever choose to go there.
I think the key to debugging threads for me is simply a good debugger. Finding bugs in multi-threaded java code when you're debugging with Eclipse, for example, is relatively easy. Finding any kind of bug at all with the tools you get for some of the consoles, on the other hand is horrifyingly difficult (naming no names!).
The biggest challenge you tend to get is race conditions - where a bug occurs sporadically depending on whether the conditions which cause it have enough time to occur (or vice versa, whether the conditions in which the code works have enough time to occur).
You end up writing a lot of state-based multi-step code to wait for other threads, as below:
bool ARoutine()
{
if(state == INITIALISATION)
{
//do initialisation, start task
state = WAITING;
}
if(state == WAITING)
{
//wait for task in other thread to finish
if(!threadedTask.Finished())
return false;
if(threadedTask.Error())
state = ERROR;
state == READYTOGO;
}
if(state == READYTOGO)
{
//Do the stuff you were waiting to do
state == DONE;
}
if(state == DONE)
{
return true;
}
return false;
}
A little simplified maybe, and maybe a little overstated, but I can't count the number of times I've written a method like that. You then just keep calling the method until it's done.
Once you get the hang of writing multi-threaded code, it gets tempting to make new threads all the time. I'd suggest you only make a new thread when there's a clear benefit to being able to start a task and ignore it until it's done. I've heard of people suggesting that an mmog server should have a thread for each player connected before - imagine how many threads would be running! In a lot of cases you can just do your own time management in a similar way to the way you'd wait for a thread, but just checking whether you've used up more than your allotted time for a part of a task, and not doing the next part of the task until next cycle.
I think pathfinding *is* something that could be farmed out to another thread, but if you always want a result within one frame, there's no point. You have to use the CPU time anyway, so why waste any thread-swapping if you're going to wait for the result anyway? If, however, you're happy to have your character or whatever either stand around and wait for a path, or follow a worse guess until the pathfinding is done, then by all means put your slow pathfinding into a different thread (I love this approach, and I wish I saw it more often - I'd love to see strategy units pull out a map or get on the comms while they try and work out their route!)
I think that there is no possibility to increase processor's speed up to 32 Ghz. That's why we must submit that further increasing of power is beyound the parallel computing.
"It is not impossible, it is inevitable" (c)
Cheers,
Andrey.
My first thought: threads are not the answer! If you need to think about parallelization, then read about tasks (I heavily recommend looking into Intel Threading Building Blocks library) and streams (google for RapidMind, CUDA, Emergent's Floodgate, and Insomniac's SPU shaders) - methods that steer your thinking into a direction where you identify chunks of processing that can be computed independently and therefore in parallel but don't think about threads themselves.
Threads are a low-level solution which should be hidden for most usage by a task pool or task graph or a streaming framework so you don't rely too much on a resource that is heavily dependent on the actual hardware. Too many threads hurt your performance. Too few hurt your performance. Streams and task pools hide the actual number of CPU cores and therefore the number of threads that are reasonable to use. Therefore these solutions tend to scale better with growing numbers of cores - and scalability is the main topic of the Intel blog referenced above.
You need to use threads directly when using certain APIs - for example an OpenGL display context is bound to one thread and all calls to this context should be made from this thread. This is a low-level restriction and your core rendering thread should/might be worth to hide behind a message queue that hides the thread-context-binding for the rest of your app.
Personally, I like to think about pathfinding as a service. My agents post pathfinding request to a pathfinding service. This service is hidden behind an interface - so the users of the service don't know if the service is using threads, a task pool, SPUs, or GPGPU techniques for its doings.
Such a service offers specific phases that coordinate reading and writing of data: a phase to post pathfinding queries, a phase to extract finished queries, a phase to manipulate the maps, etc.
If I need determinism two ways of using such a service come to my mind (there might be and are for sure more ways!):
1. Don't put the pathfinding into an asynchrone thread but call it synchronously during your main loop and allow the pathfinding service to farm of as many tasks as it needs to calculate the paths for all queries, or even to use many tasks per query - and map tasks to hierarchical path refinements and planning.
In this way pathfinding will be deterministic as you sync on it finishing the queries directly.
2. Allow pathfinding to take up to k-frames. Queue your query to the pathfinding service that might run asynchronously to your main loop. However you don't expect the query result for k-frames. If you haven't received a query reply after waiting for k frames block your main loop from advancing until the path arrives.
This might lead to hiccups in your frame-rate - but it leads to deterministic results - and handling performance in this case is very similar to handling it in a purely sequential case - use a faster algorithm, tune your maps/nav meshes/grids, deploy iterative or hierarchical pathfinding, etc.
Ugh - I hope that I don't sound to much like preaching or teaching - I get carried away by this topic - sorry ;-)
By the way: the main point of this text is: embrace deferring computations and receiving processing results to enable more asynchronicity.
Both are functional languages (to varying degrees) so they encourage a side-effect free style of programming. Without side-effects, anything can be parallelized.
However, working in games, we have lots of side-effects that we actually do want. In Erlang this is done by message passing, recursion, or database writes. Haskell gives you the means to explicitly say "this is a function that produces side effects" and gives you the means to specify "ok, I need these steps to be done in a specific order".
Erlang works in an environment where multithreading is easy because no memory is shared, it's duplicated across each core, and can't be changed once written. This works, but isn't very intuitive for game programmers and on may not be feasible on consoles, where memory is scarce. Haskell uses shared memory with atomic operations. This is much more game programmer friendly and acts much like single threaded code, with the promise of no race conditions or memory corruption.
I'm not saying we'll switch over to functional languages. C/C++/C#/Java and other imperative languages have been adding a lot of the features of functional languages over the years and may end up taking all of those we care about to give us good multithreading. But if those languages refuse to move fast enough, there's bound to be a startup that tests the waters, or an industry heavyweight (Microsoft is putting a lot of research into both the Haskell and Erlang way of concurrency) that has the deep pockets to make us actually consider re-writing those old C/C++ libraries.
Check out http://www.multicoreinfo.com for latest developments, programming resources, and research papers in the area of multicore.
Politically motivated academics have called the project a failure, largely because it failed to produce a commercially viable product, however, the project did demonstrate the parallelization of common AI tasks: ultimately the concept was vindicated by IBM's "Deep Blue" which used parallel search to outperform humans at chess.
I'm looking forward to a "Perception Engine" that applies an architecture similar to that of GPU's or the PS3 to tasks in machine vision, speech recognition and machine vision.
Also, I too agree with Eric; you get used to it.
Anything that is already time sliced (and as other commentors have pointed out, pathing is a great example of this) would be fairly straight forward to spin off into another thread as your agent is awaiting the Pathing.Complete or Pathing.CannotComplete events to be sent back before it does anything anyway. Checking the validity of a time sliced path is an already solved problem, you just look into the resulting path and see if any of the steps are blocked and if one is, calculate a new path from the last unblocked location to the target location again.
So pathing is (by its nature) a good fit for multi-threading because it is expensive to calculate and chews up CPU cycles.
Anything else that is expensive could, in theory, be a good fit for spinning off into other threads. E.g. a turn based strategy game could be running its planning phase while the player is playing his turn, and using a number of assertions, it could check the validity of these assertions at the start of its turn to determine how good (valid) the resulting plan is and how much of it should be used. In fact, I think that this base idea of setting up assertions and conditions would be useful for any type of AI calculation that has been made to run in parallel to the rest of the game engine, whether it be time-sliced, threaded or just the result of an asynchronous request (web service, perhaps).
Do I think we need a thousand cores? No, not at all. I suspect that after 8 we'll hit diminishing returns. Intel has a problem, quantum effects are preventing them from scaling up clock speeds on processors and so their answer is to increase the number of cores instead. Will that really benefit us in the long run? Probably not.
Intel doesn't care what you think - we're going to end up with a thousand cores anyway ;-)
Maybe with heavy used of fuzzy-logic to add a bit of number crunching to lessen the data accesses... but is it enough to lower cache miss rates nad prevent most of the cores from being idle alot of the time??