To Thread or not to Thread? That is the Question.

Dave Mark on July 16, 2008

In this week’s developer discussion, Dave Mark heads straight for controversial ground with this article about multi-threading and concurrency. Do you think it’s solved? Do we even need to solve it? Let him know and reply below!

Ooohh… look at all the pretty colors! I wonder which one is the A* engine?

For two consecutive nights, I slept on the floor of a vault. This was in 1997, if I recall correctly. At that time, I was one of the few Microsoft Certified Exchange Specialists in the area. (I may have been the only one still at that point.) I had come off of the process of designing and beginning the rollout of the world-wide email system for global fiber-optic pioneer MFS Communications. (Before they got absorbed by WorldCom.) This vault belonged to a significantly smaller and less far-flung enterprise. As a public utility serving a few million people spread across tens of thousands of square miles, however, they were a little prickly about their data center. That explains the vault.

The reason for my nocturnal pseudo-imprisonment in said vault was something relatively insignificant and yet managed to cripple their email system for days. In fact, it took us a few days to even ferret out what the problem was. My series of phone calls to Microsoft’s technical support generally ended up at the Exchange development team themselves.

In the end, we found the cause of the problem by tracking back through the path of damage that it had done. It came down to the RAID controller (the hard drive controller for those that don’t do server hardware). Specifically, there was something called “lazy write cache” that was in effect. Rather than the drive controller getting data from the database engine, writing that bit of data, verifying that it was written properly, and then asking for more work from the database engine, it says “just give it all to me… I’ll get to it when I can.” The problem with this approach is that, if there is a hiccup in writing the data, the drive controller can’t report back to the database and say, “Uh, dude? Like… we had a bit of a problem… can you give me that one piece again?

“Being out of sync sucks.”

In the situation with my client, the ramifications of this was a massive database corruption that started with one tiny blurb of data and spread throughout their whole email system with undirected but amazingly annoying persistence. So I slept on the data center floor in 20-minute chunks, entombed by mind-boggling amounts of concrete and steel, restoring from mind-numbingly slow backup tapes one day further into the past at a time until we could find the day where the corruption actually started.

To sum up, the cause of the whole problem was that the database and the hard drives were allowed, even for a moment to be out of sync. That RAID controller was saying “trust me… all will be well in the end.” Well… we now know how well that worked out. The moral of the story? Being out of sync is a disaster waiting to happen.

Old McDonald Had a Core…

Uh… you said the strategic behavior thread was… what color?

I am reminded of this delightful adventure because of a recent thread (which borders on being titled a spectacular rant) on the forums about a certain hardware manufacturer’s outrageous lies about game AI (free forum registration and introduction required to view). The participants in the discussion lash out at what seems to be Intel’s implication that game AI sucks and will continue to do so until we can have dozens, hundreds or even thousands of cores to play with. I will let you read the responses that ensued in the forum itself.

In the beginning, all we had to worry about was a single processor. Everything was done on that one processor — with the eventual exception of stuff that was offloaded to graphics cards. All the different parts of the game had to share that single processor — input, logic, rendering, etc. Anything you could think of. With the advent of Windows (et al), we were introduced to the idea that multiple programs could be running on a single processor at one time. However, we still thought of a program (be it game or otherwise) as being a single stream of commands run in order.

The idea of multiple threads for the same application in a single processor environment sometimes was a little anachronistic. After all, the different threads had to take turns on the same processor anyway. The only benefits at that time was that NiftyTaskA could interrupt NiftyTaskB even before NiftyTaskB was done. One common use for this was to keep from having sometimes sluggish world-processing calculations from making your control scheme visibly sporadic. The bottom line, however, is that they were all sharing the same processor no matter how many threads there were. In fact, you could actually slow things down just because of the overhead involved in switching between threads.

Here a Core, There a Core…

The PS3 cell processor. Adult assembly required.

And then came the much vaunted “Dual Core” systems. And I’m not entirely sure that the game world was ready for it. As much as we like to think we are on the cutting edge of technology, there were are startling number of games that were not at all “optimized for multiple cores.” And why were they not optimized? Largely, because they were running in a single thread. I remember playing the same RPG (which shall remain unidentified) on my old laptop (P4 2.6 Ghz) and then on my new one (Dual Core 1.7 Ghz) and being startled that it was running slower. C’mon… isn’t (1.7 x 2) > 2.6 ?

A quick peek at the performance monitor realized that the game was beating the snot outta one processor while the other one skipped along at a leisurely 3% — seemingly unaware of the flurry of activity that its neighbor was churning out. So, in effect, the game was running at 1.7 Ghz. Wonderful. I wanted desperately for them to simply cut the input over to a separate thread so my cursor didn’t stutter around the screen like it was on drugs. (Or like I was.) Thankfully, they eventually released a patch that was “optimized for multi-core machines”. (Brilliant!)

Everywhere a Core! Core!

For the most part, the only benefit that was garnered from multiple cores was that some of the OS and background applications were split up so as not to compete with the game (at least not as much). But that benefit can only go so far. Even with processor hogs such as Vista and the seemingly endless onslaught of “completely necessary stuff” that wants to install itself in our system trays, the effect on the processor is relative loose change compared to the electronic bombardment of a modern PC game.

The same can be said for consoles. With the emergence of Sony’s use of SPE’s in their multi-core PS3, there is a lot of talk about how to use those efficiently. (Of course, there is also a lot of griping about how each SPE only has 256k of RAM sitting around.) One way of using them is to process little chunks of information on a temporary basis. Wow… doesn’t this look like multi-threading?

So now that we are likely going to be surrounded by billions and billions of cores (nod to Carl Sagan), what good will they do us? I suppose that you can eventually get one cute little system tray app on each one so they don’t fight with each other. And some of the crap that I’m running right now uses a startling number of threads (a glance at my Task Manager shows Outlook using 32, Skype with 27, and Steam with 61?!?). But still splitting up our games to even half this many threads still frightens us as game developers. And yet will we need to do it to expand in a technology world that may not care that we would prefer a single processor running at 32 Ghz rather than 32 processors running at 1 Ghz?

So Where Do We Cut, Doctor?

Make sure you don’t cut the data!

Regardless of platform or architecture, the issue that keeps coming up is “where do we split that single homogeneous flow?” The reason that it is so scary is the same moral that we gleaned from my nights behind two-foot thick steel doors. “Being out of sync sucks.” And yet that’s what we face when we start breaking things off from the main data flow.

On of the advantages of having everything in one thread was that our game processes were forced to march in order. NiftyTaskA. NiftyTaskB. NiftyTaskC… every time through the game loop, it was the same parade. Boring, yet rigidly predictable. You knew that NiftyTaskB could depend completely on everything in NiftyTaskA being completed. There is a sense of security in that. Having a separate thread for just moving the cursor around the screen is one thing. After all, we are just going to cache any input until such time as it is needed in the next frame. Anything that has to do with the data, however, gets a little more dicey.

“Debugging across threads? You get used to it.”

In 2003, my late colleague and friend Eric Dybsand glibly shrugged off questions about multi-threading. He stated quite simply that he typically offloaded an A* pathfinder to a separate thread so that agents could request paths without bogging down the rest of the game loop. In that same conversation, Eric even fielded the question of debugging data in a multi-threaded environment with the calm response of “you get used to it after a while.” (I’m waiting for Eric to respawn so that I can ask him more about his techniques in this capacity.)

Needless to say, Eric’s proclamations alarmed many people. “What if…?” was the common cry. The thought was that, by the time a path was returned, it would no longer be valid. I won’t get into the array of questions and answers here. (Entire books can be written on the caveats.) Even something as simple as pathfinding — which only reads world data… it doesn’t change it — causes some significant consternation. Imagine what would happen if we wanted to go even further than this?

The Needs of the Many…

I do have to give Intel’s claim some credit in this respect… as AI programmers, we always gripe about not having enough processing power at our disposal. However, in the past, we could rely on Dr. Moore to come to the rescue by just bestowing more processor speed on us every 8 months or so. That may no longer something we can rely on… at least not in a purely linear sense. Just flipping open your latest PC catalog over the past few years has revealed the trend that clock speeds of the processors have somewhat leveled off. Instead, we see more processors — sometimes even running slower than their single-core predecessors. Sure, we have more “processor time” available to us… it’s just no longer in one place. If we want to use it, we are going to have to get over our industry-endemic fear of threads.

So… what is the possibility and/or feasibility of multi-threading? As it inevitable? Is it necessary? Does it scare the heck out of you? Or have you even given it a shot and used it successfully? Can we do it without getting “out of sync?” Or, for that matter, does being out of sync really suck that badly?

(Alex saw my topic for this week and warned that it might be a bit touchy. So, after I hit save, I think I may pathfind myself to go hide in a big, data center vault for a few days.)

Discussion 8 Comments

bknafla on July 16th, 2008

Dave - nice topic to stir the seemingly calm water ;-) My first thought: threads are not the answer! If you need to think about parallelization, then read about tasks (I heavily recommend looking into Intel Threading Building Blocks library) and streams (google for RapidMind, CUDA, Emergent's Floodgate, and Insomniac's SPU shaders) - methods that steer your thinking into a direction where you identify chunks of processing that can be computed independently and therefore in parallel but don't think about threads themselves. Threads are a low-level solution which should be hidden for most usage by a task pool or task graph or a streaming framework so you don't rely too much on a resource that is heavily dependent on the actual hardware. Too many threads hurt your performance. Too few hurt your performance. Streams and task pools hide the actual number of CPU cores and therefore the number of threads that are reasonable to use. Therefore these solutions tend to scale better with growing numbers of cores - and scalability is the main topic of the Intel blog referenced above. You need to use threads directly when using certain APIs - for example an OpenGL display context is bound to one thread and all calls to this context should be made from this thread. This is a low-level restriction and your core rendering thread should/might be worth to hide behind a message queue that hides the thread-context-binding for the rest of your app.

Jare on July 16th, 2008

A few nitpicks before bed. :) - Programming the SPEs is not quite the same as multithreading in most cases, since SPEs operate on their own little chunk of RAM and are typically used for streamed or batched numerical computation. - Delayed algorithms like Eric's threaded A* were commonly used in non-threaded scenarios. A* in particular is very easy to timeslice across multiple frames. The scary part was not that the path might not be valid after a few frames (that's going to happen anyway), but the loss of determinism and the typical issues with accessing data [b]while[/b] it's being modified. - Multithreading the AI code across many cores (beyond stuff like physics raycasts, visibility queries and such) has to go hand in hand with multithreading the entire gameplay codebase, or else Ahmdal comes to bite your ass. Perhaps there's a discussion pending on what areas of gameplay code are considered AI and what aren't (hint hint). :) For example props, mission scripts, effects, environment, etc. - It's hard to really design and implement a massively multithreaded version of your gameplay code when you don't have massively multicore hardware. Right now doing so has to be a research project, because, as you say, multithreading far beyond the number of physical cores reduces performance and therefore harms games in production. Few teams can afford the manpower to go very far with this level of research if it won't improve their current or near-future projects. Since designing a massively multithreaded gameplay engine is a very hard problem with many unknowns, the way I typically think about it is without regard to performance or resources. Imagine if you could lock-protect anything that is going to be shared; imagine if you could afford to double-buffer everything. That at least gives me a starting point.. not that I have thought too far beyond that, but heh. :)

bknafla on July 16th, 2008

Thinking about parallelization means thinking (and deciding) about determinism. Farming out pathfinding to run in parallel to the main loop logic is a decision where you allow non-determinism (when does the independent parallel path computation finish - in one frame, two, three, ...?) to gain independence of computations and performance gains. Personally, I like to think about pathfinding as a service. My agents post pathfinding request to a pathfinding service. This service is hidden behind an interface - so the users of the service don't know if the service is using threads, a task pool, SPUs, or GPGPU techniques for its doings. Such a service offers specific phases that coordinate reading and writing of data: a phase to post pathfinding queries, a phase to extract finished queries, a phase to manipulate the maps, etc. If I need determinism two ways of using such a service come to my mind (there might be and are for sure more ways!): 1. Don't put the pathfinding into an asynchrone thread but call it synchronously during your main loop and allow the pathfinding service to farm of as many tasks as it needs to calculate the paths for all queries, or even to use many tasks per query - and map tasks to hierarchical path refinements and planning. In this way pathfinding will be deterministic as you sync on it finishing the queries directly. 2. Allow pathfinding to take up to k-frames. Queue your query to the pathfinding service that might run asynchronously to your main loop. However you don't expect the query result for k-frames. If you haven't received a query reply after waiting for k frames block your main loop from advancing until the path arrives. This might lead to hiccups in your frame-rate - but it leads to deterministic results - and handling performance in this case is very similar to handling it in a purely sequential case - use a faster algorithm, tune your maps/nav meshes/grids, deploy iterative or hierarchical pathfinding, etc. Ugh - I hope that I don't sound to much like preaching or teaching - I get carried away by this topic - sorry ;-) By the way: the main point of this text is: embrace deferring computations and receiving processing results to enable more asynchronicity.

Kevin Dill on July 16th, 2008

I don't agree that we AI guys are always screaming for more processing power. I don't think I've ever worked on a game where the AI was significantly out of control in terms of performance, or where I felt squeezed. Academic AI guys might feel that way - but if they do, they're probably working on very different problems than the ones that keep me up late at night. I don't think more processing power would help all that much with the biggest problems we face. Better reasoner architectures, better ways of encoding decision-making parameters, and better tools to automate or simplify data creation, an animation revolution that does for AI what the existing revolution has done for physics... these things we could use. More processor time? Only to the extent that those other things need it (which they might well not). With that said, AI for many games isn't that hard to subdivide. Path planning is an obvious place to start, but really only a tiny part. Many games already timeslice their path planning (meaning that you can't expect a result in the same update anyway). Putting that in a separate thread is a no-brainer. Going beyond that - it's usually not necessary for every AI to update every frame. In Kohan 2, our strategic AIs were only updating once every 20-30 seconds. We could easily have run each in its own thread if we wanted to - in fact, we did run each in its own lightweight fiber. This made it easy to timeslice them arbitrarily, while still allowing us to retain control of when they released the CPU (thus solving most, but not all, of our synchronization issues). We also had separate simulation and graphics threads (all of the AI fibers were part of the simulation thread). Again, I don't believe there was all that much pain as a result (at least, I didn't see our graphics programmer ripping his hair out over it), and I don't think this is all that uncommon of an approach. One thing to be aware of if you do go down this path... and the following is based only on my vague understanding of how things work (and significantly oversimplified even so), so if you're planning to work on these machines definitely take it with a grain of salt... Modern PCs and the XBox 360 have multiple cores which can run separate threads on each core fairly cleanly and easily, so they are fairly well suited to this sort of approach. You can imagine them as having two separate CPUs which share the same access to everything else, so it doesn't much matter where a thread runs. The problem here, of course, is that if the CPUs are fighting for access to other resources then you obviously don't get all that much speedup. The PS3 is an entirely different beast. As I understand it, it has only one main CPU, with a bunch of little SPUs that have their own memory and little if any access to other resources. Basically the way the SPU works is that you send it a job, and then wait for it to send the answer back. Sending it a job is expensive, because the bandwidth to the SPU is low, so you have to send it a task that is going to run a long time in order to make the startup cost worth it. However each SPU has only a relatively small amount of memory (the 256k that Dave alludes to), so it needs to do that crunching with a limited data set. I personally think that writing AI that was optimized for the PS3 would be a really fun challenge - but only as a technical problem. From a professional point of view, I think it's unlikely that I would ever choose to go there.

PaulTozour on July 21st, 2008

In the long run, everything is going to be multicore. Dealing with that is just part of our job. Also, I too agree with Eric; you get used to it.

kingius on July 21st, 2008

I think that you have to identify the choke points of your AI engine in order to understand what may benefit from being spun off into another thread. Anything that is already time sliced (and as other commentors have pointed out, pathing is a great example of this) would be fairly straight forward to spin off into another thread as your agent is awaiting the Pathing.Complete or Pathing.CannotComplete events to be sent back before it does anything anyway. Checking the validity of a time sliced path is an already solved problem, you just look into the resulting path and see if any of the steps are blocked and if one is, calculate a new path from the last unblocked location to the target location again. So pathing is (by its nature) a good fit for multi-threading because it is expensive to calculate and chews up CPU cycles. Anything else that is expensive could, in theory, be a good fit for spinning off into other threads. E.g. a turn based strategy game could be running its planning phase while the player is playing his turn, and using a number of assertions, it could check the validity of these assertions at the start of its turn to determine how good (valid) the resulting plan is and how much of it should be used. In fact, I think that this base idea of setting up assertions and conditions would be useful for any type of AI calculation that has been made to run in parallel to the rest of the game engine, whether it be time-sliced, threaded or just the result of an asynchronous request (web service, perhaps). Do I think we need a thousand cores? No, not at all. I suspect that after 8 we'll hit diminishing returns. Intel has a problem, quantum effects are preventing them from scaling up clock speeds on processors and so their answer is to increase the number of cores instead. Will that really benefit us in the long run? Probably not.

Hodgman on July 23rd, 2008

"Do I think we need a thousand cores? No, not at all." Intel doesn't care what you think - we're going to end up with a thousand cores anyway ;-)

Dave Mark on July 23rd, 2008

And 5 years from now we will be saying "damn... I wish I had more cores to play with."

If you'd like to add a comment or question on this page, simply log-in to the site. You can create an account from the sign-up page if necessary... It takes less than a minute!