When I started to write this column over a week ago, I was in the middle of a minor crunch-worthy catastrophe on the project that I was doing for a client. I’m working on a database project for a local retailer and we are in the process of trying to export a very large amount of data to his clients. (This is all relevant, stick with me here!) When we dumped the over 32,000 result rows of data from the database, my client decided to quickly check to see if a few select items were included. Much to his consternation, he couldn’t find the items that he was looking for. In the process of exploring why, I noticed that some of the data that was in there seemed like it might possibly not be correct. My client looked at my samples and confirmed that, indeed, they were showing errors in dollar amounts that were subtle but significant. And this is where my weekend went to heck… and why Alex covered for me in this column last week. (Again, hang in there, this description is more than an excuse for not being here!)
Ironically, the false starts that I made on this column were starting to outline the theme of the column — that of automated testing. I didn’t think about the connection until later in the week. The reason that it came to mind was when my client asked me, “and how do we know that this new run is completely correct?” My short answer truly had to be, “unless we are willing to hand-check every one of the 32,000 rows of data, we can’t be sure.”
What it came down to was that we had to trust all the layers of data sources, queries, and algorithms — some of which were not even in our control. I had to check each step in the process, confirm that it was doing what it was supposed to do, check a quick sample of what it was spitting out, and move on to the next. I had to proceed under the tenuous premise that, if each step along the way was correct, the end result would be as well. As for that black-box data we were getting from elsewhere? Well, we had to trust that the programmer responsible for that had done his due diligence as well. Unless we wanted to check all 32,000 rows by hand and eye. And how do you tell at a glance if numbers that spill out of a formula or correct of if they are off by some amount? Wow… too bad we didn’t have a way of automating the testing of the data! Or did we?
While none of the adventure above had anything at all to do with games, AI, bots, agents, simulations or even a lot of mathematical processes, there are some parts that may seem familiar to my gentle readers.
Game developers live in a world where we are often constructing our products in a less-than-holistic way. We often create a number of little black box processes that fit together like algorithmic Lego. If each block does its job, we end up at the ultimate goal of having created “a behavior” or “a sequence of behaviors”. But we are not just searching for “a behavior” in a particular situation - we want “the specific behavior” that is called for. And that specific behavior may be one of dozens, scores, or even hundreds of possible behaviors. To make matters worse, our little algorithmic Lego don’t always fit together quite as comfortably as we would like. They can twist and turn and morph… and often aren’t really meant to click together at all.
Oh… and to make matters worse? As game developers, we are working in 4 dimensions, not just 3. Our solutions must be correct in the dimension of time - not just in a snapshot slice. Not only do we concern ourselves with sequences that necessarily must occur over time, but the distances measured in time are important. Did this happen quick enough? Enough of a delay? In sync with another related but unconnected behavior elsewhere?
Another caveat… Unlike the project I was working with where the inclusion of certain rows and the specific, check-able numbers were the output, in game AI, the numbers themselves are only a means to an end. The behaviors we are looking for are sometimes a little more obscure - more subjective than objective. While you can test to see if a result number is what you were hoping it would be when it gets spit out of the other end, there is no test that can be created that can replace the very subjective, sense-based jury of “that looks right“.
How Big IS the Big Bang?
How complex could agent-based AI really get? It’s only a few behaviors after all… I was dealing with 32,000 rows of data culled from a batch of over 400,000 items taken from multiple systems, through multiple processes and multiple filters. Could game AI really be that complicated?
Let’s start with a simple hypothetical:
1 agent with 5 states
5 states with 1 transition each = 5 transitions(never mind that…)
5 states with 1 transition to each of the other 4 states = 5 * 4 = 20 transitions?
5 interacting agents, each with 5 states = 3125 combinations of the agents’ states
3125 agent-state combinations * 4 potential transitions for each of 5 agents (20) = 62500 potential individual transitions at any given moment.
At an average of one state transition per agent per second, over a 5 second period, there could be 3125 potential sequences of the 5 states. The combination of sequences over the 5 seconds between all 5 agents is… uh… 3125^5 = 298,023,223,876,953,125
So if we change that one parameter for that one transition threshold for that one agent by 0.5%, it’s only a small change, right? If we wanted to test the ramifications of that parameter and how the 5 agents interact over time we would only have to test… how many situations? 298 * 10^15? You know what? Never mind. My 32,000 rows of simple data starts to look attractive.
Photo 1: An artists rendition of what the combinatorial explosion of game AI feels like. Does anyone here want to claim that any one of those millions of stars is out of place?
A few weeks ago in this column, I mentioned that, at an E3 panel discussion, the luminaries Will Wright, Peter Molyneux and David Jones stated that automated testing was an important tool in taming the emergent behavior that was a part of their respective games. This illustrates how automated testing has been increasingly a staple of game development. In fact, automated testing is a required component for methodologies such as Agile Development. Being able to do unit tests on various systems is a solid method of production that helps ensure (or at least increase confidence) that each of the building blocks is performing as advertised. But at what point can you no longer perform testing? Again, when you get to the subjective nature of the way AI looks, you begin to realize that automated testing is a little dicey.
Human vs. Computer?
But if you leave the testing to human eyeballs alone, there are other issues. Can you play through a battle, a level, or an entire game and point to a particular behavior that never showed up? Like the rows that my client found quite by accident, if you aren’t specifically looking for something, you aren’t going to notice its absence. And what of the subtle differences such as the prices that were incorrect in my client’s data? They were a string of numbers that were in the right range - but weren’t exact. We wouldn’t have known unless we looked for it specifically. There are behaviors that may look rightish (to mimic the vernacular of my teenagers) but aren’t exactly right. While that may be quickly testable by a designer on the fly at development time, what about that combinatorial explosion I illustrated above?
How far do you have to wade into that morass until you can be certain that those little innocuous vagaries aren’t going to develop into something completely obnoxious? Can you truly test them all? Can you guarantee that you have checked your work enough times and in a broad enough manner to cover all the contingencies? Let’s face it… no matter what sort of QA budget we secure, it will be nothing compared to the QA of even a hundred thousand users playing through millions of times and encountering hundreds of millions of iterations of events. And, in the days of blogs, Digg and YouTube, you can be quite sure that every little unaccounted for parameter will be well publicized.
So what’s the answer? Certainly automated AI testing has a place - more so in some genres and some applications than others. Is this something that needs to be explored better, however? And what are some potential solutions to find things that are not there, make sure that behaviors fall within parameters, or look reasonable? And most importantly, how do we make sure that we have explored all the dark nooks and crannies of the potential state space at the far reaches of that combinatorial explosion to make sure that our delicate cosmic balance doesn’t get sucked into an algorithmic black hole?
Join this week’s discussion and post a comment below or in the forums!