Pointers and Vectors and Lists, OH MY!

Dave Mark on September 29, 2008

Dave Mark is back to his regular Tuesday slot, and also based on popular demand, he’s digging deeper into the process of developing AI. Let him know how you manage your AI data and post a comment below!

A vector of lists containing pointers...

I have a 13-year background in relational database design. For those that aren’t familiar with the basic tenets of relational database models, here’s one of the major principles. You cluster like data in tables. Records in any given table can point to related records in other tables. You don’t necessarily have to point to other data, but if you do point to data in another table, it damn well better be there. In fact, the technical term for such a rule is “referential integrity.” If a record gets deleted, anything that points to it from other tables needs to be deleted as well. Thankfully, the tools in such platforms as Access and SQL Server take care of a lot of the heavy lifting in that regard… often as simple as clicking boxes such as “Enforce Referential Integrity” and “Cascade Delete”.

Coming from that database background, it was a peculiar transition to not have that model in place for me. As I started working on my very data-intensive airline management simulation in C++, I tossed about between various models of how to arrange my classes and objects. I had large collections of like items and there were certainly well-defined relationships between different types of items. I longed for the simplicity of a database model… in fact, at first it was difficult for me to break out of that mentality.

The two questions that I needed to solve were:

  1. How do I store these collections of information?

  2. How do I reference individual items in that collection from elsewhere in the game?

What a Wide Array!

Obviously, one method that is a staple of programming was to use arrays. Let’s face it, storing data in an array is pretty darn simple. And accessing it later is as easy as referencing the array by index. However, while that worked well for data that had a cap on it like the number of airlines, cities and airports in the game, it faltered somewhat on data that could wildly change in size — such as the number of aircraft, flights or passengers in the game at any point. I hated to define a fixed size array — chewing up an associated fixed amount of memory — for something that may never be used.

“[They] not only changed in numbers but also tended to be rather short-lived.”

On the other side of the spectrum, aircraft, flights and especially the passengers was something that not only changed in numbers but also tended to be rather short-lived… particularly in the case of the passengers. I could place arbitrary limits on the numbers, but they would have to be suitably large to accommodate the maximum at any one time. (In fact, I do place limits on the maximum numbers but it is done via the simulation model rather than the data structure itself… don’t ask.)

Perhaps I should explain something with regards to my passenger model so as to really set the tone of this issue. With my extreme focus on realism in this situation, I am generating hundreds of thousands of passenger blocks (each of which can represent between 1 and about 25 passengers). Sometimes these pax (pardon my abbreviation… when you have typed “passengers” thousands of times, it gets old) will be around for the few moments it takes them to determine that they can’t find a suitable itinerary and then are removed. Sometimes they will be traveling in only a few days or even hours of the simulation time. Sometimes they are booking up to 8 weeks in advance. That means that I needed a data structure that would allow me to have pax records jumping in and out very often.

These pax also needed to referenced individually by the other entities in the game. For example, a flight scheduled weeks into the future had to know who the pax were that had booked on it (to know how full it was, for example). A flight that was about to depart had to keep track of its “pax manifest” so that it could update the states and locations of the pax that were on board. This lends itself intuitively to an array setup. When you want a pax, you look him up by index.

Gimme a Vector, Victor

A nice happy medium was using an STL vector container. It would dynamically expand as necessary to accommodate the growing (and shrinking) lists. I didn’t have to pre-define the numbers and therefore pre-reserve the memory. I could still reference items by number as well. However, there was one further problem.

If I used a vector or an array, however, things would start out nice and contiguous as I generate my initial records… but eventually I would have more holes than filled spots as records got deleted. That was partially dealt with by not removing the record itself but changing its state to “empty” and adding it’s index to a list of empty slots. When I created a new record, I would pop the first record in the “empty” list and store the new record in the referenced slot. That also helped to keep the memory from getting splashed all over the inside of my machine.

I did have one concern with the array/vector-like structure, however. With all of the brutal processing that I do when dealing with that many pax, I worried that search times in an array should start to get costly. When you are working with a few dozen pax blocks, you don’t care that the entire list is 600,000 records long… you just want to go straight to the ones you are concerned with. That led me to store the references to those pax via pointers rather than by their index numbers.

Could You Give Me Some Pointers?

Untangle THIS!

Of course, pointers have their own caveats… especially with regard to saving and loading. As we all know, if I were to save and reload the game, there is no guarantee (and that’s putting the odds kinda lightly) that a particular pax record is going to end up at exactly the same place on my silicon. Let’s face it… it simply won’t end up in the same spot. That means that I would have to serialize the data when I save it into something indexed and then store the index when I save the records for the flights, etc. When I load, I would have to do the process in reverse — “OK, Pax number 117,832… grab a seat over there in the memory and then tell me what memory number is on your chair so I can tell your flight where you are sitting.” While this was not a pleasant prospect, I would rather kill the time on a save or load sequence if it meant speed while running the game.

I’ve toyed around with a hybrid of both. Hold the index number of a pax for reference, but also hold the direct memory pointer for doing the actual labor. That got kinda silly, however.

All Together Now!

For purposes of this column, I’ve actually made things even simpler than they are. For example, my Pax actually contain other objects such as their itinerary which is a list of pointers to the flights they are booked on. It may be empty or it may contain between one and three flights. (I had to cap the number of flights they were willing to take at 3 because the combinatorial explosion on booking itineraries when there are thousands of possible combinations of flights was a complete… well… you know.) So, since the sizes were different, that makes storing the objects in the array not feasible. So I have such arrangements as a CPaxCollection which has a list of pointers to CPax which has (among other things) an object CItinerary which as a list of pointers to CSchedFlight. (All CSchedFlight records are also held in a vector in CSchedFlightCollection.) There’s more where that came from, too.

I don’t claim to have the knottiest problem in game development. I simply used my experience with my game to show some of the pitfalls that I encountered as I progressed. I often thought that there had to be an easier way! How different this all is from just having database tables for Pax, Flights, Itineraries and what-not!

“We are pushing a lot of interrelated data around these days.”

Of course, data structures is certainly a topic that cuts right down to the core of game development in general but also very much to AI. As the numbers of both the entities that we manage and the world parameters that we reference increase in numbers and types, we are continually going to be faced with how best to manage those links. What’s more, we are increasingly relying on layers of abstraction to support such concepts as knowledge representation and group coordination. We are pushing a lot of interrelated data around these days.

There have been a handful of speeches, papers and examples of how to handle these issues. There was even a column in the latest Game Developer magazine by Noel Llopis on “Managing Data Relationships” where he talked about “smart pointers”. And yet no one seems to have the proverbial “silver bullet” in this regard.

What strategies have you employed in your own work to handle complex interactions between large amounts of data? Do you prefer the orderly arrays and vectors or the “straight to the point” pointers? How do you manage serialization? Your thrashed memory? Your migraine?

Discussion 9 Comments

memoni on September 30th, 2008

For dynamic data, freelists are awesome. Sometimes I like to hide them inside the data items too to save memory (and to shoot my toes off later when debugging). I tend to use handles when someone needs reference to data which is hidden inside a system (say a reference to pathfinder node). I usually use two types of handles, index+1 for static data (so that zero handle means NULL) and index+1|refcount*65536 for dynamic data. Refcount is incremented when an item is deleted so handles to deleted data can be identified and indices reused. I think I picked that up from fmod API once upon a time. The reason I like handles is that they can be serialized and that with the refcount trick, I do not need to have callbacks (and trash cache) when an object is deleted. Additionally the handle can be used as unique identifier in cases when an object is registered into some other data structure like some spatial partitioning system. The user of the handle needs to also handle the case that the API returns null object when using the handle, pointers are guaranteed to be valid within only for one tick. Sometimes it is not possible to use index in the handle. In that case I usually use some sort of hash table, that is flat array of linked lists. The array is accessed using hash of the "index" value and the correct item is found via linear search through the linked list. That is good for stuff like closed lists or with some extra work it can be used as simple spatial partitioning system too. I think in my cases I usually have few orders of magnitude less data, but I want to access them all efficiently and the query constraints can vary a great deal, like that I wan to iterate over all nodes around certain location or iterate over all nodes of certain type. Then store references to those nodes and later lookup them really fast.

Mangara on September 30th, 2008

One of the reasons I love Java. Deleting stuff? No problem, the garbage collector will clean it up. Want to save stuff and restore it to exactly the same state? Just mark everything as Serializable. Want fast random access to a dynamic set of items? Use the built-in (type-safe, using generics) HashSet or ArrayList. Of course using Java for games development brings along a whole new set of issues, mainly concerning performance, but as long as you don't really need that it's a great language because it takes so many work out of your hands, letting you focus on the interesting stuff instead of things like storage.

Ian Morrison on September 30th, 2008

Damn good article, Dave. I'm personally getting bitten in the ass by my own poor architecture decisions early on. However, I've got a few ideas to fix it. I'm becoming partial to object handles too... it slows down indirection a bit, but you can get rid of most of that by using the STL map class to keep a map of object ID numbers and the object pointers. Bad references become a little easier to deal with, since you can have the cleanup automated on the other end the moment the object handle doesn't give useful information. It makes serialization/deserialization pretty easy, too.

jamesford42 on October 1st, 2008

So lets see, you have lots of data and need unique identifiers, lots of cross referencing with automatic verification, and saving / loading of it all. How about a database? '-) Seriously though, why aren't you using a database like sqlite for example, this game really sounds like a perfect match for some database backend. If its a speed issue? Maybe you already went through all this, I'm just curious. It if was me, I'd try sqlite first (since I have some limited experience with it and its good for single-player / no server required), and only give up on it if it really, really wasn't fast enough. There's also FastDB a memory-resident-only database, sounds interesting.

memoni on October 1st, 2008

While there are many pros using stl::map, I don't like how it hides the amount of memory it needs, uses dynamic allocations and has super slow iterators. I stumbled upon a presentation yesterday which touches this topic and many others which are important in order to get your code run fast: Machine Architecture: Things Your Programming Language Never Told You

Ian Morrison on October 1st, 2008

I don't know much about the memory allocations (honestly, I keep forgetting that memory is finite... I always think in terms of speed), but why would you use the iterator on the map at all? Why not just keep the usual list or vector for iterating through objects, and just use a map for object lookups? Sure, it's another reference you need to keep track of and remove when objects disappear, but that seems a small price to pay for having quick access to objects via an ID.

memoni on October 2nd, 2008

Sometimes it is just tempting to use iterators :) Using map as just lookup sounds like the right thing to do. In modern platforms memory usage patterns (and hence cache usage) and are super important for speed. For various other reasons (say, fragmentation and just the time it takes to alloc memory) mallocs are bad too. The lecture I mentioned above explain that stuff pretty well. I've seen an order of magnitude improvements in code where I have dropped stl maps or sets in favor of simple preallocated arrays + some extra manual labor and that definitely has made me to think the whole programming thing again in more spartan way (straight C, know your memory usage) than the more "philosophian" way (the kind of thing that happens to people when they use java ;). I don't mean that you should drop all the cool stuff out there but the changes are that your tailor made lookup is going to be faster because you know the limitations of your data, than a more generic algorithm that is designed to be used in many different situations.

Ian Morrison on October 2nd, 2008

Fair enough, but certainly it would be possible to build your own hash tables for the job if the generic STL stuff was a performance bottleneck? Preallocated arrays would eventually run out of slots to point to objects with.... The article you linked to sounds interesting, but the PDF keeps crashing on me.

glopes on October 2nd, 2008

If only for the sake of expressiveness and ease of coding, looking up the abstractions built up around the .NET framework is amazingly refreshing and inspiring. The LINQ (Language Integrated Query) extensions for .NET basically allows one to express relational queries on regular data structures (and databases or XML for that matter, since it's an abstract layer). I won't press the C++ expressivity war, but I sincerely believe that in the long run, as we begin to care less and less about processing power, we'll properly appreciate the expressive power of higher-level languages like C#. Much like what happened with C++ versus C not so many years ago...

If you'd like to add a comment or question on this page, simply log-in to the site. You can create an account from the sign-up page if necessary... It takes less than a minute!