Loadtesting for Open Beta, Part 2

In our last exciting episode of Loadtesting for Open Beta, we did some initial profiling to see how the lobbyserver held up under attack by a phalanx of loadtesting robots spawned in the cloud. It didn’t hold up, obviously, or the beta would already be open.

Specifically, it failed by saturating the server’s 100Mbps network link, which turned out to be a great way to fail because it meant there were some pretty simple things I could do to optimize the bandwidth utilization.  I had done the initial gamelobby protocol in the simplest way possible, so every time any player state changed, like a new connection, or switching from chatting in the lobby to playing, it sent out the entire list of player states to everybody.  This doesn’t scale at all, since as you add more players, most aren’t changing state, but you’re sending all of their states out to everybody even if only one changes.  This doesn’t mean it was the wrong way to program it initially; it’s really important when you’re writing complicated software1 to do things the simplest way possible, as long as you have a vague plan for what you’ll do if it turns into a problem later.  In this case, I knew what I was doing was probably not going to work in the long run, but it got things up and running more quickly than overengineering some fancy solution I might not have needed, and I waited until it actually was a problem before fixing it.

Tell Me Something I Don’t Know

The solution to this problem is pretty obvious: differential state updates.  Or, in English, only send the stuff that’s changed to the people who care about it.  Doing differential updates is significantly more complicated than just spamming everybody with everything, however.  You still have to send the initial state of all the curent players when new players log in, you have to be able to add and remove players in the protocol, which you didn’t have to before because you were just sending the complete new state every time, etc.

This was going to be a fairly large change, so I took it by steps.  I knew that I’d have to send out the complete state of everybody to new logins, so it made sense to start by optimizing that initial packet using normal data size optimization techniques.  I pretty easily got it from about 88 bytes per player down to 42 bytes per player, which is nice, because my goal for these optimizations is 1000 simultaneous players, and at 88 bytes they wouldn’t all fit in my 64kb maximum packet size, where at 42 bytes they should fit, no problem, so I don’t have to add any kind of break-up-the-list-across-packets thing.  However, it turns out I actually got the ability to send the entire list across multiple packets while I was doing this, because I had to program the ability to add players as part of the differential updates, so now I could just use that packet type to send any clients in a really large player list that didn’t fit in a single packet.  But, like I said in the last episode, although I don’t think I’ll hit 1000 simultaneous outside of load testing for a while, it’s always nice to know you have that sort of thing in your back pocket for the future.

Once I’d tested the new optimized player list, I started making the updates differential.  New players get the initial list, and then they’re considered up-to-date and just get updates along with everybody else.  The list of new players is sent as additions to players already in the lobby.  For each player, I track some simple flags about what’s been updated in their state, so if they set or clear their /away message for example, that flag is set, and I only send that information.

In programming, usually when you’ve got the right design, you get some unintentional upside, and this case was no different.  Previously, I was not sending live updates to player stats (wins, game time, etc.) to the players in the lobby until the player was done playing the match, or some other state changed that caused everybody’s state to be re-sent.  Now, since the differential updates are efficient, I’m updating player stats in real time as well, so people in the lobby can see wins as they accumulate for players in matches, which is nice and how you’d expect it to work.

Results

It basically worked exactly as planned.  After lots of debugging, of course.  Here you can see the profiles for one of the loadtests, which got to 340 simultaneous players in the lobby:

I really need to have the robot Sniper win sometimes.

 

atop in memory mode

 

atop in cpu mode

Look ma, 3% network utilization!  That’s whats so awesome about a really spiky profile…when you pound one of the spikes down, things just get better!

Here’s the new table of packet sizes for this run.  If you compare this with the previous results, you can see the PLAYER_LIST packets are way way way smaller, and this table was accumulated from two longer test runs, so it’s not even a fair comparison!  It’s interesting, because the TYPE_LOBBY_MESSAGE_PACKET is smaller as well, and I think that’s because now the robots can actually start games since the network isn’t saturated, and this means they don’t broadcast chats to the entire lobby while they’re playing, so that’s a nice side effect of optimizing the bandwidth.

Packet Type Total Bytes
TYPE_LOBBY_MESSAGE_PACKET 58060417
TYPE_LOBBY_PLAYER_LIST_UPDATE_PACKET 29751413
TYPE_CLIENT_GAME_JOURNAL_PACKET 18006186
TYPE_LOBBY_ROOM_LIST_PACKET 16674479
TYPE_LOBBY_PLAYER_LIST_ADDITION_PACKET 4280563
TYPE_LOBBY_PLAYER_LIST_PACKET 3482691
TYPE_CLIENT_MESSAGE_PACKET 1501822
TYPE_CLIENT_LOGIN_PACKET 477356
TYPE_CLIENT_INVITE_PACKET 435368
TYPE_LOBBY_INVITE_PACKET 275781
TYPE_LOBBY_LOGIN_PACKET 235878
TYPE_LOBBY_GAME_ID_PACKET 96000
TYPE_LOBBY_GAME_OVER_PACKET 68901
TYPE_CLIENT_GAME_ID_CONFIRM_PACKET 40257
TYPE_LOBBY_PLAY_PACKET 32498
TYPE_CLIENT_IN_MATCH_PACKET 25714
TYPE_LOBBY_IN_MATCH_PACKET 21204
TYPE_CLIENT_CANDIDATE_PACKET 16089
TYPE_CLIENT_PLAY_PACKET 12419
TYPE_CLIENT_GAME_ID_REQUEST_PACKET 9610
TYPE_LOBBY_WELCOME_PACKET 4494
TYPE_CLIENT_JOIN_PACKET 4494
TYPE_KEEPALIVE_PACKET 1011
TYPE_CLIENT_IDLE_PACKET 24

Hmm, I just noticed as I’m writing this that the resident memory utilization in the atop screenshot is way lower now than before…I wonder why… On the application side I take about 250kb per player right now, which at 340 players should be about 85MB.  Looking at the lobbyserver logs, right about when the screenshot was taken, the lobby self-reported this data:

2013/03/03-02:13:15: MEMORY_POSIX 348/757/409: resident 12808/12808, virtual 160276/160276
2013/03/03-02:13:15: MEMORY_NEW 348/757/409: bytes 91766974, news 45707, deletes 36155

The MEMORY_NEW stats looks about right for this load and my quick math, but the MEMORY_POSIX stats—which are read from /proc/pid/status—match the atop results: expected virtual but low resident.   Maybe it was just paged out for a second, or maybe I’m not touching much of that 250kb and so it doesn’t stay resident.  A lot of it is network buffers, so it makes some sense with this lower bandwidth protocol that it wouldn’t be resident compared to last profile because less buffering is having to be done.  I’ll have to investigate this more.

Up Next, The Case of the Missing Robots

So, the bandwidth optimizations were a resounding success!  Plus, both the CPU and memory utilization of the lobbyserver are really reasonable and haven’t been optimized at all, so we’re sitting pretty for getting to 1000 simulataneous robots…

Except, where are the remaining 160 robots?  In the test above, I ran 10 EC2 instances, each with 50 robots, thinking the optimizations might let me get to 500 simultaneous and find the next performance issue…but it never got above 340 in the lobby.  I updated my perl loadtesting framework and had each instance output how many lobbyclients were running every two seconds with this shell command over ssh:

'while true; do echo `date +%T`,`pidof lobbyclient | wc -w`; sleep 2; done'

And then I loaded that into gnuplot,2 and graphed the number of robots on each instance:

The number of loadtest robots running on each EC2 instance.

You can see that they all started up with 50, but then a bunch of them lost clients until they found a steady state.   Something is killing my robots, and I need to figure out what it is…

Turn the page to Part 3…

  1. especially by yourself! []
  2. …which I hate, but I forgot to install excel on my new laptop, and Google’s spreadsheet sucks at pivottables, and the Office for Web excel doesn’t even have them as far as I could tell! []

Loadtesting for Open Beta, Part 1

Way back in 2011, right before I opened up Early-Access Beta signups, I loadtested and optimized the signup page to make sure it wouldn’t crash if lots of people were trying to submit their name and email and confirm their signup. I always intended to write up a technical post or two about that optimization process because it was an interesting engineering exercise, but I have yet to get around to it. However, I can summarize the learnings here pretty quickly: WordPress is excruciatingly slow, Varnish is incredibly fast, I ♥ Perl,1 Apache with plain old mod_php (meaning not loading WordPress) was actually way faster than I expected, slightly faster even than nginx + php-fpm in my limited tests, CloudFront is pretty easy to use,2 and even cheap and small dedicated servers can handle a lot of traffic if you’re smart about it.

Like with any kind of optimization, Assume Nothing, so you should always write the loadtester first, and run it to get a baseline performance profile, and continue running it as you optimize the hotspots. When I started, the signup submission could only handle 2 or 3 submits per second. When I was done, it could handle 400 submissions per second. I figured that was enough.3 If more than 400 people were signing up for the SpyParty beta every second, well, let’s file that under “good problem to have”.

After all the loadtesting and optimizing, the signups went off without a hitch.

Loadtesting and optimizing the beta signup process was important, because the entire reason I took signups instead of just letting people play immediately was “fear of the unknown”. I couldn’t know in advance how many people would be interested in the game, and getting a couple web forms scalable in case that number was “a lot” was much easier than getting the full game and its server scalable, and that’s ignoring the very real need to exert some control over the growth of the community, to make sure the game wasn’t incredibly buggy on different hardware configurations or that there wasn’t some glaring balance issue, etc. Overall, starting with signups and a closed beta was great for the game, even if it’s meant frustrating people who signed up and want to play.

But it’s been long enough, and I’m now finally actively loadtesting and optimizing for opening the beta!

Lobby Loadtesting Framework

Like with the signup form, I’m loadtesting first. This will tell me where I need to optimize, and allow me to test my progress against the baseline. However, loadtesting a game lobby server is a lot more complicated than loadtesting a web form, so it’s a bit slower-going. I’ve had to create a robot version of the game client that logs into the lobby, chats, invites other robots to play, and then reports on the results of the fake games played. I build this on top of the game’s client interface, so it looks just like a real game to the lobby.

As with all testing, you need to make sure you aren’t Heisenberg-ing4 your results, so I wanted to get fairly close to the same load that would happen with multiple real game clients hitting the server. This means I had to have a good number of machines running these robots hitting the test lobby at the same time, and that means using cloud computing. I was inspired by the bees with machine guns article about using Amazon Web Services’s Elastic Compute Cloud (EC2) to launch a bunch of cheap http load testers. I use AWS for SpyParty already, distributing updates and uploading crashdumps using S3, so this seemed like a good fit. At first I tried modifying the bees code to do what I want, but I found the Python threading technique they used for controlling multiple instances didn’t scale well running on Windows, and since I wanted more control over the instances anyway and the core idea was not terribly difficult to implement, I wrote my own version in Perl, which I’m much more familiar with. The code uses Net::Amazon::EC2 to talk to AWS to start, list, and stop EC2 instances, and Net::SSH2 to talk to the instances themselves, executing commands and waiting for exit codes, downloading logs, and whatnot. I just use an existing CentOS EC2 AMI5 and then have the scripts download and install my robots onto it from S3 every time I start one up; I didn’t want to bother with creating a custom AMI when my files are pretty small. I’m going to post all the loadtest framework code once I’ve got it completely working so others can use it.

How Much is Enough?

In loadtesting the loadtesters, I found that an m1.small instance could run about 50 loadtest bots simultaneously with my current client code. I can switch to larger and more expensive EC2 instance types if I need to run more robots per instance, and as I optimize the server I’m pretty sure the client code will get optimized as well, which will allow more concurrency. Amazon limits accounts to 20 simultaneous EC2 instances until you apply for an exception, so I’ve done that,6 but even with that limitation, I can loadtest to about 1000 concurrent clients, which seems like more than enough for now.

I still don’t know exactly what to expect when I open up the beta, but I don’t think I’ll hit 1000 simultaneous SpyParty players outside of loadtesting anytime soon. If you look at the Steam Stats page, 1000 simultaneous players is right in the middle of the top 100 games on the entire service, including some pretty popular mainstream games with mature player communities. In the current closed beta, I think our maximum number of simultaneous players has been around 25, and it’s usually between 10 and 15 on any given night at peak times, assuming there’s no event happening and I haven’t just sent out a big batch of invites. I still have about 6000 people left to invite for the first time from the signup list, and 9000 who didn’t register on their first invite to re-invite, all of whom I’ll use for live player loadtesting after the 1000 robots are happily playing without complaints. I think the spike from those last closed invites will be bigger than the open beta release spike, unless there are a ton of people who didn’t want to sign up with their email address, but who will buy the game once the beta is open. I guess that’s possible, but who knows? Again, if we go over 1000 simultaneous, I guess I will scramble to move the lobby to a bigger server, and keep repeating the “good problem to have” mantra over and over again, but I’m betting it’s not going to happen and things will go smoothly.

After open beta there will be a long list of awesome stuff coming into the game, including new maps and missions, spectation and replays, the new art, and lots more, but once things are open it’ll be easier to predict the size of those spikes and plan accordingly. Eventually I’ll probably (hopefully?) have to move the lobby off my current server, but I’m pretty sure based on my initial testing that the old girl can keep things going smoothly a bit longer.

Initial Loadtesting Baseline

Okay, so what happens when I unleash the robots? Well, I haven’t let 1000 of them loose yet, but I’ve tried 500, and things fall over, as you might expect. It looks like around 250 is the maximum that can even connect right now, which is actually more than I thought I’d start out with.

The loadtesting robots are not very good conversationalists.

Things don’t work very well even with 250 clients, though, with connections failing, and match invites not going through.7 However, when I looked at atop while the robots were pounding on the lobby, a wonderful thing was apparent:

atop in CPU mode

 

atop in memory mode

Neither the CPU utilization nor the memory utilization was too terrible, but the lobbyserver was saturating the 100 Mbps ethernet link! That’s awesome, because that’s going to be easy to fix!

Before I explain, let me say that the best kind of profile is one with a single giant spike, one thing that’s obviously completely slow and working poorly. The worse kind of profile is a flat line, where everything is taking 3% of the time and there’s no single thing you can optimize. This is a great profile, because it points right towards the first thing I need to fix, which is the network bandwidth.

My protocol between the game clients and the lobby server is really pretty dumb in a lot of ways, but the biggest way it’s dumb is that on any state change of any client, it sends the entire list of clients and their current state to every client. This is the simplest thing to do and means there’s no need to track which clients have received which information, and this in turn means it’s the right thing to do first when you’re getting things going, but it’s also terribly wasteful performance-wise compared to just sending out the clients who changed each tick. So, I was delighted to see that bandwidth was my first problem, because it’s easy to see that I have to fix the protocol. I’m guessing switching to a differential player state update will cut the bandwidth by 50x, which will then reveal the next performance spike.

I can’t wait to find out what it will be!8

Oh, and the total EC2 bill for my loadtesting over the past few days: $5.86

So…Open Beta?

Within weeks! Weeks, I tell you!

Oh, and as I’ve said before, everybody who is signed up will get invited in before open beta. I will then probably have a short “quiet period” where I let things settle down before really opening it up, so if you want in before open beta, sign up now.

Update: Assuming More Nothing…Er, Less Nothing?

After posting this article, I was about to start optimizing the client list packets, when it occurred to me I wasn’t assuming enough nothing, because I was assuming it was the client list taking all the bandwidth. This made me a bit nervous, which is the right feeling to have when you’re not following your own advice,9 so I implemented a really simple bit of code that accumulated the per-packet send and recieve sizes, and printed them on exit, and then threw another 250 robots at the server for 60 seconds. The results validated the client list assumption, it’s by far the biggest bandwidth consumer, sending 1.6GB in 60 seconds.10 However, it did show that the lobby sending chat and status messages to the clients is also maybe going to be a problem, so yet again: measuring things is crucial.

Packet Type Total Bytes
TYPE_LOBBY_PLAYER_LIST_PACKET 1632549877
TYPE_LOBBY_MESSAGE_PACKET 66687600
TYPE_LOBBY_ROOM_LIST_PACKET 9474937
TYPE_CLIENT_INVITE_PACKET 303056
TYPE_CLIENT_MESSAGE_PACKET 226779
TYPE_CLIENT_LOGIN_PACKET 157795
TYPE_LOBBY_INVITE_PACKET 131667
TYPE_LOBBY_LOGIN_PACKET 77951
TYPE_KEEPALIVE_PACKET 43032
TYPE_CLIENT_GAME_JOURNAL_PACKET 5478
TYPE_LOBBY_PLAY_PACKET 1888
TYPE_LOBBY_WELCOME_PACKET 1491
TYPE_CLIENT_JOIN_PACKET 1491
TYPE_CLIENT_PLAY_PACKET 836
TYPE_CLIENT_IN_MATCH_PACKET 713
TYPE_LOBBY_IN_MATCH_PACKET 532
TYPE_CLIENT_CANDIDATE_PACKET 490
TYPE_LOBBY_GAME_ID_PACKET 300
TYPE_CLIENT_GAME_ID_REQUEST_PACKET 30

It’s interesting that the clients are only sending 300KB worth of chat messages to the lobby, but it’s sending 66MB back to them, but 66MB is around 250 * 300KB, so it makes back-of-the-envelope sense. I’m probably going to need to investigate that more once I’ve hammered the player list traffic down. Maybe I’ll have to accumulate them every tick, compress them all, and send them out.

This way to Part 2…

  1. See this thread for how I wrote the dynamic loadtesting form submission in a way that would saturate the network link. []
  2. I use CF for images and other static stuff, with W3 Total Cache to keep them synced to S3, but I only use W3TC for this CDN sync, since Varnish blows it out of the water for actual caching. []
  3. Let me be clear, I think 400 submissions per second is really pretty slow for raw performance on a modern computer, but web apps these days have so many layers that you lose a ton of performance relative to what would happen if you wrote the whole thing in C. For an interesting example of this, there’s a wacky high performance web server called G-WAN that gets rid of all the layers and lets you write the pages directly in compiled C. []
  4. I just read on wikipedia that the uncertainty principle is often confused with the observer effect, and so on the surface this verbing of Heisenberg’s name isn’t correct, except he apparently also confused the two, so I’m going to keep on verbing. []
  5. ami-c9846da0 []
  6. although they haven’t gotten back to me so I guess I’ll apply again…sigh, customer service “in the cloud”  Update: Woot!  My limit has been increased, now I can DDOS myself to my heart’s content! []
  7. Let’s ignore the lobby UI also drawing all over itself for now. []
  8. You can see the CPU usage is pretty high relative to the memory usage, and seeing slapd and krb5kdc in there is a bit worrying, since that’s kerberos and ldap, which are used for the login and client authentication and are going to be a bit harder to optimize if they start poking their heads up too high, but both of them have very battle-tested enterprise-scale optimization solutions via replication, so worst-case is I’ll have to get another machine for them, I think. If the lobbyserver itself is still CPU-bound after fixing the bandwidth issue, then I’ll start normal code optimization for it, including profiling, of course. I’ll basically recurse on the lobbyserver executable! []
  9. …let alone Mike Abrash’s advice! []
  10. Or actually trying to send, since 1.6GB in 60 seconds is 200Mbps, which is not happening on a 100Mbps link! []

Happy Valentine’s Day, Belated

The ever-awesome ZeroTKA had an awesome idea for a Valentine’s Day surprise, and then my daughter gave me a cold and I couldn’t get it done for the actual day.  But hey, it’s still Valentine’s Day in Damon’s heart.

This build, v0.1.2487.0, was supposed to be a quicky, but it snowballed bigtime.  Here are the release notes from the private beta forums:

Uh, this was supposed to be a quick update.

  • add idle timestamp to invite confirmation if player is idle
  • s/In Lobby/In Room/ in lobby status
  • auto-idle state in lobby after 5 minutes of no input to SpyParty
  • display away message when playing
  • tiebreaker for both players hitting spy or sniper at the same time in match menu – kind of a hack
  • wait a bit after putting down briefcase
  • couriers never stand on a pad to pick up or put down briefcase
  • ambassador won’t put down briefcase unless it can be picked up without standing on a pad
  • display game type and level on match results screen
  • Happy Valentine’s Day (belated due to a cold) – thank zerotka for the awesome idea!
  • pointier convo arrow
  • reenable previous talker arrow
  • add concept of “important” NPCs (like green statue swapper)
  • sort important NPCs higher/lower than unknown cast
  • put a grey bar on them in the portraits on result screen
  • play microfilm failure animation if bailing on action test
  • drink events for accept, reject, offer, waiter giving up
  • okay/good/bad -> white/green/red in mission preview text
  • much more visible sniper crosshairs
  • try harder to find a spot if you’re forced to go to pedestal
  • delay forcing swapper until outside min radius and then random timer
  • finally solved the mystery of the floor pad rotations, old click-to-move ui vestige, thanks kcmmmmm!
  • round event for cast member picking up pending statue
  • handle room name escapes before eating whitespace
  • ignore escape until endgame results synced to avoid lobby kicks and disabled Spy/Sniper menu items
  • setting to not print /help on lobby entry
  • allow big lobby font, allow ellipsis in settings and lobby escape menu
  • Damon at least looks at the corpse, for god’s sake
  • arrow ui for conversation talking turns
  • version number in screenshot name
  • toby offers drink and waits until accept or refuse
  • break up too-long chat lines correctly with wrapping
  • settings in lobby escape menu uses new lobby layout
  • small timestamp font, only do time, not date (clipboard still gets full date)
  • sync up /time format
  • block actions while putting down or handing off briefcase
  • allow Toby to go more places and offer more drinks (like the back side of Balcony, which he basically was afraid to visit)
  • new walking bug tuning, midpoint, tighter radii, etc
  • display autolock animation times correctly in spy disk
  • move ambassador personal space radius in a bit so not as jumpy
  • bulge cy2 pillar out a bit more, but NPCs really hug it, better practice this or get shot by r7
  • whiter whites for your username
  • fix bug if spawn at convo center you’re going to
  • if you’re finishing convo because you’re lonely, and somebody shows up, go back to listening
  • retry look areas if your spot is taken
  • clear the customer’s queue so don’t take drinks long after waiter leaves
  • make veranda front a little wider for pathing sanity
  • walking people count as violating ambassador’s personal space, so spy can’t twitch
  • version check content package to avoid weird bugs
  • fix lobby chat crash bug
  • draw cast bars with quads, size properly
  • sort known-to-sniper cast members low, unknown (like seduction target, no-suspected double agents) high
  • splash screen works on wineskin (non-transparent)
  • expand results screen portraits on mouseover now that we have a mouse on that screen
  • fix results display for multiple consecutive time-adds
  • only push round events if playing
  • push round event on shift-m add time
  • flicker ambassador ui during leaving, not go to red (ST color)
  • grey outlines in results, not yellow

So, the briefcase/floor-pad thing is kinda huge. Having to actively reject Toby is kinda huge. New green swap is big. I don’t even remember what else. Yikes.

Chris

One Bug’s Story, or, Assume it’s a bug!

This is the story of a bug in SpyParty.  This story has a happy ending, because the SpyParty beta testers are amazing, and they are constantly helping find bugs, of course, but they are also constantly helping me reproduce bugs, and narrow down the potential causes of bugs, and triage them, and are generally providing me with incredible support so I can make the game better.

This bug also has an interesting story, because it turned out to have a very subtle cause, one that manifested itself intermittently in ways that looked almost random, and for a long time there was no “repro case”.  Getting a repro on a bug is the key to fixing it, as I discuss in the How to Report Bugs the SpyParty Way post.  It can make the difference between a 10 minute fix and a 10 day fix…or never managing to find and fix it. 

*shudder*

But let’s start at the beginning…

I first noticed the bug long before I’d invited people into the beta, but it was so rare that I didn’t prioritize finding it, and in fact I would forget about it for stretches of time.  Yes, I make notes of bugs I see, but I’ve got so much high priority stuff to do right now that I don’t go back and look at that list very often…the important bugs get fixed immediately, but a bug like this can stay in for a long time.

So, what’s the bug?  Well, let’s go to the videotape:

This video was from a Spy Commentary game played between buxx and dieffenbachj, two early beta testers who were some of the first to upload gameplay videos.

It turns out, in addition to just being plain awesome for games overall, the rise of videos and streams is also an amazing resource for bug finding and fixing!  The more people record their games, the more they’ll be able to point to a video of exactly what went wrong so the developers can see it almost first-hand.  The bane of a developer’s life is a bug report that says, “the game broke” with no other description.  You know something’s probably wrong, but it’s basically useless for finding a problem.  With a video, you can usually see exactly what’s going on, so that problem is eliminated, or at least massively reduced.

So, as you can see in that video, the Spy is facing the wrong way on the floor pad.  The Spy should be facing the pedestal with the statue on it in that position, but in this case the Spy is turned 90 degrees.

I watched buxx‘s videos when he posted them1 and I noticed this, and so I posted a bug in the Bugs Forum on the private beta website myself on May 6th, 2012.

Obviously, trying to repro it by replaying the steps in that video was fruitless.

Next up, bishop caught it with a screenshot and posted on May 13th, 2012:

It’s a perfect shot, but his next post is “I haven’t had much luck on the repro.”

ardonite chimes in the next day:

I got the rotation bug once at a statue.

I think I was rriiiiight in the bounding box. So maybe if it’s on a border pixel of the box then it glitches out?

Edit: no, hypothesis incorrect.

That’s how it goes: make a hypothesis, check it, repeat.

More shots every couple months over the summer from bishop and r7stuart:

Still no repro.

This entire time—in fact since I first saw it myself pre-beta—I’ve been trying to resist thinking it’s some kind of “numerical issue” with the code that handles the facing angle.  Yes, angles are finicky to deal with due to wrapping, but I’ve found programmers, including myself, tend to immediately go to vague concepts like “floating point error” for anything like this.  To fight this tendency back when we were working on physical simulation code together, Casey Muratori and I developed a mantra:  “Assume it’s a bug!”  It means that instead of assuming it’s some subtle floating point error creeping in, or anything mysterious like that, it’s almost certainly just some dumb programming bug.  That mantra has never failed me.  It’s always just a plain old bug.

Onward…

In the fall of 2012, streaming SpyParty took off bigtime, and so people were recording their games more often, and we started to get more videos, this one from r7stuart in October:

And tytalus in November:

I saw this last one live on tytalus’s stream, and grimaced when it happened, but also was happy to have more data to some day find a repro, or just have a random brainwave and fix it by intuition.2

At this point, people were reporting NPCs doing it, which at least made me happy, because it meant it wasn’t a tell.  Tells and anti-tells3 are the most serious SpyParty bugs, because they undermine the delicate balance of the game, so I proritize them highest, even above crash bugs sometimes!

A couple on New Year’s Eve from jorjon:

Then two clips from streams, the first from slappydavis, who’s Seduction Target appears to do it at the bookshelf on January 6th:

And then from james1221 on January 12th during the SpyParty New Years Cup Tournament:

Both of these are different, however.  In both cases, the NPC is being blocked by another character, and instead of repathing to a new place, they just wait until the blocking character leaves.  This is both good news and bad news.  It means it’s easy to project this bug onto other bugs, it means there are other bugs,4 and it means all the real examples of this bug so far have involved the Spy.  I don’t mention this last part in the hopes that nobody notices.  Luckly it’s rare enough that it’s not going to be a game balance changer even if it is a tell.

Finally, kcmmmmm finds a reliable repro on February 7th, two days ago, and 8 months after the first post in the Bugs thread!  These pictures are beautiful to me:

You can stand at that position, with that camera angle, and repro the bug most tries.  He also figures out that it’s very camera angle dependent, which is another clue, but once I could repro it locally, its remaining time on this earth was measured in minutes.

I had some trouble reproing it reliably here, including a couple wild-goose chases where I thought it wouldn’t repro with the debugger running5 or in my debugging modes,6 but in the end I got a case where I could catch it in the debugger, and I looked at the source, and there it was, suspiciously rotten code.

It was an old check from when I used to support click-to-move, as opposed to direct-control of the spy.  There was a case in the code that would check if you were clicking on the bookshelf itself, rather than the floor pad in front of the bookshelf, and it would helpfully direct you to the floor pad.  The position part of this got taken out long ago (I think), but the angle part remained in, so when the Spy stopped moving, if the mouse was over the bookshelf that code would return the angle for facing the bookshelf.

Wait, you say, there’s no mouse cursor in Spy mode?  Ah, yes there is, it’s just hidden and forced into the middle of the screen.  So, most of the time it hits your back, but sometimes, if you’re turning or leaning down or whatever when you stop, it’ll miss you and hit what’s behind you, and if it hits a bookshelf on the frame when you stop, you get the wrong facing angle.

Now, with this knowledge, go back and watch the videos and look a the pictures above.  Always a nearby guilty bookcase, isn’t there? 

But wait, you say again, what about the very first video, the green bookshelf is nowhere near the middle of the screen!  Ah, but buxx uses a controller,7 and the mouse gets hidden but doesn’t get centered if you’re using a controller!  It probably should, but it doesn’t.  So, buxx‘s hidden mouse pointer is probably off to the right of the window, over the green bookshelf, until he moves one pedestal pad to the left, and then the mouse pointer is no longer on the bookshelf, and he faces the right way!

Awesome, all the cases explained, and the bug was trivial to fix!

Okay, there’s actually one more case in the Bugs thread I didn’t post here, because it’s funny enough that I’m going to make an entire post about it soon.

So, just remember, always, Assume it’s a bug!

  1. …because they’re great for learning, the only thing better than video is commentated video! []
  2. This happens in programming a lot, but you don’t want to count on it if you don’t have to. []
  3. Where the NPCs can do something the Spy can’t. []
  4. duh []
  5. Sometimes this happens with an uninitialized variable, since the debugger initializes most memory to zero for you. []
  6. Could mean some debug code was correcting for the bug? []
  7. You can see the action UI in the video uses the controller icons! No detail is too small to matter in a bug! []

Full Commentated Live Casting for Tournament Finals, January 19th, 7pm PST!

Update:  Here’s the video of the finals match between  james1221 and bl00dw0lf with the full commentary!1

I am ludicrously excited about this!  If you follow this blog and SpyParty on Twitter or Facebook,2 you probably know I’ve been really interested in live streaming recently.  In fact, just the other day I set up the SpyParty Streams Notifier and I’m writing up a longer post about that and streaming in general, and I was recently interviewed by Chris Priestman on Indie Statik about the growth of streaming for indie games, along with Erik Svedäng who’s working on the awesome Clairvoyance, which makes great use of online replays. 

You may also know we’re at the final game of beta-tester insight’s SpyParty New Year’s Cup, which has been a really fun and exciting set of matches, most of which have been live streamed or at least recorded and posted after the match.  Watching the live streams and chatting with the other viewers during the games is incredible.

Even grizzled live streaming veterans are having a blast with the tournament streams!

So, what better to do than combine the tournament and the focus on live streaming and go all-in?

We are going to live commentate the finals match of the tournament between james1221 and bl00dw0lf, re-streaming both sides of the match to a single stream we can fade between the two views dynamically based on context, and we’ll have expert level commentary by virifaux and ky, with occasional color by me.  ky is going to play director while he casts, picking the best view from the two streams, just like the dude in the video truck at the Super Bowl!  We tested all the technical and social aspects of this today, it worked incredibly well, and you can see the results here, with a test game between james1221 and slappydavis:

Check out ky‘s awesome temporary name tags for this test, whipped up in mspaint while we were streaming!

Tune In

So, at 10pm EST / 7pm PST / 3am GMT,3 on Saturday, January 19th4 point your browser to the SpyParty twitch.tv channel:

http://twitch.tv/spyparty

If today’s practice session is any indication, it’s going to be totally awesome!

Assuming this goes as well as I hope, we’re going to start doing this kind of casting more often, with a rotating mix of commentators, so leave comments here on this post with feedback on what you saw and let us know if you liked it, or if there’s anything we can change to make it even better!

  1. As a youtube commentor pointed out, I shouldn’t have had the winner above the video because it ruins the surprise! []
  2. And now Google+, because the world definitely needed another social network! []
  3. sorry! []
  4. 20th if you’re GMT []