Loadtesting for Open Beta, Part 2

In our last exciting episode of Loadtesting for Open Beta, we did some initial profiling to see how the lobbyserver held up under attack by a phalanx of loadtesting robots spawned in the cloud. It didn’t hold up, obviously, or the beta would already be open.

Specifically, it failed by saturating the server’s 100Mbps network link, which turned out to be a great way to fail because it meant there were some pretty simple things I could do to optimize the bandwidth utilization.  I had done the initial gamelobby protocol in the simplest way possible, so every time any player state changed, like a new connection, or switching from chatting in the lobby to playing, it sent out the entire list of player states to everybody.  This doesn’t scale at all, since as you add more players, most aren’t changing state, but you’re sending all of their states out to everybody even if only one changes.  This doesn’t mean it was the wrong way to program it initially; it’s really important when you’re writing complicated software1 to do things the simplest way possible, as long as you have a vague plan for what you’ll do if it turns into a problem later.  In this case, I knew what I was doing was probably not going to work in the long run, but it got things up and running more quickly than overengineering some fancy solution I might not have needed, and I waited until it actually was a problem before fixing it.

Tell Me Something I Don’t Know

The solution to this problem is pretty obvious: differential state updates.  Or, in English, only send the stuff that’s changed to the people who care about it.  Doing differential updates is significantly more complicated than just spamming everybody with everything, however.  You still have to send the initial state of all the curent players when new players log in, you have to be able to add and remove players in the protocol, which you didn’t have to before because you were just sending the complete new state every time, etc.

This was going to be a fairly large change, so I took it by steps.  I knew that I’d have to send out the complete state of everybody to new logins, so it made sense to start by optimizing that initial packet using normal data size optimization techniques.  I pretty easily got it from about 88 bytes per player down to 42 bytes per player, which is nice, because my goal for these optimizations is 1000 simultaneous players, and at 88 bytes they wouldn’t all fit in my 64kb maximum packet size, where at 42 bytes they should fit, no problem, so I don’t have to add any kind of break-up-the-list-across-packets thing.  However, it turns out I actually got the ability to send the entire list across multiple packets while I was doing this, because I had to program the ability to add players as part of the differential updates, so now I could just use that packet type to send any clients in a really large player list that didn’t fit in a single packet.  But, like I said in the last episode, although I don’t think I’ll hit 1000 simultaneous outside of load testing for a while, it’s always nice to know you have that sort of thing in your back pocket for the future.

Once I’d tested the new optimized player list, I started making the updates differential.  New players get the initial list, and then they’re considered up-to-date and just get updates along with everybody else.  The list of new players is sent as additions to players already in the lobby.  For each player, I track some simple flags about what’s been updated in their state, so if they set or clear their /away message for example, that flag is set, and I only send that information.

In programming, usually when you’ve got the right design, you get some unintentional upside, and this case was no different.  Previously, I was not sending live updates to player stats (wins, game time, etc.) to the players in the lobby until the player was done playing the match, or some other state changed that caused everybody’s state to be re-sent.  Now, since the differential updates are efficient, I’m updating player stats in real time as well, so people in the lobby can see wins as they accumulate for players in matches, which is nice and how you’d expect it to work.

Results

It basically worked exactly as planned.  After lots of debugging, of course.  Here you can see the profiles for one of the loadtests, which got to 340 simultaneous players in the lobby:

I really need to have the robot Sniper win sometimes.

 

atop in memory mode

 

atop in cpu mode

Look ma, 3% network utilization!  That’s whats so awesome about a really spiky profile…when you pound one of the spikes down, things just get better!

Here’s the new table of packet sizes for this run.  If you compare this with the previous results, you can see the PLAYER_LIST packets are way way way smaller, and this table was accumulated from two longer test runs, so it’s not even a fair comparison!  It’s interesting, because the TYPE_LOBBY_MESSAGE_PACKET is smaller as well, and I think that’s because now the robots can actually start games since the network isn’t saturated, and this means they don’t broadcast chats to the entire lobby while they’re playing, so that’s a nice side effect of optimizing the bandwidth.

Packet Type Total Bytes
TYPE_LOBBY_MESSAGE_PACKET 58060417
TYPE_LOBBY_PLAYER_LIST_UPDATE_PACKET 29751413
TYPE_CLIENT_GAME_JOURNAL_PACKET 18006186
TYPE_LOBBY_ROOM_LIST_PACKET 16674479
TYPE_LOBBY_PLAYER_LIST_ADDITION_PACKET 4280563
TYPE_LOBBY_PLAYER_LIST_PACKET 3482691
TYPE_CLIENT_MESSAGE_PACKET 1501822
TYPE_CLIENT_LOGIN_PACKET 477356
TYPE_CLIENT_INVITE_PACKET 435368
TYPE_LOBBY_INVITE_PACKET 275781
TYPE_LOBBY_LOGIN_PACKET 235878
TYPE_LOBBY_GAME_ID_PACKET 96000
TYPE_LOBBY_GAME_OVER_PACKET 68901
TYPE_CLIENT_GAME_ID_CONFIRM_PACKET 40257
TYPE_LOBBY_PLAY_PACKET 32498
TYPE_CLIENT_IN_MATCH_PACKET 25714
TYPE_LOBBY_IN_MATCH_PACKET 21204
TYPE_CLIENT_CANDIDATE_PACKET 16089
TYPE_CLIENT_PLAY_PACKET 12419
TYPE_CLIENT_GAME_ID_REQUEST_PACKET 9610
TYPE_LOBBY_WELCOME_PACKET 4494
TYPE_CLIENT_JOIN_PACKET 4494
TYPE_KEEPALIVE_PACKET 1011
TYPE_CLIENT_IDLE_PACKET 24

Hmm, I just noticed as I’m writing this that the resident memory utilization in the atop screenshot is way lower now than before…I wonder why… On the application side I take about 250kb per player right now, which at 340 players should be about 85MB.  Looking at the lobbyserver logs, right about when the screenshot was taken, the lobby self-reported this data:

2013/03/03-02:13:15: MEMORY_POSIX 348/757/409: resident 12808/12808, virtual 160276/160276
2013/03/03-02:13:15: MEMORY_NEW 348/757/409: bytes 91766974, news 45707, deletes 36155

The MEMORY_NEW stats looks about right for this load and my quick math, but the MEMORY_POSIX stats—which are read from /proc/pid/status—match the atop results: expected virtual but low resident.   Maybe it was just paged out for a second, or maybe I’m not touching much of that 250kb and so it doesn’t stay resident.  A lot of it is network buffers, so it makes some sense with this lower bandwidth protocol that it wouldn’t be resident compared to last profile because less buffering is having to be done.  I’ll have to investigate this more.

Up Next, The Case of the Missing Robots

So, the bandwidth optimizations were a resounding success!  Plus, both the CPU and memory utilization of the lobbyserver are really reasonable and haven’t been optimized at all, so we’re sitting pretty for getting to 1000 simulataneous robots…

Except, where are the remaining 160 robots?  In the test above, I ran 10 EC2 instances, each with 50 robots, thinking the optimizations might let me get to 500 simultaneous and find the next performance issue…but it never got above 340 in the lobby.  I updated my perl loadtesting framework and had each instance output how many lobbyclients were running every two seconds with this shell command over ssh:

'while true; do echo `date +%T`,`pidof lobbyclient | wc -w`; sleep 2; done'

And then I loaded that into gnuplot,2 and graphed the number of robots on each instance:

The number of loadtest robots running on each EC2 instance.

You can see that they all started up with 50, but then a bunch of them lost clients until they found a steady state.   Something is killing my robots, and I need to figure out what it is…

Turn the page to Part 3…

  1. especially by yourself! []
  2. …which I hate, but I forgot to install excel on my new laptop, and Google’s spreadsheet sucks at pivottables, and the Office for Web excel doesn’t even have them as far as I could tell! []

5 Comments

  1. keith says:

    “Something is killing my robots, and I need to figure out what it is…”

    Easy, Players. Some people just shoot the AI and not the spy. I even heard rumors they do it ‘for fun’.

  2. gerafin says:

    This is awesome, Chris! I’m impressed that you’re doing all the networking stuff yourself, this is the point where I would be calling a friend (artwork & networking aren’t my favorites).
    Quick question: if this isn’t in open beta for PAX East (March 22), would it be OK if I set up a tournament / open play area with a few copies of the closed beta? We’re planning a pre-PAX game night and there are quite a few people interested in playing, and some people in the beta willing to use their accounts for the card. I know you don’t have an NDA or anything like that, I just wanted to make sure it was OK with you. It will be a casual setup, there won’t be any pretense of affiliation or general official-ness.

  3. dataferret says:

    You probably figured it out by now, but could it have something to do with packet per second limitations on EC2? Last I heard, it was 100k pps.

    • checker says:

      I think all the problems are on my end so far, but I hope to eventually fix all my problems and run into somebody else’s problems. :)

I have temporarily disabled blog comments due to spammers, come join us on the SpyParty Discord if you have questions or comments!