In our last exciting episode of Loadtesting for Open Beta, we did some initial profiling to see how the lobbyserver held up under attack by a phalanx of loadtesting robots spawned in the cloud. It didn’t hold up, obviously, or the beta would already be open.
Specifically, it failed by saturating the server’s 100Mbps network link, which turned out to be a great way to fail because it meant there were some pretty simple things I could do to optimize the bandwidth utilization. I had done the initial game↔lobby protocol in the simplest way possible, so every time any player state changed, like a new connection, or switching from chatting in the lobby to playing, it sent out the entire list of player states to everybody. This doesn’t scale at all, since as you add more players, most aren’t changing state, but you’re sending all of their states out to everybody even if only one changes. This doesn’t mean it was the wrong way to program it initially; it’s really important when you’re writing complicated software1 to do things the simplest way possible, as long as you have a vague plan for what you’ll do if it turns into a problem later. In this case, I knew what I was doing was probably not going to work in the long run, but it got things up and running more quickly than overengineering some fancy solution I might not have needed, and I waited until it actually was a problem before fixing it.
Tell Me Something I Don’t Know
The solution to this problem is pretty obvious: differential state updates. Or, in English, only send the stuff that’s changed to the people who care about it. Doing differential updates is significantly more complicated than just spamming everybody with everything, however. You still have to send the initial state of all the curent players when new players log in, you have to be able to add and remove players in the protocol, which you didn’t have to before because you were just sending the complete new state every time, etc.
This was going to be a fairly large change, so I took it by steps. I knew that I’d have to send out the complete state of everybody to new logins, so it made sense to start by optimizing that initial packet using normal data size optimization techniques. I pretty easily got it from about 88 bytes per player down to 42 bytes per player, which is nice, because my goal for these optimizations is 1000 simultaneous players, and at 88 bytes they wouldn’t all fit in my 64kb maximum packet size, where at 42 bytes they should fit, no problem, so I don’t have to add any kind of break-up-the-list-across-packets thing. However, it turns out I actually got the ability to send the entire list across multiple packets while I was doing this, because I had to program the ability to add players as part of the differential updates, so now I could just use that packet type to send any clients in a really large player list that didn’t fit in a single packet. But, like I said in the last episode, although I don’t think I’ll hit 1000 simultaneous outside of load testing for a while, it’s always nice to know you have that sort of thing in your back pocket for the future.
Once I’d tested the new optimized player list, I started making the updates differential. New players get the initial list, and then they’re considered up-to-date and just get updates along with everybody else. The list of new players is sent as additions to players already in the lobby. For each player, I track some simple flags about what’s been updated in their state, so if they set or clear their /away message for example, that flag is set, and I only send that information.
In programming, usually when you’ve got the right design, you get some unintentional upside, and this case was no different. Previously, I was not sending live updates to player stats (wins, game time, etc.) to the players in the lobby until the player was done playing the match, or some other state changed that caused everybody’s state to be re-sent. Now, since the differential updates are efficient, I’m updating player stats in real time as well, so people in the lobby can see wins as they accumulate for players in matches, which is nice and how you’d expect it to work.
It basically worked exactly as planned. After lots of debugging, of course. Here you can see the profiles for one of the loadtests, which got to 340 simultaneous players in the lobby:
Look ma, 3% network utilization! That’s whats so awesome about a really spiky profile…when you pound one of the spikes down, things just get better!
Here’s the new table of packet sizes for this run. If you compare this with the previous results, you can see the PLAYER_LIST packets are way way way smaller, and this table was accumulated from two longer test runs, so it’s not even a fair comparison! It’s interesting, because the TYPE_LOBBY_MESSAGE_PACKET is smaller as well, and I think that’s because now the robots can actually start games since the network isn’t saturated, and this means they don’t broadcast chats to the entire lobby while they’re playing, so that’s a nice side effect of optimizing the bandwidth.
|Packet Type||Total Bytes|
Hmm, I just noticed as I’m writing this that the resident memory utilization in the atop screenshot is way lower now than before…I wonder why… On the application side I take about 250kb per player right now, which at 340 players should be about 85MB. Looking at the lobbyserver logs, right about when the screenshot was taken, the lobby self-reported this data:
2013/03/03-02:13:15: MEMORY_POSIX 348/757/409: resident 12808/12808, virtual 160276/160276
2013/03/03-02:13:15: MEMORY_NEW 348/757/409: bytes 91766974, news 45707, deletes 36155
The MEMORY_NEW stats looks about right for this load and my quick math, but the MEMORY_POSIX stats—which are read from /proc/pid/status—match the atop results: expected virtual but low resident. Maybe it was just paged out for a second, or maybe I’m not touching much of that 250kb and so it doesn’t stay resident. A lot of it is network buffers, so it makes some sense with this lower bandwidth protocol that it wouldn’t be resident compared to last profile because less buffering is having to be done. I’ll have to investigate this more.
Up Next, The Case of the Missing Robots
So, the bandwidth optimizations were a resounding success! Plus, both the CPU and memory utilization of the lobbyserver are really reasonable and haven’t been optimized at all, so we’re sitting pretty for getting to 1000 simulataneous robots…
Except, where are the remaining 160 robots? In the test above, I ran 10 EC2 instances, each with 50 robots, thinking the optimizations might let me get to 500 simultaneous and find the next performance issue…but it never got above 340 in the lobby. I updated my perl loadtesting framework and had each instance output how many lobbyclients were running every two seconds with this shell command over ssh:
'while true; do echo `date +%T`,`pidof lobbyclient | wc -w`; sleep 2; done'
And then I loaded that into gnuplot,2 and graphed the number of robots on each instance:
You can see that they all started up with 50, but then a bunch of them lost clients until they found a steady state. Something is killing my robots, and I need to figure out what it is…