Loadtesting for Open Beta, Part 4: Done optimizing the lobbyserver!

Check out Loadtesting for Open Beta, Part 1, Part 2, and Part 3 to read the previous installments of this epic tale!

It’s been a while since the last update in this series, sorry about that!  At the end of Part 3, I mentioned the SimCity launch giving me pause about my goal of testing the SpyParty lobbyserver to 1000 simultaneous robots.  Well, I got scared enough after their launch that I increased my optimization target to 2000 simultaneous robots on my old and slow server, and then I also decided to bite the bullet and upgrade the server hardware after I hit 2000 to give myself some extra headroom.  I really don’t think I’m going to hit these numbers at Open Beta launch or even for a long time after that, but I’d rather err on the conservative side and have it purr along nicely.

Since I waited so long to post this Part 4, I can’t really give a play-by-play of all the optimizations I did as they happened, so I’m going to give the general arc I followed, and then talk about some of the interesting stops along the way.

iprof, atop, oprofile, et al.

As I mentioned at the end of the last post, I’d fixed some of the huge and obvious things with the network bandwidth usage, so it was time to start profiling the CPU usage.  There are lots of different kinds of profilers, but the one I use the most is based on Sean Barrett’s iprof.  I’ve modified it a fair bit over the years,1 but the core of the system is still the same.  It’s a runtime profiler that requires instrumenting your code into blocks, it’s efficient enough that you can leave it on all the time as long as you don’t stick a “prof block” in an inner loop, and you can generally see where you’re spending your time hierarchically.  It can draw to the screen, but I also have it output to a string, and so on the lobbyserver I can have it output to the log after a spike, and also catch a signal I send and it’ll force a prof dump.  Here’s an example:

2013/04/17-16:12:10: 85.156 ms/frame (fps: 11.74)  sort self - current frame
2013/04/17-16:12:10: zone                                                     self     hier    count
2013/04/17-16:12:10:  ProcessMessages                                      59.2910  59.2910     1.00
2013/04/17-16:12:10: +Send                                                 17.8164  18.4493  1120.69
2013/04/17-16:12:10: +ClientsUpkeepAndCloseLoop                             2.3034   2.6989   793.15
2013/04/17-16:12:10:  Log                                                   1.3559   1.3559    25.97
2013/04/17-16:12:10:  unpack_bytes                                          0.7311   0.7311    26.56
2013/04/17-16:12:10:  unpack                                                0.6674   0.6674  3551.71
2013/04/17-16:12:10: +ClientsPacketLoop                                     0.5492   3.7160   792.58
2013/04/17-16:12:10: +ClientsUpdated                                        0.4023  10.9843     0.56
2013/04/17-16:12:10: +SendQueuedClientRoomMessages                          0.2524   7.3086     1.00
2013/04/17-16:12:10:  FindClientByID                                        0.2494   0.2494   267.05
2013/04/17-16:12:10: +JournalQueuedSave                                     0.2374   0.3882     1.43
2013/04/17-16:12:10: +Tick                                                  0.2112  25.6313     1.00
2013/04/17-16:12:10:  iprof_update                                          0.2051   0.2051     1.00
2013/04/17-16:12:10: +JournalSavePrep                                       0.1560   0.5442     1.43

As you can see, it’s pretty easy to read, and you can drill down on individual blocks and see who calls them and who they call:

2013/04/17-16:12:10: 85.156 ms/frame (fps: 11.74)  sort graf - current frame
2013/04/17-16:12:10: zone                                                     self     hier    count
2013/04/17-16:12:10:     LoginReply                                         0.0006   0.0006     0.01
2013/04/17-16:12:10:     JOINING                                            0.0007   0.0008     0.03
2013/04/17-16:12:10:    +TYPE_CLIENT_GAME_ID_REQUEST_PACKET                 0.0011   0.0014     0.44
2013/04/17-16:12:10:    +NewWaitingForJoinClients                           0.0019   0.0019     0.01
2013/04/17-16:12:10:    +TYPE_CLIENT_PLAY_PACKET                            0.0086   0.0087     0.12
2013/04/17-16:12:10:    +TYPE_CLIENT_INVITE_PACKET                          0.0213   0.0225     2.37
2013/04/17-16:12:10:    +TYPE_CLIENT_IN_MATCH_PACKET                        0.0280   0.0282     0.49
2013/04/17-16:12:10:    +JournalQueuedSave                                  0.0563   0.0571     1.18
2013/04/17-16:12:10:    +RoomsChanged                                       0.0852   0.0939    15.38
2013/04/17-16:12:10:    +NewInLobbyClients                                  0.1155   0.1342    14.55
2013/04/17-16:12:10:    +ClientsUpkeepAndCloseLoop                          0.1362   0.2093   150.55
2013/04/17-16:12:10:    +TYPE_CLIENT_MESSAGE_PACKET                         0.3122   0.3170     8.87
2013/04/17-16:12:10:    +SendQueuedClientRoomMessages                       6.7122   7.0013   501.99
2013/04/17-16:12:10:    +ClientsUpdated                                    10.3361  10.5720   424.67
2013/04/17-16:12:10: -Send                                                 17.8164  18.4493  1120.69
2013/04/17-16:12:10:     unpack                                             0.6329   0.6329  3362.06

This is super useful.  The biggest downside to it is that it’s not thread-aware, but I’ve made it thread-safe via the brute force method of having it ignore all threads that aren’t the “main”.  My code is mostly single-threaded, but the threadedness increased a fair bit during these optimizations, so I hope to eventually modify iprof to be thread-aware without losing too much simplicity and performance.  However, until I make those modifications, any background thread activity will show up attributed to one of these main-thread blocks.  You can still get useful data, you just have to be aware of this.  For example, ProcessMessages in the loop above is hiding a WaitForMultipleObjectsEx call on Windows, or a call to select/epoll on POSIX, so it’s not actually taking that much active time on the main thread.

I also used oprofile, which is a nice sampling profiler on Linux that can profile per-thread using just the debug information in an application, and atop for keeping track of things happening on the machine as a whole.

Here’s a list of the stuff I ended up optimizing:

  • I was originally sending out the chat messages to all clients as they came in, but I started queuing them up and sending them all out at once to reduce send calls.  Of course, once you do this, you have to make sure you don’t overflow the network packet if you have queued a lot of messages that tick, so that makes the code more complicated and harder to modify, which is a tradeoff one often has to make while optimizing, and it’s why you want to put off most optimization until you need it…although you should have a rough plan for how you’ll optimize a piece of code in the future even if you write it the dumb way first.
  • I made more threads, including putting network sending and receiving on separate threads, making a separate thread for logging, and a thread for saving files to the disk.  There were already threads for talking to the database and Kerberos, for receiving network packets, and for checking for new client builds.  These are all relatively simple threads to add, because they’re all just throwing data into a queue on one thread and taking it out on another, although multithreading a program always makes it harder to understand.  I discovered a fair number of deadlock bugs in libjingle, the library I’m using for NAT traversal and some cross platform threading stuff, and I’ve fixed some of them.  I’ve veered far enough from the original libjingle code that I’m probably just going to have to put my version up as a fork, sadly.
  • I timesliced the login phase for the clients.  Previously, when a client would log in, I’d process a bunch of stuff immediately, including some authentication stuff which can be somewhat time consuming.  In a load test where hundreds of clients log in to the server at the same time, this would bog down, so I now process a maximum of 20ms worth of clients each tick.  This makes some clients wait a bit longer before they’re logged in, but doesn’t result in a positive feedback loop where there’s a really long tick, so a lot of packets will have arrived while it was happening, so the next tick is really long too, etc.
  • Like the player list packets, I also made the room list packets incremental, and able to span multiple network packets.  This way all the lists of players and rooms that the lobby sends to the clients can be differential and arbitrarily long, so there’s no more hard limit on the number of clients that can join the lobby.  I think there’s actually a bug in this code, but I’ve only ever seen it once, even after tens of thousands of robot sessions, so I just hope it shows up more at some point.
  • I switched the POSIX networking inner loop in libjingle from select to epoll.  This was not so much an optimization as it was simply to allow more than 1024 sockets to work at all.  epoll is also a lot faster, but I’m currently kinda using it in a dumb way, so I’m not benefiting from that speed boost much yet.
  • There were also a bunch of smaller traditional code optimizations, like using maps to cache lookups, using free lists to avoid some allocations, and whatnot.  Oh, and don’t forget to change the ulimit -n settings in limits.conf on Linux, so your process can actually accept a lot of connections!

As I was doing these optimizations, I would run a loadtest with a bunch of robots and profile the lobby.  I was at 500 robots at the end of Part 3, and I slowly raised the ceiling as I improved the code over the weeks:  569 robots…741 robots…789 robots, 833, 923, 942, 990, 997, 1008, 1076, 1122, 1158, 1199, 1330, 1372, 1399, 1404, 1445, 1503, 1614, 1635, 1653, 1658, 1659…

When I hit 1659 it was late one night, and so I stopped for the day.  When I resumed work and did the next couple of optimizations, I figured I’d get it to 1800 or something.  I always launched 20% or so more robots than I was hoping to support in a given test to account for internet and EC2 variation, and for plain old bugs in the clients that would sometimes manifest themselves, so this time I must have launched 2500 robots, because when I looked up from the profiles running in ssh terminals and over to my SpyParty client logged into the test lobby, I saw this:

This wasn’t supposed to happen yet.

Uh, I guess I was done optimizing?

I was actually kind of disappointed, to be honest.  I had all sorts of cool ideas for optimizations I was planning to do that I’d come up with while testing and profiling the code, and now, if I was going to follow my own plan and stop when I hit 2000 simultaneous robots, I would have to just take a bunch of notes for next time I optimized so I could pick up where I left off, and move on.  The good news is I’m pretty sure I can make the lobby almost twice as efficient if and when the time comes to do that!

Room at the Inn

If you look closely at that screenshot, you’ll see the thumb on the scrollbar for the player list is pretty small.  That’s because all 2010 players are in a single room, which is not going to work very well for a lobby full of real people.  In fact, the only reason there were 2010 players in that room was because at the time I’d limited the room size to 2010 because I didn’t want to bother teaching the robots how to use rooms.  There were actually a few hundred more robots knocking on the door but they couldn’t get in.  But, now that I’d hit my 2k target, it was time to fix that.

I immediately realized I had a problem.  Currently, when you connect to the lobby, it sends you a list of rooms, and you have to pick one to log in.  But, what if the rooms are full?  Oops, you couldn’t log in.  So, as soon as I set the room size down to something more reasonable, like 100, then the first 100 robots got in and the rest just sat there failing to join.

It seemed like there were a number of solutions to this problem, including allowing players to create new rooms before logging in, but in the end I went with the simplest and most robust solution, which is to have the lobby create a new empty room if all the current rooms are full.  The initial room is always called Headquarters, so I named these new dynamic rooms Headquarters 2 and onward.  Very creative, I know.  Somebody suggested using spy movie titles for these room names, but I figured that wouldn’t scale very well, ignoring the potential copyright issues.  If the lobby ever finds one of these dynamic rooms empty, it kills it, unless all the other rooms are full.  I also have the lobby automatically put you in a now-guaranteed-to-exist-non-full-room if you log in and try to join a full room, even if it wasn’t full when you clicked on it, so this eliminated a login race condition too, which is always a good sign.

This last bit also made it so I didn’t need to make the loadtesting robots know very much about rooms:  they always try to join Headquarters and if they don’t end up there, oh well.  As they join, they kind of spill over into the latest dynamic room until it fills up, and then they continue to the next, kind of like filling up an ice tray with water from one end.  I should probably make them test the actual room features by creating and changing rooms and whatnot, but the single giant 2010 player room was a way more intense loadtest than having 20 rooms with 100 players in each due to the chat broadcasting.

I don’t know if 100 is the right limit for room populations.  100 would still be way too many people to have in a single reasonable conversation, but I didn’t want to put too low of a limit on the size before I have tested things with humans instead of just robots.

The Client

There’s this annoying thing that happens when you’re testing computer code, and it’s that you encounter problems and bugs not only in the code you’re trying to test, but also in your test code.  This was no different.  I was constantly fixing various bugs in the robots that would keep them from all connecting correctly, and I even made sure some of the optimizations helped the client side so I could run more robots on a given EC2 server.  Plus, just making sure the robots keep trying to connect and login was important, because if there was a timeout due to an initial burst, you want them to try again automatically after it dies down, rather than just sitting there not doing anything.

As I said in Part 1, I started out running about 50 robots on each m1.small EC2 instance.  That didn’t scale, for some reason I’m still trying to figure out.  That worked okay with a low number of instances, but as I increased the number of instances, I had to lower the number of robots on each instance, eventually to around 20 per m1.small.  An AWS account starts with only being able to start 20 instances, so I did a total of two instance limit requests to Amazon, first to 100 and then to 300.  It’s scary to have 300 instances running…even though m1.small instances are only 6¢ an hour each, that’s still $18 an hour when there are 300 of them running, and Amazon rounds up to the hour, so if you miss shutting them down by a minute you just lost a large pizza!  It looks like Google’s new Compute Engine thing is about twice as expensive for their somewhat similar low end machine (ignoring performance differences), but charges in 1 minute increments after the first 10, which might be cheaper for this very transient use-case.

I seem to remember reading somewhere that Amazon allocates instances for the same account to the same physical machine if possible, which might explain this scaling problem, since it means I was probably maxing out a given piece of server hardware with too many instances bursting at the same time.  It’s hard to tell if this is the case, and I need to do more testing before saying for sure.  A commenter said there might be a packets-per-second limitation in EC2, as well, but I haven’t verified that.  Once I’ve tried a few different things, I’ll do a long technical post on chrishecker.com about EC2, linode, and my dedicated host machine, comparing the different results I got.

Finally, I had to do some optimization on the SpyParty game client when the numbers started getting high.  I went a little nuts with the chat system early on and it has completion on all commands, room names, and player names, but the code that builds the completion tree was calling the memory allocator 35k times per update when the numbers of players got high, so I had to remove some of the stupid in that code as well.

The New Server

With all that done, and 2010 robots running on the old server, I haggled with my hosting provider and started renting a newer and much faster server.  I use SoftLayer for dedicated hosting, and have for years.2 My old server was a Pentium 4 with a single hyperthreaded core, 1GB ram, and a 100Mbps uplink, and the new server is a Xeon 3460 with four hyperthreaded cores, 4GB ram, and 1Gbps uplink, so it’s slightly more expensive but a lot faster.  That said, everybody seems to be using VPS hosts these days.  I talked to some other indie game developers, but I didn’t have time to do a full evaluation of the tradeoffs, so went with the devil I knew, so to speak.  It seems like VPS is going to be a bit slower but also a bit cheaper, but the big advantage of VPS to me is that you can move the virtual machine image to faster hardware and have it up and running again in minutes.  That’s a pretty great scaling sweetspot between having a single physical server and praying it doesn’t melt, and a scalable system that elastically uses cloud computing The Right Way™, but it’s also hundreds of times easier to get a VPS image working and then move it to a faster machine than it is to scale elastically.  So, I dunno, it’s definitely something worth looking into more during the year as I see how things are scaling.

The new server ate the robots for lunch:

The new server works pretty well.

For reference, 4850 simultaneous players is pretty far up the top 100 Steam games by player count, so I don’t think I have to worry about those numbers for a while.  Here’s atop’s view of things:

Well within parameters.

What’s Next?

So, that’s it for the lobbyserver loadtesting.  Now I need to move the website and registration system over to the new server, test them a bit, and start inviting everybody in in big batches.  Soon I’ll send out email to the beta testers to set up some scheduled human loadtests as well.  The robots will be jealous, left out in the cold, looking in at all the humans actually playing the game. 

Open Beta is fast approaching.

  1. I’ll  release my changes at some point. []
  2. Well, they were servermatrix when I started, and then The Planet, and now SoftLayer. []

Come ask questions of me and Jonathan Blow in San Francisco on Wednesday!

Update:  Okay, the video is up from this talk, it’s a mix of general indie game stuff, some SpyParty and The Witness stuff, and miscellaneous stuff:

Jonathan Blow and I will be answering—or trying to answer—questions from the audience on Wednesday, May 8th, at 7pm PDT in San Francisco as guests of the SF IGDA.  We did a kind of similar thing a while back on Kotaku, and it turned out pretty well, so hopefully it’ll work again this time.  It’s open to the public, and it’s hosted by Dolby Labs:

Dolby Laboratories Inc.
100 Potrero Ave, San Francisco, CA

Here’s the official announcement.  I assume the questions will mostly be about indie game development.  I hope most of the questions are about indie game development.  Come ask questions about indie game development, please.

This post was really just an excuse to show you this picture.

One Year of SpyParty

Wow, SpyParty has the best beta testers in the world!  Not only are they really patient while they’re waiting for invites, then they’re patient while finding good repros for bugs, then more patient while waiting for fixes for all the bugs they find, and they’re patient and super helpful mentoring new players…but they’re also hilarious and fun and creative and just generally great to hang out with!

I’ve been heads down on trying to get the lobby scalable for open beta, so I didn’t even realize today is the one year anniversary of the first batch of invites I sent to strangers!  To celebrate this auspicious day, zerotka made this amazing video and posted it on the forums; it’s simply wonderful:

Thank you so much, I you all!  More invites soon!

Loadtesting for Open Beta, Part 3

Read Loadtesting for Open Beta, Part 1 and Part 2 to catch up on the spine-tingling story so far!

When we last left our hero, our differential state update change was a resounding success and reduced the network bandwidth utilization from 98% to 3%, and it looked like we could move on to optimizing the lobbyserver code itself to get to our goal of 1000 simultaneous loadtesting robots, until we noticed some of our robots were missing!  This led me on a wild and wooly chase through the code, which I will recount for you now…

Where’d the robots go?

The first order of business was to figure out why some robots were dying when they weren’t supposed to, and some weren’t dying when they were supposed to.  Robots: they never do what you tell them.

If you look at this graph of the number of running robots from last time, you can see that right off the bat, a bunch of them die on all the machines, and then they keep dying for about 30 seconds, and then it stabilizes.  Each of these machines should have 50 robots running solidly during the test period.

The number of loadtest robots running on each EC2 instance.

Then, to make matters worse, some of them don’t die when they’re supposed to at the end of the test.  In the graph above, they only all finally die when I kill them manually from a separate script at 03:16:30.  This points towards two different problems I’m going to have to debug on the test machines…that only manifest themselves intermittently…with gdb…in the cloud. Good times!

Okay, first things first, let’s see if the robots will tell me where they’re going.  The lobbyclient robots can output verbose log files, but I had them turned off because I was worried about bogging down the client machines.  It turns out this isn’t much of a problem as I’ll discuss below, so I turned on logging and re-ran a test.  Then I ssh’d into one of the servers, and looked at the log files.  Well, before I looked the files themselves, I just did an ls of the directory:

-rw-r--r-- 1 root root 258577 Mar  5 03:02 out59
-rw-r--r-- 1 root root 332320 Mar  5 03:02 out60
-rw-r--r-- 1 root root 177743 Mar  5 03:02 out61
-rw-r--r-- 1 root root 181639 Mar  5 03:02 out62
-rw-r--r-- 1 root root 264535 Mar  5 03:02 out63
-rw-r--r-- 1 root root 333515 Mar  5 03:02 out64
-rw-r--r-- 1 root root 282875 Mar  5 03:02 out65
-rw-r--r-- 1 root root 271040 Mar  5 03:02 out66
-rw-r--r-- 1 root root    264 Mar  5 03:01 out67
-rw-r--r-- 1 root root    264 Mar  5 03:01 out68
-rw-r--r-- 1 root root 284838 Mar  5 03:02 out69
-rw-r--r-- 1 root root 332967 Mar  5 03:02 out70
-rw-r--r-- 1 root root 303352 Mar  5 03:02 out71
-rw-r--r-- 1 root root 310596 Mar  5 03:02 out72
-rw-r--r-- 1 root root 194669 Mar  5 03:02 out73
-rw-r--r-- 1 root root 313193 Mar  5 03:02 out74
-rw-r--r-- 1 root root 238246 Mar  5 03:02 out75
-rw-r--r-- 1 root root 264190 Mar  5 03:02 out76
-rw-r--r-- 1 root root 198096 Mar  5 03:02 out77
-rw-r--r-- 1 root root 233980 Mar  5 03:02 out78
-rw-r--r-- 1 root root    264 Mar  5 03:01 out79
-rw-r--r-- 1 root root    264 Mar  5 03:01 out80
-rw-r--r-- 1 root root 301029 Mar  5 03:02 out81
-rw-r--r-- 1 root root 299694 Mar  5 03:02 out82
-rw-r--r-- 1 root root    264 Mar  5 03:01 out83
-rw-r--r-- 1 root root 351158 Mar  5 03:02 out84
-rw-r--r-- 1 root root 188071 Mar  5 03:02 out85
-rw-r--r-- 1 root root 242228 Mar  5 03:02 out86

Well, there’s a clue, at least for the early-dyers.  The contents of those 264 byte log files look like this:

Lobby Standalone Client: 1000.0.0.5
init genrand w/0, first val is 1178568022
Running for 61 seconds.
LobbyClient started, v1000.0.0.5 / v12
LobbyClient UDP bound to port 32921
lobbyclient: sendto_kdc.c:617: cm_get_ssflags: Assertion `i < selstate->nfds' failed.

A-ha!  sendto_kdc.c is a file in the Kerberos libraries, which I use for login authentication.

I really love Kerberos, the architecture just feels right to me, the API is simple, clean, and flexible, it’s cross-platform and open source, so I’ve been able to contribute features and bug fixes as I’ve used it and trace into the code when I was confused about something, and the folks at MIT that develop it are smart, knowledgeable, open-minded, and don’t mind some crazy indie game developer asking dumb questions about the best way to do things that were pretty clearly not part of the original university and enterprise use-cases.  Most importantly, it’s battle-tested; it’s used by tons of different applications, and it’s the foundation of the modern Windows domain and Xbox authentication systems, so I know it works.  The last thing you ever want to do is roll your own authentication system.

So, that assert’s the first place to look for the early-dying robots.

Next, I looked into the never-dying robots.  I logged into one of the machines that still had zombie robots1 running, ran pidof lobbyclient to figure out the process ID of one of them, and attached gdb to the robot.  A quick thread apply all backtrace full and I found the thread that was hanging while the main thread was trying to join them and exit cleanly.  It looked like the bad code was in a call to poll, and it just so happened it was in sendto_kdc.c as well! I realized I was going to need some debug symbols, but this was easy since I build the Kerberos libraries myself,2 so a quick scp of the debuginfo rpm and reattaching gdb and I could dig down a bit deeper.

The Kerberos libraries are built with optimizations on, which always makes debugging interesting, but I think it builds programming character to debug optimized code, so I don’t mind.3  Here’s the code in question:

    if (in->end_time.tv_sec == 0)
        timeout = -1;
    else {
        e = k5_getcurtime(&now);
        if (e)
            return e;
        timeout = (in->end_time.tv_sec - now.tv_sec) * 1000 +
            (in->end_time.tv_usec - now.tv_usec) / 1000;
    }
    /* We don't need a separate copy of the selstate for poll, but use one
     * anyone for consistency with the select wrapper. */
    *out = *in;
    *sret = poll(out->fds, out->nfds, timeout);

Well, these loadtesting machines are under some load themselves so they can be a bit sluggish, and there’s a problem with this code in that scenario if the call to k5_getcurtime() happens later than the in->end_time passed in by the caller.  As it says on the poll manpage, “Specifying a negative value in timeout means an infinite timeout.”  Digging around on the stack verified the timeout was negative.

Okay, so now we have a pretty good clue for each of the problems.  The second problem with the poll timeout seemed easy to fix, but the first one was pretty mysterious and might take some real debugging.  I decided to check with the krbdev mailing list to see if they had any ideas while I looked into the problems more deeply.  While doing so, I looked at the main Kerberos source repository and found a commit for the timeout problem, so it had already been fixed in a later version.  I was hoping maybe this was true of the assert as well.  True to form, the most excellent Greg Hudson replied with three more commits he thought might help.  Meanwhile, I hacked the code to loop on a call to sleep() instead of asserting to convert the early-dyers into never-dying zombies so I could attach the debugger, since that’d worked so well on the second problem.

Sadly, although the negative-timeout-check fixed the original zombies, none of the fixes prevented the assert problem.  It wasn’t asserting anymore because the asserters were now looping, so now I had more zombies to deal with.

Lots of zombie robots!

Time to get down and dirty and debug it for real.

As an aside, it’s a weird feeling when you’re debugging something on an EC2 instance, since you’re paying for it hourly.  I felt a definite pressure to hurry up and debug faster…oh no, there went another $0.06 * 5 instances!

Too deep we delved there, and woke the nameless fear!

Like I said, debugging optimized code builds character, and I built a lot of character with this bug.  The assert was in a function that was inlined by the optimizer, which was in a function that was inlined by the optimizer, which was in a loop, which looked like it had been unrolled.  It was slow going, with lots of restarts and stuffing values into memory and registers so the code would execute again.  At one point, I thought I’d narrowed it down to a compiler bug in gcc, because it seemed like a variable wasn’t getting reloaded from the stack correctly sometimes, but it was really hard to tell with all the inlining.  Even thinking it was a compiler bug was pretty silly and that thought always violates Assume it’s a Bug, so I should have known better, but it happens. 

Finally, a combination of stepping through the code, and looking at the code, and modifying the code revealed the problem. Here’s the source file at the version I was debugging, linked to the area of the code where the bug lurked.  If you search for “host+1”, you will see that it occurs twice, once inside the loop, and once outside the loop.  This is what threw me when I was debugging…initially I didn’t notice there were two separate calls to service_fds(), so in the debugger I thought it was looping again but loading weird values.  I can only assume the second call almost never occurred in the wild for anybody but me after the inner loop on hosts completed, because in that case host+1 is n_conns+1, which is out-of-bounds for the connections.4  This bug was easy for me to fix locally, and it looks like it was (inadvertently?) fixed in this commit in the main Kerberos code.

Thank goodness for open source code, where you can modify it and debug it when you run into troubles!

No more zombies!

Moar Robots!

Now that I (thought I) was done debugging the robots, and I still had 5 EC2 instances running, I decided to see how well the instances did with 100 robots on each.  My original tests indicated I could only run about 50 per m1.small instance, but the client also got a lot more efficient with the differential state update change described last time, and it turns out 100 per instance is no problem, as you can see here:

Top on an m1.small instance running 100 robots at only 20% CPU.

 The lobby was a little more grim with 501 clients:

500 robots and me.

 Here’s how the CPU looks with all these robots in the lobby, chatting at each other:

atop in CPU mode with 500 robots in the lobby jabbering.

There are two cores in this machine, which is why the lobbyserver is at 115% CPU.  It’s mostly single-threaded for simplicity, but it uses threads for servicing network connections.

However, once the robots start playing each other, the CPU usage drops a bunch:

Stop talking, start playing!

This is pretty good news.  I think it means the chat system needs some work, because when everybody’s in the lobby all the chats go to all the players, but when people in are a match, chats only go between those two players, and they don’t get any of the lobby chats.  We’ll find out soon as I describe below.  Memory looks pretty good with 501 clients, staying at about 256kb per client:

2013/03/16-04:53:11: MEMORY_POSIX 501/993/492: resident 25540/25540, virtual 198000/198000
2013/03/16-04:53:11: MEMORY_NEW 501/993/492: bytes 132098963, news 69166, deletes 55478

One last atop screenshot…this one is while the robots are starting up and connecting, but before they’re in the lobby:

Loadtest startup performance.

This one shows Kerberos and OpenLDAP taking a fair amount of time at the start of a new loadtest.  I use LDAP as the database backend for Kerberos, among other things, and when all of these robots are trying to get login tickets at the same time, it bogs down a bit.  I’m not too worried about this profile, since this scenario of 500 people all needing tickets at the same time is going to be rare (the tickets last a while, so this doesn’t happen every time), and there are well-known ways of scaling Kerberos and OpenLDAP if I need them.

Finally, here’s a shot of the 100 robots per instance:

Wait a second…

Oh no!  Who the hell is that single zombie robot at the end on instance 4!?!  Sigh.  I find that machine, log in, attach the debugger, and check it out.  It looks like I have a pretty rare deadlock between two threads during shutdown.  I’m just going to ignore it for now and deal with it later.  All the bugs above were preventing robots from doing a good job at loadtesting, while this one is just preventing 1 out of 500 from shutting down completely…it can wait.  Here’s a shot of this guy, still in the lobby, mocking me:

At least I have one more Sniper win on this debug server than this troll!

There’s actually another bug I found in the new differential state update code while I was testing this, where the server will send a duplicate client sometimes, but I had a comment in the code that I thought it might be possible, and now I know it is.  It turns out when you have 500 clients pounding on a server, you find bugs.

Coming Up Next Time

Okay, so now we’ve got things where I can easily run a predictable number of loadtesting robots against the debug lobbyserver, and I’ve got some high level profiles telling me that I’m now CPU bound inside the server itself.  That points to a clear next step:  profile the code.  I use an old hacked up version of Sean Barrett’s iprof for all my client runtime profiling, so my next task is to integrate that into the server code, and get it running on Linux.  That shouldn’t be too hard, and then I’ll be able to tell what’s actually taking the time5 when a lot of clients are in the lobby.

My prediction, based on the above, is that the chat message handling is going to be the main culprit.  If so, it’ll be easy to queue up the chats and send them out in bunches, but I need to be careful here, because the robots chat a lot more than real humans would right now, so I don’t want to spend too much time optimizing this.  I think I’ll keep the robots as they are for the initial profiles, and then dial back their chattiness to more realistic levels after I’ve plucked the low-hanging chat fruit.  I also need to teach the robots how to use lobby rooms for a more realistic test.

Finally, I’m wondering if my usage of select() is going to be an issue as I get close to 1000 robots.  I may need to port to epoll().  We shall see!

“Assume Nothing!”

And finally, the SimCity launch has given me pause…I’m still forging ahead with my 1000 simultaneous goal, but I really hope it’s enough and things go smoothly.  I would much rather have a slow buildup of players over the next year as I roll out more cool stuff than a giant spike that melts everything and makes players grumpy.

On to Part 4…

  1. ZOMBIE ROBOTS!!! []
  2. I have some local patches I haven’t cleaned up enough to contribute yet []
  3. gdb is not the best for assembly language debugging, but I did learn about “layout asm”, which helps a bit. []
  4. It never crashed because conns has a preallocated number of connections that was always bigger than n_conns+1 []
  5. This is only partially true, because iprof is single-threaded…I really wish there was a good cross-platform light-weight way to get per-thread timings. []

Come to drawnonward’s 100th stream party on Thursday, March 7th at midnight PST!

I guess Thursday midnight is technically Friday morning, unless midnight is defined as 00:00-ε or something, but anyway, the Official SpyParty Late Night Streamer™ drawnonward is celebrating his 100th stream on Thursday night at midnight, Pacific Standard Time (GMT-8), which means it’s Friday at 8am GMT (I think, unless I messed that math up), on his twitch.tv channel:

http://www.twitch.tv/drawnonward

It should be fun and rather silly…he posted in the beta forums asking for ideas and people are gifting him random Steam stuff to give away to people in stream chat, he’s going to dress up as one of the current SpyParty characters (he’s not telling which one!), and I’ll be joining him from midnight to 1am to play on his stream. Sadly, my internet this week is kind of slow and terrible, so I don’t think I can stream the other side on the SpyParty channel, but I should be able to join him via voice chat.  My personal dream is that he’ll actually start using a non-default icon/image for his channel by then.

Come along, hang out in the stream chat, and it should be a good time. The big question is whether I’ll have this build done and tested before then. Maybe you can watch a brand new build fail spectacularly, live, while I curse over TeamSpeak!

PS. I think I have figured out who is killing my robots, so Part 3 of the Loadtesting for Open Beta series will be posted soon!