<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SpyParty - A Spy Game About Subtle Behavior &#187; programming</title>
	<atom:link href="http://www.spyparty.com/category/programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.spyparty.com</link>
	<description>Chris Hecker&#039;s new espionage game about subtle behavior, performance, perception, and deception.</description>
	<lastBuildDate>Sun, 16 Mar 2014 05:01:44 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.8.2</generator>
	<item>
		<title>In-game Replays Update and Preview</title>
		<link>http://www.spyparty.com/2014/01/20/in-game-replays-update-and-preview/</link>
		<comments>http://www.spyparty.com/2014/01/20/in-game-replays-update-and-preview/#comments</comments>
		<pubDate>Mon, 20 Jan 2014 21:50:14 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[bugs]]></category>
		<category><![CDATA[competitive gaming]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[streams]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=4229</guid>
		<description><![CDATA[One of the things I want to do going forward is to talk about what I&#8217;m working on and what my near-term priorities and plans are&#8230;basically I want to put my todo list up on the blog somehow so you all can see what&#8217;s coming down the pike for SpyParty. Some indies, like Klei, actually [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>One of the things I want to do going forward is to talk about what I&#8217;m working on and what my near-term priorities and plans are&#8230;basically I want to put my todo list up on the blog somehow so you all can see what&#8217;s coming down the pike for <strong>SpyParty</strong>. Some indies, like <a href="http://kleientertainment.com/">Klei,</a> actually schedule their updates and make an event out of releasing them on a certain day, but I don&#8217;t think I have the production competence to hit those kinds of tight dates consistently, so I figure the next best thing is to at least talk about what I&#8217;m doing and how it&#8217;s going more regularly. While I figure out how to do that effectively here on the blog, here&#8217;s another ad hoc post on my current focus, &#8220;in-game replays&#8221;.</p>
<a name="What+are+replays%3F"></a><h3>What are replays?</h3>
<p>In a skill-based competitive game like <strong>SpyParty</strong> (or Starcraft or Counter-Strike or Go or Poker&#8230;), watching and studying games is an important part of learning and improving your skills. Right now, the only way to watch a game of <strong>SpyParty</strong> that you&#8217;re not participating in as a player is to watch a stream or a video. I stream on the <a href="http://twitch.tv/spyparty"><strong>SpyParty</strong> twitch.tv channel</a>, and I post videos to the <a href="http://youtube.com/spypartygame"><strong>SpyParty</strong> YouTube channel</a>, and lots of other beta testers stream and post videos too. You can catch a lot of streams by signing up for the <a title="SpyParty Streams Lists and Notification Sign Up" href="http://www.spyparty.com/streams/">SpyParty Streams Notifier</a>, and you can check out <a href="http://www.youtube.com/results?search_query=spyparty">all the Let&#8217;s Plays by searching YouTube</a>, but if somebody didn&#8217;t capture their game, there&#8217;s no way to go back and study it and it&#8217;s gone forever. Even if somebody did capture the game to video, if their camera angle wasn&#8217;t right, you might miss the thing you want to see.</p>
<p>Replays are the solution to these problems. A replay is a recording of the game, but it&#8217;s a recording of the stream of animation commands and events and movements instead of just a stream of images like a video, so you can move around in the replay while it&#8217;s playing, freeze it and look at the layout of the party, where the Sniper&#8217;s laser is relative to the Spy, and even rewind and study a section from different camera angles. Once replays are in, they&#8217;re going to revolutionize the study of elite <strong>SpyParty</strong> games; the plan is to capture a replay of every game ever played, and add them to a database that can be queried by any beta tester to study any game. About to play in a tournament against <a title="On EVO 2013, Interviewing kcmmmmm, and losing a bet with Seth Killian" href="http://www.spyparty.com/2013/07/10/on-evo-2013-interviewing-kcmmmmm-and-losing-a-bet-with-seth-killian/"><strong>kcmmmmm</strong></a>? Study the last 100 of his games against other high level players and try to get a feel for his play style. Heck, you can even sort of play the games from the Sniper&#8217;s point-of-view, trying to find the Spy, although since the Spy won&#8217;t repond to the laser sight it&#8217;s not going to be a real test. I don&#8217;t know whether this is going to benefit Spies or Snipers more,<sup><a href="http://www.spyparty.com/2014/01/20/in-game-replays-update-and-preview/#footnote_0_4229" id="identifier_0_4229" class="footnote-link footnote-identifier-link" title="my hunch is Snipers will benefit a bit more, but I don&rsquo;t know">1</a></sup> but it&#8217;s definitely going to raise the level of play across the board.</p>
<p>After I get replays working, the same technology will be used to implement &#8220;spectation&#8221;, which will allow you to log onto a game in-progress and watch it live, which is like watching a stream, but you can move the camera and see the action from either side, or even from a different camera position. At that point, if you join the lobby and everybody else is playing, you can just go spectate until somebody else joins to play. This will be huge for streaming, since it will allow people to cast other games, and commentate on the play! I&#8217;ll even implement mini-tournaments and simple betting within a spectation match, like some of the Starcraft mods do.</p>
<p>First, though, will come raw replays saved locally. You&#8217;ll be able to review your games, but to see somebody else&#8217;s you&#8217;ll have to get the replay file from them. The files should be pretty small, like hopefully one or two megabytes. After the bugs are worked out for that, I&#8217;ll get the replay database server up and running, and then spectation.</p>
<p>I really excited about replays, and I think they&#8217;ll increase the depth of the meta-game, and help the community share and discuss strategies.</p>
<a name="Video+Previews"></a><h3>Video Previews</h3>
<p>Here&#8217;s a video I recorded from last night&#8217;s stream for <a href="https://twitter.com/drawnonward"><strong>drawnonward</strong></a>&#8216;s 10,000th game (!). I gave a short preview of the current state of the replay system, which is still buggy but the hard part (rewind) is mostly working:</p>
<p><a href="http://www.spyparty.com/2014/01/20/in-game-replays-update-and-preview/"><em>Click here to view the embedded video.</em></a></p>
<p>If you&#8217;d like to see videos of the terrible things I did to the partygoers as I was getting rewind working, you can check out these two videos:</p>
<p><a href="http://www.spyparty.com/2014/01/20/in-game-replays-update-and-preview/"><em>Click here to view the embedded video.</em></a></p>
<p><a href="http://www.spyparty.com/2014/01/20/in-game-replays-update-and-preview/"><em>Click here to view the embedded video.</em></a></p>
<hr/><ol class="footnotes"><li id="footnote_0_4229" class="footnote">my hunch is Snipers will benefit a bit more, but I don&#8217;t know</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2014/01/20/in-game-replays-update-and-preview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Release Notes for 3076 and 3091! (Let&#8217;s forget about 3075, shall we?)</title>
		<link>http://www.spyparty.com/2013/12/21/release-notes-for-3076-and-3091-lets-forget-about-3075-shall-we/</link>
		<comments>http://www.spyparty.com/2013/12/21/release-notes-for-3076-and-3091-lets-forget-about-3075-shall-we/#comments</comments>
		<pubDate>Sun, 22 Dec 2013 00:13:43 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[release notes]]></category>
		<category><![CDATA[streams]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=4174</guid>
		<description><![CDATA[Here&#8217;s the latest release notes stream, co-hosted by virifaux.  It is just release notes.  It is not 2 hours and 44 minutes of me trying to fix showstopper bugs live on stream.1 These builds were a long time coming because I had to re-do the entire inside of the game in preparation for spectation and [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Here&#8217;s the latest release notes stream, co-hosted by <strong>virifaux</strong>.  It is just release notes.  It is not 2 hours and 44 minutes of me trying to fix showstopper bugs live on stream.<sup><a href="http://www.spyparty.com/2013/12/21/release-notes-for-3076-and-3091-lets-forget-about-3075-shall-we/#footnote_0_4174" id="identifier_0_4174" class="footnote-link footnote-identifier-link" title="If you really want to see that, here is your link.">1</a></sup></p>
<p>These builds were a long time coming because I had to re-do the entire inside of the game in preparation for spectation and replays, which aren&#8217;t quite done yet, but are getting closer.  However, I put in a bunch of really important bug fixes, the cool looking &#8220;menu party&#8221;, the oft-requested &#8220;back button&#8221; on the game setup screens, and Practice Mode is now handled properly so it can be used for <a title="How to Report Bugs the SpyParty Way" href="http://www.spyparty.com/2012/04/12/how-to-report-bugs-the-spyparty-way/">bug repros</a> much more reliably.  Also, you can see some nascent spectation action developing with the addition of hitting &lt;tab&gt; to switch between Spy and Sniper views in Practice Mode.</p>
<p>Oh, and of course:</p>
<div id="attachment_4177" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/12/santalanterns.png"><img class="size-large wp-image-4177 " alt="santalanterns!" src="http://cdn.spyparty.com/wp-content/uploads/2013/12/santalanterns-600x251.png" width="600" height="251" /></a><p class="wp-caption-text">Santalanterns! Inspiration courtesy of zerotka, as per usual with our holiday Easter Eggs!</p></div>
<p><a href="http://www.spyparty.com/2013/12/21/release-notes-for-3076-and-3091-lets-forget-about-3075-shall-we/"><em>Click here to view the embedded video.</em></a></p>
<p>&nbsp;</p>
<hr/><ol class="footnotes"><li id="footnote_0_4174" class="footnote">If you really want to see that, <a href="http://www.twitch.tv/spyparty/b/487857396">here is your link</a>.</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/12/21/release-notes-for-3076-and-3091-lets-forget-about-3075-shall-we/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>This Week in SpyParty, Week 2: A Bug In Plain Sight</title>
		<link>http://www.spyparty.com/2013/11/22/this-week-in-spyparty-week-2-a-bug-in-plain-sight/</link>
		<comments>http://www.spyparty.com/2013/11/22/this-week-in-spyparty-week-2-a-bug-in-plain-sight/#comments</comments>
		<pubDate>Fri, 22 Nov 2013 20:39:35 +0000</pubDate>
		<dc:creator><![CDATA[ZeroTKA]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[streams]]></category>
		<category><![CDATA[thisweek]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=4057</guid>
		<description><![CDATA[This week there was a recent bug discovery that caught my attention. With any game, especially one in beta, there are going to be bugs regardless of how hard you try to prevent them. The nature of the beast is if you have a game then you have bugs in your code. Some bugs are [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>This week there was a recent bug discovery that caught my attention. With any game, especially one in beta, there are going to be bugs regardless of how hard you try to prevent them. The nature of the beast is if you have a game then you have bugs in your code. Some bugs are minor and some are absolutely game-breaking. This one in particular seems like it should have been found ages ago.</p>
<a name="What+is+This+Bug%3F"></a><h3><strong>What is This Bug?</strong></h3>
<p>The bug that was discovered allows your laser to pass through certain sections of certain characters. This makes it seem like you miss some of your shots even though you really shouldn&#8217;t. This bug is easy to reproduce as well. All you have to do is aim at these imaginary holes and that&#8217;s it. For example, <a href="http://www.twitch.tv/virifaux/b/479531360">you can aim your laser at the stomach region</a> of Alphonse &#8220;Snaps&#8221; Mcgee, or at his hat: </p>
<p style="text-align: center;"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/11/shoot-through-bug.png"><img class="size-medium wp-image-4087 aligncenter" alt="shoot-through-bug" src="http://cdn.spyparty.com/wp-content/uploads/2013/11/shoot-through-bug-300x168.png" width="300" height="168" /></a></p>
<p>Since there was a good <a href="http://www.spyparty.com/2012/04/12/how-to-report-bugs-the-spyparty-way/">repro</a>, <strong>checker</strong> was able to find and fix it pretty quickly. I asked him if he could give us a more technical side to this bug and he agreed. Here is his response:</p>
<blockquote>
<p><em>Okay, so the characters are made out of triangles. For the old characters, there are about 1.5k tris, and for the new ones there are about 11k tris.  It&#8217;s slow to do a raycast against all these triangles, so you do check on the overall bounding box of the character first, since that&#8217;s much faster.  If your bounding box is conservative, then that means all the triangles are inside it, so if the ray doesn&#8217;t hit the bounding box, you don&#8217;t have to test all the triangles.  But, since that test assumes all the triangles are inside the bounding box (hence the word &#8220;bounding&#8221;), if that&#8217;s not actually true, you won&#8217;t test the triangles at all. </em></p>
</blockquote>
<p>This is what the bounding box looked like before the fix:</p>
<p style="text-align: center;"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/11/snaps-busted.png"><img class="size-medium wp-image-4083 aligncenter" alt="snaps-busted" src="http://cdn.spyparty.com/wp-content/uploads/2013/11/snaps-busted-205x300.png" width="205" height="300" /></a></p>
<p style="text-align: center;"> </p>
<p style="text-align: left;">This is what it looks like after the fix:</p>
<p style="text-align: center;"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/11/snaps-fixed.png"><img class="aligncenter size-medium wp-image-4089" alt="snaps-fixed" src="http://cdn.spyparty.com/wp-content/uploads/2013/11/snaps-fixed-202x300.png" width="202" height="300" /></a></p>
<p>I looked into this bug so I could gain my own understanding of it. I noticed the holes seemed to disappear during animations and while holding a drink. I asked <strong>checker</strong> if he could give some insight into what&#8217;s happening. Here is what he had to say:</p>
<blockquote>
<p><em>With animated characters, you really don&#8217;t want to build the bounding box from the triangles as they move around because that&#8217;s slow, and you often don&#8217;t have access to the posed triangles because that happens on the video card, so you build the bounding box from the bones themselves, and then expand it a bit (20% in this case) to get all the triangles. Well, with the old skeleton, it&#8217;s almost flat on the xz plane in the rest pose, so the expansion doesn&#8217;t do much.  When you&#8217;re holding a drink, your arm is forward, and that makes the bounding box actually bound the whole body! </em></p>
</blockquote>
<p>These are images with the old bounding box built from the bones:</p>
<p style="text-align: center;"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/11/snaps-busted-bones.png"><img class="size-medium wp-image-4090 alignnone" alt="snaps-busted-bones" src="http://cdn.spyparty.com/wp-content/uploads/2013/11/snaps-busted-bones-205x300.png" width="205" height="300" /></a> <a href="http://cdn.spyparty.com/wp-content/uploads/2013/11/snaps-busted-bones-drink.png"><img class="size-medium wp-image-4091 alignnone" alt="snaps-busted-bones-drink" src="http://cdn.spyparty.com/wp-content/uploads/2013/11/snaps-busted-bones-drink-191x300.png" width="191" height="300" /></a></p>
<p>So what was the fix?</p>
<blockquote>
<p><em>I looked at the code, and with the current (read: slow) animation code I decided it wouldn&#8217;t actually be any slower if I just computed the triangle-accurate bounding box.  Once I optimize the animation code, I&#8217;ll have to come up with a better solution, potentially having bounding boxes for each bone authored in Maya, or something smarter, I&#8217;m not sure yet.</em></p>
</blockquote>
<a name="How+was+it+Discovered%3F"></a><h3><strong>How was it Discovered?</strong></h3>
<p><strong>krazycaley </strong>took on challenger <strong>virifaux</strong> in <em><a title="Welcoming Zero &amp; This Week in SpyParty, Week 1: Spy on the Hill" href="http://www.spyparty.com/2013/11/12/welcoming-zero-this-week-in-spyparty-week-1-spy-on-the-hill/">Spy on the Hill</a> </em>week 7. It was during one of these games the Sniper appeared to shoot through one of the characters to both the stream viewers and to himself.  Interestingly enough, he did!</p>
<p>You can hear  <strong>krazycaley</strong> exclaim &#8220;Noooo! Miss!&#8221; right after the shot. He knew who <strong>virifaux </strong>was and <strong>virifaux </strong>knew he was dead to rights. Yet somehow he was still alive. This was enough to pique the interest of <b>virifaux</b> and he chose to pursue it more deeply a day later. Once he discovered a reliable way <a title="How to Report Bugs the SpyParty Way" href="http://www.spyparty.com/2012/04/12/how-to-report-bugs-the-spyparty-way/">to reproduce the bug</a>, he made a quick video showcasing it in action, marking the beginning of the end for this bug.</p>
<p>Streaming was a big help in squashing this bug.  <strong>SpyParty</strong> doesn&#8217;t support replays yet, which makes bug hunting difficult sometimes. Streaming is probably the next best thing. If pictures are worth a thousand words then videos are probably worth a couple million. In this case, the video gave clues to <b>virifaux</b> for where to start looking. He knew exactly what it looked like when the bug occured. He could watch it over and over until he formulated his hypothesis. Then he could go into practice mode and test it out. </p>
<p>Streaming and videos give us great insight into reproducing pesky bugs. Streaming may not always be the main tool in bug hunting, but streams and videos are almost always helpful for gathering more evidence and figuring out a clean repro.   This was the case with <a title="One Bug’s Story, or, Assume it’s a bug!" href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/">another bug&#8217;s story</a>. I don&#8217;t think streaming was the main reason this bug was caught and squashed but it certainly played a role. </p>
<a name="Why+is+This+Interesting%3F"></a><h3><strong>Why is This Interesting?</strong></h3>
<p>I find it very interesting the bug was discovered only now. When first looking over the new bug report, <strong>checker</strong> made a post stating he remembers this happened a long time ago and that it was super rare, so we know this bug has been in the game for quite some time.</p>
<p>I am going to throw some numbers your way to help give an idea of the scope of this thing. There are currently over 11,000 people who have <a title="SpyParty Beta Registration" href="http://www.spyparty.com/beta/">registered with the beta</a>. The top 100 players have a combined total of 185,520 games played. It&#8217;s safe to say there have been a lot of games played across the entire community. This isn&#8217;t even counting the number of people who have played at conventions. In all of the games that have been played, this bug is just now being talked about and reproduced.</p>
<p>That&#8217;s not the only thing about this particular bug that makes it so interesting to me. Not only have there been tons of games in which it could have been discovered, the steps needed to reproduce the bug are really simple. When you combine these two characteristics together, it&#8217;s crazy how long this bug has lasted! However, thinking about it more deeply, maybe it&#8217;s not so crazy after all&#8230;</p>
<p>There are a few things that need to align correctly in order for this to be spotted. The main thing is looking at the character from the side. Next the character can&#8217;t be doing the vast majority of animations nor can he or she have a drink in their hands. Finally, the sniper has to aim in the correct spots. Perhaps you also need to be streaming for <a href="http://www.twitch.tv/spyparty/b/477977498?t=22m45s"><em>Spy on the Hill</em> week 7</a>.</p>
<p>Whatever the case, the bug is nearing the end of its lifetime since it&#8217;s fixed on <strong>checker&#8217;s</strong> local copy, but until the patch goes live you might be able to utilize this bug in a game mode. As beta tester <strong>virifaux </strong>said, &#8220;Now that we have a reliable reproduction, we&#8217;ll have skill shots where we shoot through McGee to hit the spy.&#8221; Bugs can be fun to play with so get in your skill shots before it&#8217;s too late!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/11/22/this-week-in-spyparty-week-2-a-bug-in-plain-sight/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Release notes v0.1.2769.0 &#8211; Veranda Double Trouble</title>
		<link>http://www.spyparty.com/2013/06/21/release-notes-v0-1-2769-0-veranda-double-trouble/</link>
		<comments>http://www.spyparty.com/2013/06/21/release-notes-v0-1-2769-0-veranda-double-trouble/#comments</comments>
		<pubDate>Fri, 21 Jun 2013 07:28:38 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[playtests]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[release notes]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=3424</guid>
		<description><![CDATA[New build! This is a relatively quick update since the last one, but with one fairly small change with potentially large consequences on the Veranda map: I&#8217;ll write more about this change later, but the impetus for doubling up the statues at the front of Veranda was a series of player-designed and requested game modes. First, [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>New build! This is a relatively quick update since <a title="Release notes v0.1.2758.0 – lobby stuff, mostly" href="http://www.spyparty.com/2013/06/17/release-notes-v0-1-2758-0-lobby-stuff-mostly/">the last one</a>, but with one fairly small change with potentially large consequences on the <em>Veranda</em> map:</p>

<a href='http://www.spyparty.com/2013/06/21/release-notes-v0-1-2769-0-veranda-double-trouble/veranda-old/'><img width="300" height="150" src="http://cdn.spyparty.com/wp-content/uploads/2013/06/veranda-old-300x150.png" class="attachment-medium" alt="Before..." /></a>
<a href='http://www.spyparty.com/2013/06/21/release-notes-v0-1-2769-0-veranda-double-trouble/veranda/'><img width="300" height="150" src="http://cdn.spyparty.com/wp-content/uploads/2013/06/veranda-300x150.png" class="attachment-medium" alt="after." /></a>

<p>I&#8217;ll write more about this change later, but the impetus for doubling up the statues at the front of <em>Veranda</em> was a series of player-designed and requested game modes. First, people have been playing &#8220;Known 3 Soft Tells&#8221; on <em>Ballroom</em> lately. The Spy always chooses <em>Contact Double Agent</em>,<sup><a href="http://www.spyparty.com/2013/06/21/release-notes-v0-1-2769-0-veranda-double-trouble/#footnote_0_3424" id="identifier_0_3424" class="footnote-link footnote-identifier-link" title="aka. &ldquo;Banana Bread&rdquo;">1</a></sup> <em>Inspect Statues</em>, and <em>Seduce Target</em>. The Sniper knows these are the three missions, which is usually a camping deathtrap, but you can&#8217;t really camp &#8220;soft tell missions&#8221;, and so the mode is fairly balanced, and has a great feel, still well within <a href="http://www.spyparty.com/faq/#What+are+your+aesthetic+goals+for+the+game%3F">the aesthetics of the game</a>, but with a distinctly different flavor. Of course, this led to more experimentation, and the question became whether it was possible on <em>Veranda</em>, <a title="A Deathtrap and a Walk in the Park" href="http://www.spyparty.com/2011/01/19/a-deathtrap-and-a-walk-in-the-park/">the map that&#8217;s really hard for the Sniper</a>. Well, after adding &#8220;Known 3&#8243; as an available game mode, it turns out &#8220;Known 3 Hard Tells&#8221; works on <em>Veranda</em>, since it&#8217;s such a big and busy map, but the soft tells version didn&#8217;t work because all the statues were far apart, so you had to inspect them one at a time, and NPCs rarely visit the status three times on <em>Veranda</em>. The soft tells version was harder than the hard tells version on the easiest map for Spies?! Game design is confusing. Anyway, I doubled up the statues and we&#8217;ll see how that changes the balance of the map. It&#8217;s one of the least-played maps, mostly because it&#8217;s such a big time and attention investment, and so I think this will breathe some life into it, for a while, at least.</p>
<p>I also managed to stream the release of this build, which I&#8217;m going to try to make a regular thing. I went through the bug fixes, and then played the always-awesome <strong>kate</strong> for a bunch of games of &#8220;k3 soft&#8221; and &#8220;k3 hard&#8221; on the new <em>Veranda</em>, and had a blast and a pretty big audience considering it was 2am PDT. I&#8217;ve uploaded the video to the <a href="http://www.youtube.com/user/SpyPartyGame"><strong>SpyParty</strong> YouTube Channel</a>, and here it is (the games start at 18:23):</p>
<p><a href="http://www.spyparty.com/2013/06/21/release-notes-v0-1-2769-0-veranda-double-trouble/"><em>Click here to view the embedded video.</em></a></p>
<p>There are a few things I&#8217;ll fix next time I do this, and please post a comment below if you have additional things you think I should do:</p>
<ul class="tightlist">
<li><span style="line-height: 13px;">mic farther from mouth!</span></li>
<li>don&#8217;t bother going through the old build&#8217;s bugs, it&#8217;s not incredibly interesting, it takes forever, and OBS does a horrible epilepsy inducing flicker (at 14:40) during restart</li>
<li>get TeamSpeak and whatnot set up before I start streaming :/</li>
</ul>
<p>Here are the official release notes:</p>
<ul class="tightlist">
<li>double up front statues in veranda, yikes!</li>
<li>push modal message background farther back in the hopes of making text visible</li>
<li>try to fix more lobby list scroll bugs by clearing selected client</li>
<li>finally (?) fix the lobby state sort</li>
<li>don&#8217;t set last keyboard time on repeats, so now menus work with keys down for teamspeak</li>
<li>mask spy/normal keys for HL/LL if typing chat string</li>
<li>flirt % in seduction HUD</li>
<li>display number of required inspects in mission name</li>
<li>added &#8211;nosound command line for debugging directsound issues&#8230;better to use &#8211;mute if you just don&#8217;t want sounds</li>
<li>true up lobby details and /who so awayis last (for easy /who -&gt; ctrl-c action)</li>
</ul>
<hr/><ol class="footnotes"><li id="footnote_0_3424" class="footnote">aka. &#8220;Banana Bread&#8221;</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/06/21/release-notes-v0-1-2769-0-veranda-double-trouble/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Release notes v0.1.2758.0 &#8211; lobby stuff, mostly</title>
		<link>http://www.spyparty.com/2013/06/17/release-notes-v0-1-2758-0-lobby-stuff-mostly/</link>
		<comments>http://www.spyparty.com/2013/06/17/release-notes-v0-1-2758-0-lobby-stuff-mostly/#comments</comments>
		<pubDate>Mon, 17 Jun 2013 09:16:35 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[release notes]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=3332</guid>
		<description><![CDATA[I just realized that since I finally opened the beta, I can now post the release notes for the new builds here publicly without getting a torrent of &#8220;Send me an invite!&#8221; blog and facebook comments and tweets! Yay, I am so happy to be out of the invites business! I&#8217;ll do a post with numbers from [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>I just realized that since I finally <a title="SpyParty Beta Registration" href="http://www.spyparty.com/beta/">opened the beta</a>, I can now post the release notes for the new builds here publicly without getting a torrent of &#8220;Send me an invite!&#8221; blog and <a href="http://facebook.com/spyparty">facebook</a> comments and <a href="http://twitter.com/spyparty">tweets</a>! Yay, I am so happy to be out of the invites business!</p>
<p>I&#8217;ll do a post with numbers from the open beta launch sometime next week, but it&#8217;s going great, and I&#8217;m super happy and relieved. I was really worried about the community, and all the new players are blending into the community really well, and it all feels very healthy. Phew!</p>
<p>This build was almost exclusively bugs and minor features for supporting lobby rooms with lots of people in them. Most of these were discovered as the result of the <a title="Human Loadtest Tonight, Monday June 3, 10pm PDT!" href="http://www.spyparty.com/2013/06/03/human-loadtest-tonight-monday-june-3-10pm-pdt/">human load test</a> we did right before opening the beta. The main problem was that with lots of people in the lobby, you couldn&#8217;t really mouse-over somebody to reliably invite them into a game, so I struggled with how to solve this until somebody in the lobby said, &#8220;hey, there&#8217;s a problem because the player under my mouse keeps changing&#8221;, which pointed me to a reasonable and obvious solution: keep the player under your mouse fixed, even if they&#8217;re moving around in the sort order (due to other players joining or leaving, or due to them changing state). This is slightly weird, because it means you can get into a state where the lobby has scrolled such that there&#8217;s a blank space above or below, but it works pretty well.</p>
<div id="attachment_3353" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/06/SpyParty-v0.1.2758.0-20130617-02-19-58-0.png"><img class="size-large wp-image-3353" alt="Edge behavior for fixed-point mouse selection." src="http://cdn.spyparty.com/wp-content/uploads/2013/06/SpyParty-v0.1.2758.0-20130617-02-19-58-0-600x337.png" width="600" height="337" /></a><p class="wp-caption-text">Edge behavior for fixed-point mouse selection.</p></div>
<p>I created a new &#8220;<a href="http://www.spyparty.com/category/release-notes/">release notes</a>&#8221; tag, so you&#8217;ll be able to find all of the posts about new builds. I&#8217;m also going to start &#8220;streaming the release notes&#8221;, meaning I&#8217;m going to start up a live stream on the <a href="http://twitch.tv/spyparty"><strong>SpyParty</strong> twitch.tv channel</a>, and walk through the changes in the build on video. Then I can put that up on the <a href="http://www.youtube.com/user/SpyPartyGame"><strong>SpyParty</strong> YouTube  channel</a> for later view. The idea here is to capture a bit of the coolness of a well-edited video of changes, <a href="http://blog.wolfire.com/">like the Overgrowth folks do</a>, but without spending the time editing the video. It&#8217;ll be rough, but it will keep me from triggering my perfectionism neurons and fiddling forever with the clip. So, hopefully it&#8217;ll be a good comprimise. I wanted to do a stream for this build, but it was 2:30am when I finished it, so I couldn&#8217;t pull it off. Luckily, <a href="http://twitch.tv/drawnonward"><strong>drawnonward</strong></a> and <strong><a href="http://twitch.tv/canadianbac0nz">canadianbacon</a></strong> were streaming when I was working on it, and they convinced me to add <em>3 Known Missions</em> on <em>Veranda</em> for craziness. You can see <a href="http://www.twitch.tv/drawnonward/b/417667244?t=55m25s">that video here</a>, it&#8217;s pretty close to the kind of release notes streaming I&#8217;m going to do&#8230;casual, chatting in TeamSpeak, but previewing the features in real time.<sup><a href="http://www.spyparty.com/2013/06/17/release-notes-v0-1-2758-0-lobby-stuff-mostly/#footnote_0_3332" id="identifier_0_3332" class="footnote-link footnote-identifier-link" title="And, watching it myself from his point-of-view I just found a couple bugs with the update system!">1</a></sup></p>
<p>Here are all the changes for version 0.1.2758.0:</p>
<ul class="tightlist">
<li>k3 enabled on veranda&#8230;k3 hard tells anyone?</li>
<li>true up the triangles and the text in selected but not necessarily enabled missions</li>
<li>better /who /stats formatting</li>
<li>/help take command for single line help</li>
<li>shrink lobby a bit more to avoid red text and make /away visible</li>
<li>fix 00:00 event time bugs in pending mission state</li>
<li>stop accidentally filtering suspected da cast events</li>
<li>make lobby details a tiny bit more compact&#8230;user /stats username for full stats</li>
<li>practice mode on lobby escape menu</li>
<li>invites beep on whisper settings or chat settings</li>
<li>make &#8211;console &#8211;logstderr dump log to console window for debugging startup problems</li>
<li>better /who stats, idle, away on own line for copy</li>
<li>/statsroom /whoroom /wr</li>
<li>sort by column in lobby by clicking on the column titles</li>
<li>sort /who clients correctly, by current lobby sort</li>
<li>new fixed-point list ui for list stability, the player under your mouse will stay under your mouse, even as they change state, it&#8217;s weird, but will avoid misclicks</li>
<li>fix bug with not setting changed number on destroys</li>
<li>don&#8217;t output warning if can keep current completion string</li>
<li>fix whisper (lobby) bug</li>
<li>correct single-line chat paging with room masking&#8230;goes back in time in all rooms simultaneously, which is slightly weird</li>
<li>beeps respect room mask too</li>
<li>allow tab room changes in lobby settings and room chooser</li>
<li>make lobby say messages room based, so can be masked</li>
<li>only change chat usernames for completion if clients join/leave</li>
<li>clients stick around exiting if leave lobby, and leaving room for 2 seconds</li>
<li>sort lobby list by state with invites and exits in prev state, playing by timestamp</li>
</ul>
<p>Those are just the raw notes I post in the private beta forums,<sup><a href="http://www.spyparty.com/2013/06/17/release-notes-v0-1-2758-0-lobby-stuff-mostly/#footnote_1_3332" id="identifier_1_3332" class="footnote-link footnote-identifier-link" title="&hellip;which I might make publicly read-only some day, they&rsquo;re such a wealth of information about the game!">2</a></sup> so they might not mean much if you&#8217;re not a player. The way to solve that problem is obviously to <a title="SpyParty Beta Registration" href="http://www.spyparty.com/beta/">become a player</a>!</p>
<p>Up next, a couple super-minor fixes to this build, and then a brand new mission! I haven&#8217;t added a mission in a long time, so this will really change things. I&#8217;ll talk more about that in another post, but I&#8217;m hoping to have it stood up this week.</p>
<hr/><ol class="footnotes"><li id="footnote_0_3332" class="footnote">And, watching it myself from his point-of-view I just found a couple bugs with the update system!</li><li id="footnote_1_3332" class="footnote">&#8230;which I might make publicly read-only some day, they&#8217;re such a wealth of information about the game!</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/06/17/release-notes-v0-1-2758-0-lobby-stuff-mostly/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Loadtesting for Open Beta, Part 4: Done optimizing the lobbyserver!</title>
		<link>http://www.spyparty.com/2013/05/21/loadtesting-for-open-beta-part-4-done-optimizing-the-lobbyserver/</link>
		<comments>http://www.spyparty.com/2013/05/21/loadtesting-for-open-beta-part-4-done-optimizing-the-lobbyserver/#comments</comments>
		<pubDate>Tue, 21 May 2013 06:13:11 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[indie games]]></category>
		<category><![CDATA[metrics]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=3210</guid>
		<description><![CDATA[Check out Loadtesting for Open Beta, Part 1, Part 2, and Part 3 to read the previous installments of this epic tale! It&#8217;s been a while since the last update in this series, sorry about that!  At the end of Part 3, I mentioned the SimCity launch giving me pause about my goal of testing [&#8230;]]]></description>
				<content:encoded><![CDATA[<p><em>Check out <a title="Loadtesting for Open Beta, Part 1" href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/">Loadtesting for Open Beta, Part 1</a>, <a title="Loadtesting for Open Beta, Part 2" href="http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/">Part 2</a>, and <a title="Loadtesting for Open Beta, Part 3" href="http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/">Part 3</a> to read the previous installments of this epic tale!</em></p>
<p>It&#8217;s been a while since the last update in this series, sorry about that!  At the end of <a title="Loadtesting for Open Beta, Part 3" href="http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/">Part 3</a>, I mentioned the <a href="http://kotaku.com/tag/sim-city">SimCity launch</a> giving me pause about my goal of testing the <strong>SpyParty</strong> lobbyserver to 1000 simultaneous robots.  Well, I got scared enough after their launch that I increased my optimization target to 2000 simultaneous robots on my old and slow server, and then I also decided to bite the bullet and upgrade the server hardware after I hit 2000 to give myself some extra headroom.  I really don&#8217;t think I&#8217;m going to hit these numbers at Open Beta launch or even for a long time after that, but I&#8217;d rather err on the conservative side and have it purr along nicely.</p>
<p>Since I waited so long to post this Part 4, I can&#8217;t really give a play-by-play of all the optimizations I did as they happened, so I&#8217;m going to give the general arc I followed, and then talk about some of the interesting stops along the way.</p>
<a name="iprof%2C+atop%2C+oprofile%2C+et+al."></a><h3>iprof, atop, oprofile, et al.</h3>
<p>As I mentioned at the end of the last post, I&#8217;d fixed some of the huge and obvious things with the network bandwidth usage, so it was time to start profiling the CPU usage.  There are lots of different kinds of profilers, but the one I use the most is based on <a href="http://silverspaceship.com/src/iprof/">Sean Barrett&#8217;s iprof</a>.  I&#8217;ve modified it a fair bit over the years,<sup><a href="http://www.spyparty.com/2013/05/21/loadtesting-for-open-beta-part-4-done-optimizing-the-lobbyserver/#footnote_0_3210" id="identifier_0_3210" class="footnote-link footnote-identifier-link" title="I&rsquo;ll&nbsp; release my changes at some point.">1</a></sup> but the core of the system is still the same.  It&#8217;s a runtime profiler that requires instrumenting your code into blocks, it&#8217;s efficient enough that you can leave it on all the time as long as you don&#8217;t stick a &#8220;prof block&#8221; in an inner loop, and you can generally see where you&#8217;re spending your time hierarchically.  It can draw to the screen, but I also have it output to a string, and so on the lobbyserver I can have it output to the log after a spike, and also catch a signal I send and it&#8217;ll force a prof dump.  Here&#8217;s an example:</p>
<pre style="padding-left: 30px;"><span style="font-size: x-small;">2013/04/17-16:12:10: 85.156 ms/frame (fps: 11.74)  sort self - current frame</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: zone                                                     self     hier    count</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:  ProcessMessages                                      59.2910  59.2910     1.00</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: +Send                                                 17.8164  18.4493  1120.69</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: +ClientsUpkeepAndCloseLoop                             2.3034   2.6989   793.15</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:  Log                                                   1.3559   1.3559    25.97</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:  unpack_bytes                                          0.7311   0.7311    26.56</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:  unpack                                                0.6674   0.6674  3551.71</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: +ClientsPacketLoop                                     0.5492   3.7160   792.58</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: +ClientsUpdated                                        0.4023  10.9843     0.56</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: +SendQueuedClientRoomMessages                          0.2524   7.3086     1.00</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:  FindClientByID                                        0.2494   0.2494   267.05</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: +JournalQueuedSave                                     0.2374   0.3882     1.43</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: +Tick                                                  0.2112  25.6313     1.00</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:  iprof_update                                          0.2051   0.2051     1.00</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: +JournalSavePrep                                       0.1560   0.5442     1.43</span></pre>
<p>As you can see, it&#8217;s pretty easy to read, and you can drill down on individual blocks and see who calls them and who they call:</p>
<pre style="padding-left: 30px;"><span style="font-size: x-small;">2013/04/17-16:12:10: 85.156 ms/frame (fps: 11.74)  sort graf - current frame</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: zone                                                     self     hier    count</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:     LoginReply                                         0.0006   0.0006     0.01</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:     JOINING                                            0.0007   0.0008     0.03</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +TYPE_CLIENT_GAME_ID_REQUEST_PACKET                 0.0011   0.0014     0.44</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +NewWaitingForJoinClients                           0.0019   0.0019     0.01</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +TYPE_CLIENT_PLAY_PACKET                            0.0086   0.0087     0.12</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +TYPE_CLIENT_INVITE_PACKET                          0.0213   0.0225     2.37</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +TYPE_CLIENT_IN_MATCH_PACKET                        0.0280   0.0282     0.49</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +JournalQueuedSave                                  0.0563   0.0571     1.18</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +RoomsChanged                                       0.0852   0.0939    15.38</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +NewInLobbyClients                                  0.1155   0.1342    14.55</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +ClientsUpkeepAndCloseLoop                          0.1362   0.2093   150.55</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +TYPE_CLIENT_MESSAGE_PACKET                         0.3122   0.3170     8.87</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +SendQueuedClientRoomMessages                       6.7122   7.0013   501.99</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:    +ClientsUpdated                                    10.3361  10.5720   424.67</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10: -Send                                                 17.8164  18.4493  1120.69</span><br /><span style="font-size: x-small;">2013/04/17-16:12:10:     unpack                                             0.6329   0.6329  3362.06</span></pre>
<p>This is super useful.  The biggest downside to it is that it&#8217;s not thread-aware, but I&#8217;ve made it thread-safe via the brute force method of having it ignore all threads that aren&#8217;t the &#8220;main&#8221;.  My code is mostly single-threaded, but the threadedness increased a fair bit during these optimizations, so I hope to eventually modify iprof to be thread-aware without losing too much simplicity and performance.  However, until I make those modifications, any background thread activity will show up attributed to one of these main-thread blocks.  You can still get useful data, you just have to be aware of this.  For example, ProcessMessages in the loop above is hiding a WaitForMultipleObjectsEx call on Windows, or a call to select/epoll on POSIX, so it&#8217;s not actually taking that much active time on the main thread.</p>
<p>I also used <a href="http://oprofile.sourceforge.net/news/">oprofile</a>, which is a nice sampling profiler on Linux that can profile per-thread using just the debug information in an application, and <a href="http://atoptool.nl/">atop</a> for keeping track of things happening on the machine as a whole.</p>
<p>Here&#8217;s a list of the stuff I ended up optimizing:</p>
<ul>
<li>I was originally sending out the chat messages to all clients as they came in, but I started queuing them up and sending them all out at once to reduce send calls.  Of course, once you do this, you have to make sure you don&#8217;t overflow the network packet if you have queued a lot of messages that tick, so that makes the code more complicated and harder to modify, which is a tradeoff one often has to make while optimizing, and it&#8217;s why you want to put off most optimization until you need it&#8230;although you should have a rough plan for how you&#8217;ll optimize a piece of code in the future even if you write it the dumb way first.</li>
<li>I made more threads, including putting network sending and receiving on separate threads, making a separate thread for logging, and a thread for saving files to the disk.  There were already threads for talking to the database and Kerberos, for receiving network packets, and for checking for new client builds.  These are all relatively simple threads to add, because they&#8217;re all just throwing data into a queue on one thread and taking it out on another, although multithreading a program always makes it harder to understand.  I discovered a fair number of deadlock bugs in <a href="https://developers.google.com/talk/libjingle/">libjingle</a>, the library I&#8217;m using for <a href="http://en.wikipedia.org/wiki/NAT_traversal">NAT traversal</a> and some cross platform threading stuff, and I&#8217;ve fixed some of them.  I&#8217;ve veered far enough from the original libjingle code that I&#8217;m probably just going to have to put my version up as a fork, sadly.</li>
<li>I timesliced the login phase for the clients.  Previously, when a client would log in, I&#8217;d process a bunch of stuff immediately, including some authentication stuff which can be somewhat time consuming.  In a load test where hundreds of clients log in to the server at the same time, this would bog down, so I now process a maximum of 20ms worth of clients each tick.  This makes some clients wait a bit longer before they&#8217;re logged in, but doesn&#8217;t result in a positive feedback loop where there&#8217;s a really long tick, so a lot of packets will have arrived while it was happening, so the next tick is really long too, etc.</li>
<li>Like the player list packets, I also made the room list packets incremental, and able to span multiple network packets.  This way all the lists of players and rooms that the lobby sends to the clients can be differential and arbitrarily long, so there&#8217;s no more hard limit on the number of clients that can join the lobby.  I think there&#8217;s actually a bug in this code, but I&#8217;ve only ever seen it once, even after tens of thousands of robot sessions, so I just hope it shows up more at some point.</li>
<li>I switched the POSIX networking inner loop in libjingle from <a href="http://linux.die.net/man/2/select">select</a> to <a href="http://linux.die.net/man/4/epoll">epoll</a>.  This was not so much an optimization as it was simply to allow more than 1024 sockets to work at all.  epoll is also a lot faster, but I&#8217;m currently kinda using it in a dumb way, so I&#8217;m not benefiting from that speed boost much yet.</li>
<li>There were also a bunch of smaller traditional code optimizations, like using maps to cache lookups, using free lists to avoid some allocations, and whatnot.  Oh, and don&#8217;t forget to <a href="https://twitter.com/checker/status/335503826939424768">change the ulimit -n settings in limits.conf</a> on Linux, so your process can actually accept a lot of connections!</li>
</ul>
<p>As I was doing these optimizations, I would run a loadtest with a bunch of robots and profile the lobby.  I was at 500 robots at the end of Part 3, and I slowly raised the ceiling as I improved the code over the weeks:  569 robots&#8230;741 robots&#8230;789 robots, 833, 923, 942, 990, 997, 1008, 1076, 1122, 1158, 1199, 1330, 1372, 1399, 1404, 1445, 1503, 1614, 1635, 1653, 1658, 1659&#8230;</p>
<p>When I hit 1659 it was late one night, and so I stopped for the day.  When I resumed work and did the next couple of optimizations, I figured I&#8217;d get it to 1800 or something.  I always launched 20% or so more robots than I was hoping to support in a given test to account for internet and <a href="http://aws.amazon.com/ec2/">EC2</a> variation, and for plain old bugs in the clients that would sometimes manifest themselves, so this time I must have launched 2500 robots, because when I looked up from the profiles running in ssh terminals and over to my <strong>SpyParty</strong> client logged into the test lobby, I saw this:</p>
<div id="attachment_3224" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/05/SpyParty-v0.1.2681.1-20130501-15-23-57-0.png"><img class="size-large wp-image-3224" title="SpyParty-v0.1.2681.1-20130501-15-23-57-0" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/05/SpyParty-v0.1.2681.1-20130501-15-23-57-0-600x415.png" width="600" height="415" /></a><p class="wp-caption-text">This wasn&#8217;t supposed to happen yet.</p></div>
<p>Uh, I guess I was done optimizing?</p>
<p>I was actually kind of disappointed, to be honest.  I had all sorts of cool ideas for optimizations I was planning to do that I&#8217;d come up with while testing and profiling the code, and now, if I was going to follow my own plan and stop when I hit 2000 simultaneous robots, I would have to just take a bunch of notes for next time I optimized so I could pick up where I left off, and move on.  The good news is I&#8217;m pretty sure I can make the lobby almost twice as efficient if and when the time comes to do that!</p>
<a name="Room+at+the+Inn"></a><h3>Room at the Inn</h3>
<p>If you look closely at that screenshot, you&#8217;ll see the thumb on the scrollbar for the player list is pretty small.  That&#8217;s because all 2010 players are in a single room, which is not going to work very well for a lobby full of real people.  In fact, the only reason there were 2010 players in that room was because at the time I&#8217;d limited the room size to 2010 because I didn&#8217;t want to bother teaching the robots how to use rooms.  There were actually a few hundred more robots knocking on the door but they couldn&#8217;t get in.  But, now that I&#8217;d hit my 2k target, it was time to fix that.</p>
<p>I immediately realized I had a problem.  Currently, when you connect to the lobby, it sends you a list of rooms, and you have to pick one to log in.  But, what if the rooms are full?  Oops, you couldn&#8217;t log in.  So, as soon as I set the room size down to something more reasonable, like 100, then the first 100 robots got in and the rest just sat there failing to join.</p>
<p>It seemed like there were a number of solutions to this problem, including allowing players to create new rooms before logging in, but in the end I went with the simplest and most robust solution, which is to have the lobby create a new empty room if all the current rooms are full.  The initial room is always called <em>Headquarters</em>, so I named these new dynamic rooms <em>Headquarters 2</em> and onward.  Very creative, I know.  Somebody suggested using spy movie titles for these room names, but I figured that wouldn&#8217;t scale very well, ignoring the potential copyright issues.  If the lobby ever finds one of these dynamic rooms empty, it kills it, unless all the other rooms are full.  I also have the lobby automatically put you in a now-guaranteed-to-exist-non-full-room if you log in and try to join a full room, even if it wasn&#8217;t full when you clicked on it, so this eliminated a login race condition too, which is always a good sign.</p>
<p>This last bit also made it so I didn&#8217;t need to make the loadtesting robots know very much about rooms:  they always try to join Headquarters and if they don&#8217;t end up there, oh well.  As they join, they kind of spill over into the latest dynamic room until it fills up, and then they continue to the next, kind of like filling up an ice tray with water from one end.  I should probably make them test the actual room features by creating and changing rooms and whatnot, but the single giant 2010 player room was a way more intense loadtest than having 20 rooms with 100 players in each due to the chat broadcasting.</p>
<p>I don&#8217;t know if 100 is the right limit for room populations.  100 would still be way too many people to have in a single reasonable conversation, but I didn&#8217;t want to put too low of a limit on the size before I have tested things with humans instead of just robots.</p>
<a name="The+Client"></a><h3>The Client</h3>
<p>There&#8217;s this annoying thing that happens when you&#8217;re testing computer code, and it&#8217;s that you encounter problems and bugs not only in the code you&#8217;re trying to test, but also in your test code.  This was no different.  I was constantly fixing various bugs in the robots that would keep them from all connecting correctly, and I even made sure some of the optimizations helped the client side so I could run more robots on a given EC2 server.  Plus, just making sure the robots keep trying to connect and login was important, because if there was a timeout due to an initial burst, you want them to try again automatically after it dies down, rather than just sitting there not doing anything.</p>
<p>As I said in <a title="Loadtesting for Open Beta, Part 1" href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/">Part 1</a>, I started out running about 50 robots on each m1.small EC2 instance.  That didn&#8217;t scale, for some reason I&#8217;m still trying to figure out.  That worked okay with a low number of instances, but as I increased the number of instances, I had to lower the number of robots on each instance, eventually to around 20 per m1.small.  An AWS account starts with only being able to start 20 instances, so I did a total of two instance limit requests to Amazon, first to 100 and then to 300.  It&#8217;s scary to have 300 instances running&#8230;even though m1.small instances are only 6¢ an hour each, that&#8217;s still $18 an hour when there are 300 of them running, and Amazon rounds up to the hour, so if you miss shutting them down by a minute you just lost a large pizza!  It looks like Google&#8217;s new <a href="https://cloud.google.com/pricing/compute-engine">Compute Engine</a> thing is about twice as expensive for their somewhat similar low end machine (ignoring performance differences), but charges in 1 minute increments after the first 10, which might be cheaper for this very transient use-case.</p>
<p>I seem to remember reading somewhere that Amazon allocates instances for the same account to the same physical machine if possible, which might explain this scaling problem, since it means I was probably maxing out a given piece of server hardware with too many instances bursting at the same time.  It&#8217;s hard to tell if this is the case, and I need to do more testing before saying for sure.  A <a href="http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/comment-page-1/#comment-67637">commenter</a> said there might be a packets-per-second limitation in EC2, as well, but I haven&#8217;t verified that.  Once I&#8217;ve tried a few different things, I&#8217;ll do a long technical post on <a href="http://chrishecker.com">chrishecker.com</a> about EC2, <a href="https://www.linode.com/">linode</a>, and my dedicated host machine, comparing the different results I got.</p>
<p>Finally, I had to do some optimization on the <strong>SpyParty</strong> game client when the numbers started getting high.  I went a little nuts with the chat system early on and it has completion on all commands, room names, and player names, but the code that builds the completion tree was calling the memory allocator 35k times per update when the numbers of players got high, so I had to remove some of the stupid in that code as well.</p>
<a name="The+New+Server"></a><h3>The New Server</h3>
<p>With all that done, and 2010 robots running on the old server, I haggled with my hosting provider and started renting a newer and much faster server.  I use <a href="http://www.softlayer.com/">SoftLayer</a> for dedicated hosting, and have for years.<sup><a href="http://www.spyparty.com/2013/05/21/loadtesting-for-open-beta-part-4-done-optimizing-the-lobbyserver/#footnote_1_3210" id="identifier_1_3210" class="footnote-link footnote-identifier-link" title="Well, they were servermatrix when I started, and then The Planet, and now SoftLayer.">2</a></sup> My old server was a Pentium 4 with a single hyperthreaded core, 1GB ram, and a 100Mbps uplink, and the new server is a Xeon 3460 with four hyperthreaded cores, 4GB ram, and 1Gbps uplink, so it&#8217;s slightly more expensive but a lot faster.  That said, everybody seems to be using <a href="http://en.wikipedia.org/wiki/Virtual_private_server">VPS</a> hosts these days.  I talked to some other indie game developers, but I didn&#8217;t have time to do a full evaluation of the tradeoffs, so went with the devil I knew, so to speak.  It seems like VPS is going to be a bit slower but also a bit cheaper, but the big advantage of VPS to me is that you can move the virtual machine image to faster hardware and have it up and running again in minutes.  That&#8217;s a pretty great scaling sweetspot between having a single physical server and praying it doesn&#8217;t melt, and a scalable system that elastically uses cloud computing The Right Way™, but it&#8217;s also hundreds of times easier to get a VPS image working and then move it to a faster machine than it is to scale elastically.  So, I dunno, it&#8217;s definitely something worth looking into more during the year as I see how things are scaling.</p>
<p>The new server ate the robots for lunch:</p>
<div id="attachment_3230" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/05/SpyParty-v0.1.2703.1-20130518-20-57-40-0.png"><img class="size-large wp-image-3230" title="SpyParty-v0.1.2703.1-20130518-20-57-40-0" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/05/SpyParty-v0.1.2703.1-20130518-20-57-40-0-600x421.png" width="600" height="421" /></a><p class="wp-caption-text">The new server works pretty well.</p></div>
<p>For reference, 4850 simultaneous players is pretty far up the <a href="http://store.steampowered.com/stats/">top 100 Steam games by player count</a>, so I don&#8217;t think I have to worry about those numbers for a while.  Here&#8217;s atop&#8217;s view of things:</p>
<div id="attachment_3237" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/05/2013-05-18-20_47_06-atop.png"><img class="size-large wp-image-3237" title="2013-05-18 20_47_06-atop" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/05/2013-05-18-20_47_06-atop-600x344.png" width="600" height="344" /></a><p class="wp-caption-text">Well within parameters.</p></div>
<a name="What%26%238217%3Bs+Next%3F"></a><h3>What&#8217;s Next?</h3>
<p>So, that&#8217;s it for the lobbyserver loadtesting.  Now I need to move the website and registration system over to the new server, test them a bit, and start inviting everybody in in big batches.  Soon I&#8217;ll send out email to the beta testers to set up some scheduled human loadtests as well.  The robots will be jealous, left out in the cold, looking in at all the humans actually playing the game. </p>
<p>Open Beta is fast approaching.</p>
<hr/><ol class="footnotes"><li id="footnote_0_3210" class="footnote">I&#8217;ll  release my changes at some point.</li><li id="footnote_1_3210" class="footnote">Well, they were servermatrix when I started, and then The Planet, and now SoftLayer.</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/05/21/loadtesting-for-open-beta-part-4-done-optimizing-the-lobbyserver/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Loadtesting for Open Beta, Part 3</title>
		<link>http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/</link>
		<comments>http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/#comments</comments>
		<pubDate>Mon, 18 Mar 2013 06:37:06 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[indie games]]></category>
		<category><![CDATA[metrics]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=3139</guid>
		<description><![CDATA[Read Loadtesting for Open Beta, Part 1 and Part 2 to catch up on the spine-tingling story so far! When we last left our hero, our differential state update change was a resounding success and reduced the network bandwidth utilization from 98% to 3%, and it looked like we could move on to optimizing the [&#8230;]]]></description>
				<content:encoded><![CDATA[<p><em>Read <a title="Loadtesting for Open Beta, Part 1" href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/">Loadtesting for Open Beta, Part 1</a> and <a title="Loadtesting for Open Beta, Part 2" href="http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/">Part 2</a> to catch up on the spine-tingling story so far!</em></p>
<p><a title="Loadtesting for Open Beta, Part 2" href="http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/">When we last left our hero</a>, our differential state update change was a resounding success and reduced the network bandwidth utilization from 98% to 3%, and it looked like we could move on to optimizing the lobbyserver code itself to get to our goal of 1000 simultaneous loadtesting robots, until we noticed <a title="Loadtesting for Open Beta, Part 2" href="http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/#Up+Next,+The+Case+of+the+Missing+Robots">some of our robots were missing</a>!  This led me on a wild and wooly chase through the code, which I will recount for you now&#8230;</p>
<a name="Where%26%238217%3Bd+the+robots+go%3F"></a><h3>Where&#8217;d the robots go?</h3>
<p>The first order of business was to figure out why some robots were dying when they <em>weren&#8217;t</em> supposed to, and some weren&#8217;t dying when they <em>were</em> supposed to.  Robots: they never do what you tell them.</p>
<p>If you look at this graph of the number of running robots from last time, you can see that right off the bat, a bunch of them die on all the machines, and then they keep dying for about 30 seconds, and then it stabilizes.  Each of these machines should have 50 robots running solidly during the test period.</p>
<div id="attachment_3116" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/ec2-loadtest-client-counts.png"><img class="size-large wp-image-3116" title="ec2-loadtest-client-counts" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/ec2-loadtest-client-counts-600x344.png" alt="" width="600" height="344" /></a><p class="wp-caption-text">The number of loadtest robots running on each EC2 instance.</p></div>
<p>Then, to make matters worse, some of them don&#8217;t die when they&#8217;re supposed to at the end of the test.  In the graph above, they only all finally die when I kill them manually from a separate script at 03:16:30.  This points towards two different problems I&#8217;m going to have to debug on the test machines&#8230;that only manifest themselves intermittently&#8230;with gdb&#8230;in the cloud. Good times!</p>
<p>Okay, first things first, let&#8217;s see if the robots will tell me where they&#8217;re going.  The lobbyclient robots can output verbose log files, but I had them turned off because I was worried about bogging down the client machines.  It turns out this isn&#8217;t much of a problem as I&#8217;ll discuss below, so I turned on logging and re-ran a test.  Then I ssh&#8217;d into one of the servers, and looked at the log files.  Well, before I looked the files themselves, I just did an <span style="font-family: courier new,courier;">ls</span> of the directory:</p>
<pre style="padding-left: 30px;">-rw-r--r-- 1 root root 258577 Mar  5 03:02 out59<br />-rw-r--r-- 1 root root 332320 Mar  5 03:02 out60<br />-rw-r--r-- 1 root root 177743 Mar  5 03:02 out61<br />-rw-r--r-- 1 root root 181639 Mar  5 03:02 out62<br />-rw-r--r-- 1 root root 264535 Mar  5 03:02 out63<br />-rw-r--r-- 1 root root 333515 Mar  5 03:02 out64<br />-rw-r--r-- 1 root root 282875 Mar  5 03:02 out65<br />-rw-r--r-- 1 root root 271040 Mar  5 03:02 out66<br />-rw-r--r-- 1 root root    264 Mar  5 03:01 out67<br />-rw-r--r-- 1 root root    264 Mar  5 03:01 out68<br />-rw-r--r-- 1 root root 284838 Mar  5 03:02 out69<br />-rw-r--r-- 1 root root 332967 Mar  5 03:02 out70<br />-rw-r--r-- 1 root root 303352 Mar  5 03:02 out71<br />-rw-r--r-- 1 root root 310596 Mar  5 03:02 out72<br />-rw-r--r-- 1 root root 194669 Mar  5 03:02 out73<br />-rw-r--r-- 1 root root 313193 Mar  5 03:02 out74<br />-rw-r--r-- 1 root root 238246 Mar  5 03:02 out75<br />-rw-r--r-- 1 root root 264190 Mar  5 03:02 out76<br />-rw-r--r-- 1 root root 198096 Mar  5 03:02 out77<br />-rw-r--r-- 1 root root 233980 Mar  5 03:02 out78<br />-rw-r--r-- 1 root root    264 Mar  5 03:01 out79<br />-rw-r--r-- 1 root root    264 Mar  5 03:01 out80<br />-rw-r--r-- 1 root root 301029 Mar  5 03:02 out81<br />-rw-r--r-- 1 root root 299694 Mar  5 03:02 out82<br />-rw-r--r-- 1 root root    264 Mar  5 03:01 out83<br />-rw-r--r-- 1 root root 351158 Mar  5 03:02 out84<br />-rw-r--r-- 1 root root 188071 Mar  5 03:02 out85<br />-rw-r--r-- 1 root root 242228 Mar  5 03:02 out86</pre>
<p>Well, there&#8217;s a clue, at least for the early-dyers.  The contents of those 264 byte log files look like this:</p>
<pre style="padding-left: 30px;">Lobby Standalone Client: 1000.0.0.5<br />init genrand w/0, first val is 1178568022<br />Running for 61 seconds.<br />LobbyClient started, v1000.0.0.5 / v12<br />LobbyClient UDP bound to port 32921<br />lobbyclient: sendto_kdc.c:617: cm_get_ssflags: Assertion `i &lt; selstate-&gt;nfds' failed.</pre>
<p>A-ha!  sendto_kdc.c is a file in the <a href="http://web.mit.edu/Kerberos/">Kerberos</a> libraries, which I use for login authentication.</p>
<p>I really love Kerberos, <a href="http://web.mit.edu/kerberos/dialogue.html">the architecture just feels right to me</a>, the API is simple, clean, and flexible, it&#8217;s cross-platform and open source, so I&#8217;ve been able to contribute features and bug fixes as I&#8217;ve used it and trace into the code when I was confused about something, and the folks at MIT that develop it are smart, knowledgeable, open-minded, and <a href="http://www.google.com/search?hl=en&amp;q=%2B&quot;Chris Hecker&quot; site%3Amail-archive.com kerberos">don&#8217;t mind some crazy indie game developer asking dumb questions</a> about the best way to do things that were pretty clearly not part of the original university and enterprise use-cases.  Most importantly, it&#8217;s battle-tested; it&#8217;s used by tons of different applications, and it&#8217;s the foundation of the modern Windows domain and Xbox authentication systems, so I know it works.  <strong>The last thing you ever want to do is roll your own authentication system.</strong></p>
<p>So, that assert&#8217;s the first place to look for the early-dying robots.</p>
<p>Next, I looked into the never-dying robots.  I logged into one of the machines that still had zombie robots<sup><a href="http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/#footnote_0_3139" id="identifier_0_3139" class="footnote-link footnote-identifier-link" title="ZOMBIE ROBOTS!!!">1</a></sup> running, ran <span style="font-family: Courier New,Courier,mono;">pidof lobbyclient</span> to figure out the process ID of one of them, and attached gdb to the robot.  A quick <span style="font-family: Courier New,Courier,mono;">thread apply all backtrace full</span> and I found the thread that was hanging while the main thread was trying to join them and exit cleanly.  It looked like the bad code was in a call to <a href="http://linux.die.net/man/2/poll">poll</a>, and it just so happened it was in sendto_kdc.c as well! I realized I was going to need some debug symbols, but this was easy since I build the Kerberos libraries myself,<sup><a href="http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/#footnote_1_3139" id="identifier_1_3139" class="footnote-link footnote-identifier-link" title="I have some local patches I haven&rsquo;t cleaned up enough to contribute yet">2</a></sup> so a quick scp of the debuginfo rpm and reattaching gdb and I could dig down a bit deeper.</p>
<p>The Kerberos libraries are built with optimizations on, which always makes debugging interesting, but I think it builds programming character to debug optimized code, so I don&#8217;t mind.<sup><a href="http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/#footnote_2_3139" id="identifier_2_3139" class="footnote-link footnote-identifier-link" title="gdb is not the best for assembly language debugging, but I did learn about &ldquo;layout asm&rdquo;, which helps a bit.">3</a></sup>  Here&#8217;s the code in question:</p>
<pre>    if (in-&gt;end_time.tv_sec == 0)<br />        timeout = -1;<br />    else {<br />        e = k5_getcurtime(&amp;now);<br />        if (e)<br />            return e;<br />        timeout = (in-&gt;end_time.tv_sec - now.tv_sec) * 1000 +<br />            (in-&gt;end_time.tv_usec - now.tv_usec) / 1000;<br />    }<br />    /* We don't need a separate copy of the selstate for poll, but use one<br />     * anyone for consistency with the select wrapper. */<br />    *out = *in;<br />    *sret = poll(out-&gt;fds, out-&gt;nfds, timeout);</pre>
<p>Well, these loadtesting machines are under some load themselves so they can be a bit sluggish, and there&#8217;s a problem with this code in that scenario if the call to k5_getcurtime() happens later than the in-&gt;end_time passed in by the caller.  As it says on the <a href="http://linux.die.net/man/2/poll">poll manpage</a>, <em>&#8220;Specifying a negative value in timeout means an infinite timeout.&#8221;</em>  Digging around on the stack verified the timeout was negative.</p>
<p>Okay, so now we have a pretty good clue for each of the problems.  The second problem with the poll timeout seemed easy to fix, but the first one was pretty mysterious and might take some real debugging.  I decided to<a href="http://mailman.mit.edu/pipermail/krbdev/2013-March/011451.html"> check with the krbdev mailing list</a> to see if they had any ideas while I looked into the problems more deeply.  While doing so, I looked at the main Kerberos source repository and <a href="http://mailman.mit.edu/pipermail/krbdev/2013-March/011452.html">found a commit for the timeout problem</a>, so it had already been fixed in a later version.  I was hoping maybe this was true of the assert as well.  True to form, the most excellent Greg Hudson <a href="http://mailman.mit.edu/pipermail/krbdev/2013-March/011453.html">replied with three more commits</a> he thought might help.  Meanwhile, I hacked the code to loop on a call to sleep() instead of asserting to convert the early-dyers into never-dying zombies so I could attach the debugger, since that&#8217;d worked so well on the second problem.</p>
<p>Sadly, although the negative-timeout-check fixed the original zombies, none of the fixes prevented the assert problem.  It wasn&#8217;t asserting anymore because the asserters were now looping, so now I had more zombies to deal with.</p>
<div id="attachment_3151" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-17-16_45_07-50-not-working.png"><img class="size-large wp-image-3151" title="2013-03-17 16_45_07-50-not-working" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-17-16_45_07-50-not-working-600x302.png" alt="" width="600" height="302" /></a><p class="wp-caption-text">Lots of zombie robots!</p></div>
<p>Time to get down and dirty and debug it for real.</p>
<p>As an aside, it&#8217;s a weird feeling when you&#8217;re debugging something on an <a href="http://aws.amazon.com/ec2/">EC2 instance</a>, since you&#8217;re paying for it hourly.  I felt a definite pressure to hurry up and debug faster&#8230;oh no, there went another $0.06 * 5 instances!</p>
<a name="Too+deep+we+delved+there%2C+and+woke+the+nameless+fear%21"></a><h3>Too deep we delved there, and woke the nameless fear!</h3>
<p>Like I said, debugging optimized code builds character, and I built a lot of character with this bug.  The assert was in a function that was inlined by the optimizer, which was in a function that was inlined by the optimizer, which was in a loop, which looked like it had been unrolled.  It was slow going, with lots of restarts and stuffing values into memory and registers so the code would execute again.  At one point, I thought I&#8217;d <a href="http://mailman.mit.edu/pipermail/krbdev/2013-March/011466.html">narrowed it down to a compiler bug in gcc</a>, because it seemed like a variable wasn&#8217;t getting reloaded from the stack correctly sometimes, but it was really hard to tell with all the inlining.  Even thinking it was a compiler bug was pretty silly and that thought always violates <a title="One Bug’s Story, or, Assume it’s a bug!" href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/">Assume it&#8217;s a Bug</a>, so I should have known better, but it happens. </p>
<p>Finally, a combination of stepping through the code, and looking at the code, and modifying the code revealed the problem. Here&#8217;s <a href="https://github.com/krb5/krb5/blob/krb5-1.9.2-final/src/lib/krb5/os/sendto_kdc.c#L1255">the source file at the version I was debugging</a>, linked to the area of the code where the bug lurked.  If you search for &#8220;host+1&#8243;, you will see that it occurs twice, once inside the loop, and once outside the loop.  This is what threw me when I was debugging&#8230;initially I didn&#8217;t notice there were two separate calls to service_fds(), so in the debugger I thought it was looping again but loading weird values.  I can only assume the second call almost never occurred in the wild for anybody but me after the inner loop on hosts completed, because in that case host+1 is n_conns+1, which is out-of-bounds for the connections.<sup><a href="http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/#footnote_3_3139" id="identifier_3_3139" class="footnote-link footnote-identifier-link" title="It never crashed because conns has a preallocated number of connections that was always bigger than n_conns+1">4</a></sup>  This bug was easy for me to fix locally, and it looks like it was (inadvertently?) fixed in <a href="https://github.com/krb5/krb5/commit/8b9d249e40601047e69c92d7acb578fd0bbafc00">this commit</a> in the main Kerberos code.</p>
<p>Thank goodness for open source code, where you can modify it and debug it when you run into troubles!</p>
<div id="attachment_3150" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-17-16_36_15-50-working.png"><img class="size-large wp-image-3150" title="2013-03-17 16_36_15-50-working" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-17-16_36_15-50-working-600x302.png" alt="" width="600" height="302" /></a><p class="wp-caption-text">No more zombies!</p></div>
<a name="Moar+Robots%21"></a><h3>Moar Robots!</h3>
<p>Now that I (thought I) was done debugging the robots, and I still had 5 EC2 instances running, I decided to see how well the instances did with 100 robots on each.  My original tests indicated I could only run about 50 per <a href="http://aws.amazon.com/ec2/instance-types/">m1.small</a> instance, but the client also got a lot more efficient with the differential state update change described last time, and it turns out 100 per instance is no problem, as you can see here:</p>
<div id="attachment_3147" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-16-02_53_11-100-robots.png"><img class="size-large wp-image-3147" title="2013-03-16 02_53_11-100-robots" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-16-02_53_11-100-robots-600x379.png" alt="" width="600" height="379" /></a><p class="wp-caption-text">Top on an m1.small instance running 100 robots at only 20% CPU.</p></div>
<p> The lobby was a little more grim with 501 clients:</p>
<div id="attachment_3153" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/SpyParty-v0.1.2602.1-20130316-02-53-39-0.png"><img class="size-large wp-image-3153" title="SpyParty-v0.1.2602.1-20130316-02-53-39-0" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/SpyParty-v0.1.2602.1-20130316-02-53-39-0-600x415.png" alt="" width="600" height="415" /></a><p class="wp-caption-text">500 robots and me.</p></div>
<p> Here&#8217;s how the CPU looks with all these robots in the lobby, chatting at each other:</p>
<div id="attachment_3148" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-16-02_53_25-100-in-lobby.png"><img class="size-large wp-image-3148" title="2013-03-16 02_53_25-100-in-lobby" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-16-02_53_25-100-in-lobby-600x322.png" alt="" width="600" height="322" /></a><p class="wp-caption-text">atop in CPU mode with 500 robots in the lobby jabbering.</p></div>
<p>There are two cores in this machine, which is why the lobbyserver is at 115% CPU.  It&#8217;s mostly single-threaded for simplicity, but it uses threads for servicing network connections.</p>
<p>However, once the robots start playing each other, the CPU usage drops a bunch:</p>
<div id="attachment_3149" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-16-02_53_49-100-playing.png"><img class="size-large wp-image-3149" title="2013-03-16 02_53_49-100-playing" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-16-02_53_49-100-playing-600x322.png" alt="" width="600" height="322" /></a><p class="wp-caption-text">Stop talking, start playing!</p></div>
<p>This is pretty good news.  I think it means the chat system needs some work, because when everybody&#8217;s in the lobby all the chats go to all the players, but when people in are a match, chats only go between those two players, and they don&#8217;t get any of the lobby chats.  We&#8217;ll find out soon as I describe below.  Memory looks pretty good with 501 clients, staying at about 256kb per client:</p>
<pre style="padding-left: 30px;">2013/03/16-04:53:11: MEMORY_POSIX 501/993/492: resident 25540/25540, virtual 198000/198000<br />2013/03/16-04:53:11: MEMORY_NEW 501/993/492: bytes 132098963, news 69166, deletes 55478</pre>
<p>One last atop screenshot&#8230;this one is while the robots are starting up and connecting, but before they&#8217;re in the lobby:</p>
<div id="attachment_3146" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-16-02_52_57-startup.png"><img class="size-large wp-image-3146" title="2013-03-16 02_52_57-startup" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-16-02_52_57-startup-600x322.png" alt="" width="600" height="322" /></a><p class="wp-caption-text">Loadtest startup performance.</p></div>
<p>This one shows Kerberos and <a href="http://www.openldap.org/">OpenLDAP</a> taking a fair amount of time at the start of a new loadtest.  I use LDAP as the database backend for Kerberos, among other things, and when all of these robots are trying to get login tickets at the same time, it bogs down a bit.  I&#8217;m not too worried about this profile, since this scenario of 500 people all needing tickets at the same time is going to be rare (the tickets last a while, so this doesn&#8217;t happen every time), and there are well-known ways of scaling Kerberos and OpenLDAP if I need them.</p>
<p>Finally, here&#8217;s a shot of the 100 robots per instance:</p>
<div id="attachment_3152" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-17-16_47_28-100-working-plus-deadlock.png"><img class="size-large wp-image-3152" title="2013-03-17 16_47_28-100-working-plus-deadlock" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-17-16_47_28-100-working-plus-deadlock-600x302.png" alt="" width="600" height="302" /></a><p class="wp-caption-text">Wait a second&#8230;</p></div>
<p>Oh no!  Who the hell is that single zombie robot at the end on instance 4!?!  Sigh.  I find that machine, log in, attach the debugger, and check it out.  It looks like I have a pretty rare deadlock between two threads during shutdown.  I&#8217;m just going to ignore it for now and deal with it later.  All the bugs above were preventing robots from doing a good job at loadtesting, while this one is just preventing 1 out of 500 from shutting down completely&#8230;it can wait.  Here&#8217;s a shot of this guy, still in the lobby, mocking me:</p>
<div id="attachment_3154" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/SpyParty-v0.1.2602.1-20130316-02-55-54-0.png"><img class="size-large wp-image-3154" title="SpyParty-v0.1.2602.1-20130316-02-55-54-0" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/SpyParty-v0.1.2602.1-20130316-02-55-54-0-600x415.png" alt="" width="600" height="415" /></a><p class="wp-caption-text">At least I have one more Sniper win on this debug server than this troll!</p></div>
<p>There&#8217;s actually another bug I found in the new differential state update code while I was testing this, where the server will send a duplicate client sometimes, but I had a comment in the code that I thought it might be possible, and now I know it is.  It turns out when you have 500 clients pounding on a server, you find bugs.</p>
<a name="Coming+Up+Next+Time"></a><h3>Coming Up Next Time</h3>
<p>Okay, so now we&#8217;ve got things where I can easily run a predictable number of loadtesting robots against the debug lobbyserver, and I&#8217;ve got some high level profiles telling me that I&#8217;m now CPU bound inside the server itself.  That points to a clear next step:  profile the code.  I use an old hacked up version of <a href="http://silverspaceship.com/src/iprof/">Sean Barrett&#8217;s iprof</a> for all my client runtime profiling, so my next task is to integrate that into the server code, and get it running on Linux.  That shouldn&#8217;t be too hard, and then I&#8217;ll be able to tell what&#8217;s actually taking the time<sup><a href="http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/#footnote_4_3139" id="identifier_4_3139" class="footnote-link footnote-identifier-link" title="This is only partially true, because iprof is single-threaded&hellip;I really wish there was a good cross-platform light-weight way to get per-thread timings.">5</a></sup> when a lot of clients are in the lobby.</p>
<p>My prediction, based on the above, is that the chat message handling is going to be the main culprit.  If so, it&#8217;ll be easy to queue up the chats and send them out in bunches, but I need to be careful here, because the robots chat a lot more than real humans would right now, so I don&#8217;t want to spend too much time optimizing this.  I think I&#8217;ll keep the robots as they are for the initial profiles, and then dial back their chattiness to more realistic levels after I&#8217;ve plucked the low-hanging chat fruit.  I also need to teach the robots how to use lobby rooms for a more realistic test.</p>
<p>Finally, I&#8217;m wondering if my usage of select() is going to be an issue as I get close to 1000 robots.  I may need to port to epoll().  We shall see!</p>
<p>&#8220;Assume Nothing!&#8221;</p>
<p>And finally, the SimCity launch has given me pause&#8230;I&#8217;m still forging ahead with my 1000 simultaneous goal, but I really hope it&#8217;s enough and things go smoothly.  I would much rather have a slow buildup of players over the next year as I roll out more cool stuff than a giant spike that melts everything and makes players grumpy.</p>
<p><a title="Loadtesting for Open Beta, Part 4: Done optimizing the lobbyserver!" href="http://www.spyparty.com/2013/05/21/loadtesting-for-open-beta-part-4-done-optimizing-the-lobbyserver/">On to Part 4&#8230;</a></p>
<hr/><ol class="footnotes"><li id="footnote_0_3139" class="footnote">ZOMBIE ROBOTS!!!</li><li id="footnote_1_3139" class="footnote">I have some local patches I haven&#8217;t cleaned up enough to contribute yet</li><li id="footnote_2_3139" class="footnote">gdb is not the best for assembly language debugging, but I did learn about &#8220;layout asm&#8221;, which helps a bit.</li><li id="footnote_3_3139" class="footnote">It never crashed because conns has a preallocated number of connections that was always bigger than n_conns+1</li><li id="footnote_4_3139" class="footnote">This is only partially true, because iprof is single-threaded&#8230;I really wish there was a good cross-platform light-weight way to get per-thread timings.</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Loadtesting for Open Beta, Part 2</title>
		<link>http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/</link>
		<comments>http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/#comments</comments>
		<pubDate>Sun, 03 Mar 2013 23:28:11 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[indie games]]></category>
		<category><![CDATA[metrics]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=3109</guid>
		<description><![CDATA[In our last exciting episode of Loadtesting for Open Beta, we did some initial profiling to see how the lobbyserver held up under attack by a phalanx of loadtesting robots spawned in the cloud. It didn&#8217;t hold up, obviously, or the beta would already be open. Specifically, it failed by saturating the server&#8217;s 100Mbps network [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>In our <a title="Loadtesting for Open Beta, Part 1" href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/">last exciting episode of <em>Loadtesting for Open Beta</em></a>, we did some initial profiling to see how the lobbyserver held up under attack by a phalanx of loadtesting robots spawned in the cloud. It didn&#8217;t hold up, obviously, or the beta would already be open.</p>
<p>Specifically, it failed by saturating the server&#8217;s 100Mbps network link, which turned out to be a great way to fail because it meant there were some pretty simple things I could do to optimize the bandwidth utilization.  I had done the initial game<span style="font-size: medium;">↔</span>lobby protocol in the simplest way possible, so every time any player state changed, like a new connection, or switching from chatting in the lobby to playing, it sent out the entire list of player states to everybody.  This doesn&#8217;t scale at all, since as you add more players, most aren&#8217;t changing state, but you&#8217;re sending all of their states out to everybody even if only one changes.  This doesn&#8217;t mean it was the wrong way to program it initially; it&#8217;s really important when you&#8217;re writing complicated software<sup><a href="http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/#footnote_0_3109" id="identifier_0_3109" class="footnote-link footnote-identifier-link" title="especially by yourself!">1</a></sup> to do things the simplest way possible, as long as you have a vague plan for what you&#8217;ll do if it turns into a problem later.  In this case, I knew what I was doing was probably not going to work in the long run, but it got things up and running more quickly than overengineering some fancy solution I might not have needed, and I waited until it actually <em>was</em> a problem before fixing it.</p>
<a name="Tell+Me+Something+I+Don%26%238217%3Bt+Know"></a><h3>Tell Me Something I Don&#8217;t Know</h3>
<p>The solution to this problem is pretty obvious: differential state updates.  Or, in English, only send the stuff that&#8217;s changed to the people who care about it.  Doing differential updates is significantly more complicated than just spamming everybody with everything, however.  You still have to send the initial state of all the curent players when new players log in, you have to be able to add and remove players in the protocol, which you didn&#8217;t have to before because you were just sending the complete new state every time, etc.</p>
<p>This was going to be a fairly large change, so I took it by steps.  I knew that I&#8217;d have to send out the complete state of everybody to new logins, so it made sense to start by optimizing that initial packet using normal data size optimization techniques.  I pretty easily got it from about 88 bytes per player down to 42 bytes per player, which is nice, because my goal for these optimizations is 1000 simultaneous players, and at 88 bytes they wouldn&#8217;t all fit in my 64kb maximum packet size, where at 42 bytes they should fit, no problem, so I don&#8217;t have to add any kind of break-up-the-list-across-packets thing.  However, it turns out I actually got the ability to send the entire list across multiple packets while I was doing this, because I had to program the ability to add players as part of the differential updates, so now I could just use that packet type to send any clients in a really large player list that didn&#8217;t fit in a single packet.  But, like I said in the last episode, although I don&#8217;t think I&#8217;ll hit 1000 simultaneous outside of load testing for a while, it&#8217;s always nice to know you have that sort of thing in your back pocket for the future.</p>
<p>Once I&#8217;d tested the new optimized player list, I started making the updates differential.  New players get the initial list, and then they&#8217;re considered up-to-date and just get updates along with everybody else.  The list of new players is sent as additions to players already in the lobby.  For each player, I track some simple flags about what&#8217;s been updated in their state, so if they set or clear their /away message for example, that flag is set, and I only send that information.</p>
<p>In programming, usually when you&#8217;ve got the right design, you get some unintentional upside, and this case was no different.  Previously, I was not sending live updates to player stats (wins, game time, etc.) to the players in the lobby until the player was done playing the match, or some other state changed that caused everybody&#8217;s state to be re-sent.  Now, since the differential updates are efficient, I&#8217;m updating player stats in real time as well, so people in the lobby can see wins as they accumulate for players in matches, which is nice and how you&#8217;d expect it to work.</p>
<a name="Results"></a><h3>Results</h3>
<p>It basically worked exactly as planned.  After lots of debugging, of course.  Here you can see the profiles for one of the loadtests, which got to 340 simultaneous players in the lobby:</p>
<div id="attachment_3117" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/SpyParty-v0.1.2553.1-20130303-00-13-24-0.png"><img class="size-large wp-image-3117" title="SpyParty-v0.1.2553.1-20130303-00-13-24-0" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/SpyParty-v0.1.2553.1-20130303-00-13-24-0-600x447.png" alt="" width="600" height="447" /></a><p class="wp-caption-text">I really need to have the robot Sniper win sometimes.</p></div>
<p>&nbsp;</p>
<div id="attachment_3115" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-03-00_13_34-atop-mem.png"><img class="size-large wp-image-3115" title="2013-03-03 00_13_34-atop-mem" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-03-00_13_34-atop-mem-600x243.png" alt="" width="600" height="243" /></a><p class="wp-caption-text">atop in memory mode</p></div>
<p>&nbsp;</p>
<div id="attachment_3114" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-03-00_13_31-atop-cpu.png"><img class=" wp-image-3114" title="2013-03-03 00_13_31-atop-cpu" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/2013-03-03-00_13_31-atop-cpu-600x243.png" alt="" width="600" height="243" /></a><p class="wp-caption-text">atop in cpu mode</p></div>
<p>Look ma, 3% network utilization!  That&#8217;s whats so awesome about a really spiky profile&#8230;when you pound one of the spikes down, things just get better!</p>
<p>Here&#8217;s the new table of packet sizes for this run.  If you compare this with the <a title="Loadtesting for Open Beta, Part 1 - Packet Size Table" href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#Update:+Assuming+More+Nothing&amp;#8230;Er,+Less+Nothing?">previous results</a>, you can see the PLAYER_LIST packets are way way way smaller, and this table was accumulated from two longer test runs, so it&#8217;s not even a fair comparison!  It&#8217;s interesting, because the TYPE_LOBBY_MESSAGE_PACKET is smaller as well, and I think that&#8217;s because now the robots can actually start games since the network isn&#8217;t saturated, and this means they don&#8217;t broadcast chats to the entire lobby while they&#8217;re playing, so that&#8217;s a nice side effect of optimizing the bandwidth.</p>
<table border="0" cellspacing="0" cellpadding="0" align="center">
<thead>
<tr>
<td><strong>Packet Type</strong></td>
<td align="right"><strong>Total Bytes</strong></td>
</tr>
</thead>
<tbody>
<tr>
<td>TYPE_LOBBY_MESSAGE_PACKET</td>
<td align="RIGHT">58060417</td>
</tr>
<tr>
<td>TYPE_LOBBY_PLAYER_LIST_UPDATE_PACKET</td>
<td align="RIGHT">29751413</td>
</tr>
<tr>
<td>TYPE_CLIENT_GAME_JOURNAL_PACKET</td>
<td align="RIGHT">18006186</td>
</tr>
<tr>
<td>TYPE_LOBBY_ROOM_LIST_PACKET</td>
<td align="RIGHT">16674479</td>
</tr>
<tr>
<td>TYPE_LOBBY_PLAYER_LIST_ADDITION_PACKET</td>
<td align="RIGHT">4280563</td>
</tr>
<tr>
<td>TYPE_LOBBY_PLAYER_LIST_PACKET</td>
<td align="RIGHT">3482691</td>
</tr>
<tr>
<td>TYPE_CLIENT_MESSAGE_PACKET</td>
<td align="RIGHT">1501822</td>
</tr>
<tr>
<td>TYPE_CLIENT_LOGIN_PACKET</td>
<td align="RIGHT">477356</td>
</tr>
<tr>
<td>TYPE_CLIENT_INVITE_PACKET</td>
<td align="RIGHT">435368</td>
</tr>
<tr>
<td>TYPE_LOBBY_INVITE_PACKET</td>
<td align="RIGHT">275781</td>
</tr>
<tr>
<td>TYPE_LOBBY_LOGIN_PACKET</td>
<td align="RIGHT">235878</td>
</tr>
<tr>
<td>TYPE_LOBBY_GAME_ID_PACKET</td>
<td align="RIGHT">96000</td>
</tr>
<tr>
<td>TYPE_LOBBY_GAME_OVER_PACKET</td>
<td align="RIGHT">68901</td>
</tr>
<tr>
<td>TYPE_CLIENT_GAME_ID_CONFIRM_PACKET</td>
<td align="RIGHT">40257</td>
</tr>
<tr>
<td>TYPE_LOBBY_PLAY_PACKET</td>
<td align="RIGHT">32498</td>
</tr>
<tr>
<td>TYPE_CLIENT_IN_MATCH_PACKET</td>
<td align="RIGHT">25714</td>
</tr>
<tr>
<td>TYPE_LOBBY_IN_MATCH_PACKET</td>
<td align="RIGHT">21204</td>
</tr>
<tr>
<td>TYPE_CLIENT_CANDIDATE_PACKET</td>
<td align="RIGHT">16089</td>
</tr>
<tr>
<td>TYPE_CLIENT_PLAY_PACKET</td>
<td align="RIGHT">12419</td>
</tr>
<tr>
<td>TYPE_CLIENT_GAME_ID_REQUEST_PACKET</td>
<td align="RIGHT">9610</td>
</tr>
<tr>
<td>TYPE_LOBBY_WELCOME_PACKET</td>
<td align="RIGHT">4494</td>
</tr>
<tr>
<td>TYPE_CLIENT_JOIN_PACKET</td>
<td align="RIGHT">4494</td>
</tr>
<tr>
<td>TYPE_KEEPALIVE_PACKET</td>
<td align="RIGHT">1011</td>
</tr>
<tr>
<td>TYPE_CLIENT_IDLE_PACKET</td>
<td align="RIGHT">24</td>
</tr>
</tbody>
</table>
<p>Hmm, I just noticed as I&#8217;m writing this that the resident memory utilization in the atop screenshot is way lower now than before&#8230;I wonder why&#8230; On the application side I take about 250kb per player right now, which at 340 players should be about 85MB.  Looking at the lobbyserver logs, right about when the screenshot was taken, the lobby self-reported this data:</p>
<pre style="padding-left: 30px;">2013/03/03-02:13:15: MEMORY_POSIX 348/757/409: resident 12808/12808, virtual 160276/160276<br />2013/03/03-02:13:15: MEMORY_NEW 348/757/409: bytes 91766974, news 45707, deletes 36155</pre>
<p>The MEMORY_NEW stats looks about right for this load and my quick math, but the MEMORY_POSIX stats—which are read from /proc/pid/status—match the atop results: expected virtual but low resident.   Maybe it was just paged out for a second, or maybe I&#8217;m not touching much of that 250kb and so it doesn&#8217;t stay resident.  A lot of it is network buffers, so it makes some sense with this lower bandwidth protocol that it wouldn&#8217;t be resident compared to last profile because less buffering is having to be done.  I&#8217;ll have to investigate this more.</p>
<a name="Up+Next%2C+The+Case+of+the+Missing+Robots"></a><h3>Up Next, The Case of the Missing Robots</h3>
<p>So, the bandwidth optimizations were a resounding success!  Plus, both the CPU and memory utilization of the lobbyserver are really reasonable and haven&#8217;t been optimized at all, so we&#8217;re sitting pretty for getting to 1000 simulataneous robots&#8230;</p>
<p>Except, where are the remaining 160 robots?  In the test above, I ran 10 EC2 instances, each with 50 robots, thinking the optimizations might let me get to 500 simultaneous and find the next performance issue&#8230;but it never got above 340 in the lobby.  I updated my perl loadtesting framework and had each instance output how many lobbyclients were running every two seconds with this shell command over ssh:</p>
<pre style="padding-left: 30px;">'while true; do echo `date +%T`,`pidof lobbyclient | wc -w`; sleep 2; done'</pre>
<p>And then I loaded that into gnuplot,<sup><a href="http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/#footnote_1_3109" id="identifier_1_3109" class="footnote-link footnote-identifier-link" title="&hellip;which I hate, but I forgot to install excel on my new laptop, and Google&rsquo;s spreadsheet sucks at pivottables, and the Office for Web excel doesn&rsquo;t even have them as far as I could tell!">2</a></sup> and graphed the number of robots on each instance:</p>
<div id="attachment_3116" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/03/ec2-loadtest-client-counts.png"><img class="size-large wp-image-3116" title="ec2-loadtest-client-counts" src="http://cdn.spyparty.com/wp-content/uploads/2013/03/ec2-loadtest-client-counts-600x344.png" alt="" width="600" height="344" /></a><p class="wp-caption-text">The number of loadtest robots running on each EC2 instance.</p></div>
<p>You can see that they all started up with 50, but then a bunch of them lost clients until they found a steady state.   Something is killing my robots, and I need to figure out what it is&#8230;</p>
<p><a title="Loadtesting for Open Beta, Part 3" href="http://www.spyparty.com/2013/03/18/loadtesting-for-open-beta-part-3/">Turn the page to Part 3&#8230;</a></p>
<hr/><ol class="footnotes"><li id="footnote_0_3109" class="footnote">especially by yourself!</li><li id="footnote_1_3109" class="footnote">&#8230;which I hate, but I forgot to install excel on my new laptop, and Google&#8217;s spreadsheet sucks at pivottables, and the Office for Web excel doesn&#8217;t even have them as far as I could tell!</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Loadtesting for Open Beta, Part 1</title>
		<link>http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/</link>
		<comments>http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#comments</comments>
		<pubDate>Thu, 28 Feb 2013 03:21:24 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[indie games]]></category>
		<category><![CDATA[metrics]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=3072</guid>
		<description><![CDATA[Way back in 2011, right before I opened up Early-Access Beta signups, I loadtested and optimized the signup page to make sure it wouldn&#8217;t crash if lots of people were trying to submit their name and email and confirm their signup. I always intended to write up a technical post or two about that optimization [&#8230;]]]></description>
				<content:encoded><![CDATA[<p><a title="Here we go…it’s SpyParty Beta time!" href="http://www.spyparty.com/2011/05/10/here-we-go-its-spyparty-beta-time/">Way back in 2011</a>, right before I opened up <a title="Sign Up for the SpyParty Early-Access Beta!" href="http://www.spyparty.com/beta-sign-up/"><em>Early-Access Beta</em></a> signups, I loadtested and optimized the signup page to make sure it wouldn&#8217;t crash if lots of people were trying to submit their name and email and confirm their signup. I always intended to write up a technical post or two about that optimization process because it was an interesting engineering exercise, but I have yet to get around to it. However, I can summarize the learnings here pretty quickly: <a href="http://wordpress.org/">WordPress</a> is excruciatingly slow, <a href="https://www.varnish-cache.org/">Varnish</a> is incredibly fast, I ♥ <a href="http://www.perl.org/">Perl</a>,<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_0_3072" id="identifier_0_3072" class="footnote-link footnote-identifier-link" title="See this thread for how I wrote the dynamic loadtesting form submission in a way that would saturate the network link.">1</a></sup> <a href="http://httpd.apache.org/">Apache</a> with plain old mod_php (meaning <em>not</em> loading WordPress) was actually <em>way</em> faster than I expected, slightly faster even than <a href="http://nginx.org/">nginx</a> + php-fpm in my limited tests, <a href="http://aws.amazon.com/cloudfront/">CloudFront</a> is pretty easy to use,<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_1_3072" id="identifier_1_3072" class="footnote-link footnote-identifier-link" title="I use CF for images and other static stuff, with W3 Total Cache to keep them synced to S3, but I only use W3TC for this CDN sync, since Varnish blows it out of the water for actual caching.">2</a></sup> and even cheap and small dedicated servers can handle a lot of traffic if you&#8217;re smart about it.</p>
<p>Like with any kind of optimization, <em><a href="http://www.phatcode.net/res/224/files/html/ch03/03-01.html">Assume Nothing</a></em>, so you should always write the loadtester first, and run it to get a baseline performance profile, and continue running it as you optimize the hotspots. When I started, the signup submission could only handle 2 or 3 submits per second. When I was done, it could handle 400 submissions per second. I figured that was enough.<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_2_3072" id="identifier_2_3072" class="footnote-link footnote-identifier-link" title="Let me be clear, I think 400 submissions per second is really pretty slow for raw performance on a modern computer, but web apps these days have so many layers that you lose a ton of performance relative to what would happen if you wrote the whole thing in C. For an interesting example of this, there&rsquo;s a wacky high performance web server called G-WAN that gets rid of all the layers and lets you write the pages directly in compiled C.">3</a></sup> If more than 400 people were signing up for the <strong>SpyParty</strong> beta every second, well, let&#8217;s file that under &#8220;good problem to have&#8221;.</p>
<p>After all the loadtesting and optimizing, the signups <a title="Beta Data" href="http://www.spyparty.com/2011/05/12/beta-data/">went off without a hitch</a>.</p>
<p>Loadtesting and optimizing the beta signup process was important, because the entire reason I took signups instead of just letting people play immediately was &#8220;fear of the unknown&#8221;. I couldn&#8217;t know in advance how many people would be interested in the game, and getting a couple web forms scalable in case that number was &#8220;a lot&#8221; was much easier than getting the full game and its server scalable, and that&#8217;s ignoring the very real need to exert some control over the growth of the community, to make sure the game wasn&#8217;t incredibly buggy on different hardware configurations or that there wasn&#8217;t some glaring balance issue, etc. Overall, starting with signups and a closed beta was great for the game, even if it&#8217;s meant frustrating people who signed up and want to play.</p>
<p>But it&#8217;s been long enough, and I&#8217;m now finally actively loadtesting and optimizing for opening the beta!</p>
<a name="Lobby+Loadtesting+Framework"></a><h3>Lobby Loadtesting Framework</h3>
<p>Like with the signup form, I&#8217;m loadtesting first. This will tell me where I need to optimize, and allow me to test my progress against the baseline. However, loadtesting a game lobby server is a lot more complicated than loadtesting a web form, so it&#8217;s a bit slower-going. I&#8217;ve had to create a robot version of the game client that logs into the lobby, chats, invites other robots to play, and then reports on the results of the fake games played. I build this on top of the game&#8217;s client interface, so it looks just like a real game to the lobby.</p>
<p>As with all testing, you need to make sure you aren&#8217;t <a href="http://en.wikipedia.org/wiki/Werner_Heisenberg">Heisenberg</a>-ing<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_3_3072" id="identifier_3_3072" class="footnote-link footnote-identifier-link" title="I just read on wikipedia that the uncertainty principle is often confused with the observer effect, and so on the surface this verbing of Heisenberg&rsquo;s name isn&rsquo;t correct, except he apparently also confused the two, so I&rsquo;m going to keep on verbing.">4</a></sup> your results, so I wanted to get fairly close to the same load that would happen with multiple real game clients hitting the server. This means I had to have a good number of machines running these robots hitting the test lobby at the same time, and that means using cloud computing. I was inspired by the <a href="http://blog.apps.chicagotribune.com/2010/07/08/bees-with-machine-guns/"><em>bees with machine guns</em></a> article about using Amazon Web Services&#8217;s Elastic Compute Cloud (EC2) to launch a bunch of cheap http load testers. I use AWS for <strong>SpyParty</strong> already, distributing updates and uploading crashdumps using S3, so this seemed like a good fit. At first I tried modifying the bees code to do what I want, but I found the Python threading technique they used for controlling multiple instances didn&#8217;t scale well running on Windows, and since I wanted more control over the instances anyway and the core idea was not terribly difficult to implement, I wrote my own version in Perl, which I&#8217;m much more familiar with. The code uses <a href="http://search.cpan.org/~mallen/Net-Amazon-EC2-0.23/lib/Net/Amazon/EC2.pm">Net::Amazon::EC2</a> to talk to AWS to start, list, and stop EC2 instances, and <a href="http://search.cpan.org/~rkitover/Net-SSH2-0.48/lib/Net/SSH2.pm">Net::SSH2</a> to talk to the instances themselves, executing commands and waiting for exit codes, downloading logs, and whatnot. I just use an existing CentOS EC2 AMI<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_4_3072" id="identifier_4_3072" class="footnote-link footnote-identifier-link" title="ami-c9846da0">5</a></sup> and then have the scripts download and install my robots onto it from S3 every time I start one up; I didn&#8217;t want to bother with creating a custom AMI when my files are pretty small. I&#8217;m going to post all the loadtest framework code once I&#8217;ve got it completely working so others can use it.</p>
<a name="How+Much+is+Enough%3F"></a><h3>How Much is Enough?</h3>
<p>In loadtesting the loadtesters, I found that an <a href="http://aws.amazon.com/ec2/instance-types/"><em>m1.small</em> instance</a> could run about 50 loadtest bots simultaneously with my current client code. I can switch to larger and more expensive EC2 instance types if I need to run more robots per instance, and as I optimize the server I&#8217;m pretty sure the client code will get optimized as well, which will allow more concurrency. Amazon limits accounts to 20 simultaneous EC2 instances until you apply for an exception, so I&#8217;ve done that,<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_5_3072" id="identifier_5_3072" class="footnote-link footnote-identifier-link" title="although they haven&rsquo;t gotten back to me so I guess I&rsquo;ll apply again&hellip;sigh, customer service &ldquo;in the cloud&rdquo; &nbsp;Update: Woot! &nbsp;My limit has been increased, now I can DDOS myself to my heart&rsquo;s content!">6</a></sup> but even with that limitation, I can loadtest to about 1000 concurrent clients, which seems like more than enough for now.</p>
<p>I still don&#8217;t know exactly what to expect when I open up the beta, but I don&#8217;t think I&#8217;ll hit 1000 simultaneous <strong>SpyParty</strong> players outside of loadtesting anytime soon. If you look at <a href="http://store.steampowered.com/stats/">the Steam Stats page</a>, 1000 simultaneous players is right in the middle of the top 100 games on the entire service, including some pretty popular mainstream games with mature player communities. In the current closed beta, I think our maximum number of simultaneous players has been around 25, and it&#8217;s usually between 10 and 15 on any given night at peak times, assuming there&#8217;s no event happening and I haven&#8217;t just sent out a big batch of invites. I still have about 6000 people left to invite for the first time from the signup list, and 9000 who didn&#8217;t register on their first invite to re-invite, all of whom I&#8217;ll use for live player loadtesting after the 1000 robots are happily playing without complaints. I think the spike from those last closed invites will be bigger than the open beta release spike, unless there are a ton of people who didn&#8217;t want to sign up with their email address, but who will buy the game once the beta is open. I guess that&#8217;s possible, but who knows? Again, if we go over 1000 simultaneous, I guess I will scramble to move the lobby to a bigger server, and keep repeating the &#8220;good problem to have&#8221; mantra over and over again, but I&#8217;m betting it&#8217;s not going to happen and things will go smoothly.</p>
<p>After open beta there will be a long list of awesome stuff coming into the game, including new maps and missions, spectation and replays, the <a title="The New SpyParty Character Art Style" href="http://www.spyparty.com/2012/08/27/the-new-spyparty-character-art-style/">new art</a>, and lots more, but once things are open it&#8217;ll be easier to predict the size of those spikes and plan accordingly. Eventually I&#8217;ll probably (hopefully?) have to move the lobby off my current server, but I&#8217;m pretty sure based on my initial testing that the old girl can keep things going smoothly a bit longer.</p>
<a name="Initial+Loadtesting+Baseline"></a><h3>Initial Loadtesting Baseline</h3>
<p>Okay, so what happens when I unleash the robots? Well, I haven&#8217;t let 1000 of them loose yet, but I&#8217;ve tried 500, and things fall over, as you might expect. It looks like around 250 is the maximum that can even connect right now, which is actually more than I thought I&#8217;d start out with.</p>
<div id="attachment_3075" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-v0.1.2532.0-20130227-11-56-33-0.png"><img class="size-large wp-image-3075" title="SpyParty-v0.1.2532.0-20130227-11-56-33-0" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-v0.1.2532.0-20130227-11-56-33-0-600x505.png" alt="" width="600" height="505" /></a><p class="wp-caption-text">The loadtesting robots are not very good conversationalists.</p></div>
<p>Things don&#8217;t work very well even with 250 clients, though, with connections failing, and match invites not going through.<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_6_3072" id="identifier_6_3072" class="footnote-link footnote-identifier-link" title="Let&rsquo;s ignore the lobby UI also drawing all over itself for now.">7</a></sup> However, when I looked at <a href="http://www.atoptool.nl">atop</a> while the robots were pounding on the lobby, a wonderful thing was apparent:</p>
<div id="attachment_3096" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/2013-02-27-12_32_00-atop-cpu.png"><img class="size-large wp-image-3096" title="2013-02-27 12_32_00-atop-cpu" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/2013-02-27-12_32_00-atop-cpu-600x256.png" alt="" width="600" height="256" /></a><p class="wp-caption-text">atop in CPU mode</p></div>
<p>&nbsp;</p>
<div id="attachment_3095" style="width: 610px" class="wp-caption aligncenter"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/2013-02-27-12_31_53-atop-mem.png"><img class="size-large wp-image-3095" title="2013-02-27 12_31_53-atop-mem" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/2013-02-27-12_31_53-atop-mem-600x256.png" alt="" width="600" height="256" /></a><p class="wp-caption-text">atop in memory mode</p></div>
<p>Neither the CPU utilization nor the memory utilization was too terrible, but the lobbyserver was saturating the 100 Mbps ethernet link! That&#8217;s awesome, because that&#8217;s going to be easy to fix!</p>
<p>Before I explain, let me say that the best kind of profile is one with a single giant spike, one thing that&#8217;s obviously completely slow and working poorly. The worse kind of profile is a flat line, where everything is taking 3% of the time and there&#8217;s no single thing you can optimize. This is a great profile, because it points right towards the first thing I need to fix, which is the network bandwidth.</p>
<p>My protocol between the game clients and the lobby server is really pretty dumb in a lot of ways, but the biggest way it&#8217;s dumb is that on any state change of any client, it sends the entire list of clients and their current state to every client. This is the simplest thing to do and means there&#8217;s no need to track which clients have received which information, and this in turn means it&#8217;s the right thing to do first when you&#8217;re getting things going, but it&#8217;s also terribly wasteful performance-wise compared to just sending out the clients who changed each tick. So, I was delighted to see that bandwidth was my first problem, because it&#8217;s easy to see that I have to fix the protocol. I&#8217;m guessing switching to a differential player state update will cut the bandwidth by 50x, which will then reveal the next performance spike.</p>
<p style="text-align: left;">I can&#8217;t wait to find out what it will be!<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_7_3072" id="identifier_7_3072" class="footnote-link footnote-identifier-link" title="You can see the CPU usage is pretty high relative to the memory usage, and seeing slapd and krb5kdc in there is a bit worrying, since that&rsquo;s kerberos and ldap, which are used for the login and client authentication and are going to be a bit harder to optimize if they start poking their heads up too high, but both of them have very battle-tested enterprise-scale optimization solutions via replication, so worst-case is I&rsquo;ll have to get another machine for them, I think. If the lobbyserver itself is still CPU-bound after fixing the bandwidth issue, then I&rsquo;ll start normal code optimization for it, including profiling, of course. I&rsquo;ll basically recurse on the lobbyserver executable!">8</a></sup></p>
<p>Oh, and the total EC2 bill for my loadtesting over the past few days: $5.86</p>
<a name="So%26%238230%3BOpen+Beta%3F"></a><h3>So&#8230;Open Beta?</h3>
<p>Within weeks! Weeks, I tell you!</p>
<p>Oh, and as I&#8217;ve said before, everybody who is signed up will get invited in before open beta. I will then probably have a short &#8220;quiet period&#8221; where I let things settle down before really opening it up, so if you want in before open beta, <a title="Sign Up for the SpyParty Early-Access Beta!" href="http://www.spyparty.com/beta-sign-up/">sign up now</a>.</p>
<a name="Update%3A+Assuming+More+Nothing%26%238230%3BEr%2C+Less+Nothing%3F"></a><h3>Update: Assuming More Nothing&#8230;Er, Less Nothing?</h3>
<p>After posting this article, I was about to start optimizing the client list packets, when it occurred to me I wasn&#8217;t <a href="http://www.phatcode.net/res/224/files/html/ch03/03-01.html">assuming enough nothing</a>, because I was assuming it was the client list taking all the bandwidth. This made me a bit nervous, which is the right feeling to have when you&#8217;re not following your own advice,<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_8_3072" id="identifier_8_3072" class="footnote-link footnote-identifier-link" title="&hellip;let alone Mike Abrash&rsquo;s advice!">9</a></sup> so I implemented a really simple bit of code that accumulated the per-packet send and recieve sizes, and printed them on exit, and then threw another 250 robots at the server for 60 seconds. The results validated the client list assumption, it&#8217;s by far the biggest bandwidth consumer, sending 1.6GB in 60 seconds.<sup><a href="http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/#footnote_9_3072" id="identifier_9_3072" class="footnote-link footnote-identifier-link" title="Or actually trying to send, since 1.6GB in 60 seconds is 200Mbps, which is not happening on a 100Mbps link!">10</a></sup> However, it did show that the lobby sending chat and status messages to the clients is also maybe going to be a problem, so yet again: <em>measuring things is crucial</em>.</p>
<table border="0" cellspacing="0" cellpadding="0" align="center">
<thead>
<tr>
<td><strong>Packet Type</strong></td>
<td align="right"><strong>Total Bytes</strong></td>
</tr>
</thead>
<tbody>
<tr>
<td>TYPE_LOBBY_PLAYER_LIST_PACKET</td>
<td align="right">1632549877</td>
</tr>
<tr>
<td>TYPE_LOBBY_MESSAGE_PACKET</td>
<td align="right">66687600</td>
</tr>
<tr>
<td>TYPE_LOBBY_ROOM_LIST_PACKET</td>
<td align="right">9474937</td>
</tr>
<tr>
<td>TYPE_CLIENT_INVITE_PACKET</td>
<td align="right">303056</td>
</tr>
<tr>
<td>TYPE_CLIENT_MESSAGE_PACKET</td>
<td align="right">226779</td>
</tr>
<tr>
<td>TYPE_CLIENT_LOGIN_PACKET</td>
<td align="right">157795</td>
</tr>
<tr>
<td>TYPE_LOBBY_INVITE_PACKET</td>
<td align="right">131667</td>
</tr>
<tr>
<td>TYPE_LOBBY_LOGIN_PACKET</td>
<td align="right">77951</td>
</tr>
<tr>
<td>TYPE_KEEPALIVE_PACKET</td>
<td align="right">43032</td>
</tr>
<tr>
<td>TYPE_CLIENT_GAME_JOURNAL_PACKET</td>
<td align="right">5478</td>
</tr>
<tr>
<td>TYPE_LOBBY_PLAY_PACKET</td>
<td align="right">1888</td>
</tr>
<tr>
<td>TYPE_LOBBY_WELCOME_PACKET</td>
<td align="right">1491</td>
</tr>
<tr>
<td>TYPE_CLIENT_JOIN_PACKET</td>
<td align="right">1491</td>
</tr>
<tr>
<td>TYPE_CLIENT_PLAY_PACKET</td>
<td align="right">836</td>
</tr>
<tr>
<td>TYPE_CLIENT_IN_MATCH_PACKET</td>
<td align="right">713</td>
</tr>
<tr>
<td>TYPE_LOBBY_IN_MATCH_PACKET</td>
<td align="right">532</td>
</tr>
<tr>
<td>TYPE_CLIENT_CANDIDATE_PACKET</td>
<td align="right">490</td>
</tr>
<tr>
<td>TYPE_LOBBY_GAME_ID_PACKET</td>
<td align="right">300</td>
</tr>
<tr>
<td>TYPE_CLIENT_GAME_ID_REQUEST_PACKET</td>
<td align="right">30</td>
</tr>
</tbody>
</table>
<p>It&#8217;s interesting that the clients are only sending 300KB worth of chat messages to the lobby, but it&#8217;s sending 66MB back to them, but 66MB is around 250 * 300KB, so it makes back-of-the-envelope sense. I&#8217;m probably going to need to investigate that more once I&#8217;ve hammered the player list traffic down. Maybe I&#8217;ll have to accumulate them every tick, compress them all, and send them out.</p>
<p><a title="Loadtesting for Open Beta, Part 2" href="http://www.spyparty.com/2013/03/03/loadtesting-for-open-beta-part-2/">This way to Part 2&#8230;</a></p>
<hr/><ol class="footnotes"><li id="footnote_0_3072" class="footnote">See <a href="http://www.perlmonks.org/?node_id=901638">this thread</a> for how I wrote the dynamic loadtesting form submission in a way that would saturate the network link.</li><li id="footnote_1_3072" class="footnote">I use CF for images and other static stuff, with <a href="http://wordpress.org/extend/plugins/w3-total-cache/">W3 Total Cache</a> to keep them synced to S3, but I only use W3TC for this CDN sync, since Varnish blows it out of the water for actual caching.</li><li id="footnote_2_3072" class="footnote">Let me be clear, I think 400 submissions per second is really pretty slow for raw performance on a modern computer, but web apps these days have so many layers that you lose a ton of performance relative to what would happen if you wrote the whole thing in C. For an interesting example of this, there&#8217;s a wacky high performance web server called <a href="http://gwan.com/benchmark/babel.html">G-WAN</a> that gets rid of all the layers and lets you write the pages directly in compiled C.</li><li id="footnote_3_3072" class="footnote">I just read on wikipedia that the <a href="http://en.wikipedia.org/wiki/Uncertainty_principle">uncertainty principle</a> is often confused with the <a href="http://en.wikipedia.org/wiki/Observer_effect_%28physics%29">observer effect</a>, and so on the surface this verbing of Heisenberg&#8217;s name isn&#8217;t correct, except he apparently also confused the two, so I&#8217;m going to keep on verbing.</li><li id="footnote_4_3072" class="footnote">ami-c9846da0</li><li id="footnote_5_3072" class="footnote">although they haven&#8217;t gotten back to me so I guess I&#8217;ll apply again&#8230;sigh, customer service &#8220;in the cloud&#8221;  Update: Woot!  My limit has been increased, now I can DDOS myself to my heart&#8217;s content!</li><li id="footnote_6_3072" class="footnote">Let&#8217;s ignore the lobby UI also drawing all over itself for now.</li><li id="footnote_7_3072" class="footnote">You can see the CPU usage is pretty high relative to the memory usage, and seeing slapd and krb5kdc in there is a bit worrying, since that&#8217;s <a href="http://web.mit.edu/Kerberos/">kerberos</a> and <a href="http://www.openldap.org/">ldap</a>, which are used for the login and client authentication and are going to be a bit harder to optimize if they start poking their heads up too high, but both of them have very battle-tested enterprise-scale optimization solutions via replication, so worst-case is I&#8217;ll have to get another machine for them, I think. If the lobbyserver itself is still CPU-bound after fixing the bandwidth issue, then I&#8217;ll start normal code optimization for it, including profiling, of course. I&#8217;ll basically recurse on the lobbyserver executable!</li><li id="footnote_8_3072" class="footnote">&#8230;let alone Mike Abrash&#8217;s advice!</li><li id="footnote_9_3072" class="footnote">Or actually <em>trying to send</em>, since 1.6GB in 60 seconds is 200Mbps, which is not happening on a 100Mbps link!</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/02/27/loadtesting-for-open-beta-part-1/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>One Bug&#8217;s Story, or, Assume it&#8217;s a bug!</title>
		<link>http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/</link>
		<comments>http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/#comments</comments>
		<pubDate>Sun, 10 Feb 2013 03:07:44 +0000</pubDate>
		<dc:creator><![CDATA[checker]]></dc:creator>
				<category><![CDATA[beta]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[streams]]></category>

		<guid isPermaLink="false">http://www.spyparty.com/?p=3024</guid>
		<description><![CDATA[This is the story of a bug in SpyParty.  This story has a happy ending, because the SpyParty beta testers are amazing, and they are constantly helping find bugs, of course, but they are also constantly helping me reproduce bugs, and narrow down the potential causes of bugs, and triage them, and are generally providing [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>This is the story of a bug in <strong>SpyParty</strong>.  This story has a happy ending, because the <strong>SpyParty</strong> beta testers are amazing, and they are constantly helping find bugs, of course, but they are also constantly helping me reproduce bugs, and narrow down the potential causes of bugs, and triage them, and are generally providing me with incredible support so I can make the game better.</p>
<p>This bug also has an interesting story, because it turned out to have a very subtle cause, one that manifested itself intermittently in ways that looked almost random, and for a long time there was no &#8220;repro case&#8221;.  Getting a repro on a bug is the key to fixing it, as I discuss in the <a title="How to Report Bugs the SpyParty Way" href="http://www.spyparty.com/2012/04/12/how-to-report-bugs-the-spyparty-way/">How to Report Bugs the <strong>SpyParty</strong> Way</a> post.  It can make the difference between a 10 minute fix and a 10 day fix&#8230;or never managing to find and fix it.  <em></em></p>
<p><em>*shudder*</em></p>
<p>But let&#8217;s start at the beginning&#8230;</p>
<p>I first noticed the bug long before I&#8217;d invited people into the beta, but it was so rare that I didn&#8217;t prioritize finding it, and in fact I would forget about it for stretches of time.  Yes, I make notes of bugs I see, but I&#8217;ve got so much high priority stuff to do right now that I don&#8217;t go back and look at that list very often&#8230;the important bugs get fixed immediately, but a bug like this can stay in for a long time.</p>
<p>So, what&#8217;s the bug?  Well, let&#8217;s go to the videotape:</p>
<p><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/"><em>Click here to view the embedded video.</em></a></p>
<p>This video was from a <a href="http://www.youtube.com/watch?v=ASkd9vpner4">Spy Commentary game</a> played between <strong>buxx</strong> and <strong>dieffenbachj</strong>, two early beta testers who were some of the first to upload gameplay videos.</p>
<p>It turns out, in addition to just being plain awesome for games overall, the rise of videos and streams is also an amazing resource for bug finding and fixing!  The more people record their games, the more they&#8217;ll be able to point to a video of exactly what went wrong so the developers can see it almost first-hand.  The bane of a developer&#8217;s life is a bug report that says, &#8220;the game broke&#8221; with no other description.  You know something&#8217;s probably wrong, but it&#8217;s basically useless for finding a problem.  With a video, you can usually see exactly what&#8217;s going on, so that problem is eliminated, or at least massively reduced.</p>
<p>So, as you can see in that video, the Spy is facing the wrong way on the floor pad.  The Spy should be facing the pedestal with the statue on it in that position, but in this case the Spy is turned 90 degrees.</p>
<p>I watched <strong>buxx</strong>&#8216;s videos when he posted them<sup><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/#footnote_0_3024" id="identifier_0_3024" class="footnote-link footnote-identifier-link" title="&hellip;because they&rsquo;re great for learning, the only thing better than video is commentated video!">1</a></sup> and I noticed this, and so I posted a bug in the Bugs Forum on the private beta website myself on May 6th, 2012.</p>
<p>Obviously, trying to repro it by replaying the steps in that video was fruitless.</p>
<p>Next up, <strong>bishop</strong> caught it with a screenshot and posted on May 13th, 2012:</p>
<p><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20120512-08-45-22.png"><img class="aligncenter size-large wp-image-3034" title="SpyParty-20120512-08-45-22" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20120512-08-45-22-600x349.png" width="600" height="349" /></a></p>
<p>It&#8217;s a perfect shot, but his next post is &#8220;I haven&#8217;t had much luck on the repro.&#8221;</p>
<p><strong>ardonite</strong> chimes in the next day:</p>
<blockquote>
<p>I got the rotation bug once at a statue.</p>
<p>I think I was rriiiiight in the bounding box. So maybe if it&#8217;s on a border pixel of the box then it glitches out?</p>
<p>Edit: no, hypothesis incorrect.</p>
</blockquote>
<p>That&#8217;s how it goes: make a hypothesis, check it, repeat.</p>
<p>More shots every couple months over the summer from <strong>bishop</strong> and <strong>r7stuart</strong>:</p>
<p style="text-align: center;"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20120527-15-36-37.png"><img class="size-medium wp-image-3040 aligncenter" title="SpyParty-20120527-15-36-37" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20120527-15-36-37-300x174.png" width="300" height="174" /></a></p>
<p style="text-align: center;"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/2012-08-09_00003.jpg"><img class="size-medium wp-image-3032 aligncenter" title="2012-08-09_00003" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/2012-08-09_00003-300x159.jpg" width="300" height="159" /></a></p>
<p><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/2012-08-10_00001.jpg"><img class="aligncenter size-medium wp-image-3033" title="2012-08-10_00001" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/2012-08-10_00001-300x159.jpg" width="300" height="159" /></a></p>
<p>Still no repro.</p>
<p>This entire time—in fact since I first saw it myself pre-beta—I&#8217;ve been trying to resist thinking it&#8217;s some kind of &#8220;numerical issue&#8221; with the code that handles the facing angle.  Yes, angles are finicky to deal with due to wrapping, but I&#8217;ve found programmers, including myself, tend to immediately go to vague concepts like &#8220;floating point error&#8221; for anything like this.  To fight this tendency back when we were working on physical simulation code together, <a href="http://www.mollyrocket.com/136">Casey Muratori</a> and I developed a mantra:  <strong>&#8220;Assume it&#8217;s a bug!&#8221;</strong>  It means that instead of assuming it&#8217;s some subtle floating point error creeping in, or anything mysterious like that, it&#8217;s almost certainly just some dumb programming bug.  That mantra has never failed me.  It&#8217;s always just a plain old bug.</p>
<p>Onward&#8230;</p>
<p>In the fall of 2012, <a href="http://www.spyparty.com/streams/">streaming <strong>SpyParty</strong></a> took off bigtime, and so people were recording their games more often, and we started to get more videos, this one from <strong>r7stuart</strong> in October:</p>
<p><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/"><em>Click here to view the embedded video.</em></a></p>
<p>And <strong>tytalus</strong> in November:</p>
<p><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/"><em>Click here to view the embedded video.</em></a></p>
<p>I saw this last one live on <a href="http://www.twitch.tv/tytaluswarden">tytalus&#8217;s stream</a>, and grimaced when it happened, but also was happy to have more data to some day find a repro, or just have a random brainwave and fix it by intuition.<sup><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/#footnote_1_3024" id="identifier_1_3024" class="footnote-link footnote-identifier-link" title="This happens in programming a lot, but you don&rsquo;t want to count on it if you don&rsquo;t have to.">2</a></sup></p>
<p>At this point, people were reporting NPCs doing it, which at least made me happy, because it meant it wasn&#8217;t a <em>tell</em>.  Tells and anti-tells<sup><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/#footnote_2_3024" id="identifier_2_3024" class="footnote-link footnote-identifier-link" title="Where the NPCs can do something the Spy can&rsquo;t.">3</a></sup> are the most serious <strong>SpyParty</strong> bugs, because they undermine the delicate balance of the game, so I proritize them highest, even above crash bugs sometimes!</p>
<p>A couple on New Year&#8217;s Eve from <strong>jorjon</strong>:</p>
<p><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20121231-12-40-42-0.png"><img class="aligncenter size-medium wp-image-3036" title="SpyParty-20121231-12-40-42-0" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20121231-12-40-42-0-300x159.png" width="300" height="159" /></a></p>
<p><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20121231-12-49-28-0.png"><img class="aligncenter size-medium wp-image-3029" title="SpyParty-20121231-12-49-28-0" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20121231-12-49-28-0-300x159.png" width="300" height="159" /></a></p>
<p>Then two clips from streams, the first from <strong>slappydavis</strong>, who&#8217;s Seduction Target appears to do it at the bookshelf on January 6th:</p>
<p><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/"><em>Click here to view the embedded video.</em></a></p>
<p>And then from james1221 on January 12th during the <a title="SpyParty New Years Cup Tournament Starting Tonight!" href="http://www.spyparty.com/2013/01/02/spyparty-new-years-cup-tournament-starting-tonight/"><strong>SpyParty</strong> New Years Cup Tournament</a>:</p>
<p><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/"><em>Click here to view the embedded video.</em></a></p>
<p>Both of these are different, however.  In both cases, the NPC is being blocked by another character, and instead of repathing to a new place, they just wait until the blocking character leaves.  This is both good news and bad news.  It means it&#8217;s easy to project this bug onto other bugs, it means there are other bugs,<sup><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/#footnote_3_3024" id="identifier_3_3024" class="footnote-link footnote-identifier-link" title="duh">4</a></sup> and it means all the real examples of this bug so far have involved the Spy.  I don&#8217;t mention this last part in the hopes that nobody notices.  Luckly it&#8217;s rare enough that it&#8217;s not going to be a game balance changer even if it is a tell.</p>
<p>Finally, <strong>kcmmmmm</strong> finds a reliable repro on February 7th, two days ago, and 8 months after the first post in the Bugs thread!  These pictures are beautiful to me:</p>
<p style="text-align: center;"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20130207-12-09-59-0.png"><img class="aligncenter  wp-image-3031" title="SpyParty-20130207-12-09-59-0" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20130207-12-09-59-0-600x556.png" width="360" height="334" /></a></p>
<p style="text-align: center;"><a href="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20130207-12-03-01-0.png"><img class="aligncenter  wp-image-3030" title="SpyParty-20130207-12-03-01-0" alt="" src="http://cdn.spyparty.com/wp-content/uploads/2013/02/SpyParty-20130207-12-03-01-0-600x556.png" width="360" height="334" /></a></p>
<p>You can stand at that position, with that camera angle, and repro the bug most tries.  He also figures out that it&#8217;s very camera angle dependent, which is another clue, but once I could repro it locally, its remaining time on this earth was measured in minutes.</p>
<p>I had some trouble reproing it reliably here, including a couple wild-goose chases where I thought it wouldn&#8217;t repro with the debugger running<sup><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/#footnote_4_3024" id="identifier_4_3024" class="footnote-link footnote-identifier-link" title="Sometimes this happens with an uninitialized variable, since the debugger initializes most memory to zero for you.">5</a></sup> or in my debugging modes,<sup><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/#footnote_5_3024" id="identifier_5_3024" class="footnote-link footnote-identifier-link" title="Could mean some debug code was correcting for the bug?">6</a></sup> but in the end I got a case where I could catch it in the debugger, and I looked at the source, and there it was, suspiciously rotten code.</p>
<p>It was an old check from when I used to support click-to-move, as opposed to direct-control of the spy.  There was a case in the code that would check if you were clicking on the bookshelf itself, rather than the floor pad in front of the bookshelf, and it would helpfully direct you to the floor pad.  The position part of this got taken out long ago (I think), but the angle part remained in, so when the Spy stopped moving, if the mouse was over the bookshelf that code would return the angle for facing the bookshelf.</p>
<p>Wait, you say, there&#8217;s no mouse cursor in Spy mode?  Ah, yes there is, it&#8217;s just hidden and forced into the middle of the screen.  So, most of the time it hits your back, but sometimes, if you&#8217;re turning or leaning down or whatever when you stop, it&#8217;ll miss you and hit what&#8217;s behind you, and if it hits a bookshelf on the frame when you stop, you get the wrong facing angle.</p>
<p>Now, with this knowledge, go back and watch the videos and look a the pictures above.  Always a nearby guilty bookcase, isn&#8217;t there? </p>
<p>But wait, you say again, what about the very first video, the green bookshelf is nowhere near the middle of the screen!  Ah, but <strong>buxx</strong> uses a controller,<sup><a href="http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/#footnote_6_3024" id="identifier_6_3024" class="footnote-link footnote-identifier-link" title="You can see the action UI in the video uses the controller icons! No detail is too small to matter in a bug!">7</a></sup> and the mouse gets hidden but doesn&#8217;t get centered if you&#8217;re using a controller!  It probably should, but it doesn&#8217;t.  So, <strong>buxx</strong>&#8216;s hidden mouse pointer is probably off to the right of the window, over the green bookshelf, until he moves one pedestal pad to the left, and then the mouse pointer is no longer on the bookshelf, and he faces the right way!</p>
<p>Awesome, all the cases explained, and the bug was trivial to fix!</p>
<p>Okay, there&#8217;s actually one more case in the Bugs thread I didn&#8217;t post here, because it&#8217;s funny enough that I&#8217;m going to make an entire post about it soon.</p>
<p>So, just remember, always, <strong>Assume it&#8217;s a bug!</strong></p>
<hr/><ol class="footnotes"><li id="footnote_0_3024" class="footnote">&#8230;because they&#8217;re great for learning, the only thing better than video is commentated video!</li><li id="footnote_1_3024" class="footnote">This happens in programming a lot, but you don&#8217;t want to count on it if you don&#8217;t have to.</li><li id="footnote_2_3024" class="footnote">Where the NPCs can do something the Spy can&#8217;t.</li><li id="footnote_3_3024" class="footnote">duh</li><li id="footnote_4_3024" class="footnote">Sometimes this happens with an uninitialized variable, since the debugger initializes most memory to zero for you.</li><li id="footnote_5_3024" class="footnote">Could mean some debug code was correcting for the bug?</li><li id="footnote_6_3024" class="footnote">You can see the action UI in the video uses the controller icons! No detail is too small to matter in a bug!</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.spyparty.com/2013/02/09/one-bugs-story-or-assume-its-a-bug/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Object Caching 1282/1315 objects using apc
Content Delivery Network via Amazon Web Services: CloudFront: cdn.spyparty.com

 Served from: www.spyparty.com @ 2014-04-13 03:59:25 by W3 Total Cache -->