Hello everyone. Today, we’re going to talk about mobile app load testing with HeadSpin. My name is Aaron Gibson – I’m a Software Developer here at HeadSpin. I work primarily on backend stuff, and I’ve worked on single sign-on. I’ve worked on other infrastructure-related stuff [as well].
Today, we’re going to talk about load testing for mobile applications and how, at HeadSpin, we use load testing to help better understand how we can improve the app development for our users and customers. We’re going to talk quickly about mobile app load testing with HeadSpin – what we support and offer. We’re going to talk about playbooks, which are the main, dominant aspect of load testing. Basically, playbooks are the script that you – or a set of scripts, if you will – that run on the different targets and simulate the load. And then, we’re going to have a demo to show how this all plays together. We can schedule tests, we’ll demo running a test, we’ll demo how this integrates a little with the performance monitoring UI as well.
The main question here to start with is: “why would we load test application?”
I think most of the time developers get caught up in stuff like unit testing and integration testing and smoke testing, and these are extremely important. They verify the functionality of an app, they verify that the app behaves as expected, they verify that the user-expected input results in expected output. But in real life, when you actually deploy an application, a lot of things that get overlooked [revolve around] what happens when multiple users hit the system at the same time.
Maybe one or two users or a couple of users is not a big deal. But what happens when a lot of users hit the system and how does this affect the server backend? Maybe the server stopped serving as many requests per second because it’s bogged down or maybe the server just doesn’t respond quickly and there’s connectivity drops, especially in the mobile application environment.
At HeadSpin, we’re concerned with how the server load might integrate with the user experience. Load testing the applications produces metrics that are valuable in its own right, but we’re also concerned with how these metrics translate into better or worse user experience depending on the conditions. At HeadSpin, we support running tests from a bunch of different targets. Each target can spawn up to 200 threads or 200 agents – this is a really an upper-bound because it depends on the location of the different targets.
It depends on various factors. If the target is in a remote location that has a spotty Internet for example, then maybe it won’t scale fully to 200 just because the Internet connection is already dodgy. But basically, anywhere that we have p boxes, in more locations even, we can actually spawn a target. The way you scale the load test from a HeadSpin perspective is you specify multiple targets and then you set each target to run so many agents and then you basically multiply the two and that’s the expected throughput. We also support running the tests on a regular basis – scheduling them – so you can set a Crontab type format if you want to schedule it continuously. Of course, you can manually start the test too – that’s what we’ll do for the demo. We’re working on integrating this with our performance monitoring UI.
So, to actually run a mobile application load test with HeadSpin, the biggest step is to actually create a playbook. The playbook, at HeadSpin, is a YAML file – it supports the YAML format – and it has different parameters that let you specify the targets, the agent count, the duration, even the Crontab format, the Crontab schedule, the performance task ID, and some other factors like that. We’ll discuss the playbook in more detail in a moment. Basically you upload this playbook and then you execute the test. There’s a UI to do that, and of course at HeadSpin, we can also help with that process as it goes along.
A sample playbook looks like this. As I was saying before, you can notice here we have this duration field, we have this ramp up ramp down field. The ramp up ramp down indicates how long it takes to ramp up the test case to the full agent count, and ramp down as how long it takes to ramp down from the full agent count back to zero. These are approximations (especially ramp down) because if you have an agent that’s misbehaving or maybe your server is stalling for a long time, there’s various timeouts that can take effect. So sometimes the ramp down time is actually a bit longer.
But, the duration is the actual length of the test to what we call the Saturation region, where the full load is on all maximum number of agents give or take – various oddities for the full time that’s in the duration. And then you can specify again the frequency here, you can specify the perf task ID, which is if you wanted to integrate this with our performance monitoring UI – we’ll discuss that a little bit in the demo – and then you specify a list of targets.
Then, on the other side, you have the agent’s specs, and this is the meat of the playbook. This is the part of the playbook that indicates what to actually do, what scripts to actually run. The contents of this agent specs are a list – this is YAML syntax for list. It’s actually a list of scripts that you can run. Each script has a name – that’s what this “About” field is. It has a relative weight and then it has the script itself, which has the different commands to run. The relative weight indicates if one has a relative weight of two and another has a relative weight of one, the one that has a relative weight of two is going to run twice as often.
Then we have the actual script itself, and then we have different commands (we’ll talk about that in a moment). These slides summarize what I talked about there. (I’ll note in passing that in the sample playbook, we really only show http request, but we also support WebSockets. So, you can imagine, instead of making http requests, you connect to a WebSocket, and then while the WebSocket is open throughout the duration, you can receive, then send, data, which is an asynchronous way of thinking about it. When you receive a WebSocket request and then respond in response or send a request in response to the previous – we support that as well.
We also support match expressions. In the command above, or in the sample playbook, there’s this match field. This generates a special event if you match whatever content is here. You can envision: a basic example of match expression would be a Regex (or a regular expression), right? Something like this where you’re just matching some Regex and some data and if it matches, you generate an event, right?
We support regular expressions, but we also support this JMES path, which is a JSON selector syntax. Our observation is that a lot of APIs have a JSON-based format. So, it’s sometimes easier to reason about using a JMES-type selector.
An example of this [JMES-type] selector: “key.nested” would be the selector, then that would select “thing” from this sample JSON here in data. The full spec and the full specification is here at this URL for more detail.
We support http requests. So, you can generate a match expression based on this. [Let’s say] you want to match some input – totally hypothetical URL – you want to match this expression and then you can also specify “should_match” which means that in addition to selecting whatever it’s supposed to, you want it to actually match whatever this value is directly, so we support that. All right, so let’s go into the demo.
For the purposes of the demo, I wrote a very simple flask application that was intentionally designed poorly. In this application, we have a basic index.html – it’s a basic webpage. It has a basic image file – the image files around 2MB. Then we have this API call, which intentionally has this time.sleep
In reality, you wouldn’t have this time.sleep, right? Most servers would actually be doing some form of computation, but this is to simulate the case if the server was bogged down for some reason. We have a very simple test server. Um, and let me spawn it up over here.
This is extremely dumb, extremely simple, and it has a basic UI. You can see that the server is being hit when we make these requests, right? If you click, you have this delayed API call, so you can already see that the performance can potentially, when you start making multiple API calls that it kind of lags a little bit.
We can see, just ad hoc playing with this, that it’s kind of slow, that the server is bogged down in some sense. We can actually simulate this by running this on our device. So, just [to see how it pans out], let’s record a basic session that hits the same server on our mobile device.
You see that the phone renders the same thing, but the phone takes a little bit longer probably because it has a different form factor. This website is again very, very simple. The phone had to resize the image to even fit. You would have to zoom in a little bit to see, but it’s basically the same as you’d expect, behaves the same way as the desktop application. So, we’ll stop recording this session and we’ll view.
In this session application, in this recorded session, we can see the metrics generated, we can see there was a slow download, which is probably that the image file, we had some duplicate messages. This is probably again the image file because we refreshed and we didn’t properly cache. You can see that the server was slow and there’s various other exceptions.
If you look right here, you can see there was about, for this particular request, these are the two button clicks, the two API calls – they took about a second and some change, which makes sense because we had a time.wait call in our server, which was going to stall for a second.
Now, we want to see how this performs under the load. I’m going to add this to the user flow, just so we can keep track of the session for later. Now we want to test this server with the bigger load. I wrote this script right here – demo_ssl – that basically runs various commands that you’d expect. It runs, it connects to the server, it loads the dog image, and then we also hit this route (does not exist, just to see if it catches error codes correctly). Every so often we’ll also run the script – instead of running the basic usage we’ll do that inside our script.
I set these fields here so that it actually uploads this directly to the performance task. If you don’t set this field, we at HeadSpin can also upload the performance task – it’s not an issue – but this makes it easier for the short term.
If you go to the load test UI, you see this. After uploading our playbook, we can validate it and then we can just upload it directly. Now, let’s run it. You see over here that we have one IP, the current target was in Mountain View, USA. You can see that this over here is toggling. It says it’s generating a lot of events. If we scroll down here, we can see the events generating.
As our application scales upward, you can see that our mean http send time is actually increasing from our load. This is expected because again, our server is not very optimized. I think the playbook right now is going to run for five minutes. While we’re running this load test, we’re also going to run the session in this app again. [Let’s start a different tab now.]
Yeah, same device, same everything. We’ll go here and now we’ll visit the same application we did before. [*long pause*] Now we can see that it’s a little bit slower than it was last time, which is sort of what we expect. We’ll click this twice like we did before. The clicking is seriously lagged. You can tell that, again, it’s behaving really slowly. Then, we’ll stop recording this session and we’ll view it.
And this time we have the same problems from before, but this time our weight massively, massively increased. Now it’s six and a half seconds versus the one, maybe two, seconds that it was before. We can see that this correlates – that the server loading slower is affecting the user experience because now this request is taking a full six seconds to respond and we can see based on the slow time here that it takes longer to actually render the page.
I’m going to add this again to the same user flow we did before for comparison. Now, let’s go view it. From all the issues that we had, there was a slight increase in time, but if we look at specifically the wait time, the average wait, we can see that there is a massive difference. It’s more than twice what it was before. [Demo.]
The other thing we can do is we can view previous sessions. I’ll demo that right now. You can see all the data from before, you can see that our time spiked right here, you can see that you’d see all these different stats, and then you can actually replay it. If you click this times 10, it’ll actually play it in real time, like so. Of course, if you want to just cut it out earlier, like we will here, [that’s an option].
[One-minute pause to set-up demo.]
We’ll let this load test continue running. There it goes – it just took a little time to load in. If you notice, the first time we ran this, this is our first device – this Nexus 6P device – and we ran it in Mountain View. We’ll do the average wait times, because that’s the one that we noticed when we were doing the session UI, we saw the difference. This is the test here before we ran the load tests, and this is afterwards, but we also collect information about the load test itself so that we can actually reference this time.
It’s a special load test device because it’s not a real device. It’s a place holder because it’s not a real device. The load test started around here – this is where the load test started. And when this other load test that we’re running in the background completes, it’s going to add another dot to the test here. But you can see that before the load, when the load test started, the load test started around here. So when you look at this device and look at any of these other metrics (again, we’ll choose the average wait time because that’s what we care about), you’ll see that after the load test had started, the performance was a lot worse. And so here at HeadSpin that’s why we are interested in load testing because we can equate stressing the server with this load, to actual user performance.
When we run a second load test, more points would appear from each load test. Our current load test is running, and in about four minutes, there would be another dot from our second load test. Thank you.