When important becomes vital: Lessons from California’s biggest wildfire of 2019

HeadSpin conducted tests of how the utility’s outage-tracking mobile web site was holding up under an exponential surge in anxious visitors from around the area, country and world.

Even in the best of times, visitors are notoriously impatient waiting for mobile web sites and apps to load. During an emergency, patience is understandably even thinner. Every moment waiting for vital information – service status, storm updates, rescue efforts – can seem agonizing. 

Millions of people in northern California depend on Pacific Gas and Electric (PG&E) for energy and timely information on outages. Power cut-offs are triggered both as a controversial prevention measure and during actual wildfires, making real-time visibility into service availability life-and-death issues for residents, hospitals and responders. 

So when a major blaze broke out in Sonoma County on October 23, HeadSpin wanted to check to see if mandatory blackouts had affected our company offices in Silicon Valley, about 75 miles south of the fires. We couldn’t get information from the public utility’s outage-tracking mobile website, so we decided to run tests. Specifically, we wanted to measure how the site was handling an exponential surge in anxious visitors from around the area, country, and world. What we found is instructive for any organization offering crucial public information via a mobile web site or app. 

Exclusive results and suggested fixes are presented below. The goal here is not to pile onto PG&E, but to share a dramatic example of how proactively finding and fixing mobile technology performance issues can help avoid serious potential consequences during an emergency.

The Test

HeadSpin decided to run simple performance and availability tests of the PG&E Outage Center. We chose on Sunday, Oct. 27, four days after the fire (only 5% contained) started. Specifically, we would check service for 3200 Ash St., Palo Alto, CA, 94306. (Our corporate HQ.) 

We ran three tests: 

  1. At  9:33 a.m. PST from Mountain View, CA, via Wifi on an Apple iPhone 6s and Firefox 19.1.

  2. At 9:35 a.m. PST from Miami, FL, on the Verizon network on an Apple iPhone 8 and Chrome Version 77.0.3865.103. 

  3. At 4:33 p.m. PST , a follow-up of the Mountain View test using the same setup as earlier.

HeadSpin can test on 22,000+ combinations of devices; these are good typical selection. The HeadSpin test team generated performance sessions, capture API calls from a cell phone, then used our AI-driven system to analyze issues, pinpoint causes, and suggest remedies. 

(You can follow along with the tests from Mountain View and Miami. Real-time video shows you what the user saw at the top right of screen, alongside interactive time-series, waterfall and burst views of 14 different test categories. The dashboard offers high-level session-wide metrics, domain metrics, burst metrics and host metrics. You can drill down into results, view packet captures, and download test logs. A quick tip for identifying trouble spots: look for peaks in “issue curves,” then click over and zoom down for more detail. Try it!)  

The Experience

It’s a Herculean task to safely provide electric services during high-wind conditions across hundreds of thousands of thousands of acres across drought-dry, rural California. Ensuring fast and reliable access to emergency information proved daunting in its own way. 

We’ll drill into results in a second. But the short version is this: After firing up the browser, and manually entering the URL, it takes the site one minute and seven-plus seconds to respond. 

Eventually, we do get to the Google Maps-based search page. All right. We enter our Palo Alto test address and hit Enter. And we wait. And wait. Then, after 59.16 seconds of waiting and watching a small, slender white rectangle flick back and forth, we see this: 

As a customer, developer, or executive, that’s a screen you never want to see. 

Especially more than two minutes under (for a “real” visitor) incredibly stressful circumstances. When there is a response, we’re thrown back to the search form to try the process all over again. No thanks.

Identifying the Problem

Now, it’s possible that some PG&E customers and site visitors had better, faster, successful experiences than we did. It’s also possible some had even worse, slower encounters. 

While it’s easy to snipe, our aim was to identify and fix the root causes of poor performance. So we started scanning the HeadSpin test results. 

Here’s a high-level overview of Mountain View test. ( Miami was pretty much identical.)

In some test runs, it takes a bit to untangle what’s going on. Not here. Each of the areas measured point to a pretty similar story. Following the time-series from left to right across the graphs reveals the peaks and valleys that pinpoint trouble spots. Specifically:

“Issue cards” in the left-hand column shows us the biggest impacts on user experience, starting with the most serious. HeadSpin’s AI engine pinpoints the problems. The culprits are starkly clear: Loading animation, slow servers, and low page content. 

Highlights in yellow are the problem areas. The green bar further isolates the problem. There are those back-to-back one-minute loading times. They take up pretty much the entire session, and are almost surely the main cause of the most delays. 

Notice at the 502 error at the top right. Here’s the tell-tale message that lets us identify the main suspect: Elasticbeanstalk. 

The following API call is not responsive/slow with response code of 502 “Proxy Error”.http://outages-prod.elasticbeanstalk.com/cweb/outages/getOutagesRegions?regionType=city&expand=true&ts=1572194037 

We can drill down further. Copying the URL picks up this detail:

By changing the query parameter, regionType, expand, ts, we observed no changes in the response of the request. That strongly suggests that there’s a bug in the backend that is not reading the query parameter correctly.

elasticbeanstalk tells us that it is using https://aws.amazon.com/elasticbeanstalk/ to route the request to various API servers.

Exceptions “memcached.OperationTimeoutException: Timeout waiting for value”m. 

In short, we see again how the call took too long to fulfill. The database waits, waits, waits, then hangs. 

The Suggested Fix

So thanks to detailed test data, the problem and root cause seem pretty clear. 

Here’s what appears to have happened: A massive surge of people sought outage data during the raging fires and power shutdowns. This created paralyzing loads on server databases. The combination of too many requests and too-large results basically choked the system. 

In our case, the best the site could do was to bomb out, then offer to do the search again. I wonder how many visitors braved a second try… 

In fairness, it should be noted that PGE normally doesn’t have many such incidents. But clearly, it failed this disaster-driven stress test. To be clear: the results would not have been much, if any, better with different devices. Our test gear was fast as any customer’s – and more than many. In fact, a cell phone user on LTE in a rural area would probably have an even worse experience. No, unfortunately these appear undeniably to be systems issues.

So what’s the fix? Based on the test data and analysis, here is HeadSpin’s high-level recommendation: Cache search results and narrow down the initial search scope. That will improve scalability of the application, ensuring good performance even when there’s heavy load.

More specifically, to scale the API for such situations, we recommend: 

  1. Assuming that the same request would produce the same results, using cache to store results would allow the application to reduce load on the database. By returning fewer results, you’re putting less strain on the database. And that would save significant computation resources.
  2. Limiting the search result with pagination by filters such as country, incident regions, map zoom level to ensure that the database is not overloaded.
  3. Using a Multiple Read Replica database to ensure database queries are not blocked by other ongoing queries.
  4. Conducting 24 x7 automated load testing. 

Postlog and Takeaways

Even in the best of times, visitors are notoriously impatient waiting for mobile web sites and apps to load. During an emergency, patience is, understandably, even thinner. Every moment waiting for vital information – service status, storm updates, rescue efforts – can seem agonizing.

The Kincade wildfire was California’s largest of 2019 (so far) and the biggest in Sonoma County history. By the time it was fully contained on Nov.6, the blaze had burned 77,758 acres. The fire threatened 90,000 structures, destroyed 374, injured four responders, and forced the evacuation of more than 200,000 people. 

All told, six phased “de-energizations” (industry-speak for “blackouts”) affected nearly three million customers. The fire and difficulties in getting up-to-the-minute outage status could fuel new criticisms and stock impacts.

The devastating wildfires and electricity shutdowns bedeviling California offer many lessons. For any organization that depends on online presence, especially utilities, public safety and government, the biggest might be this: 

Make sure your mobile apps and mobile web site are in top shape. It’s crucial that they’re fast and available during normal times, and vital when calamity strikes.

Want to launch flawless products faster? Identify and fix bottlenecks? Get started with HeadSpin’s mobile experience platform today.