Why Holiday Site Crashes Are Still Happening: Lessons Learned from Black Friday 2018

Jan 09, 2019


Article ImageIt’s the start of a new year and another holiday shopping season is officially in the books. Online retailers overall can once again be commended for delivering strong web performance (speed, reliability) under crushing traffic loads – a real engineering challenge, make no mistake about it.

However, as has become customary in recent years, we did see a few high-profile outages occurring on Black Friday. In a world of instant cloud scalability and capacity on demand, this has us asking, why do these outages keep happening, and what can be done about them as we look ahead? Here are some clues:

Traffic Volumes are Exploding

Online traffic volumes during the Black Friday-Cyber Monday holiday weekend just keep on growing, at staggering rates. While visits by shoppers to physical stores were down by 1.7% compared to Black Friday 2017, online sales hit a new record— $6.22 billion in online sales, up 23.6% from a year ago and setting a new high, according to Adobe Analytics.

All major retailers know they’re going to be hit with massive traffic on Black Friday – so what explains the outages? The answer may lie (in part) in retailers’ desires to maximize Black Friday profitability. For profit-challenged retailers, IT is often a significant cost center and they may be applying a lowest common denominator (LCD) approach. They ask themselves, based on historical traffic trends, what is the least amount of infrastructure that can deliver the uptime guarantees we need? Or put another way, what is the most amount of downtime we can afford?

There is danger here because you can’t predict what the traffic volumes will be, and as traffic volumes increase, so does the potential for crash-related profit losses. Retailers are constantly struggling to strike the right balance that translates to profitability, but the unpredictable traffic surges of recent years may be throwing some off-kilter.

It seems the right solution would be automatic, pay-as-you-go, elastic, horizontal scalability achieved through the cloud.  But isn’t that what all cloud users have? Not necessarily.

Cloud Can Create a False Sense of Security

Somewhere in the midst of the cloud fervor of recent years we’ve been led to believe the cloud is a panacea – limitless resources, on-demand scalability for all, and spotless performance. But this is not always the case, especially during peak traffic periods like the holidays.

Early in the day on Black Friday, site traffic to a group of retail sites was analyzed and estimated to be nearly double that of a normal week.  Imagine you run an 8-minute mile, and at a second’s notice, you need to adjust your pace to a 4-minute mile. That’s exactly the challenge these online retailers face.

Traffic spikes occurring on peak days are a double-whammy of both sudden and huge. This is quite a departure from the normal process of cloud users getting a notification that traffic volumes are slowly creeping up, and it might be time to spin up another instance. On peak days, site administrators often don’t have the luxury of advance warning, and that’s a problem. 

Cloud users also need to consider pooled versus dedicated cloud resources. Pooled resources— where an online retailer shares a resource with other companies who may be coming under heavier load themselves—are more prone to being tapped out than dedicated resources. In many cases, dedicated resources may be the better option, though they come with a higher price tag.

The bottom line is that instant scalability achieved through the cloud is not as automatic, easy and guaranteed as we are often led to believe, and you need to look closely at the fine print and ensure your cloud resource is optimally configured to deliver on your needs.

Third-Party Services Continue to Creak Under Load

Third-party services are external components incorporated into modern sites, but originating from beyond one’s own firewall. Examples include marketing and analytics tags, social plug-ins, site search tools, photo display services, and more.

Third-party services can add rich features and functionality to a site, but when they misbehave they can create major problems for all the sites incorporating them. During peak traffic periods, it is not uncommon for these services (supporting multiple retailers under heavy load) to falter. And when a third-party service slows down or crashes, it can produce a cascading effect across all the websites it supports, causing them all to slow way down or become unavailable altogether.

We’ve seen this occur in past holiday seasons, and through our monitoring, we saw it happen yet again this year. The takeaway is to be selective about third-party services.  Keep the ones you really need but avoid the others, as each one introduces risk. Include them in your load tests, monitor them closely and in real-time and have contingency plans in place so you can quickly and easily remove any offenders that start causing problems.

Inconsistent Load Testing Across Distributed Applications

In addition to third-party services, it is also important to include all application components in load tests. The desire to run end-to-end applications in the most efficient manner—leveraging economies of scale when possible—has resulted in applications increasingly being broken apart and spread across multiple platforms, clouds, and even virtualized containers.

Consider, for example, a transactional e-commerce application—and organization may choose to run the front-end of this application in the cloud, but keep the mission-critical, highly data-sensitive transaction processing component on an on-premise mainframe.

This approach has many benefits, but one challenge is that as different developers assume responsibility for different application components, they often prioritize load and overall quality testing inconsistently. This can result in one application component delivering exceptional performance under load, while another falters. All the end user sees is a broken application.

Monitoring Perspectives Are Limited

In recent years we’ve seen a rise in micro-outages - when a website or app goes down only in a specific geography, for site visitors served by one ISP, or some other finite category. These outages or slowdowns don’t make the news, but our tracking shows they are among the most frequent form of downtime and were once again a common occurrence during the 2018 online holiday shopping season.

We believe micro-outages are on the rise because organizations are monitoring site performance from insufficient vantage points. Synthetic monitoring, which simulates “dummy” traffic and pings websites and applications at regular intervals to ensure speed and reliability, is the solution to maintain full visibility. One form of synthetic monitoring—cloud-only—has been receiving a lot of attention lately, and like many things cloud-related, many erroneously believe it is the answer to all their monitoring challenges.

Cloud monitoring, however, only measures the speed of packet round-trips from the cloud to a website or app, and back to the cloud. It can provide a useful, supplementary perspective, but in many instances, it skips over all the performance-impacting elements (ISPs, CDNs, user browsers and more) that real end users encounter. So, on its own, cloud monitoring leaves many performance blind spots.

So as you deploy synthetic monitoring, do so from as many vantage points as possible, including backbone, broadband, ISP, last mile, and wireless network locations that are closest (geographically) to your key end-user segments. This is the only way to have comprehensive visibility on all the various components in the “internet wild” that can cause end-user web performance to ultimately backfire.

Conclusion

Given the increasing number of variables and challenges online retailers face during peak traffic periods, the fact that we don’t see more outages is actually remarkable. Furthermore, most online retailers are able to recover quickly, which is a testament to IT teams’ triage and remediation capabilities.

But for those online retailers that did experience problems, the pain is likely still fresh in their minds. As they begin planning for next year’s holiday season, it is a mistake to not take advantage of the valuable lessons afforded by the current season. More holiday shopping is going online, which is a good thing–but it’s important to not become a victim of one’s own success.


Related Articles

Mastering the art of brand storytelling isn't easy. Here are five best practices, gleaned from one of the world's biggest brands, on how to successfully build an authentic content marketing voice.