Many of you would have heard of the recent ClickFrenzy event, dubbed a 1-day Australia-wide internet marketing sale.
Many of you would also have heard of the widely reported fact that the website went down within minutes of the start of the sale.
It also remained down for nearly all of the Tuesday evening (the sale started at 7pm on Tuesday 20th November, 2012).
While the PR spin tried to make the most of the “large” volumes of traffic, the fact remains that the volumes were tiny compared to traffic generated by similar overseas events. What follows is some simple analysis of the event and the mistakes made.
Of course, one wouldn’t think for a moment that they had deliberately allowed this to happen to create a “shortage” in consumer minds, or extended press coverage! Assuming that it wasn’t deliberate, the incident is interesting as we rarely think about larger traffic in Australia, yet it’s something that needs to be considered. Bearing in mind that many website visitors, not caught in the heat of a heavily advertised “sale”, will quickly move on to the next website if your site is slow or unavailable, let’s look at some of the reported causes for the failure in overview mode. And, if you have some TV coverage or an event, if your site is down at the time you have essentially thrown away a huge opportunity.
The first cause was that Clickfrenzy used Magento for their site. Magento is truly a wonderful open source eCommerce platform (arguably the industry leader), designed for selling large amounts of products in a powerful, usable and highly configurable environment. However, it needs careful optimization to make it run fast, and it’s pretty well known in the industry for being a “resource hog” – that is, slow!
That the site had not been security hardened was evident from the fact that the site’s internal database passwords were stolen from it within hours of it opening; and one might reasonably surmise from the fact it crashed under pressure that it had not been optimized for performance. This is interesting as the site was supposedly built by one of Australia’s top Magento companies, who I understand have a deservedly good reputation. Possibly the job was done at the last minute, or wasn’t funded sufficiently; one can only guess.
Also interesting is the fact that the Clickfrenzy site itself wasn’t selling products at all, so they could have used another platform just as easily. In fact, given the millions of visitors expected in a very short timeframe, they might have been better converting the site into straight HTML (even if they had a Magento backend) – which doesn’t require any processing at all and is very well cached using strategies we’ll talk about below. While Magento is a great tool when fit for purpose is good, it’s simply not the best choice for a high traffic site not selling goods.
What’s a CDN?
A CDN, or “Content Data Network” as it is variously called, is distributed caching network used by sites used to getting large amounts of highly variable traffic – like a one day sale. In effect, a CDN is actually a huge distributed cache, with hundreds or even thousands of cache “nodes” distributed across the areas they serve, with these nodes heavily optimized towards providing simple files at very high speeds. A CDN is built to cope with very, very large amounts of traffic and effectively offloads nearly all processing from the “real” site, apart from checking back periodically to see if the pages it is “caching” have changed.
CDNs are well used in industry to help sites like newspapers and other heavily visited sites to cope with large traffic bursts, and are a core technique of building in reliability (and even some security) when you have to deal with large amounts of bandwidth. They are very widely used all over the world and are a core technique used for dealing with huge amounts of traffic. In fact, a CDN can prevent any traffic hitting your actual server at all and can even keep your site up if your actual server crashes!
There are a number of CDNs in wide use – a few examples are Akamai – widely used and very old; Amazon’s Cloudfront (my personal favourite); and Cloudflare, which is targeted at smaller sites and is my personal choice for small sites. Each of these systems has strengths and weaknesses and some thought and experience is needed from your developers to get the best out of them.
There’s no doubt in my mind that the oldschool techniques often bring the biggest bang where performance is required, as developers often lose touch with how much pressure their programming techniques can put on a server. For instance, as just one simple example, a home page can often require 300 SQL lookups to display various products. The results of these lookups rarely change, so the lookups could be done every 10 minutes very safely, rather than every hit, lowering the server load by many orders of magnitude.
Testing and checking
In IT practice, it’s a well worn truth that you can never rely on theory. IT assets have a history of not performing under pressure as one would expect – sometimes from the smallest error in configuration, which can be extremely hard to remedy if not caught beforehand.
This appeared to be the fact on Tuesday night, where, for instance, scuttlebutt suggests that they started using a CDN only for images, then switched to a CDN for pages as well later in the evening. The Clickfrenzy CEO appeared later saying that the site had been designed for a million hits a day – but that’s 24 times less than they should have been designing for. If I was designing for a million hits in an hour (not a day) I’d want every single possible part of my site to be coming from an internationally recognized CDN.
There’s little doubt that this performance crisis could have been uncovered by thorough testing at an early stage. Unfortunately, it’s also true that such testing takes quite some time, and costs money, and we have no idea what problems or issues came up during the development of the site. Nevertheless, the techniques discussed in this article are in very wide use overseas (eg: Cyber Monday or Black Friday in the US), and there are a lot of experts with years of experience in implementing them if one cares to look.
Finally, there are a number of solid providers offering very expandable technology platforms. Amazon is one we use and recommend, but there are many others – Amazon itself hosts some 40% of internet traffic in the US, as well as turning over an estimated $2 billion annually in it’s scalable hosting business. They are at least one player that knows their game well, but keeps a low profile.
As an example of what I mean, Amazon’s architecture allows you to add 20 or more servers at a button touch. If your application has been architected correctly, this makes expanding the site to cover a large event almost trivial. The hard part here is knowing enough to design the architecture appropriately so the expansion can be meaningfully used.
This article has really been intended to give a brief high-level overview of the technologies available for supporting high performance sites.
If you have a small site, you might want to try Cloudflare, who make it very easy to get up and running. However, if you have a larger site that will be under significant pressure, don’t spare the expense of getting qualified architectural advice at the start of your project, and experienced support and review during it. Mistakes made with site architecture can be hugely expensive – not only to your site performance, but they also cost a lot to fix after a system has gone live.
Sometimes you only have one chance to impress as Clickfrenzy found out rather publicly last Tuesday!
Retailers Duped By Click Frenzy Meltdown Start Talking Refunds – IB Times, 21 Nov 2012
Forget Click Frenzy, the real bargains start with Black Friday, Cyber Monday in the US – The Australian, 23 Nov 2012
Retailers on failed Click Frenzy website begin discussing refunds – Herald Sun, 21 Nov 2012