Over the past few years I’ve become increasingly involved in the UK testing community, which has been a real eye opener. I’ve met loads of interesting people and had lots of really helpful conversations. One thing has been pretty obvious from the start though – while the people I meet work in all sorts of interesting spaces, not many work in products similar to the ones I test. It seems that the majority of testers out there these days are working on web applications of one sort or another.
So, I thought it might be interesting to share a few of the differences that telecoms testing involves. I won’t try to cover them all now but let’s start with a significant one. “Five Nines”.
The term “five nines” may be familiar to some, but for others we’re talking here about reliability and the need to attain 99.999% uptime. The need for seriously high reliability is one of the key things that makes telecoms testing a bit different from some other technology spaces. Don’t get me wrong; I’m not suggesting that no-one else cares about reliability, or that downtime on websites isn’t a big deal, but it is right at the top of my priority list. Let’s face it, if an application on your PC crashes, you might be annoyed (particularly if you’ve lost work) but you’re probably not that surprised. If a website you try to use is unavailable for a few minutes, you probably wouldn’t give it much further thought. But when was the last time you picked up your land-line phone and didn’t get dial tone? It can happen, but it’s a pretty rare occurrence.
Some of the products I test deliver phone service for 250,000 subscribers. If one of these things goes down, you might have a whole town or more or people who can’t make phone calls any more. What’s more there are occasions when being able to make a call is literally life and death. If you need to dial 999, you’d be more than a bit cross if the phones service wasn’t working because of a software bug.
So what does five nines reliability mean in practice. There are 525,600 minutes in a non leap year. 0.001% of this 5.26 which means that we need to limit downtime to no more than 5 minutes per year. That’s a pretty big ask, especially when you consider that problems could be the result of either hardware or software failures.
There’s a whole set of measures we take to ensure that these reliability figures are met and I won’t go into them all now. On the hardware side, everything is redundant, there are no single points of failure. The software runs with redundancy too – we typically have two or more instances of the software running with important state being continuously replicated between the two.
On the testing side we devote a lot of time and effort to testing the resilience and reliability of the system under a whole range of conditions. These range from normal maximum load conditions, through to the worst case scenarios we can dream up. We also need to make sure that any potential outage issues seen during testing are carefully investigated. Waiting to see if it happens again just isn’t good enough. This might be the one time we have to debug and fix it before it happens in production.
I think this focus on reliability is a double edged sword for a tester. On the one hand it adds a lot of pressure and the testing can be quite painful at times. On the other hand we get to dream up interesting and evil testing scenarios and then go hunting! As a tester what could be more fun than that!