A degree of risk (part 2)

I was talking with a colleague a couple of weeks ago about risk in software development projects and I said, “so, the aim of a mature development process is to reduce the number of errors that happen after a code release as far as possible”. And he said, “No, that’s not right – the aim is to optimise the number of errors. You can have too few errors”. I think I understand what he meant and on reflection I agree with him. This is my attempt to explain why.

Unless you’re a TV presenter, it’s quite easy to understand that risk is nearly always present but is mostly manageable. Often this is inescapable – no amount of effort will reduce the risk to zero. The question is, should we ever try? Should zero risk be the goal of any undertaking?

It depends, of course. You will struggle to make a business profit without taking risks. That’s accepted. But what if you are developing the control system for a passenger aircraft? What is the acceptable risk in that undertaking? One fault in your output could have fatal consequences. Should you pursue zero risk in that sort of project? Well, aircraft control systems are complex and there is a limit to how thoroughly they can be tested (although that limit is quite high). You can take action to ensure that many sources of error are eliminated systematically, but I don’t believe you can ever get to 100% certainty that an aircraft control system is safe. At some point you have to install your system on a plane and watch it take off.

The things you can do to make that experience less nerve-racking are ordinary bits of good engineering practice. Lessons learned from centuries of building stuff from wood, stone, iron, steel, concrete and glass can be translated into software projects fairly seamlessly. Transparent project management, repeatable processes, eliminating unnecessary manual tasks, identifying valuable metrics, keep it as simple as possible, etc. The important point is to understand where the risk arises and mitigate it whenever it is sensible to do so.

And that’s the point I’ve been trying to get to. There are many sources of risk and not all are equal. If a risk can be mitigated cost effectively then you should do it. The key word is “cost-effective” and the trick is to get the cost-benefit sums right. This can lead you to some surprising results. The point my colleague was making was that if you have too few errors after a code release, you may be spending too much on mitigating your project risks. You could have delivered the release earlier or more cheaply. How do you compare the value of expediting the release with the downside of delivering buggy software?

The savings are relatively clear: if you deliver your software earlier or with fewer resources then there are direct and measurable cost savings. The other factor is opportunity cost – the sooner you deliver your software, the sooner it can start earning money for you or your customer. That loss of revenue (or loss of downstream cost savings) can be very significant and is certainly a factor in your calculations.

How do you measure the downside? This is less clear because the errors are not predictable. One disastrous error could wipe out all the supposed benefits of the entire project. You could analyse the maximum loss from a disastrous error then try to estimate how likely it is to occur. For a trading system this should be doable – what is the effect of mispricing a trade by an order of magnitude? The maximum loss could be terrifying, but you can take steps to reduce the likelihood of it happening: (i) release code frequently so each release is only an incremental change from the previous one, (ii) pilot your new version with a restricted set of expert users, (iii) build sanity checks into the software to raise an alert when an unlikely event happens.

I guess you are comparing the direct savings of your aggressive delivery schedule with the premium you would pay to insure yourself against the loss due to delivering bad software. The insurance premium will be lower if you can make the event less likely. You should take the risk if the savings exceed the premium. That’s about as clear as I can make it.

So anything you can do to mitigate risk with only a small cost, you should do. Definitely. Errors that arise from easily avoidable risk are just plain stupid and unprofessional. Doing things manually that you could easily automate, writing code in a bad style, using new technology just because it’s new – these are commonplace sources of error which can be eliminated by adopting simple hygiene measures.

On the other hand, taking expensive traders off a desk to spend time testing your new software in a test environment may not be such a good investment. It prevents them earning money (so it is not popular with them) and you are unlikely to get their full attention. So long as they understand and accept the risk of using software they have not tested personally then that is a business decision for them to make. This is an acceptable, managed risk.


0 Responses to “A degree of risk (part 2)”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: