Root cause, part 2

I read an excellent essay by Joel Spolsky today about Problem Management (except that he cleverly doesn’t mention Problem Management by name at all).

Some things I took away:

1. Joel’s approach of talking about a real-life incident is exactly the strategy I’m using to implement Problem Management at my organisation. Nobody wants to sit through Powerpoint slides with process flow charts on them, but get them talking about a real problem and they are hooked. Thanks for the confirmation that my approach is not totally mad.

2. He talks about root cause as the process by which you decide where to target your long-term solution. This is exactly the pragmatic approach to root cause analysis I was trying to describe in my last post.

I wasn’t aware of Sakichi Toyoda’s Five Whys maxim, but it makes some sort of sense. The first couple of questions you ask are going to be about the immediate symptoms of a problem. Beyond 5 questions and you are in danger of looking too deeply at the mysteries of the universe. You could well end up trying to implement too general a solution, at great expense, which fixes problems you don’t actually have.

So my amended principle is this: a root cause is that which we can alter to effect a permanent fix to a problem such that the solution addresses more than just the immediate symptoms of the problem.

The important points again are that the solution domain is limited to the space we can directly affect ourselves and that the solution should be a permanent or long-term fix, not a palliative for the symptoms.


5 Responses to “Root cause, part 2”

  1. 1 Dominic Sayers June 12, 2008 at 15:04

    Hi Kathy and thanks for the learned contribution.

    “Five Whys” is clearly not the right answer in every case. Sig says you might occasionally stumble upon a radical solution if you allow your analysis to wander out to the outer reaches of the imagination. That’s right, and I would support his approach so long as the cost of the analysis is not prohibitive.

    In your final example you hit on the point I was trying to make in both posts: it doesn’t matter what caused the substation to malfunction because it’s not your problem. All you need to take away is that the power supply is liable to be dodgy. Possible solutions: install power smoothing, change your power supplier etc. But fixing the substation is not a possible solution for you – it’s outside your control. Hence any further analysis is unlikely to yield a potential solution (sorry, Sig!). Stop analysing the problem at that point and start thinking about solutions.

  2. 2 kathyreid June 12, 2008 at 12:55

    Hi there,
    I work in problem management in a university environment and found this post very interesting. “True” root cause is a concept we have struggled with too – and Dominic, you’re right ITIL defines this very poorly – ie the underlying cause of one or more incidents.

    There are a number of contentious surrounding root cause:

    1. There is one primary root cause of each problem, rather than multiple causes

    2. Root cause is a tangible ‘event’ rather than something abstract

    3. The definitive root cause is something within the sphere of influence, or control, of the analyst

    If you follow the ‘Five whys’ model then you are left with one true root cause – the one at the bottom of the five whys. However, adequately preventing a recurrence may require actions for each of the five whys. Many ITIL tools such as Openview force you into using one root cause classification as well – ie you are driven to categorise the problem as having one root cause. A different approach (which is also referred to in the ITIL books) is Ishikawa or cause and effect diagrams. This allows multiple root causes to be graphed, and their causal linkage shown. Kepner Tregoe in their Problem Solving and Decision Making training doesn’t really tackle this at all – perhaps because KT is based on trying to get to the bottom of defects or ‘deviations’, rather than having a process engineering focus.

    My gut feel on this is that there is always going to be one root cause which contributed most.

    The other debate I’ve heard is that root cause is an event. For instance, a host stopped responding BECAUSE the log filled, or a network switch died BECAUSE it reached capacity limits, or a database performed poorly BECAUSE a locking concurrency condition was encountered.

    If you use the five whys, it suggests the root cause is much deeper – the network switch died BECAUSE it reached capacity limits BECAUSE the network failed traffic over to the switch as designed BECAUSE a route was marked bad BECAUSE there were CRC errors in an ATM circuit BECAUSE there was failed hardware BECAUSE a business decision was made to not replace hardware which had reached mean time to failure. Which one is the root cause? If you take the event approach, root cause is ‘capacity exceeded’. If you take the five whys approach, then root cause is a bad business decision.

    Another characteristic of root cause that is often debated is that the root cause is something within your sphere of control. I don’t necessarily agree with this approach – just because something is not within your sphere of control doesn’t mean that it’s not the root cause. Take for example a power outage, where a UPS lasts for its specified uptime and then drops power, causing an outage of the hosts to which it is connected. The IT equipment has behaved as designed and specified – so there is no deviation (and therefore no root cause) with the IT side of things. But the root cause lies with the power company – what caused the power outage? Is “power outage” as a root cause sufficient, or to find root cause do you need to know from the power company that Substation X exceeded voltage on a line and blew a transformer? True, you can’t fix this root cause (the power company has to), but is it the real root cause?

    A key question that has to be asked is “do you need to know root cause to take effective action?”

  3. 3 sig January 24, 2008 at 08:26

    True, true – but nevertheless impossible to know if there was a solution present somewhere in the land of ridiculous questions unless you pay it a visit.

    Such a “visit” should not take much time, minutes only, it’s after all just quick questions and a few moments of pondering, unless one goes corporate with standardised questionnaires *grin*

    I wonder though; when a child asks impossible questions we’re bemused, but when the boss / colleague / subordinate does it, it could easily drift into ever so slightly embarrassing… and that’s kind of sad because the questions are indeed provoking (EdB used the term ‘po’ for Provocative Operation) and has very little cost while perhaps some occasional surprising rewards.

    Why not make it into a standard behaviour – whys are going to be asked even for seemingly obvious situations, all the way into the ridiculous – nobody feels stupid and it might even be fun too? Payoff if such questioning is standard, the carrier-of-problem may ask himself a few questions before being grilled and perhaps even solve the issue – would save some time that… hehe.

  4. 4 Dominic Sayers January 23, 2008 at 20:49

    Hi Sig,

    Thanks for dropping by again.

    I am wary of continuing to ask why? until no more whys can be asked. I think you get from the concrete to the abstract to the metaphysical quite quickly! Often there is no limit to the chain of whys unless you set some sort of stop criterion, as we know from answering young children’s questions. I am only interested in an efficient and pragmetic way of discovering the optimum point at which to target a solution.

    Note to self: must read some Edward de Bono :-)

  5. 5 sig January 23, 2008 at 17:23


    was not aware of the “five whys” either, but met the concept many years ago when reading Edward de Bono – where he quite rightly points out the all too human tendency to “define” the problem including the reasons, and how asking “why?” drills down. He did not have a limit to five though, more of a until no more whys can be asked.

    The thing is that it can be used beyond “problems”, it can be used during a process sometimes – image being that you’re half way up a wall having ten more nodes to the top, but a retreat down to the bottom could give you a new view of the wall perhaps making a five-node route visible…

    Guess that’s why (heh) I never take any “truth” for granted, such nonsense (truths) have to be questioned of course

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: