Tag Archives: impact

In search of a denominator

churchill2

Winston Churchill 1942

Winston Churchill is frequently quoted as saying,  ‘Democracy is the worst form of government, except for all the other ones.’  I feel a similar thing might be said about risk management: ‘Risk management is the worst form of predictive analysis for making IT, Information Management, and cybersecurity decisions, except for all the other ones.’

Because there is so much complexity and so many factors and systems to consider in IT, Information Management, and cybersecurity, we resort to risk management as a technique to help us make decisions. As we consider options and problem-solve, we can’t logically follow every possible approach and solution through and analyze each one on its relative merits.  There are simply way too many.  So we try to group things into buckets of like types and analyze them very generally with estimated likelihoods and estimated impacts.

We resort to risk management methods, because we know so little about the merits of future options that we have to ‘resort to’ this guessing game.  That said, it’s the best game in town.  We have to remind ourselves that the reason that we are doing risk management to begin with is that we know so little

In some ways, it is a system of reasoned guesses — we take our best stabs at the likelihood of events and what we think the impacts of  events might be. We accept the imprecision and approximations of risk management because we have no other better alternatives.

Approximate approximations

It gets worse. In the practice of managing IT and cybersecurity risk, we find that we have trouble doing the things that allow us to make those approximations that are to guide us.  Our approximations are approximations.

Managing risk is about establishing fractions. In theory we establish the domain over which we intend to manage risk, much like a denominator, and we consider various subsets of that domain as we group things into buckets for simpler analysis, the numerator.  The denominator might be all of the possible events in a particular domain, or it might be all of the pieces of equipment upon which we want to reflect, or maybe it’s all of the users or another measure.

When we manage risk, we want to do something like this:

  1. Identify the stuff, the domain — what’s the scope of things that we’re concerned about? How many things are we talking about? Where are these things? Count these things
  2. Figure out how we want to quantify the risk of that stuff — how do we figure out the likelihood of an event, how do we want to quantify the impact?
  3. Do that work — quantify it, prioritize it
  4. Come up with a plan for mitigating the risks that we’ve identified, quantified, & prioritized
  5. Execute that mitigation (do that work)
  6. Measure the outcome (we did some work to make things less bad; we’re we successful at making things less bad?)

The problem is that in this realm of increasingly complex problems, the 1st step, the one in which we establish the denominator, doesn’t get done. We’re having a heck of a time with #1 — identifying and counting the stuff — and this is supposed to be the easy step.  Often we’re surprised by this. In fact, sometimes we don’t even realize that we’re failing at step #1 — because it’s supposed to be the easy part of the process.  However, the effects of BYOD, hundreds of new SaaS services that don’t require substantial capital investment, regular fluctuations in users, evolving and unclear organizational hierarchies, and other factors all contribute to this inability to establish the denominator.  A robust denominator, one that fully establishes our domain, is missing.

Solar flares and cyberattacks

solarflareUnlike industries such as finance, shipping, and some health sectors, which can have sometimes hundreds of years of data to analyze, the IT, Information Management, and cybersecurity industries are very new and are evolving very rapidly.  There is not the option of substantial historical data to assist with predictive analysis. 

However, even firms such as Lloyd’s of London must deal with emerging risk for which there is little to no historical data.  Speaking on attempting to underwrite risks from solar flare damages (of all things), emerging risks manager, Neil Smith says, “In the future types of cover may be developed but the challenge is the size of the risk and getting suitable data to be able to quantify it … The challenge for underwriters of any new risk is being able to quantify and price the risk. This typically requires loss data, but for new risks that doesn’t really exist.”  While this is talking about analyzing solar flare risk, it can be readily applied to risk we have in cybersecurity and Information Management.

In an interview with Lloyds.com, reinsurer Swiss Re Corporate Solutions  head of global markets, Oliver Dlugosch, says that he takes the following steps when evaluating an emerging risk:

  1. Is the risk insurable?
  2. Does the company have the risk appetite for this particular emerging risk?
  3. Can the risk be quantified (particularly in the absence of large amounts of historical data)?

Whence the denominator?

So back to the missing denominator. What to do? How do we do our risk analysis?  Some options for addressing this missing denominator might be:

  1. Go no further until that denominator is established. (However, that is not likely to be fruitful as we may never get to do any analysis if we get hung up on step 1) OR
  2. Make the problem space very small so that we have a tractable denominator (but then our risk analysis is probably so narrow that it is not useful to us) OR
  3. Move forward with only a partial denominator and being ready to continually backfill that denominator as we learn more about our evolving environment.

I think there is something to approach #3.  The downside is that with it we may feel we are departing what we perceive as intellectual honesty, scientific method, or possibly even ‘truthiness.’  However, it is better than #1 or 2 and better than doing nothing. I think we can resolve the intellectual honesty part to ourselves by 1) acknowledging that this approach will take more work (ie constantly updating the quantity and quality of the denominator) and 2) acknowledging that we’re going to know even less than we were hoping for.

So, more work to know less than we were hoping for. It doesn’t sound very appetizing, but as the world becomes increasingly complex, I think we have to accept that we will know less about the future and that this approach is our best shot.

 

“Not everything that can be counted counts. Not everything that counts can be counted.”

William Bruce Cameron (probably)

 

[Images: WikiMedia Commons]

Frog Boiling and Shifting Baselines

There is an apocryphal anecdote that a frog can be boiled to death if it is placed in a pan with room temperature water and then slowly heating the water to boiling. The idea is that the frog continues to acclimate to the new temperature which is only slightly warmer than the previous temperature (if the temp went up in incremental chunks) and so it is never in distress and never tries to jump out. A less PETA-evoking metaphor is that of a ‘shifting baseline’ which was originally used to describe a problem in measuring rates of decline in fish populations. The idea here is that skewed research results stemmed from choosing a starting point that did not reflect changes in fish populations that had already occurred which resulted in an order of magnitude of change that was not accounted for.

We can also see this sort of effect in risk management activities where risks that were initially identified with a particular impact and likelihood (and acceptability) seem to slide to a new place over time on the risk heat map. This can be a problem because often the impact and the probability did not change and there’s no logical reason for the level of risk acceptability to have changed. I believe this slide can happen for a number of reasons:

  • we get accustomed to having the risk; it seems more familiar and (illogically) less uncertain the longer we are around it
  • we tell ourselves that because the risk event didn’t happen yesterday or the day before, then it probably won’t happen today either. This is flawed thinking! (In aviation we called this complacency).
  • we get overloaded, fatigued, or distracted and our diligence erodes
  • external, financial, and/or political pressures create an (often insidious) change in how we determine what risk is acceptable and what is not. That is, criteria for impact and/or probability is changed without recognition of that change.

Challenger Disaster

Challenger_explosion

Challenger plume after explosion

In 1986, the Space Shuttle Challenger exploded 71 seconds after liftoff. The Rogers Commission, established to investigate the mishap, created a report and panel debrief to include the testimony by the iconic Richard Feynman. Chapter 6 of the report, entitled “An Accident Rooted in History” opens with, “The Space Shuttle’s Solid Rocket Booster problem began with the faulty design of its joint and increased as both NASA and contractor management first failed to recognize it as a problem, then failed to fix it and finally treated it as an acceptable flight risk.” (italics added) Let’s recap that:

  • initially failed to formally recognize it as a problem (but later did)
  • failed to fix it
  • morphed into an acceptable flight Risk

When further testing confirmed O Ring issues, instead of fixing the problem, “the reaction by both NASA and Thiokol (the contractor) was to increase the amount of damage considered ‘acceptable’.” That is, in effect, they changed criteria for risk.  Implicitly, and by extension, loss of life and vehicle was now more acceptable.

 

For risk to become acceptable, the criteria for impact and/or probability would have had to change -- even if implicitly

For risk to become acceptable, the criteria for impact and/or probability would have had to change — even if implicitly

Physicist Richard Feynman, on the Commission, observed:

“a kind of Russian roulette. … (The Shuttle) flies (with O-ring
erosion) and nothing happens. Then it is suggested, therefore, that
the risk is no longer so high for the next flights. We can lower our
standards a little bit because we got away with it last time. … You
got away with it, but it shouldn’t be done over and over again like
that.”

This is the sort of frog boiling or baseline shifting that we talked about earlier. Though there were many contributing factors, I believe that one of them was that they simply got familiar with and comfortable with the risk. In a way, the risk was ‘old news’.

Frog Boiling in Information Risk Management

This kind of frog boiling or baseline shifting can happen with Information Risk Management as well. As we become inundated with tasks, complexity, and knowledge of new threats and vulnerabilities, it can be tempting to reduce the importance of risk issues that were established earlier. That is, we can have a tendency to put a higher priority on the issue(s) that we have been most recently dealing with. If we are not careful, the perceptions of earlier risks can actually change.  Gregory Berns discusses perception change in his book, Iconoclast.

It’s one thing to consciously re-evaluate our tolerance for difference types of risk because we have new information and new things to consider, but it is another to let our tolerance for risk slide because of fatigue, ‘information overload’, political pressure, or distraction.

More than once I’ve picked up an older Information Risk Register & Heat Map and reminded myself that there were unmitigated risks that were still there. Yes, since that time, I had new risks to deal with, but that didn’t mean that the original ones went away.

One way to help us address this non-overt perception change is to use our earlier Risk Registers and Heat Maps.

      • Look at it again.
      • Did the original issues really go away?
      • Did their impacts or likelihoods really lessen?

It may well be that with new knowledge of threats, vulnerabilities, and capabilities, that we have to revisit and adjust (usually increase) our tolerance for risk given our limited resources. That is fine and appropriate. However, this needs to be a conscious effort. Changing the importance of a risk issue, whether it be its impact or probability, because we are tired, overwhelmed, or because the issue is ‘old news’ doesn’t help us. We need to be methodical and consistent with our approach. Otherwise, we are just fooling ourselves.

 

Have you had to change your risk tolerance in your organization?  Have you had to change your criteria for impact and probability? What were driving factors for re-assessing how you measure risk impact?

 

Risk Mapping: Determining Impact & Probability in the movie Aliens

In describing the Alien threat that they (Ripley and the tactical team) are facing as the evening wears on, the young female protagonist, Newt,  states:

“They mostly come out at night. Mostly.”

When Ripley tries to assuage her by sharing that there are armed soldiers to protect her, Newt states:

“It won’t make any difference”

Newt and Ripley

Newt and Ripley

Analysis: 

Let’s define this particular Risk Register item “Alien attack at night”

We have it from a first-hand observer of multiple instances of Alien attack, Newt, that they mostly come out at night. (Mostly).  If we’re using Low, Medium, & High to represent probabilities  0 < p < 33%, 33% < p < 67%, 67% < p < 100%, respectively, then I think we can safely convert these observations to High likelihood of Alien attack at night.

We have direct observations of destruction and further interviewing of a first hand observer who shares that having several well-armed soldiers in place “won’t make any difference”.  Given this information and defining Impact as Low = no fatalities/minor injuries, Medium = 1 to 3 fatalities (or capture with subsequent implantation & cocooning) & severe injuries , and High = greater than 3 fatalities (or capture w/subsequent implantation & cocooning) & very severe injuries (that includes Alien acid-blood burns), I believe that we can call this a High Impact.

If Hudson were keeping his Risk Register current, I think he could reasonably update with:

        • Risk Description: Alien attack @ night
      • Probability: High
      • Impact: High

 

How would you assess this situation?

Communicate risk in a single page (the first time)

One of the most challenging aspects of work for an IT or information management professional is to communicate risk.  If you are in a resource-constrained business, e.g. small and medium size businesses (SMB), that hasn’t analyzed information risk before, consider communicating it the first time in a single page.

The reason for a single page communication is that risk can be so complicated and obscure and IT technologies, concepts, and vocabulary can also be complicated and obscure that the combination of both can go well beyond mystifying to an audience not familiar with either or both (which is most people).

A few years ago I was in a position to try to communicate information risk to a number of highly educated, highly accomplished, and high performing professionals with strong opinions (doctors).  I only had a tiny sliver of time and attention for them to listen to my pitch on the information risk in their work environment.  If I tried some sort of multi-page analysis and long presentation, I would have been able to hear the ‘clunk’ as their eyes rolled back in their heads.

Clearly, there was no lack of intellectual capacity for this group, but there was a lack of available bandwidth for this topic and I had to optimize the small amount that I could get.

After several iterations and some informal trials (which largely consisted of me pitching the current iteration of my information risk presentation while walking with a doc in the hall on the way to the operating room), I came up with my single page approach.  It consists of three components:

  • (truncated) risk register 
  • simple heat map
  • 2 – 3 mitigation tasks or objectives
Single Page Risk Plan Communication

Communicate risk in a single page (click to enlarge)

I put the attention-getting colorful heat map in the upper left hand corner, the risk register in the upper right, and a proposed simple mitigation plan at the bottom of the page.

This ended up being pretty successful.  I actually managed to engage them for 5 – 10 minutes (which is a relatively large amount of time for them) and get them thinking about information risk in their environment.

To communicate risk in a single page, I am choosing to leave information out.  This can tend to go against our nature in wanting to be very detailed, comprehensive, and thorough in everything that we do.  However, that level of detail will actually impede communication.  And I need to communicate risk.  By leaving information out, I actually increase the communication that occurs.

Also, notice in the Proposed Mitigation section, I am not proposing to solve everything in the register.  I am proposing to solve things that are important and feasible in a given time frame (three months in this case).

In three, six, or nine months, we can come back with a new presentation that includes results from the proposed mitigation in this presentation.

Notice that I put “Sensitive” in a couple of places on the document to try to remind people that we don’t want to share our weak spots with the world.

If at some point, your company leadership or other stakeholders want more detail, that’s fine.  If they ask for it, they are much more likely to be able and willing to consume it.

To communicate risk, start simple.  If they want more, you’ll be ready by being able to use your working risk register as a source.  I’ll be willing to be bet, though, that most will be happy with a single page.

 

Have you presented information risk to your constituents before? What techniques did you use?  How did it go?

Impact vs Probability

Climbing aboard the helicopter for a training flight one evening was probably the first time that I thought about the difference between probability and impact as components of risk.   For whatever reason, I remember looking at the tail rotor gearbox that particular evening and thinking, “What if that fails? There aren’t a lot of options.”

Tail rotor failures in helicopters are not good.  The tail rotor is what provides the counterbalance for the torque generated by the main rotor that generates all of the lift.  It’s what keeps the fuselage of the helicopter from spinning in the opposite direction of the main rotor in flight.  If the tail rotor fails in some way, not only are you no longer in controlled flight (because the fuselage wants to spin around in circles), but the emergency procedures (EP’s) are pretty drastic and probably won’t help much.

helo1riskslide

tail rotor gearbox

So I found myself thinking, “Why in the world would I (or anyone) climb on board when something so catastrophic could happen?”  And then the answer hit me, “because it probably won’t fail.”  That is, the impact of the event is very bad, but the probability of it happening is low. This particular possibility represented the extremes — very high impact and generally low probability.

helo2riskslide

nose gear

But there are possibilities in between also.  For example, what if the nose gear gets stuck and won’t come back down when I want to land.  While not desirable, it’s certainly not as bad as the tail rotor failing.  I could come back and land/hover and keep the nose gear off the deck while someone comes out and tried to pull it down.  Or they could put something underneath the nose of the helicopter (like a stack of wooden pallets) and set it down on that. While not a high likelihood of occurrence, a stuck nose gear happens more often than a tail rotor failure, so let’s call it a medium probability for the sake of argument.

While the impact of the stuck-nose-gear-event is much less than that of the tail rotor failure, the potential impact is not trivial because recovery from it requires extra people on the ground that are placed in harm’s way. So maybe this is a medium impact event.

helo3riskslide

multiple components/multiple systems

Similarly, what if the main gear box overheats or has other problems? Or other systems have abnormalities, problems or failures? What are the probabilities and impacts of each of these?

There are multiple pieces to the puzzle and each piece needs to be considered in terms of both impact and likelihood.  Even as commercial airline passengers,

If we based our decision to fly purely on an analysis of the impact of an adverse event (crash), few people would ever fly.

We do board the plane, though,  because we know or believe that the probability of that particular plane crashing is low.  So, in making our decision to fly, we consider two components of risk: the probability of a mishap and the impact of a mishap.

We have the same kind of thing in managing risk for IT and Information Management services.  We have many interconnected and complex systems and each has components with various probabilities of failure and resulting impacts from failure.

itimriskslide

multiple components and multiple systems in IT & Information Management systems as well

 What if I’m working in a healthcare environment and users store Protected Health Information (PHI) on a cloud service with a shared user name and password and PHI leaks out into the wild?  This might have risk components of: Probability — High.  Impact — Medium to High, depending on size of leak.  What about laptop theft or loss containing PHI? Same for USB thumb drives.  What is the probability? What is the impact? What about malware infestation of workstations on your network because of lack of configuration standards on BYOD devices? What is the likelihood?  What is the impact?

It’s possible that our server or data center could be hit with an asteroid.  The impact would be very high.  Maybe there are multiple asteroids that hit our failover servers and redundant data centers also !  That would surely shut us down.  But does that mean that we should divert our limited business funds to put our server in an underground bunker?  Probably not — because the likelihood of server failure due to asteroid impact is very low.

As with flying, when we analyze risks  in IT and Information Management operations, we have to dissect and review the events in terms of their respective impacts and probabilities.  Only when we have these two components, impact vs probability, can we start to do meaningful analysis and planning.

 

What are events do you plan for that have Low Probabilities but High Impacts?  What about High Probabilities but Low Impacts?

Creating simple information risk management heat maps

Visualization is a powerful tool for simple information risk analysis.  The simple of act of placing risks in spacial relationship to each other allows a quick overview of essential elements of your risk profile.  As importantly, it allows you to communicate that simple risk profile to others that aren’t as versed in information security, IT, and information management.  A popular risk visualization tool is an information risk heat map.

A couple of posts ago, I talked about creating a simple risk register.  In a nutshell, this is a list of things that can go wrong in the IT and information management part of your business with an estimate of how bad it would be and how likely you think it is to happen.  An example might be as simple as, “business shutdown for greater than 3 days because no backup for critical data — medium probability, high impact” or “data inaccessibility because of failure of cloud services provider — medium probability, medium impact.”

As a reminder, keep your analysis simple for probability and impact of event.  I suggest just Low, Medium, or High to start.  As a small or medium-sized company, you probably don’t have a lot of data to drive estimates for probability, so just make your best educated estimate. For example, if I’ve got a server with a RAID 5 configuration, I think the chances of 2 disks failing simultaneously (resulting in data loss) is Low.  Similarly, for impact, keep the analysis of impact pretty simple to start.  For example, amount of time offline might be a guideline for you — maybe 4 hours or less offline is Low impact for you, 4 hours to 2 days is Medium, and greater than 2 days is High.  How you define impact is a statement of your business’s “risk tolerance” and will vary from business to business. The main thing to remember is to not make it overly complicated.

The heat map is going to be a simple 3 x 3 grid with probability on one axis and impact on the other.  The cells inside the grid will contain the actual risks.

In the original risk register example, there were four columns: risk description, probability, impact, and then a column for date added. To create the heat map, we’re going to add one more column to the far left simply called Risk #.  This is just a number to identify the risk.  It doesn’t indicate any sort of risk priority.

information risk register with numeric indices

information risk register with numeric indices

Once you have the risk register that includes the Risk # column (which is just an index to the risk description and not a priority), start your heat map by creating a 3 x 3 grid with probability of the event happening on one axis and impact of the event of the other axis.

basic information risk heat map grid

basic information risk management heat map grid

If you’d like add some color:

information risk heat map grid with color

information risk management heat map grid with color

Finally, add the risk index #’s from your risk register

 

information risk heat map grid with risk indices

information risk management heat map grid with risk indices

Once people orient to axes on the heat map, their eyeballs generally go directly to the upper right hand corner of the heat map where the higher impact and probability events are. This is not a bad thing.

With your simple heat map, and without a lot of work, you now have some insight into your operation’s risk profile and are in a better position to make informed business decisions.

 

 

Have you created information risk registers or information risk heat maps before? What did you use for criteria for impact?

Where to begin with IT risk management

Starting an IT risk management program in the traditional sense appears daunting, and usually is, to a small or medium-sized business.  This is one of the reasons that they often don’t get started.  To make the insurmountable surmountable, start a simple risk register.  If you haven’t already, start one today.  Napkins, yellow legal pads, Moleskine notebooks, Evernote notes, etc all work to start.

Starting and developing a risk register will:

  • increase your situational awareness of your environment
  • serve as the basis of a communication tool to others
  • demonstrate some intent and effort towards due care to auditors and regulators
  • provide the basis of future more in depth analysis (as resources allow)

The simplest risk register will have three columns — a simple risk description, a likelihood of the event happening, and the impact of it happening.  (Adding a fourth column that contains the date of when you added the risk can be helpful, but is not required).

moleskineriskregister

Start with writing down the risks as soon as you think of them. If you haven’t done this before, several will probably pop into your head right off.  The act of writing something down is deceptively powerful. It makes you articulate the problem and maybe revisit a couple of your assumptions about the problem.  That said, don’t go nuts analyzing any particular risk when you start. Just get the core idea down, maybe something like, “PII loss resulting from laptop theft” or “reduced support effectiveness because of lack of BYOD policy.”

After you’ve got a dozen or so, take a break for now (you’ll add more later), and review the whole list. Make two columns next to this list. Label one column ‘probability’ and the other column ‘impact’. Next to each risk, write down what you think the likelihood of that event occurring — just High, Medium, or Low. Nothing fancier than that. Same thing with impact — how bad would it be if this event occurred? What’s the impact?  Again, just High Medium or Low.

When you have a few minutes, you can structure this a little bit more by putting this in a table.  I like using Powerpoint or Keynote over Excel/Numbers for this stage.  By using Powerpoint’s cartoonish and colorful tables, I tend to stay oriented to the fact that I’ll be communicating these risks (or some of these risks) later on.  If I use Excel for this, I tend to get overly analytical and detailed.  It starts to become more of a math problem vs something that I will be communicating to others.

SampleRiskRegister3

(click to enlarge)

Keep in mind, that it is very easy to over-design beyond the point that is useful to you right now.  And you want it to be useful to you right now.  At this point, you are creating a simple document that informs you in that brief moment of time that have to look at it.  You don’t want a document that taxes you right now.  It needs to give you a quick easily digestible and broad view of your risk picture.  If the document gets too complicated or goes into too much detail, you increase the likelihood that you won’t pick it up again tomorrow or in a week or in a month.

In an upcoming post, we’ll create a simple visualization tool, called a heat map, that can be very helpful in providing a profile of your risk picture.

Do you currently use a risk register now?  How did you create it?  How do you maintain it?