The Hysteresis of Hysteria – The 11-Step Cycle of a System Crisis

by | Oct 5, 2021 | Gradient of Terror | 0 comments

If you’ve been a project manager for any time, you’ve probably experienced an emergency.

Sometimes it happens during a project, say during system testing. The test configuration was working fine, and then “bam!” something changes, and it flips over to being wholly unworkable or unusable. Backing out recent changes doesn’t work, and no one seems to know the root cause of the problem.

If the system is in production, perhaps there’s some early warning in a metric or alarm. But then “boom!” the system becomes unstable and unusable—alarm storms rage. Customers start complaining, and soon enough, it makes its way into the media. Brand damage becomes real.

How do you get out of this? There are established ways to respond to this situation. Bring in more experienced people, escalations to more senior levels of the company (voluntary and otherwise), crisis meetings of more and more people, and setting up war rooms.

We see more and more intense focus from a larger and larger group of people.

Eventually, the system responds to this management intensity. You find root causes and apply fixes and bring the system under control and restored to normal operating parameters.

What does the team experience during that cycle?

They experience the “Hysteresis of Hysteria”.

Let’s look at this graphically on a two-dimensional chart. We’ll look at the two dimensions first and then walk through the cycle

The dimensions

The high-level description of the two dimensions is below. You can read more detail after “The Bottom Line”.

Below is the basic plot for our crisis analysis.

Graphical user interface, text Description automatically generated

Management Intensity

The horizontal scale is “Management Intensity”, i.e. the strength of management focus and the weight of resources engaged in addressing the system’s state.

“Low Intensity” on the right is the steady-state / BAU. “High Intensity” on the left is the multi-level management focus plus additional resources assigned to find and fix the problem in a crisis.

System Coherence

The vertical dimension of System Coherence measures how well the system coheres to its intended design and operation. “High Coherence” at the top means that the solution is operating normally, as per design. There are few, if any, system problems, and it responds to commands and configuration changes. “Low Coherence” at the bottom indicates that the system is either completely down or operating in a degraded state.

The Steps

By my reckoning, the cycle of degradation back to restoration follows a hysteresis curve, as follows:

A screenshot of a map Description automatically generated

The high-level description of each step is below. You can read more detail after “The Bottom Line”.

#StepDescription
1Steady Statethe maximum level of coherence and the minimum level of management.
2EmergenceThe system becomes affected by an underlying problem, but there are few symptoms. Local teams (e.g. admins and operators ) are handling the issue.
3Tipping PointThe local team has applied “standard” diagnoses and remediations. Superficially, the system may respond, but unseen problems are growing and the trend is about to accelerate. Management control is now with the senior team member in the local team.
4The DropThe problem impacts/system instability are growing much faster than the team’s ability to observe and report the status. A broader community of uses are experiencing impacts. Problem reports are streaming in via non-technical channels: support calls/emails/chat sessions are spiking and swamp their capacity. Executives become aware of the problem either through escalations or via external channels. Management control passes rapidly up the chain to the team leader, the department manager, until the director is hands-on in the crisis. More and more senior/experienced people are dragged onto the problem.
5Early TractionIf a war room were ever on the cards, it would be operating by now. Teams by now are in full crisis mode and have abandoned all non-critical tasks. People are working long hours and looking ragged. Teams have to balance competing problem-solving activities and status reporting/answering questions, like “when will there be a fix”. But, at some point, the weight of people start to work out hypotheses for the problems and potential fixes.
6Low PointStale pizza and tired faces are everywhere. But, the team knows the problems and has some fixes: some of which may be applied or developed. Maybe the teams forecast non-instant recovery tasks, e.g. developing software or standing up new infrastructure). In that case, they will inevitably have to participate in the justification/trade-off discussions around short-term (but ugly) fixes vs longer-term solutions.
7Encouraging ResponseThe collective team has now planned out the recovery process and started work. The system is responding to the recovery work.

The management process is descaling gracefully. The process requires fewer people at meetings, and those meetings become less frequent.

8Problem / Solution LockedThe very focused and intense activity begins to pay off. The level of management force and focus starts to seem a bit like overkill. People start to ask, “do you really need me at the next status meeting?” Extreme measures are wound back
9Rapid RecoveryThe trend to recovery is clear – the team sees significant progress in developing fixes and deploying them successfully.

The management resources and intensive are wound back.

10Light at the endRapid continued improvement results in an almost complete de-focus on the problem. The local team (admins or ops)
11All the way back (Almost?)The system should be back to a steady-state of maximum coherence.

Did we get “All the way back” or “almost all the way back”? Are there any lasting effects on the system that remain, or is it actually in better shape than before?

For example, has the management system added new alarming infrastructure or updated procedures or even modified the system to make it more resistant to whatever happened last time?

The Bottom Line

All big system crises that I’ve observed have followed a similar pattern. They seem to start with a “sleepwalking” perspective on systems that are operating well:

  • Management tends to ignore things that are in control and spend time on other problems.
  • There are few situations needing people to look at manuals or work instructions
  • Local “myths” develop about how the system operates. Like any myth, it has a core of truth but misinforms
  • Expectations that the system “works well” and will continue leads to a reduction in resource budgets
  • People’s skill levels in operating the solution decline as cheaper resources are tried, and no impact is noted. Efficiency dividends are achieved and claimed.
  • Constant re-organisation and personnel loss hollows out know

Once this stasis situation creeps in, any unexpected or discontinuous event can rapidly exceed the experience and knowledge of the BAU team.

What is your experience? Is it any different?

If you’re interested in more detail, check out the “blow-by-blow” below.

The Blow-by-blow.

The Dimensions

Management Intensity

The horizontal scale is “Management Intensity”, i.e. the strength of management focus and the weight of resources that are engaged to address the system’s state.

Low Intensity (RHS)

This end of the scale is “steady-state” or “business as usual”. Nothing is particularly wrong, and so the lowest levels of the management and experience hierarchy are assigned. Think of this as “maintenance mode” – if there is any pro-active checking, it is on a low-frequency cycle.

Most likely, those who manage this situation rely on alarms or event notifications to know something is wrong.

High Intensity (on the left end of the scale)

This end of the scale is when you’re in full-blown crisis mode, say a “major incident”, and there is a significant escalation in resource levels. There is also typically higher levels of visibility and intervention across the management hierarchy.

Most likely, you’ve put extraordinary intervention measures (crisis rooms, workshops, special assignments), and people are often working around the clock. The pizza boxes and coffee cups are building up around the offices.

Status checking across team members is high frequency, with lots of coordination and communications: a significant overhead.

System Coherence

The vertical dimension of System Coherence measures how well the system coheres to its intended design and operation. How “in control” or “out of control” the system operates and responds to management commands.

High Coherence

The solution is operating normally, as per design. There are few, if any, system problems, and it responds to commands and configuration changes.

Low Coherence

The system is either completely down or operating in a degraded state. The system responds in ways that are unexpected and misunderstood: predictions and diagnoses don’t hold.

If the system is in this state, it is the point of incipient or actual brand damage or irretrievable system collapse. If the system is not in production, then project timelines are being trashed.

The Steps

1. Steady State

This point is at the maximum level of coherence and the minimum level of management. Exception-based monitoring is the norm: any built-in alarms are quiet, and pro-active checks are infrequent.

2. Emergence

At this point, the system becomes affected by an underlying problem, but there are few symptoms. An expert with years of experience might recognise the early warning signs, but they were promoted or are working on “more important” projects.

Early warning signals are easily misdiagnosed or ignored by day-to-day staff, or automated systems (which may be outdated) do not detect the problems. Without management focus, the problem spreads its effects very slowly at first.

3. Tipping Point

At this point, “standard” management responses have been applied by those closest to the system, e.g. admins and operators. Superficially, the system may respond, but the problems are growing and the trend is about to accelerate invisibly.

The project manager probably won’t have the complete picture due to delays in reporting clear symptoms/root cause as the situation emerges.

Alarms and escalations at this point are becoming more visible outside the core teams and immediate management.

More experienced project managers may recognise the symptoms of an emerging crisis, and s/he might suggest more aggressive responses. But there is often resistance to formalising the response and a reluctance to disturb the current task load and service crisis response actions.

But things are about to get worse.

4. The Drop

At this point, the problem impacts/system instability are now growing much faster than the team’s ability to observe and report the status. Impacts are experienced across a broader community of users, and problem reports are streaming in via non-technical channels: calls/emails/chat sessions to support channels and begin to swamp their capacity.

Social media channels start to light up with complaints, often not about the system issues but the lack of responsiveness to their problem reports or even getting through.

The problem is now visible to multiple management layers, and executives often hear about it from outside the company, not through internal escalations. The team handling the problem begins to get swamped by status queries from multiple points within the company.

The teams can’t set up status meetings quickly enough before the problem has spread or evolved.

And so:

  1. the information in the discussion is out of date; and
  2. not all the necessary people are at the meeting

Crisis meetings grow and become more frequent, and the problem’s symptoms spread. Daily status sessions become twice daily, then three times and more.

Talks of “war rooms” have begun, but whereas previously such measures were seen as a disruption, everyone is now just too frantic.

5. Early Traction

Teams are now in full crisis mode, and all pretence of maintaining normal activities has been dropped. People are working long hours and looking ragged. If a war room was ever going to be set up, it has been set up by now.

Whereas in previous stages, coordination of problem response was left to technical leads, by now, project managers have been brought in to do the heavy lifting of organising the flow of information, tracking of issues and responses, and running crisis meetings. The PM is also working with more senior managers to coordinate communications with outside parties. If the problem results in customer impacts and is large enough, coordination will include executive level, corporate comms and potentially legal offices.

But, within the noise, patterns are emerging. Some teams start to see responses to their fixes and remediation work. The situation is not resolved, but there are green shoots.

Nobody has really started to think about recovery yet, just finding the root cause and arresting the slide.

6. Low Point

The problems have been identified, fixes have been identified and are being applied or possibly developed. Suppose the teams forecast long periods to recover (e.g. to put in new physical infrastructure). In that case, they get tied up in justification sessions to trade-off short term (but ugly) fixes vs longer-term solutions.

There is more work to do to recover, but the slide has been stopped. At this point, we find the maximum number of people engaged in the crisis management process. Crisis management has probably developed a bit of a rhythm. Meetings are running relatively smoothly, people are responding to actions quickly and reliably.

Just when everything is operating smoothly, we start to reduce the need.

The decay has stopped, but we need to recover.

7. Encouraging Response

At this point, the recovery process is planned out and has started. The system is responding to the recovery work.

The management process is descaling gracefully. Fewer people are required at meetings, and those meetings become less frequent.

All those involved are operating well due to the repeated cycles and the emerging success. People are getting sleep and becoming less fractious.

Management feels vindicated that the extra efforts have been effective and brought the situation back

8. Problem / Solution Locked

At this point, the focused and intense activity begins to restore balance.

There is a plan in place for the remediation activities, and that plan is under close management.

The level of management force and focus begins to seem a bit like overkill. Extreme measures are rewound. Senior management /Executive focus has moved on to other problems, and any senior management reporting on the problem is rolled back into regular reporting cycles.

9. Rapid Recovery

At this point, the trend to recovery is clear – significant progress has been made – and the system is responding to the interventions.

If there were procurement or development activities needed to restore the system fully, all or nearly all have been completed, and the rest are seen as routine repetitions of solution steps already completed.

10. Light at the End

At this point, rapid continuous improvement results in an almost complete de-focus on the problem.

Pretty much everything is turned over to the local admin or operations team to complete the job unless there are continuing technical changes needed to complete the restoration.

11. All the Way Back (Almost?)

At this point, the system should be back to a steady-state of maximum coherence.

Did we get “All the way back”, or “almost all the way back”? Are there any lasting effects on the system that remain, or is it actually in better shape than before?

For example, has the management system added new alarming infrastructure or updated procedures or even modified the system to make it more resistant to whatever happened last time?

Or did it decide that was the last time and it was time for a new system to replace the old?

 

Sign up for the AOP Newsletter

Each week Adam writes about interesting and varied topics for Project Managers everywhere and curates useful articles, books and papers from other sources.