Service Level Agreements, or Service Levels defined in contracts, typically define maximum delays between certain events in the procedural interaction between two enterprises.   They are intended to create clarity and precision in the dealings between these two enterprises, and to balance the business priorities of the enterprises against the cost of implementing the support.

A typical example is the time between a “customer” enterprise notifying a “supplier” enterprise of a problem, and the “supplier” enterprise responding to that notification. It is not uncommon to have several interactions defined in sequence, such as: notification of problem, administrative response, diagnosis, solution and long term fix.  The further into the future you look the less common are fixed timeframes and/or the range of timeframe becomes larger.  How can you guarantee a time to fix when you don’t know the problems that will be raised.  However, there are some cases where these are locked down.

The SLA should define the longest time that is reasonably expected that the event should happen, before some degradation of performance begins to occur  the “customers” operations.  Such degradation may not (in most cases don’t) occur immediately, but the lead time towards that degradation has been consumed almost completely: there is no more wiggle room.

But in practice, once these SLA’s have been established into the operational cycle, a strange thing often happens: one or both of the two teams responsible for managing the interactions begin to “optimise” the process and in doing so can start a gradual but persistent drift of the operational practices away from the intent that was in the minds of those who negotiated the contract.  In the worst (and probably statistically the modal) case, the micro-competition between the two enterprises results in a sub-optimal or even dysfunctional process.  Depending on the relationship between the two enterprises, the “optimisation” may end up as “gaming” or even “exploitation”.

For example it is common for SLA timeframes become habituated to be the expected time for the event to happen, not the outer limit of performance.  In such cases you may start hearing discussions that include statements like: “that request isn’t out of SLA, so you cannot escalate”.   The code phrase “not out of SLA” meaning that the request is still within the contracted timelines. But this is both inappropriate and dangerous, primarily to the “customer” enterprise, and advantages the “supplier” enterprise greatly.

If everything is targeted to the outer limit of the SLA, it is inevitable that some requests begin to slip over the line to the other side.  the distribution curve of performance begins to develop with a peak at the SLA timeframe and a distribution either side of that.  the “Long Tail” of performance falls away on the inside curve.   Depending on whether there are penalties for “out of SLA” requests, the tail on the “out of SLA” curve may be very steep or may tail away slowly.

The only real way that this kind of progressive degradation can be avoided is for there to be a regular “all hands” review that includes the accountable executives.

This review needs to be based on accurate data collection and on clear business criteria for assessing performance.  the “Business criteria” include the intent of the contract and on some assessment of value that the process is creating for the organisation.  By “all hands” we mean all the participants in management / execution process, the commercial and legal folks who own the contract upon which the process is based, key representatives of the consumers of the process (e.g. Project Managers, System Owners or even end users), and the accountable executives.

By focusing on business value, it is much easier to strip away the gamesmanship, process habituation, and often emotional baggage that can existing between teams, and focus on what the process is intended to achieve.