Threshold implementation is a standard practice used to regulate the number of events that need to be managed. it is unusual for the gravity of these values seems to be re-evaluated by either party. They just get set when the monitoring tools are installed, and the service provider tweaks them behind the scenes. What possible availability thresholds exist? “service must be down for more than X minutes”? does it make sense to wait for that threshold before triggering a ticket? Why not launch the ticket immediately and recall it if necessary?
In theory, the data collection process and result repository should be reviewed periodically. I’ve never seen service level data on the service level management components. People don’t think about it and it adds a layer of overhead that people can’t afford. But it is appropriate in some situations, especially where multiple service providers are contributing to the same deal and one of them has the responsibility to develop overall service level reports.
Everyone should have confidence in the data. The monthly SLA report preparation cycle includes a period in which the various staff (client and providers) review the draft report. Most of the focus for these reviews is on understanding the data points that exceed the target or can trigger a penalty. The atmosphere of these discussions is defensive and essentially non-constructive as everyone scrambles to justify their performance.
These hassles could be moderated with a safety valve that takes issues off the metric review table that are not statistically relevant and pipes these to a Problem Mgmt process for further evaluation and improvement. Clients need to take the aberrations out of the metrics rationalization process and handle them in an escalation context. The end result will be more effective and the service level reports will get out on time.
Availability objectives must be clearly defined and communicated to the client. OK! Sounds good.
This requirement is normally satisfied by repeatedly pinging a key server box (or boxes). There’s not a lot of information there, at least we can conclude that the application was possibly up when the ping response was received.
A lightweight, programmatic test transaction would be more informative. These should be triggered from remote segments used by the key application end users, we get all the network dependencies that way. The only missing piece in the view of availability is the end user workstation.
How can the outsourcer require the service provider to actively monitor availability? A real time view of services would only yield a Boolean result. A real time average seems to miss the point. How just about relying on automated alerts (traditional monitoring) to identify and resolve issues?
The data collection methods should be available to the people who consume the service level metrics. I’ve seen an appendix that serves as a handy reference. If service level reports are distributed via a portal, a “metrics definition page” also helps remind people.
These definitions should also document the data collection intervals. Bonus points if the service level objectives recognize the business cycle, and tighten at critical times of the day, week, month, or quarter.
availability requirements usually change depending on the weekly, monthly or annual cycle. we notice this because most everyone does maintenance on saturday night, stuff needs to be greased and the filters need to be cleaned. its also a good time to do a few upgrades. don’t include this time in the service level calculation!
it might be possible to offer a piece of the standard service during the maintenance window, perhaps supporting the client’s original vision of overall availability, while encouraging the client to remember that maintenance is a good thing.
training for the service desk is an example of maintenance that has to happen and it favors the client. we can reduce production capacity temporarily for the benefit of the client, service provider, and the agent. the idea is to make the risk of missed service levels low.
the math discourages people from calculating availability all the way out to the end user. it seems impractical to insist that all the components in the chain support an end result of 95+%.
being careful about change scheduling and understanding the cycle of business critical transactions are ways to raise the probability of raising the effective availability of the service.
configuration mgmt databases are notoriously difficult to maintain. every change should touch the CMDB in the planning, approval, and implementation stages. hopefully the records get updated during this cycle.
maybe because we think about SLAs in terms of availability and we think of availability in terms of boxes, we end up registering the boxes as the most meaningful CIs in the CMDB.
there are two glaring weaknesses in this approach: we don’t recognize the context of the service delivered to the end user and we don’t recognize the context of the business outcome belivered to the business.
suggestion: register business outcomes as the CI root. then identify all the infrastructure elements, the participating service providers, and the end users as supporting CIs.
- approved hardware and software components
- no single point of failure: all hardware components are redundant
- all components are actively monitored 24×7
- “do not touch” production windows defined
- all components have spares on-site
- guaranteed 30 minute response time for hardware engineers
- fail-over rehearsals conducted every quarter
- service re-build rehearsals conducted every quarter