Service providers establish service-level objectives (SLOs) to help them reach customer expectations and contractual service-level agreements (SLAs). For example, an internet service provider (ISP) might promise 99.5% uptime that keeps clients connected to the online resources they need. Setting a 99.9% uptime SLO gives teams some room to fail without breaking the agreement.
Setting goals and SLA monitoring should make it easier for companies to know when they meet client expectations. Even when service providers fail, they can use metrics to identify opportunities for improvement.
Falling short today could highlight inaccurate assumptions and point to a more reliable path forward.
We usually find that companies have reasonable SLOs. Unfortunately, we also talk to plenty of service providers that don’t know how to enforce an SLO action plan. That often means they fail to reach the expected results.
Since we work with diverse technologies, we have some insights into how you can create and enforce an SLO action plan that gets the results you need and your clients expect.
Every company needs SLOs they can meet. Otherwise, you’re misleading potential clients and will likely reap negative consequences.
We’ve already made a list of metrics software companies should track. Some metrics that stand out include mean time to failure (MTTF), availability, mean time between failure (MTBF), mean time to repair (MTTR), and probability of failure on demand (POFOD). Read our earlier post for a more in-depth look at important metrics to track.
The metrics you track could differ depending on the products and services involved in monitoring SLOs for your company. If they aren’t obvious now, you will quickly learn about industry expectations when you start courting clients.
Critically, you must set goals you can reach. Don’t promise a five-hour MTTR unless you’re certain your team can meet that objective. Keep in mind, however, that SLOs are based on averages. If four repairs take 20 hours, you have a five-hour MTTR.
How do you know how to set a reasonable objective? Start by picking a specific timeframe to measure performance. Then, segment your objectives so you can get a granular view of the data you collect. The results should give you a good idea of what SLOs make sense for your company.
If the data shows that you can’t meet the SLOs your competitors offer, consider upgrading your technology.
Most companies have development teams and operations teams. These teams often have contradictory goals because the development team wants to release new products and updates quickly, while the operations team wants to ensure stability.
We’ve seen organizations handle this situation in various ways. Simply establishing SLOs could make it easier for you to find a balance between each team’s goals. We find that encouraging cross-team collaborations also helps.
Site reliability engineering (SRE) can play a critical role in meeting SLOs. SRE combines aspects of development and operations engineering to give you a more holistic approach to performance management. When you encourage teams to embrace SRE and work together, you should see your technology become more resilient and efficient.
Importantly, SRE uses artificial intelligence, machine learning, and automation to maintain system performance. Since these technologies can handle many of the common issues that arise, you can keep your systems working well without forcing team members to do a lot of extra work. They will need to learn a few new skills and communicate with other teams, but they won’t need to set aside much time for SRE duties.
Encouraging teams to collaborate will help you achieve your SLOs. Ultimately, though, you need someone to take charge and accept responsibility for making sure your company fulfills its SLOs.
Ideally, you already have a staff member with enough experience to serve as a site reliability engineer. If not, you might want to find someone interested in learning new skills. Alternatively, you could hire someone to fill the role.
Assigning a reliable professional as your reliability engineer means you know who accepts responsibility for fulfilling service level agreements and service level objectives. When assets in your IT ecosystem don’t meet expectations, this person can review alternatives and choose one that will help you meet your service level targets. Perhaps you need to invest in new hardware or cloud services to reduce latency, respond to incidents more quickly, or prepare for changes in user behaviors.
Without a designated site reliability engineer, this work will probably become your CTO’s responsibility. You can take that approach, assuming the CTO has enough time to do the work, but we find that many CTOs already have too many tasks to handle. Adding SRE responsibilities could mean the CTO never gets around to those jobs.
Make sure your employees get something in return for helping the company meet or exceed its SLOs. You have plenty of options. For example, you could tie annual performance bonuses to meeting SLOs. The company stands to make more money by meeting client expectations, so it makes sense to distribute revenue to the people keeping clients happy.
You could also offer short-term incentives for meeting objectives. If the company fulfills its obligations for an entire month, maybe everyone gets some additional paid time off.
Regardless of the incentive you choose, make sure you frame it in a positive way. We’ve seen some companies tell employees that they “will lose” their bonuses if they don’t reach specific goals. We believe the carrot works better than the stick. Communicate that everyone will receive higher bonuses for meeting objectives. That approach should motivate people to work harder instead of making them feel unhappy that they will lose money they think they deserve.
Still unsure of how to enforce an SLO action plan that works for your business? Reach out to the team at Adservio. We can review your issues, discuss any barriers to success, and help you implement a plan.