IT resilience – How to improve reliability, tolerability, and disaster recovery

A great IT resilience plan often involves improving reliability, tolerability, and disaster recovery. We share some insights into embracing resilience.

Digital Quality

3 min

All organisations with IT infrastructures face potential issues with technology. Common points of failure include network switches, on-premises servers, internet service providers (ISPs), and critical business applications. When something goes wrong, downtime can prevent employees from doing work, contribute to data loss, and disrupt customer experiences.

We’ve worked with a lot of companies that want to improve their IT resilience, including aspects of their reliability, tolerability, and disaster recovery plans. Giving clients accurate, actionable advice requires extensive research. In 2022, Gartner researchers Ron Blair, Belinda Wilson, Mrudula Bangera, and Josh Chessman published an influential paper called “IT Resilience — 7 Tips for Improving Reliability, Tolerability and Disaster Recovery.” (You can find a summary on the Gartner, Inc website.)

We recently incorporated the opinions of Gartner experts into our approach to serving clients. The following tips should help improve aspects of your IT resilience to achieve high availability and improved functionality throughout your IT ecosystem.

Focus on disaster recovery efforts that work

All companies need disaster recovery plans that anticipate the worst possible scenarios and provide ways to minimize downtime. Somewhat surprisingly, Gartner's research shows that:

Eighty-six percent of I&O leaders self-assessed their recovery capabilities as meeting or exceeding CIO expectations. Yet only 27% of that group consistently undertook three of the most basic elements expected of a DR program — formalizing scope, performing a BIA to acquire business requirements, and creating detailed recovery procedures. This means over 54% are likely suffering from mirages of overconfidence.

We call this “disaster recovery theater.” People put in a lot of work to create disaster recovery plans, but they don’t do the right research and testing to ensure those plans work well.

What can you do to focus on disaster recovery efforts that work? We recommend creating plans that anyone, including third-party providers, can execute. That way, you can reach your recovery time objective even during nights and weekends.

We also like building exercises into disaster recovery plans. Difficult training exercises will expose vulnerabilities within your team and technologies. The results will tell you what types of training your team members need to minimize damage caused by outages.

Recognize IT resilience as a cross-team effort

It might seem obvious that the site reliability engineering (SRE) team should lead IT resilience efforts. In reality, IT resilience is a cross-team effort that often involves managing distributed systems.

Bring in people from all of your IT teams so they can contribute to a plan that serves diverse needs. Someone on the SRE team will recognize the importance of building resilient software. However, they might not fully understand how their plan will affect the on-site and cloud servers. A network administrator can fill in knowledge gaps to help ensure the hardware performs as SRE expects.

Similarly, you need security experts trained to scan systems for vulnerabilities. If a DDoS attack makes it impossible for clients to access your website, you will need a security specialist to find a way to block the attack without also barring legitimate clients from using the site.

No individual can know everything about today's IT landscape. Get all of your teams involved so they can contribute to a plan that improves uptime and prevents interruptions.

Take advantage of failover and failback automations

Do your teams have access to tools that make it easier for them to achieve the metrics set by your recovery time objective (RTO) and recovery point objective (RPO)? Gartner identifies automation as an important part of failover and failback, especially at the application level.

Anything from adding new features to getting a sudden influx of users could disrupt application-level performance. Automations can keep an eye on performance metrics and adapt as needed. For example, an app that needs to scale up to meet user demands could automatically recruit additional resources from cloud servers and virtual machines. No one on your team has to decide to scale. It's such an obvious way to adjust to changes that you can trust automated processes to do it without much oversight.

Additionally, take advantage of more cloud technologies with servers based outside of your area. Automated cloud backups and recovery can protect your most sensitive data and essential services during chaotic events like storms and power outages. Even if you can't serve everyone during the event, you won't lose data stored on local servers.

Devote resources to topology mapping and performance monitoring

The more you know about your IT ecosystem, the better you can prepare for disruptions. For many companies, getting to know their IT ecosystems starts with topology mapping. Topology mapping generates a visual diagram of your network, including the locations of physical resources and relationships between dependencies. With an accurate view of your network, you know what to expect when a disruption occurs at any specific location.

Performance monitoring helps keep teams ahead of outages and other problems. Does struggling performance in one area make it likely that certain features will stop working? Spotting potential issues before they happen makes it easier for IT teams to stay ahead of challenges.

‍

‍

AIOps makes performance monitoring even more effective. As signs of a potential disruption emerge, artificial intelligence can make decisions that prevent downtime.

AI can also analyze massive amounts of performance data to help humans make choices. In fact, the combination of human and artificial intelligence tends to get the best results. We trust AI to make some choices. Ultimately, though, we know that organizations get better results when they combine AIOps with IT infrastructure monitoring (ITIM), network performance monitoring and diagnostics (NPMD), application performance monitoring (APM), and digital experience monitoring (DEM) overseen by human professionals.

Make IT resilience an essential part of your organisation

General advice can help improve IT resilience initiatives, but it can’t anticipate all of the factors that influence your organization’s technology. Feel free to reach out to Adservio for more detailed, personalized instructions from I&O leaders who have served clients in diverse industries. We’d love to help you develop a plan that improves your resilience posture.

‍

Published on

September 29, 2023