Bringing together software engineers and site reliability engineers

Are your software engineering teams and site reliability engineers working together effectively? Learn more about coordinating their efforts from Adservio.

Digital Quality

15 min

Most of us have worked in places with departmental silos, so we know the negative effect they can have on software development and ongoing operations. Bringing together software engineering teams and site reliability engineers (SREs) helps break down boundaries, creating more opportunities for success.

How can you make it easier for SRE teams and DevOps teams to work together? We have some thoughts and tools that lead to improved optimization and communication.

Design DevOps processes with site reliability in mind

Site reliability engineers want to find approaches that work as well as possible. They don’t usually worry about whether the client wants to use a specific programming language or cloud service provider. They just want things to work, so they have IT operations that improve uptime and system reliability.

Keep this flexibility in mind when designing DevOps processes. We’ve experienced the headaches that come from taking rigid approaches to product development. Releasing new features can feel like an impossible feat when CI/CD pipelines use a language unfamiliar to members of the team. We know the frantic work that erupts during an incident response that drags down metrics.

By keeping site reliability in mind, you can hope for the best while planning for the worst.

Along the way, you'll find an automation testing framework that works well for the project. You’ll discover that some open-source code components work better than others in your production environments.

We often find that it makes sense to shake up DevOps roles depending on which team members have experience in a project’s most critical areas. We put our egos aside and take on the tasks that improve the end user’s experience and meet required metrics.

When team members encounter problems, note them, and make them part of your continuous delivery strategy. If you have the right incident management processes in place, you can evolve with a project’s needs without experiencing excessive downtime.

Build resilience and observability into every project

Let’s step away from the philosophical approaches about how software engineering teams should work with site reliability engineers. From a technical perspective, how can you build resilience and observability into every project, making it easier for team members to reach goals?

Use microservices when possible

Shorten software development lifecycles by avoiding monolithic software and embracing microservices. With microservices, you get a more granular view of why products fail. We find that microservices make it much easier to track metrics, troubleshoot issues, monitor behaviors, and create system logs.

Additionally, you can reuse microservices in future projects. You don’t need to “reinvent the wheel” when building the next product. Instead, you can package together microservices that have already proven themselves useful.

Embrace event-driven architecture

Updating apps will always have unexpected consequences. We like event-driven architecture because some features can fail without crashing the entire platform. It’s an adaptable architecture that adjusts to changes instead of ruining the entire user experience.

For example, an update might mean an app no longer connects to an integral API that lets users send SMS messages. Routing those users to other options — such as email and Slack messages — doesn’t disrupt the experience nearly as much as rejecting all types of messages.

Event-driven architecture can also notify on-call operations teams of emerging problems. But more about that in the next section.

Send alert notifications to SRE and development teams

Companies must adhere to their service-level agreements (SLAs). Failing to do so could mean losing revenue and damaging your brand reputation.

For example, a development company that makes niche software for clients might have a service-level indicator (SLI) that requires responding to outages within two hours. Failing to meet SLIs like that could mean the client pays a lower rate. Essentially, the development company didn’t meet its contract obligations, so the client gets a partial “refund.”

A software or IT company might have a broad range of service-level indicators, service-level objectives (SLOs), and SLAs written into their contracts. Of course, you usually have an error budget that allows for a certain amount of downtime. It’s nearly impossible to provide 100% perfect service every second of the day. Exceeding the error budget, however, means making less money. It could also make it harder to attract future clients.

Correcting problems quickly to stay within error budgets

Error budgets rarely give service providers much wiggle room. Perhaps a client can expect 20 minutes of downtime per month on average. That means your teams need to respond to events quickly.

An automated notification system that alerts members of the SRE and development teams gives product and service providers a better chance of meeting their goals.

It isn’t enough to notify one group. You’ll likely need the SRE team working on long-term scaling and infrastructure solutions while your software engineers focus on implementing code that gets the product online as soon as possible.

Having the two groups working in coordination provides the best path forward and helps avoid future issues.

Give teams the tools they need for better collaboration

We’ve tested several collaboration platforms over the years. Some stand-out options include Jira, Asana, Slack, and Trello. All of these tools help streamline workflows for efficient software delivery.

Plus, engineering leaders can use them to assign tasks to specific people or teams, so everyone can share accountability while working on the same project. Two people might work on separate teams, but they can still strive for the same goals when writing code, solving latency problems, and looking for bug fixes.

Realistically, it doesn’t matter which collaboration tool you choose as long as it meets the teams’ needs. Most offer free trials, so try a few before you commit to one. Of course, we’re happy to help you review some options so you can save time.

Promote cross-team collaboration with help from Adservio

Getting software engineering teams and site reliability teams to work together effectively often relies on having the right policies and tools. Make it easier for your teams to reach their goals by reaching out to Adservio. Our experience working with diverse companies gives us insight into how you can help your teams collaborate and exceed client expectations.

Published on

July 14, 2024