The golden signals of API monitoring form the foundation for observability at the service-level.
In other words, these four components are essential for large-scale applications as they help measure the user experience, business impact, and service abandonment.
As we dive into these signals, you'll learn why it's important to measure them and how to begin measuring them in an effective manner.
The Four Golden Signals
The golden signals of monitoring come down to latency, traffic, errors, and saturation.
You'll find them championed by countless thought leaders, including in Google's SRE workbook and in the SRE community at large.
Even if you dismiss the increasing popularity of site reliability engineering (SRE) in the tech industry, the golden signals remain as fundamental metrics to track service performance and health, and here's what they boil down to.
Latency (Time taken to serve a request)
Latency is defined as the amount of time required to service a request.
For this signal, it's important to distinguish between the latency of a failed request and that of a successful one.
Latency should be monitored for every API call, which can be done by observing client requests for individual endpoints and monitoring server responses.
However, measuring at the endpoint level doesn't provide as much information and can lead to incorrect conclusions, so consider defining various latencies using server response codes.
An HTTP 500 error that fails quickly is better than another error that fails slowly.
Along those lines, even an HTTP 200 OK response can be considered an error, if it's delayed.
Comparing the latency of different errors and responses allows for better visibility into response time.
Traffic (The stress from demand on the system)
Traffic measures the amount of demand for a given service.
It's a high-level metric, but service-specific, tracking things like the number of HTTP requests per second.
Also referred to as "throughput," traffic breaks down requests per second (RPS), divided by REST endpoints or other metrics.
It's important to go beyond a summarized number, though, and further breakdown this information based on request types, hosts, status codes, body content, and so on.
Errors (Rate of requests that are failing)
Errors are defined as the number of requests that do not succeed.
Errors may be explicit, like HTTP 500 errors, or implicit, like HTTP 200 OK responses.
Oftentimes, error rates are monitored by tracking server response codes, but those alone may not offer sufficient information to identify a failure. When that's the case, you can identify a failure by checking other metrics.
For instance, you can monitor the size of anomalous responses by recognizing that a malformed request is likely to result in response much smaller, say 2KB, than a successful request, which may average 500KB.
Saturation (The overall capacity of the service)
Lastly, saturation defines the fullness of a service, measuring overall system utilization while emphasizing the most constrained resources.
Services do not perform well as high saturation is reached.
Important saturation metrics relate to infrastructure and namely consist of Network I/O, Disk I/O, and System Memory.
Of course, monitoring saturation is often the most difficult, as it requires both significant flexibility and a variety of utilization metrics.
How to Utilize The Golden Signals
Implementing the golden signals represents a major step, but it's far from the last.
Once the signals are in place, you can use them to determine additional monitoring required by your platform or application.
Since the golden signals are key to observability, you can apply the signals to monitor runtimes and user experience and use them to create meaningful dashboards for stakeholders.
Moreover, use the golden signals to collect and store data that establishes normal performance markers and trends. Additionally, utilize it to explore hypotheses and test capabilities.
Not only are these metrics a valuable data source, but they can contribute to dynamic dashboards and offer extremely useful information in the event of an incident.
Additionally, you can use the golden signals to help send more actionable alerts to your organization.
No team member likes dealing with the toil created by false alarms. Moreover, the toil created by alerts that don't contain enough actionable information is just as wasteful.
When used correctly, golden signals will help your company improve productivity, efficiency, and performance across the board.
Improving Your Cloud-Native Applications
With the implementation of golden signals, you can be well on your way to improving the performance of your organization's cloud-native applications.
However, it's important to recognize that implementing the golden signals on their own isn't enough.
The golden signals form the foundation of observation, but that doesn't mean they can work alone.
For the best results, you should implement the golden signals and then look to gaps to identify the other monitoring solutions your systems may require.
Once your systems are being fully monitored, the next step is utilizing all of the valuable data these monitoring tools are collecting for your organization.
Aside from filling dynamic dashboards, these signals can help transform workflows, if used correctly.
By implementing the golden signals, you stand to improve productivity and efficiency across the board, just follow the best practices.
Do you need help monitoring and maintaining the health of your systems? Even with the golden signals, it can be a lot to handle.
Adservio works with the world's top brands each and every day to keep their tech stack strong. Contact us today to learn more.