How to Simplify Monitoring for Modern Applications

SRE and the four golden signals of Monitoring

The Site Reliability or SRE discipline and how we can apply it to simplifying
monitoring for complex modern applications this will help us identify root causes more quickly and drastically reduce the mean time to recovery (MTR) so that we can maintain the end-user performance that we want for our applications so first let’s take a look at what happens before we’ve applied these SRE principles to our monitoring so let’s say that I’m the owner of an application and I’ve gotten an alert that says that I’m having a latency issue now my application is really critical for the business and so I need to find the root cause quickly but because I’m part of the complex micro service topology it can be really difficult to figure out where exactly the root cause is coming from and to make things more complex all of my dependencies could be based on different technologies,

so let’s say one is built on node.js, one is a db2 database another is written in Swift and so on now all of these have different metrics that are typically monitored and I may not be an expert in any of these different technologies so it may be difficult for me personally to go in and figure out what the problem is so I would have to call in a expert for each of these technologies now as you can imagine this is time consuming for everyone to go through their service figure out if there is a problem or if I need to keep going downstream and all the while my users are still experiencing this latency issue now what if there was a better way this is what we can learn from the SRE discipline which tells us that there’s really only four key performance indicators that we need to monitor not all the different metrics for each technology and we call these golden signals.

So the golden signals are latency which is the time it takes to service A request errors which is a view of the request error rate traffic which is the demand placed on the system and saturation which is our utilization versus max capacity now let’s go back to our initial example and see how this would work applying the golden signals so my service will call it service A,

we know we have a latency issue now we know that latency is typically a symptom and if we examine the service let’s say we’re not seeing any of the causes so we know we have to keep looking downstream but we don’t want to go back to this complicated micro service topology and try and figure it all out so some APM tools can help you out with this by identifying only the services that are one hop away from my service in question so let’s say we have services B, C and D that are connected to my service A that’s having the problem now no matter what technology these services are built on all we need to do is go look at the golden signals so let’s say we look at the golden signals for service B and everything looks fine so we know service B is not the problem and let’s say service C same scenario we don’t see any issues so we can eliminate that as the problem now service D let’s say that we’re seeing an issue with our saturation which is trending upwards so right there after only a few minutes we’ve identified service D is likely our root cause so now instead of having to pull in the experts for each of these different services now we can go directly to service D and let them know that we’ve identified that they’re likely a cause of this issue that we’re having and they can go about fixing it and what’s even better is if they’re using golden signals to their service it’s very likely they’ve already identified this and are already working on the fix so as you can see this process drastically improves the time that it takes to go through this complex topology and many different technologies to figure out where your root causes and identify exactly how to fix it so when you’re identifying an APM tool to use make sure that it offers the ability to use these golden signals and this one hop dependency view so that you can quickly identify the root causes and get your service restored as quickly as possible.

Thanks for stopping by! Check out my page for more such stories.

DevSecOps Engineer