In Part 2, I pointed out that live site monitoring is about answering two questions: “is everything working?” and “is everything healthy?”, and invariants, logs and synthetic transactions are the three ways to find out the answers.
For those who are building/using/improving live site monitoring, besides how to find out the answers, they also need to be aware of and consider the following four aspects:
I. Knowing what has changed in production helps the monitoring more effectively answer "is everything working" and "is everything healthy".
Some changes are triggered by human. For example, a code change or a configuration change. Pragmatic data shows that about 2/3 live site incidents were triggered by code changes and configuration changes (including the situation of a pre-existing bug, which may continue to remain dormant, until it's triggered/surfaced by a new version rollout or configuration change). So after rolling out a new version or having flipped on/off a configuration setting, we want to get the answers the sooner the better (it's Ok and reasonable to first have some preliminary assessment quickly, then take a bit more time to get a fuller assessment). Similarly, when a live site issue is fixed, we also want to get the answer ASAP to confirm the fix. When there are manually triggered changes, we know about the delta (in code or in configuration), so that we may first look at the areas related to the delta when trying to answer "is everything working" and "is everything healthy".
Some changes are not triggered by human. They are just naturally happening as time goes by, or under the influence of the external environment. For examples: the message queue length grows as incoming traffic picks up; the data store size grows over the time; a system is slowly leaking resources; password expired, certificate expired, customer accounts are deleted by an automatic backend job after 90 days non-payment, etc.. Some of such changes can build up very quickly, such as a surge of simultaneous connection count during men's ice hockey semifinal between the USA and Canada at the 2014 Winter Olympics. Knowing what are changing even when no one is touching the system, it helps target the monitoring more precisely.
II. Understand the confidence level of the answer.
Anyone can give an answer to the question "is everything working". I could answer "yes" out of my ignorance ("I haven't heard about any incidents"). That's a legitimate answer, but a very low confidence answer. When it comes to live site monitoring and answer the questions "is everything working" and "is everything healthy", we need the answers to be in higher confidence, to reduce false alarm or false positive.
The bar of "high confidence" may vary. For example, we may tune the monitoring system to be a little bit more conservative during the nights, so that we don’t wake people up too easily and prematurely. We could be more aggressive (lowering the bar) during working hours or special occasions (e.g. the winter Olympic).
Time is the key factor for a monitoring system to gain confidence on the answer (either positive or negative). For a slow-building issue, it usually takes hours or days to confirm it. To differentiate between a jump vs. a single spike, it needs to collect the data for a bit longer. In live site monitoring, we often quickly do a sanity check to give a preliminary answer (low confidence), then spend more time to be sure (higher confidence).
In one word, it takes time to get higher confidence. That's why shorter MTTD (mean time to detect) and low noise ratio is mutually exclusive in general. That seems pretty obvious here. But in reality, many people can forget that in day-to-day work, especially in complex context. I have seen that as a common pitfall. People make designs which try to get shorter MTTD and lower noise ration at the same time. Some leadership sometimes challenge the team to improve both -- it's not unachievable, but harder than most people think.
III. Understand the different levels of turnaround time (usually as referred as MTTD, mean time to detect) and understand what kind of monitoring approach (invariants, logs and synthetic transactions) we should invest in to either move into a higher level of responsiveness or improve within the same level.
The basic level of turnaround time is to know the issue after customers have run into it. Synthetic transactions may not be the best place to invest in, if we want to shorten the time from the first several customers have hit the issue to the time we know about it. Instead, we should more rely on detecting anomalies and outliers based on aggregated logs.
It will be better if we know the issue before it affects any customer. That's a much better turnaround time. In order to get ahead of the customers, we must use the synthetic transaction approach. The other two approaches (invariants and logs) cannot help when there is no customer impact yet. However, as pointed out in Part 2, synthetic transaction can become very expensive if we want to use it to cover more granular cases. Which means, to balance the cost and benefit, it will be more practical and realistic to only invest in catching major problems ahead of customers and let the issues in granular cases be there until some customers are affected. In the other words, catching all the live site issues ahead of customers should not be the North Star.
Some may say, shouldn't the most ideal turnaround time be the negative turnaround time: detect the issue even before it exists in production. Of course that's even better. But that is no longer a live site monitoring thing. Preventing issues from getting into live site is a QA responsibility, a software testing things.
IV. How will the answers be delivered?
Many people equals this to sending alerts. But sending alert is just one way to deliver the answers of "is it working" & "is it healthy". There are many other delivery mechanisms. Some groups have a big flat screen TV on the way in their hallway, which shows a bunch of real time numbers and charts. When any issue happens, the numbers would turn red or flashing and the bar/line in the chart will shoot up high. Then it will get noticed by people who walk by. Such a flat screen TV is also a mechanism to deliver the answer. Sometime the answer is delivered without being request, such as when some thresholds are breached in the night, the on-call person will be called.
The differences between the delivery mechanisms are:
Is the delivery guaranteed? Flat screen TV is not a guaranteed delivery, since we can't make sure the right people will just walk by and noticed the red numbers. Emails and text messages are not guaranteed delivery, either. People may not be checking email and text message all the time. Calling on-call person's cell phone is a guaranteed delivery: unless the on-call person answers the phone call, the call will be retried multiple times, and fall back on to the secondary on-call and the backup on-call, until someone answers the phone.
Is the delivery explicit or implicit? Implicit: no bad news is good news. Explicit: we still want to keep receiving good news ("yes, everything is working fine in the last 30 minutes"), in order to have the peace of mind about the delivery channel (to avoid wondering whether there is no bad news, or the bad news got lost on its way?).
How soon/fast is the answer delivered? It depends on factors including: how bad it is (severity, impact assessment); how much are we sure about it (confidence level). Usually there is a conflict between confidence level vs. how soon: we can aggressively send alerts, at the cost of high noise ratio, or we tune the monitoring system to wait until it's very much sure about the issue before sending the alerts, at the cost of longer MTTD (mean time to detection).
Who is the recipient? Phone call is a guaranteed delivery, but only one person will receive it at a time. Emails can be sent to a group of person of our choice. Flashing screens in the hallway will be seen, but we don’t know who exactly will see it. We also want the message get delivered to the right people, who need to know and/or can do something about.
Among all the ways, no one is the best way. There is only the right way, in different situations. "Right" = deliver the message to the right audience, with the right balance between shorter delay and higher confidence level, and containing the right level of details and actionable data.
Summary: this blog series (Part 1, Part 2, Part 3) captures the mind model that I use to look at live site monitoring. This model helps me better see the context of each topic related to monitoring, see where they are in the bigger picture and how they relate to each other.
//the end